Cleaned repository and fixed issues

General adjustments: - cleaned up the repository by removing and adjusting several files - updated README.md to reflect changes from original repository - added db_clear.py to easily clear the entire database main_st.py: - removed cache refresh button, issue is now fixed - cleaned up multiple parts of the code db_build.py: - cleaned up code
Vlassie · Oct 26, 2023 · d983844 · d983844
1 parent 1ebb5bd
commit d983844
Show file tree

Hide file tree

Showing 11 changed files with 239 additions and 3,398 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,5 @@
 # GGML Models
-models/*.bin
+models/*
 
 # Data
 data/*

diff --git a/README.md b/README.md
@@ -1,57 +1,54 @@
 # Running Llama 2 and other Open-Source LLMs on CPU Inference Locally for Document Q&A
 
-### Clearly explained guide for running quantized open-source LLM applications on CPUs using LLama 2, C Transformers, GGML, and LangChain
+## Preface
+This is a fork of Kenneth Leung's original repository, that adjusts the original code in several ways:
+- A streamlit visualisation is available to make it more user-friendly
+- Follow-up questions are now possible thanks to memory implementation
+- Different models now appear as options for the user
+- Multiple other optimalisations 
 
-**Step-by-step guide on TowardsDataScience**: https://towardsdatascience.com/running-llama-2-on-cpu-inference-for-document-q-a-3d636037a3d8
 ___
-## Context
-- Third-party commercial large language model (LLM) providers like OpenAI's GPT4 have democratized LLM use via simple API calls. 
-- However, there are instances where teams would require self-managed or private model deployment for reasons like data privacy and residency rules.
-- The proliferation of open-source LLMs has opened up a vast range of options for us, thus reducing our reliance on these third-party providers. 
-- When we host open-source LLMs locally on-premise or in the cloud, the dedicated compute capacity becomes a key issue. While GPU instances may seem the obvious choice, the costs can easily skyrocket beyond budget.
-- In this project, we will discover how to run quantized versions of open-source LLMs on local CPU inference for document question-and-answer (Q&A).
-<br><br>
-![Alt text](assets/diagram_flow.png)
-___
-
 ## Quickstart
-- Ensure you have downloaded the GGML binary file from https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML and placed it into the `models/` folder
-- To start parsing user queries into the application, launch the terminal from the project directory and run the following command:
-`poetry run python main.py "<user query>"`
-- For example, `poetry run python main.py "What is the minimum guarantee payable by Adidas?"`
-- Note: Omit the prepended `poetry run` if you are NOT using Poetry
-<br><br>
+- Ensure you have downloaded the model of your choice in GGUF format and placed it into the `models/` folder. Some examples:
+    - https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF
+    - https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF
+
+- Fill the `data/` folder with .pdf, .doc(x) or .txt files you want to ask questions about
+
+- To build a FAISS database with information regarding your files, launch the terminal from the project directory and run the following command <br>
+`python db_build.py`
+
+- To start asking questions about your files, run the following command: <br>
+`streamlit run main_st.py`
+
+- Choose which model to use for Q&A and adjust parameters to your liking
+
 ![Alt text](assets/qa_output.png)
+
 ___
 ## Tools
 - **LangChain**: Framework for developing applications powered by language models
-- **C Transformers**: Python bindings for the Transformer models implemented in C/C++ using GGML library
+- **LlamaCPP**: Python bindings for the Transformer models implemented in C/C++
 - **FAISS**: Open-source library for efficient similarity search and clustering of dense vectors.
 - **Sentence-Transformers (all-MiniLM-L6-v2)**: Open-source pre-trained transformer model for embedding text to a 384-dimensional dense vector space for tasks like clustering or semantic search.
 - **Llama-2-7B-Chat**: Open-source fine-tuned Llama 2 model designed for chat dialogue. Leverages publicly available instruction datasets and over 1 million human annotations. 
-- **Poetry**: Tool for dependency management and Python packaging
 
 ___
 ## Files and Content
 - `/assets`: Images relevant to the project
 - `/config`: Configuration files for LLM application
 - `/data`: Dataset used for this project (i.e., Manchester United FC 2022 Annual Report - 177-page PDF document)
-- `/models`: Binary file of GGML quantized LLM model (i.e., Llama-2-7B-Chat) 
+- `/models`: Binary file of GGUF quantized LLM model (i.e., Llama-2-7B-Chat) 
 - `/src`: Python codes of key components of LLM application, namely `llm.py`, `utils.py`, and `prompts.py`
 - `/vectorstore`: FAISS vector store for documents
 - `db_build.py`: Python script to ingest dataset and generate FAISS vector store
-- `main.py`: Main Python script to launch the application and to pass user query via command line
-- `pyproject.toml`: TOML file to specify which versions of the dependencies used (Poetry)
+- `db_clear.py`: Python script to clear the previously built database
+- `main_st.py`: Main Python script to launch the streamlit application 
+- `main.py`: Python script to launch an older version of the application within the terminal, mainly used for testing purposes
 - `requirements.txt`: List of Python dependencies (and version)
 ___
 
 ## References
-- https://github.com/marella/ctransformers
 - https://huggingface.co/TheBloke
-- https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML
-- https://python.langchain.com/en/latest/integrations/ctransformers.html
-- https://python.langchain.com/en/latest/modules/models/llms/integrations/ctransformers.html
-- https://python.langchain.com/docs/ecosystem/integrations/ctransformers
-- https://ggml.ai
-- https://github.com/rustformers/llm/blob/main/crates/ggml/README.md
-- https://www.mdpi.com/2189676
+- https://github.com/abetlen/llama-cpp-python
+- https://python.langchain.com/docs/integrations/llms/llamacpp
diff --git a/assets/diagram_flow.png b/assets/diagram_flow.png
diff --git a/assets/qa_output.png b/assets/qa_output.png
diff --git a/config/config.yml b/config/config.yml
@@ -3,10 +3,7 @@ VECTOR_COUNT: 2
 CHUNK_SIZE: 500
 CHUNK_OVERLAP: 50
 DATA_PATH: 'data/'
+LOG_FILE: 'log_loaded.txt'
 DB_FAISS_PATH: 'vectorstore/db_faiss'
-# MODEL_TYPE: 'mpt'
-# MODEL_BIN_PATH: 'models/mpt-7b-instruct.ggmlv3.q8_0.bin'
-MODEL_TYPE: 'llama'
-MODEL_BIN_PATH: 'models/llama-2-7b-chat.ggmlv3.q8_0.bin'
 MAX_NEW_TOKENS: 256
 TEMPERATURE: 0.01
diff --git a/db_build.py b/db_build.py
@@ -4,6 +4,7 @@
 import box
 import yaml
 from langchain.vectorstores import FAISS
+from langchain.embeddings import HuggingFaceEmbeddings
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 from langchain.document_loaders import PyPDFLoader, DirectoryLoader
 from langchain.document_loaders import Docx2txtLoader
@@ -12,7 +13,6 @@
 import sys
 import os
 
-from langchain.embeddings import HuggingFaceEmbeddings
 
 # Import config vars
 with open('config/config.yml', 'r', encoding='utf8') as ymlfile:
@@ -25,19 +25,19 @@ def run_db_build():
     documents = []
 
     source = cfg.DATA_PATH
-    output_file = 'log_loaded.txt'
-    output_path = os.path.join(source, output_file)
+    log_file = cfg.LOG_FILE
+    log_path = os.path.join(source, log_file)
     all_items = os.listdir(source)
 
     # Check which files are already loaded in the database (if any)
     existing_files = []
-    if os.path.exists(output_path):
-        with open(output_path, 'r') as file:
+    if os.path.exists(log_path):
+        with open(log_path, 'r') as file:
             existing_files = file.read().splitlines()
     # Obtain files that aren't yet loaded
-    new_files = [name for name in all_items if name not in existing_files and name != output_file]
+    new_files = [name for name in all_items if name not in existing_files and name != log_file]
     # Save their names to the logging file
-    with open(output_path, 'a') as file:
+    with open(log_path, 'a') as file:
         for name in new_files:
             file.write(name + '\n')
     if new_files:
@@ -47,7 +47,7 @@ def run_db_build():
         sys.exit()
 
     for index, file in enumerate(new_files, start=1):
-        if not file == output_file: # skip adding the logging file to the database
+        if not file == log_file: # skip adding the logging file to the database
             print(f"Loading... {file} - File {index}/{total_files}", end='\r')
             print(end='\x1b[2K') # clear previous print so no overlap occurs
 

diff --git a/db_clear.py b/db_clear.py
@@ -0,0 +1,40 @@
+# =========================
+#  Module: Vector DB Clear
+# =========================
+import box
+import yaml
+import os
+
+# Import config vars
+with open('config/config.yml', 'r', encoding='utf8') as ymlfile:
+    cfg = box.Box(yaml.safe_load(ymlfile))
+
+def delete_files_and_clear_content(folder_path, file_to_clear):
+    try:
+        # Get a list of all files in the folder
+        files = os.listdir(folder_path)
+
+        # Loop through the list and delete each file
+        for file in files:
+            file_path = os.path.join(folder_path, file)
+            if os.path.isfile(file_path):
+                os.remove(file_path)
+                print(f"{file} deleted successfully.")
+
+        print(f"All files in '{folder_path}' have been deleted.")
+    except FileNotFoundError:
+        print(f"Folder not found at path: {folder_path}")
+
+    # Clear the contents of the specified file
+    try:
+        with open(file_to_clear, 'w') as clear_file:
+            clear_file.truncate(0)
+        print(f"Contents of '{file_to_clear}' cleared successfully.")
+    except FileNotFoundError:
+        print(f"{file_to_clear} not found.")
+
+if __name__ == "__main__":
+    folder_path = cfg.DB_FAISS_PATH
+    file_to_clear = os.path.join(cfg.DATA_PATH, cfg.LOG_FILE)
+
+    delete_files_and_clear_content(folder_path, file_to_clear)