---
title: privateGPT Walkthrough
author: Aayush Agrawal
date: "2023-05-22"
categories: [NLP, Deep Learning, LLMs]
image: "blog_logo.png"
format:
    html:
        code-fold: false
        number-sections: true     
---

> A code walkthrough of [privateGPT](https://github.com/imartinez/privateGPT) repo on how to build your own offline GPT Q&A system.

Large Language Models (LLMs) have surged in popularity, pushing the boundaries of natural language processing. OpenAI's GPT-3.5 is a prime example, revolutionizing our technology interactions and sparking innovation. Particularly, LLMs excel in building Question Answering applications on knowledge bases. In this blog, we delve into the top trending GitHub repository for this week: the [PrivateGPT repository](https://github.com/imartinez/privateGPT) and do a code walkthrough.

<figure align = "center">
    <img src="./privateGPT_trending.png" style="width:100%">
<figcaption align = "center">
        Fig. 1: Private GPT on GitHub's top trending chart
</figcaption>
</figure>

# What is privateGPT?

One of the primary concerns associated with employing online interfaces like OpenAI chatGPT or other Large Language Model systems pertains to data privacy, data control, and potential data leakage. The [privateGPT repository](https://github.com/imartinez/privateGPT) presents a fully offline alternative for engaging with personal documents. It is constructed using open source tools and technology, thereby enabling the utilization of LLMs capabilities without compromising data privacy or encountering data leakage issues.

<figure align = "center">
    <img src="./privateGPT_githubSnapshot.png" style="width:80%">
<figcaption align = "center">
        Fig.2: [privateGPT](https://github.com/imartinez/privateGPT) on GitHub. At the time of writing repo had 19K+ stars and 2k+ forks.
</figcaption>
</figure>

# Running privateGPT locally 

To run privateGPT locally, users need to install the necessary packages, configure specific variables, and provide their knowledge base for question-answering purposes. Additional information on the installation process and usage can be found in the repository documentation or by referring to a [dedicated blog post on the topic](https://www.codingthesmartway.com/privategpt-the-ultimate-solution-for-offline-secure-language-processing-that-turns-your-pdfs-into-interactive-ai-dialogues/).

Essentially you can run it by calling the `privateGPT.py` file like - 

```
python privateGPT.py
```

<figure align = "center">
    <img src="./demo_query.png" style="width:100%">
<figcaption align = "center">
        Fig.3: Invoking [privateGPT](https://github.com/imartinez/privateGPT) locally and asking a question.
</figure>

And get a response that also mention the sources it looked up for context. 

<figure align = "center">
    <img src="./demo_response.png" style="width:100%">
<figcaption align = "center">
        Fig.4: [privateGPT](https://github.com/imartinez/privateGPT) response.
</figure>

# Code Walkthrough

[privateGPT](https://github.com/imartinez/privateGPT/blob/main/ingest.py) code comprises two pipelines:

1. **Ingestion Pipeline:** This pipeline is responsible for converting and storing your documents, as well as generating embeddings for them. The documents are stored in a suitable format, and their embeddings are stored in an embedding database.

2. **Q&A Interface:** This interface accepts user prompts, the embedding database, and an open-source Language Model (LM) model as inputs. It utilizes these inputs to generate responses to the user's queries.


## Ingestion Pipeline

Let's delve into the [ingestion pipeline](https://github.com/imartinez/privateGPT/blob/main/ingest.py) for a closer examination. The ingestion pipeline encompasses the following steps:

1. Identifying files with various extensions and retrieving all the knowledge base from the source directory.

2. Splitting the documents into smaller chunks based on the parameters of chunk_size and chunk_overlap.

3. Initializing the `Huggingfaceembeddings` module of `langchain`. This involves loading a pre-trained language model from the sentence_transformers library.

4. Initializing the Chroma database from `langchain.vectorstores`. This step involves taking the chunked text and the initialized embedding model and saving it in the embedding database on disk.

<figure align = "center">
    <img src="./ingest_pipeline.png" style="width:100%">
<figcaption align = "center">
        Fig.5: Ingestion Pipeline
</figure>

Let's look at these steps one by one.

### Identifying and loading files from the source directory


First, we import the required libraries and various text loaders from `langchain.document_loaders`.

In [1]:
import os
import glob
from typing import List
from multiprocessing import Pool
from tqdm import tqdm
from langchain.document_loaders import (
    CSVLoader,
    EverNoteLoader,
    PDFMinerLoader,
    TextLoader,
    UnstructuredEmailLoader,
    UnstructuredEPubLoader,
    UnstructuredHTMLLoader,
    UnstructuredMarkdownLoader,
    UnstructuredODTLoader,
    UnstructuredPowerPointLoader,
    UnstructuredWordDocumentLoader,
)
from langchain.docstore.document import Document

Next, we define the mapping b/w each extension and their respective `langchain` document loader. You can read document loader [documentation](https://python.langchain.com/en/latest/modules/indexes/document_loaders.html) for more available loaders.

In [2]:
# Map file extensions to document loaders and their arguments
LOADER_MAPPING = {
    ".csv": (CSVLoader, {}),
    ".doc": (UnstructuredWordDocumentLoader, {}),
    ".docx": (UnstructuredWordDocumentLoader, {}),
    ".enex": (EverNoteLoader, {}),
    ".epub": (UnstructuredEPubLoader, {}),
    ".html": (UnstructuredHTMLLoader, {}),
    ".md": (UnstructuredMarkdownLoader, {}),
    ".odt": (UnstructuredODTLoader, {}),
    ".pdf": (PDFMinerLoader, {}),
    ".ppt": (UnstructuredPowerPointLoader, {}),
    ".pptx": (UnstructuredPowerPointLoader, {}),
    ".txt": (TextLoader, {"encoding": "utf8"}),
}

Next, we define our single document loader.

In [3]:
def load_single_document(file_path: str) -> Document:
    ## Find extension of the file
    ext = "." + file_path.rsplit(".", 1)[-1] 
    if ext in LOADER_MAPPING: 
        # Find the appropriate loader class and arguments
        loader_class, loader_args = LOADER_MAPPING[ext] 
        # Invoke the instance of document loader
        loader = loader_class(file_path, **loader_args) 
        ## Return the loaded document
        return loader.load()[0] 
    raise ValueError(f"Unsupported file extension '{ext}'")
    
git_dir = "../../../../privateGPT/"
loaded_document = load_single_document(git_dir+'source_documents/state_of_the_union.txt')
print(f'Type of loaded document {type(loaded_document)}')
loaded_document

Type of loaded document <class 'langchain.schema.Document'>


Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \n\nGroups of citizen

The `load_single_document` function accomplishes the following steps:

1. Extracts the file extension from the given file path.
2. Retrieves the corresponding document loader and its arguments from the previously defined `LOADER_MAPPING` dictionary.
3. Creates an instance of the appropriate document loader.
4. Loads the document using the instantiated loader.
5. Returns the loaded document.

We can see that `load_single_document` returns a document of type `langchain.schema.Document`. Which according to the [documentation](https://docs.langchain.com/docs/components/schema/document) consists of `page_content` (the content of the data) and `metadata` (auxiliary pieces of information describing attributes of the data).

In [4]:
def load_documents(source_dir: str, ignored_files: List[str] = []) -> List[Document]:
    """
    Loads all documents from the source documents directory, ignoring specified files
    """
    all_files = []
    for ext in LOADER_MAPPING:
        #Find all the files within source documents which matches the extensions in Loader_Mapping file
        all_files.extend(
            glob.glob(os.path.join(source_dir, f"**/*{ext}"), recursive=True)
        )
    
    ## Filtering files from all_files if its in ignored_files
    filtered_files = [file_path for file_path in all_files if file_path not in ignored_files]
    
    ## Spinning up resource pool
    with Pool(processes=os.cpu_count()) as pool:
        results = []
        with tqdm(total=len(filtered_files), desc='Loading new documents', ncols=80) as pbar:
            # Load each document from filtered files list using load_single_document function
            for i, doc in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
                results.append(doc)
                pbar.update()
    
    return results

The load_single_documents function carries out the following steps:
1. Initializes an empty dictionary called `all_files`. <br>
2. For each extension in the `LOADER_MAPPING` dictionary, it searches for all the files with that extension in the source directory and adds them to the `all_files` list. <br>
3. Creates a new list named `filtered_files` by removing the files listed in the `ignored_files` list from the `all_files` list.<br>
4. Executes a parallel loading operation on all the files in the `filtered_files` list using the `load_single_document` function, and appends the results to the results list.<br>
5. Returns the list of loaded documents.

In [5]:
loaded_documents = load_documents(git_dir+'source_documents')
print(f"Length of loaded documents: {len(loaded_documents)}")
loaded_documents[0]

Loading new documents: 100%|█████████████████████| 1/1 [00:00<00:00, 246.69it/s]

Length of loaded documents: 1





Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \n\nGroups of citizen

You can see we have loaded the `state_of_the_union.txt` file from the [privateGPT repo](https://github.com/imartinez/privateGPT/tree/main/source_documents). As this is the only file in that directory the length of loaded documents is one.

### Splitting the documents into smaller chunks

Now we have seen how we can load multiple documents of different extensions using the `load_documents` function. The next step is to look at `process_document` function which loads and splits large documents into smaller chunks.

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunk_size = 500
chunk_overlap = 50

def process_documents(source_dir: str, ignored_files: List[str] = []) -> List[Document]:
    """
    Load documents and split in chunks
    """
    print(f"Loading documents from {source_dir}")
    documents = load_documents(source_dir, ignored_files)
    if not documents:
        print("No new documents to load")
        exit(0)
    print(f"Loaded {len(documents)} new documents from {source_dir}")
    ## Load text splitter
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    ## Split text
    texts = text_splitter.split_documents(documents)
    print(f"Split into {len(texts)} chunks of text (max. {chunk_size} tokens each)")
    return texts

processed_documents = process_documents(git_dir+'source_documents')

Loading documents from ../../../../privateGPT/source_documents


Loading new documents: 100%|█████████████████████| 1/1 [00:00<00:00, 315.74it/s]

Loaded 1 new documents from ../../../../privateGPT/source_documents
Split into 90 chunks of text (max. 500 tokens each)





The `process_documents` function performs the following steps:

1. Loads all the documents from the `source_dir` directory using the `load_documents` function.
2. Initializes an instance of `RecursiveCharacterTextSplitter` from the `langchain.text_splitter` module, providing the `chunk_size` and `chunk_overlap` parameters. This class is responsible for splitting a list of documents into smaller overlapping chunks. [[`RecursiveCharacterTextSplitter` documentation](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html)].
3. Uses the `split_documents` method of the `RecursiveCharacterTextSplitter` instance to split the loaded documents into smaller chunks.
4. Returns the resulting list of the smaller document chunks.

### Initializing the embedding model

Next, we load our embedding module which converts the smaller document chunks from previous steps to embeddings.

In [7]:
from langchain.embeddings import HuggingFaceEmbeddings
EMBEDDINGS_MODEL_NAME = "all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=EMBEDDINGS_MODEL_NAME)

print("Testing on a single query.")
embedded_vector = embeddings.embed_query("What is your name?")
print(f"Size of embedded vector: {len(embedded_vector)}")

Testing on a single query.
Size of embedded vector: 384


The given code snippet carries out the following steps:

1. Imports the `HuggingFaceEmbeddings` function from the `langchain.embeddings` module. This function is responsible for loading and encapsulating the [SentenceTransformers](https://www.sbert.net/) embeddings, which are used for generating dense vector representations of sentences. You can refer to the [HuggingFaceEmbeddings documentation](https://python.langchain.com/en/latest/modules/models/text_embedding/examples/sentence_transformers.html) for more details.
2. Loads the [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model from the `sentence_transformers` library. This model is specifically designed to map sentences and paragraphs into a 384-dimensional dense vector space. It is commonly utilized for tasks such as semantic search and similarity analysis.

We can see that our embedded vector on a sample query returns a 384 dimension vector.

### Embed smaller text and save it in the vector database

The next step involves utilizing the document chunks and the embedding model to store the documents and their corresponding embeddings in a vector database.

In [8]:
from chromadb.config import Settings
from langchain.vectorstores import Chroma

PERSIST_DIRECTORY= git_dir+"db"
# Define the Chroma settings
CHROMA_SETTINGS = Settings(
        chroma_db_impl='duckdb+parquet',
        persist_directory=PERSIST_DIRECTORY,
        anonymized_telemetry=False
)
## Create the embedding database
db = Chroma.from_documents(processed_documents, embeddings, persist_directory=PERSIST_DIRECTORY, client_settings=CHROMA_SETTINGS)
db.persist()

Using embedded DuckDB with persistence: data will be stored in: ../../../../privateGPT/db


The given code snippet performs the following operations:

1. It imports the `Settings` class from the `chromadb.config` module and the `Chroma` class from the `langchain.vectorstores` module.
2. It creates an instance of the `Settings` class named `CHROMA_SETTINGS`, providing several configuration parameters:
   - `chroma_db_impl` is set to `'duckdb+parquet'`, specifying the implementation to be used for the Chroma vector database.
   - `persist_directory` is set to the `PERSIST_DIRECTORY` variable defined earlier, indicating the directory where the vector database will be saved.
   - `anonymized_telemetry` is set to `False`, indicating whether anonymized telemetry data should be collected.
3. It creates a vector database by calling the `Chroma.from_documents()` method. This method takes the following arguments:
   - `processed_documents`: The list of processed documents obtained from the previous step.
   - `embeddings`: The embeddings object/model used to generate the document embeddings.
   - `persist_directory`: The directory where the vector database will be persisted, specified by the `PERSIST_DIRECTORY` variable.
   - `client_settings`: The settings object (`CHROMA_SETTINGS`) containing configuration parameters for the vector database.
4. We use `db.persist()` to store the index for future retrieval task

In [9]:
## Test the semantic retrieval 
db.similarity_search(query="What is the American Rescue Plan?", k= 4)

[Document(page_content='The American Rescue Plan gave schools money to hire teachers and help students make up for lost learning.  \n\nI urge every parent to make sure your school does just that. And we can all play a part—sign up to be a tutor or a mentor. \n\nChildren were also struggling before the pandemic. Bullying, violence, trauma, and the harms of social media.', metadata={'source': '../../../../privateGPT/source_documents/state_of_the_union.txt'}),
 Document(page_content='It fueled our efforts to vaccinate the nation and combat COVID-19. It delivered immediate economic relief for tens of millions of Americans.  \n\nHelped put food on their table, keep a roof over their heads, and cut the cost of health insurance. \n\nAnd as my Dad used to say, it gave people a little breathing room. \n\nAnd unlike the $2 Trillion tax cut passed in the previous administration that benefitted the top 1% of Americans, the American Rescue Plan helped working people—and left no one behind.', metada

In [10]:
db = None

To test the retrieval of semantic similarity, we can use the `similarity_search` function. `similarity_search` function takes a text query as input and returns the top `k=4` document chunks from the vector database.

## Question & Answer Interface

Let's explore the [Q&A interface](https://github.com/imartinez/privateGPT/blob/main/privateGPT.py) in more detail. The Q&A interface consists of the following steps:

1. Load the vector database and prepare it for the retrieval task.
2. Load a pre-trained Large language model from [LlamaCpp](https://github.com/ggerganov/llama.cpp) or  [GPT4ALL](https://github.com/nomic-ai/gpt4all).
3. Prompt the user with a query and generate a response using the `RetrievalQA` pipeline from `langchain.chains`.


<figure align = "center">
    <img src="./q_a_pipeline.png" style="width:100%">
<figcaption align = "center">
        Fig.6: Question Answering Pipeline
</figure>

Let’s look at these steps one by one.

### Load the vector database

First, we import the required libraries.

In [1]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from chromadb.config import Settings
git_dir = "../../../../privateGPT/"
PERSIST_DIRECTORY= git_dir+"db"
EMBEDDINGS_MODEL_NAME = "all-MiniLM-L6-v2"

# Define the Chroma settings
CHROMA_SETTINGS = Settings(
        chroma_db_impl='duckdb+parquet',
        persist_directory=PERSIST_DIRECTORY,
        anonymized_telemetry=False
)

embeddings = HuggingFaceEmbeddings(model_name=EMBEDDINGS_MODEL_NAME)
db = Chroma(persist_directory=PERSIST_DIRECTORY, embedding_function=embeddings, client_settings=CHROMA_SETTINGS)
retriever = db.as_retriever()

Using embedded DuckDB with persistence: data will be stored in: ../../../../privateGPT/db


The given code snippet carries out the following steps:

1. Loads the embeddings using the `HuggingFaceEmbeddings` function, which was previously used to create the embedding store.
2. Instantiates a Chroma vector database that was created earlier.
3. Sets the vector database in retrieval mode.

In [2]:
## Testing retriever
retriever.vectorstore.similarity_search(query = "What is Amercian rescue plan?")

[Document(page_content='The American Rescue Plan gave schools money to hire teachers and help students make up for lost learning.  \n\nI urge every parent to make sure your school does just that. And we can all play a part—sign up to be a tutor or a mentor. \n\nChildren were also struggling before the pandemic. Bullying, violence, trauma, and the harms of social media.', metadata={'source': '../../../../privateGPT/source_documents/state_of_the_union.txt'}),
 Document(page_content='It fueled our efforts to vaccinate the nation and combat COVID-19. It delivered immediate economic relief for tens of millions of Americans.  \n\nHelped put food on their table, keep a roof over their heads, and cut the cost of health insurance. \n\nAnd as my Dad used to say, it gave people a little breathing room. \n\nAnd unlike the $2 Trillion tax cut passed in the previous administration that benefitted the top 1% of Americans, the American Rescue Plan helped working people—and left no one behind.', metada

### Load a pre-trained Large language model.

In [3]:
from langchain.llms import GPT4All

MODEL_PATH = git_dir+"models/ggml-gpt4all-j-v1.3-groovy.bin" 
MODEL_N_CTX=1000

# Prepare the LLM
llm = GPT4All(model=MODEL_PATH, n_ctx=MODEL_N_CTX, backend='gptj', callbacks=None, verbose=False)

gptj_model_load: loading model from '../../../../privateGPT/models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 4505.45 MB
gptj_model_load: memory_size =   896.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285


The code snippet above create an instance of the `GPT4All` class named `llm`, which represents the Language Model (LLM) using the [GPT-4All model](https://github.com/nomic-ai/gpt4all). The constructor of GPT4All takes the following arguments: <br>
- `model`: The path to the GPT-4All model file specified by the `MODEL_PATH` variable. <br>
- `n_ctx`: The context size or maximum length of input sequences specified by the `MODEL_N_CTX` variable. <br>
- `backend`: The backend to use for the LLM. In this case, it is set to 'gptj'. <br>
- `callbacks`: The callbacks to be used during the LLM execution. In this case, it is set to None. <br>
- `verbose`: A boolean flag indicating whether to print verbose output during LLM execution. In this case, it is set to False.

### Prompt the user with a query and generate a response

In [4]:
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True)
query = "What is American rescue plan?"
res = qa(query)
answer, docs = res['result'], res['source_documents']

# Get the answer from the chain
# Print the result
print("\n\n> Question:")
print(query)
print("\n> Answer:")
print(answer)

# Print the relevant sources used for the answer
for document in docs:
    print("\n> " + document.metadata["source"] + ":")
    print(document.page_content)



> Question:
What is American rescue plan?

> Answer:
 The American Rescue Plan is a program that provides funding to schools to hire teachers and help students make up for lost learning due to the COVID-19 pandemic. It also provides economic relief for tens of millions of Americans by helping them put food on their table, keep a roof over their heads, and cut the cost of health insurance. The plan also helps working people by providing breathing room and giving them a little breathing room. It is a program that helps millions of families on Affordable Care Act plans save $2,400 a year on their health care premiums and combat climate change by cutting energy costs for families an average of $500 a year.

> ../../../../privateGPT/source_documents/state_of_the_union.txt:
The American Rescue Plan gave schools money to hire teachers and help students make up for lost learning.  

I urge every parent to make sure your school does just that. And we can all play a part—sign up to be a tutor 

Firstly, an instance of the `RetrievalQA` class named `qa` is created using the `from_chain_type` method. The `RetrievalQA` class is a chain specifically designed for question-answering tasks over an index. Please refer to the [documentation](https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html) for further details. The `from_chain_type` method takes the following arguments:

- `llm`: The Language Model instance (`llm`) that was created previously.
- `chain_type`: A string representing the type of chain to be used. In this case, it is set to `"stuff"`. There may be other available chain types specific to the question-answering scenario. Please consult the [documentation](https://python.langchain.com/en/latest/modules/chains/index_examples/question_answering.html) for more information.
- `retriever`: An instance of a Chroma database used to retrieve relevant documents for the given query.
- `return_source_documents`: A boolean flag indicating whether to return the source documents along with the answer. In this case, it is set to `True`.

Next, the `qa` instance is used to process a query. The Language Model (LLM) within the `qa` instance generates a response that includes the query, the answer, and the source documents used as context for generating the answer.

Finally, the answer and source documents are printed out for display.

## Conclusion

In this blog post, we explored privateGPT, its implementation, and the code walkthrough for its ingestion pipeline and q&A interface. I hope this blog post has been valuable in understanding privateGPT and its implementation. I recommend my readers to try privateGPT on your own knowledge base.

I hope you enjoyed reading it. If there is any feedback on the code or just the blog post, feel free to comment below or reach out on [LinkedIn](https://www.linkedin.com/in/aayushmnit/).