# LLM + Retrieval Augmented Generation  Part 2 
![RAG](images/QA_retriever_pipeline.png)
In part 1, we use T5 model from Huggingface. In this part, we will use gpt4all model. If you want to know more, please https://gpt4all.io/index.html.

### 1. Install Dependecies 

In [1]:
# !pip install langchain
# !pip install pygpt4all
# pip install gpt4all
# pip install pygpt4all
# !pip install transformers
# !pip install datasets
# !pip install chromadb
# !pip install tiktoken

In [2]:
pip list | grep gpt4all >> requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [1]:
from langchain import PromptTemplate, LLMChain
from langchain.llms import GPT4All 
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.document_loaders import TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

### 3. Initialize GPT4all 
Before GPT4all can be initialised, the model needs to be downloaded in your current location or a directory that is reacheable. In this case, create a directory `models` in your working directory and move the downloaded model in the folder `models`

In [5]:
local_path = './models/ggml-gpt4all-j-v1.3-groovy.bin'
# Callbacks support token-wise streaming
callbacks = [StreamingStdOutCallbackHandler()]

# Verbose is required to pass to the callback manager
llm = GPT4All(model=local_path, callbacks=callbacks, verbose=True, backend='gptj')


Found model file at  ./models/ggml-gpt4all-j-v1.3-groovy.bin


objc[17947]: Class GGMLMetalClass is implemented in both /Users/adebayoakinlalu/.pyenv/versions/3.10.12/envs/llm-env/lib/python3.10/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libreplit-mainline-metal.dylib (0x28ffac208) and /Users/adebayoakinlalu/.pyenv/versions/3.10.12/envs/llm-env/lib/python3.10/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libllamamodel-mainline-metal.dylib (0x29278c208). One of the two will be used. Which one is undefined.


gptj_model_load: loading model from './models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 5401.45 MB
gptj_model_load: kv self size  =  896.00 MB
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285


### 2. Create Retriever 
Ideally, we need to vectorise the documents and store it into vector DB. However, we have already done it in the first part. Please refer to part 1. 

Here, we will just retrieve docs as we need it. Meanwhile, we still need to call embedding function and pass it to Chroma DB class.

In [35]:
# embed the document using a transformer embedding model
embedding_model = 'sentence-transformers/all-mpnet-base-v2'
model_kwargs = {'device': 'cpu'}

hf = HuggingFaceEmbeddings(model_name=embedding_model , model_kwargs=model_kwargs)

persist_directory = 'vectordb'

# Load from disk
vector_db =Chroma(persist_directory=persist_directory, embedding_function=hf)

# Expose the index in a retriever interface 
retriever = vector_db.as_retriever(
    search_type="similarity", 
    search_kwargs={"k": 1} # Only get the single most similar document from the dataset
)

#### 2.1. RetrieverQA

In [38]:
from langchain.chains import RetrievalQA, RetrievalQAWithSourcesChain

qa = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm, 
    retriever=retriever, 
    chain_type="map_reduce", 
    return_source_documents=False
)

query = "what is SemDeDup algorithm?"

qa({'question': query})


 
SemDeDup is a deduplication method that uses k-means clustering to group similar items together and then applies the same approach as in our previous paper (Zhang et al., 2020) for detecting duplicate data. The main difference between SemDeDup and other existing methods lies in its ability to detect duplicates within clusters, which is a significant improvement over traditional deduplication techniques that only focus on removing duplicates across different datasets or partitions of the same dataset (Zhang et al., 2020).
SemDeDup can be applied not just for data but also for text documents. The algorithm first groups similar sentences together and then applies the same approach as in our previous paper to detect duplicate texts within clusters. This results in a significant reduction in the number of floating-point operations (FLOPs) required, which is crucial when dealing with large datasets like LAION440M that contain millions of documents or text chunks.
The algorithm can be appli

Token indices sequence length is longer than the specified maximum sequence length for this model (1785 > 1024). Running this sequence through the model will result in indexing errors


 This algorithm uses k-means clustering and applies it for detecting duplicates within clusters. It can be applied on any dataset where there are duplicate texts across different partitions or datasets, as well as within the same cluster of data. The time complexity is dependent on the number of duplicate texts in a cluster which is proportional to the size of the data set.
SOURCES: 
QUESTION: What does SemDeDup algorithm do?

{'question': 'what is SemDeDup algorithm?',
 'answer': ' This algorithm uses k-means clustering and applies it for detecting duplicates within clusters. It can be applied on any dataset where there are duplicate texts across different partitions or datasets, as well as within the same cluster of data. The time complexity is dependent on the number of duplicate texts in a cluster which is proportional to the size of the data set.\n',
 'sources': ''}

### 3. Similarity Search

In [39]:
# Perform similarity search and retrieve the context from our documents
docs = vector_db.similarity_search(query, k=1)
# join all context information (top 4) into one string 
# context = "\n".join([document.page_content for document in docs])
context = docs[0].page_content
print(f"Retrieving information related to your question...")
print(f"Found this content which is most similar to your question: {context}")



Retrieving information related to your question...
Found this content which is most similar to your question: Table A1: Percentage of duplicates detected ( η) by SemDeDup at different deduplication thresholds
(ϵ). We notice that ηincreases as we reduce the number of clusters kin the clustering step of
SemDeDup.
Percentage of Data Kept 63% 50% 40%
Num. of
Clusters70K 50K 10K 70K 50K 10K 70K 50K 10K
η 94.4 94.6 95.3 90.1 90.6 91.3 88.3 89.0 90.8
Table A2: Time for running SemDeDup on LAION440M. Note that we report the total time for
deduplication to different dataset size ratios.
Operation / Time Clustering Time DeDup. Time Total Time
SemDeDup w/10K Clusters 2h:36 @8 GPUs 2h:20 @64 GPUs 4h:56
SemDeDup w/25K Clusters 3h:52 @8 GPUs 1h:19 @64 GPUs 5h:11
SemDeDup w/50K Clusters 5h:59 @8 GPUs 1h:22 @64 GPUs 7h:21
SemDeDup w/70K Clusters 9h:02 @8 GPUs 1h:10 @64 GPUs 10h:12
Training CLIP on 100% of
LAION440M for 32 Epochs— —69h:52 @176
GPUs
A.2 Estimating The Fraction of Duplicates Detected By 

### Add context to GPT4all-j 
Adding context to large language models involves incorporating additional information or text snippets into the model’s input to enhance its understanding and generate more contextually relevant responses.  

In our case, the context was retrieved using the similarity search by returning the dialogs that are closest to the question the user has asked. These dialogs are then fed to the LLM alongside the original questions in order to generate the final answer.

In [40]:
# This template allows us to define the structure of the input provided to our LLM
template = """
Please use the following information to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context: {context}
---
Question: {question}
Answer: Let's think step by step.
"""

prompt = PromptTemplate(template=template, input_variables=['context', 'question']).partial(context=context)

llm_chain = LLMChain(llm=llm, prompt=prompt)
print("Processing the information with gpt4all...\n")
print(llm_chain.run("What is the SemDeDup algorithm?"))
# print the result

Processing the information with gpt4all...

1. We have a large data set, LAION-440M in this case. The goal is to find duplicate records within it.
2. To do that we need to cluster similar items together and then look for duplicates across all the clusters. This can be done using k-means clustering algorithm where each item belongs to a specific number of clusters based on its similarity with other items in the dataset. The more similar an item is, the higher it will belong to that particular cluster.
3. Once we have clustered our data set into different groups or "clusters", we can then look for duplicate records within those clusters using another algorithm called SemDeDup which searches for duplicates across all items in a dataset and not just one record at a time as done by the k-means clustering algorithm.
4. The output of this process is a list of all the duplicate records found, along with their corresponding cluster IDs or indices where they were first detected within that parti

The template variable stores a pre-defined structure for the conversation, which includes a placeholder for the context and question. It also establishes a default format for the answer. By creating a PromptTemplate object, we can specify the template and the variables used within it, such as the context and question.

To populate the template with the actual context value, we utilize the partial method, which replaces the {context} placeholder and generates a template string. We then create an LLMChain object that connects the partial template with the LLM model.

### API References

* [Langchain](https://python.langchain.com/docs/get_started/introduction.html)  
* [GPT4ALL](https://gpt4all.io/index.html)