# LLM + Retrieval Augmented Generation
![RAG](../images/QA_retriever_pipeline.png)
Imagine you have have research papers in pdfs that you may want to be able to reference or consult by asking questions. This process is called Retrieval Augmented Generation (RAG). RAG is a process of fetching up to date or context specific data from an external database and making it available to to an LLM when asking it to generate a response.  

To be able to do this, we need an open source language model, a vector database and a composer. Fortunately, there are freely available open source python libraries to create this solution. For simplicity, we will use the following:

* Pre-trained T5 model from Huggingface as LLM
* ChromaDB as vector database 
* Langchain as application tools.

### 1. Using Chromadb, Huggaface and Langchain

I am fond of downloading AI research papers from Avivs. Sometime I have time to read them and sometime I don't. In this article, I will demostrate to be able to access relevant information from this documents by asking questions. This concept is referring Question Answering.

In addition, the article will be introduction to chromadb, huggingace transformers and langchain libraries.

### 1.  Install transfomers, chromadb and langchain
Other libraries that may be required including:

* pypdf
* unstrctured 
* tabulate
* pdf2image
* unsrtuctured[local-inference]


In [2]:
# !pip install --upgrade  transformers
# !pip install --upgrade  datasets
# !pip install --upgrade  chromadb
# !pip install --upgrade  langchain
# !pip install pypdf
# !pip install unstructured 
# !pip install tabulate
# !pip install pdf2image
# !pip install unsrtuctured[local-inference]
# !pip install rich 

### 2. Load Model Achitecture
T5 and its variants are amazing text to text generation models. For this purpose, we will use google/flan-t5-base. We will use transformers to load the model.

In [2]:
import torch 
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import pipeline
from langchain.llms import HuggingFacePipeline

architecture = 'google/flan-t5-base'

#'MBZUAI/LaMini-T5-223M'

tokenizer = T5Tokenizer.from_pretrained(architecture)

model = T5ForConditionalGeneration.from_pretrained(architecture)

pipe = pipeline('text2text-generation', 
                model=model, 
                tokenizer=tokenizer,
                max_length=100,
                temperature=0,
                top_p=0.95,
                repetition_penalty=1.2)

# llm = HuggingFacePipeline(pipeline=pipe)


def load_llm(pipe):
    llm = HuggingFacePipeline(pipeline=pipe)
    return llm

llm = load_llm(pipe)

  from .autonotebook import tqdm as notebook_tqdm
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


### 3.  Load and process documents 
For simplicity, we will vectorise five documents on AI research publications from avix. Documents cover the following topics:
* SemDeDup:Data-efficient learning at web-scale through semantic deduplication,
* DETRs Beat YOLOs on Real-time Object Detection, 
* Low-code LLM: Visual Programming over LLMs
* Learning to Compress Prompts with Gist Tokens

In [3]:
import os 
from langchain.document_loaders import PyPDFLoader, DirectoryLoader

pdfs = os.listdir('avixs')

loaders = [PyPDFLoader(f'avixs/{pdf}') for pdf in pdfs]

documents =[] 
for loader in loaders:
    documents.extend(loader.load())

Alternatively:

In [4]:
def load_documents(path, cls):
    return DirectoryLoader(path, loader_cls=cls, show_progress=True).load()


documents = load_documents('avixs', PyPDFLoader)

  0%|          | 0/4 [00:00<?, ?it/s]

100%|██████████| 4/4 [00:02<00:00,  1.71it/s]


### 4. Use Question Answering (QA) Methods to query the documents
Langchain provides several methods to perform question answering tasks. We will cover some of them here. Another thing to consider is chain type. Chain type is essential is what volume of document we want to retrieve anytime we ask question. Langchain has three chain types including:

* `stuff` is a default chain type that uses ALL of the text from the documents in the prompt.  

* `map_reduce` is a chain type that separates text into batches, feeds each batch with the question to LLM separately and generate final answer based on the answers from each batch.  

* `refine` is a chain type that separates text into batches feeds the first bacth to LLM and feeds the answer and so on. It refines the answer by going through all batches.  

* `map-rank` is a chain type that seperate texts into batches, feeds each batch into LLM, returna score of how it fully answers the question and finalise the answer based on the high scored answers from the batches.


#### 4.1. Simplest QA -load_qa_chain
`load_qa_chain` is a langchain simplest method for answering questions. It loads a chain to do QA for input documents.

In [5]:
from langchain.chains.question_answering import load_qa_chain 

chain = load_qa_chain(llm=llm, chain_type="map_reduce")

query = "what is SemDeDup algorithm?"

chain.run(input_documents=documents, question=query)

Token indices sequence length is longer than the specified maximum sequence length for this model (1059 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2600 > 1024). Running this sequence through the model will result in indexing errors


'SemDeDup is a deduplication algorithm for deduplicating embeddings.'

#### 4.2 Use Vector Database 
The objective here is to leverage vector database to enhance the process of QA. First of all, we need to vectorise the documents with embedding algorithms and store it in vector database which is Chroma db.

We will use sentence transformer embeedding model from Huggingface.

In [8]:
# pip install sentence_transformers

In [10]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

model = 'sentence-transformers/all-mpnet-base-v2'
model_kwargs = {'device': 'cpu'}

# sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model)
# embedding_hf = HuggingFaceEmbeddings(model_name=model, model_kwargs=model_kwargs)

def call_embedding_hf(model, model_kwargs):
    return HuggingFaceEmbeddings(model_name=model, model_kwargs=model_kwargs)

embedding_hf = call_embedding_hf(model, model_kwargs)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


RetriveQA provides better solution to `load_qa_chain`. It first retrieves relevant chunks of text and leverage vector database.

Before we can use `RetrieverQA` from langchain, we need to use `CharacterTextSplitter` from langchain to split documents into smaller chunks.

In [12]:
from langchain.vectorstores import Chroma 
from langchain.text_splitter import CharacterTextSplitter

# split the documents into smaller chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

# Set parsist location for the vector store
persist_directory = 'vectordb' 

# Create the vectorstore to store the embeddings and use as a search index
vector_db = Chroma.from_documents(documents=texts, embedding=embedding_hf,  persist_directory=persist_directory)
# vectordb =Chroma(persist_directory=persist_directory, embedding_function=embedding_hf)

# Persist the vectorstore to disk
vector_db.persist()
# vector_db = None 

# Expose the index in a retriever interface 
retriever = vector_db.as_retriever(
    search_type="similarity", 
    search_kwargs={"k": 2}, 
    persist_directory=persist_directory
)

How the documents are retrieved from vectorstore depend on the specific tasks.

There are two main ways to retrieve documents relevant to a query- Similarity Search and Max Marginal Relevance Search (MMR Search). Similarity Search is the default, but you can use MMR by adding the search_type parameter:

`vector_db.as_retriever(search_type="mmr")`

The retriever auguments include:

* k defines how many documents are returned; defaults to 4.
* score_threshold allows you to set a minimum relevance for documents returned by the retriever, if you are using the "similarity_score_threshold" search type.
*  fetch_k determines the amount of documents to pass to the MMR algorithm; defaults to 20.
* lambda_mult controls the diversity of results returned by the MMR algorithm, with 1 being minimum diversity and 0 being maximum. Defaults to 0.5.
* `filter` allows you to define a filter on what documents should be retrieved, based on the documents' metadata. This has no effect if the Vectorstore doesn't store any metadata.

**Some examples of how these parameters can be used:**  
Retrieve more documents with higher diversity- useful if your dataset has many similar documents  
`vector_db.as_retriever(search_type="mmr", search_kwargs={'k': 6, 'lambda_mult': 0.25})`

Fetch more documents for the MMR algorithm to consider, but only return the top 5  
`vector_db.as_retriever(search_type="mmr", search_kwargs={'k': 5, 'fetch_k': 50})`

Only retrieve documents that have a relevance score above a certain threshold  
`vector_db.as_retriever(search_type="similarity_score_threshold", search_kwargs={'score_threshold': 0.8})`

Only get the single most similar document from the dataset  
`vector_db.as_retriever(search_kwargs={'k': 1})`

 Use a filter to only retrieve documents from a specific paper   
`vector_db.as_retriever(search_kwargs={'filter': {'paper_title':'GPT-4 Technical Report'}})`



#### 4.2.1. RetrieverQA

In [14]:
from langchain.chains import RetrievalQA 

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    retriever=retriever, 
    chain_type="map_reduce", 
    return_source_documents=False
)

query = "what is SemDeDup algorithm?"

qa({'query': query})

Token indices sequence length is longer than the specified maximum sequence length for this model (1628 > 1024). Running this sequence through the model will result in indexing errors


{'query': 'what is SemDeDup algorithm?',
 'result': 'SemDeDup searches for duplicates within clusters.'}

#### 4.2.2 VectorstoreIndexCreator 
VectorstoreIndexCreator is a wrapper around the RetrievalQA. It is higher level abstraction that allows you to write few line code.

In [15]:
from langchain.indexes import VectorstoreIndexCreator

index = VectorstoreIndexCreator(
    text_splitter=text_splitter,
    embedding=embedding_hf,
    # vectorstore_cls=Chroma,
    vectorstore_kwargs={"persist_directory": persist_directory},
).from_documents(documents)

index.query(llm=llm, question=query, chain_type="map_reduce")

# InvalidDimensionException: Dimensionality of (768) does not match index dimensionality (384)


Token indices sequence length is longer than the specified maximum sequence length for this model (1728 > 1024). Running this sequence through the model will result in indexing errors


'SemDeDup searches for duplicates within clusters'

#### 4.2.3. ConversationalRetrievalChain 
In addition to RetrievalQA, ConversaltionalRetrivalChain adds a parameter `chat_history` to pass in chat history which can be used for follow-up questions.

In essence,

`ConversationalRetrievalChain` = `RetrievalQAChain` + `conversation memory`

To use ConversationalRetrievalChain, we can leverage `retriever` we created for RetrieveAQ.

In [17]:
from langchain.chains import ConversationalRetrievalChain 

conversationQA = ConversationalRetrievalChain.from_llm(
    llm=llm, 
    retriever=retriever, 
    chain_type="map_reduce", 
    return_source_documents=True
    )

chat_history = []
result = conversationQA({'question': query, 'chat_history': chat_history})

print(result['answer'])

chat_history = [(query, result['answer'])]
query2 = "elaborate on a deduplication algorithm"
result = conversationQA({'question': query2, 'chat_history': chat_history})

result['answer']

Token indices sequence length is longer than the specified maximum sequence length for this model (1628 > 1024). Running this sequence through the model will result in indexing errors


SemDeDup searches for duplicates within clusters.


Token indices sequence length is longer than the specified maximum sequence length for this model (1471 > 1024). Running this sequence through the model will result in indexing errors


'Deduplication with Threshold (eps)=0.03'

### Summary:
In conclusion, the key elements to remember here including:

* **embeddings**: We use huggingface Embedding. There are other embeddings such as Open AI Embedding.

* **TextSplitter**: We use Character Text Splitter that split the text by single character. Please documentation of other Splitters.

* **VectorStore**: We use Chroma as vector database where vectorised text were stored. Other popular vector databases are FAISS, Mulvus, Pinecone, Weaviate. etc.

* **Retrievers**: We use a VectorStoreRetriever, which is backed by a VectorStore. To retrieve text, there are two search types to leverage which inlude `similarity` and `mmr`.

  * Similarity search selects chunk of vectors text that are mostly similar to the question vector.

   * mmr search uses the maximum marginal relevance search where it optimizes for similarity to query AND diversity among selected documents.

* **Chain_Type** can be any of these: 'stuff`, `map reduce`, `refine` or `map_rerank`.