<a href="https://colab.research.google.com/github/ash-rulz/RAG/blob/main/RAG_Langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG based QA using Langchain

We are using FLAN_T5 model in this excercise. The predecessor of this model is the T5 model which originated from the paper - Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer(see [video](https://www.youtube.com/watch?v=91iLu6OOrwk) for more details on the paper, or even detailed video [here](https://www.youtube.com/watch?v=Axo0EtMUK90)). The T5 model is based on the paper Scaling Instruction-Finetuned Language Models(see [video](https://www.youtube.com/watch?v=SHMsdAPo2Ls)). FLAN T5 is just Fine-tuned LANguage model on T5.

Next, we use [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) which converts the sentences to a 384 dimensional vector space. This sentence embedding is stored in FAISS vector DB.

The relevant documents are retrieved using *RetrievalQA* from *langchain*.





---
Prerequisites:
1. "data" folder with the pdf and the evaluation dataset need to be saved.
---



# Pre-requisites
Create a folder called example_data and in that folder place the pdf to be stored in the vector database.

# Step1: Split the PDF

In [1]:
!pip install -q -U langchain pypdf

In [2]:
#Load the pdf to memory
from langchain.document_loaders import PyPDFLoader
pdfLoader = PyPDFLoader("data/LetterToIndustry.pdf")
documents = pdfLoader.load()

In [3]:
len(documents)

2

In [4]:
#Split the file to chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=['\n\n', '\n', '(?=>\. )', ' ', ''])
docs = text_splitter.split_documents(documents)

In [5]:
docs[3]

Document(page_content='• The thesis  comprises  30 credit  points,  i.e. 100%  studies  for a semester  \n(the academic year in Sweden is divided in two semesters).  \n• The thesis  is organized  as a course  and runs  between  fixed  dates  \n(Jan - June  for the spring  semester  or Sept -Jan for the fall semester).  \n• The thesis  is individual  work  and the student  has to write  a \nsingle -authored master thesis report.  \n• The proposed project needs to be accepted  by the course \nexaminer.', metadata={'source': 'data/LetterToIndustry.pdf', 'page': 0})

# Step2: Create vector store
Here we use all-MiniLM-L6-v2 to create the sentence embedding and the embeddings are stored in the [FAISS](https://python.langchain.com/docs/integrations/vectorstores/faiss) vector store.

In [6]:
!pip install -U -q sentence-transformers

In [7]:
from langchain.embeddings import HuggingFaceEmbeddings

model_path = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}

embeddings = HuggingFaceEmbeddings(
    model_name=model_path,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [8]:
!pip install -q faiss-gpu

In [9]:
#Create vector store
from langchain.vectorstores import FAISS
db = FAISS.from_documents(docs, embeddings)

In [10]:
#Example of documents retrieved from the vector DB for a question
question = "How many credits does the thesis comprise of?"
searchDocs = db.similarity_search_with_score(question)
print(searchDocs[0])

(Document(page_content='• The thesis  comprises  30 credit  points,  i.e. 100%  studies  for a semester  \n(the academic year in Sweden is divided in two semesters).  \n• The thesis  is organized  as a course  and runs  between  fixed  dates  \n(Jan - June  for the spring  semester  or Sept -Jan for the fall semester).  \n• The thesis  is individual  work  and the student  has to write  a \nsingle -authored master thesis report.  \n• The proposed project needs to be accepted  by the course \nexaminer.', metadata={'source': 'data/LetterToIndustry.pdf', 'page': 0}), 0.6723114)


# Step3: Create the generator
Here we use the FLAN-T5 model which is fine-tuned on many tasks including QA tasks.


In [11]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM,pipeline
from langchain import HuggingFacePipeline

model_name_flan = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name_flan)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_flan)
pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer,max_new_tokens=200)
llm = HuggingFacePipeline(
    pipeline = pipe,
    model_kwargs={"temperature": 0, "max_length": 1000000},
)

# Step4: Create the prompt template
During inference, when for the query sent to the vector search, the vector DB will provide multiple documents, from which we choose the best. This becomes the context in the prompt template.

This is the 1st method: this is a simpler approach.

In [12]:
from langchain.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Keep the answer as concise as possible.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

This is the 2nd method using ChatProptTemplate. This is a more sophasticated approach. More information on the difference between these 2 templates are mentioned [here](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/).

In [13]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

### CONTEXT
{context}

### QUESTION
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

# Step5: Create a retriever
We can use a RetrieverQA chain from langchain for this. See [this](https://docs.smith.langchain.com/cookbook/hub-examples/retrieval-qa-chain).

In [14]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_kwargs={"k" : 3}),
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
    )

This is a more sophasticated approach. This approach is explained better in [LangChain expression language notebook](https://colab.research.google.com/drive/1yFgTXd3sUa83-QWUDQWoyywduCIlT2E6#scrollTo=n34J771m_hh8&line=4&uniqifier=1).

In [15]:
from operator import itemgetter

from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough

base_retriever = db.as_retriever(search_kwargs={"k" : 3})
retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | llm, "context": itemgetter("context")}
)

In [16]:
result = qa_chain ({ "query" : question })
print(result)

{'query': 'How many credits does the thesis comprise of?', 'result': '30'}


In [17]:
result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result)

Token indices sequence length is longer than the specified maximum sequence length for this model (527 > 512). Running this sequence through the model will result in indexing errors


{'response': '30', 'context': [Document(page_content='• The thesis  comprises  30 credit  points,  i.e. 100%  studies  for a semester  \n(the academic year in Sweden is divided in two semesters).  \n• The thesis  is organized  as a course  and runs  between  fixed  dates  \n(Jan - June  for the spring  semester  or Sept -Jan for the fall semester).  \n• The thesis  is individual  work  and the student  has to write  a \nsingle -authored master thesis report.  \n• The proposed project needs to be accepted  by the course \nexaminer.', metadata={'source': 'data/LetterToIndustry.pdf', 'page': 0}), Document(page_content='programming, database management and big data  analytics.\n \n \nA majority  of the master  theses  in the program  are written  in collaboration  with  the \nindustry, based on a specific problem that the firm wants to solve. This letter sets \nout the requirements for the thesis work as an aid for the firm. In particular, here \nare some  important  things  to keep  in mi

In [18]:
result['context'][0]

Document(page_content='• The thesis  comprises  30 credit  points,  i.e. 100%  studies  for a semester  \n(the academic year in Sweden is divided in two semesters).  \n• The thesis  is organized  as a course  and runs  between  fixed  dates  \n(Jan - June  for the spring  semester  or Sept -Jan for the fall semester).  \n• The thesis  is individual  work  and the student  has to write  a \nsingle -authored master thesis report.  \n• The proposed project needs to be accepted  by the course \nexaminer.', metadata={'source': 'data/LetterToIndustry.pdf', 'page': 0})

In [19]:
result['context'][1]

Document(page_content='programming, database management and big data  analytics.\n \n \nA majority  of the master  theses  in the program  are written  in collaboration  with  the \nindustry, based on a specific problem that the firm wants to solve. This letter sets \nout the requirements for the thesis work as an aid for the firm. In particular, here \nare some  important  things  to keep  in mind while  discussing  a potential  project:  \n \n• The thesis  comprises  30 credit  points,  i.e. 100%  studies  for a semester', metadata={'source': 'data/LetterToIndustry.pdf', 'page': 0})

In [20]:
result['context'][2]

Document(page_content='applications should therefore be based on statistically oriented  \nmethods.  \n• The master  thesis  is a scientific  work . This  means  for example  that  \nthe proposed  solution  in the thesis  needs  to relate  to existing work  in \nthe scientific literature, and the advantages and disadvantages of \nthe proposed  solution  needs  to be critically  assessed.  \n• A master thesis is a public document  the results from the thesis work will', metadata={'source': 'data/LetterToIndustry.pdf', 'page': 1})

# Step6: Evaluatiopn of the RAG pipeline
We will evaluate using RAGAS( [paper](https://arxiv.org/abs/2309.15217)).

In [3]:
!pip install datasets -q

In [22]:
from datasets import Dataset
eval_dataset = Dataset.from_csv("data/RagEvalGT.csv", encoding='latin1', sep = ';')

In [23]:
eval_dataset

Dataset({
    features: ['question', 'ground_truth'],
    num_rows: 8
})

The dataset for ragas needs to be in a particular structure, see [page](https://docs.ragas.io/en/latest/howtos/applications/data_preparation.html).


1. For the question, we load it from the question field from the csv.
2. For the answer, we get the response from the generator.
3. For the contexts, we get the contexts retrieved by the retriever.
4. For the ground truths, we get the answer from the csv.



In [24]:
!pip install tqdm -q

In [25]:
from tqdm import tqdm
import pandas as pd

def create_ragas_dataset(rag_chain, eval_dataset):
  rag_dataset = []
  for row in tqdm(eval_dataset):
    answer = rag_chain.invoke({"question" : row["question"]})
    rag_dataset.append(
        {"question" : row["question"],
         "answer" : answer["response"],
         "contexts" : [context.page_content for context in answer["context"]],
         "ground_truths" : [row["ground_truth"]]
         }
    )
  rag_df = pd.DataFrame(rag_dataset)
  rag_eval_dataset = Dataset.from_pandas(rag_df)
  return rag_eval_dataset

In [26]:
basic_qa_ragas_dataset = create_ragas_dataset(retrieval_augmented_qa_chain, eval_dataset)
basic_qa_ragas_dataset

100%|██████████| 8/8 [02:19<00:00, 17.48s/it]


Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truths'],
    num_rows: 8
})

In [27]:
basic_qa_ragas_dataset[2]

{'question': 'How many credit points does the thesis comprise of?',
 'answer': '30',
 'contexts': ['• The thesis  comprises  30 credit  points,  i.e. 100%  studies  for a semester  \n(the academic year in Sweden is divided in two semesters).  \n• The thesis  is organized  as a course  and runs  between  fixed  dates  \n(Jan - June  for the spring  semester  or Sept -Jan for the fall semester).  \n• The thesis  is individual  work  and the student  has to write  a \nsingle -authored master thesis report.  \n• The proposed project needs to be accepted  by the course \nexaminer.',
  'programming, database management and big data  analytics.\n \n \nA majority  of the master  theses  in the program  are written  in collaboration  with  the \nindustry, based on a specific problem that the firm wants to solve. This letter sets \nout the requirements for the thesis work as an aid for the firm. In particular, here \nare some  important  things  to keep  in mind while  discussing  a potential  p

In [28]:
# Save the dataset to a Parquet file
save_path = '/content/basic_qa_ragas_dataset.parquet'
basic_qa_ragas_dataset.to_pandas().to_parquet(save_path)

In [29]:
basic_qa_ragas_dataset

Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truths'],
    num_rows: 8
})

In [7]:
from datasets import Dataset
import pandas as pd

save_path = '/content/basic_qa_ragas_dataset.parquet'
ragas_eval_dataset =  Dataset.from_pandas(pd.read_parquet(save_path))
ragas_eval_dataset

Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truths'],
    num_rows: 8
})

Now that we have got the dataset in the structure expected by the RAGAS module, we can go ahead and evaluate our whole pipeline.

In [1]:
!pip install -q -U ragas

RAGAS evaluation needs OpenAI api key. Can be setup [here](https://platform.openai.com/api-keys). The usage can be tracked [here](https://platform.openai.com/usage).

In [17]:
import os
import openai
import getpass

open_ai_key = getpass.getpass('Enter your OPENAI API Key')
os.environ['OPENAI_API_KEY'] = open_ai_key

Enter your OPENAI API Key··········


In [12]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_relevancy,
    answer_correctness,
    answer_similarity
)
from ragas import evaluate
eval_result = evaluate(
  ragas_eval_dataset,
  metrics=[
      context_precision,
      faithfulness,
      answer_relevancy,
      context_recall,
      context_relevancy,
      answer_correctness,
      answer_similarity
  ],
)
eval_result

evaluating with [context_precision]


100%|██████████| 1/1 [00:08<00:00,  8.12s/it]


evaluating with [faithfulness]


100%|██████████| 1/1 [00:14<00:00, 14.63s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:06<00:00,  6.32s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [00:04<00:00,  4.96s/it]


evaluating with [context_relevancy]


100%|██████████| 1/1 [00:02<00:00,  2.76s/it]


evaluating with [answer_correctness]


100%|██████████| 1/1 [00:06<00:00,  6.45s/it]


evaluating with [answer_similarity]


100%|██████████| 1/1 [00:00<00:00,  1.61it/s]


{'context_precision': 0.6875, 'faithfulness': 0.7500, 'answer_relevancy': 0.7136, 'context_recall': 1.0000, 'context_relevancy': 0.0275, 'answer_correctness': 0.6130, 'answer_similarity': 0.8897}

# Notes



1.   While this works for simple pdf documents, complicated pdf documents resulted in very poor results.



# References


1.   [Post](https://www.linkedin.com/pulse/get-insight-from-your-business-data-build-llm-application-jain/) by Ashish Jain.
2.   [Blog](https://medium.com/@onkarmishra/using-langchain-for-question-answering-on-own-data-3af0a82789ed) by Onkar Mishra
3. [Langchain course](https://learn.deeplearning.ai/langchain-chat-with-your-data/lesson/1/introduction) on DeepLearning.ai



# To-Do(if time is available)
1. Checkout [Langchain Expression Language](https://www.youtube.com/watch?v=moJRxxEddzU)
2. Checkout about Langchain chunking(RecursiveCharacterTextSplitter). A [lead](https://www.youtube.com/watch?v=eqOfr4AGLk8) maybe.

# Next steps:
1. Explain each metric reported in RAGAS. Fine-tune.
2. Check what details are needed for the Harry Potter data. Both eval and pdf document to parse.