<a href="https://colab.research.google.com/github/ash-rulz/TextMining/blob/main/TextMiningProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG
This notebook implements a RAG pipeline for answering questions related to the first book in the Harry Potter series - "Harry Potter and the Sorcer's Stone".

The whole process can be summarized as follows:
1. **PDF splitter**: A PDF version of the book is parsed and split into smaller chunks.
2. **Sentence embedding**: These chunks of information is transformed into sentence embeddings. The sentence embedder user for this is *sentence-transformers/all-MiniLM-L6-v2*.
3. **Vector DB**: The embeddings are stored in vector DB. The vector DB used here is FAISS.
4. **Generator**: A generator based on a LLM is created. The generator used here is *google/flan-t5-large*.
5. **Retriever chain**: A retriever chain is then created. The input to a retriever chain will be a question. The question is converted into a sentence embedding which is then compared with the embedding in the vector DB. The k most similar documents are retrieved from the vector DB. These documents are passed as *context* in the custom made prompt template, along with the original question. This prompt is passed to the Generator to get the answer to the question.
6. **Evaluation**: The whole RAG pipeline is evaluated using the RAGAS framework.
7. **Improvements**: We try out different ways to improve the evaluation scores.

In [1]:
!pip install -q -U langchain pypdf

In [2]:
#Load the pdf to memory
from langchain.document_loaders import PyPDFLoader
pdfLoader = PyPDFLoader("Book.pdf")
documents = pdfLoader.load()

In [3]:
#Split the file to chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=['\n\n', '\n', '(?=>\. )', ' ', ''])
docs = text_splitter.split_documents(documents)
len(docs)

1189

In [4]:
!pip install -U -q sentence-transformers faiss-gpu

In [5]:
#Store the documents in a vector store
from langchain.embeddings import HuggingFaceEmbeddings

model_path = "sentence-transformers/all-MiniLM-L6-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}

embeddings = HuggingFaceEmbeddings(
    model_name=model_path,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

#Create vector store
from langchain.vectorstores import FAISS
db = FAISS.from_documents(docs, embeddings)

In [6]:
#Example of documents retrieved from the vector DB for a question
question = "What is the name of Filch's cat?"
searchDocs = db.similarity_search_with_score(question)
print(searchDocs[0])

(Document(page_content="106Filch owned a cat called Mrs. Norris, a scrawny, dust-colored creature\nwith bulging, lamp like eyes just like Filch's. She patrolled thecorridors alone. Break a rule in front of her, put just one toe out ofline, and she'd whisk off for Filch, who'd appear, wheezing, two secondslater. Filch knew the secret passageways of the school better thananyone (except perhaps the Weasley twins) and could pop up as suddenly\nas any of the ghosts. The students all hated him, and it was the dearest", metadata={'source': 'Book.pdf', 'page': 106}), 0.747094)


In [7]:
#Create a generator
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM,pipeline
from langchain import HuggingFacePipeline

model_name_flan = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_name_flan)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_flan)
pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer,max_new_tokens=200)
llm = HuggingFacePipeline(
    pipeline = pipe,
    model_kwargs={"temperature": 0, "max_length": 1000000},
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [8]:
#Evidence of hallucination by the T5 model
question = "What is the name of Filch's cat?"
llm_result = llm.invoke(question)
llm_result

'sam'

In [9]:
#Create a retriever
from langchain.prompts import ChatPromptTemplate
from operator import itemgetter
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

### CONTEXT
{context}

### QUESTION
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

base_retriever = db.as_retriever(search_kwargs={"k" : 3})
retrieval_augmented_qa_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": prompt | llm, "context": itemgetter("context")}
)

In [10]:
#Evidence of how the RAG improved the result
result = retrieval_augmented_qa_chain.invoke({"question" : question})
result['response']

Token indices sequence length is longer than the specified maximum sequence length for this model (552 > 512). Running this sequence through the model will result in indexing errors


'Mrs. Norris'

# Evaluation

In [11]:
!pip install -q datasets tqdm

In [12]:
#Get the ground truth data from QAEval
from datasets import Dataset
#eval_dataset = Dataset.from_csv("QAEval.csv", encoding='latin1', sep = ';')
eval_dataset = Dataset.from_csv("QAEval.csv")
eval_dataset

Dataset({
    features: ['question', 'ground_truth'],
    num_rows: 141
})

The QAEval.csv containing the ground truth data is created in the [EvalDataGenerator](https://colab.research.google.com/github/ash-rulz/TextMining/blob/main/EvalDataGenerator.ipynbhttps://) notebook.

In [13]:
#Format the data into the RAGAS structure
from tqdm import tqdm
import pandas as pd

def create_ragas_dataset(rag_chain, eval_dataset):
  rag_dataset = []
  for row in tqdm(eval_dataset):
    answer = rag_chain.invoke({"question" : row["question"]})
    rag_dataset.append(
        {"question" : row["question"],
         "answer" : answer["response"],
         "contexts" : [context.page_content for context in answer["context"]],
         "ground_truths" : [row["ground_truth"]]
         }
    )
  rag_df = pd.DataFrame(rag_dataset)
  rag_eval_dataset = Dataset.from_pandas(rag_df)
  return rag_eval_dataset

basic_qa_ragas_dataset = create_ragas_dataset(retrieval_augmented_qa_chain, eval_dataset)
basic_qa_ragas_dataset

100%|██████████| 141/141 [16:45<00:00,  7.13s/it]


Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truths'],
    num_rows: 141
})

In [14]:
# Save the dataset to a Parquet file
save_path = '/content/basic_qa_ragas_dataset.parquet'
basic_qa_ragas_dataset.to_pandas().to_parquet(save_path)

In [13]:
from datasets import Dataset
import pandas as pd

save_path = '/content/basic_qa_ragas_dataset.parquet'
ragas_eval_dataset =  Dataset.from_pandas(pd.read_parquet(save_path))
ragas_eval_dataset

Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truths'],
    num_rows: 141
})

In [14]:
!pip install -q -U ragas openai

RAGAs evaluation needs OpenAI api key.

In [17]:
import os
import openai
import getpass

open_ai_key = getpass.getpass('Enter your OPENAI API Key')
os.environ['OPENAI_API_KEY'] = open_ai_key

In [19]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_relevancy,
    answer_correctness,
    answer_similarity
)
from ragas import evaluate
eval_result = evaluate(
  ragas_eval_dataset,
  metrics=[
      context_precision,
      faithfulness,
      answer_relevancy,
      context_recall,
      context_relevancy,
      answer_correctness,
      answer_similarity
  ],
)
eval_result

evaluating with [context_precision]


100%|██████████| 10/10 [02:13<00:00, 13.31s/it]


evaluating with [faithfulness]


100%|██████████| 10/10 [02:57<00:00, 17.77s/it]


evaluating with [answer_relevancy]


100%|██████████| 10/10 [02:41<00:00, 16.13s/it]


evaluating with [context_recall]


100%|██████████| 10/10 [02:11<00:00, 13.13s/it]


evaluating with [context_relevancy]


100%|██████████| 10/10 [02:04<00:00, 12.42s/it]


evaluating with [answer_correctness]


100%|██████████| 10/10 [01:34<00:00,  9.49s/it]


evaluating with [answer_similarity]


100%|██████████| 10/10 [00:17<00:00,  1.74s/it]


{'context_precision': 0.2996, 'faithfulness': 0.4255, 'answer_relevancy': 0.6323, 'context_recall': 0.6390, 'context_relevancy': 0.1007, 'answer_correctness': 0.4832, 'answer_similarity': 0.8691}