<a href="https://colab.research.google.com/github/aswinaus/Evals/blob/main/RAG_with_Evaluation_RAGAS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install langchain langchain_community langchain_openai chromadb pymupdf nest_asyncio --quiet
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain_openai import ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

from langchain_core.runnables import (
    RunnableParallel,
    RunnablePassthrough
)
from langchain.schema.output_parser import StrOutputParser

Import the nest_asyncio library. This library provides a way to run asyncio code within an existing event loop, avoiding conflicts.

nest_asyncio.apply(): The core it "patches" the asyncio event loop to allow it to run inside the existing event loop of your environment. In simpler terms, it makes sure your asynchronous code plays nicely within the notebook environment without causing errors.

import os:  Line imports the os library, which is a standard Python library for interacting with the operating system.

In [None]:
import nest_asyncio
import os
nest_asyncio.apply()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pymupdf

In [None]:
# Download Data
data_dir = '/content/drive/MyDrive' # Input a data dir path from your mounted Google Drive

In [None]:
doc = pymupdf.open(f"{data_dir}/RAG/data/TP/Intel_Financial_Statements_Year_Ended_2017.pdf")

In [None]:
#Printing the content to validate
for page in doc:
    text = page.get_text()
    #print(text)

In [None]:
import chromadb
from langchain.embeddings import OpenAIEmbeddings

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
pages=[]
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300,
    chunk_overlap=50
)
loader = PyMuPDFLoader(f"{data_dir}/RAG/data/TP/Intel_Financial_Statements_Year_Ended_2017.pdf")
# load_and_split uses RecursiveCharacterTextSplitter by default
pages_to_persist = loader.load_and_split(text_splitter)
pages.extend(pages_to_persist)

In [None]:
# split the pages into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
splits = text_splitter.split_documents(pages)

In [None]:
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [None]:
# create vector store with Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.vectorstores.utils import filter_complex_metadata # import filter_complex_metadata

vectordb = Chroma.from_documents(documents=pages, embedding=OpenAIEmbeddings(openai_api_key=os.environ["OPENAI_API_KEY"]),persist_directory=f"{data_dir}/RAG/VectorDB/chroma_db_RAG")
vectordb.persist()
retriever = vectordb.as_retriever()

In [None]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

**RunnablePassthrough.assign():** This function is a core part of LangChain, a framework for building language model applications. RunnablePassthrough is like a pipe that allows data to flow through while potentially modifying or adding to it. .assign() is used here to add a new key-value pair to the data being passed through.

**context= :** This part specifies that the key we are adding is called context. The value associated with this key is determined by the expression on the right side of the equals sign. This context will hold the relevant information retrieved from the document.

**lambda x: :** This is an anonymous function (also called a lambda function) in Python. It takes one input (x, which will be a dictionary containing the user's question) and performs an operation to produce an output. This output becomes the value of the context key.

**vectordb.similarity_search(x["question"], k=10):** This is where the magic happens.

**vectordb** is a Chroma vector database containing the embeddings of the document you loaded earlier (Intel Financial Statements).
similarity_search is a method that searches the vector database for the documents most similar to a given query.

**x["question"]** provides the user's question as the query.

**k=10** specifies that we want to retrieve the top 10 most similar documents.

In [None]:
#Creating a RAG Pipeline
from operator import itemgetter
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
# RAG
template = """You are an AI language model Accounting assistant.Answer the following question based on this context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(temperature=0, openai_api_key=os.environ["OPENAI_API_KEY"])
final_rag_chain = (
    #{"context": retriever | format_docs, "question": RunnablePassthrough()}

    RunnablePassthrough.assign(
        context=lambda x: format_docs(vectordb.similarity_search(x["question"], k=10)),
    )

    #| RunnablePassthrough.assign(debug_context=lambda x: print(f"Context before prompt: {x['context']}"))
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
question="Can you let me know the Identified intangible assets subject to amortization and show the difference between 2016 and 2017?"

In [None]:
final_rag_chain.invoke({"question":question})

In [None]:
questions = [
    "Can you get the total amount of Goodwill and Identified Intangible Assets?",
    "How much did Intangibles such as Goodwill and other identified intangible assets did Intel gain by acquiring Altera in millions?",
    "Can you list all the Intel Goodwill activities for year 2017 along with figures in millions?",
    "Can let me know how much was spent on Data Center Group along for 2016 and 2017 and show the difference between 2016 and 2017?",
    "Can you let me know the Identified intangible assets subject to amortization and show the difference between 2016 and 2017?",
    ]
ground_truth = [
    "The total amount of Goodwill is $10,278 million, and the total amount of Identified Intangible Assets is $7,566 million.",
    "Intel gained $13,014 million in intangibles such as Goodwill and other identified intangible assets by acquiring Altera.",
    "Sure, here are the Intel Goodwill activities for the year 2017 along with figures in millions:- Client Computing Group: $4,356;- Data Center Group: $5,421;- Internet of Things Group: $1,126;- Programmable Solutions Group: $2,490;- All other: $10,996;Total: $24,389 million",
    "In 2016, the amount spent on the Data Center Group was $7,520 million, and in 2017, it was $8,395 million. The difference between the two years is $875 million, with an increase in spending on the Data Center Group from 2016 to 2017.",
    "The Identified intangible assets subject to amortization for 2016 were $8,686 million, and for 2017, they were $10,577 million. The difference between 2016 and 2017 is $1,891 million.",
    ]

In [None]:
!pip install datasets --quiet
from datasets import Dataset

In [None]:
answers  = []
contexts = []

# traversing each question and passing into the chain to get answer from the system
for question in questions:
    answers.append(final_rag_chain.invoke({"question":question}))
    contexts.append([docs.page_content for docs in retriever.get_relevant_documents(question)])

# Preparing the dataset
data = {
    "question": questions,
    "answer": answers,
    "contexts": contexts,
    "ground_truth": ground_truth
}

# Convert dict to dataset
dataset = Dataset.from_dict(data)

dataset.to_pandas()

In [None]:
!pip install ragas --quiet
import ragas

In [None]:
#!git clone https://github.com/aswinaus/rag_dataset_ragas.git
#%cd rag_dataset_ragas

In [None]:
#from datasets import load_dataset
#dataset = load_dataset('json', data_files='RAGDataset.json')
#dataset = dataset['train']
#print(dataset)

Code is focused on evaluating the performance of a Retrieval Augmented Generation (RAG) system using the ragas library. RAG systems combine information retrieval (finding relevant documents) with text generation (creating answers).

**from ragas import evaluate:** This line imports the evaluate function from the ragas library. This function is the main tool for assessing the RAG system's quality.

**from ragas.metrics import (...):** Here, specific evaluation metrics are imported from ragas.metrics. These metrics will be used to judge different aspects of the system's performance.

**faithfulness:** Measures how well the generated answer aligns with the information provided in the retrieved documents. It checks if the answer is supported by the evidence.

**answer_relevancy:** Assesses the relevance of the generated answer to the user's question. It determines if the answer addresses the question appropriately.

**context_recall:** Evaluates how well the system retrieves all the necessary documents relevant to the question. A higher recall means more relevant documents are found.

**context_precision:** Measures the accuracy of the retrieved documents. A higher precision means that a larger proportion of the retrieved documents are actually relevant.

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)

result = evaluate(
    dataset=dataset,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
)

df = result.to_pandas()
df