# Retriever & RAG evaluation with RAGAS

Part-1:

1. Chunk PDF file
2. Create a vector database

Part-2:

3. Setup RAG pipeline
4. Generate responses for test inputs
5. Create eval dataset & save to file system


Amazon 10-k filing
https://ir.aboutamazon.com/sec-filings/default.aspxd

#### Dependencies
!pip install sentence-transformers chromad f

## Utility functions

In [1]:
# def print_metadata(result_docs):
#     for result in result_docs:
#         print(result.metadata)



## Part-1 : Retriever : ChromaDB

### 1. Load document and chunk

Use the PyPDFLoader

In [2]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Reads the PDF and generates the chunks of specified size & overlaps
def   create_chunks_from_pdf(pdf_source, chunk_size, chunk_overlap):
# Change this 
# pdf_source = "octank_financial_10K.pdf"
# pdf_source = "amz-10k/amz-10k-2024.pdf"

    # Used the PDF loader
    pdf_loader = PyPDFLoader(pdf_source) 

    # Load the documents
    documents = pdf_loader.load()

    # Text splitter
    pdf_text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = chunk_size,
        chunk_overlap = chunk_overlap,
        keep_separator = False,
        strip_whitespace = True,
    )

    # Chunked docs
    chunked_documents = pdf_text_splitter.split_documents(documents)

    return chunked_documents

### 2. ChromaDB Vectorstore : Generate the embeddings for the chunks & add 

In this sample we are using ChromaDB, but you may use any vector store.

https://python.langchain.com/docs/integrations/vectorstores/chroma



#### Parameters (check doc for additional parameters:
https://api.python.langchain.com/en/stable/vectorstores/langchain_community.vectorstores.chroma.Chroma.html#langchain_community.vectorstores.chroma.Chroma

* collection_name (required)
* embedding_function (optional)
* persist_directory (optional)
* collection_metadata (optional)
* client (optional)


In [3]:
from langchain_community.vectorstores import Chroma


def setup_chroma_db(embedding_function, chunked_documents, collection_name, metadata):

    # load it into Chroma using default embedding all-MiniLM-L6-v2 
    
    
    vector_store = Chroma(collection_name=collection_name, collection_metadata=collection_metadata, embedding_function=embedding_function)
    vector_store.add_documents(chunked_documents)

    return vector_store


## Part-2 : RAG pipeline for processing


In [4]:
from IPython.display import JSON
from dotenv import load_dotenv
import os
from langchain.prompts import PromptTemplate
import warnings
import sys
import json
warnings.filterwarnings("ignore")


# Load the file that contains the API keys
load_dotenv('C:\\Users\\raj\\.jupyter\\.env')

# setting path
sys.path.append('../')

### 3. Setup RAG pipeline

The RAG pipeline will be used for generating the response(s)

In [5]:
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate


def   setup_rag_pipeline(rag_llm):
    # Setup a prompt
    prompt = PromptTemplate(
        template=""" You are a smart agent who uses only the provided provided context to answer the given question. 
                     Your answers are concise and to the point.
                     Keep your responses under 1000 characters in length.
                     \n\n Question: \n {question} 
                     \n\n Context: \n {context}
        """,
        input_variables=["question", "context"]
    )
    
    
    
    # Setup RAG pipeline
    rag_pipeline = prompt | rag_llm

    return rag_pipeline

### 4. Generate responses for test inputs

In [6]:
# Utility function
def   split_eval_input_reference(user_input_reference_file_path):
    user_input = []
    reference = []
    
    with open(user_input_reference_file_path, 'r', encoding='utf-8') as f:  # Handle potential encoding issues
            data = json.load(f)
    
            for item in data:
                user_input.append(item.get("user_input")) #Use .get to handle missing keys gracefully
                reference.append(item.get("reference"))   #Use .get to handle missing keys gracefully

    return user_input, reference

# Read the user_input_reference file and split it into 2 arrays
# filepath = "amz-10k/amz-10k-2024-QA.json"
# user_input, reference = split_eval_input_reference(filepath)

In [7]:
def create_context_from_docs(result_docs):
    context=''
    for result in result_docs:
       context = context + "\n" + result.page_content
    return context

In [8]:
def     generate_context_response_evaluation_dataset(k, user_input, reference):
    # Now gather the contexts and responses for the user_input
    retrieved_contexts = []
    responses = []
    
    for question in user_input:
        print("RAG pipeline processing: ", question)
        
        result_docs = vector_store.similarity_search(question, k = k)
        # result_docs = vector_store.max_marginal_relevance_search(question, k = k, lambda_mult = 0.5)
        
        context = create_context_from_docs(result_docs)
    
        # Create a list of contexts retrieved from the vector store
        context_list = []
        for doc in result_docs:
            context_list.append(doc.page_content)
        
        retrieved_contexts.append(context_list)
        response = rag_pipeline.invoke(input={"question": question, "context" : context})
        responses.append(response)
    
    # Create the evaluation dataset
    eval_dataset = {
        "user_input" : user_input,
        "retrieved_contexts"  : retrieved_contexts,
        "reference" : reference,
        "response" : responses
    }

    return eval_dataset



## 5. Create eval dataset & save to file system

In [9]:
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

from utils.create_llm import create_gpt_llm, create_cohere_llm, create_ollama_llm, create_hugging_face_llm, create_google_llm, create_ai21_llm

# Initialize the inputs
pdf_source = "amzn-10k/amz-10k-2024.pdf"                          # Document to be indexed
user_input_reference_file_path="amzn-10k/amz-10k-2024-QA.json"    # A JSON file with user_input and reference data

# Setup chunking
chunk_size = 900
chunk_overlap = 150
k=5

# 1. Chunk the docs
chunked_documents = create_chunks_from_pdf(pdf_source, chunk_size, chunk_overlap)
print("Number of chunks: ", len(chunked_documents))

# 2. Create the vector store
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
collection_metadata = {'embedding': 'all-MiniLM-L6-v2', 'chunk_size': chunk_size, 'chunk_overlap': chunk_overlap}
vector_store = setup_chroma_db(embedding_function, chunked_documents, "financial-analysis", collection_metadata)

# 3. Setup the LLM for RAG pipeline
rag_llm = create_cohere_llm(args={"temperature":0})
rag_pipeline = setup_rag_pipeline(rag_llm)

# 4. Read the eval input and references
user_input, reference = split_eval_input_reference(user_input_reference_file_path)
eval_dataset = generate_context_response_evaluation_dataset(k, user_input, reference)



Number of chunks:  456
RAG pipeline processing:  What are Amazon's primary revenue sources?
RAG pipeline processing:  How much did Amazon's net sales grow in 2024?
RAG pipeline processing:  What is the operating income for 2024?
RAG pipeline processing:  How does Amazon manage its inventory risk?
RAG pipeline processing:  What is the significance of AWS to Amazon's overall business?
RAG pipeline processing:  How does Amazon handle cybersecurity risks?
RAG pipeline processing:  What are Amazon's major operating expenses?
RAG pipeline processing:  How does Amazon manage its cash flow?
RAG pipeline processing:  What are Amazon's key risks in international operations?
RAG pipeline processing:  How does Amazon address competition in its markets?


In [10]:
# Output file used for evaluation
eval_filepath = f"amzn-10k/amz-10k-2024-eval-dataset-cohere-chunk{chunk_size}-overlap{chunk_overlap}-k{k}.json" 

# Save the dataset to file system
with open(eval_filepath, 'w', encoding='utf-8') as f:  # 'w' for writing, utf-8 encoding
    json.dump(eval_dataset, f, indent=4, ensure_ascii=False)  # indent for pretty printing, ensure_ascii handles non-ASCII

print("Created evaluation dataset file : ", eval_filepath)

Created evaluation dataset file :  amzn-10k/amz-10k-2024-eval-dataset-cohere-chunk900-overlap150-k5.json
