# RAG Advanced Ingestion
In this notebook, we will cover how to apply advanced techniques to your retrieval augmented generation (RAG) ingestion pipeline!

# Notebook Setup

In [1]:
# Performing the necessary pip installs
import os
if 'KAGGLE_URL_BASE' in os.environ:
    from pip_install import perform_pip_install
    # perform_pip_install()

In [13]:
# Importing the necessary Python libraries
import json
import yaml

import pandas as pd
from datasets import Dataset
from langchain.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_community.document_loaders import DataFrameLoader
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision
)

In [3]:
# Loading the API keys from Kaggle Secrets
if 'KAGGLE_URL_BASE' in os.environ:
    from load_api_keys import load_api_keys
    api_keys = load_api_keys()
    
# Loading the API keys from local file
else:
    with open('../keys/api_keys.yaml', 'r') as file:
        api_keys = yaml.safe_load(file)

In [4]:
# Loading in the sample datasets from file
if 'KAGGLE_URL_BASE' in os.environ:
    df_kis = pd.read_csv('/kaggle/input/synthetic-it-related-knowledge-items/synthetic_knowledge_items.csv')
    df_validation = pd.read_csv('/kaggle/input/sample-rag-knowledge-item-dataset/rag_sample_qas_from_kis.csv')
else:
    df_kis = pd.read_csv('../data/synthetic_knowledge_items.csv')
    df_validation = pd.read_csv('../data/rag_sample_qas_from_kis.csv')

In [5]:
# Dropping alt_ki_text from the df_kis DataFrame
df_kis.drop(columns = ['alt_ki_text'], inplace = True)

# Viewing the first few rows of the knowledge item dataframe
df_kis.head()

Unnamed: 0,ki_topic,ki_text
0,Setting Up a Mobile Device for Company Email,**Setting Up a Mobile Device for Company Email...
1,Resetting a Forgotten PIN,**Resetting a Forgotten PIN**\n\nIf you have f...
2,Configuring VPN Access for Remote Workers,**Configuring VPN Access for Remote Workers**\...
3,Troubleshooting Issues with Microsoft Office,**Troubleshooting Issues with Microsoft Office...
4,Setting Up a Conference Call on Cisco Webex,"To set up a conference call on Cisco Webex, fo..."


In [6]:
# Dropping any unnecessary columns from the validation DataFrame
df_validation.drop(columns = ['ki_topic', 'ki_text'], inplace = True)

# Renaming the remaining columns
df_validation.rename(columns = {
    'sample_question': 'question',
    'sample_ground_truth': 'ground_truth'
}, inplace = True)

# Viewing the first few rows of the validation dataframe
df_validation.head()

Unnamed: 0,question,ground_truth
0,"""How do I set up my company email on my mobile...",To set up your company email on your mobile de...
1,"I forgot my PIN, how can I reset it?","Don't worry, I'm here to help To reset your fo..."
2,How do I set up VPN access on my laptop so I c...,To set up VPN access on your laptop and access...
3,"""My Microsoft Word keeps freezing every time I...",I'd be happy to help you troubleshoot the issu...
4,How do I set up a conference call on Cisco Web...,To set up a conference call on Cisco Webex wit...


# Chunking Strategies

In [7]:
# Creating the ground truth simulation prompt template
ANSWER_GENERATION_PROMPT = '''You are an expert evaluator for question-answering systems. Your task is to provide the ideal answer based on the given question and context. Please follow these guidelines:

1. Question: {question}

2. Context: {context}

3. Instructions:
   - Carefully analyze the question and the provided context.
   - Formulate a comprehensive and accurate answer based solely on the information given in the context.
   - Ensure your answer directly addresses the question.
   - Include all relevant information from the context, but do not add any external knowledge.
   - If the context doesn't contain enough information to fully answer the question, state this clearly and provide the best possible partial answer.
   - Use a formal, objective tone.

Remember, your goal is to provide the ideal answer that should be used as the benchmark for evaluating the AI's performance.'''

In [17]:
# Setting the OpenAI API key as an environment variable
os.environ['OPENAI_API_KEY'] = api_keys['OPENAI_API_KEY']

# Setting up the embedding algorithm
embedding_algorithm = OpenAIEmbeddings()

# Setting up the chat model
chat_model = ChatOpenAI(model = 'chatgpt-4o-latest')

# Creating the prompt engineering emplate to generate the simulated ground truth
answer_generation_prompt = ChatPromptTemplate.from_messages(messages = [
    HumanMessagePromptTemplate.from_template(template = ANSWER_GENERATION_PROMPT)
])

# Creating the inference chain to generate the simulated answer
answer_generation_chain = answer_generation_prompt | chat_model

In [18]:
def pandas_to_ragas(df):
    '''
    Converts a Pandas DataFrame into a Ragas-compatible dataset
    
    Inputs:
        - df (Pandas DataFrame): The input DataFrame to be converted
        
    Returns:
        - ragas_testset (Hugging Face Dataset): A Hugging Face dataset compatible with the Ragas framework
    '''
    # Ensure all text columns are strings and handle NaN values
    text_columns = ['question', 'ground_truth', 'answer']
    for col in text_columns:
        df[col] = df[col].fillna('').astype(str)
        
    # Convert 'contexts' to a list of lists
    df['contexts'] = df['contexts'].fillna('').astype(str).apply(lambda x: [x] if x else [])
    
    # Converting the DataFrame to a dictionary
    data_dict = df[['question', 'contexts', 'answer', 'ground_truth']].to_dict('list')
    
    # Loading the dictionary as a Hugging Face dataset
    ragas_testset = Dataset.from_dict(data_dict)
    
    return ragas_testset

In [None]:
# Defining different chunk sizes to experiment with
chunk_sizes = [100, 200, 500, 1000]

# Initializing dictionary to store FAISS indexes and DataFrame for evaluation results
faiss_indexes = {}
df_chunking_results = pd.DataFrame(columns = ['chunk_size', 'answer_correctness', 'answer_relevancy', 'faithfulness', 'context_recall', 'context_precision'])

In [26]:
# Iterating through each chunk size
for chunk_size in chunk_sizes:

    # Checking if the chunk size has already been evaluated
    if chunk_size not in df_chunking_results['chunk_size'].values:

        # Creating a text splitter with the current chunk size
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size = chunk_size,
            chunk_overlap = 20,
            length_function = len
        )
        
        # Loading and splitting the documents
        documents = DataFrameLoader(df_kis, page_content_column = 'ki_text').load()
        chunks = text_splitter.split_documents(documents)
        
        # Creating FAISS index for the current chunk size
        faiss_index = FAISS.from_documents(chunks, embedding_algorithm)
        
        # Storing the index
        faiss_indexes[chunk_size] = faiss_index

        # Creating a retriever from the FAISS index
        retriever = faiss_index.as_retriever(search_kwargs = {'k': 1})
        
        # Generating answers using the retriever and answer generation chain
        df_validation['contexts'] = df_validation['question'].apply(lambda q: retriever.invoke(q)[0].page_content)
        df_validation['answer'] = df_validation.apply(
            lambda row: answer_generation_chain.invoke({
                'question': row['question'],
                'context': row['contexts']
            }).content,
            axis = 1
        )
        
        # Converting the DataFrame to a Ragas-compatible dataset
        ragas_testset = pandas_to_ragas(df_validation)
        
        # Evaluating the chunking strategy using ragas
        result = evaluate(
            dataset = ragas_testset,
            llm = chat_model,
            metrics = [
                answer_correctness,
                answer_relevancy,
                faithfulness,
                context_recall,
                context_precision
            ]
        )
        
        # Storing the evaluation results in the DataFrame
        new_row = pd.DataFrame({
            'chunk_size': [chunk_size],
            'answer_correctness': [result['answer_correctness']],
            'answer_relevancy': [result['answer_relevancy']],
            'faithfulness': [result['faithfulness']],
            'context_recall': [result['context_recall']],
            'context_precision': [result['context_precision']]
        })

        df_chunking_results = pd.concat([df_chunking_results, new_row], ignore_index = True)
    else:
        print(f"Chunk size {chunk_size} already evaluated. Skipping...")

print(f"Created FAISS indexes for chunk sizes: {list(faiss_indexes.keys())}")
print("Evaluation results:")
print(df_chunking_results)

Evaluating: 100%|██████████| 50/50 [01:18<00:00,  1.56s/it]
  df_chunking_results = pd.concat([df_chunking_results, new_row], ignore_index = True)
Evaluating: 100%|██████████| 50/50 [01:12<00:00,  1.44s/it]
Evaluating: 100%|██████████| 50/50 [00:57<00:00,  1.16s/it]
Evaluating: 100%|██████████| 50/50 [01:16<00:00,  1.54s/it]


Created FAISS indexes for chunk sizes: [100, 200, 500, 1000]
Evaluation results:
  chunk_size  answer_correctness  answer_relevancy  faithfulness  \
0        100            0.460825          0.675054      0.378456   
1        200            0.396596          0.671702      0.590422   
2        500            0.457008          0.774391      0.553076   
3       1000            0.617736          0.967614      0.698524   

   context_recall  context_precision  
0        0.314758                0.4  
1        0.335341                0.2  
2        0.278637                0.7  
3        0.413057                1.0  


In [28]:
df_chunking_results.head()
df_chunking_results.to_csv('../data/chunking_experiment_results.csv', index = False)

# Optimized Indexing for Fast Retrieval
Now that we have covered the chunking strategies, let's move into talking about indexing. To be completely transparent, I struggled to emulate this because pretty much all vector databases come with some sort of indexing algorithm built in! Most commonly, you will run into a particular algorithm called **Hierarchical Navigable Small World (HNSW)**. While there are indeed other options for indexing, we won't necessarily cover those. Again, this is because whether you choose to use Pinecone, Weaviate, AWS OpenSearch, Chroma, FAISS, or one of the other multitude of options out there, you're going to get this indexing optimization built in.

Instead, what we'll focus on is comparing retrieval speeds between an index with no special optimizations applied compared to the FAISS in-memory database option that we used in the previous chunking section.