# 📱 Building Robust RAG Systems step by step! 🤖

- In this exciting notebook, we'll walk through creating an advanced Retrieval Augmented Generation (RAG) system to intelligently answer questions about building effective RAG solutions.
- Get ready to level up your knowledge retrieval skills! 🚀

## Loading the Data...

- the items in blue simply show some of my early decisions
- due to the standardization and flexibility of the LangChain APIs I was able to experiment 🔬

![image.png](./diagrams/langchain-rag-loader.png)

## Retrieving the Data...

- important to remember to choose the same Embedding Model for the retriever that was used to load the data

![image.png](./diagrams/langchain-rag-retriever.png)

## Imports

In [1]:
%pip install -qU pypdf pymupdf 
%pip install -qU langchain langchain-core langchain-community langchain-experimental langchain-text-splitters 
%pip install -qU langchain-openai langchain-cohere
%pip install -qU langchain-groq langchain-anthropic
%pip install -qU langchain-chroma langchain-qdrant langchain-pinecone faiss-cpu


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## 🛠️ Assembling Our AI Toolkit

In [2]:
import os
from langchain import hub
from langchain_groq import ChatGroq

llm = ChatGroq(model="llama3-70b-8192", temperature=1)

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY")
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
QDRANT_API_URL = os.getenv("QDRANT_URL")

# LangSmith tracing and 
os.environ["LANGCHAIN_PROJECT"] = "RAG Architecture Amplified"
os.environ["LANGCHAIN_ENDPOINT"]=os.getenv("LANGCHAIN_ENDPOINT")
os.environ["LANGCHAIN_API_KEY"]=os.getenv("LANGCHAIN_API_KEY")
os.environ["LANGCHAIN_TRACING_V2"]=os.getenv("LANGCHAIN_TRACING_V2")

# Leverage a prompt from the LangChain hub
LLAMA3_PROMPT = hub.pull("rlm/rag-prompt-llama3")

In [3]:
# Parameterize some stuff

LOAD_NEW_DATA = True
# FILE_PATH = "https://singjupost.com/wp-content/uploads/2014/07/Steve-Jobs-iPhone-2007-Presentation-Full-Transcript.pdf"
# FILE_PATH = "https://arxiv.org/pdf/2309.15217"
# FILE_PATH = "https://arxiv.org/pdf/2405.17813"
# FILE_PATH = "https://arxiv.org/pdf/2406.05085"
FILE_PATH = "https://arxiv.org/pdf/2212.10496"
COLLECTION_NAME = "rag_evaluation"
QUESTION = "provide a step by step plan to guide companies in establishing a robust approach to evaluating Retrieval Augmented Generation (RAG) solutions."

## 🧩 Piecing Together the Perfect RAG System

Building a high-performance RAG system is like solving a complex puzzle. Each piece - the document loader, text splitter, embeddings, and vector store - must be carefully chosen to fit together seamlessly.

In this section, we'll walk through the key implementation choices we've made for each component, and how they contribute to a powerful, efficient, and flexible RAG solution.

### 📄 Intelligent Document Loading
- **PyMuPDFLoader**: For lightning-fast processing of complex PDFs 
- **UnstructuredHTMLLoader**: When web pages are the name of the game
- **CSVLoader**: Tabular data? No problem!

In [4]:
# Document Loader Concepts - https://python.langchain.com/v0.2/docs/concepts/#document-loaders
# PDF: https://python.langchain.com/v0.2/docs/how_to/document_loader_pdf/
# HTML:  https://python.langchain.com/v0.2/docs/how_to/document_loader_html/
# Microsoft Office files:  https://python.langchain.com/v0.2/docs/how_to/document_loader_office_file/
from langchain_community.document_loaders import (
    PyPDFLoader,
    PyMuPDFLoader,
    DirectoryLoader,
    UnstructuredHTMLLoader,
    BSHTMLLoader,
    SpiderLoader,
    JSONLoader,
    UnstructuredMarkdownLoader,
    CSVLoader,
)

In [5]:
# I chose the PyMuPDFLoader for its speed, ability to handle complex PDFs, and more extensive metadata.

DOCUMENT_LOADER = PyMuPDFLoader
# DOCUMENT_LOADER = "PyPDFLoader"
# DOCUMENT_LOADER = "DirectoryLoader"
# DOCUMENT_LOADER = "UnstructuredHTMLLoader"
# DOCUMENT_LOADER = "BSHTMLLoader"
# DOCUMENT_LOADER = "SpiderLoader"
# DOCUMENT_LOADER = "JSONLoader"
# DOCUMENT_LOADER = "UnstructuredMarkdownLoader"
# DOCUMENT_LOADER = "CSVLoader"

### ✂️ Strategic Text Splitting
- **RecursiveCharacterTextSplitter**: The smart way to keep related info together
- **TokenTextSplitter**: For when token limits matter most
- **HuggingFaceTextSplitter**: Leveraging the best in NLP for optimal splits

In [6]:
# Text Splitters concepts - https://python.langchain.com/v0.2/docs/concepts/#text-splitters
# Splitting by Token using HF tokenizers:  https://python.langchain.com/v0.2/docs/how_to/split_by_token/#hugging-face-tokenizer
# Use of RecursiveCharacterTextSplitter to split code - https://python.langchain.com/v0.2/docs/how_to/code_splitter/
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    TokenTextSplitter,
    MarkdownHeaderTextSplitter,
    RecursiveJsonSplitter,
    Language,
)
from langchain_experimental.text_splitter import SemanticChunker

In [7]:
# select the text splitter to use
# worth investigating using the RecursiveCharacterTextSplitter with the length_function based on a tokenizer VS the TokenTextSplitter

TEXT_SPLITTER = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
    )
# TEXT_SPLITTER = TokenTextSplitter
# TEXT_SPLITTER = MarkdownHeaderTextSplitter
# TEXT_SPLITTER = RecursiveJsonSplitter
# TEXT_SPLITTER = SemanticChunker

### 🪢 Powerful Embeddings
- **OpenAIEmbeddings**: Harnessing the power of cutting-edge language models
- **CohereEmbeddings**: When diversity and flexibility are key

In [8]:
# Embedding Model Concepts - https://python.langchain.com/v0.2/docs/concepts/#embedding-models
# Text Embedding Models - https://python.langchain.com/v0.2/docs/how_to/embed_text/
# Hugging Face embeddings supported through langchain-huggingface python library
# note ability to cache embeddings

from langchain_openai import OpenAIEmbeddings
from langchain_cohere import CohereEmbeddings


In [9]:
# select the embedding model to use
EMBEDDING_MODEL = OpenAIEmbeddings(
    model="text-embedding-3-small"
    )
# EMBEDDING_MODEL = CohereEmbeddings()

### 🗄️ Blazing-Fast Vector Stores
- **Qdrant**: The high-performance, scalable choice for demanding workloads
- **Chroma**: Unbeatable speed and efficiency for real-time use cases
- **Pinecone**: Fully-managed simplicity and reliability at scale

In [10]:
# import vector stores - https://python.langchain.com/v0.2/docs/concepts/#vector-stores
# after installing additional python dependencies, I started seeing protobuf errors with the Chroma vector store
from qdrant_client import QdrantClient

# from langchain_chroma import Chroma
from langchain_qdrant import Qdrant
from langchain_pinecone import Pinecone
from langchain_community.vectorstores import FAISS

  from tqdm.autonotebook import tqdm


## Initialize the Vector Store client

In [11]:
# Create a Qdrant client instance
client = QdrantClient(url=QDRANT_API_URL, api_key=QDRANT_API_KEY, prefer_grpc=True)

# Initialize the Qdrant vector store
qdrant = Qdrant(
    client=client,
    collection_name=COLLECTION_NAME,
    embeddings=EMBEDDING_MODEL
)

## 🆕 Time for New Docs? Let's Check!

The `LOAD_NEW_DATA` flag is a key part of our simple data ingestion pipeline. When set to `True`, it allows the loading of new documents.

### 📥 Ingesting Fresh Docs: Embracing Adaptability 

By using a flag like `LOAD_NEW_DATA`, we can control when new data is ingested without modifying the code itself. This supports rapid experimentation and iteration, as we can test our RAG system with different datasets by simply toggling the flag.

In this case, we're using `PyMuPDFLoader` to load a PDF file, but the beauty of this setup is that we can easily switch to other loaders like `UnstructuredHTMLLoader` for HTML files or `CSVLoader` for CSV data by changing the `DOCUMENT_LOADER` variable. This flexibility is crucial for adapting our pipeline to experiment with various data sources.

In [12]:
# run loader if LOAD_NEW_DATA is True
if LOAD_NEW_DATA:
    loader = DOCUMENT_LOADER(FILE_PATH)
    docs = loader.load()

In [13]:
# Document Loader validation
if LOAD_NEW_DATA:
    print(f"len(docs): {len(docs)}")
    print(f"\ndocs[0].page_content[0:100]:\n{docs[0].page_content[0:100]}")
    print(f"\ndocs[0].metadata):\n{docs[0].metadata}")

    print(f"\ndocs[1].page_content[0:100]:\n{docs[1].page_content[0:100]}")
    print(f"\ndocs[1].metadata):\n{docs[1].metadata}")

    print(f"\ndocs[-2].page_content[0:100]:\n{docs[-2].page_content[0:100]}")
    print(f"\ndocs[-2].metadata):\n{docs[-2].metadata}")

    print(f"\ndocs[-1].page_content[0:100]:\n{docs[-1].page_content[0:100]}")
    print(f"\ndocs[-1].metadata):\n{docs[-1].metadata}")

len(docs): 11

docs[0].page_content[0:100]:
Precise Zero-Shot Dense Retrieval without Relevance Labels
Luyu Gao∗†
Xueguang Ma∗‡
Jimmy Lin‡
Jamie

docs[0].metadata):
{'source': 'https://arxiv.org/pdf/2212.10496', 'file_path': 'https://arxiv.org/pdf/2212.10496', 'page': 0, 'total_pages': 11, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.21', 'creationDate': 'D:20221221014304Z', 'modDate': 'D:20221221014304Z', 'trapped': ''}

docs[1].page_content[0:100]:
HyDE
GPT
Contriever
how long does it take to remove
wisdom tooth
It usually takes between 30
minutes

docs[1].metadata):
{'source': 'https://arxiv.org/pdf/2212.10496', 'file_path': 'https://arxiv.org/pdf/2212.10496', 'page': 1, 'total_pages': 11, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'LaTeX with hyperref', 'producer': 'pdfTeX-1.40.21', 'creationDate': 'D:20221221014304Z', 'modDate': 'D:20221221014

## ✂️ Intelligent Text Splitting

Once our data is loaded, the next step is splitting it into manageable chunks. We're using the `RecursiveCharacterTextSplitter` for this, which intelligently splits text while keeping related pieces together.

The splitter works by recursively dividing the text on specified characters (like newlines and periods) until each chunk is within our desired `chunk_size`. The `chunk_overlap` parameter ensures some overlap between chunks to maintain context.

By adjusting these parameters, we can fine-tune the output to suit our specific use case. For example, a larger `chunk_size` results in fewer, longer chunks, while more `chunk_overlap` helps preserve context across chunks.

In [14]:
if LOAD_NEW_DATA:
    text_splitter = TEXT_SPLITTER
    splits = text_splitter.split_documents(docs)

In [15]:
# capture the split chunks for use in the vector store
if LOAD_NEW_DATA:
    print(f"len(splits): {len(splits)}")

    print(f"\nsplits[0]:\n{splits[0]}")
    print(f"\nsplits[1]:\n{splits[1]}")
    print(f"\nsplits[-2]:\n{splits[-2]}")
    print(f"\nsplits[-1]:\n{splits[-1]}")

    for i, split in enumerate(splits):
        print(f"\nSplit # {i}:")
        # print page number from split.metadata

        print(f"split.metadata.get('page'): {split.metadata.get('page')}")
        print(f"len(splits[{i}]): {len(split.page_content)}")
        print(f"splits[{i}][0:25]: {split.page_content[0:25]}")

len(splits): 53

splits[0]:
page_content='Precise Zero-Shot Dense Retrieval without Relevance Labels\nLuyu Gao∗†\nXueguang Ma∗‡\nJimmy Lin‡\nJamie Callan†\n†Language Technologies Institute, Carnegie Mellon University\n‡David R. Cheriton School of Computer Science, University of Waterloo\n{luyug, callan}@cs.cmu.edu, {x93ma, jimmylin}@uwaterloo.ca\nAbstract\nWhile dense retrieval has been shown effec-\ntive and efﬁcient across tasks and languages,\nit remains difﬁcult to create effective fully\nzero-shot dense retrieval systems when no rel-\nevance label is available.\nIn this paper, we\nrecognize the difﬁculty of zero-shot learning\nand encoding relevance.\nInstead, we pro-\npose to pivot through Hypothetical Document\nEmbeddings (HyDE). Given a query, HyDE ﬁrst\nzero-shot instructs an instruction-following\nlanguage model (e.g. InstructGPT) to gen-\nerate a hypothetical document.\nThe docu-\nment captures relevance patterns but is unreal\nand may contain false details. Then, an un-\nsu

## 🗄️ Supercharging Our RAG System with Qdrant

With our text split into manageable chunks, it's time to vectorize and store them for fast retrieval. That's where Qdrant comes in - a state-of-the-art vector database that offers unparalleled performance, scalability, and flexibility.

Qdrant utilizes the HNSW algorithm for blazing-fast similarity search, delivering up to 4x higher requests per second compared to alternatives. Its advanced compression features reduce memory usage by up to 97%, while its flexible storage options allow us to fine-tune for our specific needs.

But Qdrant isn't just fast - it's also incredibly versatile. With support for hybrid search (combining vector similarity and filtering), sparse vectors, and rich JSON payloads, Qdrant enables powerful querying patterns that go beyond simple similarity search.

And with a robust set of enterprise features like multitenancy, access control, and backup/recovery, Qdrant is ready to scale with our RAG system as it grows.

By leveraging Qdrant's speed, efficiency, and flexibility, we're building a knowledge base that can rapidly retrieve the most relevant information for any query. Whether we're serving a small prototype or a massive production system, Qdrant has us covered.

So let's dive in and see how Qdrant can supercharge our RAG system! 🚀

In [16]:
# Store the chunks in Qdrant
if LOAD_NEW_DATA:
    from_splits = qdrant.from_documents(
        url=QDRANT_API_URL,
        api_key=QDRANT_API_KEY,
        prefer_grpc=True,
        documents=splits,
        collection_name=COLLECTION_NAME,
        embedding=EMBEDDING_MODEL
    )

## 🔍 Implementing a Robust Vector Store Retriever

In [17]:
# Concepts:  https://python.langchain.com/v0.2/docs/concepts/#retrievers
# Vector Store as Retriever:  https://python.langchain.com/v0.2/docs/how_to/vectorstore_retriever/
# Including Similarity Search Scores:  https://python.langchain.com/v0.2/docs/how_to/add_scores_retriever/

retriever = qdrant.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.5}
)

## 🧠 Constructing the RAG Chain for Question Answering

In [18]:
from operator import itemgetter
from langchain.schema.runnable import RunnablePassthrough

retrieval_augmented_qa_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": LLAMA3_PROMPT | llm, "context": itemgetter("context")}
)

In [19]:
print(retrieval_augmented_qa_chain.get_graph().draw_ascii())

                      +---------------------------------+                        
                      | Parallel<context,question>Input |                        
                      +---------------------------------+                        
                           ****                   ****                           
                       ****                           ***                        
                     **                                  ****                    
+--------------------------------+                           **                  
| Lambda(itemgetter('question')) |                            *                  
+--------------------------------+                            *                  
                 *                                            *                  
                 *                                            *                  
                 *                                            *                  
     +----------

## 🎉 Moment of Truth: Testing Our RAG System!

In [20]:
response = retrieval_augmented_qa_chain.invoke({"question" : QUESTION})

In [21]:
# return the response.  filter on the response key AIMessage content element
response["response"].content


"Here is a step-by-step plan to guide companies in establishing a robust approach to evaluating Retrieval Augmented Generation (RAG) solutions:\n\n1. Identify the evaluation metrics: Define the key performance indicators (KPIs) that align with the business objectives, such as accuracy, relevance, and fluency.\n2. Select a reference-free evaluation framework: Utilize a framework like RAGAS, which provides a reference-free evaluation approach for RAG pipelines, considering multiple dimensions, including the retrieval system's ability to identify relevant context passages and the LLM's ability to exploit these passages.\n3. Automate the evaluation process: Develop an automated evaluation process that can assess the RAG system's performance on various metrics, ensuring consistency and reproducibility."

In [22]:
print(response["response"].content)

Here is a step-by-step plan to guide companies in establishing a robust approach to evaluating Retrieval Augmented Generation (RAG) solutions:

1. Identify the evaluation metrics: Define the key performance indicators (KPIs) that align with the business objectives, such as accuracy, relevance, and fluency.
2. Select a reference-free evaluation framework: Utilize a framework like RAGAS, which provides a reference-free evaluation approach for RAG pipelines, considering multiple dimensions, including the retrieval system's ability to identify relevant context passages and the LLM's ability to exploit these passages.
3. Automate the evaluation process: Develop an automated evaluation process that can assess the RAG system's performance on various metrics, ensuring consistency and reproducibility.


In [23]:
for i, context_instance in enumerate(response["context"]):
  print(f"\nvector store CONTEXT # {i}:")
  print(f"Page # : {context_instance.metadata.get('page')}")
  print(f"context.page_content:\n{context_instance.page_content}")
  print(f"context.metadata:\n{context_instance.metadata}")


vector store CONTEXT # 0:
Page # : 0
context.page_content:
RAGAS: Automated Evaluation of Retrieval Augmented Generation
Shahul Es†, Jithin James†, Luis Espinosa-Anke∗♢, Steven Schockaert∗
†Exploding Gradients
∗CardiffNLP, Cardiff University, United Kingdom
♢AMPLYFI, United Kingdom
shahules786@gmail.com,jamesjithin97@gmail.com
{espinosa-ankel,schockaerts1}@cardiff.ac.uk
Abstract
We introduce RAGAS (Retrieval Augmented
Generation Assessment), a framework for
reference-free evaluation of Retrieval Aug-
mented Generation (RAG) pipelines.
RAG
systems are composed of a retrieval and an
LLM based generation module, and provide
LLMs with knowledge from a reference textual
database, which enables them to act as a natu-
ral language layer between a user and textual
databases, reducing the risk of hallucinations.
Evaluating RAG architectures is, however, chal-
lenging because there are several dimensions to
consider: the ability of the retrieval system to
identify relevant and focused context p