# üì± Building Robust RAG Systems step by step! ü§ñ

- In this exciting notebook, we'll walk through creating an advanced Retrieval Augmented Generation (RAG) system to intelligently answer questions about building effective RAG solutions.
- Get ready to level up your knowledge retrieval skills! üöÄ
- Checkout the notes under the "Piecing Together the Perfect RAG System" section for information on LangChain v0.2 docs.  Watch out for those breaking changes!

### Things you'll learn üöÄ

- üèóÔ∏è LangSmith - Set up a crude end-to-end framework to test and evaluate the RAG solution
- üîç Qdrant - initialize a vector store retriever that can run independently from the document loader üí™
- üé® LCEL - the ascii art helps bring this concept home, and confirm that the flow is set-up properly üìö

### For next time...

- üß† Some of the key themes I'll do a deep dive on soon are captured [in this LangSmith Trace](https://smith.langchain.com/public/aebb43d1-5b39-42ad-ac22-f20c3efcda1b/r) üí°

## Loading the Data...

- the items in blue simply show some of my early decisions
- due to the standardization and flexibility of the LangChain APIs I was able to experiment üî¨

![image.png](./diagrams/langchain-rag-loader.png)

## Retrieving the Data...

- important to remember to choose the same Embedding Model for the retriever that was used to load the data

![image.png](./diagrams/langchain-rag-retriever.png)

## Lessons Learned

### Things I'm excited to learn next! üéâ


- go back and revisit some of the alternatives at each step of the RAG data pipeline (document loader, test splitter, embedding models)

## üõ†Ô∏è Assembling Our AI Toolkit

In [1]:
%pip install -qU pypdf pymupdf 
%pip install -qU langchain langchain-core langchain-community langchain-experimental langchain-text-splitters 
%pip install -qU langchain-openai langchain-cohere
%pip install -qU langchain-groq langchain-anthropic
%pip install -qU langchain-chroma langchain-qdrant langchain-pinecone faiss-cpu


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
from langchain import hub
from langchain_groq import ChatGroq

llm = ChatGroq(model="llama3-70b-8192", temperature=0.1)

os.environ["GROQ_API_KEY"] = os.getenv("GROQ_API_KEY")
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

QDRANT_API_KEY = os.getenv("QDRANT_API_KEY")
QDRANT_API_URL = os.getenv("QDRANT_URL")

# LangSmith tracing and 
os.environ["LANGCHAIN_PROJECT"] = "AirBnB PDF Jun18"
os.environ["LANGCHAIN_ENDPOINT"]=os.getenv("LANGCHAIN_ENDPOINT")
os.environ["LANGCHAIN_API_KEY"]=os.getenv("LANGCHAIN_API_KEY")
os.environ["LANGCHAIN_TRACING_V2"]=os.getenv("LANGCHAIN_TRACING_V2")

# Leverage a prompt from the LangChain hub
LLAMA3_PROMPT = hub.pull("rlm/rag-prompt-llama3")

In [43]:
# Parameterize some stuff

LOAD_NEW_DATA = False
# FILE_PATH = "https://singjupost.com/wp-content/uploads/2014/07/Steve-Jobs-iPhone-2007-Presentation-Full-Transcript.pdf"
# FILE_PATH = "https://arxiv.org/pdf/2309.15217"
# FILE_PATH = "https://arxiv.org/pdf/2405.17813"
# FILE_PATH = "https://arxiv.org/pdf/2406.05085"
# FILE_PATH = "https://arxiv.org/pdf/2212.10496"
FILE_PATH = "/home/donbr/aie3-bootcamp/AIE3/Week 3/Day 2/files/airbnb.pdf"
COLLECTION_NAME = "airbnb_pdf_rec_1000_200"
# QUESTION = "provide a step by step plan to guide companies in establishing a robust approach to evaluating Retrieval Augmented Generation (RAG) solutions."
# QUESTION = "What is Airbnb's 'Description of Business'?"
# QUESTION = "What was the total value of 'Cash and cash equivalents' as of December 31, 2023?"
QUESTION = "What is the 'maximum number of shares to be sold under the 10b5-1 Trading plan' by Brian Chesky?"

## üß© Piecing Together the Perfect RAG System

Building a high-performance RAG system is like solving a complex puzzle. Each piece - the document loader, text splitter, embeddings, and vector store - must be carefully chosen to fit together seamlessly.

In this section, we'll walk through the key implementation choices we've made for each component, and how they contribute to a powerful, efficient, and flexible RAG solution.

### üìÑ Intelligent Document Loading
- **PyMuPDFLoader**: For lightning-fast processing of complex PDFs 
- **UnstructuredHTMLLoader**: When web pages are the name of the game
- **CSVLoader**: Tabular data? No problem!

In [44]:
# Document Loader Concepts - https://python.langchain.com/v0.2/docs/concepts/#document-loaders
# PDF: https://python.langchain.com/v0.2/docs/how_to/document_loader_pdf/
# HTML:  https://python.langchain.com/v0.2/docs/how_to/document_loader_html/
# Microsoft Office files:  https://python.langchain.com/v0.2/docs/how_to/document_loader_office_file/
from langchain_community.document_loaders import (
    PyPDFLoader,
    PyMuPDFLoader,
    DirectoryLoader,
    UnstructuredHTMLLoader,
    BSHTMLLoader,
    SpiderLoader,
    JSONLoader,
    UnstructuredMarkdownLoader,
    CSVLoader,
)

In [45]:
# I chose the PyMuPDFLoader for its speed, ability to handle complex PDFs, and more extensive metadata.

DOCUMENT_LOADER = PyMuPDFLoader
# DOCUMENT_LOADER = "PyPDFLoader"
# DOCUMENT_LOADER = "DirectoryLoader"
# DOCUMENT_LOADER = "UnstructuredHTMLLoader"
# DOCUMENT_LOADER = "BSHTMLLoader"
# DOCUMENT_LOADER = "SpiderLoader"
# DOCUMENT_LOADER = "JSONLoader"
# DOCUMENT_LOADER = "UnstructuredMarkdownLoader"
# DOCUMENT_LOADER = "CSVLoader"

### ‚úÇÔ∏è Strategic Text Splitting
- **RecursiveCharacterTextSplitter**: The smart way to keep related info together
- **TokenTextSplitter**: For when token limits matter most
- **HuggingFaceTextSplitter**: Leveraging the best in NLP for optimal splits

In [46]:
# Text Splitters concepts - https://python.langchain.com/v0.2/docs/concepts/#text-splitters
# Splitting by Token using HF tokenizers:  https://python.langchain.com/v0.2/docs/how_to/split_by_token/#hugging-face-tokenizer
# Use of RecursiveCharacterTextSplitter to split code - https://python.langchain.com/v0.2/docs/how_to/code_splitter/
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    TokenTextSplitter,
    MarkdownHeaderTextSplitter,
    RecursiveJsonSplitter,
    Language,
)
from langchain_experimental.text_splitter import SemanticChunker

In [47]:
# select the text splitter to use
# worth investigating using the RecursiveCharacterTextSplitter with the length_function based on a tokenizer VS the TokenTextSplitter

TEXT_SPLITTER = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
    )
# TEXT_SPLITTER = TokenTextSplitter
# TEXT_SPLITTER = MarkdownHeaderTextSplitter
# TEXT_SPLITTER = RecursiveJsonSplitter
# TEXT_SPLITTER = SemanticChunker

### ü™¢ Powerful Embeddings
- **OpenAIEmbeddings**: Harnessing the power of cutting-edge language models
- **CohereEmbeddings**: When diversity and flexibility are key

In [48]:
# Embedding Model Concepts - https://python.langchain.com/v0.2/docs/concepts/#embedding-models
# Text Embedding Models - https://python.langchain.com/v0.2/docs/how_to/embed_text/
# Hugging Face embeddings supported through langchain-huggingface python library
# note ability to cache embeddings

from langchain_openai import OpenAIEmbeddings
from langchain_cohere import CohereEmbeddings


In [49]:
# select the embedding model to use
EMBEDDING_MODEL = OpenAIEmbeddings(
    model="text-embedding-3-small"
    )
# EMBEDDING_MODEL = CohereEmbeddings()

### üóÑÔ∏è Blazing-Fast Vector Stores
- **Qdrant**: The high-performance, scalable choice for demanding workloads
- **Chroma**: Unbeatable speed and efficiency for real-time use cases
- **Pinecone**: Fully-managed simplicity and reliability at scale

In [50]:
# import vector stores - https://python.langchain.com/v0.2/docs/concepts/#vector-stores
# after installing additional python dependencies, I started seeing protobuf errors with the Chroma vector store
from qdrant_client import QdrantClient

# from langchain_chroma import Chroma
from langchain_qdrant import Qdrant
from langchain_pinecone import Pinecone
from langchain_community.vectorstores import FAISS

### Initialize the Vector Store client

In [51]:
# Create a Qdrant client instance
client = QdrantClient(url=QDRANT_API_URL, api_key=QDRANT_API_KEY, prefer_grpc=True)

# Initialize the Qdrant vector store
qdrant = Qdrant(
    client=client,
    collection_name=COLLECTION_NAME,
    embeddings=EMBEDDING_MODEL
)

## üÜï Time for New Docs? Let's Check!

The `LOAD_NEW_DATA` flag is a key part of our simple data ingestion pipeline. When set to `True`, it allows the loading of new documents.

### üì• Ingesting Fresh Docs: Embracing Adaptability 

By using a flag like `LOAD_NEW_DATA`, we can control when new data is ingested without modifying the code itself. This supports rapid experimentation and iteration, as we can test our RAG system with different datasets by simply toggling the flag.

In this case, we're using `PyMuPDFLoader` to load a PDF file, but the beauty of this setup is that we can easily switch to other loaders like `UnstructuredHTMLLoader` for HTML files or `CSVLoader` for CSV data by changing the `DOCUMENT_LOADER` variable. This flexibility is crucial for adapting our pipeline to experiment with various data sources.

In [52]:
# run loader if LOAD_NEW_DATA is True
if LOAD_NEW_DATA:
    loader = DOCUMENT_LOADER(FILE_PATH)
    docs = loader.load()

In [53]:
# Document Loader validation
if LOAD_NEW_DATA:
    print(f"len(docs): {len(docs)}")
    print(f"\ndocs[0].page_content[0:100]:\n{docs[0].page_content[0:100]}")
    print(f"\ndocs[0].metadata):\n{docs[0].metadata}")

    print(f"\ndocs[1].page_content[0:100]:\n{docs[1].page_content[0:100]}")
    print(f"\ndocs[1].metadata):\n{docs[1].metadata}")

    print(f"\ndocs[-2].page_content[0:100]:\n{docs[-2].page_content[0:100]}")
    print(f"\ndocs[-2].metadata):\n{docs[-2].metadata}")

    print(f"\ndocs[-1].page_content[0:100]:\n{docs[-1].page_content[0:100]}")
    print(f"\ndocs[-1].metadata):\n{docs[-1].metadata}")

### ‚úÇÔ∏è Intelligent Text Splitting

Once our data is loaded, the next step is splitting it into manageable chunks. We're using the `RecursiveCharacterTextSplitter` for this, which intelligently splits text while keeping related pieces together.

The splitter works by recursively dividing the text on specified characters (like newlines and periods) until each chunk is within our desired `chunk_size`. The `chunk_overlap` parameter ensures some overlap between chunks to maintain context.

By adjusting these parameters, we can fine-tune the output to suit our specific use case. For example, a larger `chunk_size` results in fewer, longer chunks, while more `chunk_overlap` helps preserve context across chunks.

In [54]:
if LOAD_NEW_DATA:
    text_splitter = TEXT_SPLITTER
    splits = text_splitter.split_documents(docs)

In [55]:
# capture the split chunks for use in the vector store
if LOAD_NEW_DATA:
    print(f"len(splits): {len(splits)}")

    print(f"\nsplits[0]:\n{splits[0]}")
    print(f"\nsplits[1]:\n{splits[1]}")
    print(f"\nsplits[-2]:\n{splits[-2]}")
    print(f"\nsplits[-1]:\n{splits[-1]}")

    for i, split in enumerate(splits):
        print(f"\nSplit # {i}:")
        # print page number from split.metadata

        print(f"split.metadata.get('page'): {split.metadata.get('page')}")
        print(f"len(splits[{i}]): {len(split.page_content)}")
        print(f"splits[{i}][0:25]: {split.page_content[0:25]}")

### üóÑÔ∏è Supercharging Our RAG System with Qdrant

With our text split into manageable chunks, it's time to vectorize and store them for fast retrieval. That's where Qdrant comes in - a state-of-the-art vector database that offers unparalleled performance, scalability, and flexibility.

Qdrant utilizes the HNSW algorithm for blazing-fast similarity search, delivering up to 4x higher requests per second compared to alternatives. Its advanced compression features reduce memory usage by up to 97%, while its flexible storage options allow us to fine-tune for our specific needs.

But Qdrant isn't just fast - it's also incredibly versatile. With support for hybrid search (combining vector similarity and filtering), sparse vectors, and rich JSON payloads, Qdrant enables powerful querying patterns that go beyond simple similarity search.

And with a robust set of enterprise features like multitenancy, access control, and backup/recovery, Qdrant is ready to scale with our RAG system as it grows.

By leveraging Qdrant's speed, efficiency, and flexibility, we're building a knowledge base that can rapidly retrieve the most relevant information for any query. Whether we're serving a small prototype or a massive production system, Qdrant has us covered.

So let's dive in and see how Qdrant can supercharge our RAG system! üöÄ

In [56]:
# Store the chunks in Qdrant
if LOAD_NEW_DATA:
    from_splits = qdrant.from_documents(
        url=QDRANT_API_URL,
        api_key=QDRANT_API_KEY,
        prefer_grpc=True,
        documents=splits,
        collection_name=COLLECTION_NAME,
        embedding=EMBEDDING_MODEL
    )

## üîç Implementing a Robust Vector Store Retriever

- depends on the "Initialize the Vector Store client" section above

In [57]:
# Concepts:  https://python.langchain.com/v0.2/docs/concepts/#retrievers
# Vector Store as Retriever:  https://python.langchain.com/v0.2/docs/how_to/vectorstore_retriever/
# Including Similarity Search Scores:  https://python.langchain.com/v0.2/docs/how_to/add_scores_retriever/

retriever = qdrant.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.5}
)

## üß† Constructing the RAG Chain for Question Answering

In [58]:
from operator import itemgetter
from langchain.schema.runnable import RunnablePassthrough

retrieval_augmented_qa_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": LLAMA3_PROMPT | llm, "context": itemgetter("context")}
)

In [59]:
print(retrieval_augmented_qa_chain.get_graph().draw_ascii())

                      +---------------------------------+                        
                      | Parallel<context,question>Input |                        
                      +---------------------------------+                        
                           ****                   ****                           
                       ****                           ***                        
                     **                                  ****                    
+--------------------------------+                           **                  
| Lambda(itemgetter('question')) |                            *                  
+--------------------------------+                            *                  
                 *                                            *                  
                 *                                            *                  
                 *                                            *                  
     +----------

## üéâ Moment of Truth: Testing Our RAG System!

In [60]:
response = retrieval_augmented_qa_chain.invoke({"question" : QUESTION})

In [61]:
# return the response.  filter on the response key AIMessage content element
response["response"].content


'The maximum number of shares to be sold under the 10b5-1 Trading plan by Brian Chesky is 1,146,000.'

In [62]:
print(response["response"].content)

The maximum number of shares to be sold under the 10b5-1 Trading plan by Brian Chesky is 1,146,000.


### Thanks to LangSmith, this custom code is no longer required

```python
for i, context_instance in enumerate(response["context"]):
  print(f"\nvector store CONTEXT # {i}:")
  print(f"Page # : {context_instance.metadata.get('page')}")
  print(f"context.page_content:\n{context_instance.page_content}")
  print(f"context.metadata:\n{context_instance.metadata}")
```