<a href="https://colab.research.google.com/github/abdul9870/abdul9870/blob/main/project%205%20RAG_Optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Comprehensive Tutorial: Building a Retrieval-Augmented Generation (RAG) System

Welcome to this in-depth tutorial on Retrieval-Augmented Generation (RAG)! RAG is a powerful technique that enhances the capabilities of Large Language Models (LLMs) by allowing them to access and utilize external knowledge sources. This is particularly useful for tasks requiring up-to-date information or domain-specific knowledge that wasn't part of the LLM's original training data.

In this hands-on code lab, we will guide you step-by-step through the process of building a RAG system. We\ll start by ingesting a research paper, explore various text chunking strategies, create vector embeddings, set up a vector store for efficient retrieval, and finally, build and query a RAG pipeline using popular frameworks like LangChain. The goal is to provide you with a solid understanding of the RAG architecture and the practical skills to implement it.

## Concepts Covered

Throughout this tutorial, we will cover the following key concepts essential for understanding and building RAG systems:

*   **PDF Ingestion and Text Extraction:** How to programmatically download and extract textual content from PDF documents.
*   **Text Chunking Strategies:** The importance of breaking down large texts into smaller, manageable chunks. We will explore and implement:
    *   Fixed-Size Chunking
    *   Sliding Window Chunking
    *   Semantic Chunking (e.g., sentence-based or regex-based)
    *   Advanced Chunking with LangChain Text Splitters
*   **Vector Embeddings:** Understanding how text is converted into numerical representations (vectors) that capture semantic meaning.
*   **Vector Indexing and Retrieval:** Using vector databases/libraries (like FAISS) to store embeddings and perform efficient similarity searches.
*   **Building a RAG Pipeline:** Integrating the components (retriever, LLM) to create a functional RAG system, primarily using the LangChain framework.
*   **Querying and Evaluation:** How to ask questions to your RAG system and interpret the results.
*   **Optimization and Best Practices:** Tips for improving the performance and production-readiness of RAG systems.

## Resources and Tools

We will be using the following Python libraries and tools. The first code cell in this notebook will handle their installation.

*   **Core Python Libraries:**
    *   `requests`: For making HTTP requests (e.g., downloading the PDF).
    *   `pypdf` (or `PyPDF2`): For reading and extracting text from PDF files.
    *   `numpy`: For numerical operations, especially with embeddings.
    *   `re`: Python\s regular expression module, useful for some chunking methods.
*   **Embeddings and Vector Stores:**
    *   `sentence-transformers`: For generating high-quality text embeddings.
    *   `faiss-cpu` (or `faiss-gpu` if you have a compatible GPU): A library for efficient similarity search and clustering of dense vectors.
*   **LLM and RAG Frameworks:**
    *   `transformers`: Provides access to a wide range of pre-trained models from Hugging Face, including LLMs.
    *   `langchain`: A comprehensive framework for developing applications powered by language models, including RAG pipelines. We\ll use `langchain[all]` to get common dependencies.
    *   `torch` and `tensorflow`: Deep learning frameworks that are often dependencies for `transformers` and `sentence-transformers`. We\ll ensure compatible versions are installed.

## Paper for Demonstration

For this tutorial, we will be working with the seminal paper that introduced the RAG concept:

*   **Title:** Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
*   **Authors:** Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel
*   **ArXiv URL:** [https://arxiv.org/pdf/2005.11401.pdf](https://arxiv.org/pdf/2005.11401.pdf)

We will download this PDF, extract its content, and use it as the knowledge base for our RAG system.

## 1. Setup: Installing Necessary Libraries

Before we begin, we need to install the Python libraries required for this tutorial. The following code cell uses `pip` to install `requests`, `pypdf`, `sentence-transformers`, `faiss-cpu`, `transformers`, `torch`, and `langchain` with all its common extras. If you are running this in an environment like Google Colab, these installations will be specific to your current session. If you are running locally, ensure you have a virtual environment set up to avoid conflicts with other projects.

In [None]:
# This cell installs the necessary libraries.
# It\s generally recommended to run this in a virtual environment.
# If using Google Colab, this will install packages for the current session.

!pip install requests pypdf sentence-transformers faiss-cpu transformers torch langchain langchain_community -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m303.4/303.4 kB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m86.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m78.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m42.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

**Explanation of the Installation Cell:**

*   `!pip install ...`: The `!` symbol allows us to run shell commands directly from a Jupyter Notebook cell. `pip` is Python\s package installer.
*   `requests`: Used for making HTTP requests to download files from the internet, like our PDF paper.
*   `pypdf`: A library for working with PDF files. We\ll use it to read the content of the downloaded paper.
*   `sentence-transformers`: This library provides an easy way to use state-of-the-art sentence, text, and image embedding models. We\ll use it to convert our text chunks into vector embeddings.
*   `faiss-cpu`: Facebook AI Similarity Search (FAISS) is a library for efficient similarity search and clustering of dense vectors. The `-cpu` version is for systems without a dedicated NVIDIA GPU. If you have a GPU, you might consider `faiss-gpu`. We\ll use FAISS to create an index of our embeddings for fast retrieval.
*   `transformers`: From Hugging Face, this library provides thousands of pre-trained models for Natural Language Processing (NLP) tasks, including the LLMs we might use for the generation part of RAG.
*   `torch`: PyTorch is an open-source machine learning framework that many of these libraries (like `transformers` and `sentence-transformers`) depend on.
*   `langchain[all]`: LangChain is a framework designed to simplify the creation of applications using large language models. The `[all]` part installs LangChain along with many of its common integrations and dependencies, making it easier to build complex pipelines like RAG.
*   `-q`: This flag stands for "quiet" and reduces the amount of output during the installation process, keeping our notebook cleaner.

## 2. Data Ingestion: Downloading and Parsing the RAG Paper PDF

The first step in our RAG pipeline is to acquire the data that will serve as our knowledge base. In this case, it\s the RAG research paper from arXiv. We need to download the PDF and then extract the raw text from it. We\ll use the `requests` library to fetch the PDF from its URL and `pypdf` to parse its content.

In [None]:
import requests
from io import BytesIO
from pypdf import PdfReader # For newer versions, it might be PyPDF2.PdfReader

pdf_url = "https://arxiv.org/pdf/2005.11401.pdf"

# Download the PDF
response = requests.get(pdf_url)
response.raise_for_status()  # Ensure the request was successful

# Read the PDF content from the response
pdf_file = BytesIO(response.content)
reader = PdfReader(pdf_file)

# Extract text from each page
extracted_text = ""
for page_num, page in enumerate(reader.pages):
    extracted_text += page.extract_text() + "\\n" # Add a newline character between pages for better separation

print(f"Successfully extracted {len(extracted_text):,} characters from the PDF.")
# You can print a small portion of the text to verify
# print(extracted_text[:1000])

Successfully extracted 69,135 characters from the PDF.


**Explanation of the Data Ingestion Code:**

1.  **Import Libraries:**
    *   `requests`: To send an HTTP GET request to the PDF URL.
    *   `BytesIO` (from `io`): To treat the downloaded binary content of the PDF as an in-memory file-like object, which `pypdf` can read.
    *   `PdfReader` (from `pypdf`): The primary class used to read and parse PDF files.
2.  **Define PDF URL:**
    *   `pdf_url`: Stores the direct link to the RAG paper PDF on arXiv.
3.  **Download PDF:**
    *   `response = requests.get(pdf_url)`: Sends a GET request to the URL. The server\s response (including the PDF content) is stored in the `response` object.
    *   `response.raise_for_status()`: This is a good practice. It checks if the HTTP request was successful (e.g., status code 200 OK). If there was an error (like 404 Not Found or 500 Server Error), it will raise an HTTPError exception.
4.  **Read PDF Content:**
    *   `pdf_file = BytesIO(response.content)`: `response.content` holds the raw bytes of the PDF. `BytesIO` creates an in-memory binary stream from these bytes. This is necessary because `PdfReader` expects a file-like object.
    *   `reader = PdfReader(pdf_file)`: Initializes a `PdfReader` object with the PDF data.
5.  **Extract Text:**
    *   `extracted_text = ""`: Initializes an empty string to accumulate the text from all pages.
    *   `for page_num, page in enumerate(reader.pages):`: Iterates through each page in the PDF. `reader.pages` is a list-like object containing all pages of the document.
    *   `extracted_text += page.extract_text() + "\\n"`: For each `page` object, `page.extract_text()` attempts to extract all textual content. We append this text to our `extracted_text` string, adding a newline character (`\\n`) after each page\s content to maintain some separation, which can be helpful for later chunking.
6.  **Print Summary:**
    *   `print(f"Successfully extracted {len(extracted_text):,} characters from the PDF.")`: Prints the total number of characters extracted, giving us a sense of the volume of text we\re working with. The `:,` in the f-string formats the number with commas for readability.
    *   The commented-out line `# print(extracted_text[:1000])` can be used to quickly inspect the beginning of the extracted text to ensure it looks reasonable.

## 3. Text Chunking Strategies

Once we have the raw text, the next crucial step is **chunking**. Large language models have a limited context window (the amount of text they can consider at one time). Therefore, we need to break down the lengthy extracted PDF text into smaller, semantically meaningful pieces or "chunks". These chunks will then be converted into embeddings and stored in a vector database for retrieval.

The way we chunk text can significantly impact the RAG system\s performance. If chunks are too small, they might not contain enough context to answer a question. If they are too large, they might exceed the LLM\s context window or dilute the relevant information with noise. We will explore a few common chunking strategies.

### 3.1 Fixed-Size Chunking

This is the simplest chunking method. The text is split into chunks of a predetermined fixed size (e.g., a certain number of characters). While easy to implement, it can be problematic as it might split sentences or even words in half, disrupting semantic meaning.

In [None]:
from typing import List

def fixed_size_chunker(text: str, chunk_size: int) -> List[str]:
    "Splits text into fixed-size chunks."
    return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

# Define a chunk size (e.g., 1000 characters)
char_chunk_size = 1000
fixed_chunks = fixed_size_chunker(extracted_text, char_chunk_size)

print(f"Number of fixed-size chunks: {len(fixed_chunks)}")
# You can inspect a chunk to see its content
# print(f"\nExample chunk (first 200 chars):\n{fixed_chunks[0][:200]}...")
# print(f"\nExample chunk (last 200 chars):\n...{fixed_chunks[0][-200:]}")

Number of fixed-size chunks: 70


**Explanation of Fixed-Size Chunking Code:**

1.  **Import `List`:** From the `typing` module, `List` is used for type hinting, indicating that the function will return a list of strings.
2.  **`fixed_size_chunker` Function:**
    *   Takes two arguments: `text` (the string to be chunked) and `chunk_size` (the desired number of characters per chunk).
    *   Uses a list comprehension: `[text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]`.
        *   `range(0, len(text), chunk_size)`: Generates starting indices for each chunk. It starts at `0`, goes up to the length of the text, and increments by `chunk_size` in each step.
        *   `text[i:i + chunk_size]`: Slices the text from the current starting index `i` up to `i + chunk_size`. This creates a chunk of the specified size.
    *   Returns a list of these text chunks.
3.  **Define Chunk Size and Apply:**
    *   `char_chunk_size = 1000`: Sets the chunk size to 1000 characters. This value is often chosen based on the embedding model\s input limits and empirical testing.
    *   `fixed_chunks = fixed_size_chunker(extracted_text, char_chunk_size)`: Calls the function with our extracted PDF text and the defined chunk size.
4.  **Print Summary:**
    *   `print(f"Number of fixed-size chunks: {len(fixed_chunks)}")`: Shows how many chunks were created. This helps in understanding the scale of data for the next steps (embedding and indexing).
    *   The commented-out lines can be used to inspect the content of a sample chunk, which is useful for verifying if the chunking is happening as expected and to observe potential issues like mid-sentence breaks.

### 3.2 Sliding Window Chunking

To mitigate some issues of fixed-size chunking (like cutting off information abruptly at chunk boundaries), sliding window chunking introduces an **overlap** between consecutive chunks. This means that the end of one chunk will also be the beginning of the next chunk for a certain number of characters (the overlap size). This helps to preserve context around the boundaries.

For example, if `chunk_size` is 1000 and `overlap` is 200:
*   Chunk 1: characters 0-999
*   Chunk 2: characters 800-1799 (overlaps with Chunk 1 from 800-999)
*   Chunk 3: characters 1600-2599 (overlaps with Chunk 2 from 1600-1799)
And so on.

In [None]:
def sliding_window_chunker(text: str, chunk_size: int, overlap: int) -> List[str]:
    "Splits text into chunks with a sliding window and overlap."
    if overlap >= chunk_size:
        raise ValueError("Overlap size must be less than chunk size.")

    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        if end >= len(text): # Reached the end of the text
            break
        start += (chunk_size - overlap) # Move the window, considering the overlap
    return chunks

# Define chunk size and overlap
# Common practice: overlap is 10-20% of chunk_size
overlap_size = 200
sliding_chunks = sliding_window_chunker(extracted_text, char_chunk_size, overlap_size)

print(f"Number of sliding-window chunks: {len(sliding_chunks)}")
# You can inspect a few chunks to see the overlap
# if len(sliding_chunks) > 1:
#     print(f"\nExample chunk 1 (last 300 chars):\n...{sliding_chunks[0][-300:]}")
#     print(f"\nExample chunk 2 (first 300 chars):\n{sliding_chunks[1][:300]}...")

Number of sliding-window chunks: 87


**Explanation of Sliding Window Chunking Code:**

1.  **`sliding_window_chunker` Function:**
    *   Takes `text`, `chunk_size`, and `overlap` as input.
    *   Includes a `ValueError` check to ensure `overlap` is less than `chunk_size`, which is a logical requirement for this method.
    *   Initializes an empty list `chunks` and a `start` index at `0`.
    *   Uses a `while` loop that continues as long as `start` is within the bounds of the text.
        *   `end = start + chunk_size`: Calculates the end position of the current chunk.
        *   `chunks.append(text[start:end])`: Appends the sliced chunk to the list.
        *   `if end >= len(text): break`: If the end of the current chunk goes beyond the text length, it means we have processed the entire text, so the loop breaks.
        *   `start += (chunk_size - overlap)`: This is the key step for the sliding window. The next chunk starts not immediately after the current one ends, but `overlap` characters before. So, the step size is `chunk_size - overlap`.
    *   Returns the list of `chunks`.
2.  **Define Parameters and Apply:**
    *   `overlap_size = 200`: Sets the overlap. A common heuristic is 10-20% of the `chunk_size`. Here, `char_chunk_size` is 1000, so 200 is 20%.
    *   `sliding_chunks = sliding_window_chunker(extracted_text, char_chunk_size, overlap_size)`: Calls the function.
3.  **Print Summary:**
    *   `print(f"Number of sliding-window chunks: {len(sliding_chunks)}")`: Shows the total number of chunks. Note that with overlap, you will generally get more chunks than with fixed-size chunking for the same `chunk_size` if the overlap is greater than 0.
    *   The commented-out lines are useful for inspecting the end of one chunk and the beginning of the next to verify that the overlap is working correctly.

### 3.3 Semantic Chunking (Regex-Based on Sentences)

Fixed-size and sliding window chunkers operate purely on character counts and don\"t understand the *meaning* or structure of the text. **Semantic chunking** aims to divide text along natural semantic boundaries, such as sentences or paragraphs. This can lead to more coherent and contextually relevant chunks.

One way to achieve a basic form of semantic chunking is to first split the text into sentences and then group a fixed number of sentences into each chunk. We can use regular expressions (`re` module) for a simple sentence tokenization, though more advanced NLP libraries like NLTK or spaCy offer more robust sentence splitting.

**Note:** Simple regex-based sentence splitting might not be perfect for all texts (e.g., handling abbreviations like "Mr." or complex sentence structures). For production systems, consider more sophisticated sentence tokenizers.

In [None]:
import re

def regex_sentence_tokenizer(text: str) -> List[str]:
    "Splits text into sentences using a basic regex."
    # This regex looks for one or more whitespace characters following a period, exclamation mark, or question mark.
    # The (?<=[.!?]) is a positive lookbehind assertion, ensuring the punctuation is part of the match but not the split point itself.
    sentences = re.split(r"(?<=[.!?])\s+", text.strip())
    return [s for s in sentences if s] # Remove any empty strings that might result from splitting

def semantic_chunker_by_sentence(text: str, max_sentences_per_chunk: int) -> List[str]:
    "Chunks text by grouping a fixed number of sentences."
    sentences = regex_sentence_tokenizer(text)

    chunks = []
    current_chunk_sentences = []
    for i in range(0, len(sentences), max_sentences_per_chunk):
        chunk = " ".join(sentences[i:i + max_sentences_per_chunk])
        chunks.append(chunk)
    return chunks

# Define the number of sentences per chunk
sentences_per_chunk = 5 # Group 5 sentences into one chunk
semantic_chunks = semantic_chunker_by_sentence(extracted_text, sentences_per_chunk)

print(f"Number of semantic (sentence-based) chunks: {len(semantic_chunks)}")
# You can inspect a chunk
# if semantic_chunks:
#     print(f"\nExample semantic chunk (first 3 sentences if available):\n{semantic_chunks[0]}")

Number of semantic (sentence-based) chunks: 140


**Explanation of Semantic Chunking (Regex-Based) Code:**

1.  **Import `re`:** Python\s built-in regular expression module.
2.  **`regex_sentence_tokenizer` Function:**
    *   Takes the input `text`.
    *   `text.strip()`: Removes leading/trailing whitespace from the text.
    *   `re.split(r"(?<=[.!?])\s+", ...)`: This is the core of the sentence splitting.
        *   `r"(?<=[.!?])\s+"`: The regular expression pattern.
            *   `(?<=[.!?])`: This is a **positive lookbehind assertion**. It matches a position that is immediately preceded by a period (`.`), an exclamation mark (`!`), or a question mark (`?`). The punctuation itself is not included in the split delimiter.
            *   `\s+`: Matches one or more whitespace characters (spaces, tabs, newlines). This is what we actually split by.
        *   The result is a list of strings, where each string is ideally a sentence.
    *   `[s for s in sentences if s]`: Filters out any empty strings that might occur if, for example, there were multiple spaces after a period.
3.  **`semantic_chunker_by_sentence` Function:**
    *   Takes `text` and `max_sentences_per_chunk` as input.
    *   `sentences = regex_sentence_tokenizer(text)`: First, tokenizes the entire text into individual sentences.
    *   Initializes an empty list `chunks`.
    *   Iterates through the `sentences` list with a step of `max_sentences_per_chunk`: `for i in range(0, len(sentences), max_sentences_per_chunk):`.
    *   `chunk = " ".join(sentences[i:i + max_sentences_per_chunk])`: For each iteration, it takes a slice of sentences (e.g., sentences 0-4, then 5-9, etc.) and joins them back together with a space in between to form a single chunk string.
    *   `chunks.append(chunk)`: Adds the formed chunk to the list.
    *   Returns the list of sentence-grouped `chunks`.
4.  **Define Parameters and Apply:**
    *   `sentences_per_chunk = 5`: We decide to group 5 sentences into each chunk. This number is a hyperparameter that can be tuned.
    *   `semantic_chunks = semantic_chunker_by_sentence(extracted_text, sentences_per_chunk)`: Calls the chunking function.
5.  **Print Summary:**
    *   `print(f"Number of semantic (sentence-based) chunks: {len(semantic_chunks)}")`: Displays the total number of chunks created using this method.
    *   The commented-out lines allow for inspection of a sample chunk to see if it correctly groups sentences.

### 3.4 LangChain Text Splitters

LangChain provides a variety of sophisticated text splitters that are often more robust and configurable than implementing them from scratch. One of the most commonly used is the `RecursiveCharacterTextSplitter`. This splitter tries to split text based on a list of characters (by default `["\n\n", "\n", " ", ""]`) and recursively tries to keep semantically related pieces of text together. It also supports `chunk_size` and `chunk_overlap`.

Using LangChain for this task is generally recommended as it handles many edge cases and offers more advanced options.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the LangChain text splitter
lc_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # The maximum size of each chunk (in characters)
    chunk_overlap=200,      # The number of characters to overlap between chunks
    length_function=len,    # Function to measure chunk length (usually len for characters)
    add_start_index=True    # Optionally, add the start index of the chunk in the original document
)

# LangChain splitters can work directly with text or with LangChain `Document` objects.
# If we have raw text:
langchain_chunks_from_text = lc_text_splitter.split_text(extracted_text)

# If we have LangChain Document objects (which we will use later):
# from langchain.docstore.document import Document
# doc = Document(page_content=extracted_text, metadata={"source": pdf_url})
# langchain_chunks_from_doc = lc_text_splitter.split_documents([doc])

print(f"Number of chunks using LangChain RecursiveCharacterTextSplitter: {len(langchain_chunks_from_text)}")

# LangChain chunks are often `Document` objects if split from documents, or strings if split from text.
# Let"s inspect the type and content of a chunk if split_text was used:
# if langchain_chunks_from_text:
#     print(f"Type of a LangChain chunk (from text): {type(langchain_chunks_from_text[0])}")
#     print(f"Content of first chunk (first 200 chars): {langchain_chunks_from_text[0][:200]}...")

# If you used split_documents, each element would be a Document object:
# if langchain_chunks_from_doc:
#     print(f"Type of a LangChain chunk (from document): {type(langchain_chunks_from_doc[0])}")
#     print(f"Content of first chunk (first 200 chars): {langchain_chunks_from_doc[0].page_content[:200]}...")
#     print(f"Metadata of first chunk: {langchain_chunks_from_doc[0].metadata}")

Number of chunks using LangChain RecursiveCharacterTextSplitter: 88


**Explanation of LangChain Text Splitter Code:**

1.  **Import `RecursiveCharacterTextSplitter`:** This is one of LangChain\s most versatile text splitters.
2.  **Initialize the Splitter:**
    *   `lc_text_splitter = RecursiveCharacterTextSplitter(...)`: Creates an instance of the splitter.
    *   `chunk_size=1000`: Specifies the target maximum size for each chunk (measured by `length_function`).
    *   `chunk_overlap=200`: Sets the number of characters of overlap between adjacent chunks. This helps maintain context continuity, similar to our manual sliding window approach but often more intelligently handled.
    *   `length_function=len`: Defines how the "size" of a chunk is measured. `len` typically means character count.
    *   `add_start_index=True`: An optional parameter that, if set, will add metadata to each chunk indicating its starting character position in the original document. This can be useful for referencing back to the source.
3.  **Splitting Text:**
    *   `langchain_chunks_from_text = lc_text_splitter.split_text(extracted_text)`: If you have a single block of raw text, you can use the `split_text` method. It returns a list of strings, where each string is a chunk.
4.  **Splitting Documents (Commented Out):**
    *   LangChain often works with `Document` objects, which are simple containers for text content and associated metadata (like source, page number, etc.).
    *   The commented-out section shows how you would typically create a `Document` object and then use `lc_text_splitter.split_documents([doc])`. This method takes a list of `Document` objects and returns a list of new `Document` objects, where each new document is a chunk of the original(s). The metadata from the original document can be preserved or modified in the chunks.
    *   We will use the `split_documents` approach more extensively later when building the full RAG pipeline with LangChain.
5.  **Print Summary and Inspect:**
    *   `print(f"Number of chunks using LangChain...")`: Shows the number of chunks created.
    *   The commented-out inspection lines show how to check the type and content of the generated chunks. If `split_text` is used, chunks are strings. If `split_documents` is used, chunks are `Document` objects, and you access their content via `chunk.page_content` and metadata via `chunk.metadata`.

The `RecursiveCharacterTextSplitter` is powerful because it tries to split based on a hierarchy of separators (e.g., double newlines, then single newlines, then spaces). This often results in more semantically coherent chunks than simple fixed-size splitting, especially for well-structured text.

## 4. Vector Embeddings and Indexing with FAISS

After chunking our text, the next step is to convert these textual chunks into a numerical format that machine learning models can understand. This process is called **embedding**. Each chunk will be transformed into a dense vector (a list of numbers) where semantically similar chunks have vectors that are close to each other in the vector space.

We will use the `sentence-transformers` library to generate these embeddings. This library provides pre-trained models that are excellent at creating meaningful sentence and paragraph embeddings.

Once we have the embeddings, we need an efficient way to store them and search for the most similar ones to a given query embedding. This is where a **vector index** (or vector store/database) comes in. We will use **FAISS (Facebook AI Similarity Search)**, a library developed by Facebook AI for efficient similarity search and clustering of dense vectors. It\s highly optimized for speed and can handle very large datasets.

### 4.1 Generating Embeddings and Building a FAISS Index

Let"s choose one of our chunking methods to proceed. For this demonstration, the `semantic_chunks` (grouped by sentences) or `langchain_chunks_from_text` are good candidates as they attempt to preserve semantic coherence. We will proceed with `semantic_chunks` for this part, but you could adapt this to use any list of text chunks.

First, we load a pre-trained sentence transformer model. Then, we encode our text chunks to get their embeddings. Finally, we create a FAISS index and add these embeddings to it.

In [None]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# 1. Load a pre-trained Sentence Transformer model
#    "all-MiniLM-L6-v2" is a popular choice: fast and good quality.
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Determine which chunks to embed
if "semantic_chunks" not in globals() or not semantic_chunks:
    print("Warning: semantic_chunks not found or empty. Please run a chunking cell first.")
    # Try a LangChain-generated fallback if available
    if "langchain_chunks_from_text" in globals() and langchain_chunks_from_text:
        chunks_to_embed = langchain_chunks_from_text
        print(f"Using {len(chunks_to_embed)} langchain_chunks_from_text for embedding.")
    else:
        # Final fallback to a dummy placeholder
        chunks_to_embed = ["This is a placeholder chunk as no other chunks were found."]
        print("Using placeholder text for embedding as no suitable chunks were found.")
else:
    chunks_to_embed = semantic_chunks
    print(f"Using {len(semantic_chunks)} semantic_chunks for embedding.")

# 2. Generate embeddings for our text chunks
#    The .encode() method converts a list of sentences/text into a numpy array of embeddings.
chunk_embeddings = embedding_model.encode(
    chunks_to_embed,
    convert_to_numpy=True,
    show_progress_bar=True
)

print(f"Shape of the embeddings matrix: {chunk_embeddings.shape}")  # (num_chunks, embedding_dim)

# 3. Build a FAISS index
embedding_dimension = chunk_embeddings.shape[1]
faiss_index = faiss.IndexFlatL2(embedding_dimension)  # exact L2 search

# 4. Add the chunk embeddings to the FAISS index
faiss_index.add(chunk_embeddings.astype(np.float32))  # FAISS requires float32

print(f"FAISS index built. Total vectors in index: {faiss_index.ntotal}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Using 140 semantic_chunks for embedding.


Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Shape of the embeddings matrix: (140, 384)
FAISS index built. Total vectors in index: 140


**Explanation of Embedding Generation and FAISS Indexing Code:**

1.  **Import Libraries:**
    *   `SentenceTransformer` from `sentence_transformers`: For loading embedding models and encoding text.
    *   `faiss`: The FAISS library for vector indexing.
    *   `numpy`: For numerical operations, especially handling the embedding arrays.
2.  **Load Embedding Model:**
    *   `embedding_model = SentenceTransformer("all-MiniLM-L6-v2")`: Initializes a pre-trained sentence embedding model. `all-MiniLM-L6-v2` is a widely used model known for its balance of speed and performance. It maps sentences and paragraphs to a 384-dimensional dense vector space.
3.  **Select Chunks for Embedding:**
    *   The code selects `semantic_chunks` by default. It includes a check to see if this variable exists and is not empty. If not, it tries to fall back to `langchain_chunks_from_text` or a placeholder to ensure the notebook can run sequentially. In a real workflow, you would ensure your desired chunks are prepared.
4.  **Generate Embeddings:**
    *   `chunk_embeddings = embedding_model.encode(chunks_to_embed, convert_to_numpy=True, show_progress_bar=True)`: This is the core step for creating embeddings.
        *   `chunks_to_embed`: The list of text strings (our chunks) to be encoded.
        *   `convert_to_numpy=True`: Ensures the output is a NumPy array, which is convenient for FAISS.
        *   `show_progress_bar=True`: Displays a progress bar, which is helpful for larger datasets.
    *   The resulting `chunk_embeddings` is a 2D NumPy array where each row is the embedding vector for a chunk, and the number of columns is the dimensionality of the embedding model (384 for `all-MiniLM-L6-v2`).
5.  **Build FAISS Index:**
    *   `embedding_dimension = chunk_embeddings.shape[1]`: Gets the dimensionality of the embeddings from the shape of the `chunk_embeddings` matrix.
    *   `faiss_index = faiss.IndexFlatL2(embedding_dimension)`: Creates a FAISS index.
        *   `IndexFlatL2`: This is one of the simplest types of FAISS indexes. It performs an exact search using L2 (Euclidean) distance. This means it will compare the query vector to all vectors in the index to find the closest ones. While accurate, it can be slow for extremely large datasets (millions of vectors). For larger scales, FAISS offers approximate nearest neighbor (ANN) search indexes like `IndexIVFPQ` that trade a tiny bit of accuracy for significant speed gains.
6.  **Add Embeddings to Index:**
    *   `faiss_index.add(chunk_embeddings.astype(np.float32))`: Adds our generated chunk embeddings to the FAISS index. FAISS typically expects input vectors to be of type `float32`, so we use `.astype(np.float32)` to ensure compatibility.
7.  **Print Confirmation:**
    *   `print(f"FAISS index built. Total vectors in index: {faiss_index.ntotal}")`: Confirms that the index is built and shows how many vectors it contains, which should match the number of chunks we embedded.

### 4.2 Performing Similarity Search (Retrieval)

Now that we have our FAISS index populated with the embeddings of our document chunks, we can perform a similarity search. This involves taking a query (e.g., a question), embedding it using the *same* sentence transformer model, and then using FAISS to find the chunks in our index whose embeddings are closest to the query embedding.

In [None]:
def search_faiss_index(query: str, model, index, chunks_list: List[str], k: int = 3) -> List[str]:
    "Encodes a query, searches the FAISS index, and returns the k most similar chunks."
    # 1. Encode the query using the same model
    query_embedding = model.encode([query], convert_to_numpy=True).astype(np.float32)

    # 2. Search the FAISS index
    #    index.search returns two NumPy arrays:
    #    D: distances to the k nearest neighbors
    #    I: indices of the k nearest neighbors in the original dataset
    distances, indices = index.search(query_embedding, k)

    # 3. Retrieve the actual text chunks
    retrieved_chunks = [chunks_list[i] for i in indices[0]]

    return retrieved_chunks, distances[0]

# Example query
test_query = "What is Retrieval-Augmented Generation?"

# Perform the search
# Ensure chunks_to_embed is the same list of chunks that was used to build the FAISS index
retrieved_docs, retrieved_distances = search_faiss_index(test_query, embedding_model, faiss_index, chunks_to_embed, k=3)

print(f"Query: {test_query}")
print("Retrieved Chunks:")
for i, (doc, dist) in enumerate(zip(retrieved_docs, retrieved_distances)):
    print(f"Rank {i+1} (Distance: {dist:.4f}):")
    print(f"{doc[:500]}...") # Print the first 500 characters of the chunk

Query: What is Retrieval-Augmented Generation?
Retrieved Chunks:
Rank 1 (Distance: 0.8292):
This said, RAG techniques may work well in these settings, and
could represent promising future work. 6 Discussion
In this work, we presented hybrid generation models with access to parametric and non-parametric
memory. We showed that our RAG models obtain state of the art results on open-domain QA. We
found that people prefer RAG’s generation over purely parametric BART, ﬁnding RAG more factual
and speciﬁc. We conducted an thorough investigation of the learned retrieval component, validating
it...
Rank 2 (Distance: 0.9175):
The non-parametric memory index does not consist of trainable parameters, but does consists of 21M
728 dimensional vectors, consisting of 15.3B values. These can be easily be stored at 8-bit ﬂoating
point precision to manage memory and disk footprints. H Retrieval Collapse
In preliminary experiments, we observed that for some tasks such as story generation [ 11], the
retriev

**Explanation of Similarity Search Code:**

1.  **`search_faiss_index` Function:**
    *   Takes the `query` string, the loaded `model` (SentenceTransformer), the `index` (FAISS index), the original `chunks_list` (the list of text chunks corresponding to the embeddings in the index), and `k` (the number of top results to retrieve) as input.
    *   **Encode Query:** `query_embedding = model.encode([query], ...)`: The input query string is embedded using the *exact same* `embedding_model` that was used for the document chunks. This is crucial because similarity is measured in the vector space defined by this model. The query is passed as a list `[query]` because `encode` expects a list of texts. We also convert to `np.float32`.
    *   **Search Index:** `distances, indices = index.search(query_embedding, k)`:
        *   The `index.search()` method takes the `query_embedding` (which should be a 2D array, hence the query embedding is often `[query_embedding_vector]`) and `k`.
        *   It returns two arrays:
            *   `D` (distances): A 2D array containing the distances (e.g., L2 distance for `IndexFlatL2`) from the query to the `k` nearest neighbors found in the index. Each row corresponds to a query if multiple queries were searched at once.
            *   `I` (indices): A 2D array containing the original indices (0-based) of these `k` nearest neighbors as they were added to the FAISS index. Each row corresponds to a query.
    *   **Retrieve Chunks:** `retrieved_chunks = [chunks_list[i] for i in indices[0]]`:
        *   Since we searched with a single query, `indices[0]` gives us the array of indices of the `k` most similar chunks from our original `chunks_list`.
        *   A list comprehension is used to fetch the actual text content of these chunks using these indices.
    *   Returns the `retrieved_chunks` (list of strings) and their corresponding `distances`.
2.  **Example Query and Search:**
    *   `test_query = "What is Retrieval-Augmented Generation?"`: Defines a sample question.
    *   `retrieved_docs, retrieved_distances = search_faiss_index(...)`: Calls our search function.
        *   It is critical that `chunks_to_embed` passed here is the *exact same list of strings* whose embeddings were added to `faiss_index` in the same order. The indices returned by FAISS refer to the order in which vectors were added.
3.  **Print Results:**
    *   The code then iterates through the `retrieved_docs` and `retrieved_distances`, printing each retrieved chunk along with its rank and distance to the query. Smaller distances mean higher similarity for L2 distance.
    *   `doc[:500]` prints only the first 500 characters of each retrieved chunk for brevity.

This retrieval step is the "R" in RAG. We have successfully retrieved relevant pieces of information from our knowledge base (the RAG paper) based on a user query. The next step in a full RAG system would be to feed these retrieved chunks, along with the original query, to a Large Language Model (LLM) to generate a comprehensive answer.

## 5. Building a RAG Pipeline with LangChain

While we have manually performed the steps of chunking, embedding, indexing, and retrieval, LangChain provides a powerful and streamlined framework to build RAG systems with much less boilerplate code. LangChain abstracts many of these components, allowing us to easily plug and play different LLMs, vector stores, and retrieval strategies.

We will now rebuild our RAG system using LangChain components. This typically involves:
1.  **Document Loading:** Using a LangChain `DocumentLoader` to load the data (e.g., `PyPDFLoader` for PDFs directly from a URL or local path).
2.  **Text Splitting:** Using a LangChain `TextSplitter` (like the `RecursiveCharacterTextSplitter` we saw earlier) to chunk the loaded documents.
3.  **Embedding Model:** Defining an embedding model using LangChain wrappers (e.g., `SentenceTransformerEmbeddings`).
4.  **Vector Store:** Creating a vector store (e.g., LangChain"s `FAISS` wrapper) from the chunked documents and the embedding model. This handles both embedding and indexing internally.
5.  **Retriever:** Obtaining a retriever interface from the vector store, which can be used to fetch relevant documents for a query.
6.  **LLM:** Setting up a Large Language Model (e.g., using `HuggingFacePipeline` to wrap a model from the Hugging Face Hub).
7.  **QA Chain:** Constructing a Question-Answering chain (like `RetrievalQA`) that combines the retriever and the LLM to answer questions based on the retrieved context.

### 5.1 Setting up the LLM with HuggingFacePipeline

First, let"s set up the Language Model we will use for the generation part. LangChain provides a `HuggingFacePipeline` wrapper that allows us to use any model from the Hugging Face Hub that supports a relevant task (like `text2text-generation` or `text-generation`). For this demo, we"ll use a relatively small and fast model like `google/flan-t5-small`.

In [None]:
from transformers import pipeline
from langchain_community.llms import HuggingFacePipeline # Updated import path

# Define the Hugging Face pipeline for text-to-text generation
# google/flan-t5-small is a good, relatively lightweight model for demonstration
# device=-1 typically means use CPU. Change to 0, 1, etc., for GPU if available and configured.
hf_llm_pipeline = pipeline(
    task="text2text-generation",
    model="google/flan-t5-base",
    device=0,                # Use -1 for CPU, or device index for GPU (e.g., 0)
    max_length=256,         # Maximum length of the generated text
    do_sample=False,          # Set to True to use sampling, False for deterministic output
    temperature=0.7,          # Controls randomness if do_sample=True (0.0-1.0). Lower is more deterministic.
    top_p=0.9                 # Nucleus sampling parameter if do_sample=True
)

# Wrap the Hugging Face pipeline in LangChain"s HuggingFacePipeline class
llm = HuggingFacePipeline(pipeline=hf_llm_pipeline)

print("LLM (google/flan-t5-base via HuggingFacePipeline) initialized.")

# You can test the LLM directly (optional)
# test_llm_output = llm("Translate to German: Hello, how are you?")
# print(f"LLM Test Output: {test_llm_output}")

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cuda:0


LLM (google/flan-t5-base via HuggingFacePipeline) initialized.


  llm = HuggingFacePipeline(pipeline=hf_llm_pipeline)


**Explanation of LLM Setup Code:**

1.  **Import Libraries:**
    *   `pipeline` from `transformers`: This is a high-level utility from the Hugging Face `transformers` library to easily load pre-trained models and their tokenizers for various tasks.
    *   `HuggingFacePipeline` from `langchain_community.llms` (note the updated import path for newer LangChain versions): This LangChain class wraps a Hugging Face `pipeline` object, making it compatible with the LangChain ecosystem.
2.  **Create Hugging Face Pipeline:**
    *   `hf_llm_pipeline = pipeline(...)`: Initializes the Hugging Face pipeline.
        *   `task="text2text-generation"`: Specifies the task. `google/flan-t5-small` is a sequence-to-sequence model suitable for tasks like translation, summarization, and question answering where input text is transformed into output text.
        *   `model="google/flan-t5-small"`: The identifier of the pre-trained model from the Hugging Face Model Hub.
        *   `device=-1`: Instructs the pipeline to run on the CPU. If you have a CUDA-enabled GPU and PyTorch with GPU support, you could set this to `0` (for the first GPU), `1`, etc.
        *   `max_length=256`: Sets the maximum number of tokens the model can generate in its output.
        *   `do_sample=False`: When `False`, the model uses greedy decoding, picking the most probable next token at each step, leading to deterministic output. If `True`, it uses sampling methods (controlled by `temperature`, `top_p`, etc.) which can produce more diverse and creative outputs but are non-deterministic.
        *   `temperature` and `top_p`: These parameters control the sampling process if `do_sample=True`. `temperature` adjusts the randomness (lower is less random), and `top_p` (nucleus sampling) restricts sampling to the most probable tokens whose cumulative probability exceeds `top_p`. These are ignored if `do_sample=False`. We include them here for completeness, as you might want to experiment with sampling.
3.  **Wrap in LangChain `HuggingFacePipeline`:**
    *   `llm = HuggingFacePipeline(pipeline=hf_llm_pipeline)`: Creates the LangChain LLM object by passing our configured Hugging Face pipeline to it. This `llm` object can now be used in LangChain chains.
4.  **Confirmation and Optional Test:**
    *   A print statement confirms initialization.
    *   The commented-out lines show how you can directly invoke the LangChain `llm` object with a prompt to test if it"s working and generating text as expected. This is a good sanity check.

### 5.2 Creating the LangChain RAG Pipeline (Loading, Splitting, Embedding, Storing, Retrieving)

Now, let"s use LangChain components to handle the data preparation and retrieval parts: loading the PDF, splitting it into chunks, embedding those chunks, and creating a FAISS vector store that can act as a retriever.

In [None]:
from langchain_community.document_loaders import PyPDFLoader # Updated import path
from langchain.text_splitter import RecursiveCharacterTextSplitter # Already imported, but good to remember
from langchain_community.vectorstores import FAISS as LangchainFAISS # Updated import path, aliased to avoid conflict
from langchain_community.embeddings import SentenceTransformerEmbeddings # Updated import path

# 1. Load the PDF using PyPDFLoader
#    PyPDFLoader can load from a URL or a local file path.
#    It loads the PDF and splits it into Document objects, typically one per page.
pdf_loader = PyPDFLoader(pdf_url) # Using the pdf_url defined earlier
pages_as_documents = pdf_loader.load() # This returns a list of LangChain Document objects

print(f"Loaded {len(pages_as_documents)} pages from the PDF as LangChain Documents.")
# if pages_as_documents:
#     print(f"Content of first page (first 200 chars): {pages_as_documents[0].page_content[:200]}...")
#     print(f"Metadata of first page: {pages_as_documents[0].metadata}")

# 2. Split the loaded documents into smaller chunks
#    We use the same RecursiveCharacterTextSplitter as before, but this time on Document objects.
lc_doc_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    add_start_index=True # Good for referencing later
)
document_chunks = lc_doc_splitter.split_documents(pages_as_documents)

print(f"Split the PDF into {len(document_chunks)} smaller Document chunks.")
# if document_chunks:
#     print(f"Content of first chunk (first 200 chars): {document_chunks[0].page_content[:200]}...")
#     print(f"Metadata of first chunk: {document_chunks[0].metadata}") # Notice metadata like page number is often preserved

# 3. Initialize the embedding model using LangChain wrapper
#    We use the same SentenceTransformer model for consistency.
lc_embedding_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# 4. Create a FAISS vector store from the document chunks and embeddings
#    LangChain"s FAISS wrapper handles the embedding process internally when given documents and an embedding model.
#    This might take a moment as it embeds all chunks.
vector_store = LangchainFAISS.from_documents(document_chunks, lc_embedding_model)

print("FAISS vector store created and populated with document chunk embeddings.")

# 5. Obtain a retriever from the vector store
#    The retriever is an interface that can find relevant documents given a query.
#    `search_kwargs={"k": 3}` means it will retrieve the top 3 most similar documents by default.
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

print("Retriever created from FAISS vector store. Ready to retrieve documents.")

# You can test the retriever (optional)
# retrieved_by_langchain = retriever.get_relevant_documents("What are the main components of RAG?")
# print(f"\nRetrieved {len(retrieved_by_langchain)} documents using LangChain retriever:")
# for i, doc in enumerate(retrieved_by_langchain):
#     print(f"Doc {i+1} (first 150 chars): {doc.page_content[:150]}...")
#     print(f"Doc {i+1} Metadata: {doc.metadata}")


Loaded 19 pages from the PDF as LangChain Documents.
Split the PDF into 92 smaller Document chunks.


  lc_embedding_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")


FAISS vector store created and populated with document chunk embeddings.
Retriever created from FAISS vector store. Ready to retrieve documents.


**Explanation of LangChain Data Pipeline Code:**

1.  **Import LangChain Components:**
    *   `PyPDFLoader`: For loading PDF files into LangChain `Document` objects.
    *   `RecursiveCharacterTextSplitter`: For splitting `Document` objects into smaller chunks (also `Document` objects).
    *   `FAISS` (as `LangchainFAISS`): LangChain"s wrapper for the FAISS vector store. It simplifies creating an index from documents and an embedding model.
    *   `SentenceTransformerEmbeddings`: LangChain"s wrapper for using sentence-transformer models to generate embeddings.
2.  **Load PDF with `PyPDFLoader`:**
    *   `pdf_loader = PyPDFLoader(pdf_url)`: Initializes the loader with the URL of our RAG paper.
    *   `pages_as_documents = pdf_loader.load()`: Calls the `load` method, which downloads the PDF, parses it, and returns a list of `Document` objects. Typically, `PyPDFLoader` creates one `Document` per page of the PDF. Each `Document` contains `page_content` (the text of that page) and `metadata` (e.g., `source` URL and `page` number).
3.  **Split Documents with `RecursiveCharacterTextSplitter`:**
    *   `lc_doc_splitter = RecursiveCharacterTextSplitter(...)`: Initializes the splitter with desired `chunk_size` and `chunk_overlap`, similar to how we used it for raw text, but now it will operate on `Document` objects.
    *   `document_chunks = lc_doc_splitter.split_documents(pages_as_documents)`: This method takes the list of page-level `Document` objects and splits them into smaller `Document` chunks. Importantly, LangChain splitters try to preserve or update metadata. For example, the page number from the original page document might be carried over to the chunks derived from it.
4.  **Initialize Embedding Model with `SentenceTransformerEmbeddings`:**
    *   `lc_embedding_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")`: Creates a LangChain-compatible embedding model object. We specify the same `model_name` as before to ensure consistency in the embedding space.
5.  **Create FAISS Vector Store with `LangchainFAISS.from_documents`:**
    *   `vector_store = LangchainFAISS.from_documents(document_chunks, lc_embedding_model)`: This is a very convenient LangChain method. It takes the list of `Document` chunks and the embedding model. Internally, it performs two key operations:
        1.  It iterates through each `document_chunk`, gets its `page_content`, and uses the `lc_embedding_model` to generate its vector embedding.
        2.  It then creates a FAISS index and adds all these (text, embedding) pairs to it.
    *   The resulting `vector_store` object is now a fully functional FAISS index wrapped in a LangChain interface.
6.  **Obtain Retriever from Vector Store:**
    *   `retriever = vector_store.as_retriever(search_kwargs={"k": 3})`: The vector store can be converted into a `Retriever` object. A retriever is a generic LangChain interface for fetching relevant documents based on a query.
        *   `search_kwargs={"k": 3}`: This argument configures the retriever to return the top 3 most similar documents when queried. This `k` value can be adjusted based on your needs.
7.  **Confirmation and Optional Test:**
    *   Print statements confirm the steps.
    *   The commented-out section shows how to use the `retriever.get_relevant_documents("Your query here")` method to directly test the retrieval part. It should return a list of `Document` objects (our chunks) that are most relevant to the query, along with their metadata.

### 5.3 Creating and Running the `RetrievalQA` Chain

With the retriever and the LLM set up, we can now combine them into a `RetrievalQA` chain. This chain will:
1.  Take a user"s query.
2.  Use the `retriever` to fetch relevant document chunks from our FAISS vector store.
3.  Combine the query and the content of these retrieved chunks into a prompt.
4.  Pass this prompt to the `llm` to generate an answer.

In [None]:
from langchain.chains import RetrievalQA

# 1. Create the RetrievalQA chain
#    Chain types: "stuff", "map_reduce", "refine", "map_rerank".
#    "stuff" is simplest: concatenates all retrieved docs into the prompt.
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,                            # Our initialized LangChain LLM wrapper
    chain_type="stuff",                 # How to handle retrieved documents
    retriever=retriever,                # Our FAISS-backed retriever
    return_source_documents=True,       # Return the source documents used for the answer
    verbose=False                       # Set True for detailed chain logs
)

print("RetrievalQA chain created.")

# 2. Define a test query for the RAG system
rag_query = "How is the RAG model structured and what are its key components according to the paper?"

# 3. Run the QA chain with the query
#    This does: 1) retrieve docs, 2) stuff them into the prompt, 3) generate answer
rag_response = qa_chain({"query": rag_query})

# 4. Print out the query and result
print(f"\nQuery: {rag_response['query']}")
print(f"\nGenerated Answer:\n{rag_response['result']}")

# 5. If source documents were returned, display snippets and metadata
source_docs = rag_response.get("source_documents", [])
if source_docs:
    print("\nSource Documents Used:")
    for i, doc in enumerate(source_docs, start=1):
        snippet = doc.page_content[:200].replace("\n", " ")
        print(f"\nSource {i} Snippet: {snippet}...")
        print(f"Source {i} Metadata: {doc.metadata}")
        print("-" * 40)


RetrievalQA chain created.


  rag_response = qa_chain({"query": rag_query})



Query: How is the RAG model structured and what are its key components according to the paper?

Generated Answer:
The RAG model uses the input sequencex to retrieve text documents z and use them as additional context when generating the target sequence y

Source Documents Used:

Source 1 Snippet: blob/master/examples/rag/README.md and an interactive demo of a RAG model can be found at https://huggingface.co/rag/ 2https://github.com/pytorch/fairseq 3https://github.com/huggingface/transformers 1...
Source 1 Metadata: {'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-04-13T00:48:38+00:00', 'author': '', 'keywords': '', 'moddate': '2021-04-13T00:48:38+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'https://arxiv.org/pdf/2005.11401.pdf', 'total_pages': 19, 'page': 16, 'page_label': '17', 'start_index': 2357}
-------------------

**Explanation of `RetrievalQA` Chain Code:**

1.  **Import `RetrievalQA`:** This is the LangChain class for creating a standard retrieval-augmented question-answering chain.
2.  **Create `RetrievalQA` Chain:**
    *   `qa_chain = RetrievalQA.from_chain_type(...)`: Initializes the chain.
        *   `llm=llm`: Passes our LangChain-wrapped LLM (`google/flan-t5-small` in this case).
        *   `chain_type="stuff"`: This is a crucial parameter that defines how the retrieved documents are incorporated into the prompt for the LLM.
            *   **`stuff`**: This is the simplest method. It takes all the retrieved document chunks, "stuffs" their content directly into the prompt along with the user"s query. It"s effective if the total length of the retrieved text plus the query fits within the LLM"s context window. If it exceeds the context window, an error will occur or the text will be truncated.
            *   Other `chain_type` options (more advanced, for handling more documents or longer texts):
                *   `map_reduce`: Processes each chunk individually with the LLM (map step), then combines these individual responses (reduce step).
                *   `refine`: Processes the first chunk, then iteratively refines the answer by processing subsequent chunks one by one, feeding the previous answer and the new chunk to the LLM.
                *   `map_rerank`: Processes each chunk individually, scores the answer, and returns the highest-scoring one.
            For our demonstration with `k=3` retrieved documents and a small LLM, `stuff` is usually appropriate.
        *   `retriever=retriever`: Passes our configured FAISS retriever.
        *   `return_source_documents=True`: If `True`, the chain"s output will include the list of actual document chunks that were retrieved and used to generate the answer. This is very useful for transparency, debugging, and allowing users to see the evidence for the answer.
        *   `verbose=False`: If set to `True`, LangChain will print detailed logs about the chain"s execution steps, which can be helpful for understanding its internal workings or debugging.
3.  **Define Query:**
    *   `rag_query = "..."`: A sample question to ask our RAG system, related to the content of the RAG paper.
4.  **Run the Chain:**
    *   `rag_response = qa_chain({"query": rag_query})`: Executes the QA chain. The input to a `RetrievalQA` chain is a dictionary where the key is `"query"` and the value is the user"s question string.
    *   The chain will internally perform the retrieval, prompt construction, and LLM call.
5.  **Print Results:**
    *   The `rag_response` is a dictionary. The main generated answer is typically under the key `"result"`. The original query is usually echoed back under the `"query"` key.
    *   If `return_source_documents=True` was set, the response dictionary will also contain a key `"source_documents"` whose value is a list of the LangChain `Document` objects (chunks) that were retrieved and used.
    *   The code iterates through these source documents and prints a snippet of their content and their metadata (which might include the original PDF page number, start index, etc.), providing context for the generated answer.

## 6. Optimization and Production Best Practices

Building a basic RAG system is a great start, but for real-world applications, several optimizations and best practices should be considered to improve performance, relevance, and robustness. This section will discuss some key areas, primarily through explanations, as implementing each would significantly expand the notebook.

### 6.1 Choosing the Right Components

*   **Embedding Models:**
    *   The `all-MiniLM-L6-v2` model used is good for general purposes and speed. However, for specific domains or higher accuracy, explore other models from `sentence-transformers` (e.g., `all-mpnet-base-v2` for better performance, or domain-specific embeddings if available). Consider models from Hugging Face"s Massive Text Embedding Benchmark (MTEB) leaderboard.
    *   **Finetuning Embeddings:** For highly specialized domains, finetuning an embedding model on your own data can yield significant improvements in retrieval quality.
*   **LLMs:**
    *   `google/flan-t5-small` is lightweight. For better generation quality, consider larger models like `Flan-T5-base`, `Flan-T5-large`, or models from other families like Llama, Mistral, or GPT (if using APIs like OpenAI"s).
    *   The choice depends on the trade-off between performance, cost (if using API-based models), and computational resources.
*   **Vector Stores:**
    *   FAISS `IndexFlatL2` provides exact search. For very large datasets (millions+ vectors), this can be slow. Explore FAISS"s Approximate Nearest Neighbor (ANN) indexes like `IndexIVFPQ` (Inverted File with Product Quantization). These trade a small amount of accuracy for a large speedup in search.
    *   Consider managed vector databases like Pinecone, Weaviate, Milvus, Qdrant, or ChromaDB, especially for production systems. They offer scalability, persistence, metadata filtering, and other advanced features.

### 6.2 Advanced Chunking and Preprocessing

*   **More Sophisticated Semantic Chunking:** Instead of regex or fixed sentence counts, use NLP libraries (spaCy, NLTK) for more accurate sentence/paragraph tokenization. LangChain also offers splitters like `SpacyTextSplitter` or `NLTKTextSplitter`. Consider chunking based on logical sections if your documents have clear structure (e.g., splitting by HTML headers, Markdown sections).
*   **Chunk Size and Overlap Tuning:** Experiment with different `chunk_size` and `chunk_overlap` values. Optimal values depend on your data, embedding model, and LLM context window. Too small chunks might lack context; too large might contain too much noise or exceed LLM limits.
*   **Metadata Enrichment:** When creating chunks (LangChain `Document` objects), add as much relevant metadata as possible (e.g., original document title, section, page number, creation date). This metadata can be used for filtering during retrieval or for providing more context to the LLM.
*   **Cleaning Text:** Preprocess text by removing irrelevant content (e.g., headers/footers from PDFs if they are noisy, boilerplate text, special characters that might confuse embedding models or LLMs).

### 6.3 Improving Retrieval Quality

*   **Number of Retrieved Chunks (`k`):** The `k` in `retriever.as_retriever(search_kwargs={"k": k})` is important. Too few chunks might miss crucial information; too many might overwhelm the LLM or introduce noise. Tune this based on your LLM"s context window and the nature of your queries/documents.
*   **Re-ranking:** After initial retrieval (e.g., top 20 chunks using vector similarity), use a more sophisticated (but potentially slower) re-ranking model to reorder these top chunks for relevance. Cross-encoder models are often used for this. LangChain has components like `ContextualCompressionRetriever` that can integrate re-rankers.
*   **Query Expansion/Transformation:** Sometimes the user"s query might not be optimal for vector search. Techniques like query expansion (adding synonyms or related terms), query rewriting (using an LLM to rephrase the query for better retrieval), or HyDE (Hypothetical Document Embeddings - generating a hypothetical answer and embedding that for search) can improve retrieval.
*   **Hybrid Search:** Combine dense vector search (like FAISS) with traditional sparse keyword search (like BM25 from Elasticsearch or Whoosh). This can be beneficial as keyword search excels at matching specific terms/entities, while vector search excels at semantic similarity. LangChain supports some forms of hybrid search.

### 6.4 Prompt Engineering for RAG

*   The way you structure the prompt given to the LLM (containing the original query and the retrieved context) is critical.
*   **Clear Instructions:** Instruct the LLM to base its answer *only* on the provided context. Tell it what to do if the answer isn"t found in the context (e.g., "If the context doesn"t provide an answer, say so.").
*   **Formatting Context:** Clearly demarcate the retrieved context chunks in the prompt. Numbering them or using special tokens can help the LLM differentiate between different pieces of information.
*   **Iterate and Test:** Prompt engineering is often an iterative process. Test different prompt structures to see what works best for your specific LLM and use case.
*   LangChain"s `RetrievalQA` chain uses a default prompt, but you can customize it using the `chain_type_kwargs` parameter (e.g., `chain_type_kwargs={"prompt": custom_prompt_template}`).

### 6.5 Evaluation

*   **Retrieval Metrics:** Evaluate the retriever component separately. Metrics include:
    *   `Hit Rate`: Percentage of queries for which at least one relevant document is retrieved in the top k.
    *   `Mean Reciprocal Rank (MRR)`: Considers the rank of the first relevant document.
    *   `Normalized Discounted Cumulative Gain (NDCG)`: Considers the position and relevance scores of all retrieved documents.
*   **Generation Metrics:** Evaluate the quality of the final generated answer. This is harder and often requires human evaluation. Some automated metrics (with caveats) include:
    *   `BLEU`, `ROUGE`, `METEOR`: Typically used for machine translation and summarization, can give some indication of similarity to reference answers.
    *   **LLM-as-a-Judge:** Use a powerful LLM (like GPT-4) to score the generated answers based on criteria like faithfulness (to the source), relevance (to the query), and coherence.
*   **End-to-End Evaluation:** Frameworks like RAGAs (RAG Assessment) provide tools and metrics specifically designed for evaluating RAG pipelines, considering aspects like faithfulness, answer relevance, and context relevance.

### 6.6 Handling Failures and Edge Cases

*   **No Relevant Documents Found:** What should the system do if the retriever finds no relevant documents, or if the retrieved documents have very low similarity scores? The LLM should be prompted to indicate that it cannot answer based on the provided information.
*   **Conflicting Information:** If retrieved chunks contain conflicting information, how should the LLM handle it? This is a challenging area, and sophisticated prompting or a multi-step reasoning process might be needed.
*   **LLM Hallucinations:** Even with RAG, LLMs can sometimes hallucinate or generate plausible but incorrect information. Emphasizing in the prompt to stick to the provided context is key. Returning source documents helps users verify.

Implementing these optimizations often involves more complex code and deeper experimentation, but they are crucial for building robust and reliable RAG systems for production environments.

## 7. Conclusion and Further Resources

Congratulations on completing this tutorial on Retrieval-Augmented Generation! We have journeyed from the basics of data ingestion and text chunking to building a functional RAG pipeline using both manual steps with FAISS and streamlined approaches with LangChain.

**Key Takeaways:**

*   **RAG Architecture:** You now understand the core components of a RAG system: a retriever that fetches relevant information from a knowledge base, and a generator (LLM) that uses this information to produce an answer.
*   **Importance of Chunking:** Effective chunking is vital for providing LLMs with digestible and contextually relevant pieces of information.
*   **Embeddings and Vector Stores:** Converting text to embeddings and using vector stores like FAISS are fundamental for efficient semantic search.
*   **LangChain for RAG:** LangChain significantly simplifies the development of RAG pipelines by providing modular components for document loading, splitting, embedding, vector storage, retrieval, and LLM integration.
*   **Iterative Process:** Building and optimizing a RAG system is an iterative process involving experimentation with different components, parameters, and prompting strategies.

**Where to Go From Here?**

The field of RAG is rapidly evolving. Here are some resources and directions for further exploration:

*   **LangChain Documentation:** The official LangChain Python documentation ([https://python.langchain.com/](https://python.langchain.com/)) is an invaluable resource for exploring more advanced features, different integrations (LLMs, vector stores, tools), and example use cases.
*   **Hugging Face:** Explore the Hugging Face Hub ([https://huggingface.co/models](https://huggingface.co/models)) for a vast collection of pre-trained embedding models and LLMs. The `transformers` library documentation is also essential.
*   **Sentence Transformers Library:** Dive deeper into the `sentence-transformers` library ([https://www.sbert.net/](https://www.sbert.net/)) for advanced embedding techniques and model choices.
*   **FAISS Documentation:** For large-scale applications, understanding the different index types and optimization strategies in FAISS ([https://faiss.ai/](https://faiss.ai/)) can be very beneficial.
*   **Advanced RAG Techniques:** Research topics like:
    *   **Hybrid Search:** Combining keyword and semantic search.
    *   **Re-ranking Models:** Improving the relevance of retrieved documents.
    *   **Query Transformations:** Techniques like HyDE or query expansion.
    *   **Self-Correcting/Reflective RAG:** Systems that can iteratively refine their retrieval or generation process.
    *   **Multi-Modal RAG:** Extending RAG to handle images, audio, or other data types in addition to text.
*   **Evaluation Frameworks:** Look into tools like RAGAs ([https://github.com/explodinggradients/ragas](https://github.com/explodinggradients/ragas)) for systematically evaluating your RAG pipelines.
*   **Research Papers:** Stay updated with the latest research in RAG by following conferences like NeurIPS, ICML, ACL, EMNLP, and looking for papers on arXiv.

Building robust RAG systems requires a blend of understanding NLP concepts, practical engineering skills, and iterative experimentation. We hope this tutorial has provided you with a strong foundation to build upon. Happy coding!