# Simple RAG (Retrieval-Augmented Generation) Implementation with LangChain 1.0+

## Overview
This notebook demonstrates a complete RAG pipeline using **LangChain 1.0+ with LCEL** (LangChain Expression Language).

### What is RAG?
RAG combines retrieval of relevant documents with generation from a Large Language Model (LLM):
1. **Retrieval**: Find relevant information from a knowledge base
2. **Augmentation**: Add retrieved context to the prompt
3. **Generation**: LLM generates answers based on the context

### Pipeline Flow:
```
PDF Documents ‚Üí Load ‚Üí Split into Chunks ‚Üí Create Embeddings ‚Üí Store in Vector DB
                                                                         ‚Üì
User Query ‚Üí Retrieve Similar Chunks ‚Üí Combine with Query ‚Üí LLM ‚Üí Answer
```

### Components Used:
- **Document Loader**: PyPDFLoader (for PDF processing)
- **Text Splitter**: RecursiveCharacterTextSplitter (smart chunking)
- **Embeddings**: OpenAI text-embedding-3-small (vector representations)
- **Vector Store**: FAISS (fast similarity search)
- **LLM**: OpenAI GPT-4-Turbo or GPT-3.5-Turbo
- **Chain Builder**: LCEL (LangChain Expression Language)

### LangChain 1.0+ Features:
- ‚úÖ Modern LCEL syntax with pipe operator `|`
- ‚úÖ More readable and composable chains
- ‚úÖ Better streaming support
- ‚úÖ Type-safe operations

---

## 1. Installation & Setup

First, install all required packages. Make sure you have Python 3.9+ installed.

In [None]:
# Install required packages
# Uncomment and run ONE of the following options:

# Option 1: Install from requirements.txt (RECOMMENDED)
# !pip install -r requirements.txt

# Option 2: Install all packages individually
# !pip install langchain langchain-core langchain-openai langchain-community langchain-text-splitters faiss-cpu pypdf python-dotenv tiktoken jupyter notebook

# Option 3: Quick install (if you're having import issues)
# !pip install --upgrade langchain langchain-core langchain-openai langchain-community langchain-text-splitters

## 2. Import Required Libraries

Import all necessary modules with explanations of what each does.

### Verify Installation

If you encounter import errors, run this cell first to check package versions:

In [None]:
#!pip install langchain langchain-core langchain-openai langchain-community langchain-text-splitters faiss-cpu pypdf python-dotenv tiktoken

In [1]:
# Check installed package versions
import sys
from importlib.metadata import version

try:
    import langchain
    print(f"‚úì langchain: {langchain.__version__}")
except:
    print("‚úó langchain not installed")

try:
    import langchain_core
    print(f"‚úì langchain-core: {langchain_core.__version__}")
except:
    print("‚úó langchain-core not installed - REQUIRED!")
    print("  Run: pip install langchain-core")

try:
    import langchain_openai
    print(f"‚úì langchain-openai: {version('langchain-openai')}")
except:
    print("‚úó langchain-openai not installed")

try:
    import langchain_community
    print(f"‚úì langchain-community: {langchain_community.__version__}")
except:
    print("‚úó langchain-community not installed")

print(f"\nPython version: {sys.version}")
print("\nIf any packages are missing, run:")
print("pip install langchain langchain-core langchain-openai langchain-community langchain-text-splitters faiss-cpu pypdf python-dotenv tiktoken")

‚úì langchain: 1.0.5
‚úì langchain-core: 1.0.5
‚úì langchain-openai: 1.0.3
‚úì langchain-community: 0.4.1

Python version: 3.13.9 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 19:09:58) [MSC v.1929 64 bit (AMD64)]

If any packages are missing, run:
pip install langchain langchain-core langchain-openai langchain-community langchain-text-splitters faiss-cpu pypdf python-dotenv tiktoken


In [2]:
# Standard library imports
import os

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
from pathlib import Path

# Environment variable management - for secure API key handling
from dotenv import load_dotenv

# LangChain Document Loaders - for loading PDF documents
from langchain_community.document_loaders import PyPDFLoader

# LangChain Text Splitters - for breaking documents into manageable chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter

# OpenAI Integration - for embeddings and LLM
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

# Vector Store - FAISS for efficient similarity search
from langchain_community.vectorstores import FAISS

# LangChain Core Components
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

print("‚úì All imports successful!")
print("‚úì Compatible with LangChain 1.0+")

‚úì All imports successful!
‚úì Compatible with LangChain 1.0+


## 3. Environment Configuration

### Setting up OpenAI API Key

You have two options:
1. **Recommended**: Create a `.env` file with `OPENAI_API_KEY=your_key_here`
2. **Alternative**: Set it directly in code (not recommended for production)

Get your API key from: https://platform.openai.com/api-keys

In [3]:
# Load environment variables from .env file
load_dotenv()

# Verify API key is loaded
if not os.getenv("OPENAI_API_KEY"):
    print("‚ö†Ô∏è  WARNING: OPENAI_API_KEY not found!")
    print("Please set it in .env file or uncomment the line below:")
    # os.environ["OPENAI_API_KEY"] = "your_api_key_here"
else:
    print("‚úì OpenAI API Key loaded successfully!")
    print(f"‚úì Key starts with: {os.getenv('OPENAI_API_KEY')[:8]}...")

‚úì OpenAI API Key loaded successfully!
‚úì Key starts with: sk-proj-...


## 4. Document Loading

### Loading PDF Documents

PyPDFLoader extracts text from PDF files page by page. Each page becomes a separate document with metadata (page number, source file).

**How it works:**
- Reads PDF files and extracts text content
- Preserves page numbers for source tracking
- Returns Document objects with `.page_content` and `.metadata`

**Note**: Update the `pdf_path` variable to point to your PDF file(s).

In [4]:
# ===== CONFIGURATION: Update this path to your PDF file =====
pdf_path = "attention.pdf"  # Change this to your PDF file path
# =============================================================

# Check if file exists
if not os.path.exists(pdf_path):
    print(f"‚ö†Ô∏è  ERROR: File '{pdf_path}' not found!")
    print("Please update the pdf_path variable with your PDF file location.")
else:
    # Initialize the PDF loader
    loader = PyPDFLoader(pdf_path)
    
    # Load all pages from the PDF
    # Each page becomes a separate Document object
    documents = loader.load()
    
    # Display information about loaded documents
    print(f"‚úì Loaded {len(documents)} pages from '{pdf_path}'")
    print(f"\n--- First Document Preview ---")
    print(f"Content (first 500 chars): {documents[0].page_content[:500]}...")
    print(f"\nMetadata: {documents[0].metadata}")
    print(f"\nTotal characters across all pages: {sum(len(doc.page_content) for doc in documents):,}")

‚úì Loaded 15 pages from 'attention.pdf'

--- First Document Preview ---
Content (first 500 chars): Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani‚àó
Google Brain
avaswani@google.com
Noam Shazeer‚àó
Google Brain
noam@google.com
Niki Parmar‚àó
Google Research
nikip@google.com
Jakob Uszkoreit‚àó
Google Research
usz@google.com
Llion Jones‚àó
Google Research
llion@google.com
Aidan N. Gomez‚àó ‚Ä†
University of Toronto
aidan@cs.toronto.edu
≈Åukasz ...

Metadata: {'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'attention.pdf', 'tota

### Loading Multiple PDFs (Optional)

If you have multiple PDF files, you can load them all at once:

In [5]:
# Example: Loading multiple PDFs from a directory
# Uncomment and modify if you want to load multiple files

pdf_directory = "./pdfs"  # Directory containing your PDFs
all_documents = []

if os.path.exists(pdf_directory):
    pdf_files = list(Path(pdf_directory).glob("*.pdf"))
    print(f"Found {len(pdf_files)} PDF files")
    
    for pdf_file in pdf_files:
        loader = PyPDFLoader(str(pdf_file))
        docs = loader.load()
        all_documents.extend(docs)
        print(f"  ‚úì Loaded {len(docs)} pages from {pdf_file.name}")
    
    print(f"\nTotal pages loaded: {len(all_documents)}")
    documents = all_documents  # Use this for the rest of the pipeline

Found 2 PDF files
  ‚úì Loaded 19 pages from rag.pdf
  ‚úì Loaded 21 pages from ragsurvey.pdf

Total pages loaded: 40


## 5. Text Splitting

### Why Split Documents?
- LLMs have token limits (e.g., 4K, 8K, 128K tokens)
- Smaller chunks = more precise retrieval
- Balance: chunks must be large enough to contain meaningful context but small enough to be specific

### RecursiveCharacterTextSplitter
This splitter tries to keep related text together by recursively splitting on:
1. Paragraphs (`\n\n`)
2. Lines (`\n`)
3. Sentences (`. `)
4. Words (` `)
5. Characters (as last resort)

**Parameters:**
- `chunk_size=1024`: Target size for each chunk (in characters)
- `chunk_overlap=128`: Overlap between chunks to maintain context continuity
- Overlap prevents important information from being split across chunks

In [6]:
# Initialize the text splitter with recommended settings
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,        # Maximum characters per chunk (roughly 200-250 tokens)
    chunk_overlap=128,      # Characters overlap between chunks (maintains context)
    length_function=len,    # Function to measure chunk length
    separators=["\n\n", "\n", " ", ""]  # Try to split on paragraphs first, then lines, etc.
)

# Split the documents into chunks
# This creates smaller, manageable pieces while preserving semantic meaning
chunks = text_splitter.split_documents(documents)

# Display splitting results
print(f"‚úì Split {len(documents)} documents into {len(chunks)} chunks")
print(f"\nAverage chunk size: {sum(len(chunk.page_content) for chunk in chunks) / len(chunks):.0f} characters")

# Preview a few chunks
print(f"\n--- Chunk Examples ---")
for i, chunk in enumerate(chunks[:3]):
    print(f"\nChunk {i+1} (length: {len(chunk.page_content)} chars):")
    print(f"{chunk.page_content[:200]}...")
    print(f"Metadata: {chunk.metadata}")

‚úì Split 40 documents into 218 chunks

Average chunk size: 904 characters

--- Chunk Examples ---

Chunk 1 (length: 1008 chars):
Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks
Patrick Lewis‚Ä†‚Ä°, Ethan Perez‚ãÜ,
Aleksandra Piktus‚Ä†, Fabio Petroni‚Ä†, Vladimir Karpukhin‚Ä†, Naman Goyal‚Ä†, Heinrich K√ºttler‚Ä†,
Mike Lewis‚Ä†, W...
Metadata: {'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-04-13T00:48:38+00:00', 'author': '', 'keywords': '', 'moddate': '2021-04-13T00:48:38+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'pdfs\\rag.pdf', 'total_pages': 19, 'page': 0, 'page_label': '1'}

Chunk 2 (length: 1023 chars):
memory have so far been only investigated for extractive downstream tasks. We
explore a general-purpose Ô¨Åne-tuning recipe for retrieval-augmented generation
(RAG) ‚Äî models which combine pre

## 6. Creating Embeddings

### What are Embeddings?
Embeddings are vector representations of text that capture semantic meaning. Similar texts have similar vectors.

**Example**: 
- "dog" and "puppy" ‚Üí similar vectors (close in vector space)
- "dog" and "spaceship" ‚Üí different vectors (far apart)

### OpenAI text-embedding-3-small
- **Dimensions**: 1536 (each text becomes a 1536-dimensional vector)
- **Cost**: $0.00002 per 1,000 tokens (very affordable)
- **Performance**: 62.3% on MTEB benchmark
- **Speed**: Fast and efficient

**Alternative**: `text-embedding-3-large` for higher quality (64.6% MTEB) at higher cost

In [7]:
# Initialize OpenAI Embeddings
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",  # Latest, cost-effective embedding model
    #dimensions=128
    # Alternative: "text-embedding-3-large" for better quality
)

# Test the embeddings with a sample text
sample_text = "This is a test sentence to demonstrate embeddings."
#sample_embedding = embeddings.embed_query(sample_text)

from tenacity import retry, wait_random_exponential, stop_after_attempt

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def embed(sample_text):
    return embeddings.embed_query(sample_text)

print(f"‚úì Embeddings model initialized: text-embedding-3-small")
print(f"‚úì Embedding dimension: {len(sample_text)}")
print(f"‚úì Sample embedding (first 10 values): {sample_text[:10]}")
print(f"\n‚ÑπÔ∏è  Each chunk will be converted to a {len(sample_text)}-dimensional vector for similarity search")

‚úì Embeddings model initialized: text-embedding-3-small
‚úì Embedding dimension: 50
‚úì Sample embedding (first 10 values): This is a 

‚ÑπÔ∏è  Each chunk will be converted to a 50-dimensional vector for similarity search


### Google Gemini Embeddings (Optional)

In [None]:
# Below is the code for Gemini Embeddings:

# from langchain_google_genai import GoogleGenerativeAIEmbeddings

# embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

# # Test the embeddings with a sample text
# sample_text = "This is a test sentence to demonstrate embeddings."
# # sample_embedding = embeddings.embed_query(sample_text)

# print(f"‚úì Embeddings model initialized: text-embedding-3-small")
# print(f"‚úì Embedding dimension: {len(sample_embedding)}")
# print(f"‚úì Sample embedding (first 10 values): {sample_embedding[:10]}")
# print(f"‚úì Each chunk will be converted to a {len(sample_embedding)}-dimensional vector for similarity search")

## 7. Creating Vector Store (FAISS)

### What is a Vector Store?
A vector store (or vector database) stores embeddings and enables fast similarity search.

### FAISS (Facebook AI Similarity Search)
- **Fast**: Optimized for billion-scale vector search
- **Local**: Runs on your machine, no cloud dependency
- **Efficient**: Uses advanced indexing algorithms

### How Similarity Search Works:
1. Convert query to embedding vector
2. Find vectors in the database most similar to query vector (using cosine similarity or Euclidean distance)
3. Return the corresponding text chunks

**This cell will:**
1. Convert all chunks to embeddings (may take a minute for large documents)
2. Build a FAISS index
3. Save to disk for future use

### FAISS-CPU Version Compatibility Issue

If you encounter issues with `faiss-cpu` installation, try:

```bash
uv pip uninstall faiss-cpu
uv pip install faiss-cpu==1.12.0
```

Or for conda users:
```bash
conda install -c conda-forge faiss-cpu==1.12.0
```

In [22]:
!pip show faiss-cpu

Name: faiss-cpu
Version: 1.12.0
Summary: A library for efficient similarity search and clustering of dense vectors.
Home-page: 
Author: 
Author-email: Kota Yamaguchi <yamaguchi_kota@cyberagent.co.jp>
License-Expression: MIT AND BSD-3-Clause
Location: C:\Users\exact\anaconda3\envs\myenv\Lib\site-packages
Requires: numpy, packaging
Required-by: 


In [17]:
# Create FAISS vector store from document chunks
# This step converts each chunk to an embedding and stores it
print(f"Creating FAISS index from {len(chunks)} chunks...")
print("This may take a minute depending on the number of chunks...")

vectorstore = FAISS.from_documents(
    documents=chunks,      # Our split document chunks
    embedding=embeddings   # OpenAI embedding model
)

print(f"‚úì FAISS vector store created successfully!")
print(f"‚úì Indexed {len(chunks)} document chunks")

# Save the vector store to disk for later use
# This allows you to reload the index without re-processing documents
vectorstore_path = "./faiss_index"
vectorstore.save_local(vectorstore_path)
print(f"‚úì Vector store saved to '{vectorstore_path}'")
print(f"\n‚ÑπÔ∏è  You can reload this index later using: FAISS.load_local('{vectorstore_path}', embeddings)")

Creating FAISS index from 402 chunks...
This may take a minute depending on the number of chunks...


RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

In [16]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def get_embeddings(texts):
    return embeddings.embed_documents(texts)

# Batch size (adjust depending on limits ‚Äî smaller is safer)
BATCH_SIZE = 50

all_vectors = []
for i in range(0, len(chunks), BATCH_SIZE):
    batch = [c.page_content for c in chunks[i:i+BATCH_SIZE]]
    print(f"Embedding batch {i} ‚Üí {i + len(batch)} ...")
    batch_vectors = get_embeddings(batch)
    all_vectors.extend(batch_vectors)
    time.sleep(0.5)  # cool down slightly

vectorstore = FAISS.from_existing_vectors(all_vectors, [c.metadata for c in chunks])
vectorstore.save_local("./faiss_index")

print("‚úì Vector store created & saved successfully!")


Embedding batch 0 ‚Üí 50 ...


RetryError: RetryError[<Future at 0x21c3becdf40 state=finished raised RateLimitError>]

In [23]:
# Create FAISS vector store from document chunks
# This step converts each chunk to an embedding and stores it
print(f"Creating FAISS index from {len(chunks)} chunks...")
print("This may take a minute depending on the number of chunks...")

vectorstore = FAISS.from_documents(
    documents=chunks,      # Our split document chunks
    embedding=embeddings   # OpenAI embedding model
)

print(f"‚úì FAISS vector store created successfully!")
print(f"‚úì Indexed {len(chunks)} document chunks")

# Save the vector store to disk for later use
# This allows you to reload the index without re-processing documents
vectorstore_path = "./faiss_index"
vectorstore.save_local(vectorstore_path)
print(f"‚úì Vector store saved to '{vectorstore_path}'")
print(f"\n‚ÑπÔ∏è  You can reload this index later using: FAISS.load_local('{vectorstore_path}', embeddings)")

Creating FAISS index from 218 chunks...
This may take a minute depending on the number of chunks...


RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

### ChromaDB Vector Store (Optional)

In [7]:
#ChromaDB has better Python 3.13 support. Replace cells 21-24 with:

# Instead of FAISS, use ChromaDB
# from langchain_community.vectorstores import Chroma

# # Create ChromaDB vector store
# print(f"Creating ChromaDB from {len(chunks)} chunks...")
# vectorstore = Chroma.from_documents(
#     documents=chunks,
#     embedding=embeddings,
#     persist_directory="./chroma_db"
# )
# print("‚úì ChromaDB vector store created!")

Creating ChromaDB from 49 chunks...
‚úì ChromaDB vector store created!


### Loading a Saved Vector Store (Optional)

If you've already created a vector store, you can load it instead of recreating:

In [None]:
#Uncomment to load an existing vector store instead of creating a new one
# vectorstore_path = "./faiss_index"
# vectorstore = FAISS.load_local(
#     vectorstore_path, 
#     embeddings,
#     allow_dangerous_deserialization=True  # Required for loading pickled data
# )
# print(f"‚úì Loaded existing vector store from '{vectorstore_path}'")

In [20]:
# Create a retriever from the vector store
retriever = vectorstore.as_retriever(
    search_type="similarity",    # Use cosine similarity for search
    search_kwargs={"k": 4}        # Retrieve top 4 most relevant chunks
)

print("‚úì Retriever configured successfully")
print(f"  - Search type: similarity")
print(f"  - Number of documents to retrieve (k): 4")

# Test the retriever with a sample query
# Note: In LangChain 1.0+, use .invoke() instead of .get_relevant_documents()
test_query = "What is the main topic of this document?"
retrieved_docs = retriever.invoke(test_query)  # LangChain 1.0+ method

print(f"\n--- Retriever Test ---")
print(f"Query: '{test_query}'")
print(f"Retrieved {len(retrieved_docs)} documents:")

for i, doc in enumerate(retrieved_docs):
    print(f"\nDocument {i+1}:")
    print(f"  Content preview: {doc.page_content[:150]}...")
    print(f"  Metadata: {doc.metadata}")

NameError: name 'vectorstore' is not defined

## 9. Configuring the Language Model (LLM)

### LLM Selection
The LLM generates the final answer based on retrieved context.

### Available Models:
1. **gpt-4-turbo-2025-04-09**: Most capable, best quality, slower, more expensive
2. **gpt-4o**: Fast GPT-4 level performance, good balance
3. **gpt-3.5-turbo**: Fast and cheap, good for simpler queries

### Temperature:
- **0**: Deterministic, focused answers (recommended for factual Q&A)
- **0.7**: More creative, varied responses
- **1.0**: Most creative, less predictable

### Max Tokens:
Controls the maximum length of the generated response.

In [27]:
# Initialize the ChatOpenAI model
llm = ChatOpenAI(
      model="gpt-4-turbo-2024-04-09",  # Choose your model
      # Alternative options:
      # model="gpt-4o",           # Faster GPT-4 performance, good 
      # balance,
      # model="gpt-3.5-turbo",    # Faster and cheaper option

      temperature=0,         # 0 = deterministic, factual responses (recommended for Q&A)
      max_tokens=2000,       # Maximum length of response
  )

print("‚úì LLM configured successfully")
print(f"  - Model: gpt-4-turbo-2024-04-09")
print(f"  - Temperature: 0 (deterministic)")
print(f"  - Max tokens: 2000")

# Test the LLM with a simple query
test_response = llm.invoke("Say 'Hello, I am ready to answer questions!'")
print(f"\nLLM Test Response: {test_response.content}")

#   üìù Explanation of Parameters:

#   Model Selection:

#   # Option 1: Best quality (slower, more expensive)
#   llm = ChatOpenAI(model="gpt-4-turbo-2024-04-09")

#   # Option 2: Fast GPT-4 performance (balanced)
#   llm = ChatOpenAI(model="gpt-4o")

#   # Option 3: Fast and cheap (good for testing)
#   llm = ChatOpenAI(model="gpt-3.5-turbo")

#   Temperature:

#   temperature=0    # Deterministic, focused (best for factual Q&A)
#   temperature=0.7  # More creative, varied responses
#   temperature=1.0  # Most creative, less predictable

#   Max Tokens:

#   max_tokens=2000  # Controls maximum response length

‚úì LLM configured successfully
  - Model: gpt-4-turbo-2024-04-09
  - Temperature: 0 (deterministic)
  - Max tokens: 2000

LLM Test Response: Hello, I am ready to answer questions!


## 10. Creating the RAG Chain (LangChain 1.0+ LCEL)

### What is a RAG Chain?
The RAG chain combines retrieval and generation into a single workflow:
1. User asks a question
2. Retriever finds relevant documents
3. Documents are formatted as context
4. LLM generates answer using the context

### LangChain 1.0+ LCEL (LangChain Expression Language)
LangChain 1.0+ uses LCEL, a declarative way to build chains using the pipe operator `|`.

**Benefits:**
- More intuitive and readable
- Better streaming support
- Easier to debug and modify
- Type-safe and composable

**Components:**
- **RunnablePassthrough**: Passes input through unchanged
- **Pipe operator (|)**: Chains components together
- **StrOutputParser**: Converts LLM output to string

In [28]:
#Step 1: Create the Retriever 
# Create a retriever from the vector store
retriever = vectorstore.as_retriever(
      search_type="similarity",    # Use cosine similarity for search
      search_kwargs={"k": 4}        # Retrieve top 4 most relevant chunks
  )

print("‚úì Retriever configured successfully")
print(f"  - Search type: similarity")
print(f"  - Number of documents to retrieve (k): 4")

  # Test the retriever with a sample query
test_query = "What is the main topic of this document?"
retrieved_docs = retriever.invoke(test_query)

print(f"\n--- Retriever Test ---")
print(f"Query: '{test_query}'")
print(f"Retrieved {len(retrieved_docs)} documents:")

for i, doc in enumerate(retrieved_docs):
    print(f"\nDocument {i+1}:")
    print(f"  Content preview: {doc.page_content[:150]}...")
    print(f"  Metadata: {doc.metadata}")

‚úì Retriever configured successfully
  - Search type: similarity
  - Number of documents to retrieve (k): 4

--- Retriever Test ---
Query: 'What is the main topic of this document?'
Retrieved 4 documents:

Document 1:
  Content preview: Attention Visualizations
Input-Input Layer5
It
is
in
this
spirit
that
a
majority
of
American
governments
have
passed
new
laws
since
2009
making
the
re...
  Metadata: {'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'attention.pdf', 'total_pages': 15, 'page': 12, 'page_label': '13'}

Document 2:
  Content preview: Input-Input Layer5
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
Th

In [14]:
# Define the prompt template for the RAG system
# This tells the LLM how to use the retrieved context
system_prompt = (
    "You are a helpful assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer the question. "
    "If you don't know the answer based on the context, say that you don't know. "
    "Keep the answer concise and accurate.\n\n"
    "Context: {context}\n\n"
    "Question: {question}"
)

# Create the prompt template
prompt = ChatPromptTemplate.from_template(system_prompt)

# Helper function to format documents
def format_docs(docs):
    """Format retrieved documents into a single string."""
    return "\n\n".join(doc.page_content for doc in docs)

# Build the RAG chain using LangChain 1.0+ LCEL (LangChain Expression Language)
# This uses the pipe operator (|) to chain components together
rag_chain = (
    {
        "context": retriever | format_docs,  # Retrieve docs and format them
        "question": RunnablePassthrough()      # Pass through the question
    }
    | prompt           # Format with prompt template
    | llm              # Generate answer with LLM
    | StrOutputParser() # Parse output to string
)

print("‚úì RAG chain created successfully using LangChain 1.0+ LCEL!")
print("\nRAG Pipeline Flow:")
print("  1. User provides a query")
print("  2. Retriever finds top 4 relevant chunks")
print("  3. Chunks are formatted as context")
print("  4. Context + question are formatted with prompt template")
print("  5. LLM generates answer based on context")
print("  6. Answer is parsed and returned to user")

‚úì RAG chain created successfully using LangChain 1.0+ LCEL!

RAG Pipeline Flow:
  1. User provides a query
  2. Retriever finds top 4 relevant chunks
  3. Chunks are formatted as context
  4. Context + question are formatted with prompt template
  5. LLM generates answer based on context
  6. Answer is parsed and returned to user


In [29]:
# Example Query 1: General question about the document
query1 = "What is the main topic or subject of this document?"

print(f"Query: {query1}")
print("\nProcessing...\n")

# With LangChain 1.0+, we invoke the chain with the question directly
answer = rag_chain.invoke(query1)

print("=" * 80)
print("ANSWER:")
print("=" * 80)
print(answer)
print("\n" + "=" * 80)

# To see which documents were retrieved, we can call the retriever separately
print("\nSOURCE DOCUMENTS USED:")
print("=" * 80)
retrieved_docs = retriever.invoke(query1)
for i, doc in enumerate(retrieved_docs):
    print(f"\nDocument {i+1}:")
    print(f"  Source: {doc.metadata}")
    print(f"  Content: {doc.page_content[:200]}...")
    print("-" * 80)

Query: What is the main topic or subject of this document?

Processing...

ANSWER:
The main topic of the document is the introduction and explanation of the Transformer network architecture, which is based solely on attention mechanisms and does not use recurrence or convolutions. This architecture is used in sequence transduction models that involve an encoder and a decoder connected through an attention mechanism.


SOURCE DOCUMENTS USED:

Document 1:
  Source: {'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'attention.pdf', 'total_pages': 15, 'page': 12, 'page_label': '13'}
  Content: Attention Visualizations
Input-Input Layer5
It
is
in
this
spirit
that
a
majority
of
American
governments


In [16]:
# Example Query 2: Specific information extraction
query2 = "Can you summarize the key points from this document?"

print(f"Query: {query2}")
print("\nProcessing...\n")

answer = rag_chain.invoke(query2)

print("=" * 80)
print("ANSWER:")
print("=" * 80)
print(answer)
print("\n" + "=" * 80)

Query: Can you summarize the key points from this document?

Processing...

ANSWER:
The document discusses the imperfections of the law and emphasizes that while the law itself may not be perfect, its application should be just. It also mentions that since 2009, a majority of American governments have passed laws that make the registration or voting process more difficult. Additionally, the document includes technical details about attention mechanisms in neural networks, specifically focusing on how attention heads in layer 5 of a 6-layer encoder model are involved in tasks like anaphora resolution and tracking long-distance dependencies in sentences. The document is part of a larger work related to the Transformer model, a network architecture that relies solely on attention mechanisms, eliminating the need for recurrence and convolutions in sequence transduction models.



### Please you Execute

In [17]:
# Example Query 3: Your custom question
# Replace this with your own question!
custom_query = "What specific details are mentioned about attention mechanisms?"

print(f"Query: {custom_query}")
print("\nProcessing...\n")

answer = rag_chain.invoke(custom_query)

print("=" * 80)
print("ANSWER:")
print("=" * 80)
print(answer)
print("\n" + "=" * 80)

Query: What specific details are mentioned about attention mechanisms?

Processing...

ANSWER:
The provided context mentions several specific details about attention mechanisms:

1. **Self-attention**: This is an attention mechanism that relates different positions of a single sequence to compute a representation of the sequence. It has been used successfully in tasks like reading comprehension, abstractive summarization, textual entailment, and learning task-independent sentence representations.

2. **Multi-Head Attention**: This is used to counteract the averaging of attention-weighted positions. It involves multiple attention heads which can attend to different parts of the sequence simultaneously, allowing for a richer representation.

3. **End-to-end memory networks**: These use a recurrent attention mechanism instead of sequence-aligned recurrence and have shown good performance in simple-language question answering and language modeling tasks.

4. **Transformer model**: This mod

In [18]:
# Example Query 3: Your custom question
# Replace this with your own question!
custom_query = "What are the applications of Attention Mechanism?"

print(f"Query: {custom_query}")
print("\nProcessing...\n")

response3 = rag_chain.invoke(custom_query)

print("=" * 80)
print("ANSWER:")
print("=" * 80)
print(response3)
print("\n" + "=" * 80)

Query: What are the applications of Attention Mechanism?

Processing...

ANSWER:
The applications of the attention mechanism include reading comprehension, abstractive summarization, textual entailment, learning task-independent sentence representations, simple-language question answering, and language modeling tasks.



### FAISS Kernel Crash ISSUE

Testing retrieval with query: 'What is the main topic of this document?'
OMP: Error #15: Initializing libomp.dylib, but found libomp.dylib already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/