# RAG Pipeline: PDF to FAISS Vector Store

This notebook demonstrates how to:
1. Load and chunk a PDF document
2. Create embeddings and store them in FAISS
3. Save the index locally as .faiss and .pkl files
4. Load the saved index and create a retriever for RAG applications

## 1. Install Required Libraries

Run this cell first to install all necessary dependencies:

## 2. Import Required Libraries

In [1]:
import os
import pickle
from typing import List, Optional
import faiss
import numpy as np
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import PyPDFLoader
import logging

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print("All libraries imported successfully!")

All libraries imported successfully!


## 3. Configuration

Set your PDF path and output directory here:

In [10]:
# Configuration
PDF_PATH = "../data/dc2523af.pdf"  # Change this to your PDF path
INDEX_SAVE_DIR = "../faiss_index"  # Directory to save the FAISS index
EMBEDDING_MODEL = "all-MiniLM-L6-v2"  # Embedding model to use

# Text splitting parameters
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200

print(f"PDF Path: {PDF_PATH}")
print(f"Index Save Directory: {INDEX_SAVE_DIR}")
print(f"Embedding Model: {EMBEDDING_MODEL}")

PDF Path: ../data/dc2523af.pdf
Index Save Directory: ../faiss_index
Embedding Model: all-MiniLM-L6-v2


In [4]:
from dotenv import load_dotenv
load_dotenv()
import os
os.environ["OPENAI_API_KEY"]=os.getenv("OPENAI_API_KEY")
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
model_name = "gpt-4o"
llm = ChatOpenAI(
                model=model_name,
                #openai_api_base=openai_api_base
            )

## 4. Initialize Components

In [5]:
# Initialize embeddings
print("🔄 Loading embedding model...")
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

print("Components initialized successfully!")

🔄 Loading embedding model...
Components initialized successfully!


## 5. Load and Process PDF

In [6]:
# Load PDF
print(f"Loading PDF from {PDF_PATH}...")
loader = PyPDFLoader(PDF_PATH)
pages = loader.load()

print(f"Loaded {len(pages)} pages from PDF")
print(f"First page preview (first 200 chars): {pages[0].page_content[:200]}...")

Loading PDF from ../data/dc2523af.pdf...
Loaded 69 pages from PDF
First page preview (first 200 chars): ...


## 6. Split Documents into Chunks

In [7]:
# Split documents into chunks
print("Splitting documents into chunks")
chunks = text_splitter.split_documents(pages)

print(f"Created {len(chunks)} chunks")
print(f"First chunk preview: {chunks[0].page_content[:200]}...")
print(f"Average chunk length: {sum(len(chunk.page_content) for chunk in chunks) // len(chunks)} characters")

Splitting documents into chunks
Created 130 chunks
First chunk preview: 1 Introduction ........................................................................................  5
1.1 Ownership of this document 5
1.2	 API	Definition	and	Overview	 5
1.3 Purpose 6
1.4 Scope ...
Average chunk length: 767 characters


## 7. Create FAISS Index

In [8]:
# Create FAISS vectorstore
print("Creating FAISS index")
vectorstore = FAISS.from_documents(chunks, embeddings)

print("FAISS index created successfully!")
print(f"Index contains {vectorstore.index.ntotal} vectors")

Creating FAISS index


INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


FAISS index created successfully!
Index contains 130 vectors


## 8. Save Index Locally

In [11]:
# Create directory if it doesn't exist
os.makedirs(INDEX_SAVE_DIR, exist_ok=True)

# Save FAISS index locally
print(f"Saving FAISS index to {INDEX_SAVE_DIR}...")
vectorstore.save_local(INDEX_SAVE_DIR)

print("Index saved successfully!")

# List files in the directory
files = os.listdir(INDEX_SAVE_DIR)
print(f"Files created: {files}")

Saving FAISS index to ../faiss_index...
Index saved successfully!
Files created: ['index.faiss', 'index.pkl']


## 9. Load Saved Index

In [12]:
# Load the saved FAISS index
print(f"Loading FAISS index from {INDEX_SAVE_DIR}")
loaded_vectorstore = FAISS.load_local(
    INDEX_SAVE_DIR, 
    embeddings,
    allow_dangerous_deserialization=True
)

print("Index loaded successfully!")
print(f"Loaded index contains {loaded_vectorstore.index.ntotal} vectors")

Loading FAISS index from ../faiss_index
Index loaded successfully!
Loaded index contains 130 vectors


## 10. Create Retriever for RAG

In [13]:
# Create retriever
retriever = loaded_vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}  # Number of documents to retrieve
)

print("Retriever created successfully!")
print(f"Retriever configured to return top 4 most similar documents")

Retriever created successfully!
Retriever configured to return top 4 most similar documents


## 11. Test the Retriever

In [None]:
# Test the retriever with a sample query
test_query = "What is the main topic of this document?"  

print(f"Testing retriever with query: '{test_query}'")
results = retriever.get_relevant_documents(test_query)

print(f"Retrieved {len(results)} documents")
print("\n" + "="*50)
for i, doc in enumerate(results):
    print(f"\nDocument {i+1}:")
    print(f"Content: {doc.page_content[:300]}...")
    print(f"Metadata: {doc.metadata}")
    print("-" * 30)

Testing retriever with query: 'What is the main topic of this document?'


  results = retriever.get_relevant_documents(test_query)
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


Retrieved 4 documents


Document 1:
Content: contain	no	explicit	technology	or	protocol	restrictions;	rather,	the	document	offers	best	practices-
based	guidelines	that	ensure	that	UAE	digital	government	APIs	are	effective,	designed	correctly,	
secure	and	provide	value.
1.3.	Purpose...
Metadata: {'producer': 'Adobe PDF Library 15.0', 'creator': 'Adobe InDesign 15.0 (Macintosh)', 'creationdate': '2021-04-29T13:43:36+04:00', 'moddate': '2021-05-02T11:10:50+04:00', 'trapped': '/False', 'source': '../data/dc2523af.pdf', 'total_pages': 69, 'page': 7, 'page_label': '7'}
------------------------------

Document 2:
Content: UAE Government API First Guidlines
4
1.
Ownership
of this document
Introduction
1.1....
Metadata: {'producer': 'Adobe PDF Library 15.0', 'creator': 'Adobe InDesign 15.0 (Macintosh)', 'creationdate': '2021-04-29T13:43:36+04:00', 'moddate': '2021-05-02T11:10:50+04:00', 'trapped': '/False', 'source': '../data/dc2523af.pdf', 'total_pages': 69, 'page': 4, 'page_label': '4'}
-----

## 12. Utility Functions for Reusability

In [None]:
def load_or_create_vectorstore(pdf_path: str, index_dir: str, force_recreate: bool = False):
    """
    Load existing vectorstore or create new one if it doesn't exist.
    
    Args:
        pdf_path: Path to PDF file
        index_dir: Directory for FAISS index
        force_recreate: Whether to recreate index even if it exists
    
    Returns:
        FAISS vectorstore object
    """
    if not force_recreate and os.path.exists(index_dir) and os.listdir(index_dir):
        print(f"📂 Loading existing index from {index_dir}")
        return FAISS.load_local(
            index_dir, 
            embeddings, 
            allow_dangerous_deserialization=True
        )
    else:
        print(f"🔄 Creating new index from {pdf_path}")
        # Load and process PDF
        loader = PyPDFLoader(pdf_path)
        pages = loader.load()
        chunks = text_splitter.split_documents(pages)
        
        # Create and save vectorstore
        vectorstore = FAISS.from_documents(chunks, embeddings)
        os.makedirs(index_dir, exist_ok=True)
        vectorstore.save_local(index_dir)
        
        return vectorstore

def create_retriever(pdf_path: str, index_dir: str, k: int = 4, force_recreate: bool = False):
    """
    Create a retriever from PDF.
    
    Args:
        pdf_path: Path to PDF file
        index_dir: Directory for FAISS index
        k: Number of documents to retrieve
        force_recreate: Whether to recreate index
    
    Returns:
        LangChain retriever object
    """
    vectorstore = load_or_create_vectorstore(pdf_path, index_dir, force_recreate)
    return vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": k}
    )

print("✅ Utility functions defined!")

## 13. Example: Using the Utility Functions

In [None]:
# Example of using utility functions
# This will load existing index or create new one if needed
example_retriever = create_retriever(
    pdf_path=PDF_PATH,
    index_dir=INDEX_SAVE_DIR,
    k=3,  # Retrieve top 3 documents
    force_recreate=False  # Set to True to force recreation
)

print("✅ Example retriever created!")

# Test with a different query
test_query_2 = "main concepts"  # Change this for your specific content
results_2 = example_retriever.get_relevant_documents(test_query_2)

print(f"\n🔍 Query: '{test_query_2}'")
print(f"📄 Found {len(results_2)} relevant documents")
for i, doc in enumerate(results_2):
    print(f"\nDoc {i+1}: {doc.page_content[:150]}...")

## 14. Summary

🎉 **Congratulations!** You have successfully:

1. ✅ Loaded and chunked a PDF document
2. ✅ Created embeddings using HuggingFace models
3. ✅ Built a FAISS vector index
4. ✅ Saved the index locally (creates .faiss and .pkl files)
5. ✅ Loaded the saved index
6. ✅ Created a retriever for RAG applications

### Files Created:
- `index.faiss`: The FAISS index file
- `index.pkl`: Metadata and document store

### Next Steps:
- Integrate this retriever with your RAG chain
- Experiment with different embedding models
- Try different chunk sizes and overlap values
- Add more documents to expand your knowledge base