# Retrieval-Augmented Generation (RAG) with LangChain, Pinecone, and ChromaDB for book.pdf

This notebook demonstrates **Retrieval-Augmented Generation (RAG)** using **LangChain** with **Pinecone** and **ChromaDB** as vector databases, processing a user-provided PDF document (`book.pdf`). The RAG system extracts text from the PDF, chunks it, creates embeddings, stores them in both vector databases, and answers queries based on the document content. It also compares the performance of Pinecone and ChromaDB.

## Objectives
- Extract text from `book.pdf`
- Chunk the text and create vector embeddings
- Store embeddings in Pinecone and ChromaDB
- Set up RAG pipelines for question answering
- Compare Pinecone and ChromaDB performance

## Prerequisites
- `book.pdf` in the working directory
- Pinecone account and API key (free tier available)
- Ollama installed locally with a model (e.g., `llama3`)
- Required libraries: `langchain`, `pinecone-client`, `chromadb`, `pypdf`, `sentence-transformers`, `langchain-community`, `langchain-ollama`

Let's get started!

## Setup and Imports

Install and import the necessary libraries. Replace the Pinecone API key and environment with your credentials.

In [7]:
#!pip install -qU langchain langchain-community langchain-ollama pinecone-client chromadb sentence-transformers pypdf

import os
#import pinecone
import chromadb
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Pinecone, Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain_groq import ChatGroq
import time

# Set Groq API key
os.environ["GROQ_API_KEY"] = "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"

## Step 1: Load and Preprocess book.pdf

We'll use `PyPDFLoader` to extract text from `book.pdf` and split it into chunks suitable for retrieval.

In [8]:
# Load PDF
loader = PyPDFLoader("data/AI for Everyone.pdf")
documents = loader.load()

# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)

print(f"Number of document chunks: {len(docs)}")
print("Sample chunk:")
print(docs[0].page_content[:200] + "...")

Number of document chunks: 78
Sample chunk:
AI For Everyone
Om Prabhu
19D170018
Undergraduate, Department of Energy Science and Engineering
Indian Institute of Technology Bombay
Last updated January 31, 2021
NOTE: This document is a brief compi...


## Step 2: Create Embeddings

We'll use Hugging Face's `all-MiniLM-L6-v2` model to create 384-dimensional embeddings for the document chunks, compatible with both Pinecone and ChromaDB.

In [9]:
# Initialize Hugging Face embeddings
embeddings = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')

# Test embedding
sample_text = "This is a test sentence from the book."
sample_embedding = embeddings.embed_query(sample_text)
print(f"Embedding dimension: {len(sample_embedding)}")
print(f"Sample embedding (first 5 values): {sample_embedding[:5]}")

Embedding dimension: 384
Sample embedding (first 5 values): [0.03519676998257637, 0.07600128650665283, 0.023663559928536415, 0.14537909626960754, -0.01117265410721302]


## Step 3: Store Embeddings in Pinecone

We'll create a Pinecone index, store the document embeddings, and set up a LangChain vector store for retrieval.

In [None]:
# # Create Pinecone index
# index_name = 'book-rag'
# if index_name not in pinecone.list_indexes():
#     pinecone.create_index(index_name, dimension=384, metric='cosine')

# # Initialize Pinecone index
# index = pinecone.Index(index_name)

# # Store documents in Pinecone
# pinecone_store = Pinecone.from_documents(docs, embeddings, index_name=index_name)

# print(f"Pinecone index '{index_name}' created and populated with {len(docs)} documents.")

## Step 4: Store Embeddings in ChromaDB

We'll create a ChromaDB collection, store the document embeddings, and set up a LangChain vector store.

In [10]:
# Initialize ChromaDB client
chroma_client = chromadb.PersistentClient(path='./chroma_db')

# Store documents in ChromaDB
chroma_store = Chroma.from_documents(
    docs,
    embeddings,
    client=chroma_client,
    collection_name='book-rag',
    persist_directory='./chroma_db'
)

print(f"ChromaDB collection 'book-rag' created and populated with {len(docs)} documents.")

InternalError: Database error: error returned from database: (code: 14) unable to open database file

## Step 5: Set Up RAG Pipeline

We'll use LangChain's `RetrievalQA` chain with an Ollama-hosted LLM (`llama3`) to create RAG pipelines for both Pinecone and ChromaDB.

In [5]:
# Initialize LLM (Ollama with llama3)
#llm = OllamaLLM(model='llama3')
llm = ChatGroq(
    model="llama-3.1-8b-instant",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)
# Create RetrievalQA chains
# pinecone_qa = RetrievalQA.from_chain_type(
#     llm=llm,
#     chain_type='stuff',
#     retriever=pinecone_store.as_retriever(search_kwargs={'k': 3})
# )

chroma_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type='stuff',
    retriever=chroma_store.as_retriever(search_kwargs={'k': 3})
)

print("RAG pipelines initialized for Pinecone and ChromaDB.")

RAG pipelines initialized for Pinecone and ChromaDB.


## Step 6: Test RAG Pipelines

We'll test both RAG pipelines with sample queries relevant to the content of `book.pdf`. Since I don't know the exact content of your PDF, I'll provide generic queries that you can modify based on your document's topics. We'll measure response time and compare outputs.

In [6]:
# Sample queries (modify based on your book.pdf content)
queries = [
    "What is the main topic of the book?",
    "What is the main difference between Artificial Narrow Intelligence (ANI) and Artificial General Intelligence (AGI)?",
    "What is one real-life example of how supervised learning is used?"
]

# Test and compare
for query in queries:
    print(f"\nQuery: {query}")
    
    # # Pinecone
    # start_time = time.time()
    # pinecone_result = pinecone_qa.run(query)
    # pinecone_time = time.time() - start_time
    # print(f"Pinecone Response (Time: {pinecone_time:.2f}s):")
    # print(pinecone_result)
    
    # ChromaDB
    start_time = time.time()
    chroma_result = chroma_qa.run(query)
    chroma_time = time.time() - start_time
    print(f"ChromaDB Response (Time: {chroma_time:.2f}s):")
    print(chroma_result)


Query: What is the main topic of the book?


  chroma_result = chroma_qa.run(query)


ChromaDB Response (Time: 1.98s):
The main topic of the book "AI For Everyone" is Artificial Intelligence (AI).

Query: What is the main difference between Artificial Narrow Intelligence (ANI) and Artificial General Intelligence (AGI)?
ChromaDB Response (Time: 0.76s):
The main difference between Artificial Narrow Intelligence (ANI) and Artificial General Intelligence (AGI) is their scope and capabilities.

Artificial Narrow Intelligence (ANI) is a type of AI that is designed to perform a specific task or a narrow set of tasks. It can excel in a particular area, such as:

- Recognizing images
- Translating languages
- Playing chess
- Driving a car

ANI is incredibly valuable in specific industries due to its narrow application, but it is limited to a particular domain.

On the other hand, Artificial General Intelligence (AGI) is a type of AI that is designed to perform any intellectual task that a human can. It has the ability to reason, learn, and apply its knowledge across a wide range

## Step 7: Comparison of Pinecone and ChromaDB

### Pinecone
- **Pros**:
  - Fully managed, scalable vector database
  - Optimized for real-time similarity search
  - Easy integration with LangChain
  - Ideal for production with high throughput
- **Cons**:
  - Requires an account and API key; free tier limited to 100,000 vectors
  - Costly for large-scale use
- **Performance**: Typically faster for cloud-based retrieval (~500ms for small indexes)

### ChromaDB
- **Pros**:
  - Open-source and free
  - Easy to set up locally or in Docker
  - Persists data locally
  - Great for prototyping and development
- **Cons**:
  - Self-hosted, requiring infrastructure management
  - Less optimized for large-scale production
- **Performance**: Slightly slower for large datasets; performance depends on local hardware

## Recommendations
- **Use Pinecone** for production-grade applications requiring scalability and managed infrastructure.
- **Use ChromaDB** for prototyping, local testing, or projects where self-hosting is preferred.

## Notes
- If `book.pdf` is large, adjust `chunk_size` and `chunk_overlap` to balance retrieval accuracy and performance.
- If the PDF contains scanned images, use an OCR tool like `pytesseract` (let me know if you need help with this).
- Response times depend on network latency (Pinecone), hardware (ChromaDB), and LLM performance (Ollama).

## Cleanup

Delete the Pinecone index and ChromaDB collection to free resources.

In [None]:
# Cleanup Pinecone
# if index_name in pinecone.list_indexes():
#     pinecone.delete_index(index_name)
#     print(f"Deleted Pinecone index '{index_name}'.")

# Cleanup ChromaDB
chroma_client.delete_collection('book-rag')
print("Deleted ChromaDB collection 'book-rag'.")

## Explanation

- **PDF Loading and Splitting**: Used `PyPDFLoader` to extract text from `book.pdf` and `RecursiveCharacterTextSplitter` to create manageable chunks.
- **Embeddings**: Employed Hugging Face's `all-MiniLM-L6-v2` for 384-dimensional embeddings.
- **Pinecone**: Created a cloud-based index, stored embeddings, and set up a `Pinecone` vector store.
- **ChromaDB**: Created a local collection, stored embeddings, and set up a `Chroma` vector store.
- **RAG Pipeline**: Used `RetrievalQA` with an Ollama-hosted `llama3` LLM to answer queries, retrieving the top-3 relevant chunks.
- **Comparison**: Evaluated Pinecone and ChromaDB based on setup, performance, and use cases.

## Next Steps
- **Modify Queries**: Replace the sample queries with ones specific to `book.pdf` content (e.g., key concepts, chapters, or examples).
- **Handle Large PDFs**: If `book.pdf` is large, test with smaller `chunk_size` (e.g., 500) or increase `k` in `search_kwargs`.
- **OCR for Scanned PDFs**: If the PDF is scanned, install `pytesseract` and use `pdf2image` for text extraction (I can provide a modified notebook).
- **Advanced RAG**: Add prompt templates, hybrid search, or reranking for better results.
- **Deployment**: Create a UI with Chainlit or deploy the RAG system as an API.

If you encounter issues (e.g., PDF extraction fails, Pinecone setup errors, or specific content queries), let me know, and I can provide tailored solutions!