### Building a RAG System with LangChain and ChromaDB
#### Introduction
Retrieval-Augmented Generation (RAG) is a powerful technique that combines the capabilities of large language models with external knowledge retrieval. This notebook will walk you through building a complete RAG system using:

- LangChain: A framework for developing applications powered by language models
- ChromaDB: An open-source vector database for storing and retrieving embeddings
- OpenAI: For embeddings and language model (you can substitute with other providers)

In [1]:
import os 
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
## LangChain Imports
from langchain_text_splitters import  RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import TextLoader
## Vectorstores
from langchain_community.vectorstores import Chroma

## Utility imports
import numpy as np
from typing import List


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Import the splitter from the updated package
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Sample text
text = """LangChain is a framework for building applications with LLMs.
It supports document loading, text splitting, embeddings, and RAG pipelines.
RecursiveCharacterTextSplitter allows splitting text into coherent chunks."""

# Create the splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=50,      # max characters per chunk
    chunk_overlap=10,   # overlap between chunks
    separators=["\n\n", "\n", " ", ""]  # recursive separators
)

# Split the text
chunks = splitter.split_text(text)

# Print the resul


In [4]:
# RAG Architecture Overview
print("""
RAG (Retrieval-Augmented Generation) Architecture:

1. Document Loading: Load documents from various sources
2. Document Splitting: Break documents into smaller chunks
3. Embedding Generation: Convert chunks into vector representations
4. Vector Storage: Store embeddings in ChromaDB
5. Query Processing: Convert user query to embedding
6. Similarity Search: Find relevant chunks from vector store
7. Context Augmentation: Combine retrieved chunks with query
8. Response Generation: LLM generates answer using context

Benefits of RAG:
- Reduces hallucinations
- Provides up-to-date information
- Allows citing sources
- Works with domain-specific knowledge
""")


RAG (Retrieval-Augmented Generation) Architecture:

1. Document Loading: Load documents from various sources
2. Document Splitting: Break documents into smaller chunks
3. Embedding Generation: Convert chunks into vector representations
4. Vector Storage: Store embeddings in ChromaDB
5. Query Processing: Convert user query to embedding
6. Similarity Search: Find relevant chunks from vector store
7. Context Augmentation: Combine retrieved chunks with query
8. Response Generation: LLM generates answer using context

Benefits of RAG:
- Reduces hallucinations
- Provides up-to-date information
- Allows citing sources
- Works with domain-specific knowledge



### 1. Sample Data

In [5]:
## create sample documents
sample_docs = [
    """
    Machine Learning Fundamentals
    
    Machine learning is a subset of artificial intelligence that enables systems to learn 
    and improve from experience without being explicitly programmed. There are three main 
    types of machine learning: supervised learning, unsupervised learning, and reinforcement 
    learning. Supervised learning uses labeled data to train models, while unsupervised 
    learning finds patterns in unlabeled data. Reinforcement learning learns through 
    interaction with an environment using rewards and penalties.
    """,
    
    """
    Deep Learning and Neural Networks
    
    Deep learning is a subset of machine learning based on artificial neural networks. 
    These networks are inspired by the human brain and consist of layers of interconnected 
    nodes. Deep learning has revolutionized fields like computer vision, natural language 
    processing, and speech recognition. Convolutional Neural Networks (CNNs) are particularly 
    effective for image processing, while Recurrent Nimeural Networks (RNNs) and Transformers 
    excel at sequential data processing.
    """,
    
    """
    Natural Language Processing (NLP)
    
    NLP is a field of AI that focuses on the interaction between computers and human language. 
    Key tasks in NLP include text classification, named entity recognition, sentiment analysis, 
    machine translation, and question answering. Modern NLP heavily relies on transformer 
    architectures like BERT, GPT, and T5. These models use attention mechanisms to understand 
    context and relationships between words in text.
    """
]

sample_docs


['\n    Machine Learning Fundamentals\n\n    Machine learning is a subset of artificial intelligence that enables systems to learn \n    and improve from experience without being explicitly programmed. There are three main \n    types of machine learning: supervised learning, unsupervised learning, and reinforcement \n    learning. Supervised learning uses labeled data to train models, while unsupervised \n    learning finds patterns in unlabeled data. Reinforcement learning learns through \n    interaction with an environment using rewards and penalties.\n    ',
 '\n    Deep Learning and Neural Networks\n\n    Deep learning is a subset of machine learning based on artificial neural networks. \n    These networks are inspired by the human brain and consist of layers of interconnected \n    nodes. Deep learning has revolutionized fields like computer vision, natural language \n    processing, and speech recognition. Convolutional Neural Networks (CNNs) are particularly \n    effective f

In [6]:
## Save sample documents to files
import tempfile
temp_dir = tempfile.mkdtemp()

for i,doc in enumerate(sample_docs):
    with open(f"{temp_dir}/doc_{i}.txt","w") as f:
        f.write(doc)
        
print(f"Sample document create in : {temp_dir}")

Sample document create in : C:\Users\ajits\AppData\Local\Temp\tmpabmaonoc


In [7]:
## save sample documents to files
import tempfile
temp_dir=tempfile.mkdtemp()

for i,doc in enumerate(sample_docs):
    with open(f"doc_{i}.txt","w") as f:
        f.write(doc)



In [8]:
temp_dir

'C:\\Users\\ajits\\AppData\\Local\\Temp\\tmp_9r4je9i'

### 2. Document Loading

In [9]:
from langchain_community.document_loaders import DirectoryLoader,TextLoader

# Load documents from directory
loader = DirectoryLoader(
    "data", 
    glob="*.txt", 
    loader_cls=TextLoader,
    loader_kwargs={'encoding': 'utf-8'}
)
documents = loader.load()

print(f"Loaded {len(documents)} documents")
print(f"\nFirst document preview:")
print(documents[0].page_content[:200] + "...")


Loaded 3 documents

First document preview:

    Machine Learning Fundamentals

    Machine learning is a subset of artificial intelligence that enables systems to learn 
    and improve from experience without being explicitly programmed. Ther...


In [10]:
documents

[Document(metadata={'source': 'data\\doc_0.txt'}, page_content='\n    Machine Learning Fundamentals\n\n    Machine learning is a subset of artificial intelligence that enables systems to learn \n    and improve from experience without being explicitly programmed. There are three main \n    types of machine learning: supervised learning, unsupervised learning, and reinforcement \n    learning. Supervised learning uses labeled data to train models, while unsupervised \n    learning finds patterns in unlabeled data. Reinforcement learning learns through \n    interaction with an environment using rewards and penalties.\n    '),
 Document(metadata={'source': 'data\\doc_1.txt'}, page_content='\n    Deep Learning and Neural Networks\n\n    Deep learning is a subset of machine learning based on artificial neural networks. \n    These networks are inspired by the human brain and consist of layers of interconnected \n    nodes. Deep learning has revolutionized fields like computer vision, natur

### 2. Document Splitting

In [11]:
# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500, # Maximum size of each chunk
    chunk_overlap = 50, # Overlap betweem chunks to maintain contexr
    length_function = len,
    separators=[" "] # Hierarchy of Separators
)

chunks = text_splitter.split_documents(documents)

print(f"Created {len(chunks)} chunks from {len(documents)} documents")
print(f"\nChunk example:")
print(f"Content: {chunks[0].page_content[:150]}...")
print(f"Metadata: {chunks[0].metadata}")

Created 5 chunks from 3 documents

Chunk example:
Content: Machine Learning Fundamentals

    Machine learning is a subset of artificial intelligence that enables systems to learn 
    and improve from experie...
Metadata: {'source': 'data\\doc_0.txt'}


In [12]:
chunks

[Document(metadata={'source': 'data\\doc_0.txt'}, page_content='Machine Learning Fundamentals\n\n    Machine learning is a subset of artificial intelligence that enables systems to learn \n    and improve from experience without being explicitly programmed. There are three main \n    types of machine learning: supervised learning, unsupervised learning, and reinforcement \n    learning. Supervised learning uses labeled data to train models, while unsupervised \n    learning finds patterns in unlabeled data. Reinforcement learning learns through'),
 Document(metadata={'source': 'data\\doc_0.txt'}, page_content='data. Reinforcement learning learns through \n    interaction with an environment using rewards and penalties.'),
 Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep Learning and Neural Networks\n\n    Deep learning is a subset of machine learning based on artificial neural networks. \n    These networks are inspired by the human brain and consist of layers of in

### Embedding Models

In [13]:
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

In [14]:
sample_text = "Machine Learning is Fascinating"
embeddings = OpenAIEmbeddings()
embeddings

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x000001E3E0745010>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x000001E3E0745940>, model='text-embedding-ada-002', dimensions=None, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

In [15]:
vector = embeddings.embed_query(sample_text)
vector

[-0.02159224823117256,
 0.021384380757808685,
 0.013251559808850288,
 -0.02277449518442154,
 -0.005300623830407858,
 0.020539917051792145,
 -0.0049823266454041,
 0.00708698621019721,
 -0.016005804762244225,
 -0.043054576963186264,
 0.012348635122179985,
 0.0372602678835392,
 -0.017266003414988518,
 -0.0019714944064617157,
 -0.004670525435358286,
 0.006262011826038361,
 0.03728625178337097,
 0.011822470463812351,
 -0.0013170361053198576,
 -0.003621443407610059,
 -0.02795819379389286,
 0.03167382627725601,
 -0.004888136871159077,
 -0.03801378980278969,
 -0.0238787904381752,
 0.0029133944772183895,
 -0.0002450158353894949,
 -0.04261285811662674,
 -0.011465197429060936,
 -0.004322996828705072,
 0.026022426784038544,
 -0.003501269966363907,
 -0.020448975265026093,
 -0.03627289831638336,
 -0.00841863825917244,
 -0.01050380989909172,
 -0.0006296926876530051,
 -0.004586079157888889,
 -0.017071127891540527,
 0.0006162949721328914,
 0.024645302444696426,
 0.024294527247548103,
 -0.00387153425253

### Initialize the ChromaDB Vector Store And Stores the chunks in Vector Representation

In [16]:
chunks

[Document(metadata={'source': 'data\\doc_0.txt'}, page_content='Machine Learning Fundamentals\n\n    Machine learning is a subset of artificial intelligence that enables systems to learn \n    and improve from experience without being explicitly programmed. There are three main \n    types of machine learning: supervised learning, unsupervised learning, and reinforcement \n    learning. Supervised learning uses labeled data to train models, while unsupervised \n    learning finds patterns in unlabeled data. Reinforcement learning learns through'),
 Document(metadata={'source': 'data\\doc_0.txt'}, page_content='data. Reinforcement learning learns through \n    interaction with an environment using rewards and penalties.'),
 Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep Learning and Neural Networks\n\n    Deep learning is a subset of machine learning based on artificial neural networks. \n    These networks are inspired by the human brain and consist of layers of in

In [17]:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

## Create a ChromaDB vector store
persist_directory = "./chroma_db"

## Initialize ChromaDB with OpenAI Embeddings
vectorstore = Chroma.from_documents(
    documents = chunks,
    embedding = OpenAIEmbeddings(),
    persist_directory = persist_directory,
    collection_name = "rag_collection"
)

print(f"Vector store created with {vectorstore._collection.count()} vectors")
print(f"Persisted to: {persist_directory}")


Vector store created with 5 vectors
Persisted to: ./chroma_db


In [18]:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Test Chroma
embeddings = OpenAIEmbeddings()
vectordb = Chroma(collection_name="test", embedding_function=embeddings)
print("Chroma loaded successfully")


Chroma loaded successfully


  vectordb = Chroma(collection_name="test", embedding_function=embeddings)


### Test Similarity Search

In [19]:
query = "What are the types of Machine Learning"

similar_docs = vectorstore.similarity_search(query,k=3)
similar_docs

[Document(metadata={'source': 'data\\doc_0.txt'}, page_content='Machine Learning Fundamentals\n\n    Machine learning is a subset of artificial intelligence that enables systems to learn \n    and improve from experience without being explicitly programmed. There are three main \n    types of machine learning: supervised learning, unsupervised learning, and reinforcement \n    learning. Supervised learning uses labeled data to train models, while unsupervised \n    learning finds patterns in unlabeled data. Reinforcement learning learns through'),
 Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep Learning and Neural Networks\n\n    Deep learning is a subset of machine learning based on artificial neural networks. \n    These networks are inspired by the human brain and consist of layers of interconnected \n    nodes. Deep learning has revolutionized fields like computer vision, natural language \n    processing, and speech recognition. Convolutional Neural Networks (

In [20]:
query="what is NLP?"

similar_docs=vectorstore.similarity_search(query,k=3)
similar_docs

[Document(metadata={'source': 'data\\doc_2.txt'}, page_content='Natural Language Processing (NLP)\n\n    NLP is a field of AI that focuses on the interaction between computers and human language. \n    Key tasks in NLP include text classification, named entity recognition, sentiment analysis, \n    machine translation, and question answering. Modern NLP heavily relies on transformer \n    architectures like BERT, GPT, and T5. These models use attention mechanisms to understand \n    context and relationships between words in text.'),
 Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep Learning and Neural Networks\n\n    Deep learning is a subset of machine learning based on artificial neural networks. \n    These networks are inspired by the human brain and consist of layers of interconnected \n    nodes. Deep learning has revolutionized fields like computer vision, natural language \n    processing, and speech recognition. Convolutional Neural Networks (CNNs) are part

In [21]:
print(f"Query: {query}")
print(f"\nTop {len(similar_docs)} similar chunks:")
for i, doc in enumerate(similar_docs):
    print(f"\n--- Chunk {i+1} ---")
    print(doc.page_content[:200] + "...")
    print(f"Source: {doc.metadata.get('source', 'Unknown')}")

Query: what is NLP?

Top 3 similar chunks:

--- Chunk 1 ---
Natural Language Processing (NLP)

    NLP is a field of AI that focuses on the interaction between computers and human language. 
    Key tasks in NLP include text classification, named entity recogn...
Source: data\doc_2.txt

--- Chunk 2 ---
Deep Learning and Neural Networks

    Deep learning is a subset of machine learning based on artificial neural networks. 
    These networks are inspired by the human brain and consist of layers of i...
Source: data\doc_1.txt

--- Chunk 3 ---
Neural Networks (RNNs) and Transformers 
    excel at sequential data processing....
Source: data\doc_1.txt


### Advanced Similarity Search With Scores

In [23]:
results_scores = vectorstore.similarity_search_with_score(query,k=3)
results_scores

[(Document(metadata={'source': 'data\\doc_2.txt'}, page_content='Natural Language Processing (NLP)\n\n    NLP is a field of AI that focuses on the interaction between computers and human language. \n    Key tasks in NLP include text classification, named entity recognition, sentiment analysis, \n    machine translation, and question answering. Modern NLP heavily relies on transformer \n    architectures like BERT, GPT, and T5. These models use attention mechanisms to understand \n    context and relationships between words in text.'),
  0.23328782618045807),
 (Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep Learning and Neural Networks\n\n    Deep learning is a subset of machine learning based on artificial neural networks. \n    These networks are inspired by the human brain and consist of layers of interconnected \n    nodes. Deep learning has revolutionized fields like computer vision, natural language \n    processing, and speech recognition. Convolutional Neura

#### Understanding Similarity Scores
The similarity score represents how closely related a document chunk is to your query. The scoring depends on the distance metric used:

ChromaDB default: Uses L2 distance (Euclidean distance)

- Lower scores = MORE similar (closer in vector space)
- Score of 0 = identical vectors
- Typical range: 0 to 2 (but can be higher)


Cosine similarity (if configured):

- Higher scores = MORE similar
- Range: -1 to 1 (1 being identical)

#### Initialize LLM, RAG Chain, Prompt Template,Query the RAG system

In [24]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model_name = "gpt-3.5-turbo"
)

In [25]:
test_response = llm.invoke("Whis is Large Language Models")
test_response

AIMessage(content="Large Language Models (LLMs) are a type of artificial intelligence model that is trained with massive amounts of text data in order to generate human-like responses to natural language inputs. These models are designed to understand and generate human language at a high level of proficiency, and they have been used in a variety of applications such as machine translation, text generation, and chatbots. Some examples of popular LLMs include GPT-3, BERT, and OpenAI's DALL-E.", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 98, 'prompt_tokens': 13, 'total_tokens': 111, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-CwYt07j8aYeYepKWueh0OqO4INguf', 'service_tier': 

In [26]:
from langchain.chat_models.base import init_chat_model

llm = init_chat_model("openai:gpt-3.5-turbo")
## llm = init_chat_model("groq:")
llm

ChatOpenAI(profile={'max_input_tokens': 16385, 'max_output_tokens': 4096, 'image_inputs': False, 'audio_inputs': False, 'video_inputs': False, 'image_outputs': False, 'audio_outputs': False, 'video_outputs': False, 'reasoning_output': False, 'tool_calling': False, 'structured_output': False, 'image_url_inputs': False, 'pdf_inputs': False, 'pdf_tool_message': False, 'image_tool_message': False, 'tool_choice': True}, client=<openai.resources.chat.completions.completions.Completions object at 0x000001E3E37FA5D0>, async_client=<openai.resources.chat.completions.completions.AsyncCompletions object at 0x000001E3E37F8050>, root_client=<openai.OpenAI object at 0x000001E3E3819350>, root_async_client=<openai.AsyncOpenAI object at 0x000001E3E3818180>, model_kwargs={}, openai_api_key=SecretStr('**********'), stream_usage=True)

In [27]:
llm.invoke("What is AI")

AIMessage(content='AI, or artificial intelligence, refers to the simulation of human intelligence processes by machines, especially computer systems. This includes learning, reasoning, problem-solving, perception, and language understanding. AI technology aims to mimic cognitive functions such as learning and problem-solving to enable machines to perform tasks that typically require human intelligence.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 61, 'prompt_tokens': 10, 'total_tokens': 71, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'id': 'chatcmpl-CwYuKEfPWj1OqUNzxUNcoftd5k3h0', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='lc_run--019ba960-ad94-7210-ba6b-7

### Modern RAG Chain

In [5]:
# Vector store
from langchain_community.vectorstores import Chroma

# Embeddings
from langchain_openai import OpenAIEmbeddings

# Text splitting
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Prompts
from langchain_core.prompts import ChatPromptTemplate

# Chains
from langchain_experimental.chains import stuff
from langchain_experimental.chains import create_retrieval_chain


ModuleNotFoundError: No module named 'langchain_experimental'

In [8]:
# Vector store
from langchain_community.vectorstores import Chroma

# Embeddings
from langchain_openai import OpenAIEmbeddings

# Text splitting
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Documents
from langchain.schema import Document

# LLM
from langchain.chat_models import ChatOpenAI

# Prompt templates
from langchain.prompts import PromptTemplate


ModuleNotFoundError: No module named 'langchain.schema'

In [9]:
# 1. Embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# 2. Chroma vectorstore
vectorstore = Chroma(
    collection_name="my_collection",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)

# 3. Example document
docs = [Document(page_content="LangChain makes building RAG easy!")]

# 4. Add documents
vectorstore.add_documents(docs)

# 5. Splitter (optional if you have long docs)
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

# 6. Retriever
retriever = vectorstore.as_retriever()

# 7. LLM
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# 8. Prompt
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="Answer the question based on the following context:\n{context}\nQuestion: {question}"
)

# 9. Manual retrieval + LLM
query = "What is LangChain?"
docs = retriever.get_relevant_documents(query)
context = "\n".join([doc.page_content for doc in docs])
response = llm.predict(prompt.format(context=context, question=query))

print(response)


  vectorstore = Chroma(


NameError: name 'Document' is not defined

In [29]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain
from langchain_core.prompts import ChatPromptTemplate


ModuleNotFoundError: No module named 'langchain.chains'