### Building a RAG System with LangChain and ChromaDB
#### Introduction
 Retrieval-Augmented Generation (RAG) is a powerful technique that combines the capabilities of large
 language models with external knowledge retrieval. 
 This notebook will walk you through building a
 complete RAG system using:
- LangChain: A framework for developing applications powered by language models
- ChromaDB: An open-source vector database for storing and retrieving embeddings
- OpenAI: For embeddings and language model (you can substitute with other providers)

In [1]:
import os 
from dotenv import load_dotenv
load_dotenv()

True

In [2]:
gemini = os.getenv("GEMINI_API_KEY")

In [3]:
# Recursive character text splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.schema import Document

# Vector stores
from langchain_community.vectorstores import Chroma


# Utility imports
import numpy as np
from typing import List

### RAG (Retrieval-Augmented Generation) Architecture:
1. Document Loading: Load documents from various sources
2. Document Splitting: Break documents into smaller chunks
3. Embedding Generation: Convert chunks into vector representations
4. Vector Storage: Store embeddings in ChromaDB
5. Query Processing: Convert user query to embedding
6. Similarity Search: Find relevant chunks from vector store
7. Context Augmentation: Combine retrieved chunks with query
8. Response Generation: LLM generates answer using context
### Benefits of RAG:
- Reduces hallucinations
- Provides up-to-date information
- Allows citing sources
- Works with domain-specific knowledge

#### 1. Sample Data

In [4]:
# create sample data

sample_docs = [
    """Machine Learning Fundamentals

    Machine learning is a subset of artificial intelligence that enables systems to learn
    and improve from experience without being explicitly programmed. There are three main
    types of machine learning: supervised learning, unsupervised learning, and reinforcement
    learning. Supervised learning uses labeled data to train models, while unsupervised
    learning finds patterns in unlabeled data. Reinforcement learning learns through
    interaction with an environment using rewards and penalties.
    """,
    
    """
    Deep Learning and Neural Networks

    Deep learning is a subset of machine learning based on artificial neural networks.
    These networks are inspired by the human brain and consist of layers of interconnected
    nodes. Deep learning has revolutionized fields like computer vision, natural language
    processing, and speech recognition. Convolutional Neural Networks (CNNs) are particularly
    effective for image processing, while Recurrent Neural Networks (RNNs) and Transformers
    excel at sequential data processing.
    """,

    """Natural Language Processing (NLP)

    NLP is a field of AI that focuses on the interaction between computers and human language.
    Key tasks in NLP include text classification, named entity recognition, sentiment analysis,
    machine translation, and question answering. Modern NLP heavily relies on transformer
    architectures like BERT, GPT, and T5. These models use attention mechanisms to understand
    context and relationships between words in text.
    """

]

In [5]:
sample_docs

['Machine Learning Fundamentals\n\n    Machine learning is a subset of artificial intelligence that enables systems to learn\n    and improve from experience without being explicitly programmed. There are three main\n    types of machine learning: supervised learning, unsupervised learning, and reinforcement\n    learning. Supervised learning uses labeled data to train models, while unsupervised\n    learning finds patterns in unlabeled data. Reinforcement learning learns through\n    interaction with an environment using rewards and penalties.\n    ',
 '\n    Deep Learning and Neural Networks\n\n    Deep learning is a subset of machine learning based on artificial neural networks.\n    These networks are inspired by the human brain and consist of layers of interconnected\n    nodes. Deep learning has revolutionized fields like computer vision, natural language\n    processing, and speech recognition. Convolutional Neural Networks (CNNs) are particularly\n    effective for image proces

In [6]:
import tempfile

temp_dir = tempfile.mkdtemp()

for i, doc in enumerate(sample_docs):
    with open(f"{temp_dir}/doc_{i}.txt", "w") as f:
        f.write(doc)

In [7]:
temp_dir

'C:\\Users\\monda\\AppData\\Local\\Temp\\tmp57ggche3'

#### 2. Document Loading

In [8]:
from langchain_community.document_loaders import DirectoryLoader

# Load documents from directory

loader = DirectoryLoader(
    "data",
    glob="*.txt",
    loader_cls=TextLoader,
    loader_kwargs={'encoding':'utf-8'}
)

documents = loader.load()

print(f"Loaded {len(documents)} documents")
print(f"\nFirst document preview: \n")
print(documents[0].page_content[:200]+"...")

Loaded 3 documents

First document preview: 

Machine Learning Fundamentals

    Machine learning is a subset of artificial intelligence that enables systems to learn
    and improve from experience without being explicitly programmed. There are ...


#### 3. Document Splitting

In [9]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500, # Maximum size of each chunk
    chunk_overlap = 50, # Overlap between chunks to maintain context
    length_function = len, 
    separators=["\n\n", "\n", ". ", " ", ""] # Hierarchy of separators
)

In [10]:
chunks = text_splitter.split_documents(documents)

#### 4. Embedding Models

In [11]:
# Hugging face embedding models

embeddings = HuggingFaceEmbeddings(
    model_name = "sentence-transformers/all-MiniLM-L6-v2"
)

  from .autonotebook import tqdm as notebook_tqdm


In [12]:
sample_text = "Machine learning is amazing"
embedding1 = embeddings.embed_query(sample_text)

#### 5. Initialize the ChromaDB Vector store and stores the chunks in Vector representation

In [13]:
# create a Chromadb vector store
persist_directory = "./chroma_db"

# Initialize chromadb with huggingFace embeddings
vector_store = Chroma.from_documents(
    documents=chunks,
    embedding=HuggingFaceEmbeddings(),
    persist_directory=persist_directory,
    collection_name="rag_collection"
)

print(f"Vector store created with {vector_store._collection.count()} vectors")
print(f"Persisted to: {persist_directory}")

Vector store created with 28 vectors
Persisted to: ./chroma_db


#### 6. Test similarity test

In [14]:
query = "What are the types of machine learning and what is nlp?"

similar_docs = vector_store.similarity_search_with_score(query, k=5)

In [15]:
similar_docs

[(Document(metadata={'source': 'data\\doc_0.txt'}, page_content='Machine learning is a subset of artificial intelligence that enables systems to learn\n    and improve from experience without being explicitly programmed. There are three main\n    types of machine learning: supervised learning, unsupervised learning, and reinforcement\n    learning. Supervised learning uses labeled data to train models, while unsupervised\n    learning finds patterns in unlabeled data. Reinforcement learning learns through'),
  0.5402185916900635),
 (Document(metadata={'source': 'data\\doc_0.txt'}, page_content='Machine learning is a subset of artificial intelligence that enables systems to learn\n    and improve from experience without being explicitly programmed. There are three main\n    types of machine learning: supervised learning, unsupervised learning, and reinforcement\n    learning. Supervised learning uses labeled data to train models, while unsupervised\n    learning finds patterns in unlabe

In [16]:
chunks

[Document(metadata={'source': 'data\\doc_0.txt'}, page_content='Machine Learning Fundamentals'),
 Document(metadata={'source': 'data\\doc_0.txt'}, page_content='Machine learning is a subset of artificial intelligence that enables systems to learn\n    and improve from experience without being explicitly programmed. There are three main\n    types of machine learning: supervised learning, unsupervised learning, and reinforcement\n    learning. Supervised learning uses labeled data to train models, while unsupervised\n    learning finds patterns in unlabeled data. Reinforcement learning learns through'),
 Document(metadata={'source': 'data\\doc_0.txt'}, page_content='interaction with an environment using rewards and penalties.'),
 Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep Learning and Neural Networks'),
 Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep learning is a subset of machine learning based on artificial neural networks.\n    These net

#### 2. Understanding Similarity Scores
The similarity score represents how closely related a document chunk is to your query. The scoring depends on the distance
metric used: 

ChromaDB default: Uses L2 distance (Euclidean distance)

- Lower scores = MORE similar (closer in vector space)
- Score of 0 = identical vectors
- Typical range: 0 to 2 (but can be higher)

Cosine similarity (if configured):
- Higher scores = MORE similar
- Range: -1 to 1 (1 being identical)


In [17]:
# Loading the gemini api key
import os
from dotenv import load_dotenv
load_dotenv()

gemini = os.getenv("GEMINI_API_KEY")

In [18]:
# Initiating gemini

llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    google_api_key=gemini
)

In [19]:
test = llm.predict("What is a LLM ?")

  test = llm.predict("What is a LLM ?")


In [20]:
test

'An **LLM** stands for **Large Language Model**.\n\nIt\'s a type of artificial intelligence (AI) program that has been trained on a massive amount of text data to understand, generate, and process human language.\n\nLet\'s break down what each part of the name signifies:\n\n1.  **Large:**\n    *   **Parameters:** LLMs typically have billions, even trillions, of parameters (the internal variables that the model learns during training). This vast number allows them to capture incredibly complex patterns and relationships in language.\n    *   **Data:** They are trained on enormous datasets, often comprising a significant portion of the publicly available text on the internet (books, articles, websites, code, etc.).\n    *   **Compute:** Training and running these models require immense computational power, usually involving thousands of specialized processors (GPUs).\n\n2.  **Language:**\n    *   This refers to human language (natural language). LLMs are designed to work with text – read

#### Modern RAG chain

In [22]:
from langchain.chains import create_retrieval_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain

In [23]:
## convert a vector store to retriever

retriever = vector_store.as_retriever(
    seacrch_kwargs = {"k":3} ## Retrieve top 3 chunks
)

In [24]:
retriever

VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x000001FC238F45C0>, search_kwargs={})

In [25]:
## create a prompt template

system_prompts = """You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question,
If you don't know the answer, just say that you don't know.
Use three sentences maximum and keep the answer concise

Context: {context}"""

prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompts),
    ("human", "{input}")
])

In [26]:
prompt

ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks.\nUse the following pieces of retrieved context to answer the question,\nIf you don't know the answer, just say that you don't know.\nUse three sentences maximum and keep the answer concise\n\nContext: {context}"), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], input_types={}, partial_variables={}, template='{input}'), additional_kwargs={})])

In [27]:
# Create stuff document chain

from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(llm, prompt)
document_chain

RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableLambda(format_docs)
}), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
| ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks.\nUse the following pieces of retrieved context to answer the question,\nIf you don't know the answer, just say that you don't know.\nUse three sentences maximum and keep the answer concise\n\nContext: {context}"), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], input_types={}, partial_variables={}, template='{input}'), additional_kwargs={})])
| ChatGoogleGenerativeAI(model='models/gemini-2.5-flash', google_api_key=SecretStr('**********'), client=<google.ai.generativelanguage_v1b

In [28]:
# Create the final RAG chain

rag_chain = create_retrieval_chain(retriever, document_chain)

In [29]:
rag_chain

RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableBinding(bound=RunnableLambda(lambda x: x['input'])
           | VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x000001FC238F45C0>, search_kwargs={}), kwargs={}, config={'run_name': 'retrieve_documents'}, config_factories=[])
})
| RunnableAssign(mapper={
    answer: RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
              context: RunnableLambda(format_docs)
            }), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
            | ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks.\nUse the following pieces of retrieved context to answer the question,\nIf you d

In [35]:
rag_chain.invoke({"input":"What are neural networks ?"})

{'input': 'What are neural networks ?',
 'context': [Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep learning is a subset of machine learning based on artificial neural networks.\n    These networks are inspired by the human brain and consist of layers of interconnected\n    nodes. Deep learning has revolutionized fields like computer vision, natural language\n    processing, and speech recognition. Convolutional Neural Networks (CNNs) are particularly\n    effective for image processing, while Recurrent Neural Networks (RNNs) and Transformers\n    excel at sequential data processing.'),
  Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep learning is a subset of machine learning based on artificial neural networks.\n    These networks are inspired by the human brain and consist of layers of interconnected\n    nodes. Deep learning has revolutionized fields like computer vision, natural language\n    processing, and speech recognition. Convolutional