### Building a RAG System with LangChain and ChromaDB

Introduction

Retrieval-Augmented Generation (RAG) is a powerful technique that combines the capabilities of large language models with external knowledge retrieval. This notebook will walk you through building a complete RAG system using:

- LangChain: A framework for developing applications powered by language models

- ChromaDB: An open-source vector database for storing and retrieving embeddings

- OpenAI: For embeddings and language model (you can substitute with other providers)

In [43]:
import os
from dotenv import load_dotenv
from pathlib import Path
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_community.vectorstores import Chroma
from typing import List
import warnings
warnings.filterwarnings('ignore')

load_dotenv()
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')


### 1. sample data

In [44]:
sample_docs = [
    """
    Machine Learning Fundamentals

    Machine learning is a subset of artificial intelligence that enables
    systems to learn and improve from experience without being explicitly
    programmed. There are three main types of machine learning: supervised
    learning, unsupervised learning, and reinforcement learning. Supervised
    learning uses labeled data to train models, while unsupervised learning
    finds patterns in unlabeled data. Reinforcement learning learns through
    interaction with an environment using rewards and penalties.
    """,

    """
    Deep Learning and Neural Networks

    Deep learning is a subset of machine learning based on artificial neural
    networks. These networks are inspired by the human brain and consist of
    layers of interconnected nodes. Deep learning has revolutionized fields
    like computer vision, natural language processing, and speech recognition.
    Convolutional Neural Networks (CNNs) are particularly effective for image
    processing, while Recurrent Neural Networks (RNNs) and Transformers excel
    at sequential data processing.
    """,

    """
    Natural Language Processing (NLP)

    NLP is a field of AI that focuses on the interaction between computers and
    human language. Key tasks in NLP include text classification, named entity
    recognition, sentiment analysis, machine translation, and question answering.
    Modern NLP heavily relies on transformer architectures like BERT, GPT, and T5.
    These models use attention mechanisms to understand context and relationships
    between words in text.
    """
]


save the sample docs in txt file

In [45]:


path = Path("data")
path.mkdir(parents=True, exist_ok=True)

for i, doc in enumerate(sample_docs):
    (path / f"doc_{i}.txt").write_text(doc, encoding="utf-8")

print("✅ Sample docs created")


✅ Sample docs created


### 2. document load

In [46]:
loader = DirectoryLoader(
    path,
    glob='*.txt',
    loader_cls=TextLoader,
    loader_kwargs={'encoding': 'utf-8'}
)

docs = loader.load()
docs

[Document(metadata={'source': 'data\\doc_0.txt'}, page_content='\n    Machine Learning Fundamentals\n\n    Machine learning is a subset of artificial intelligence that enables\n    systems to learn and improve from experience without being explicitly\n    programmed. There are three main types of machine learning: supervised\n    learning, unsupervised learning, and reinforcement learning. Supervised\n    learning uses labeled data to train models, while unsupervised learning\n    finds patterns in unlabeled data. Reinforcement learning learns through\n    interaction with an environment using rewards and penalties.\n    '),
 Document(metadata={'source': 'data\\doc_1.txt'}, page_content='\n    Deep Learning and Neural Networks\n\n    Deep learning is a subset of machine learning based on artificial neural\n    networks. These networks are inspired by the human brain and consist of\n    layers of interconnected nodes. Deep learning has revolutionized fields\n    like computer vision, na

### 3. Splitting docs

In [47]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=30,
    length_function=len,
    separators=['\n\n', '\n', '. ', '']
)

splitted_docs = text_splitter.split_documents(docs)

print(f"total docs after split: {len(splitted_docs)}\n")
splitted_docs
# for i, v in enumerate(splitted_docs):
#     print(f"chunk {i}\n {splitted_docs[i].page_content}\n")

total docs after split: 9



[Document(metadata={'source': 'data\\doc_0.txt'}, page_content='Machine Learning Fundamentals'),
 Document(metadata={'source': 'data\\doc_0.txt'}, page_content='Machine learning is a subset of artificial intelligence that enables\n    systems to learn and improve from experience without being explicitly\n    programmed. There are three main types of machine learning: supervised\n    learning, unsupervised learning, and reinforcement learning. Supervised'),
 Document(metadata={'source': 'data\\doc_0.txt'}, page_content='learning uses labeled data to train models, while unsupervised learning\n    finds patterns in unlabeled data. Reinforcement learning learns through\n    interaction with an environment using rewards and penalties.'),
 Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep Learning and Neural Networks'),
 Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep learning is a subset of machine learning based on artificial neural\n    networks. Thes

### 4. Create Embedding and store in Vector store ChromaDB

In [48]:
persist_directory = "./chroma_db"

vector_store = Chroma.from_documents(
    documents=splitted_docs,
    embedding=OpenAIEmbeddings(),
    persist_directory=persist_directory,
    collection_name='rag_collection'
)

print(f"vector store created with '{vector_store._collection.count()}' vectors")
print(f'location: {persist_directory}')

vector store created with '18' vectors
location: ./chroma_db


### 5. Test Similarity Search

In [49]:
query = "what are the types of machine learning"

similar_docs = vector_store.similarity_search(query, k=3)
similar_docs

[Document(metadata={'source': 'data\\doc_0.txt'}, page_content='Machine learning is a subset of artificial intelligence that enables\n    systems to learn and improve from experience without being explicitly\n    programmed. There are three main types of machine learning: supervised\n    learning, unsupervised learning, and reinforcement learning. Supervised'),
 Document(metadata={'source': 'data\\doc_0.txt'}, page_content='Machine learning is a subset of artificial intelligence that enables\n    systems to learn and improve from experience without being explicitly\n    programmed. There are three main types of machine learning: supervised\n    learning, unsupervised learning, and reinforcement learning. Supervised'),
 Document(metadata={'source': 'data\\doc_0.txt'}, page_content='Machine Learning Fundamentals')]

In [50]:
query = "what is NLP?"
similar_docs = vector_store.similarity_search(query, k=5)
similar_docs

[Document(metadata={'source': 'data\\doc_2.txt'}, page_content='Natural Language Processing (NLP)'),
 Document(metadata={'source': 'data\\doc_2.txt'}, page_content='Natural Language Processing (NLP)'),
 Document(metadata={'source': 'data\\doc_2.txt'}, page_content='NLP is a field of AI that focuses on the interaction between computers and\n    human language. Key tasks in NLP include text classification, named entity\n    recognition, sentiment analysis, machine translation, and question answering.'),
 Document(metadata={'source': 'data\\doc_2.txt'}, page_content='NLP is a field of AI that focuses on the interaction between computers and\n    human language. Key tasks in NLP include text classification, named entity\n    recognition, sentiment analysis, machine translation, and question answering.'),
 Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep Learning and Neural Networks')]

In [51]:
query = 'what is deep learning?'
similar_docs = vector_store.similarity_search(query, k=3)
similar_docs

[Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep Learning and Neural Networks'),
 Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep Learning and Neural Networks'),
 Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep learning is a subset of machine learning based on artificial neural\n    networks. These networks are inspired by the human brain and consist of\n    layers of interconnected nodes. Deep learning has revolutionized fields')]

#### Advanced Similarity Search with Scores

In [52]:
vector_store.similarity_search_with_score(query, k=3)

[(Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep Learning and Neural Networks'),
  0.2128422111272812),
 (Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep Learning and Neural Networks'),
  0.21299368143081665),
 (Document(metadata={'source': 'data\\doc_1.txt'}, page_content='Deep learning is a subset of machine learning based on artificial neural\n    networks. These networks are inspired by the human brain and consist of\n    layers of interconnected nodes. Deep learning has revolutionized fields'),
  0.22426888346672058)]

### Understanding Similarity Scores
The similarity score represents how closely related a document chunk is to your query. 
The scoring depends on the distance metric used:

**ChromaDB default: Uses L2 distance (Euclidean distance)**
- Lower scores = MORE similar (closer in vector space)  
- Score of 0 = identical vectors  
- Typical range: 0 to 2 (but can be higher)  

**Cosine similarity (if configured):**
- Higher scores = MORE similar  
- Range: -1 to 1 (1 being identical)  


### 6. Initialize LLM, RAG Chain, Prompt Template

In [53]:
llm = ChatOpenAI(
    model='gpt-4o',
)

llm.predict("What is LLM?")


"LLM can refer to a couple of different things depending on the context:\n\n1. **Master of Laws (LL.M.)**: This is a postgraduate academic degree in law. It is typically pursued by individuals who already hold a primary law degree (such as a Juris Doctor (JD) in the United States or a Bachelor of Laws (LLB) in other countries) and seek to gain specialized knowledge in a particular area of law, such as international law, human rights, tax law, or intellectual property.\n\n2. **Large Language Model**: In the context of artificial intelligence and machine learning, LLM stands for Large Language Model. These are advanced AI models trained to understand, generate, and manipulate human language on a large scale. Examples of large language models include OpenAI's GPT (Generative Pre-trained Transformer) series, such as GPT-3 and GPT-4. They are used in various applications, including chatbots, automated writing, language translation, and more.\n\nThe specific meaning of LLM will depend on the

### Modern RAG Chain

In [54]:
from langchain.chains import create_retrieval_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain

# create retriver
retriever = vector_store.as_retriever(search_kwarg={"k": 3})

retriever

VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x0000019251C1EB90>, search_kwargs={})

In [55]:
## Create a prompt template
system_prompt = """You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Use three sentences maximum and keep the answer concise.

Context: {context}"""

prompt = ChatPromptTemplate.from_messages([
    ('system', system_prompt),
    ('human', "{input}")
])

In [56]:
prompt

ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks.\nUse the following pieces of retrieved context to answer the question.\nIf you don't know the answer, just say that you don't know.\nUse three sentences maximum and keep the answer concise.\n\nContext: {context}"), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], input_types={}, partial_variables={}, template='{input}'), additional_kwargs={})])

In [57]:
document_chain = create_stuff_documents_chain(llm=llm, prompt=prompt)
document_chain

RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableLambda(format_docs)
}), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
| ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks.\nUse the following pieces of retrieved context to answer the question.\nIf you don't know the answer, just say that you don't know.\nUse three sentences maximum and keep the answer concise.\n\nContext: {context}"), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], input_types={}, partial_variables={}, template='{input}'), additional_kwargs={})])
| ChatOpenAI(client=<openai.resources.chat.completions.completions.Completions object at 0x0000019251C7D890>, async_client=<openai.resour

This chain:
- Takes retrieved documents
- "Stuffs" them into the prompt's {context} placeholder
- Sends the complete prompt to the LLM
- Returns the LLM's response


In [58]:
rag_chain = create_retrieval_chain(retriever, document_chain)
rag_chain

RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableBinding(bound=RunnableLambda(lambda x: x['input'])
           | VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x0000019251C1EB90>, search_kwargs={}), kwargs={}, config={'run_name': 'retrieve_documents'}, config_factories=[])
})
| RunnableAssign(mapper={
    answer: RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
              context: RunnableLambda(format_docs)
            }), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
            | ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks.\nUse the following pieces of retrieved context to answer the question.\nIf you don't 

In [59]:
query = "what is artificial intelligence?"

result = rag_chain.invoke({"input": query})

In [60]:
print(result['answer'])

Artificial intelligence (AI) is a branch of computer science that focuses on creating systems capable of performing tasks that typically require human intelligence, such as understanding language, recognizing patterns, solving problems, and making decisions. AI encompasses various subfields, including machine learning, natural language processing, and robotics. Its goal is to enable machines to mimic human-like cognitive functions.
