### Building a RAG System with LangChain and ChromaDB
#### Introduction
Retrieval-Augmented Generation (RAG) is a powerful technique that combines the capabilities of large language models with external knowledge retrieval. This notebook will walk you through building a complete RAG system using:

- LangChain: A framework for developing applications powered by language models
- ChromaDB: An open-source vector database for storing and retrieving embeddings
- OpenAI: For embeddings and language model (you can substitute with other providers)

In [None]:
import os
from dotenv import load_dotenv
load_dotenv()

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader  
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.schema import Document

#vectorestore
from langchain.vectorstores import Chroma
from langchain_community.vectorstores import Chroma 

##utility imports
import numpy as np
from typing import List, Dict, Any


In [None]:
# RAG Architecture Overview
print("""
RAG (Retrieval-Augmented Generation) Architecture:

1. Document Loading: Load documents from various sources
2. Document Splitting: Break documents into smaller chunks
3. Embedding Generation: Convert chunks into vector representations
4. Vector Storage: Store embeddings in ChromaDB
5. Query Processing: Convert user query to embedding
6. Similarity Search: Find relevant chunks from vector store
7. Context Augmentation: Combine retrieved chunks with query
8. Response Generation: LLM generates answer using context

Benefits of RAG:
- Reduces hallucinations
- Provides up-to-date information
- Allows citing sources
- Works with domain-specific knowledge
""")

In [None]:
#1 create sample data

def create_sample_data() -> List[Document]:
    sample_text = """
    LangChain is a framework for developing applications powered by language models.
    It provides modular components for building LLM applications, including document loaders,
    text splitters, embeddings, and vector stores.
    
    ChromaDB is a vector database that allows efficient storage and retrieval of high-dimensional vectors.
    It is commonly used in RAG architectures to store embeddings of text chunks.
    """
    return [Document(page_content=sample_text, metadata={"source": "sample_data"})]


#1 Load sample data
documents = create_sample_data()
print(f"Loaded {len(documents)} documents.")
print("Sample document content:", documents[0].page_content)
print("Sample document metadata:", documents[0].metadata)


In [16]:

#2 Split documents into smaller chunks
def split_documents(
        documents: List[Document], 
        chunk_size: int = 100, 
        chunk_overlap: int = 5) -> List[Document]:
    
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    return text_splitter.split_documents(documents)
print("Splitting documents into chunks...")
chunked_documents = split_documents(documents)
print(f"Created {len(chunked_documents)} chunks from {len(documents)} documents.")

for i, chunk in enumerate(chunked_documents):
    print(f"Chunk {i+1}: {chunk.page_content[:50]}...")  # Print first 50 characters of each chunk




Splitting documents into chunks...
Created 6 chunks from 1 documents.
Chunk 1: LangChain is a framework for developing applicatio...
Chunk 2: It provides modular components for building LLM ap...
Chunk 3: text splitters, embeddings, and vector stores....
Chunk 4: ChromaDB is a vector database that allows efficien...
Chunk 5: vectors....
Chunk 6: It is commonly used in RAG architectures to store ...


In [19]:

# 3 Initialize HuggingFace embeddings
# Generate embeddings for the chunks
def generate_embeddings(documents: List[Document]) -> List[np.ndarray]:
    embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    return [embeddings_model.embed_documents(chunk.page_content) for chunk in chunked_documents]

print("Generating embeddings for chunks...")
embeddings = generate_embeddings(chunked_documents)
print(f"Generated embeddings for {len(embeddings)} chunks.") 

for i, embedding in enumerate(embeddings):
    print(f"Embedding {i+1}: {embedding[:5]}...")  # Print first 5 values of each embedding



Generating embeddings for chunks...
Generated embeddings for 6 chunks.
Embedding 1: [[-0.029210269451141357, -0.00813598558306694, 0.03420502319931984, 0.040295638144016266, 0.07426184415817261, 0.059225041419267654, 0.08180944621562958, 0.037137411534786224, 0.032914865761995316, -0.03143712505698204, 0.060038626194000244, -0.07254357635974884, 0.024705274030566216, -0.004340950399637222, -0.01321366336196661, 0.018410898745059967, -0.08305701613426208, -0.014857887290418148, -0.12271381169557571, 0.0023262619506567717, -0.033769641071558, 0.027208659797906876, -0.020789364352822304, 0.02001935988664627, -0.01195458322763443, -0.016626890748739243, -0.021979263052344322, -0.004876251798123121, 0.0012886389158666134, -0.08734939992427826, 0.007423533126711845, 0.1130385547876358, 0.058595214039087296, -0.019217340275645256, 0.017139475792646408, -0.07475003600120544, -0.07950087636709213, -0.05808822810649872, 0.03741088882088661, 0.059456147253513336, -0.08294069021940231, -0.08918190

In [None]:
# 4 Store embeddings in ChromaDB
## Create a Chromdb vector store
persist_directory="./chroma_db"

embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

def store_embeddings_in_chroma(documents: List[Document], embeddings: List[np.ndarray]) -> Chroma:
    chroma_db = Chroma.from_documents(documents, embeddings_model, persist_directory=persist_directory,collection_name="rag_collection")
    return chroma_db
print("Storing embeddings in ChromaDB...")
chroma_db = store_embeddings_in_chroma(chunked_documents, embeddings_model)
print(f"Stored {len(chroma_db)} embeddings in ChromaDB.")
print("ChromaDB vector store created successfully.")



Storing embeddings in ChromaDB...
Stored 6 embeddings in ChromaDB.
ChromaDB vector store created successfully.
