
# 📚 Retrieval-Augmented Generation (RAG) with LangChain

This repository demonstrates a simple **Retrieval-Augmented Generation (RAG)** pipeline using:
- **LLM**: OpenAI GPT (`gpt-4o-mini`) via LangChain
- **API**: OpenAI API (and HuggingFace API optionally)
- **Vector Database**: ChromaDB
- **Retriever**: Chroma Retriever

---


In [None]:
import os
os.environ['OPENAI_API_KEY'] = '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'


## 🔑 Step 1: Install Dependencies

We first install the required Python packages:
- `langchain-community` and `langchain-openai` for LLM and RAG tools  
- `faiss-cpu` or `chromadb` for vector database options  
- `tiktoken` for tokenization  
- `youtube-transcript-api` (optional, for transcript input sources)


In [None]:
!pip install -q \
    youtube-transcript-api \
    langchain-community \
    langchain-openai \
    faiss-cpu \
    tiktoken \
    python-dotenv


## 🔑 Step 2: Import Libraries and Set API Key

Here we import **LangChain** and **OpenAI LLM** integration.  
We also set the `OPENAI_API_KEY` (replace with your own key in `.env` file for security).  


In [None]:
!pip install -q youtube-transcript-api --upgrade


## 🔑 Step 3: Define the LLM

We define our **ChatOpenAI model** with:
- `model='gpt-4o-mini'`
- `temperature=2` (controls creativity in responses)


In [None]:
from langchain_openai import ChatOpenAI

In [None]:
llm = ChatOpenAI(model='gpt-4o-mini',temperature=2)

In [None]:
llm.invoke('when did rohit sharma retired from test career?').content




## 🔑 Step 4: Setup Vector Database (ChromaDB)

We use **ChromaDB** as our vector store.  
Steps:
1. Convert text/documents into embeddings  
2. Store them in **ChromaDB**  
3. Query them using a **Retriever**


In [None]:
# Open file and read everything
with open("rohit sharma.txt", "r", encoding="utf-8") as f:
    transcript = f.read()
# 'transcript' now holds the full content of ipl.txt

In [None]:
len(transcript)

53107

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
splitter = RecursiveCharacterTextSplitter(chunk_size=220, chunk_overlap=200)
chunks = splitter.create_documents([transcript])

In [None]:
len(chunks)

994

In [None]:
chunks[10]

Document(metadata={}, page_content='Full name\t\nRohit Gurunath Sharma\nBorn\t30 April 1987 (age 38)\nNagpur, Maharashtra, India\nNickname\t\nHitman [1]\nShana [2]\nBatting\tRight-handed\nBowling\tRight-arm off break\nRole\tTop-order batter\nInternational information')

In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import Document

In [None]:
docs = []
for i in chunks:
  docs.append(i)

In [None]:
docs[10]

Document(metadata={}, page_content='Full name\t\nRohit Gurunath Sharma\nBorn\t30 April 1987 (age 38)\nNagpur, Maharashtra, India\nNickname\t\nHitman [1]\nShana [2]\nBatting\tRight-handed\nBowling\tRight-arm off break\nRole\tTop-order batter\nInternational information')

In [None]:
embeddings = OpenAIEmbeddings(openai_api_key="your-api-key-here")

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

embedding_function = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

In [None]:
# !pip install langchain chromadb openai tiktoken pypdf langchain_openai langchain-community

In [None]:
vector_store = Chroma(embedding_function=embedding_function,persist_directory='my_chroma_db',collection_name='sample')

In [None]:
vector_store

<langchain_community.vectorstores.chroma.Chroma at 0x7e5f17ae03e0>

In [None]:
vector_store.add_documents(docs)

['36ff9b4b-74b5-4165-81ce-46dde06cd21e',
 '60a68574-4e73-4aa4-8851-b61750ffc82f',
 'c2145bad-d4ac-4799-9558-03e06f302531',
 'f76dca45-04e3-4a23-94c0-68d2e805c670',
 'da51d6f5-3bb0-4d3e-8756-c2d6d4890246',
 '1320bac6-3a37-4150-8ad1-ac073d9e2a89',
 'c049dc67-b953-41dd-aeaf-a4244161cbbd',
 '114c0660-667c-4669-9978-4fbe9d3745f9',
 '7d3ade1d-faf9-4931-b1dd-6358f563fb46',
 '2a0a953d-9733-46f4-b799-6c21a51b8dfa',
 '3483f6de-995f-4eb7-aee9-3ae111345623',
 '9be144c5-8e36-4a60-bfbd-8d20a265eac1',
 'df9782d5-d85b-469f-ad7f-e214a86a970c',
 '88f8da3b-3b5e-466d-860f-75a58dc5a0f9',
 '9e8b8823-ec55-4925-9305-7d79dc883878',
 '4a16300e-458d-4bfd-80ca-16d4a917c002',
 'd2d318e4-51cd-4670-b750-608556d4877d',
 '273e94d6-9190-49d3-815f-5fa31edddc56',
 '61276a75-c948-423e-b59c-c45fc77ac76b',
 '3238a58d-a035-48a9-b6b5-9488f2f5e1ea',
 'c2324ebe-f8e8-44da-9eb6-507572b735d0',
 '4386166d-72d6-4498-b0ce-d9f66be64525',
 'fbe98734-f963-4a64-a286-92750ef38581',
 '230de6b5-bf45-4434-94c5-f95259ae4213',
 'd61ab81b-fa2c-

In [None]:
retriever = vector_store.as_retriever(search_type='similarity',search_kwargs={'k':4})

In [None]:
retriever

VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x7e5f17ae03e0>, search_kwargs={'k': 4})

In [None]:
retriever.invoke('who was the winning captain of icc t20 world cup 2024 and icc champions trophy 2025?')

[Document(metadata={}, page_content='was also a member of the teams that won the 2007 T20 World Cup, the 2013 ICC Champions Trophy and was the winning captain of the 2025 ICC Champions Trophy, where he was also the player of the match in the final.'),
 Document(metadata={}, page_content='from T20Is.[4][5] He was also a member of the teams that won the 2007 T20 World Cup, the 2013 ICC Champions Trophy and was the winning captain of the 2025 ICC Champions Trophy, where he was also the player of the match'),
 Document(metadata={}, page_content="victory at the 2024 Men's T20 World Cup, he announced his retirement from T20Is.[4][5] He was also a member of the teams that won the 2007 T20 World Cup, the 2013 ICC Champions Trophy and was the winning captain of the"),
 Document(metadata={}, page_content='he announced his retirement from T20Is.[4][5] He was also a member of the teams that won the 2007 T20 World Cup, the 2013 ICC Champions Trophy and was the winning captain of the 2025 ICC Champi

In [None]:
from langchain_openai import ChatOpenAI

In [None]:
llm = ChatOpenAI(model='gpt-4o-mini',temperature=2)

In [None]:
from langchain.prompts import PromptTemplate

In [None]:
from langchain.prompts import PromptTemplate

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""
    You are a helpful assistant.
    Answer ONLY from the provided transcript context.

    Context:
    {context}

    Question:
    {question}

    Answer:
    """
    )

In [None]:
question = 'when did rohit sharma retired from test career?'
retrieved_docs = retriever.invoke(question)

In [None]:
retrieved_docs

[Document(metadata={}, page_content='"Rohit Sharma announces retirement from Test Cricket". ESPNCricinfo. 7 May 2025.\n "India captain Rohit Sharma announces Test retirement". International Cricket Council. 7 May 2025. Retrieved 7 May 2025.'),
 Document(metadata={}, page_content='"Rohit Sharma retires: A look at cricketer\'s numbers for India in Tests". The Indian Express. 7 May 2025. Retrieved 9 May 2025.\n "Rohit Sharma announces retirement from Test Cricket". ESPNCricinfo. 7 May 2025.'),
 Document(metadata={}, page_content='Sharma has spent his entire domestic first-class career at Mumbai. In December 2009, he made his highest career score of 309 not out in the Ranji Trophy against Gujarat.[31] In October 2013, upon the retirement of Ajit'),
 Document(metadata={}, page_content='"Rohit Sharma: India\'s Test cricket captain retires from longest format". BBC Sport. 7 May 2025. Retrieved 9 May 2025.')]

In [None]:
context = '\n\n'.join(doc.page_content for doc in retrieved_docs)

In [None]:
print(context)

"Rohit Sharma announces retirement from Test Cricket". ESPNCricinfo. 7 May 2025.
 "India captain Rohit Sharma announces Test retirement". International Cricket Council. 7 May 2025. Retrieved 7 May 2025.

"Rohit Sharma retires: A look at cricketer's numbers for India in Tests". The Indian Express. 7 May 2025. Retrieved 9 May 2025.
 "Rohit Sharma announces retirement from Test Cricket". ESPNCricinfo. 7 May 2025.

Sharma has spent his entire domestic first-class career at Mumbai. In December 2009, he made his highest career score of 309 not out in the Ranji Trophy against Gujarat.[31] In October 2013, upon the retirement of Ajit

"Rohit Sharma: India's Test cricket captain retires from longest format". BBC Sport. 7 May 2025. Retrieved 9 May 2025.


In [None]:
final_prompt = prompt.invoke({'context':context, 'question':question})

In [None]:
print(final_prompt.text)


    You are a helpful assistant.
    Answer ONLY from the provided transcript context.

    Context:
    "Rohit Sharma announces retirement from Test Cricket". ESPNCricinfo. 7 May 2025.
 "India captain Rohit Sharma announces Test retirement". International Cricket Council. 7 May 2025. Retrieved 7 May 2025.

"Rohit Sharma retires: A look at cricketer's numbers for India in Tests". The Indian Express. 7 May 2025. Retrieved 9 May 2025.
 "Rohit Sharma announces retirement from Test Cricket". ESPNCricinfo. 7 May 2025.

Sharma has spent his entire domestic first-class career at Mumbai. In December 2009, he made his highest career score of 309 not out in the Ranji Trophy against Gujarat.[31] In October 2013, upon the retirement of Ajit

"Rohit Sharma: India's Test cricket captain retires from longest format". BBC Sport. 7 May 2025. Retrieved 9 May 2025.

    Question:
    when did rohit sharma retired from test career?

    Answer:
    


In [None]:
answer = llm.invoke(final_prompt)

In [None]:
print(answer.content)

Rohit Sharma announced his retirement from Test cricket on 7 May 2025.


In [None]:
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Download required NLTK resources
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

True

In [None]:
# Define reference and generated responses
reference = "Rohit sharma is the captain of ODI team india"
generated = "ODI team india is captained by rohit sharma"

# Tokenize
ref_tokens = [nltk.word_tokenize(reference.lower())]
gen_tokens = nltk.word_tokenize(generated.lower())

# Calculate BLEU Score
smoothie = SmoothingFunction().method4
bleu_score = sentence_bleu(ref_tokens, gen_tokens, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=smoothie)

print(f"BLEU score: {bleu_score:.4f}")

BLEU score: 0.1917



## 🔑 Step 5: Build the RAG Pipeline

- User provides a query  
- Retriever searches **ChromaDB** for relevant chunks  
- LLM generates a response using both retrieved data and its own knowledge  


In [None]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

In [None]:
def format_docs(retrieved_docs):
  context_text = "\n\n".join(doc.page_content for doc in retrieved_docs)
  return context_text

In [None]:
parallel_chain = RunnableParallel({'context': retriever | RunnableLambda(format_docs), 'question': RunnablePassthrough()})

In [None]:
parallel_chain.invoke('who is ritika?')

{'context': 'longtime girlfriend, Ritika Sajdeh on 13 December 2015 whom he first met in 2008. They welcomed their first child, a daughter born on 30 December 2018.[17] Sharma is a practitioner of the meditation technique Sahaj\n\nRitika Sajdeh on 13 December 2015 whom he first met in 2008. They welcomed their first child, a daughter born on 30 December 2018.[17] Sharma is a practitioner of the meditation technique Sahaj Marg.[18] Rohit and\n\n"Rohit Sharma and wife Ritika bat for animal charity". mid-day.com. Mid-Day Infomedia Ltd. 16 November 2017. Retrieved 4 August 2019.\n\nwhom he first met in 2008. They welcomed their first child, a daughter born on 30 December 2018.[17] Sharma is a practitioner of the meditation technique Sahaj Marg.[18] Rohit and Ritika welcomed their second child, a',
 'question': 'who is ritika?'}

In [None]:
parser = StrOutputParser()

In [None]:
main_chain = parallel_chain | prompt | llm | parser

In [None]:
main_chain.invoke('who is ritika?')

"Ritika Sajdeh is Rohit Sharma's longtime girlfriend, whom he married on 13 December 2015, and with whom he has two children."

In [None]:
main_chain.invoke('how many centuries rohit sharma has made in 2019 world cup?')

'Rohit Sharma made five centuries in the 2019 World Cup.'

In [None]:
main_chain.invoke('when did rohit shatma retired in t20?')

"Rohit Sharma announced his retirement from T20Is after the team's victory at the 2024 Men's T20 World Cup."

In [None]:
main_chain.invoke("how many cars rohit sharma have?")

'The transcript does not provide any information about the number of cars Rohit Sharma has.'


## 🎯 Summary

In this project, we implemented:
- **LLM**: OpenAI GPT-4o-mini  
- **Retriever**: Chroma Retriever  
- **Vector Database**: ChromaDB  
- **API**: OpenAI API (+ optional HuggingFace)  

This notebook can be directly uploaded to **GitHub** as a demonstration of your RAG project 🚀
