<a href="https://colab.research.google.com/github/deepakgarg08/llm-diary/blob/main/llm_chronicles_basic_rag_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [24]:
!pip -q install langchain openai chromadb tiktoken sentence_transformers langchainhub langchain_openai langchain_chroma langchain_community

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [13]:
# Adapted from https://python.langchain.com/docs/use_cases/question_answering/

import os
from langchain import hub
# from langchain.chat_models import ChatOpenAI
from langchain_openai import ChatOpenAI
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain.vectorstores import Chroma
from langchain_chroma import Chroma
from langchain.schema import Document

In [14]:
# We'll be using GPT-3.5 Turbo for inference
# os.environ['OPENAI_API_KEY'] = ""
api_key = os.getenv('OPENAI_API_KEY')

# 1 - Process dataset into Langchain Documents

We start by fetching a dataset that contains transcript of the first 20 episodes of the Huberman Lab Podcast on health and fitness.

Each episode is represented as a plantext file, starting with the YouTube URL ofthe episode and the title, that we'll parse as metadata. The actual transcript start after the "TRANSCRIPTS" sparator.

In [15]:
!wget https://github.com/kyuz0/llm-chronicles/raw/main/datasets/huberman-lab-transcripts.tgz
!tar xzf huberman-lab-transcripts.tgz

--2025-06-12 10:22:49--  https://github.com/kyuz0/llm-chronicles/raw/main/datasets/huberman-lab-transcripts.tgz
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/datasets/huberman-lab-transcripts.tgz [following]
--2025-06-12 10:22:50--  https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/datasets/huberman-lab-transcripts.tgz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 639359 (624K) [application/octet-stream]
Saving to: ‘huberman-lab-transcripts.tgz’


2025-06-12 10:22:50 (11.3 MB/s) - ‘huberman-lab-transcripts.tgz’ saved [639359/639359]



We'll process each episode and load it into a Langchain Document object (https://js.langchain.com/docs/modules/data_connection/document_loaders/how_to/creating_documents). This object has two main attributes:

- page_conent: the actual context we want to index and search sematically
- metadata: any associated metadata, in our case title and YouTube url.

In [17]:
def process_txt_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()

    # Extract URL and Title
    url = lines[0].strip()
    title = lines[2].strip()

    # Extract page content after "TRANSCRIPT"
    transcript_index = lines.index('TRANSCRIPT\n')
    page_content = ''.join(lines[transcript_index + 1:])

    return Document(page_content=page_content, metadata={'source': url, 'title': title})


def create_documents_from_directory(directory_path):
    documents = []
    for filename in os.listdir(directory_path):
        if filename.endswith('.txt'):
            doc = process_txt_file(os.path.join(directory_path, filename))
            documents.append(doc)
    return documents

# Example usage
directory_path = 'huberman-lab-transcripts'
docs = create_documents_from_directory(directory_path)
len(docs)


20

In [18]:
docs[0].metadata

{'source': 'https://www.youtube.com/watch?v=uuP-1ioh4LY', 'title': '\ufeff'}

In [19]:
docs[0].page_content[:200]

"\n\n  (0:00:00) Introduction\nWelcome to the Huberman Lab Podcast where we discuss science, and science-based tools for everyday life. My name is Andrew Huberman, and I'm a professor of neurobiology and "

# 2 - Spliting the documents into chunks

We'll now proceed to split the transcripts into smaller chunks.

![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/5.3%20-%20RAG/chunks.png)

In [20]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=700, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)
len(all_splits)

4581

In [21]:
all_splits[1].page_content

"Our first sponsor is InsideTracker. InsideTracker analyzes data from your blood and DNA to help you better understand your body and health and health needs. I've been getting my blood tested for many years now. Because, it just turns out that many of the things that are important to our health and wellbeing can only be detected in a blood test or a DNA test. InsideTracker makes that really easy. They can come to your house to take those samples if you like, or you can go to a nearby clinic as well. The major problem with most blood tests and DNA tests is that it's very hard to make sense of the information you get. You get a lot of numbers related to metabolic factors, endocrine factors, et"

# 3 - Embedding chunks and loading into a vector database

This is a key preparation step for us to be able to perform semantic search on the transcripts.

- **BGE Embeddings**: BGE models on the HuggingFace are among the best performing open-source embedding models. BGE is created by the Beijing Academy of Artificial Intelligence (BAAI)- https://huggingface.co/BAAI/bge-large-en
- **Chroma**: Chroma is an open-source vector database for building AI applications with embeddings. It comes with everything you need to get started built in, and runs on your machine. Check out a more comprehensive list of vector databases here -> https://www.datacamp.com/blog/the-top-5-vector-databases.

![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/5.3%20-%20RAG/vector-store.png)

In [26]:
# from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
model_name = "BAAI/bge-base-en"
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

bge_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    # model_kwargs={'device': 'cuda'},
    model_kwargs={'device': 'cpu'},
    encode_kwargs=encode_kwargs
)


In [27]:
vectorstore = Chroma.from_documents(documents=all_splits, embedding=bge_embeddings)

KeyboardInterrupt: 

In [None]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 5})

In [None]:
retrieved_docs = retriever.get_relevant_documents(
    "How do I find my temperature minimum?"
)
len(retrieved_docs)



5

In [None]:
retrieved_docs

[Document(page_content="of you might even know your body mass index, some of you might know other things about your biology that have fancy names, but everyone should know their temperature minimum. Your temperature minimum doesn't require a thermometer to measure, although you could measure it. Your temperature minimum is the point in every 24 hour cycle when your temperature is lowest. Now, how do you measure that without a thermometer? It tends to fall 90 minutes to two hours before your average waking time. So I want to repeat that, your temperature minimum tends to fall 90 minutes to two hours before your average waking time. So let's say you're not traveling and your typical wake up time is 5:30 AM. Your", metadata={'source': 'https://www.youtube.com/watch?v=NAATB55oxeQ', 'start_index': 28734, 'title': 'Find Your Temperature Minimum to Defeat Jetlag, Shift Work & Sleeplessness | Huberman Lab Podcast #4'}),
 Document(page_content='(00:31:01) Your Temperature Minimum', metadata={'s

# 4 - Full RAG Chain

Let's now put everything together to build a fully functional RAG chain using Lanchain Expression Language -> https://python.langchain.com/docs/expression_language/.

![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/5.3%20-%20RAG/retrieval.png)

In [None]:
prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

prompt

ChatPromptTemplate(input_variables=['context', 'question'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))])

In [None]:
rag_chain.invoke("What are some good ways to increase motivation?")

'Some good ways to increase motivation include raising your heart rate through activities like sprinting or cycling, and then practicing calming the mind while in this heightened state of activation. Deep breathing exercises can also increase adrenaline and cortisol levels, which can help increase motivation.'

# 5 - Quoting sources

One of the advantages of RAG systems is that it allows us to quote the sources that were provided to the LLM to answer the questions. We can use a modified chain to return the metadata belonging to the source.

![picture](https://raw.githubusercontent.com/kyuz0/llm-chronicles/main/5.3%20-%20RAG/references.png)

In [None]:
from operator import itemgetter

from langchain.schema.runnable import RunnableMap

rag_chain_from_docs = (
    {
        "context": lambda input: format_docs(input["documents"]),
        "question": itemgetter("question"),
    }
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain_with_source = RunnableMap(
    {"documents": retriever, "question": RunnablePassthrough()}
) | {
    "documents": lambda input: [doc.metadata for doc in input["documents"]],
    "answer": rag_chain_from_docs,
}

rag_chain_with_source.invoke("What are some good ways to increase motivation?")

{'documents': [{'source': 'https://www.youtube.com/watch?v=ntfcfJ28eiU',
   'start_index': 61433,
   'title': '\ufeffTools for Managing Stress & Anxiety | Huberman Lab Podcast #10'},
  {'source': 'https://www.youtube.com/watch?v=JPX8g8ibKFc',
   'start_index': 101836,
   'title': 'Using Cortisol & Adrenaline to Boost Our Energy & Immune System Function | Huberman Lab Podcast #18'},
  {'source': 'https://www.youtube.com/watch?v=xaE9XyMMAHY',
   'start_index': 43168,
   'title': '\ufeffSupercharge Exercise Performance & Recover with Cooling | Huberman Lab Podcast #19'},
  {'source': 'https://www.youtube.com/watch?v=JPX8g8ibKFc',
   'start_index': 33758,
   'title': 'Using Cortisol & Adrenaline to Boost Our Energy & Immune System Function | Huberman Lab Podcast #18'},
  {'source': 'https://www.youtube.com/watch?v=JPX8g8ibKFc',
   'start_index': 40360,
   'title': 'Using Cortisol & Adrenaline to Boost Our Energy & Immune System Function | Huberman Lab Podcast #18'}],
 'answer': 'Some good 