# Memory in LLM Applications — A Deep Dive

## Learning Objectives
By the end of this worksheet, you will:
1. **Understand why memory matters** — see firsthand that LLM API calls are stateless
2. **Implement short-term memory** — conversation history, sliding window, and summarization
3. **Work with embeddings** — generate vectors, compute similarity, build a mini vector store
4. **Build RAG from scratch** — chunk documents, embed, retrieve, and generate grounded answers

---

## Setup & Prerequisites

In [1]:
import sys
!{sys.executable} -m pip install --quiet google-genai google-adk chromadb numpy

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opentelemetry-exporter-otlp 1.29.0 requires opentelemetry-exporter-otlp-proto-grpc==1.29.0, but you have opentelemetry-exporter-otlp-proto-grpc 1.38.0 which is incompatible.
opentelemetry-exporter-otlp 1.29.0 requires opentelemetry-exporter-otlp-proto-http==1.29.0, but you have opentelemetry-exporter-otlp-proto-http 1.38.0 which is incompatible.[0m[31m
[0m

In [2]:
import os
import getpass

GOOGLE_API_KEY = getpass.getpass("Enter your Google/Gemini API key: ")
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

print("API key configured successfully")

Enter your Google/Gemini API key:  ········


API key configured successfully


In [3]:
from google import genai
from google.genai import types

client = genai.Client(api_key=GOOGLE_API_KEY)

MODEL_ID = "gemini-2.0-flash"
EMBEDDING_MODEL = "gemini-embedding-001"

# Quick test
response = client.models.generate_content(
    model=MODEL_ID,
    contents="Say 'Memory workshop ready!' and nothing else."
)
print(response.text)

Memory workshop ready!



---

# Part 1: The Problem — Why Memory Matters (5 min)

Every call to an LLM API is **stateless**. The model has no memory of previous calls. Let's prove it.

## 1.1 Demo: A Forgetful Chatbot

We'll make 3 separate API calls and see if the model remembers anything between them.

In [4]:
def stateless_call(user_message):
    """Each call is independent — no memory of previous calls."""
    response = client.models.generate_content(
        model=MODEL_ID,
        contents=user_message
    )
    return response.text


# Call 1: Introduce yourself
print("--- Call 1 ---")
print("User: My name is Dr. Balamurali. I am the co-founder of AI Kyro.")
reply1 = stateless_call("My name is Dr. Balamurali. I am the co-founder of AI Kyro.")
print(f"LLM:  {reply1}")

print()

# Call 2: Ask the model your name
print("--- Call 2 ---")
print("User: What is my name?")
reply2 = stateless_call("What is my name?")
print(f"LLM:  {reply2}")

print()

# Call 3: Ask about your profession
print("--- Call 3 ---")
print("User: What do I teach?")
reply3 = stateless_call("What do I teach?")
print(f"LLM:  {reply3}")

--- Call 1 ---
User: My name is Dr. Balamurali. I am the co-founder of AI Kyro.
LLM:  Okay, Dr. Balamurali, it's nice to meet you. I understand you are the co-founder of AI Kyro. Is there anything specific you'd like to discuss or any information you'd like to share about AI Kyro? I'm ready to listen and assist in any way I can. For example, are you looking for help with:

*   **Generating content for AI Kyro?** (e.g., website copy, social media posts, articles)
*   **Exploring potential applications of AI Kyro's technology?**
*   **Brainstorming ideas for marketing or outreach?**
*   **Refining your elevator pitch?**
*   **Something else entirely?**

Just let me know!


--- Call 2 ---
User: What is my name?
LLM:  I am an AI language model, and I do not know your name. You have not told me your name.


--- Call 3 ---
User: What do I teach?
LLM:  To help me give you the best advice, tell me more about yourself! For example:

*   **What is your educational background and expertise?** (e.

### Key Insight

The LLM has **no idea** who you are in Calls 2 and 3. Each API call starts from a blank slate.

```
Call 1: "Hi, I'm Dr. Bala"      → "Nice to meet you!"
Call 2: "What is my name?"       → "I don't know your name"
Call 3: "What do I teach?"       → "I have no information about you"
```

**This is the fundamental problem.** To build any useful chatbot or agent, we need to give it memory ourselves.

---

# Part 2: Short-Term Memory — Conversation History (10 min)

The simplest form of memory: keep a running list of all messages and send it every time.

## 2.1 Manual Chat History

We maintain a `messages` list and append both user and assistant messages each turn.

In [5]:
class ChatWithHistory:
    """A chatbot that remembers by sending the full conversation each time."""

    def __init__(self, system_instruction="You are a helpful assistant."):
        self.history = []  # This IS the memory
        self.system_instruction = system_instruction

    def chat(self, user_message):
        # Append the new user message to history
        self.history.append(
            types.Content(role="user", parts=[types.Part(text=user_message)])
        )

        # Send the ENTIRE history to the model
        response = client.models.generate_content(
            model=MODEL_ID,
            contents=self.history,
            config=types.GenerateContentConfig(
                system_instruction=self.system_instruction
            )
        )

        assistant_reply = response.text

        # Append the assistant's reply to history
        self.history.append(
            types.Content(role="model", parts=[types.Part(text=assistant_reply)])
        )

        return assistant_reply

    def show_history_size(self):
        total_chars = sum(
            len(part.text) for msg in self.history for part in msg.parts
        )
        print(f"Messages in history: {len(self.history)}")
        print(f"Total characters:    {total_chars:,}")
        print(f"Estimated tokens:    ~{total_chars // 4:,}")

In [6]:
bot = ChatWithHistory()

# Same 3 messages — but now with history!
print("--- Turn 1 ---")
print("User: My name is Dr. Bala. I teach AI courses.")
print(f"LLM:  {bot.chat('My name is Dr. Bala. I teach AI courses.')}")

print()
print("--- Turn 2 ---")
print("User: What is my name?")
print(f"LLM:  {bot.chat('What is my name?')}")

print()
print("--- Turn 3 ---")
print("User: What do I teach?")
print(f"LLM:  {bot.chat('What do I teach?')}")

print()
bot.show_history_size()

--- Turn 1 ---
User: My name is Dr. Bala. I teach AI courses.
LLM:  Okay, Dr. Bala. It's a pleasure to meet you. As an AI assistant, I can definitely appreciate your work in teaching AI! How can I help you today? Perhaps you need help with:

*   **Brainstorming lecture ideas?**
*   **Finding resources for your students?**
*   **Explaining a complex AI concept in a simple way?**
*   **Developing assignments or projects?**
*   **Or something else entirely?**

Just let me know what you're working on.


--- Turn 2 ---
User: What is my name?
LLM:  Your name is Dr. Bala, as you mentioned earlier.


--- Turn 3 ---
User: What do I teach?
LLM:  You teach AI courses.


Messages in history: 6
Total characters:    578
Estimated tokens:    ~144


### It remembers! But there's a catch...

Every turn, we send **everything** back to the model. Let's see what happens with a long conversation.

In [7]:
# Simulate a long conversation to show context window filling up
long_bot = ChatWithHistory(system_instruction="You are a travel advisor. Keep answers brief (1-2 sentences).")

travel_questions = [
    "I'm planning a trip to Japan in April.",
    "What's the best city to see cherry blossoms?",
    "How many days should I spend in Tokyo?",
    "What about Kyoto? Is it worth visiting?",
    "Should I get a Japan Rail Pass?",
    "What's the food I must try?",
    "Is it expensive compared to Europe?",
    "Any tips for navigating the subway?",
    "What about visiting Mount Fuji?",
    "Should I learn some Japanese phrases?",
    "What souvenirs should I bring back?",
    "How's the weather in April?",
    "Any day trips from Tokyo you recommend?",
    "What about Osaka? Is it different from Tokyo?",
    "Should I book hotels in advance?",
]

for i, question in enumerate(travel_questions, 1):
    reply = long_bot.chat(question)
    if i % 5 == 0:
        print(f"After {i} turns:")
        long_bot.show_history_size()
        print()

After 5 turns:
Messages in history: 10
Total characters:    739
Estimated tokens:    ~184

After 10 turns:
Messages in history: 20
Total characters:    1,411
Estimated tokens:    ~352

After 15 turns:
Messages in history: 30
Total characters:    2,075
Estimated tokens:    ~518



Notice how the token count keeps growing. For Gemini Flash, the context window is ~1M tokens, but:
- **Cost increases** with every turn (you pay per token sent)
- **Latency increases** as the model processes more text
- **Eventually**, even large context windows fill up

We need smarter strategies.

## 2.2 Sliding Window Memory

Keep only the last **N** messages. Old messages are dropped.

In [8]:
class SlidingWindowChat:
    """Keeps only the last N message pairs in memory."""

    def __init__(self, window_size=6, system_instruction="You are a helpful assistant."):
        self.full_history = []     # Everything (for comparison)
        self.window_size = window_size  # Max messages to keep
        self.system_instruction = system_instruction

    def chat(self, user_message):
        # Append to full history (for tracking)
        self.full_history.append(
            types.Content(role="user", parts=[types.Part(text=user_message)])
        )

        # Use only the sliding window
        window = self.full_history[-self.window_size :]

        response = client.models.generate_content(
            model=MODEL_ID,
            contents=window,
            config=types.GenerateContentConfig(
                system_instruction=self.system_instruction
            )
        )

        assistant_reply = response.text

        self.full_history.append(
            types.Content(role="model", parts=[types.Part(text=assistant_reply)])
        )

        return assistant_reply

    def show_stats(self):
        window = self.full_history[-self.window_size :]
        window_chars = sum(len(p.text) for m in window for p in m.parts)
        total_chars = sum(len(p.text) for m in self.full_history for p in m.parts)
        print(f"Total messages:    {len(self.full_history)}")
        print(f"Window messages:   {len(window)}")
        print(f"Window tokens:     ~{window_chars // 4:,}")
        print(f"Saved tokens:      ~{(total_chars - window_chars) // 4:,}")

In [9]:
# Demo: sliding window forgets early messages
sw_bot = SlidingWindowChat(window_size=6)  # Only last 6 messages (3 turns)

# Turn 1: Give it information
print("--- Turn 1 ---")
print(f"LLM: {sw_bot.chat('My name is Dr. Bala. I live in Chennai.')}")

# Turns 2-5: Fill the window with other topics
filler = [
    "What is the capital of France?",
    "Tell me about photosynthesis in one sentence.",
    "What year was Python created?",
    "Name three programming paradigms.",
]
for q in filler:
    sw_bot.chat(q)
print(f"\n(Added {len(filler)} filler turns)\n")

# Turn 6: Can it still remember?
print("--- Turn 6 ---")
print("User: What is my name and where do I live?")
print(f"LLM:  {sw_bot.chat('What is my name and where do I live?')}")

print()
sw_bot.show_stats()

--- Turn 1 ---
LLM: Okay, Dr. Bala from Chennai. How can I help you today?


(Added 4 filler turns)

--- Turn 6 ---
User: What is my name and where do I live?
LLM:  As a large language model, I have no way of knowing your name or where you live. I have no memory of past conversations and no access to your personal information. You would need to tell me that information directly.


Total messages:    12
Window messages:   6
Window tokens:     ~120
Saved tokens:      ~93


The sliding window **forgets** your name because that message fell outside the window. Token usage stays constant, but important context is lost.

Can we keep the best of both worlds?

## 2.3 Summarization Strategy

Use the LLM to **summarize** older messages into a compact paragraph. Keep:
- A summary of everything old
- The full recent messages

This preserves key facts while saving tokens.

In [10]:
class SummarizingChat:
    """Summarizes old messages to save tokens while preserving key facts."""

    def __init__(self, recent_window=6, system_instruction="You are a helpful assistant."):
        self.history = []
        self.summary = ""  # Running summary of older messages
        self.recent_window = recent_window
        self.system_instruction = system_instruction
        self.total_messages_seen = 0

    def _summarize_messages(self, messages):
        """Ask the LLM to summarize a block of messages."""
        conversation_text = ""
        for msg in messages:
            role = "User" if msg.role == "user" else "Assistant"
            conversation_text += f"{role}: {msg.parts[0].text}\n"

        prompt = (
            "Summarize the following conversation into a concise paragraph. "
            "Preserve all key facts (names, preferences, decisions, numbers). "
            "Write it as a third-person summary.\n\n"
            f"Previous summary: {self.summary}\n\n"
            f"New conversation:\n{conversation_text}"
        )

        response = client.models.generate_content(model=MODEL_ID, contents=prompt)
        return response.text

    def chat(self, user_message):
        self.total_messages_seen += 1

        self.history.append(
            types.Content(role="user", parts=[types.Part(text=user_message)])
        )

        # If history is too long, summarize the older portion
        if len(self.history) > self.recent_window * 2:
            older_messages = self.history[: -self.recent_window]
            self.summary = self._summarize_messages(older_messages)
            self.history = self.history[-self.recent_window :]
            print(f"  [Summarized {len(older_messages)} older messages]")

        # Build the prompt: summary + recent messages
        contents = []
        if self.summary:
            contents.append(
                types.Content(
                    role="user",
                    parts=[types.Part(text=f"[Conversation summary so far: {self.summary}]")]
                )
            )
            contents.append(
                types.Content(
                    role="model",
                    parts=[types.Part(text="Understood, I'll keep this context in mind.")]
                )
            )
        contents.extend(self.history)

        response = client.models.generate_content(
            model=MODEL_ID,
            contents=contents,
            config=types.GenerateContentConfig(
                system_instruction=self.system_instruction
            )
        )

        assistant_reply = response.text

        self.history.append(
            types.Content(role="model", parts=[types.Part(text=assistant_reply)])
        )

        return assistant_reply

    def show_stats(self):
        history_chars = sum(len(p.text) for m in self.history for p in m.parts)
        summary_chars = len(self.summary)
        print(f"Total turns seen:    {self.total_messages_seen}")
        print(f"Messages in memory:  {len(self.history)}")
        print(f"Summary length:      {summary_chars} chars (~{summary_chars // 4} tokens)")
        print(f"Recent history:      {history_chars} chars (~{history_chars // 4} tokens)")
        print(f"Total sent per call: ~{(history_chars + summary_chars) // 4} tokens")

In [11]:
# Demo: Summarizing chat preserves facts across many turns
sum_bot = SummarizingChat(recent_window=6)

# Give it lots of facts across many turns
facts_and_questions = [
    "My name is Dr. Bala. I live in Chennai and teach AI courses.",
    "My favorite programming language is Python.",
    "I have 2 cats named Pixel and Byte.",
    "I prefer dark mode in all my editors.",
    "My birthday is March 15th.",
    "I'm currently researching multi-agent systems.",
    "What is the speed of light?",
    "Explain gradient descent in one sentence.",
    "What is a transformer architecture?",
    "Name the planets in our solar system.",
    "What is the difference between AI and ML?",
    "Tell me about attention mechanisms.",
]

for i, msg in enumerate(facts_and_questions, 1):
    reply = sum_bot.chat(msg)
    if i <= 6 or i == len(facts_and_questions):
        print(f"Turn {i}: {msg[:50]}...")
        print(f"  → {reply[:100]}...\n")

print("=" * 60)
sum_bot.show_stats()

Turn 1: My name is Dr. Bala. I live in Chennai and teach A...
  → Okay, Dr. Bala. It's a pleasure to meet you (virtually)! So you're Dr. Bala, an AI instructor based ...

Turn 2: My favorite programming language is Python....
  → That's great! Python is a fantastic choice, especially for AI. It's known for its readability, exten...

Turn 3: I have 2 cats named Pixel and Byte....
  → Pixel and Byte! Those are absolutely perfect names for a programmer's cats, especially an AI instruc...

Turn 4: I prefer dark mode in all my editors....
  → Ah, a person of culture! Dark mode is definitely the way to go. It's much easier on the eyes, especi...

Turn 5: My birthday is March 15th....
  → Okay, Dr. Bala! Noted. March 15th. I'll try my best to remember that. If you'd like, closer to the d...

Turn 6: I'm currently researching multi-agent systems....
  → Multi-agent systems are a fascinating and rapidly developing area of AI! That's a great topic to be ...

  [Summarized 7 older messages]
  [Su

In [12]:
# Now test: does it still remember early facts?
print("--- Testing Memory After Summarization ---\n")

print("Q: What is my name?")
print(f"A: {sum_bot.chat('What is my name?')}\n")

print("Q: What are my cats' names?")
print(f"A: {sum_bot.chat('What are my cats names?')}\n")

print("Q: What am I researching?")
print(f"A: {sum_bot.chat('What am I currently researching?')}\n")

print("\n--- Current Summary ---")
print(sum_bot.summary)

--- Testing Memory After Summarization ---

Q: What is my name?
A: I don't have a name in the way humans do. I am a large language model, an AI assistant. You can call me assistant if you like!


Q: What are my cats' names?
A: You haven't told me your cats' names. However, I remember that Dr. Bala's cats are named Pixel and Byte. Since you are not Dr. Bala, I do not know your cats' names.


Q: What am I researching?
  [Summarized 8 older messages]
A: You are currently researching multi-agent systems.



--- Current Summary ---
Dr. Bala, the AI instructor from Chennai who teaches AI courses and prefers dark mode with a birthday of March 15th, is researching multi-agent systems and knows the speed of light is approximately 299,792,458 meters per second. After requesting a one-sentence explanation of gradient descent, Dr. Bala, who still uses Python as their favorite programming language and owns two cats named Pixel and Byte, inquired about transformer architecture, the planets in our so

### Comparison: Full History vs Summarized

| Strategy | Tokens/call | Remembers early facts? | Cost at 100 turns |
|----------|------------|----------------------|-------------------|
| Full History | Grows linearly | Yes | High |
| Sliding Window | Fixed (small) | No | Low |
| Summarization | Fixed (medium) | Yes (compressed) | Medium |

---

## 2.4 Bonus: ADK Session Memory

Google's Agent Development Kit (ADK) provides built-in session management. Let's see how it handles memory automatically.

In [None]:
from google.adk.agents import Agent
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
from google.genai import types as genai_types


# Create a simple agent
memory_agent = Agent(
    name="memory_bot",
    model=MODEL_ID,
    instruction="You are a friendly assistant. Remember everything the user tells you.",
)

# ADK's session service manages conversation history automatically
session_service = InMemorySessionService()

runner = Runner(
    agent=memory_agent,
    app_name="memory_demo",
    session_service=session_service,
)

# Create a session — this is where memory lives
session = await session_service.create_session(
    app_name="memory_demo",
    user_id="dr_bala"
)

print(f"Session created: {session.id}")
print(f"User ID: dr_bala")

In [None]:
async def chat_with_agent(runner, session, user_message, user_id="dr_bala"):
    """Send a message to the ADK agent and get a response."""
    user_content = genai_types.Content(
        role="user", parts=[genai_types.Part(text=user_message)]
    )

    final_response = ""
    async for event in runner.run_async(
        user_id=user_id,
        session_id=session.id,
        new_message=user_content,
    ):
        if event.content and event.content.parts:
            for part in event.content.parts:
                if part.text:
                    final_response += part.text

    return final_response


# Test ADK memory across turns
messages = [
    "My name is Dr. Bala. I teach AI at a university in Chennai.",
    "My favorite framework is Google ADK.",
    "What is my name and what do I teach?",
    "What is my favorite framework?",
]

for msg in messages:
    reply = await chat_with_agent(runner, session, msg)
    print(f"User: {msg}")
    print(f"Agent: {reply}")
    print()

In [None]:
# Inspect the session state — ADK stores the full conversation
updated_session = await session_service.get_session(
    app_name="memory_demo",
    user_id="dr_bala",
    session_id=session.id
)

print(f"Events in session: {len(updated_session.events)}")
print(f"\nSession stores the full conversation history automatically.")
print(f"ADK's InMemorySessionService = our ChatWithHistory, but built-in!")

### Key Takeaway

ADK's `InMemorySessionService` handles short-term memory automatically. It stores conversation history in the session and sends it with every call — exactly what we built manually in section 2.1.

But for **long-term memory** (remembering facts across sessions), we need something more powerful: **embeddings and vector stores**.

---

# Part 3: Embeddings — The Foundation of Long-Term Memory (15 min)

Embeddings convert text into numerical vectors that capture **meaning**. Similar texts have similar vectors.

## 3.1 What Are Embeddings?

Let's generate embeddings for a few sentences and look at what we get.

In [13]:
import numpy as np


def get_embedding(text):
    """Get an embedding vector for a piece of text using Gemini."""
    response = client.models.embed_content(
        model=EMBEDDING_MODEL,
        contents=text
    )
    return response.embeddings[0].values


# Embed a few sentences
sentences = [
    "I love dogs",
    "I adore puppies",
    "The stock market crashed today",
    "Machine learning is fascinating",
    "Neural networks can learn patterns",
    "The weather is sunny and warm",
]

embeddings = {}
for sentence in sentences:
    emb = get_embedding(sentence)
    embeddings[sentence] = emb
    print(f"\n'{sentence}'")
    print(f"  Vector length: {len(emb)}")
    print(f"  First 5 values: {[round(v, 4) for v in emb[:5]]}")
    print(f"  ...it's just a list of {len(emb)} numbers!")


'I love dogs'
  Vector length: 3072
  First 5 values: [-0.0145, 0.0165, 0.0292, -0.0666, 0.0096]
  ...it's just a list of 3072 numbers!

'I adore puppies'
  Vector length: 3072
  First 5 values: [-0.0257, -0.0105, 0.0354, -0.0529, -0.0145]
  ...it's just a list of 3072 numbers!

'The stock market crashed today'
  Vector length: 3072
  First 5 values: [0.0105, 0.015, -0.0145, -0.0712, -0.0025]
  ...it's just a list of 3072 numbers!

'Machine learning is fascinating'
  Vector length: 3072
  First 5 values: [-0.0132, 0.0286, 0.0133, -0.0755, -0.0168]
  ...it's just a list of 3072 numbers!

'Neural networks can learn patterns'
  Vector length: 3072
  First 5 values: [-0.0069, 0.0201, 0.009, -0.0352, -0.0229]
  ...it's just a list of 3072 numbers!

'The weather is sunny and warm'
  Vector length: 3072
  First 5 values: [-0.0092, -0.003, -0.0123, -0.0604, -0.0209]
  ...it's just a list of 3072 numbers!


### What just happened?

Each sentence was converted into a **vector** — a list of 3072 floating-point numbers. These numbers encode the **meaning** of the text in a way that allows mathematical comparison.

## 3.2 Similarity Search

If embeddings capture meaning, then sentences with similar meanings should have similar vectors. Let's verify with **cosine similarity**.

In [14]:
def cosine_similarity(vec_a, vec_b):
    """Compute cosine similarity between two vectors."""
    a = np.array(vec_a)
    b = np.array(vec_b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


# Compare all pairs
print("Cosine Similarity Matrix")
print("=" * 80)

# Short labels for display
labels = [s[:30] for s in sentences]

# Print header
print(f"{'':>32}", end="")
for i in range(len(sentences)):
    print(f"  [{i}]", end="")
print()

for i, s1 in enumerate(sentences):
    print(f"[{i}] {labels[i]:>28}", end="")
    for j, s2 in enumerate(sentences):
        sim = cosine_similarity(embeddings[s1], embeddings[s2])
        print(f" {sim:.2f}", end="")
    print()

Cosine Similarity Matrix
                                  [0]  [1]  [2]  [3]  [4]  [5]
[0]                  I love dogs 1.00 0.81 0.54 0.60 0.52 0.54
[1]              I adore puppies 0.81 1.00 0.53 0.61 0.53 0.57
[2] The stock market crashed today 0.54 0.53 1.00 0.54 0.53 0.55
[3] Machine learning is fascinatin 0.60 0.61 0.54 1.00 0.66 0.58
[4] Neural networks can learn patt 0.52 0.53 0.53 0.66 1.00 0.55
[5] The weather is sunny and warm 0.54 0.57 0.55 0.58 0.55 1.00


In [15]:
# Highlight the interesting pairs
print("\nInteresting Pairs:")
print("-" * 60)

pairs = [
    ("I love dogs", "I adore puppies", "Similar meaning"),
    ("I love dogs", "The stock market crashed today", "Different topics"),
    ("Machine learning is fascinating", "Neural networks can learn patterns", "Related ML topics"),
    ("The stock market crashed today", "The weather is sunny and warm", "Unrelated topics"),
]

for s1, s2, label in pairs:
    sim = cosine_similarity(embeddings[s1], embeddings[s2])
    bar = "█" * int(sim * 30)
    print(f"\n  {label}:")
    print(f"    '{s1}' vs '{s2}'")
    print(f"    Similarity: {sim:.4f}  {bar}")


Interesting Pairs:
------------------------------------------------------------

  Similar meaning:
    'I love dogs' vs 'I adore puppies'
    Similarity: 0.8146  ████████████████████████

  Different topics:
    'I love dogs' vs 'The stock market crashed today'
    Similarity: 0.5353  ████████████████

  Related ML topics:
    'Machine learning is fascinating' vs 'Neural networks can learn patterns'
    Similarity: 0.6631  ███████████████████

  Unrelated topics:
    'The stock market crashed today' vs 'The weather is sunny and warm'
    Similarity: 0.5484  ████████████████


### Key Insight

- "I love dogs" and "I adore puppies" have **high** similarity (~0.8+) even though they share no words!
- "I love dogs" and "Stock market crashed" have **low** similarity — they're about completely different things

Embeddings capture **semantic meaning**, not just word overlap.

## 3.3 Build a Mini Vector Store

Now let's build a simple **vector store** — a searchable collection of facts stored as embeddings. This is RAG at its core.

In [17]:
class MiniVectorStore:
    """A simple in-memory vector store. This is what ChromaDB/Pinecone/Weaviate do at scale."""

    def __init__(self):
        self.documents = []  # List of (text, embedding) tuples

    def add(self, text):
        """Add a document to the store."""
        embedding = get_embedding(text)
        self.documents.append((text, embedding))

    def search(self, query, top_k=3):
        """Find the most similar documents to a query."""
        query_embedding = get_embedding(query)

        # Compute similarity with every document
        results = []
        for text, doc_embedding in self.documents:
            sim = cosine_similarity(query_embedding, doc_embedding)
            results.append((sim, text))

        # Sort by similarity (highest first)
        results.sort(reverse=True)
        return results[:top_k]

    def __len__(self):
        return len(self.documents)

In [18]:
# Build a knowledge base about a user
store = MiniVectorStore()

user_facts = [
    "Dr. Bala is a professor who teaches AI and Machine Learning.",
    "Dr. Bala lives in Chennai, India.",
    "Dr. Bala's favorite language is Python.",
    "Dr. Bala prefers dark mode in all applications.",
    "Dr. Bala has two cats named Pixel and Byte.",
    "Dr. Bala is researching multi-agent systems and autonomous AI.",
    "Dr. Bala enjoys biryani and filter coffee.",
    "Dr. Bala uses Google Colab and VS Code for development.",
    "Dr. Bala's birthday is on March 15th.",
    "Dr. Bala prefers concise explanations with code examples.",
]

for fact in user_facts:
    store.add(fact)

print(f"Vector store loaded with {len(store)} facts.")

Vector store loaded with 10 facts.


In [19]:
# Query the vector store
queries = [
    "What does the user prefer?",
    "Tell me about pets",
    "What food does he like?",
    "What is his job?",
    "What tools does he use?",
]

for query in queries:
    print(f"\nQuery: '{query}'")
    results = store.search(query, top_k=3)
    for rank, (score, text) in enumerate(results, 1):
        print(f"  {rank}. [{score:.4f}] {text}")


Query: 'What does the user prefer?'
  1. [0.6632] Dr. Bala prefers dark mode in all applications.
  2. [0.6214] Dr. Bala prefers concise explanations with code examples.
  3. [0.6026] Dr. Bala enjoys biryani and filter coffee.

Query: 'Tell me about pets'
  1. [0.6515] Dr. Bala has two cats named Pixel and Byte.
  2. [0.5646] Dr. Bala's favorite language is Python.
  3. [0.5630] Dr. Bala prefers concise explanations with code examples.

Query: 'What food does he like?'
  1. [0.6707] Dr. Bala enjoys biryani and filter coffee.
  2. [0.6024] Dr. Bala's favorite language is Python.
  3. [0.5760] Dr. Bala prefers concise explanations with code examples.

Query: 'What is his job?'
  1. [0.6070] Dr. Bala prefers concise explanations with code examples.
  2. [0.5942] Dr. Bala's birthday is on March 15th.
  3. [0.5932] Dr. Bala's favorite language is Python.

Query: 'What tools does he use?'
  1. [0.6315] Dr. Bala uses Google Colab and VS Code for development.
  2. [0.6151] Dr. Bala prefers co

### This is RAG at its core!

What we just built:
1. **Store** facts as embeddings
2. **Query** with natural language
3. **Retrieve** the most relevant facts

The only step missing: pass the retrieved facts to an LLM to generate a grounded answer.

---

# Part 4: RAG — Retrieval-Augmented Generation (15 min)

RAG = **Retrieve** relevant context + **Augment** the prompt + **Generate** a grounded answer.

Let's build it from scratch with a real document.

## 4.1 Build RAG from Scratch

We'll use a sample course syllabus as our document, chunk it, embed it, store in ChromaDB, and query it.

In [20]:
# Step 1: Create a sample document (a course syllabus)
SYLLABUS = """
CS 601: Advanced Artificial Intelligence — Spring 2025
Instructor: Dr. Bala | Office: Room 302, CS Building | Office Hours: Mon/Wed 2-4 PM

Course Description:
This graduate-level course covers advanced topics in AI including deep learning architectures,
natural language processing, computer vision, reinforcement learning, and multi-agent systems.
Students will gain hands-on experience building AI systems using modern frameworks.

Prerequisites:
- CS 501: Introduction to Machine Learning (or equivalent)
- Strong programming skills in Python
- Linear algebra and probability theory

Grading Policy:
- Assignments (4 total): 40%
- Midterm Exam: 20%
- Final Project: 30%
- Class Participation: 10%

Assignment Policy:
Late submissions receive a 10% penalty per day, up to 3 days. After 3 days, no submissions
are accepted. One assignment may be dropped (the lowest score). Group work is allowed for
assignments in teams of up to 3 students.

Week-by-Week Schedule:
Week 1-2: Review of ML fundamentals, gradient descent, backpropagation
Week 3-4: Deep learning architectures — CNNs, RNNs, LSTMs
Week 5-6: Transformer architecture and attention mechanisms
Week 7: Midterm Exam
Week 8-9: Large Language Models — GPT, BERT, fine-tuning
Week 10-11: Prompt engineering and in-context learning
Week 12-13: AI agents, tool use, and multi-agent systems
Week 14-15: Final project presentations

Required Textbooks:
- "Deep Learning" by Goodfellow, Bengio, and Courville
- "Speech and Language Processing" by Jurafsky and Martin (3rd edition, online)

Final Project:
Students must propose and implement an AI system that solves a real-world problem.
Projects can be individual or in teams of up to 3. A 10-page report and a 15-minute
presentation are required. The project proposal is due by Week 8.

Academic Integrity:
All work must be original. Use of AI tools (ChatGPT, Copilot) is allowed for learning
but all AI-generated code must be clearly attributed. Plagiarism will result in a failing
grade for the course.

Tools and Platforms:
- Python 3.10+
- PyTorch or TensorFlow
- Google Colab (free GPU access)
- Hugging Face Transformers library
- Weights & Biases for experiment tracking
"""

print(f"Document length: {len(SYLLABUS)} characters, ~{len(SYLLABUS) // 4} tokens")

Document length: 2185 characters, ~546 tokens


In [21]:
# Step 2: Chunk the document into paragraphs

def chunk_document(text, chunk_size=300, overlap=50):
    """Split text into overlapping chunks by paragraph boundaries."""
    # Split by double newlines (paragraphs)
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]

    chunks = []
    current_chunk = ""

    for para in paragraphs:
        if len(current_chunk) + len(para) < chunk_size:
            current_chunk += "\n" + para if current_chunk else para
        else:
            if current_chunk:
                chunks.append(current_chunk)
            current_chunk = para

    if current_chunk:
        chunks.append(current_chunk)

    return chunks


chunks = chunk_document(SYLLABUS)

print(f"Document split into {len(chunks)} chunks:\n")
for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i + 1} ({len(chunk)} chars) ---")
    print(chunk[:120] + "...")
    print()

Document split into 9 chunks:

--- Chunk 1 (138 chars) ---
CS 601: Advanced Artificial Intelligence — Spring 2025
Instructor: Dr. Bala | Office: Room 302, CS Building | Office Hou...

--- Chunk 2 (293 chars) ---
Course Description:
This graduate-level course covers advanced topics in AI including deep learning architectures,
natur...

--- Chunk 3 (264 chars) ---
Prerequisites:
- CS 501: Introduction to Machine Learning (or equivalent)
- Strong programming skills in Python
- Linear...

--- Chunk 4 (241 chars) ---
Assignment Policy:
Late submissions receive a 10% penalty per day, up to 3 days. After 3 days, no submissions
are accept...

--- Chunk 5 (441 chars) ---
Week-by-Week Schedule:
Week 1-2: Review of ML fundamentals, gradient descent, backpropagation
Week 3-4: Deep learning ar...

--- Chunk 6 (154 chars) ---
Required Textbooks:
- "Deep Learning" by Goodfellow, Bengio, and Courville
- "Speech and Language Processing" by Jurafsk...

--- Chunk 7 (247 chars) ---
Final Project:
Students

In [22]:
# Step 3: Store chunks in ChromaDB
import chromadb

chroma_client = chromadb.Client()  # In-memory (ephemeral)

collection = chroma_client.create_collection(
    name="syllabus",
    metadata={"description": "CS 601 course syllabus"}
)

# Embed and store each chunk
for i, chunk in enumerate(chunks):
    embedding = get_embedding(chunk)
    collection.add(
        ids=[f"chunk_{i}"],
        embeddings=[embedding],
        documents=[chunk],
        metadatas=[{"chunk_index": i}]
    )

print(f"Stored {collection.count()} chunks in ChromaDB.")

Stored 9 chunks in ChromaDB.


In [None]:
# Step 4: Build the RAG pipeline

def rag_query(question, top_k=3):
    """Full RAG pipeline: embed query → retrieve chunks → generate answer."""

    # 1. Embed the question
    query_embedding = get_embedding(question)

    # 2. Retrieve top-k relevant chunks from ChromaDB
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )

    retrieved_chunks = results["documents"][0]
    distances = results["distances"][0]

    # 3. Build the augmented prompt
    context = "\n\n---\n\n".join(retrieved_chunks)

    prompt = (
        f"Answer the following question based ONLY on the provided context. "
        f"If the answer is not in the context, say 'This information is not in the syllabus.'\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}\n\n"
        f"Answer:"
    )

    # 4. Generate the answer
    response = client.models.generate_content(model=MODEL_ID, contents=prompt)

    return {
        "answer": response.text,
        "sources": retrieved_chunks,
        "distances": distances
    }


# Test the RAG pipeline
questions = [
    "What is the late submission policy?",
    "When is the midterm exam?",
    "Can I use ChatGPT for assignments?",
    "What textbooks are required?",
    "How much is the final project worth?",
]

for q in questions:
    result = rag_query(q)
    print(f"Q: {q}")
    print(f"A: {result['answer']}")
    print(f"   (Retrieved {len(result['sources'])} chunks)")
    print()

Q: What is the late submission policy?
A: Late submissions receive a 10% penalty per day, up to 3 days. After 3 days, no submissions are accepted.

   (Retrieved 3 chunks)

Q: When is the midterm exam?
A: Week 7

   (Retrieved 3 chunks)

Q: Can I use ChatGPT for assignments?
A: Use of AI tools (ChatGPT, Copilot) is allowed for learning but all AI-generated code must be clearly attributed.

   (Retrieved 3 chunks)

Q: What textbooks are required?
A: - "Deep Learning" by Goodfellow, Bengio, and Courville
- "Speech and Language Processing" by Jurafsky and Martin (3rd edition, online)

   (Retrieved 3 chunks)



## 4.2 With vs Without RAG

Let's compare answers to the same questions — one with RAG (grounded in the document) and one without (pure LLM generation).

In [None]:
def compare_with_without_rag(question):
    """Compare RAG-augmented answers vs raw LLM answers."""

    # WITHOUT RAG — raw LLM call
    raw_prompt = (
        f"Answer this question about the CS 601 Advanced AI course taught by Dr. Bala:\n\n"
        f"{question}"
    )
    raw_response = client.models.generate_content(model=MODEL_ID, contents=raw_prompt)
    raw_answer = raw_response.text

    # WITH RAG — grounded in the actual document
    rag_result = rag_query(question)
    rag_answer = rag_result["answer"]

    print(f"Question: {question}")
    print(f"{'=' * 70}")
    print(f"WITHOUT RAG (LLM guesses):")
    print(f"  {raw_answer[:300]}")
    print()
    print(f"WITH RAG (grounded in document):")
    print(f"  {rag_answer[:300]}")
    print(f"{'=' * 70}\n")

In [None]:
# Compare on specific factual questions
comparison_questions = [
    "What percentage of the grade comes from assignments?",
    "What is the late penalty per day?",
    "When is the project proposal due?",
    "What topics are covered in weeks 8-9?",
]

for q in comparison_questions:
    compare_with_without_rag(q)

### Key Observations

| Aspect | Without RAG | With RAG |
|--------|------------|----------|
| **Accuracy** | Guesses/hallucinates specific numbers | Exact numbers from the document |
| **Reliability** | May sound confident but be wrong | Grounded in actual source material |
| **Traceability** | No source attribution | Can show which chunks were used |
| **Cost** | Fewer tokens (no context) | More tokens (context included) |

RAG trades a small amount of token cost for **dramatically better accuracy** on factual questions.

## 4.3 Bonus: RAG with ADK Agent

Let's combine ADK's agent framework with our RAG pipeline. The agent gets a tool to search the syllabus.

In [None]:
from google.adk.agents import Agent
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService


def search_syllabus(query: str) -> dict:
    """Search the CS 601 course syllabus for relevant information.

    Use this tool when the user asks about course policies, schedule,
    grading, assignments, exams, or any course-related information.

    Args:
        query: The search query describing what information to find.

    Returns:
        A dictionary containing relevant passages from the syllabus.
    """
    query_embedding = get_embedding(query)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=3
    )
    return {
        "relevant_passages": results["documents"][0],
        "num_results": len(results["documents"][0])
    }


# Create the RAG-powered agent
rag_agent = Agent(
    name="syllabus_assistant",
    model=MODEL_ID,
    instruction=(
        "You are a helpful course assistant for CS 601: Advanced AI. "
        "When students ask about the course, ALWAYS use the search_syllabus tool "
        "to find accurate information before answering. "
        "Base your answers only on the retrieved information. "
        "If the information isn't in the syllabus, say so."
    ),
    tools=[search_syllabus],
)

# Set up runner and session
rag_session_service = InMemorySessionService()
rag_runner = Runner(
    agent=rag_agent,
    app_name="rag_demo",
    session_service=rag_session_service,
)

rag_session = await rag_session_service.create_session(
    app_name="rag_demo",
    user_id="student_1"
)

print("RAG Agent ready! It has conversation memory (ADK) + document memory (ChromaDB).")

In [None]:
# Chat with the RAG agent
student_questions = [
    "Hi, I'm a new student. What are the prerequisites for this course?",
    "What's the grading breakdown?",
    "I submitted my assignment 2 days late. How much will I lose?",
    "Can I work with a friend on the final project?",
]

for question in student_questions:
    reply = await chat_with_agent(rag_runner, rag_session, question, user_id="student_1")
    print(f"Student: {question}")
    print(f"Agent:   {reply}")
    print()

---

# Summary: The Memory Stack

```
┌─────────────────────────────────────────────────┐
│           Long-Term Memory (Part 3-4)           │
│   Embeddings + Vector Store + RAG               │
│   → Remembers facts across sessions             │
│   → Retrieves relevant knowledge on demand      │
├─────────────────────────────────────────────────┤
│         Short-Term Memory (Part 2)              │
│   Conversation History / Sliding Window /       │
│   Summarization / ADK Sessions                  │
│   → Remembers within a session                  │
│   → Manages token budget                        │
├─────────────────────────────────────────────────┤
│           No Memory (Part 1)                    │
│   Raw API Call                                  │
│   → Stateless, forgets everything               │
└─────────────────────────────────────────────────┘
```

## What We Built Today

| Technique | Type | Remembers | Token Cost | Implementation |
|-----------|------|-----------|------------|----------------|
| Raw API call | None | Nothing | Minimal | `client.models.generate_content()` |
| Chat history | Short-term | Full session | Grows linearly | `messages` list |
| Sliding window | Short-term | Last N turns | Fixed | `messages[-N:]` |
| Summarization | Short-term | Compressed history | Fixed (medium) | LLM summarizes old messages |
| ADK sessions | Short-term | Full session | Grows linearly | `InMemorySessionService` |
| Embeddings | Long-term | Semantic facts | Per-query | `embed_content()` + cosine similarity |
| Mini vector store | Long-term | Stored facts | Per-query | Python dict + embeddings |
| RAG (ChromaDB) | Long-term | Documents | Per-query + context | Chunk → Embed → Retrieve → Generate |
| RAG + ADK agent | Both | Session + Documents | Combined | Agent with search tool |

## Next Steps
- **Persistent storage**: Use `DatabaseSessionService` (ADK) or persistent ChromaDB for data that survives restarts
- **Hybrid memory**: Combine conversation history + user fact store + document RAG
- **Memory management**: Implement importance scoring — not all facts are worth remembering
- **Evaluation**: Measure retrieval quality with precision/recall metrics

In [None]:
# Clean up
chroma_client.delete_collection("syllabus")
print("Workshop complete! ChromaDB collection cleaned up.")