# Building a Chatbot from Scratch with LangChain

## Tutorial Overview

In this beginner-friendly tutorial, you'll learn to build a chatbot step by step:

1. **Basic Chatbot (No Memory)** - Start with a simple chatbot
2. **Understanding the Problem** - See why memory matters
3. **Adding Short-Term Memory** - Make the chatbot remember conversations

### Prerequisites

```bash
pip install langchain langchain-openai langgraph
```

You'll also need an OpenAI API key set as an environment variable.

---

## Part 1: Basic Chatbot Without Memory

Let's start with the simplest possible chatbot. This chatbot can respond to questions but **cannot remember previous messages**.

### Step 1: Import Required Libraries

In [1]:
import os
import pprint
from langchain.agents import create_agent
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
load_dotenv(dotenv_path="/home/bipin/Documents/genai/bot/learning/.env")
llm = ChatOpenAI(model="gpt-5-nano")


# Set your API key (or use environment variable)
# os.environ["OPENAI_API_KEY"] = "your-api-key-here"

### Step 2: Create a Simple Agent (No Memory)

The `create_agent` function is the standard way to build agents in LangChain. It's built on LangGraph but provides a simpler interface for beginners.

**Key Parameters:**
- `model`: The LLM to use (we'll use GPT-4)
- `tools`: Functions the agent can use (empty for now)
- `system_prompt`: Instructions for the chatbot's behavior

In [2]:
# Create a basic agent without memory
simple_agent = create_agent(
    model=llm,
    tools=[],  # No tools for now, just conversation
    system_prompt="You are a helpful assistant. Answer questions concisely."
)

print("Simple agent created successfully!")

Simple agent created successfully!


### Step 3: Test the Basic Chatbot

In [3]:
# First message
response1 = simple_agent.invoke({
    "messages": [{"role": "user", "content": "Hi! My name is Bipin."}]
})

print("Bot:", response1["messages"][-1].content)

Bot: Nice to meet you, Bipin! How can I help today? If you’re not sure, tell me what you’d like to do—ask questions, get help with writing or coding, learn something new, or plan something.


In [4]:
# Second message - asking about the name
response2 = simple_agent.invoke({
    "messages": [{"role": "user", "content": "What's my name?"}]
})

print("Bot:", response2["messages"][-1].content)

Bot: I don’t know your name. What would you like me to call you?


---

## Part 2: The Problem - No Memory!

### What Just Happened?

You probably noticed the chatbot **doesn't remember your name**! This is because:

1. Each `invoke()` call is independent
2. We only sent the current message, not the conversation history
3. The agent has no way to remember previous interactions

### Why This Is a Problem

- Users expect chatbots to remember context
- Real conversations build on previous messages
- Without memory, the chatbot feels robotic and unhelpful

### The Solution: Short-Term Memory

We need to:
1. Store conversation history
2. Send the full history with each new message
3. Use a "checkpointer" to persist the conversation

---

## Part 3: Adding Short-Term Memory

LangChain provides built-in memory management through **checkpointers**. A checkpointer saves the conversation state so the agent can remember previous interactions.

### Step 1: Import Memory Components

In [None]:
from langgraph.checkpoint.memory import MemorySaver

# MemorySaver stores conversation history in memory
# For production, you'd use a database checkpointer (SQLite, Postgres, etc.)

### Step 2: Create an Agent with Memory

The key difference: we add a `checkpointer` parameter to save conversation state.

In [6]:
# Create a checkpointer to store conversation history
memory = MemorySaver()

# Create an agent WITH memory
agent_with_memory = create_agent(
    model=llm,
    tools=[],
    system_prompt="You are a helpful assistant. Remember details from our conversation.",
    checkpointer=memory  # This enables memory!
)

print("Agent with memory created!")

Agent with memory created!


### Step 3: Understanding Thread IDs

To maintain separate conversations, we use **thread IDs**. Think of a thread as a conversation session.

- Each thread has its own memory
- Same thread ID = same conversation
- Different thread ID = new conversation

In [7]:
# Configuration with thread ID
config = {"configurable": {"thread_id": "conversation_1"}}

# First message in the conversation
response1 = agent_with_memory.invoke(
    {"messages": [{"role": "user", "content": "Hi! My name is Bipin."}]},
    config  # Pass the config to maintain thread
)

print("Bot:", response1["messages"][-1].content)

Bot: Hi Bipin! Nice to meet you. How can I help you today? I can assist with writing, explanations, coding, learning a topic, planning, or just chatting. Tell me what you’re aiming to do, and we’ll dive in.


In [8]:
# Second message - the bot should remember the name now!
response2 = agent_with_memory.invoke(
    {"messages": [{"role": "user", "content": "What's my name?"}]},
    config  # Same config = same conversation thread
)

print("Bot:", response2["messages"][-1].content)

Bot: Your name is Bipin. Nice to meet you again, Bipin! What would you like to work on today?


In [9]:
# Let's have a longer conversation to test memory
response3 = agent_with_memory.invoke(
    {"messages": [{"role": "user", "content": "I like pizza and cats."}]},
    config
)

print("Bot:", response3["messages"][-1].content)

Bot: Nice to hear that, Bipin! Want to dive into one of these?

- Pizza: recipe ideas, topping combos, crust types, or diet-friendly options
- Cats: care tips, breed info, behavior tricks, or fun cat facts
- A fun blend: a cat-themed pizza idea or a short story featuring a pizza-loving cat

Tell me what you’d like to explore or if you have a specific question.


In [10]:
# Ask about earlier information
response4 = agent_with_memory.invoke(
    {"messages": [{"role": "user", "content": "What do I like?"}]},
    config
)

print("Bot:", response4["messages"][-1].content)

Bot: You like pizza and cats. Great combos!

What would you like to do next? Here are a few options:
- Pizza: recipe ideas, topping combos, or crust types
- Cats: care tips, breed info, behavior tricks
- A fun blend: a cat-themed pizza idea or a short story about a pizza-loving cat

Tell me which you’d prefer or ask me anything specific.


In [11]:
response4

{'messages': [HumanMessage(content='Hi! My name is Bipin.', additional_kwargs={}, response_metadata={}, id='03a29872-7974-4577-8c30-1d565a46fecd'),
  AIMessage(content='Hi Bipin! Nice to meet you. How can I help you today? I can assist with writing, explanations, coding, learning a topic, planning, or just chatting. Tell me what you’re aiming to do, and we’ll dive in.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 700, 'prompt_tokens': 30, 'total_tokens': 730, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 640, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-5-nano-2025-08-07', 'system_fingerprint': None, 'id': 'chatcmpl-CnwcKEyC7e3jBL9KcXdzQzeQgm8md', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='lc_run--019b2ef5-fcbc-7630-92e4-8f910ba4881f-0', usage_metada

### Step 4: Viewing Conversation History

You can inspect what the agent remembers by looking at the messages in the state.

In [12]:
# View all messages in the conversation
print("\n=== Full Conversation History ===")
for i, msg in enumerate(response4["messages"], 1):
    role = msg.__class__.__name__.replace("Message", "")
    content = msg.content
    print(f"{i}. [{role}]: {content}")


=== Full Conversation History ===
1. [Human]: Hi! My name is Bipin.
2. [AI]: Hi Bipin! Nice to meet you. How can I help you today? I can assist with writing, explanations, coding, learning a topic, planning, or just chatting. Tell me what you’re aiming to do, and we’ll dive in.
3. [Human]: What's my name?
4. [AI]: Your name is Bipin. Nice to meet you again, Bipin! What would you like to work on today?
5. [Human]: I like pizza and cats.
6. [AI]: Nice to hear that, Bipin! Want to dive into one of these?

- Pizza: recipe ideas, topping combos, crust types, or diet-friendly options
- Cats: care tips, breed info, behavior tricks, or fun cat facts
- A fun blend: a cat-themed pizza idea or a short story featuring a pizza-loving cat

Tell me what you’d like to explore or if you have a specific question.
7. [Human]: What do I like?
8. [AI]: You like pizza and cats. Great combos!

What would you like to do next? Here are a few options:
- Pizza: recipe ideas, topping combos, or crust types
- Cat

### Step 5: Starting a New Conversation

Use a different thread ID to start a fresh conversation with no shared memory.

In [None]:
# New thread = new conversation = no memory of Alice
new_config = {"configurable": {"thread_id": "conversation_2"}}

response_new = agent_with_memory.invoke(
    {"messages": [{"role": "user", "content": "What's my name?"}]},
    new_config
)

print("Bot (new conversation):", response_new["messages"][-1].content)

---

## Part 4: Building a Simple Interactive Chat Loop

Let's create a simple chat interface where you can type messages and get responses.

In [None]:
def chat_with_bot():
    """
    Simple interactive chat loop.
    Type 'quit' to exit.
    """
    print("\n=== Chat with your AI Assistant ===")
    print("Type 'quit' to exit\n")
    
    # Create a new conversation thread
    thread_config = {"configurable": {"thread_id": "interactive_chat"}}
    
    while True:
        # Get user input
        user_message = input("You: ")
        
        # Exit condition
        if user_message.lower() in ['quit', 'exit', 'bye']:
            print("Goodbye!")
            break
        
        # Send message to agent
        response = agent_with_memory.invoke(
            {"messages": [{"role": "user", "content": user_message}]},
            thread_config
        )
        
        # Print bot response
        print(f"Bot: {response['messages'][-1].content}\n")

# Uncomment to start chatting:
# chat_with_bot()

---

## Summary: What You Learned

### 1. Basic Chatbot (No Memory)
```python
agent = create_agent(
    model="gpt-4o",
    tools=[],
    system_prompt="You are helpful."
)
```
**Problem:** Forgets everything after each message.

### 2. The Memory Problem
- Without memory, chatbots can't maintain context
- Each conversation starts from scratch
- Poor user experience

### 3. Chatbot with Short-Term Memory
```python
memory = MemorySaver()
agent = create_agent(
    model="gpt-4o",
    tools=[],
    system_prompt="You are helpful.",
    checkpointer=memory  # Enables memory!
)

config = {"configurable": {"thread_id": "chat_1"}}
agent.invoke({"messages": [...]}, config)
```
**Solution:** Remembers conversation within a thread.

### Key Concepts

1. **create_agent()** - The standard way to build agents in LangChain
2. **Checkpointer** - Saves conversation state (MemorySaver, SQLite, Postgres)
3. **Thread ID** - Identifies a conversation session
4. **Short-term memory** - Remembers within a conversation thread

### Next Steps

- **Add tools** to let your agent perform actions
- **Use persistent storage** (SQLite, Postgres) instead of in-memory
- **Implement long-term memory** across multiple conversations
- **Add message summarization** for very long conversations
- **Stream responses** for real-time output

---

## Bonus: Understanding How Memory Works

### Behind the Scenes

When you use a checkpointer:

1. **First message**: Agent processes it and saves the state
2. **Second message**: Agent loads previous state, adds new message, processes, saves updated state
3. **Continue**: This repeats for every message

### Message Types

- **HumanMessage**: Messages from the user
- **AIMessage**: Messages from the assistant
- **SystemMessage**: Instructions for the agent (system prompt)
- **ToolMessage**: Results from tool executions

### Memory Types in LangChain

| Type | Scope | Use Case |
|------|-------|----------|
| **Short-term** | Within a thread | Single conversation session |
| **Long-term** | Across threads | User preferences, facts across sessions |

This tutorial covered **short-term memory**. For production chatbots, you'll often want both!

---

## Part 5: Intelligent Cost Management - Trimming and Summarizing Messages

### The Cost Problem

As conversations grow longer (e.g., 20+ questions), the context window grows with every message. This leads to:

- **Higher API costs**: You're charged for every token in the conversation history
- **Slower responses**: More tokens = more processing time
- **Context window limits**: Eventually, you'll hit the model's maximum token limit

### The Solution: Two Intelligent Approaches

1. **Message Trimming**: Keep only recent messages (fast but loses context)
2. **Message Summarization**: Compress old messages into a summary (preserves context, slightly more cost)

Let's learn both approaches!

---

## Practice Exercises for Part 5

Try these to reinforce your learning:

1. **Modify the trigger threshold**: Change the summarization trigger to 1000 tokens and observe the difference
2. **Custom trimming logic**: Keep the first message, middle message, and last 2 messages
3. **Cost comparison**: Run the same 20-question conversation with all three approaches and compare token counts
4. **Hybrid approach**: Create middleware that trims after 10 messages AND summarizes at 1000 tokens
5. **Smart summarization**: Write a custom summary prompt that focuses on extracting user preferences only

### Bonus Challenge

Create a chatbot that:
- Uses trimming for the first 10 messages
- Switches to summarization after 10 messages
- Keeps a running summary of user preferences separately

This combination gives you speed early on and context preservation later!

---

## Part 5 Summary: Key Takeaways

### What You Learned

1. **The Cost Problem**: Long conversations = growing token counts = higher costs
2. **Message Trimming**: Keep only recent messages (simple, fast, loses old context)
3. **Message Summarization**: Compress old messages into summaries (preserves context, slight overhead)
4. **Token Counting**: How to measure and optimize token usage

### Best Practices

1. **For Customer Support**: Use summarization (need full conversation context)
2. **For Quick Q&A**: Use trimming (only recent context matters)
3. **For Production**: Combine both strategies with smart thresholds
4. **Monitor Costs**: Track token usage to optimize your configuration

### Code Reference

```python
# Trimming approach
from langchain.agents.middleware import before_model

@before_model
def trim_middleware(state, runtime):
    messages = state["messages"]
    if len(messages) <= 5:
        return None
    return {"messages": [messages[0]] + messages[-4:]}

# Summarization approach
from langchain.agents.middleware import SummarizationMiddleware

SummarizationMiddleware(
    model="gpt-4o-mini",
    trigger={"tokens": 2000},
    keep={"messages": 8}
)
```

### What's Next?

- Experiment with different trigger thresholds
- Try combining multiple middleware functions
- Implement custom summarization prompts
- Add persistent storage (SQLite, Postgres) for production use

---

## Cost Calculation Example

Let's calculate potential savings with a 20-question conversation:

### Without Memory Management
- Message 1: 100 tokens sent
- Message 2: 200 tokens sent (prev + new)
- Message 3: 300 tokens sent
- ...
- Message 20: 2000 tokens sent
- **Total tokens across all calls**: ~20,000 tokens

### With Summarization (trigger at 500 tokens, keep 4 messages)
- Messages 1-5: Normal growth (100 + 200 + 300 + 400 + 500 = 1,500 tokens)
- Message 6+: Each call sends ~600 tokens (summary + last 4 messages)
- **Total tokens across all calls**: ~10,000 tokens
- **Savings**: ~50% reduction!

### With Trimming (keep last 4 messages)
- Every call sends only ~400 tokens (last 4 messages)
- **Total tokens across all calls**: ~8,000 tokens
- **Savings**: ~60% reduction!

**Note**: These are simplified calculations. Actual savings depend on message lengths and conversation patterns.

In [None]:
# Production-ready chatbot with intelligent memory management
production_memory = MemorySaver()

production_agent = create_agent(
    model=llm,
    tools=[],
    middleware=[
        SummarizationMiddleware(
            model="gpt-4o-mini",  # Cheap model for summarization
            trigger=[
                {"tokens": 2000},    # Trigger at 2000 tokens OR
                {"messages": 15}     # Trigger at 15 messages
            ],
            keep={"messages": 8}    # Keep last 8 messages (4 conversation turns)
        )
    ],
    checkpointer=production_memory
)

print("Production-ready chatbot created!")
print("\nConfiguration:")
print("- Summarizes when: >2000 tokens OR >15 messages")
print("- Keeps intact: Last 8 messages")
print("- Summarization model: gpt-4o-mini (cost-effective)")
print("- Main model: gpt-4o (high quality responses)")

---

## Real-World Example: Production-Ready Configuration

Here's a recommended setup for a production chatbot that balances cost and context retention:

In [None]:
# Import token counting utilities
from langchain_core.messages.utils import count_tokens_approximately

def analyze_token_usage(response):
    """
    Analyze token usage in a conversation.
    """
    messages = response["messages"]
    
    # Count tokens in all messages
    total_tokens = count_tokens_approximately(messages)
    
    print(f"Total messages: {len(messages)}")
    print(f"Approximate tokens: {total_tokens}")
    
    # Show token breakdown by message type
    human_msgs = [m for m in messages if m.__class__.__name__ == "HumanMessage"]
    ai_msgs = [m for m in messages if m.__class__.__name__ == "AIMessage"]
    
    print(f"Human messages: {len(human_msgs)}")
    print(f"AI messages: {len(ai_msgs)}")
    
    return total_tokens

# Test with our previous responses
print("=== Without Management (Full History) ===")
analyze_token_usage(response4)

print("\n=== With Trimming ===")
# You would analyze the trimming response here

print("\n=== With Summarization ===")
# You would analyze the summary response here

---

## Advanced: Token Counting

Let's see exactly how many tokens we're using with each approach.

---

## Comparing the Three Approaches

| Approach | Memory | Cost Efficiency | Context Retention | Best For |
|----------|--------|----------------|-------------------|----------|
| **No Management** | Unlimited growth | Poor (grows linearly) | Perfect | Short conversations |
| **Trimming** | Recent messages only | Excellent | Limited | Cost-sensitive, recent context matters |
| **Summarization** | Summary + recent | Good | Excellent | Long conversations, full context needed |

### How Summarization Works

When the token count exceeds the threshold (500 tokens):

1. The middleware takes old messages (except the last 4)
2. Sends them to a cheaper model (gpt-4o-mini) with a prompt: "Summarize this conversation"
3. Replaces the old messages with a single summary message
4. Keeps recent messages intact for immediate context

**Result:** The agent remembers details from the beginning of the conversation (via summary) AND recent context!

In [None]:
# Test summarization with a long conversation
summary_config = {"configurable": {"thread_id": "summary_test"}}

# Same messages as before
messages_to_send = [
    "Hi! My name is Bipin.",
    "I live in Mumbai.",
    "I work as a software engineer.",
    "I love playing cricket.",
    "My favorite food is biryani.",
    "I have a dog named Max.",
    "I also enjoy reading science fiction books.",
    "My favorite author is Isaac Asimov.",
    "What's my name?",
    "What do you know about me?",  # This should include summarized info!
]

print("Sending messages to agent with summarization...\n")

for i, user_msg in enumerate(messages_to_send, 1):
    response = agent_with_summarization.invoke(
        {"messages": [{"role": "user", "content": user_msg}]},
        summary_config
    )
    bot_response = response["messages"][-1].content
    
    print(f"{i}. You: {user_msg}")
    print(f"   Bot: {bot_response[:150]}...")  # Truncate for readability
    print()

### Testing Summarization

Let's test with the same long conversation and see how the agent preserves context through summarization.

### Understanding the SummarizationMiddleware Parameters

- `model`: The LLM to use for creating summaries (use a cheaper model like gpt-4o-mini)
- `trigger`: When to start summarization
  - `{"tokens": 500}` - Trigger when conversation exceeds 500 tokens
  - `{"messages": 10}` - Trigger when there are 10+ messages
  - Can use both: `[{"tokens": 500}, {"messages": 10}]` (OR logic)
- `keep`: How many recent messages to keep intact
  - `{"messages": 4}` - Keep last 4 messages
  - `{"tokens": 200}` - Keep last 200 tokens worth of messages

In [None]:
# Import summarization middleware
from langchain.agents.middleware import SummarizationMiddleware

# Create an agent with automatic summarization
summary_memory = MemorySaver()

agent_with_summarization = create_agent(
    model=llm,
    tools=[],
    middleware=[
        SummarizationMiddleware(
            model="gpt-4o-mini",  # Use a cheaper model for summarization
            trigger={"tokens": 500},  # Trigger summarization at 500 tokens
            keep={"messages": 4}  # Keep the last 4 messages intact
        )
    ],
    checkpointer=summary_memory
)

print("Agent with summarization middleware created!")

---

## Approach 2: Message Summarization (Intelligent Compression)

Instead of deleting old messages, we can **summarize** them! This preserves important context while reducing token count.

### Key Observations with Trimming

- The agent remembers your name (recent messages)
- But it might forget earlier details like where you work (trimmed messages)
- The message count stays bounded (doesn't grow indefinitely)
- **Cost savings**: Only recent messages are sent to the LLM each time

### Pros and Cons of Trimming

**Pros:**
- Fast and simple
- Guaranteed cost savings
- No extra LLM calls needed

**Cons:**
- Loses information from trimmed messages
- May forget important context from earlier in the conversation

In [None]:
# Simulate a long conversation
trim_config = {"configurable": {"thread_id": "trim_test"}}

# Send multiple messages
messages_to_send = [
    "Hi! My name is Bipin.",
    "I live in Mumbai.",
    "I work as a software engineer.",
    "I love playing cricket.",
    "My favorite food is biryani.",
    "I have a dog named Max.",
    "What's my name?",  # This should still work
    "Where do I work?",  # This might be forgotten due to trimming
]

print("Sending messages to agent with trimming...\n")

for i, user_msg in enumerate(messages_to_send, 1):
    response = agent_with_trimming.invoke(
        {"messages": [{"role": "user", "content": user_msg}]},
        trim_config
    )
    bot_response = response["messages"][-1].content
    
    print(f"{i}. You: {user_msg}")
    print(f"   Bot: {bot_response}")
    print(f"   Messages in history: {len(response['messages'])}")
    print()

### Testing Message Trimming

Let's simulate a long conversation with 10+ messages and see how trimming works.

In [None]:
# Create an agent with message trimming middleware
from langgraph.checkpoint.memory import MemorySaver

trim_memory = MemorySaver()

agent_with_trimming = create_agent(
    model=llm,
    tools=[],
    middleware=[trim_messages_middleware],  # Add our trimming middleware
    checkpointer=trim_memory
)

print("Agent with message trimming created!")

In [None]:
# Import required modules for message trimming
from langchain.messages import RemoveMessage
from langgraph.graph.message import REMOVE_ALL_MESSAGES
from langchain.agents import AgentState
from langchain.agents.middleware import before_model
from langgraph.runtime import Runtime
from typing import Any

# Define a middleware function that trims messages
@before_model
def trim_messages_middleware(state: AgentState, runtime: Runtime) -> dict[str, Any] | None:
    """
    Keep only the last few messages to fit context window.
    This runs BEFORE each model call.
    """
    messages = state["messages"]
    
    # If we have 5 or fewer messages, no need to trim
    if len(messages) <= 5:
        return None  # No changes needed
    
    # Keep the first message (usually system/initial context)
    first_msg = messages[0]
    
    # Keep the last 4 messages (2 turns of conversation)
    recent_messages = messages[-4:]
    
    # Create new message list
    new_messages = [first_msg] + recent_messages
    
    # Return the trimmed messages
    return {
        "messages": [
            RemoveMessage(id=REMOVE_ALL_MESSAGES),  # Remove all existing messages
            *new_messages  # Add back only the messages we want to keep
        ]
    }

print("Trim middleware function created!")

---

## Approach 1: Message Trimming with Middleware

The simplest approach: automatically keep only the last N messages. LangChain provides middleware that runs **before** the model is called.