# 04 - RAG with Memory (Conversational RAG)

**Architecture:** RAG with Conversational Memory

**Complexity:** ‚≠ê‚≠ê

**Use Cases:**
- Chatbots and conversational AI
- Customer support systems  
- Interactive Q&A sessions

**Key Feature:** Maintains chat history to handle follow-up questions and anaphoric references.

**Example:**
```
User: "What is RAG?"
Bot: "RAG is Retrieval-Augmented Generation..."
User: "What are its main components?"  ‚Üê References "RAG" from context
Bot: "The main components of RAG are..."  ‚Üê Understands reference
```

## 1. Setup

In [10]:
import sys
sys.path.append('../..')

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from shared.config import OPENAI_VECTOR_STORE_PATH, DEFAULT_MODEL, DEFAULT_K
from shared.utils import load_vector_store, print_section_header, format_docs
from shared.prompts import MEMORY_RAG_PROMPT

print_section_header("Setup: RAG with Memory")

# Load vector store
embeddings = OpenAIEmbeddings()
vectorstore = load_vector_store(OPENAI_VECTOR_STORE_PATH, embeddings)

# Create retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": DEFAULT_K})

# Initialize LLM
llm = ChatOpenAI(model=DEFAULT_MODEL, temperature=0)

print("\n‚úÖ Setup complete!")


SETUP: RAG WITH MEMORY

‚úì Loaded vector store from /Users/gianlucamazza/Workspace/notebooks/llm_rag/notebooks/advanced_architectures/../../data/vector_stores/openai_embeddings

‚úÖ Setup complete!


## 2. Memory Setup

We'll use `RunnableWithMessageHistory` to add conversational memory to our RAG chain.

In [11]:
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

print_section_header("Memory Configuration")

# Session store (in-memory for demo)
store = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

print("‚úì Memory store configured")

# Build base chain
# The retriever needs just the "input" string, not the whole dict
base_chain = (
    RunnablePassthrough.assign(
        context=lambda x: format_docs(retriever.invoke(x["input"]))
    )
    | MEMORY_RAG_PROMPT
    | llm
    | StrOutputParser()
)

# Wrap with memory
conversational_chain = RunnableWithMessageHistory(
    base_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="chat_history",
)

print("‚úì Conversational chain created with memory")
print("\nüí° Chat history is maintained per session_id")


MEMORY CONFIGURATION

‚úì Memory store configured
‚úì Conversational chain created with memory

üí° Chat history is maintained per session_id


## 3. Test Conversational Flow

In [12]:
print_section_header("Conversational Test")

session_id = "user_123"

# First question
print("User: What is RAG?\n")
response1 = conversational_chain.invoke(
    {"input": "What is RAG?"},
    config={"configurable": {"session_id": session_id}}
)
print(f"Bot: {response1}\n")
print("=" * 80)

# Follow-up with anaphoric reference
print("\nUser: What are its main components?\n")
response2 = conversational_chain.invoke(
    {"input": "What are its main components?"},
    config={"configurable": {"session_id": session_id}}
)
print(f"Bot: {response2}\n")
print("=" * 80)

# Another follow-up
print("\nUser: How do I implement it?\n")
response3 = conversational_chain.invoke(
    {"input": "How do I implement it?"},
    config={"configurable": {"session_id": session_id}}
)
print(f"Bot: {response3}")

print("\n‚úÖ Conversation maintained context successfully!")


CONVERSATIONAL TEST

User: What is RAG?

Bot: RAG (Retrieval-Augmented Generation) is a pattern where a model retrieves relevant information from an external index at query time and uses it to generate a grounded answer.

Key pieces:
- Indexing: a separate pipeline that ingests data from sources and builds an index.
- Retrieval and generation: at run time, fetch the most relevant chunks from the index and pass them to the model to produce the response.

Common implementations:
- Agentic RAG: the LLM decides when/how to call a simple search tool‚Äîgood general-purpose approach.
- Two-step RAG chain: retrieve first, then a single LLM call‚Äîfast and effective for simple queries.

Tooling:
- You can view step-by-step execution, latency, and metadata in a LangSmith trace.
- For more control, use LangGraph to add steps like grading document relevance or rewriting queries (see the Agentic RAG tutorial).


User: What are its main components?

Bot: Core components of RAG

- Ingestion and inde

## 4. View Chat History

In [13]:
print_section_header("Chat History")

history = store[session_id]
print(f"Messages in session '{session_id}': {len(history.messages)}\n")

for i, msg in enumerate(history.messages, 1):
    role = "User" if msg.type == "human" else "Bot"
    content = msg.content[:150] + "..." if len(msg.content) > 150 else msg.content
    print(f"{i}. {role}: {content}\n")


CHAT HISTORY

Messages in session 'user_123': 6

1. User: What is RAG?

2. Bot: RAG (Retrieval-Augmented Generation) is a pattern where a model retrieves relevant information from an external index at query time and uses it to gen...

3. User: What are its main components?

4. Bot: Core components of RAG

- Ingestion and indexing (offline)
  - Connectors to data sources
  - Text cleaning, chunking/splitting
  - Embedding model to...

5. User: How do I implement it?

6. Bot: Here‚Äôs a practical way to implement RAG, from simplest to more advanced, using the pieces we discussed.

Baseline (two-step RAG)
- Choose a chat model...



## 5. Comparison: Memory vs No Memory

In [14]:
from shared.prompts import RAG_PROMPT_TEMPLATE

print_section_header("Comparison: Memory vs No Memory")

# Simple RAG (no memory)
simple_chain = (
    {"context": retriever | format_docs, "input": RunnablePassthrough()}
    | RAG_PROMPT_TEMPLATE
    | llm
    | StrOutputParser()
)

# Test with anaphoric query
query = "What are the advantages of using it?"

print(f"Query (after discussing RAG): '{query}'\n")
print("=" * 80)

print("\n[Simple RAG - NO MEMORY]")
try:
    simple_response = simple_chain.invoke(query)
    print(simple_response[:300])
except Exception as e:
    print(f"Cannot answer: {e}")

print("\n" + "=" * 80)
print("\n[Memory RAG - WITH MEMORY]")
memory_response = conversational_chain.invoke(
    {"input": query},
    config={"configurable": {"session_id": session_id}}
)
print(memory_response[:300])

print("\n" + "=" * 80)
print("\nüí° Memory RAG understands 'it' refers to RAG from conversation context!")


COMPARISON: MEMORY VS NO MEMORY

Query (after discussing RAG): 'What are the advantages of using it?'


[Simple RAG - NO MEMORY]
It‚Äôs not clear what ‚Äúit‚Äù refers to. Based on the context, there are two relevant sets of advantages:

Advantages of treating search as a tool the LLM uses
- Search only when needed ‚Äì The LLM can handle greetings, follow-ups, and simple queries without triggering unnecessary searches. [From ‚ÄúBenefits


[Memory RAG - WITH MEMORY]
Key advantages of RAG

- Higher factual accuracy: grounds answers in retrieved sources, reducing hallucinations.
- Up-to-date knowledge: you can update the index without retraining the model.
- Domain adaptation without fine-tuning: plug in proprietary docs/policies and get expert answers fast.
- Ex


üí° Memory RAG understands 'it' refers to RAG from conversation context!


## Summary

### Architecture: RAG with Memory

**Flow:**
```
User Query + Chat History ‚Üí Retriever ‚Üí LLM + Prompt ‚Üí Response
                                            ‚Üì
                                    Update History
```

**Key Components:**
- `RunnableWithMessageHistory`: LCEL wrapper for memory
- `ChatMessageHistory`: Stores conversation
- `MessagesPlaceholder`: Injects history into prompt

**Advantages:**
‚úÖ Handles follow-up questions  
‚úÖ Understands anaphoric references ("it", "that", "them")  
‚úÖ More natural conversations  
‚úÖ Context accumulates over session  

**Limitations:**
- Higher cost (more tokens in context)
- Memory can grow large
- Privacy considerations (stores conversations)

**Production Tips:**
- Use `ConversationBufferWindowMemory` to limit history size
- Implement conversation summarization for long sessions
- Store sessions in database (Redis, PostgreSQL)
- Add conversation timeout/expiry

**Next:** [05_branched_rag.ipynb](05_branched_rag.ipynb) - Multi-query parallel retrieval