# RAG with Short-term and Long-term Memory
Building a conversational RAG Chatbot with STM and LTM using Langchain,ChormaDB and OpenAI

**Objectives**

- Maintain conversations across multiple turns

- Recall information from previous interactions

- Retrieve domain-specific knowledge

- Scale to long-term personalized interactions

To achieve this, we designed a Conversational Retrieval-Augmented Generation (RAG) System equipped with both:

- Short-Term Memory (STM) for multi-turn dialogue

- Vector-Based Long-Term Memory (LTM) for storing persistent knowledge

In [None]:
#!pip install langchain>=1.0.7 langchain-community>=0.4.1 langchain-openai>=1.0.3 langchain_groq>=1.0.1 langchain_google_genai>=3.0.3 langchain-chroma>=1.0.0 langchain-text-splitters>=1.0.0 html2text>=2025.4.15

In [2]:
# Initialize LLM
import os
import langchain
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

llm = ChatOpenAI(model="gpt-4o-mini")
response = llm.invoke("What is the capital of France?")

print(f"answer is: {response.content}")

answer is: The capital of France is Paris.


## Build a Chroma VectordB
We will scrape web pages and store the embeddings in the vectordB.

In [3]:
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.document_transformers import Html2TextTransformer
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Load webpage
urls = ["https://lilianweng.github.io/posts/2023-06-23-agent/"]
loader = WebBaseLoader(urls)
docs = loader.load()

# HTML ‚Üí text
html2text = Html2TextTransformer()
docs = html2text.transform_documents(docs)

# Split
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
splits = splitter.split_documents(docs)

# Vectorstore
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(splits, embedding=embeddings)
retriever = vectorstore.as_retriever()

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [4]:
# quick simillarity search test
results = vectorstore.similarity_search("What is a LLM agent?", k=3)
print("\nüîç Top 3 matching chunks:")
# results
for i, doc in enumerate(results):
    print(f"\n[{i+1}] Content: {doc.page_content[:100]}...")
    print(f"\n[{i+1}] Metadata: {doc.metadata}...")
    # print(f"\n[{i+1}] Simillarity Score: {doc}...")
    print(f"=================================================================")



üîç Top 3 matching chunks:

[1] Content: Proof-of-Concept Examples Challenges Citation References Building agents with LLM (large language mo...

[1] Metadata: {'language': 'en', 'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'title': "LLM Powered Autonomous Agents | Lil'Log", 'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview\nIn a LLM-powered autonomous agent system, LLM functions as the agent‚Äôs brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The a

## Define Coversational Prompt

In [7]:
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template("""
You are a helpful Q&A assistant. Use the context below and the conversation history
to answer the user's last question. If the answer is not in the context, say so.

Conversation history:
{chat_history}

Retrieved context:
{context}

User question:
{question}
""")

## Define RAG Pipeline

In [8]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough, RunnableLambda
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

#Retriever + Formatting Pipeline
from operator import itemgetter

def format_docs(docs):
    # convert list of Document objects ‚Üí string
    return "\n\n".join(d.page_content for d in docs)

rag_chain = (
    {
        # send ONLY the question string ‚Üí retriever ‚Üí format docs
        "context": itemgetter("question") | retriever | RunnableLambda(format_docs),

        # send the question string into the prompt
        "question": itemgetter("question"),

        # send chat history array as string (RunnableWithMessageHistory injects this)
        "chat_history": itemgetter("chat_history"),
    }
    | prompt
    | llm
)


## Define Multi-turn Convesation with Short-term Memory

In [9]:
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

# in-memory session store
store = {}

def get_history(session_id):
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

conversation_rag_chain = RunnableWithMessageHistory(
    rag_chain,
    get_history,
    input_messages_key="question",        # the new user input
    history_messages_key="chat_history"   # pipeline expects {chat_history}
)

## Test the Short-term Memory

In [None]:
session_id = "memory_test_01"

print("\n---------------- TURN 1 ----------------")
response1 = conversation_rag_chain.invoke(
    {"question": "What is an AI agent?"},
    config={"configurable": {"session_id": session_id}}
)
print("A1:", response1)

print("\n---------------- TURN 2 ----------------")
response2 = conversation_rag_chain.invoke(
    {"question": "How do LLM-based agents work?"},
    config={"configurable": {"session_id": session_id}}
)
print("A2:", response2)

print("\n---------------- TURN 3 ----------------")
response3 = conversation_rag_chain.invoke(
    {"question": "Did you remember my first question?"},
    config={"configurable": {"session_id": session_id}}
)
print("A3:", response3)

print("\n---------------- TURN 4 ----------------")
response4 = conversation_rag_chain.invoke(
    {"question": "Summarize everything we have discussed so far."},
    config={"configurable": {"session_id": session_id}}
)
print("A4:", response4)

print("\n---------------- TURN 5 ----------------")
response5 = conversation_rag_chain.invoke(
    {"question": "Based on our discussion, what should I learn next?"},
    config={"configurable": {"session_id": session_id}}
)
print("A5:", response5)

print("\n---------------- TURN 6 ----------------")
response6 = conversation_rag_chain.invoke(
    {"question": "What was my second question again?"},
    config={"configurable": {"session_id": session_id}}
)
print("A6:", response6)


---------------- TURN 1 ----------------


            id = uuid7()
Future versions will require UUID v7.
  input_data = validator(cls_, input_data)


A1: content='The context does not provide a specific definition for an AI agent. However, based on general knowledge, an AI agent can be described as a system or algorithm that employs artificial intelligence techniques to perform tasks in an autonomous manner, often making decisions based on its environment and goals. AI agents can plan, learn from experience, and interact with their surroundings, utilizing various functionalities such as memory and tool use.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 79, 'prompt_tokens': 456, 'total_tokens': 535, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_560af6e559', 'id': 'chatcmpl-CegbPrcQMxx0grTEph2vJVoRHkwAe', 'service_tier': 'default', 'finish

## Continious Chat with Short-term Memory

In [13]:

def chat_with_rag(session_id="chat_user"):
    print("RAG Chatbot Ready! (type 'exit' or 'quit' to stop)")
    print("-----------------------------------------------------")

    while True:
        user_input = input("You: ")

        if user_input.lower() in ["exit", "quit", "bye"]:
            print("Bot: Goodbye!")
            break

        # invoke the chain with memory
        response = conversation_rag_chain.invoke(
            {"question": user_input},
            config={"configurable": {"session_id": session_id}},
        )

        print(f"Bot: {response}\n")


# start chatbot
chat_with_rag()

RAG Chatbot Ready! (type 'exit' or 'quit' to stop)
-----------------------------------------------------
Bot: content='Hello! How can I assist you today?' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 9, 'prompt_tokens': 421, 'total_tokens': 430, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_51db84afab', 'id': 'chatcmpl-CegdBGFQQM6hKFXP6BhPtFgJezaV1', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None} id='lc_run--6a979394-896b-4c27-8ade-cba955f89fd3-0' usage_metadata={'input_tokens': 421, 'output_tokens': 9, 'total_tokens': 430, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}

Bot: content="I'm just a computer pro

In [14]:
session_id="chat_user" #"run--8eb56645-b013-4d1e-98b2-66e1f123dd86-0"

chat_with_rag(session_id)


RAG Chatbot Ready! (type 'exit' or 'quit' to stop)
-----------------------------------------------------
Bot: content='Your name is Bibhu.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 6, 'prompt_tokens': 3917, 'total_tokens': 3923, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 2944}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_560af6e559', 'id': 'chatcmpl-CegfovzEfYL5sCOoxVYIHmT0XkHjR', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None} id='lc_run--386d652e-1196-4942-a7be-ea0fd8cbe502-0' usage_metadata={'input_tokens': 3917, 'output_tokens': 6, 'total_tokens': 3923, 'input_token_details': {'audio': 0, 'cache_read': 2944}, 'output_token_details': {'audio': 0, 'reasoning': 0}}

Bot: Goodbye!


## Build Long-term Memory

In [15]:
# Create a Chroma dB as Long Term Memory Store
#Create vector Store

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

ltm_embeddings = OpenAIEmbeddings()

# Create a separate Chroma DB for long-term memory
ltm_vectorstore = Chroma(
    collection_name="long_term_memory",
    embedding_function=ltm_embeddings
)

# Long-term memory retriever
ltm_retriever = ltm_vectorstore.as_retriever(search_kwargs={"k": 3})


  ltm_vectorstore = Chroma(


In [16]:
#Long Term Memory Updation

def save_to_ltm(question, answer):
    """Store important conversational info as long-term memory."""
    text = f"User asked: {question}\nAssistant answered: {answer}"
    ltm_vectorstore.add_texts([text])

In [17]:
## Define RAG Pipeline with Long-term Memory

def join_text(data):
    return "\n\n".join([d.page_content for d in data])

rag_chain = (
    {
        "rag_context": itemgetter("question") | retriever | RunnableLambda(format_docs),
        "ltm_context": itemgetter("question") | ltm_retriever | RunnableLambda(format_docs),
        "question": itemgetter("question"),
        "chat_history": itemgetter("chat_history"),
    }
    | ChatPromptTemplate.from_template("""
You are a helpful AI assistant. Use ALL sources below:

1. Short-term chat history:
{chat_history}

2. Long-term memory (LTM):
{ltm_context}

3. Retrieved knowledge (RAG):
{rag_context}

Answer the final user question:
{question}
""")
    | llm
)


In [18]:
def chatbot_fn(message, history):
    session_id = "memory_test_02"  #New with LMT

    answer = conversation_rag_chain.invoke(
        {"question": message},
        config={"configurable": {"session_id": session_id}},
    )

    # Save important info into long-term memory
    save_to_ltm(message, answer)

    return answer

In [20]:
session_id = "long_term_test_01"

print("=== TURN 1: Store a personal fact in LTM ===")
response1 = conversation_rag_chain.invoke(
    {"question": "My birthday is on 14th October."},
    config={"configurable": {"session_id": session_id}}
)
print("A1:", response1)

# Save to LTM
save_to_ltm("My birthday is on 14th October.", response1)


print("\n=== TURN 2‚Äì5: Distract the model with unrelated queries ===")
unrelated_questions = [
    "Explain the difference between supervised and unsupervised learning.",
    "What is the capital of France?",
    "Tell me something about reinforcement learning.",
    "What is backpropagation?"
]

for i, q in enumerate(unrelated_questions, start=2):
    print(f"\n--- TURN {i}: {q} ---")
    r = conversation_rag_chain.invoke(
        {"question": q},
        config={"configurable": {"session_id": session_id}}
    )
    print(f"A{i}:", r)
    # these do NOT go to LTM ‚Äî only distract



print("\n=== TURN 6: Critical: Ask LTM-dependent question ===")
response6 = conversation_rag_chain.invoke(
    {"question": "When is my birthday?"},
    config={"configurable": {"session_id": session_id}}
)
print("A6:", response6)


print("\n=== TURN 7: Even stronger LTM recall test ===")
response7 = conversation_rag_chain.invoke(
    {"question": "Earlier you learned a fact about my personal life. What was it?"},
    config={"configurable": {"session_id": session_id}}
)
print("A7:", response7)


print("\n=== TURN 8: Ask for reasoning using long-term memory ===")
response8 = conversation_rag_chain.invoke(
    {"question": "Since my birthday is on that date, what zodiac sign am I?"},
    config={"configurable": {"session_id": session_id}}
)
print("A8:", response8)


print("\n=== TURN 9: Ask to explain how it remembered ===")
response9 = conversation_rag_chain.invoke(
    {"question": "How did you remember my birthday even after many unrelated questions?"},
    config={"configurable": {"session_id": session_id}}
)
print("A9:", response9)


=== TURN 1: Store a personal fact in LTM ===
A1: content='The retrieved context does not contain information related to birthdays or personal celebrations. Sorry, I cannot provide an answer to your statement about your birthday.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 28, 'prompt_tokens': 432, 'total_tokens': 460, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_560af6e559', 'id': 'chatcmpl-CegiS4T1t9mVvXPrGpYNrfJ3B5XIS', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None} id='lc_run--1f463811-b603-4531-8cd5-54ebc0896d43-0' usage_metadata={'input_tokens': 432, 'output_tokens': 28, 'total_tokens': 460, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_d