# Conversational Q&A Chatbot with Memory
* Notebook by Adam Lang
* Date: 1/25/2025

# Overview of Chatbot Memory
* In a question and answer application such as a chatbot we want to be able to allow users to have a "back and forth" conversation.
* This means that the application has "memory" of previous questions/queries and answers so that information may be included in all future interactions the user has with the chatbot.
  * In a sense whis is like performing "Real-time RAG" on a real-time conversation.

# Project Overview
* In this notebook we will go over how to add logic to a Generative AI application to historical messages.
* **Specifically we will build a chatbot with memory that can converse with external website data.**

## Approaches to Chatbot Memory
1. Chains
  * This includes a retrieval step.

2. Agents
  * We give an LLM the ability to have discrtion over whether and how to execute retrieval steps or multiple steps.

# Install Dependencies

In [7]:
%%capture
!pip install langchain langchain_community langchain_groq langchain_openai langsmith langchain_core langchain-chroma langchain-huggingface

In [6]:
%%capture
!pip install bs4 ## for html scraping

In [9]:
%%capture
!pip install -U transformers
!pip install -U sentence-transformers

# Setup open source LLM via GROQ API

In [1]:
import os
from getpass import getpass

GROQ_API_KEY = getpass("Enter your GROQ API KEY: ")

Enter your GROQ API KEY: ··········


In [2]:
## setup groq env
os.environ['GROQ_API_KEY'] = GROQ_API_KEY

In [5]:
from langchain_groq import ChatGroq

## load LLM from GROQ API
llm = ChatGroq(groq_api_key=GROQ_API_KEY,
               model_name="Llama3-8b-8192")
llm

ChatGroq(client=<groq.resources.chat.completions.Completions object at 0x7800c8e52210>, async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x7800c8e52e50>, model_name='Llama3-8b-8192', model_kwargs={}, groq_api_key=SecretStr('**********'))

# Import Libraries
* Note about `create_stuff_documents_chain`
  * This chain stuffs or inserts a list of documents into the prompt and sends it to the LLM.
  * It is most ideal for applications where documents are small, and only a few are used at a time.
  * The stuff chain would fail if the document tokens exceed the LLM limit.

In [48]:
from langchain_chroma import Chroma # vector DB
from langchain_community.document_loaders import WebBaseLoader
import bs4 ## to scrape web data for WebBaseLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

## langchain chain functions
from langchain.chains import create_retrieval_chain ## interface to vector DB
from langchain.chains.combine_documents import create_stuff_documents_chain ## stuff document chain combines documents

## langchain chat history/memory functions
from langchain.chains import create_history_aware_retriever ##history/memory retriever
from langchain_core.prompts import MessagesPlaceholder ## key:val pairs for messages
from langchain_core.messages import AIMessage, HumanMessage


# Setup Embedding Model
* We will load an open-source embedding model from hugging face.

In [28]:
## setup embeddings via SentenceTransformers
#from sentence_transformers import SentenceTransformer

## load embeddings from HF
embeddings=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# Load Data via WebBaseLoader

In [20]:
## 1. load, chunk and index contents of a blog to creat retriever
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header") #html fields
        )
    ),
)
## load web docs
docs = loader.load()

In [22]:
## view a doc
docs[:10]

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent'}, page_content='\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes

# Chunk and Split Data
* Now that we loaded the data we need to chunk and split it before creating embeddings and storing it in a vector store.

In [29]:
## 1. init text splitter
text_splitter=RecursiveCharacterTextSplitter(chunk_size=1000,
                                              chunk_overlap=200)

## 2. Create chunks
chunks = text_splitter.split_documents(docs)


## 3. Store Embeddings in Vector DB
vectorstore=Chroma.from_documents(documents=chunks,embedding=embeddings)

## 4. Create retriever
retriever=vectorstore.as_retriever()
retriever

VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_chroma.vectorstores.Chroma object at 0x77ff7c0a1d10>, search_kwargs={})

# Prompt Template
* We now create the prompt template.

In [30]:
## prompt template
system_prompt = (
    """
    1. You are an expert assistant for question-answering tasks.
    2. Use the following pieces of retrieved context to answer the question.
    3. If you do not know the answer, say that you don't know the answer.
    4. Answer the question in three sentences maximum and keep the answer concise.
    "\n\n"
    "{context}"
    """
)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),

    ]
)

# Create Chain
* This is the question-answer chain
* **This uses the `create_stuff_documents_chain`**

In [31]:
## create Q&A chain
question_answer_chain=create_stuff_documents_chain(llm, prompt)

## create RAG chain
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [39]:
## invoke chain
response=rag_chain.invoke({"input":"What is React?"})
response

{'input': 'What is React?',
 'context': [Document(id='678896b0-58cc-42ee-832a-5cbc9579f011', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent'}, page_content='ReAct (Yao et al. 2023) integrates reasoning and acting within LLM by extending the action space to be a combination of task-specific discrete actions and the language space. The former enables LLM to interact with the environment (e.g. use Wikipedia search API), while the latter prompting LLM to generate reasoning traces in natural language.\nThe ReAct prompt template incorporates explicit steps for LLM to think, roughly formatted as:\nThought: ...\nAction: ...\nObservation: ...\n... (Repeated many times)'),
  Document(id='20fb21a8-e2b7-48fb-905c-89255dd827c5', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent'}, page_content='Case Studies#\nScientific Discovery Agent#\nChemCrow (Bran et al. 2023) is a domain-specific example in which LLM is augmented with 13 expert-designed tools t

In [40]:
## get exact generative answer
response['answer']

'ReAct is a framework that integrates reasoning and acting within Large Language Models (LLMs) by extending the action space to combine task-specific discrete actions and the language space. This enables LLMs to interact with the environment and generate reasoning traces in natural language.'

## Summary
* We can see the retrieved contextual documents above.
* And finally we can see the generative answer from the context of the retrieved documents.

# Adding Chat History (Memory)
* What we will see below is that there is NO CONTEXT OR MEMORY in the rag_chain. We will fix this.

In [41]:
## invoke chain without memory
rag_chain.invoke({"input":"How do we create it?"})

{'input': 'How do we create it?',
 'context': [Document(id='033c89eb-5c21-40f9-978e-92b559b808c2', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent'}, page_content='Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.\nTask decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.'),
  Document(id='12ab73f5-3d22-48e9-8bcf-410b8ef1a1e1', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent'}, page_content="Prompt LM with 10

## Build chat history/memory

In [43]:
from langchain.chains import create_history_aware_retriever ##history/memory retriever
from langchain_core.prompts import MessagesPlaceholder ## key:val pairs for messages


## 1. contextual system prompt
context_q_system_prompt = (
    """
    Given a chat history and the latest user question which
    may reference previous context in the chat history, do the following:
      1. Formulate a standalone question which can be understood  without the chat history.
      2. Do NOT answer the question, ONLY reformulate it if needed and otherwise return it as is.
    """
)

## 2. contextual prompt
context_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", context_q_system_prompt),
        MessagesPlaceholder("chat_history"), ## conversation history stored here
        ("human", "{input}"),
    ]
)


In [44]:
## 3. create new retriever for history -- retrieves results from vector DB based on history
history_aware_retriever=create_history_aware_retriever(llm,
                                                       retriever,
                                                       context_q_prompt)
history_aware_retriever

RunnableBinding(bound=RunnableBranch(branches=[(RunnableLambda(lambda x: not x.get('chat_history', False)), RunnableLambda(lambda x: x['input'])
| VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_chroma.vectorstores.Chroma object at 0x77ff7c0a1d10>, search_kwargs={}))], default=ChatPromptTemplate(input_variables=['chat_history', 'input'], input_types={'chat_history': list[typing.Annotated[typing.Union[typing.Annotated[langchain_core.messages.ai.AIMessage, Tag(tag='ai')], typing.Annotated[langchain_core.messages.human.HumanMessage, Tag(tag='human')], typing.Annotated[langchain_core.messages.chat.ChatMessage, Tag(tag='chat')], typing.Annotated[langchain_core.messages.system.SystemMessage, Tag(tag='system')], typing.Annotated[langchain_core.messages.function.FunctionMessage, Tag(tag='function')], typing.Annotated[langchain_core.messages.tool.ToolMessage, Tag(tag='tool')], typing.Annotated[langchain_core.messages.ai.AIMessageChunk, Tag(tag='AIMessageChu

In [46]:
## 4. create prompt
qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        MessagesPlaceholder("chat_history"), ## history/memory placeholder
        ("human", "{input}"),
    ]
)

## Create Chain with memory

In [47]:
## chain with memory
question_answer_chain=create_stuff_documents_chain(llm,qa_prompt)

## rag chain with memory
rag_chain=create_retrieval_chain(history_aware_retriever, question_answer_chain)

In [49]:
from langchain_core.messages import AIMessage, HumanMessage

chat_history = [] #store chat_history in list
## query
question = "What is React?"
response1=rag_chain.invoke({"input":question,"chat_history":chat_history})

## append chat history
chat_history.extend(
    [
        HumanMessage(content=question),
        AIMessage(content=response1['answer'])
    ]
)

## query 2
question2 = "Tell me more about it."
response2=rag_chain.invoke({"input":question,"chat_history":chat_history})
print(response2['answer'])

I apologize for the mistake earlier. According to the context, ReAct (not React) is a system that integrates reasoning and acting within a Large Language Model (LLM) by extending the action space to include both task-specific discrete actions and language space.


In [50]:
## view chat history
chat_history

[HumanMessage(content='What is React?', additional_kwargs={}, response_metadata={}),
 AIMessage(content='React is a system that integrates reasoning and acting within a Large Language Model (LLM) by extending the action space to include both task-specific discrete actions and language space. This allows the LLM to interact with the environment and generate reasoning traces in natural language.', additional_kwargs={}, response_metadata={})]

# Additional Chat History -- using session id's

In [51]:
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

store = {}

## function to get session history
def get_session_history(session_id: str) -> BaseChatMessageHistory:
  if session_id not in store:
    store[session_id] = ChatMessageHistory()
  return store[session_id]


## create rag chain
conversation_rag_chain = RunnableWithMessageHistory(
    rag_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="chat_history",
    output_messages_key="answer",
)

In [53]:
## invoke based on session id
conversation_rag_chain.invoke(
    {"input": "What is Maximum Inner Product Search?"},
    config={
        "configurable": {"session_id": "abc123"}
    }, # stores a key "abc123" in the `store` dict
)["answer"]

'Maximum Inner Product Search (MIPS) is a technique that allows for fast search of the nearest neighbors in a vector space by computing the maximum inner product between query vectors and stored vectors. This is often used in applications where the goal is to find the most similar or relevant items or entities to a given query, such as in recommendation systems, information retrieval, or computer vision.'

In [54]:
## continue same session id conversation
conversation_rag_chain.invoke(
    {"input": "What are some example algorithms?"},
    config={
        "configurable": {"session_id": "abc123"}
    }, # stores a key "abc123" in the `store` dict
)["answer"]

'Some example algorithms for Maximum Inner Product Search (MIPS) include: \n\n1. LSH (Locality-Sensitive Hashing), which uses a hashing function to map similar input items to the same buckets.\n2. ANNOY (Approximate Nearest Neighbors Oh Yeah), which uses random projection trees to search for nearest neighbors.\n3. FAISS (Facebook AI Similarity Search), which applies vector quantization to partition the vector space and search for nearest neighbors.\n4. ScaNN (Scalable Nearest Neighbors), which uses anisotropic vector quantization to search for nearest neighbors.\n\nThese algorithms are designed to provide fast and efficient search results while sacrificing some accuracy.'