# Conversational RAG with Query History Retriever

https://python.langchain.com/v0.2/docs/tutorials/qa_chat_history/

https://cloud.google.com/blog/products/ai-machine-learning/to-tune-or-not-to-tune-a-guide-to-leveraging-your-data-with-llms

## LangChain Chroma
https://python.langchain.com/v0.2/docs/integrations/vectorstores/chroma/

## 1. Setup LLM

In [1]:
from dotenv import load_dotenv
import sys
import json

from langchain.prompts import PromptTemplate

# Load the file that contains the API keys - OPENAI_API_KEY
load_dotenv('C:\\Users\\raj\\.jupyter\\.env')

# setting path
sys.path.append('../')

from utils.create_chat_llm import create_gpt_chat_llm, create_cohere_chat_llm

# Try with GPT
llm = create_gpt_chat_llm()

## 2. Setup vector database

In [2]:
# 1. Load a couple of Blogs 
from langchain_community.document_loaders import WebBaseLoader

# Sample blogs on RAG that we will add to vector database
url1 = "https://cloud.google.com/blog/products/ai-machine-learning/to-tune-or-not-to-tune-a-guide-to-leveraging-your-data-with-llms"
url2 = "https://aws.amazon.com/blogs/aws/build-rag-and-agent-based-generative-ai-applications-with-new-amazon-titan-text-premier-model-available-in-amazon-bedrock/"

loader = WebBaseLoader(
    web_paths=(url1,url2)
)

docs = loader.load()


In [3]:
# 2. Chunk the blogs
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunked_documents = text_splitter.split_documents(docs)

In [4]:
# 3. Add chunks to the ChromaDB
from langchain_community.vectorstores import Chroma
# from langchain_chroma import Chroma
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

# load it into Chroma using default embedding all-MiniLM-L6-v2
collection_name = 'sample-blog'
collection_metadata = {'embedding': 'all-MiniLM-L6-v2'}

embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

vector_store = Chroma(collection_name=collection_name, collection_metadata=collection_metadata, embedding_function=embedding_function)
vector_store.add_documents(chunked_documents)

retriever = vector_store.as_retriever()

## 2. Chat history retriever


An agent is used for creating the *input* for retriever input based on the conversation context.

#### Without history
query -> retriever
#### With history:
(query, conversation history) -> LLM -> rephrased query -> retriev

https://api.python.langchain.com/en/latest/chains/langchain.chains.history_aware_retriever.create_history_aware_retriever.html

###### Prompt requires the *chat_history* 

If there is no chat_history, then the input is just passed directly to the retriever. If there is chat_history, then the prompt and LLM will be used to generate a search query. That search query is then passed to the retriever.

#### MessagePlaceHolder
Used for passing a list of messages in the prompt.

https://api.python.langchain.com/en/latest/prompts/langchain_core.prompts.chat.MessagesPlaceholder.htmler

In [5]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import MessagesPlaceholder

# This is the prompt used for generating the input/query from chat history and user input
contextualize_q_system_prompt = (
    "Given a chat history and the latest user question "
    "which might reference context in the chat history, "
    "formulate a standalone question which can be understood "
    "without the chat history. Do NOT answer the question, "
    "just reformulate it if needed and otherwise return it as is."
)

contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

# This is where you create the retriever
history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)

## 3. History aware conversation retriever chain

In [6]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# This the prompt that is sent to LLM for generating response to user input
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)


question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)

rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

## 4. Tests

Because our chain includes a "chat_history" input, the caller needs to manage the chat history. We can achieve this by appending input and output messages to a list:

In [7]:
from langchain_core.messages import AIMessage, HumanMessage

chat_history = []

def  invoke_llm(input):
    response = rag_chain.invoke({"input": input, "chat_history": chat_history})
    chat_history.extend(
        [
            HumanMessage(content=input),
            AIMessage(content=response["answer"]),
        ]
    )
    return response

In [8]:
input = "What is RAG?"

response = invoke_llm(input)

response['answer']

'RAG stands for Retrieval-Augmented Generation. It is an approach that ensures model outputs are grounded on your data by searching for information relevant to a query within your data and passing that information into the prompt. RAG allows for the retrieval of new context from data with each interaction, supporting fresh, updated, private, and large-scale data.'

In [9]:
input = "How is it different than fine tuning?"
response = invoke_llm(input)
response['answer']

'RAG focuses on ensuring model outputs are grounded on data by searching for relevant information within the data, while fine-tuning involves giving specific instructions to a model for a well-defined task, such as classification. RAG allows for the retrieval of new context from data with each interaction, while fine-tuning is more suitable for tasks like structured outputs from free text or classification.'