# CPI Q/A bot

This chatbot retrieves context from a proprietary datasource and the web to answer questions about Consumer Price Index (CPI) changes in the province of British Columbia (BC) in April 2024.  The proprietary datasource is a PDF report highlighting CPI changes in BC in April 2024 over a 12-month period.  The web data needed to answer the question is being retrieved using the You.com API.  The chatbot is implemented as a parallel chain in Langchain.

The following packages and versions are being utilized in the code below:
- openai==1.30.3
- langchain==0.2.1
- langchain_community==0.2.1
- langchain_openai==0.1.7
- langchain_text_splitters==0.2.0
- langchain_core==0.2.1
- numpy==1.26.4
- pandas==2.2.2

In [119]:
import langchain
import os

In [91]:
os.environ["YDC_API_KEY"] = "<Insert your YDC API key here>"
os.environ["OPENAI_API_KEY"] = "<Insert your OpenAI key here>"

## Instantiating the You.com Retriever in Langchain

Langchain provides a You.com retriever.  For more information, please visit: https://python.langchain.com/v0.1/docs/integrations/retrievers/you-retriever/

In [92]:
from langchain_community.retrievers.you import YouRetriever

ydc_retriever = YouRetriever(num_web_results = 10)

In [93]:
# Let's test it out
response = ydc_retriever.invoke("British Columbia’s Consumer Price Index (CPI) in April 2024 was 2.9% higher (unadjusted) than in April 2023.  How does this compare to the Canadian CPI?")
# Let's take a look at the first 3 responses
response[:3]

[Document(page_content='Mail to: BC Stats, Box 9410 Stn Prov Govt, Victoria BC V8W 9V1', metadata={'url': 'https://www2.gov.bc.ca/gov/content/data/statistics/economy/consumer-price-index', 'thumbnail_url': None, 'title': 'Consumer Price Index (CPI) - Province of British Columbia', 'description': "Looking for more data? Explore the B.C. Government's extensive collection of datasets, applications and web services · Please send your questions and service requests to BC Stats here"}),
 Document(page_content='Consumer Price Index (CPI) data', metadata={'url': 'https://www2.gov.bc.ca/gov/content/data/statistics/economy/consumer-price-index', 'thumbnail_url': None, 'title': 'Consumer Price Index (CPI) - Province of British Columbia', 'description': "Looking for more data? Explore the B.C. Government's extensive collection of datasets, applications and web services · Please send your questions and service requests to BC Stats here"}),
 Document(page_content="Shelter inflation has been a thorn 

## Creating a Vector DB retriever based on data from a PDF File

We are going to load a PDF file using the PyPDFLoader in Langchain.  We will then use the RecursiveTextSplitter in Langchain to split the documents into chuncks that can be vectorized.  The vectorized chunks of text will be stored in a Facebook AI Similarity Search (FAISS) vector store.  This vector store will be converted into a Langchain retriever.

In [94]:
from langchain_community.document_loaders import PyPDFLoader

# The PDF file we are using can be downloaded from: https://www2.gov.bc.ca/assets/gov/data/statistics/economy/cpi/cpi_highlights.pdf
# load the PDF file
loader = PyPDFLoader("bc_cpi_highlights.pdf")
docs = loader.load()

In [95]:
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# split the document into chunks, and vectorize these chunks in a FAISS database
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 100)
notes = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(documents=notes, embedding=embeddings)

In [96]:
# test out the similarity search
query = "How much did food prices increase in April 2024?"
response = db.similarity_search(query, k=3)
response[0].page_content

'(excluding fish, seafood, and other marine products) \n(+2.1%). At the same time, fruit, fruit preparations, \nand nut s was the only major food category to \ndecrease in price (- 0.1%)  \nBritish Columbians paid more for both  health (+2.7%) \nand personal (+ 2.0%) care  when compared to \n12-months ago. Services, instead of items within \nthese categories, had the largest price increase. Personal services (such a hairdressing) cost 4.8% \nmore when compared to 12 -months ago, while the \ncost of health care services (such as eye and dental \ncare) increased by 4.3%.  Consumer Price \nIndex   \n \n \nReference date:  April  2024  Issue:  #24-04 Released:  May 21 , 2024 \n      \n-5.8-1.91.92.22.32.62.82.96.8\nClothing & FootwearHouseholdRecreationAlc., Tob., & CannabisHealth & PersonalFoodTransportationAll-itemsShelterInflation by Category\n% change, same month previous year'

In [97]:
# Create the retriever
faiss_retriever = db.as_retriever()

## Create an Ensemble Retriever using the You.Com Retriever and the FAISS Retriever

The Ensemble Retriever in Langchain ensembles results from multiple retrievers.  We will create an Ensemble Retriever with the FAISS Vector store retriever and the You.com retriever that we defined above as constituent retrievers.

In [98]:
from langchain.retrievers import EnsembleRetriever

ensemble_retriever = EnsembleRetriever(
    retrievers = [ydc_retriever, faiss_retriever], weights = [0.5, 0.5]
)

## Instantiate the LLM

In [99]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", temperature=0.5)

## Create the Prompt Template

In [100]:
system_prompt = """
You are an assistant that answers questions pertaining to CPI (Consumer Price Index).  Please utilize the following retrieved context from the web and from a proprietary
datasource to provide an accurate answer to the question.  Please try and utilize numbers where applicable to substantiate your answer.  If you do not know the answer, simply say you do not 
know the answer.  Please keep the response concise.

{context}
"""

In [101]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

## Create a basic chain without chat history

We will test our chain first without chat history.

In [102]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

qa_chain = create_stuff_documents_chain(llm, qa_prompt)
rag_chain = create_retrieval_chain(ensemble_retriever, qa_chain)

In [103]:
response = rag_chain.invoke({"input": "How did the CPI in April 2024 in BC compare to the national CPI in Canada?"})

In [104]:
response["answer"]

"In April 2024, the Consumer Price Index (CPI) in British Columbia increased by 2.9% compared to April 2023. Nationally, Canada's CPI was up by 2.7% over the same period. Therefore, the CPI increase in British Columbia was slightly higher than the national average."

## Add chat history to our chatbot

Chat history is an integral component of any chat application, as the input query might require additional conversational context to be understood by the LLM.  We are going to add chat history to our chatbot, and contextualize the input prompts with chat history.

In [105]:
# Create a prompt that utilizes the chat history as context to reformulate the most recent input, as a standalone question that the LLM can comprehend
from langchain.chains import create_history_aware_retriever

contextualize_q_system_prompt = """
Given a chat history and the latest question, which might reference context in the chat history, formulate a standalone question, which can be understood without chat history.
Do not answer the question, just reformulate the question if necessary and return it as it otherwise.
"""

contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}")
    ]
)

In [106]:
# Create a chain that takes conversation history and contextualizes the prompt
history_aware_retriever = create_history_aware_retriever(llm, ensemble_retriever, contextualize_q_prompt)

In [107]:
# rejig qa prompt to include the chat history
qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}")
    ]
)

In [108]:
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory

# statefully manage session history
store = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
        
    return store[session_id]

In [109]:
# create chains that include message history
qa_chain = create_stuff_documents_chain(llm, qa_prompt)
rag_chain = create_retrieval_chain(history_aware_retriever, qa_chain)

In [110]:
from langchain_core.runnables.history import RunnableWithMessageHistory

# create final chain that ties everything together

conversation_rag_chain = RunnableWithMessageHistory(
    rag_chain,
    get_session_history,
    input_messages_key = "input",
    history_messages_key = "chat_history",
    output_messages_key = "answer"
)

## Let's try it out!

In [111]:
conversation_rag_chain.invoke({"input": "How much did food prices increase in April 2024 in BC compared to April 2023?"}, config = {"configurable": {"session_id": "xyz_789"}})["answer"]

'Food prices in British Columbia increased by 2.6% in April 2024 compared to April 2023.'

In [112]:
conversation_rag_chain.invoke({"input": "How does that compare to the increase in food prices across the nation?"}, config = {"configurable": {"session_id": "xyz_789"}})["answer"]

'In April 2024, food prices in British Columbia increased by 2.6% compared to April 2023. Nationally, food prices in Canada increased by 2.3% over the same period. Thus, the increase in food prices in British Columbia was slightly higher than the national average.'

In [113]:
conversation_rag_chain.invoke({"input": "What contributed to the rising food prices in BC in April 2024?"}, config = {"configurable": {"session_id": "xyz_789"}})["answer"]

'Several factors contributed to the rising food prices in British Columbia in April 2024:\n\n1. **Beef and Veal Prices**: Beef and veal prices rose by 0.8% in April 2024 and were 7.0% higher than in April 2023. This increase was driven by tight supplies and strong demand.\n\n2. **Pork Prices**: Wholesale pork prices increased by 2.9% in April 2024 and were 18.3% higher than in April 2023 due to higher demand.\n\n3. **Poultry Prices**: Although poultry prices decreased by 0.6% in April 2024, they were still 0.9% higher than in April 2023.\n\n4. **General Food Inflation**: The overall food price index in British Columbia climbed to 179.2 (+5.4%) over the latest 12-month average, reflecting broader food inflation trends.\n\nThese specific increases in meat prices, along with general food inflation trends, contributed to the overall rise in food prices in British Columbia in April 2024.'

In [114]:
conversation_rag_chain.invoke({"input": "How did the CPI in April 2024 in BC compare to the national CPI in Canada?"}, config = {"configurable": {"session_id": "xyz_789"}})["answer"]

'In April 2024, the Consumer Price Index (CPI) in British Columbia increased by 2.9% compared to April 2023. Nationally, the CPI in Canada increased by 2.7% over the same period. Therefore, the CPI increase in British Columbia was slightly higher than the national average.'

## Create a Python object that encapsulates the functionality of creating a chatbot that retrieves context from PDF files and the web

Let's create a Python class that encapsulates the code above.  This will enable users to easily create chatbots for new use cases with custom PDF files and prompts.

In [131]:
import secrets
class PDF_QA_Bot:
    def __init__(self, llm: ChatOpenAI, pdf_files: list[str], system_prompt: str, num_web_results_to_fetch: int = 10):
        
        self._llm = llm
        
        docs = self._load_pdf_documents(pdf_files)
        
        # split the docs into chunks, vectorize the chunks and load them into a vector store
        db = self._create_vector_store(docs)
        
        # create Langchain retriever from the vector store
        self._faiss_retriever = db.as_retriever()
        
        # create YDC retriever
        self._ydc_retriever = YouRetriever(num_web_results = 10)

        # create ensemble retriever 
        self._ensemble_retriever = EnsembleRetriever(
            retrievers = [self._ydc_retriever, self._faiss_retriever], weights = [0.5, 0.5]
        )

        # create the system prompt from the user input
        self._system_prompt = system_prompt + "\n\n" + "{context}"

        self._contextualize_q_system_prompt = """
        Given a chat history and the latest question, which might reference context in the chat history, formulate a standalone question, which can be understood without chat history.
        Do not answer the question, just reformulate the question if necessary and return it as it otherwise.
        """

        # Create a prompt that utilizes the chat history as context to reformulate the most recent input, as a standalone question that the LLM can comprehend
        self._contextualize_q_prompt = ChatPromptTemplate.from_messages(
            [
                ("system", contextualize_q_system_prompt),
                MessagesPlaceholder("chat_history"),
                ("human", "{input}")
            ]
        )

        self._history_aware_retriever = create_history_aware_retriever(llm, ensemble_retriever, contextualize_q_prompt)

        self._qa_prompt = ChatPromptTemplate.from_messages(
            [
                ("system", system_prompt),
                MessagesPlaceholder("chat_history"),
                ("human", "{input}")
            ]
        )

        self._messages_store = {}
        self._session_id = self._generate_session_id()

        # create chains that include message history
        self._qa_chain = create_stuff_documents_chain(llm, qa_prompt)
        self._rag_chain = create_retrieval_chain(history_aware_retriever, qa_chain)

        # create final chain that ties everything together
        self._conversation_rag_chain = RunnableWithMessageHistory(
            rag_chain,
            self._get_session_history,
            input_messages_key = "input",
            history_messages_key = "chat_history",
            output_messages_key = "answer"
        )
        
    def _load_pdf_documents(self, pdf_files: list[str]) -> list:
        docs = []
        for file in pdf_files:
            file_loader = PyPDFLoader(file)
            docs.extend(file_loader.load())
        return docs
    
    def _create_vector_store(self, docs: list):
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
        chunked_docs = text_splitter.split_documents(docs)
        embeddings = OpenAIEmbeddings()
        return FAISS.from_documents(documents=chunked_docs, embedding=embeddings)
    
    
    def _get_session_history(self, session_id) -> BaseChatMessageHistory:
        """Statefully manage chat history"""
        if session_id not in store:
            store[session_id] = ChatMessageHistory()
            
        return store[self._session_id]
    
    def _generate_session_id(self) -> str:
        session_id = secrets.token_urlsafe(16)
        return session_id
    
    def invoke_bot(self, input_str: str) -> str:
        input = {"input": input_str}
        config = {"configurable": {"session_id": self._session_id}}
        output = self._conversation_rag_chain.invoke(input, config)
        return output["answer"]
        

        

## Let's try it out!

In [132]:
conversational_rag_system_prompt = """You are an assistant that answers questions pertaining to CPI (Consumer Price Index).  Please utilize the following retrieved context from the web and from a proprietary
datasource to provide an accurate answer to the question.  Please try and utilize numbers where applicable to substantiate your answer.  If you do not know the answer, simply say you do not 
know the answer.  Please keep the response concise."""
conversational_rag = PDF_QA_Bot(llm, pdf_files=["bc_cpi_highlights.pdf"], system_prompt=conversational_rag_system_prompt, num_web_results_to_fetch=10)

In [133]:
conversational_rag.invoke_bot("How much did food prices increase in April 2024 in BC compared to April 2023?")

'Food prices in British Columbia increased by 2.6% in April 2024 compared to April 2023.'

In [134]:
conversational_rag.invoke_bot("How does that compare to the increase in food prices across the nation?")

'The increase in food prices in British Columbia (2.6%) in April 2024 compared to April 2023 is slightly higher than the national increase of 2.3% for the same period.'

In [135]:
conversational_rag.invoke_bot("What contributed to the rising food prices in BC in April 2024?")

'Several factors contributed to the rising food prices in British Columbia in April 2024:\n\n1. **Supply Constraints**: Tight supplies of certain products, such as beef and veal, which saw significant price increases due to strong demand and limited availability.\n2. **Higher Demand**: Increased demand for products like wholesale pork, which saw a substantial price rise due to higher demand after previous declines.\n3. **Global Events**: International events and climate conditions impacting harvests, such as wildfires and flooding, which have affected the supply chain and increased costs.\n4. **Inflation**: General inflationary pressures across various sectors, including transportation and energy, which indirectly affect food prices.\n5. **Corporate Behavior**: Allegations of price gouging by major grocery chains, which have been a subject of media and government attention.\n\nThese factors combined to drive the overall increase in food prices in the region.'