# LangChain Q & A bot with Web and PDF data retrieval

Let's first create a chatbot that retrieves context from a proprietary datasource and the web to answer questions about Consumer Price Index (CPI) changes in the province of British Columbia (BC) in April 2024.  The proprietary datasource is a PDF report highlighting CPI changes in BC in April 2024 over a 12-month period.  The web data needed to answer the question is being retrieved using the You.com API.  The chatbot is implemented as a parallel chain in LangChain.

## Install all required packages

In [40]:
%%capture
! pip install openai==1.30.3
! pip install langchain==0.2.1
! pip install langchain_community==0.2.1
! pip install langchain_openai==0.1.7
! pip install langchain_text_splitters==0.2.0
! pip install langchain_core==0.2.1
! pip install numpy==1.26.4
! pip install pandas==2.2.2
! pip install python-dotenv==1.0.1
! pip install pypdf==4.2.0
! pip install faiss-cpu==1.8.0


In [41]:
import langchain
import os

In [42]:
# The YDC_API_KEY and OPENAI_API_KEY should be defined in a .env file
# Let's load the API keys in from the .env file
import dotenv
dotenv.load_dotenv(".env", override=True)


True

## Instantiating the You.com Retriever in LangChain

LangChain provides a You.com retriever.  For more information, please visit: https://python.langchain.com/v0.1/docs/integrations/retrievers/you-retriever/

In [43]:
from langchain_community.retrievers.you import YouRetriever

ydc_retriever = YouRetriever(num_web_results = 10)

In [44]:
# Let's test it out
response = ydc_retriever.invoke("Has the inflation in Canada dropped in 2024?")
# Let's take a look at the first 3 responses
response[:3]

[Document(page_content="OTTAWA, Feb 20 (Reuters) - Canada's annual inflation rate slowed significantly more than expected to 2.9% in January and core price measures also eased, data showed on Tuesday, bringing forward bets for an early interest rate cut.", metadata={'url': 'https://www.reuters.com/world/americas/canadas-inflation-rate-drops-more-than-expected-29-january-2024-02-20/', 'thumbnail_url': None, 'title': "Canada's inflation rate slows and bolsters bets on early rate cut | Reuters", 'description': "Canada's annual inflation rate slowed significantly more than expected to 2.9% in January and core price measures also eased, data showed on Tuesday, bringing forward bets for an early interest rate cut."}),
 Document(page_content='The BoC projects headline inflation will remain around 3% in the first half of 2024, before cooling down to 2.5% by end-year. The central bank said last month that while interest rates had helped to bring down overall inflation, which touched a peak of 8

## Creating a Vector DB retriever based on data from a PDF File

We are going to load a PDF file using the PyPDFLoader in LangChain.  We will then use the RecursiveTextSplitter in LangChain to split the documents into chunks that can be vectorized.  The vectorized chunks of text will be stored in a Facebook AI Similarity Search (FAISS) vector store.  This vector store will be converted into a LangChain retriever.

In [45]:
from langchain_community.document_loaders import PyPDFLoader

# The PDF file we are using can be downloaded from: https://www2.gov.bc.ca/assets/gov/data/statistics/economy/cpi/cpi_highlights.pdf
# load the PDF file
loader = PyPDFLoader("bc_cpi_highlights.pdf")
docs = loader.load()

In [46]:
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# split the document into chunks, and vectorize these chunks in a FAISS database
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 100)
notes = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(documents=notes, embedding=embeddings)

In [47]:
# test out the similarity search
query = "How much did food prices increase in April 2024?"
response = db.similarity_search(query, k=3)
response[0].page_content

'(excluding fish, seafood, and other marine products) \n(+2.1%). At the same time, fruit, fruit preparations, \nand nut s was the only major food category to \ndecrease in price (- 0.1%)  \nBritish Columbians paid more for both  health (+2.7%) \nand personal (+ 2.0%) care  when compared to \n12-months ago. Services, instead of items within \nthese categories, had the largest price increase. Personal services (such a hairdressing) cost 4.8% \nmore when compared to 12 -months ago, while the \ncost of health care services (such as eye and dental \ncare) increased by 4.3%.  Consumer Price \nIndex   \n \n \nReference date:  April  2024  Issue:  #24-04 Released:  May 21 , 2024 \n      \n-5.8-1.91.92.22.32.62.82.96.8\nClothing & FootwearHouseholdRecreationAlc., Tob., & CannabisHealth & PersonalFoodTransportationAll-itemsShelterInflation by Category\n% change, same month previous year'

In [48]:
# Create the retriever
faiss_retriever = db.as_retriever()

## Create an Ensemble Retriever using the You.Com Retriever and the FAISS Retriever

The Ensemble Retriever in LangChain ensembles results from multiple retrievers.  We will create an Ensemble Retriever with the FAISS vector store retriever and the You.com retriever that we defined above as constituent retrievers.

In [49]:
from langchain.retrievers import EnsembleRetriever

ensemble_retriever = EnsembleRetriever(
    retrievers = [ydc_retriever, faiss_retriever], weights = [0.5, 0.5]
)

## Instantiate the LLM

In [50]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", temperature=0.5)

## Create the Prompt Template

In [51]:
system_prompt = """
You are an assistant that answers questions pertaining to CPI (Consumer Price Index).  Please utilize the following retrieved context from the web and from a proprietary
datasource to provide an accurate answer to the question.  Please try and utilize numbers where applicable to substantiate your answer.  If you do not know the answer, simply say you do not 
know the answer.  Please keep the response concise.

{context}
"""

In [52]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

## Create a basic chain without chat history

We will test our chain first without chat history.

In [53]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

qa_chain = create_stuff_documents_chain(llm, qa_prompt)
rag_chain = create_retrieval_chain(ensemble_retriever, qa_chain)

In [54]:
response = rag_chain.invoke({"input": "How did the CPI in April 2024 in BC compare to the national CPI in Canada?"})

In [55]:
response["answer"]

"In April 2024, British Columbia's Consumer Price Index (CPI) was 2.9% higher than in April 2023. Nationally, Canada's CPI increased by 2.7% over the same period. Therefore, the CPI growth in British Columbia was slightly higher than the national average."

## Add chat history to our chatbot

Chat history is an integral component of any chat application, as the input query might require additional conversational context to be understood by the LLM.  We are going to add chat history to our chatbot and contextualize the input prompts with chat history.

In [56]:
# Create a prompt that utilizes the chat history as context to reformulate the most recent input, as a standalone question that the LLM can comprehend
from langchain.chains import create_history_aware_retriever

contextualize_q_system_prompt = """
Given a chat history and the latest question, which might reference context in the chat history, formulate a standalone question, which can be understood without chat history.
Do not answer the question, just reformulate the question if necessary and return it as it otherwise.
"""

contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}")
    ]
)

In [57]:
# Create a chain that takes conversation history and contextualizes the prompt
history_aware_retriever = create_history_aware_retriever(llm, ensemble_retriever, contextualize_q_prompt)

In [58]:
# rejig qa prompt to include the chat history
qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}")
    ]
)

In [59]:
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory

# statefully manage session history
store = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
        
    return store[session_id]

In [60]:
# create chains that include message history
qa_chain = create_stuff_documents_chain(llm, qa_prompt)
rag_chain = create_retrieval_chain(history_aware_retriever, qa_chain)

In [61]:
from langchain_core.runnables.history import RunnableWithMessageHistory

# create final chain that ties everything together

conversation_rag_chain = RunnableWithMessageHistory(
    rag_chain,
    get_session_history,
    input_messages_key = "input",
    history_messages_key = "chat_history",
    output_messages_key = "answer"
)

## Let's try it out!

In [62]:
conversation_rag_chain.invoke({"input": "How much did food prices increase in April 2024 in BC compared to April 2023?"}, config = {"configurable": {"session_id": "xyz_789"}})["answer"]

'In April 2024, food prices in British Columbia increased by 2.6% compared to April 2023.'

In [63]:
conversation_rag_chain.invoke({"input": "How does that compare to the increase in food prices across the nation?"}, config = {"configurable": {"session_id": "xyz_789"}})["answer"]

'In April 2024, food prices in British Columbia increased by 2.6% compared to April 2023. Nationally, food prices in Canada increased by 2.3% over the same period. Therefore, the increase in food prices in British Columbia was slightly higher than the national average.'

In [67]:
conversation_rag_chain.invoke({"input": "What contributed to the rising food prices in BC in April 2024?"}, config = {"configurable": {"session_id": "xyz_789"}})["answer"]

'The rising food prices in British Columbia in April 2024 were influenced by several factors:\n\n1. **General Inflation**: Overall inflationary pressures contributed to the rise in food prices.\n2. **Supply Chain Disruptions**: Ongoing issues in the supply chain, including transportation and logistics challenges, impacted the availability and cost of food items.\n3. **Climate Events**: Adverse climate events such as wildfires and flooding affected agricultural production, leading to higher prices for certain food products.\n4. **Increased Production Costs**: Rising costs for labor, energy, and other inputs necessary for food production also played a role.\n5. **High Demand for Specific Products**: Increased demand for items like meat, bakery goods, and vegetables drove up prices in these categories.\n\nThese factors combined to push food prices higher in British Columbia in April 2024.'

In [66]:
conversation_rag_chain.invoke({"input": "How did the CPI in April 2024 in BC compare to the national CPI in Canada?"}, config = {"configurable": {"session_id": "xyz_789"}})["answer"]

"In April 2024, the Consumer Price Index (CPI) in British Columbia increased by 2.9% compared to April 2023. Nationally, Canada's CPI was up by 2.7% over the same period. Therefore, the CPI increase in British Columbia was slightly higher than the national average."

## Create a Python object that encapsulates the functionality of creating a chatbot that retrieves context from PDF files and the web

Let's create a Python class that encapsulates the code above.  This will enable users to easily create chatbots for new use cases with custom PDF files and prompts.

In [68]:
import secrets
class PDF_QA_Bot:
    def __init__(self, llm: ChatOpenAI, pdf_files: list[str], system_prompt: str, num_web_results_to_fetch: int = 10):
        
        self._llm = llm
        
        docs = self._load_pdf_documents(pdf_files)
        
        # split the docs into chunks, vectorize the chunks and load them into a vector store
        db = self._create_vector_store(docs)
        
        # create LangChain retriever from the vector store
        self._faiss_retriever = db.as_retriever()
        
        # create YDC retriever
        self._ydc_retriever = YouRetriever(num_web_results = num_web_results_to_fetch)

        # create ensemble retriever 
        self._ensemble_retriever = EnsembleRetriever(
            retrievers = [self._ydc_retriever, self._faiss_retriever], weights = [0.5, 0.5]
        )

        # create the system prompt from the user input
        self._system_prompt = system_prompt + "\n\n" + "{context}"

        self._contextualize_q_system_prompt = """
        Given a chat history and the latest question, which might reference context in the chat history, formulate a standalone question, which can be understood without chat history.
        Do not answer the question, just reformulate the question if necessary and return it as it otherwise.
        """

        # Create a prompt that utilizes the chat history as context to reformulate the most recent input, as a standalone question that the LLM can comprehend
        self._contextualize_q_prompt = ChatPromptTemplate.from_messages(
            [
                ("system", self._contextualize_q_system_prompt),
                MessagesPlaceholder("chat_history"),
                ("human", "{input}")
            ]
        )

        self._history_aware_retriever = create_history_aware_retriever(self._llm, self._ensemble_retriever, self._contextualize_q_prompt)

        self._qa_prompt = ChatPromptTemplate.from_messages(
            [
                ("system", self._system_prompt),
                MessagesPlaceholder("chat_history"),
                ("human", "{input}")
            ]
        )

        self._messages_store = {}
        self._session_id = self._generate_session_id()

        # create chains that include message history
        self._qa_chain = create_stuff_documents_chain(self._llm, self._qa_prompt)
        self._rag_chain = create_retrieval_chain(self._history_aware_retriever, self._qa_chain)

        # create final chain that ties everything together
        self._conversation_rag_chain = RunnableWithMessageHistory(
            self._rag_chain,
            self._get_session_history,
            input_messages_key = "input",
            history_messages_key = "chat_history",
            output_messages_key = "answer"
        )
        
    def _load_pdf_documents(self, pdf_files: list[str]) -> list:
        docs = []
        for file in pdf_files:
            file_loader = PyPDFLoader(file)
            docs.extend(file_loader.load())
        return docs
    
    def _create_vector_store(self, docs: list):
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
        chunked_docs = text_splitter.split_documents(docs)
        embeddings = OpenAIEmbeddings()
        return FAISS.from_documents(documents=chunked_docs, embedding=embeddings)
    
    
    def _get_session_history(self, session_id) -> BaseChatMessageHistory:
        """Statefully manage chat history"""
        if session_id not in self._messages_store:
            self._messages_store[session_id] = ChatMessageHistory()
            
        return self._messages_store[self._session_id]
    
    def _generate_session_id(self) -> str:
        session_id = secrets.token_urlsafe(16)
        return session_id
    
    def invoke_bot(self, input_str: str) -> str:
        input = {"input": input_str}
        config = {"configurable": {"session_id": self._session_id}}
        output = self._conversation_rag_chain.invoke(input, config)
        return output["answer"]
        

        

## Let's try it out on a new use case with a different document!

We will utilize our object to implement a chatbot that answers questions about student housing and residences at the University of Toronto.  The chatbot retrieves relevant context from a PDF brochure about student housing options at the University of Toronto and the web using the You.com API to answer questions.

In [69]:
llm = ChatOpenAI(model="gpt-4o", temperature=0.5)
conversational_rag_system_prompt = """You are an assistant that answers questions pertaining to student housing and residence options at the University of Toronto (UofT).  Please utilize relevant context from the web and a brochure about student housing options
to provide an accurate response to the question.  If you do not know the answer to a question, simply say you do not know the answer."""

# The PDF file being used in this example can be downloaded from: https://studentlife.utoronto.ca/wp-content/uploads/Student-Housing-Brochure.pdf
conversational_rag = PDF_QA_Bot(llm, pdf_files=["Student-Housing-Brochure.pdf"], system_prompt=conversational_rag_system_prompt, num_web_results_to_fetch=10)

In [70]:
conversational_rag.invoke_bot("Which residences at the University of Toronto do not offer a meal plan?")

'The residences at the University of Toronto that do not offer a mandatory meal plan are:\n\n- Innis College\n- Woodsworth College\n- Graduate House\n- University Family Housing\n\nResidents of these accommodations can purchase optional meal plans for use around campus if they wish. For more information on optional meal plans, you can visit [foodservices.utoronto.ca](http://foodservices.utoronto.ca).'

In [71]:
conversational_rag.invoke_bot("What food establishments are located nearby the Innis College?")

"Nearby Innis College, you can find several food establishments that cater to various tastes and dietary preferences:\n\n1. **Innis Cafe**: Located within Innis College, this cafe offers freshly-made, healthy food with options for vegan and halal diets. It is known for its friendly staff and quick service, making it a popular spot for students.\n\n2. **MeeT You 177 on College**: Just minutes from U of T, this place offers bubble tea and Chinese food, including favorites like Kung Pao chicken, dumplings, and beef with vegetables.\n\n3. **Einstein's Pub**: Located on College Street, this pub is a go-to spot for students, especially after exams. It offers a variety of craft beers and a simple menu of bar food, including fantastic wings.\n\n4. **Diabolos' Coffee Bar**: Situated in the Junior Common Room of University College, this student-owned and operated cafe offers fair trade coffee, tea, and snacks, including freshly baked muffins and vegan treats.\n\n5. **Second Cup at New College**:

In [33]:
conversational_rag.invoke_bot("Which residences are female only at UofT?")

"At the University of Toronto, the residences that are female-only are:\n\n1. **Loretto College**: This residence is specifically for female students and is affiliated with St. Michael's College.\n\nThese residences provide a supportive environment tailored to female students. If you have specific needs or preferences, it's always a good idea to contact the residence directly for more detailed information."

In [34]:
conversational_rag.invoke_bot("Is a meal plan mandatory at Loretto College?")

'Yes, a meal plan is mandatory at Loretto College. All residents are required to participate in the meal plan because there are no cooking facilities in the dorms. The dining hall offers a variety of food options, including vegan and halal choices, to cater to different dietary needs.'

In [35]:
conversational_rag.invoke_bot("What is the cost of a meal plan at Loretto College?")

'The cost of the meal plan at Loretto College is included in the overall residence fees. For the 2023-24 academic year, the total cost for living at Loretto College ranges from $17,284 to $18,714. This fee covers both the accommodation and the mandatory meal plan.'

In [36]:
conversational_rag.invoke_bot("What room types are offered at Loretto College?")

'At Loretto College, the room types offered include:\n\n1. **Double Rooms**: Available for first-year students. Two double rooms share an ensuite washroom.\n2. **Single Rooms**: Available for upper-year students. Single rooms have an individual sink in the room and share a common washroom with other single rooms.\n\nThese room types provide different levels of privacy and convenience, catering to the needs of both first-year and upper-year students.'

In [37]:
conversational_rag.invoke_bot("Are there any attractions, museums, fun things to do close to Loretto College?")

"Yes, there are several attractions, museums, and fun activities near Loretto College, given its prime location in downtown Toronto. Here are some notable options:\n\n1. **Royal Ontario Museum (ROM)**: Just steps away from Loretto College, the ROM is one of North America's largest museums, featuring art, culture, and nature from around the globe.\n\n2. **Yorkville**: This vibrant neighborhood is known for its upscale shopping, dining, and cultural attractions. It's a great place to explore boutique stores, art galleries, and trendy cafes.\n\n3. **Bata Shoe Museum**: A unique museum dedicated to footwear from around the world, showcasing a collection of over 13,000 shoes.\n\n4. **Gardiner Museum**: Specializing in ceramics, this museum offers engaging exhibitions and hands-on workshops.\n\n5. **Casa Loma**: A short distance away, this historic castle offers tours, beautiful gardens, and themed events.\n\n6. **Queen’s Park**: A large public park that is perfect for a relaxing walk or a p

In [38]:
conversational_rag.invoke_bot("What is there to do in the Yorkville neighborhood?")

'Yorkville is one of Toronto\'s most vibrant and upscale neighborhoods, offering a wide range of activities and attractions. Here are some things you can do in Yorkville:\n\n### Shopping:\n1. **Designer Boutiques**: Explore high-end designer stores like Chanel, Gucci, Hermès, Louis Vuitton, and Prada. This stretch of Bloor Street is known as the "Mink Mile" due to its concentration of luxury brands.\n2. **Holt Renfrew**: Visit this flagship luxury department store, which features brands like Giorgio Armani, Dolce & Gabbana, and Valentino, along with a salon, spa, restaurant, and cafe.\n3. **Yorkville Village Shopping Center**: A chic shopping center with trendy fashion and lifestyle brands.\n\n### Dining:\n1. **High-End Restaurants**: Enjoy fine dining at renowned restaurants offering a variety of international cuisines.\n2. **Cafes and Bistros**: Relax at one of the many quaint cafes and bistros scattered throughout the neighborhood.\n\n### Art and Culture:\n1. **Art Galleries**: Visi

In [39]:
conversational_rag.invoke_bot("What restaurants are located in the neighborhood?")

"Yorkville is home to a diverse and impressive array of restaurants, catering to various tastes and preferences. Here are some notable dining options in the neighborhood:\n\n### Fine Dining:\n1. **Sassafraz**: Known for its contemporary Canadian cuisine, Sassafraz is set in a beautifully restored Victorian rowhouse and offers a magical dining experience.\n2. **Café Boulud**: Located in the Four Seasons Hotel, this French brasserie by the renowned Chef Daniel Boulud offers a luxurious dining experience.\n3. **ONE Restaurant**: Chef Mark McEwan’s culinary concept at The Hazelton Hotel, known for its French and Italian flavors, craft cocktails, and a spacious patio.\n\n### Casual and Contemporary Dining:\n4. **Planta**: An upscale, plant-based restaurant that promotes environmental sustainability with innovative vegan cuisine.\n5. **Alobar Yorkville**: A Michelin-starred restaurant offering a chic dining experience with a focus on upscale cocktails and contemporary dishes.\n6. **Dimmi Bar