# LangChain Q & A bot with Web and PDF data retrieval

Let's first create a chatbot that retrieves context from a proprietary datasource and the web to answer questions about Consumer Price Index (CPI) changes in the province of British Columbia (BC) in April 2024.  The proprietary datasource is a PDF report highlighting CPI changes in BC in April 2024 over a 12-month period.  The web data needed to answer the question is being retrieved using the You.com API.  The chatbot is implemented as a parallel chain in LangChain.

## Install all required packages

In [3]:
%%capture
! pip install openai==1.30.3
! pip install langchain==0.2.1
! pip install langchain_community==0.2.1
! pip install langchain_openai==0.1.7
! pip install langchain_text_splitters==0.2.0
! pip install langchain_core==0.2.1
! pip install numpy==1.26.4
! pip install pandas==2.2.2
! pip install python-dotenv==1.0.1
! pip install pypdf==4.2.0
! pip install faiss-cpu==1.8.0


In [4]:
import langchain
import os

In [5]:
# The YDC_API_KEY and OPENAI_API_KEY should be defined in a .env file
# Let's load the API keys in from the .env file
import dotenv
dotenv.load_dotenv(".env", override=True)


True

## Instantiating the You.com Retriever in LangChain

LangChain provides a You.com retriever.  For more information, please visit: https://python.langchain.com/v0.1/docs/integrations/retrievers/you-retriever/

In [6]:
from langchain_community.retrievers.you import YouRetriever

ydc_retriever = YouRetriever(num_web_results = 10)

In [7]:
# Let's test it out
response = ydc_retriever.invoke("Has the inflation in Canada dropped in 2024?")
# Let's take a look at the first 3 responses
response[:3]

[Document(page_content="OTTAWA, Feb 20 (Reuters) - Canada's annual inflation rate slowed significantly more than expected to 2.9% in January and core price measures also eased, data showed on Tuesday, bringing forward bets for an early interest rate cut.", metadata={'url': 'https://www.reuters.com/world/americas/canadas-inflation-rate-drops-more-than-expected-29-january-2024-02-20/', 'thumbnail_url': None, 'title': "Canada's inflation rate slows and bolsters bets on early rate cut | Reuters", 'description': "Canada's annual inflation rate slowed significantly more than expected to 2.9% in January and core price measures also eased, data showed on Tuesday, bringing forward bets for an early interest rate cut."}),
 Document(page_content='The BoC projects headline inflation will remain around 3% in the first half of 2024, before cooling down to 2.5% by end-year. The central bank said last month that while interest rates had helped to bring down overall inflation, which touched a peak of 8

## Creating a Vector DB retriever based on data from a PDF File

We are going to load a PDF file using the PyPDFLoader in LangChain.  We will then use the RecursiveTextSplitter in LangChain to split the documents into chunks that can be vectorized.  The vectorized chunks of text will be stored in a Facebook AI Similarity Search (FAISS) vector store.  This vector store will be converted into a LangChain retriever.

In [8]:
from langchain_community.document_loaders import PyPDFLoader

# The PDF file we are using can be downloaded from: https://www2.gov.bc.ca/assets/gov/data/statistics/economy/cpi/cpi_highlights.pdf
# load the PDF file
loader = PyPDFLoader("bc_cpi_highlights.pdf")
docs = loader.load()

In [9]:
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# split the document into chunks, and vectorize these chunks in a FAISS database
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 100)
notes = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(documents=notes, embedding=embeddings)

In [10]:
# test out the similarity search
query = "How much did food prices increase in April 2024?"
response = db.similarity_search(query, k=3)
response[0].page_content

'(excluding fish, seafood, and other marine products) \n(+2.1%). At the same time, fruit, fruit preparations, \nand nut s was the only major food category to \ndecrease in price (- 0.1%)  \nBritish Columbians paid more for both  health (+2.7%) \nand personal (+ 2.0%) care  when compared to \n12-months ago. Services, instead of items within \nthese categories, had the largest price increase. Personal services (such a hairdressing) cost 4.8% \nmore when compared to 12 -months ago, while the \ncost of health care services (such as eye and dental \ncare) increased by 4.3%.  Consumer Price \nIndex   \n \n \nReference date:  April  2024  Issue:  #24-04 Released:  May 21 , 2024 \n      \n-5.8-1.91.92.22.32.62.82.96.8\nClothing & FootwearHouseholdRecreationAlc., Tob., & CannabisHealth & PersonalFoodTransportationAll-itemsShelterInflation by Category\n% change, same month previous year'

In [11]:
# Create the retriever
faiss_retriever = db.as_retriever()

## Create an Ensemble Retriever using the You.Com Retriever and the FAISS Retriever

The Ensemble Retriever in LangChain ensembles results from multiple retrievers.  We will create an Ensemble Retriever with the FAISS vector store retriever and the You.com retriever that we defined above as constituent retrievers.

In [12]:
from langchain.retrievers import EnsembleRetriever

ensemble_retriever = EnsembleRetriever(
    retrievers = [ydc_retriever, faiss_retriever], weights = [0.5, 0.5]
)

## Instantiate the LLM

In [13]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", temperature=0.5)

## Create the Prompt Template

In [14]:
system_prompt = """
You are an assistant that answers questions pertaining to CPI (Consumer Price Index).  Please utilize the following retrieved context from the web and from a proprietary
datasource to provide an accurate answer to the question.  Please try and utilize numbers where applicable to substantiate your answer.  If you do not know the answer, simply say you do not 
know the answer.  Please keep the response concise.

{context}
"""

In [15]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

## Create a basic chain without chat history

We will test our chain first without chat history.

In [16]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

qa_chain = create_stuff_documents_chain(llm, qa_prompt)
rag_chain = create_retrieval_chain(ensemble_retriever, qa_chain)

In [17]:
response = rag_chain.invoke({"input": "How did the CPI in April 2024 in BC compare to the national CPI in Canada?"})

In [18]:
response["answer"]

"In April 2024, the Consumer Price Index (CPI) in British Columbia (BC) increased by 2.9% compared to April 2023. Nationally, Canada's CPI was up 2.7% over the same period. Therefore, BC's CPI increase was slightly higher than the national average."

## Add chat history to our chatbot

Chat history is an integral component of any chat application, as the input query might require additional conversational context to be understood by the LLM.  We are going to add chat history to our chatbot and contextualize the input prompts with chat history.

In [19]:
# Create a prompt that utilizes the chat history as context to reformulate the most recent input, as a standalone question that the LLM can comprehend
from langchain.chains import create_history_aware_retriever

contextualize_q_system_prompt = """
Given a chat history and the latest question, which might reference context in the chat history, formulate a standalone question, which can be understood without chat history.
Do not answer the question, just reformulate the question if necessary and return it as it otherwise.
"""

contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}")
    ]
)

In [20]:
# Create a chain that takes conversation history and contextualizes the prompt
history_aware_retriever = create_history_aware_retriever(llm, ensemble_retriever, contextualize_q_prompt)

In [21]:
# rejig qa prompt to include the chat history
qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}")
    ]
)

In [22]:
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory

# statefully manage session history
store = {}

def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
        
    return store[session_id]

In [23]:
# create chains that include message history
qa_chain = create_stuff_documents_chain(llm, qa_prompt)
rag_chain = create_retrieval_chain(history_aware_retriever, qa_chain)

In [24]:
from langchain_core.runnables.history import RunnableWithMessageHistory

# create final chain that ties everything together

conversation_rag_chain = RunnableWithMessageHistory(
    rag_chain,
    get_session_history,
    input_messages_key = "input",
    history_messages_key = "chat_history",
    output_messages_key = "answer"
)

## Let's try it out!

In [25]:
conversation_rag_chain.invoke({"input": "How much did food prices increase in April 2024 in BC compared to April 2023?"}, config = {"configurable": {"session_id": "xyz_789"}})["answer"]

'In British Columbia, food prices increased by 2.6% in April 2024 compared to April 2023.'

In [26]:
conversation_rag_chain.invoke({"input": "How does that compare to the increase in food prices across the nation?"}, config = {"configurable": {"session_id": "xyz_789"}})["answer"]

'Nationally, food prices in Canada increased by 2.3% in April 2024 compared to April 2023. Therefore, the increase in food prices in British Columbia (2.6%) was slightly higher than the national average.'

In [27]:
conversation_rag_chain.invoke({"input": "What contributed to the rising food prices in BC in April 2024?"}, config = {"configurable": {"session_id": "xyz_789"}})["answer"]

'Several factors contributed to the rising food prices in British Columbia in April 2024:\n\n1. **Beef and Veal Prices**: Beef and veal prices rose by 7.0% compared to April 2023, driven by tight supplies and strong demand.\n\n2. **Wholesale Pork Prices**: Wholesale pork prices increased by 18.3% over the year, due to higher demand after previous declines in 2022 and 2023.\n\n3. **Energy Costs**: Although the energy price index decreased by 3.9%, fluctuations in energy costs can indirectly affect food prices through transportation and production costs.\n\n4. **Global Events**: Factors such as climate events, including wildfires and flooding, have adversely affected harvests, contributing to higher food prices.\n\n5. **Supply Chain Issues**: Disruptions in the supply chain and increased costs for packaging and transportation also played a role.\n\nThese combined factors led to the overall increase in food prices observed in the province.'

In [28]:
conversation_rag_chain.invoke({"input": "How did the CPI in April 2024 in BC compare to the national CPI in Canada?"}, config = {"configurable": {"session_id": "xyz_789"}})["answer"]

"In April 2024, British Columbia's Consumer Price Index (CPI) increased by 2.9% compared to April 2023. Nationally, Canada's CPI rose by 2.7% over the same period. Thus, the CPI increase in British Columbia was slightly higher than the national average."

## Create a Python object that encapsulates the functionality of creating a chatbot that retrieves context from PDF files and the web

Let's create a Python class that encapsulates the code above.  This will enable users to easily create chatbots for new use cases with custom PDF files and prompts.

In [29]:
import secrets
class PDF_QA_Bot:
    def __init__(self, llm: ChatOpenAI, pdf_files: list[str], system_prompt: str, num_web_results_to_fetch: int = 10):
        
        self._llm = llm
        
        docs = self._load_pdf_documents(pdf_files)
        
        # split the docs into chunks, vectorize the chunks and load them into a vector store
        db = self._create_vector_store(docs)
        
        # create LangChain retriever from the vector store
        self._faiss_retriever = db.as_retriever()
        
        # create YDC retriever
        self._ydc_retriever = YouRetriever(num_web_results = num_web_results_to_fetch)

        # create ensemble retriever 
        self._ensemble_retriever = EnsembleRetriever(
            retrievers = [self._ydc_retriever, self._faiss_retriever], weights = [0.5, 0.5]
        )

        # create the system prompt from the user input
        self._system_prompt = system_prompt + "\n\n" + "{context}"

        self._contextualize_q_system_prompt = """
        Given a chat history and the latest question, which might reference context in the chat history, formulate a standalone question, which can be understood without chat history.
        Do not answer the question, just reformulate the question if necessary and return it as it otherwise.
        """

        # Create a prompt that utilizes the chat history as context to reformulate the most recent input, as a standalone question that the LLM can comprehend
        self._contextualize_q_prompt = ChatPromptTemplate.from_messages(
            [
                ("system", self._contextualize_q_system_prompt),
                MessagesPlaceholder("chat_history"),
                ("human", "{input}")
            ]
        )

        self._history_aware_retriever = create_history_aware_retriever(self._llm, self._ensemble_retriever, self._contextualize_q_prompt)

        self._qa_prompt = ChatPromptTemplate.from_messages(
            [
                ("system", self._system_prompt),
                MessagesPlaceholder("chat_history"),
                ("human", "{input}")
            ]
        )

        self._messages_store = {}
        self._session_id = self._generate_session_id()

        # create chains that include message history
        self._qa_chain = create_stuff_documents_chain(self._llm, self._qa_prompt)
        self._rag_chain = create_retrieval_chain(self._history_aware_retriever, self._qa_chain)

        # create final chain that ties everything together
        self._conversation_rag_chain = RunnableWithMessageHistory(
            self._rag_chain,
            self._get_session_history,
            input_messages_key = "input",
            history_messages_key = "chat_history",
            output_messages_key = "answer"
        )
        
    def _load_pdf_documents(self, pdf_files: list[str]) -> list:
        docs = []
        for file in pdf_files:
            file_loader = PyPDFLoader(file)
            docs.extend(file_loader.load())
        return docs
    
    def _create_vector_store(self, docs: list):
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
        chunked_docs = text_splitter.split_documents(docs)
        embeddings = OpenAIEmbeddings()
        return FAISS.from_documents(documents=chunked_docs, embedding=embeddings)
    
    
    def _get_session_history(self, session_id) -> BaseChatMessageHistory:
        """Statefully manage chat history"""
        if session_id not in self._messages_store:
            self._messages_store[session_id] = ChatMessageHistory()
            
        return self._messages_store[self._session_id]
    
    def _generate_session_id(self) -> str:
        session_id = secrets.token_urlsafe(16)
        return session_id
    
    def invoke_bot(self, input_str: str) -> str:
        input = {"input": input_str}
        config = {"configurable": {"session_id": self._session_id}}
        output = self._conversation_rag_chain.invoke(input, config)
        return output["answer"]
        

        

## Let's try it out on a new use case with a different document!

We will utilize our object to implement a chatbot that answers questions about student housing and residences at the University of Toronto.  The chatbot retrieves relevant context from a PDF brochure about student housing options at the University of Toronto and the web using the You.com API to answer questions.

In [31]:
llm = ChatOpenAI(model="gpt-4o", temperature=0.5)
conversational_rag_system_prompt = """You are an assistant that answers questions pertaining to student housing and residence options at the University of Toronto (UofT).  Please utilize relevant context from the web and a brochure about student housing options
to provide an accurate response to the question.  If you do not know the answer to a question, simply say you do not know the answer."""

# The PDF file being used in this example can be downloaded from: https://studentlife.utoronto.ca/wp-content/uploads/Student-Housing-Brochure.pdf
conversational_rag = PDF_QA_Bot(llm, pdf_files=["Student-Housing-Brochure.pdf"], system_prompt=conversational_rag_system_prompt, num_web_results_to_fetch=10)

In [43]:
conversational_rag.invoke_bot("Which residences at the University of Toronto does not offer a meal plan?")

'At the University of Toronto, the following residences do not offer a mandatory meal plan:\n\n1. **Innis College**\n2. **Woodsworth College**\n3. **Graduate House**\n4. **University Family Housing**\n\nResidents of these locations have the option to purchase meal plans if desired. For more information about optional meal plans, you can visit [U of T Food Services](https://foodservices.utoronto.ca).'

In [44]:
conversational_rag.invoke_bot("What food establishments are located nearby the Innis College?")

"Near Innis College at the University of Toronto, there are several food establishments that students frequent:\n\n1. **Innis Cafe**: Located within Innis College, this cafe offers freshly-made, healthy food with vegan and halal options. It's known for its friendly staff and quick service.\n\n2. **MeeT You 177 on College**: Just minutes from U of T, this spot offers bubble tea and Chinese food, including favorites like Kung Pao chicken, dumplings, and beef with vegetables.\n\n3. **Einstein's Pub and Café**: A popular pub on College Street known for its craft beers, bar food, and fantastic wings. It's a go-to spot for students, especially after exams.\n\n4. **Diabolos' Coffee Bar**: Located in the Junior Common Room of University College, this student-owned café offers fair trade coffee, tea, snacks, freshly baked muffins, and vegan treats.\n\n5. **Second Cup at Myhal Centre for Engineering Innovation and Entrepreneurship**: Offers a variety of sweet treats, healthy snacks, and fair tra

In [45]:
conversational_rag.invoke_bot("Which residences are female only at UofT?")

"At the University of Toronto, the following residences are female-only:\n\n1. **Loretto College**: Located on the St. George campus, Loretto College is an all-women's residence affiliated with St. Michael's College and the University of Toronto.\n\n2. **Annesley Hall**: Part of Victoria College, Annesley Hall is another all-women's residence. It is also a National Historic Site and was the first university residence for women in Canada.\n\nThese residences provide a supportive environment specifically for female students."

In [36]:
conversational_rag.invoke_bot("Is a meal plan mandatory at Loretto College?")

'Yes, a meal plan is mandatory at Loretto College. All residents are required to have a meal plan because there are no cooking facilities in any of the dorms. The meal plan includes access to the dining hall, which offers a variety of food stations.'

In [37]:
conversational_rag.invoke_bot("What is the cost of a meal plan at Loretto College?")

'The cost of living at Loretto College, which includes the mandatory meal plan, ranges from $17,284 to $18,714 for the 2023-24 academic year. This fee encompasses both accommodation and the meal plan.'

In [42]:
conversational_rag.invoke_bot("What room types are offered at Loretto College?")

'Loretto College offers two main types of dormitory-style rooms:\n\n1. **Double Rooms**: These are available primarily for first-year students. Each double room is shared by two students and comes with an ensuite washroom that is shared with another double room (i.e., four students share one ensuite washroom).\n\n2. **Single Rooms**: These are typically available for upper-year students. Each single room has an individual sink, and residents share common washrooms located on each floor.\n\nBoth room types come fully furnished with essential amenities such as a bed, desk, chair, desk lamp, bookshelf, mirror, and plenty of closet and storage space. Additionally, weekly cleaning services are provided for all rooms.'

In [39]:
conversational_rag.invoke_bot("Are there any attractions, museums, fun things to do close to Loretto College?")

"Yes, there are several attractions, museums, and fun activities close to Loretto College at the University of Toronto:\n\n1. **Royal Ontario Museum (ROM)**: Just steps from Loretto College, the ROM is Canada's largest museum and one of North America's most renowned cultural institutions. It features art, culture, and nature from around the globe.\n\n2. **Bata Shoe Museum**: Located a short walk away, this museum offers a unique perspective on history and culture through its extensive collection of footwear from different eras and regions.\n\n3. **Gardiner Museum**: Specializing in ceramic art, the Gardiner Museum is nearby and provides a range of exhibits and hands-on activities.\n\n4. **Yorkville**: This trendy neighborhood is known for its upscale boutiques, galleries, cafes, and restaurants. It's a great place to explore, shop, and dine.\n\n5. **Queen’s Park**: A beautiful green space perfect for a walk, picnic, or simply relaxing outdoors. It’s also home to the Ontario Legislative

In [40]:
conversational_rag.invoke_bot("What is there to do in the Yorkville neighborhood?")

"Yorkville, one of Toronto's most vibrant and upscale neighborhoods, offers a wide range of activities and attractions. Here are some highlights:\n\n1. **Royal Ontario Museum (ROM)**: A must-visit, the ROM showcases art, culture, and nature from around the world with extensive collections and fascinating exhibits.\n\n2. **Gardiner Museum**: Specializing in ceramics, this museum offers unique exhibits and interactive activities.\n\n3. **Bata Shoe Museum**: Discover the history of footwear from around the world with this museum's extensive and unique collection.\n\n4. **Shopping on Mink Mile**: Yorkville is known for its high-end shopping, with luxury retailers like Burberry, Gucci, Hermès, Louis Vuitton, and Chanel lining Bloor Street West.\n\n5. **Yorkville Park**: A small, beautifully designed park that celebrates the history and diversity of Canadian landscapes. It's a great spot for a rest or people-watching.\n\n6. **Art Galleries**: Yorkville is home to several prestigious gallerie

In [41]:
conversational_rag.invoke_bot("What restaurants are located in the neighborhood?")

"Yorkville is renowned for its diverse and upscale dining options. Here are some notable restaurants in the neighborhood:\n\n1. **Sassafraz**: A beloved local landmark, Sassafraz offers contemporary Canadian cuisine in a beautifully restored Victorian rowhouse. It's known for its seasonal menu and elegant setting.\n\n2. **Café Boulud**: Located in the Four Seasons Hotel, this French brasserie by Chef Daniel Boulud features a seasonally changing menu rooted in French tradition.\n\n3. **ONE Restaurant**: Chef Mark McEwan’s culinary concept at The Hazelton Hotel, offering a contemporary dining experience with a sumptuous bar and spacious tree-lined patio.\n\n4. **Kasa Moto**: A sprawling two-story contemporary Japanese restaurant and lounge, known for its inventive sushi and chic ambiance.\n\n5. **Alobar Yorkville**: A hidden gem serving as a restaurant and bar, offering a full-scale dining experience with a chic and polished interior.\n\n6. **Jacques Bistro Du Parc**: A classic French bi