# LangChain: Q&A over Documents

An example might be a tool that would allow you to query a product catalog for items of interest.

In [36]:
#pip install --upgrade langchain

In [1]:
# start by importing the environment variables as we always do 
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

Note: LLM's do not always produce the same results. When executing the code in your notebook, you may get slightly different answers that those in the lecture.

In [2]:
# this block just initialises `llm_model` variable nothing else
# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

Now we're going to import some things that will help us when building this chain. 

In [3]:
# We're going to import the retrieval QA chain. This will do retrieval over some documents. 
from langchain.chains import RetrievalQA

# We're going to import our favorite chat open AI language model.
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI

# We're going to import a document loader. This is going to be used to load some proprietary data that 
# we're going to combine with the language model. 
# In this case it's going to be in a CSV. So we're going to import the CSV loader.
from langchain.document_loaders import CSVLoader

# Finally, we're going to import a vector store. 
# There are many different types of vector stores and we'll cover what exactly these are later on but we're going to get started with the "DocArrayInMemorySearch" vector store. 
# This is really nice because it's an in-memory vector store and it doesn't require connecting to an external database of any kind so it makes it really easy to get started.
from langchain.vectorstores import DocArrayInMemorySearch

#  We're also going to import display and markdown to common utilities for displaying information in our notebooks.
from IPython.display import display, Markdown


In [4]:
# We've provided a CSV of outdoor clothing that we're going to use to combine with the language model. 
# Here we're going to initialize a loader, the CSV loader, with a path to the file.
file = '../data/OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)

In [5]:
# We're next going to import an index, the "VectorStoreIndexCreator". 
# This will help us create a vector store really easily.
from langchain.indexes import VectorstoreIndexCreator

In [42]:
#pip install docarray

In [6]:
# To create an index, we're going to specify two things. 
# First, we're going to specify the vector store class. 
# As mentioned before, we're going to use this vector store, as it's a particularly easy one to get started with. 
# After it's been created, we're then going to call "from_loaders", which takes in a list of document loaders. 
# We've only got one loader that we really care about, so that's what we're passing in here. 
# It's now been created and we can start to ask questions about it.
from langchain.embeddings import OpenAIEmbeddings
embedding_model = OpenAIEmbeddings()
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embedding_model
).from_loaders([loader])

  embedding_model = OpenAIEmbeddings()


In [7]:
# Below we'll cover what exactly happened under the hood, 
# so let's not worry about that for now. Here, we'll start with a query.
query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."

**Note**:
- The notebook uses `langchain==0.0.179` and `openai==0.27.7`
- For these library versions, `VectorstoreIndexCreator` uses `text-davinci-003` as the base model, which has been deprecated since 1 January 2024.
- The replacement model, `gpt-3.5-turbo-instruct` will be used instead for the `query`.
- The `response` format might be different than the class because of this replacement model.

In [8]:
llm_replacement_model = OpenAI(temperature=0, 
                               model='gpt-3.5-turbo-instruct')

# We'll then create a response using "index.query" and pass in this query.
response = index.query(query, 
                       llm = llm_replacement_model)

  llm_replacement_model = OpenAI(temperature=0,


In [9]:
# Again, we'll cover what's going on under the hood down below. 
# We've gotten back a table in markdown with names and descriptions for all shirts with sun protection. 
display(Markdown(response))



| Name | Description | Sun Protection Rating |
| --- | --- | --- |
| Men's Tropical Plaid Short-Sleeve Shirt | Made of 100% polyester, UPF 50+ rating, wrinkle-resistant, front and back cape venting, two front bellows pockets | SPF 50+, blocks 98% of harmful UV rays |
| Men's Plaid Tropic Shirt, Short-Sleeve | Made of 52% polyester and 48% nylon, UPF 50+ rating, SunSmart technology, wrinkle-free, front and back cape venting, two front bellows pockets | SPF 50+, blocks 98% of harmful UV rays |
| Men's TropicVibe Shirt, Short-Sleeve | Made of 71% nylon and 29% polyester, UPF 50+ rating, front and back cape venting, two front bellows pockets | SPF 50+, blocks 98% of harmful UV rays |
| Sun Shield Shirt | Made of 78% nylon and 22% Lycra Xtra Life fiber, UPF 50+ rating, moisture-wicking, abrasion-resistant, fits over swimsuit | SPF 50+, blocks 98% of harmful UV rays |

goto: slide #21, 22, 23, 24

## Step By Step

In [10]:
# So above, we created this chain and only a few lines of code. 
# But let's now do it a bit more step-by-step and understand what exactly is going on under the hood. 
# The first step is similar to above. 
# We're going to create a document loader, loading from that CSV with all the descriptions of the products that we want to do question answering over.
from langchain.document_loaders import CSVLoader
loader = CSVLoader(file_path=file)

In [11]:
# We can then load documents from this document loader. 
docs = loader.load()

In [12]:
# If we look at the individual documents, we can see that each document corresponds to one of the products in the CSV. 
docs[0]

Document(metadata={'source': '../data/OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.")

In [13]:
# Previously, we talked about creating chunks. 
# Because these documents are already so small, we actually don't need to do any chunking here. 
# And so we can create embeddings directly. 
# To create embeddings, we're going to use OpenAI's embedding class.
# We can import it and initialize it here.
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [14]:
 # If we want to see what these embeddings do, we can actually take a look at what happens when we embed a particular piece of text. 
embed = embeddings.embed_query("Hi my name is Ankit")

In [15]:
# In this case, the sentence, "Hi, my name is Ankit." 
# If we take a look at this embedding, we can see that there are over a thousand different elements.
print(len(embed))

1536


In [16]:
# Each of these elements is a different numerical value.
# Combined, this creates the overall numerical representation for this piece of text.
print(embed[:5])

[-0.010378659221592458, 0.0007889397309350342, -0.0179676963998228, -0.014404589573688809, -0.016902568897655755]


In [17]:
# We want to create embeddings for all the pieces of text that we just loaded and then we also want to store them in a vector store.
# We can do that by using the "from_documents" method on the vector store. 
db = DocArrayInMemorySearch.from_documents(
    docs, 
    embeddings
)
# This method takes in a list of documents, an embedding object, and then we'll create an overall vector store. 

In [18]:
# We can now use this vector store to find pieces of text similar to an incoming query. 
# So let's look at the query, "Please suggest a shirt with sunblocking". 
query = "Please suggest a shirt with sunblocking"

In [19]:
# If we use the similarity search method on the vector store and pass in a query, we will get back a list of documents. 
docs = db.similarity_search(query)

In [20]:
len(docs)

4

In [21]:
# We can see that it returns four documents, and if we look at the first one, we can see that it is indeed a shirt about sunblocking. 
docs[0]

Document(metadata={'source': '../data/OutdoorClothingCatalog_1000.csv', 'row': 255}, page_content=': 255\nname: Sun Shield Shirt by\ndescription: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. \n\nSize & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.\n\nFabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.\n\nAdditional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.\n\nSun Protection That Won\'t Wear Off\nOur high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun\'s harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.')

So, how do we use this to do question answering over our own documents? 

In [22]:
# First, we need to create a retriever from this vector store.
# A retriever is a generic interface that can be underpinned by any method that takes in a query and returns documents. 
# Vector stores and embeddings are one such method to do so, although there are plenty of different methods, some less advanced, some more advanced. 
retriever = db.as_retriever()

In [23]:
# Next, because we want to do text generation and return a natural language response, 
# we're going to import a language model and we're going to use ChatOpenAI.

llm_model = "gpt-4o"
llm = ChatOpenAI(temperature = 0.0, model=llm_model)

  llm = ChatOpenAI(temperature = 0.0, model=llm_model)


In [24]:
# If we were doing this by hand, what we would do is we would combine the documents into a single piece of text. 
qdocs = "".join([docs[i].page_content for i in range(len(docs))])

In [25]:
# So we'd do something like this, where we join all the page content in the documents into a variable 
# and then would pass this variable or a variant on the question, like, 
# "Please list all your shirts with sun protection in a table in markdown and summarize each one." into the language model. 
# let's call the model
response = llm.call_as_llm(f"{qdocs} Question: Please list all your \
shirts with sun protection in a table in markdown and summarize each one.") 


  response = llm.call_as_llm(f"{qdocs} Question: Please list all your \


In [27]:
# And if we print out the response here, we can see that we get back a table exactly as we asked for. 
display(Markdown(response))

Certainly! Below is a markdown table summarizing the shirts with sun protection:

| Name                               | Description                                                                                                           | Size & Fit                  | Fabric & Care                                                                 | Additional Features                                                                 |
|------------------------------------|-----------------------------------------------------------------------------------------------------------------------|-----------------------------|--------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|
| Sun Shield Shirt                   | High-performance sun shirt with UPF 50+ protection, blocks 98% of harmful UV rays.                                     | Slightly Fitted             | 78% nylon, 22% Lycra Xtra Life fiber. Handwash, line dry.                      | Moisture-wicking, abrasion-resistant, fits over swimsuits. Recommended by The Skin Cancer Foundation. |
| Men's Plaid Tropic Shirt           | Ultracomfortable shirt with UPF 50+ protection, originally designed for fishing, great for travel.                     | Not specified               | 52% polyester, 48% nylon. Machine washable and dryable.                        | Wrinkle-free, quick-drying, front and back cape venting, two front bellows pockets.  |
| Men's TropicVibe Shirt             | Lightweight sun-protection shirt with UPF 50+, ideal for hot weather and strong UV rays.                               | Traditional Fit             | Shell: 71% Nylon, 29% Polyester. Lining: 100% Polyester knit mesh. Machine wash and dry. | Wrinkle-resistant, front and back cape venting, two front bellows pockets.           |
| Men's Tropical Plaid Short-Sleeve Shirt | Lightest hot-weather shirt with UPF 50+ protection, relaxed fit, made of 100% polyester.                                | Traditional Fit             | 100% polyester. Wrinkle-resistant.                                             | Front and back cape venting, two front bellows pockets.                              |

### Summary:
- **Sun Shield Shirt**: Offers high-performance sun protection with a slightly fitted design, moisture-wicking, and abrasion resistance. Ideal for wearing over swimsuits.
- **Men's Plaid Tropic Shirt**: Designed for fishing and travel, this shirt is ultracomfortable with quick-drying and wrinkle-free features.
- **Men's TropicVibe Shirt**: Provides lightweight sun protection with a traditional fit, perfect for hot weather with its venting features.
- **Men's Tropical Plaid Short-Sleeve Shirt**: The lightest option for hot weather, offering a relaxed fit and excellent sun protection with venting for cool breezes.

In [28]:
# All these steps can be encapsulated with a langchain chain.
# so we are creating a RetrievalQA chain
# This does retrival and does question answering over the retrived document. 
# To create such chain we will pass in a few different things: 
# 1st: language model - will be used for text generation in the end. 
# 2nd: chain type - will be using the simplest method - it will stuff all the docs in to context and makes one call to a language model 
# 3rd: the retriver -  will be used to fetch the docs and pass it to the language model. 

qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

In [29]:
# create a query 
query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."

In [30]:
# run the chain on this query 
response = qa_stuff.run(query)

  response = qa_stuff.run(query)




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [31]:
display(Markdown(response))

Here is a table summarizing the shirts with sun protection:

| Name                                      | Summary                                                                                                                                                                                                 |
|-------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Men's Tropical Plaid Short-Sleeve Shirt   | A lightweight, wrinkle-resistant shirt made of 100% polyester, offering UPF 50+ sun protection. It features front and back cape venting and two front bellows pockets. Ideal for hot weather.             |
| Men's Plaid Tropic Shirt, Short-Sleeve    | Designed for fishing, this shirt is made of 52% polyester and 48% nylon, providing UPF 50+ protection. It is wrinkle-free, quick-drying, and includes cape venting and two front bellows pockets.         |
| Men's TropicVibe Shirt, Short-Sleeve      | This shirt offers a traditional fit with a shell of 71% nylon and 29% polyester, and a 100% polyester knit mesh lining. It provides UPF 50+ protection, is wrinkle-resistant, and features cape venting.    |
| Sun Shield Shirt                          | Made of 78% nylon and 22% Lycra Xtra Life fiber, this shirt is slightly fitted and offers UPF 50+ protection. It wicks moisture, is abrasion-resistant, and fits comfortably over a swimsuit.              |

Each shirt provides the highest rated sun protection possible, blocking 98% of the sun's harmful rays.

In [32]:
# same thing as above -- both equates to the same thing. 
#  That's what makes the Langchain instresting -- you can exec in one line or you can breakdown the things step by step 
response = index.query(query, llm=llm)

In [33]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
).from_loaders([loader])

goto: slide #25 