# LangChain: Q&A over Documents

Develop a tool to query a product catalog for items of interest

- LLMs can only inspect 1000s of words at a time

- For longer documents, need embeddings and vector databases 

In [2]:
# Import relevant libraries
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

In [4]:
from langchain.chains import RetrievalQA # RetrievalQA chain to retrieve answer to questions over document
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader # document loader
from langchain.vectorstores import DocArrayInMemorySearch # internal vector database
from IPython.display import display, Markdown

In [5]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)

In [6]:
from langchain.indexes import VectorstoreIndexCreator # import index to create vector database

In [7]:
# Create vector database - specify 2 things: vector database class & document loaders (take in a list of documents)
# pip install docarray

index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

## Embeddings & Vector Database

**Embeddings**:<br>
- Numerical representation of text

- Captures semantic meaning of the text 

- Pieces of text with similar content will have similar vectors -> allow comparison of text pieces in a vector space


**Vector databases**:<br>
- Stores the vector representations of text 

- Create vector database by populating with chunks of texts coming from documents:
  + First, break up a big document into smaller chunks - pass only relevant chunks to the LLM
  + Then, create embedding for each chunk 
  + Store the embeddings in a vector database
=> That is what happened when we create the index<br>
<br>

- An incoming query will be transformed into embedding and compared with text embeddings from the document in the vector database
- Pass the most similar embeddings in the prompt (context) through the LLM to get back final answer

![Alt text](vector_database.png)  

In [8]:
# Create a query
query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."

In [9]:
# Create a response
response = index.query(query)

In [10]:
# Print out response in markdown
display(Markdown(response))



| Name | Description |
| --- | --- |
| Men's Tropical Plaid Short-Sleeve Shirt | UPF 50+ rated, 100% polyester, wrinkle-resistant, front and back cape venting, two front bellows pockets |
| Men's Plaid Tropic Shirt, Short-Sleeve | UPF 50+ rated, 52% polyester and 48% nylon, machine washable and dryable, front and back cape venting, two front bellows pockets |
| Men's TropicVibe Shirt, Short-Sleeve | UPF 50+ rated, 71% Nylon, 29% Polyester, 100% Polyester knit mesh, machine wash and dry, front and back cape venting, two front bellows pockets |
| Sun Shield Shirt by | UPF 50+ rated, 78% nylon, 22% Lycra Xtra Life fiber, handwash, line dry, wicks moisture, fits comfortably over swimsuit, abrasion resistant |

All four shirts provide UPF 50+ sun protection, blocking 98% of the sun's harmful rays. The Men's Tropical Plaid Short-Sleeve Shirt is made of 100% polyester and is wrinkle-resistant

# Step By Step (Under-the-hood Process)

In [11]:
# Create document_loader - load the CSV file
from langchain.document_loaders import CSVLoader
loader = CSVLoader(file_path=file)

In [12]:
docs = loader.load() # Load "documents" from the previously defined document_loader

In [13]:
docs[0] # Each "document" is actually one of the products in the CSV file

Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0})

In [14]:
# No need to create chunks since the "documents" are already small
# Create embeddings directly with OpenAI embedding class
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [15]:
# Create embedding for a piece of text - embed_query() method
embed = embeddings.embed_query("Hi my name is Harrison")

In [16]:
print(len(embed)) # 1536 elements

1536


In [17]:
print(embed[:5]) # numerical values

[-0.021867522969841957, 0.006806864403188229, -0.01818099617958069, -0.03910486772656441, -0.014066680334508419]


In [18]:
# Create a vector database containing all embeddings from text of the document
db = DocArrayInMemorySearch.from_documents(
    docs, 
    embeddings
)

In [19]:
# Crete a query
query = "Please suggest a shirt with sunblocking"

In [20]:
# Look up similar vectors to the query in the vector database - return a list of "documents"
docs = db.similarity_search(query)

In [21]:
len(docs) # 4 documents

4

In [22]:
docs[0] 

Document(page_content=': 255\nname: Sun Shield Shirt by\ndescription: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. \n\nSize & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.\n\nFabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.\n\nAdditional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.\n\nSun Protection That Won\'t Wear Off\nOur high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun\'s harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.', metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 255})

In [23]:
# Create a retriever - a generic interface used for taking in a query and returns documents
retriever = db.as_retriever() 

In [34]:
# Import LLM to return a response in natural language
llm = ChatOpenAI(temperature = 0.0)

In [25]:
# Combine the "documents" into a single piece of text stored in a single variable
qdocs = "".join([docs[i].page_content for i in range(len(docs))])

In [26]:
# Pass the text to the LLM on a question
response = llm.call_as_llm(f"{qdocs} Question: Please list all your \
    shirts with sun protection in a table in markdown and summarize each one.") 

In [27]:
display(Markdown(response))

| Name | Description |
| --- | --- |
| Sun Shield Shirt | High-performance sun shirt with UPF 50+ sun protection, moisture-wicking, and abrasion-resistant fabric. Fits comfortably over swimsuits. Recommended by The Skin Cancer Foundation. |
| Men's Plaid Tropic Shirt | Ultracomfortable shirt with UPF 50+ sun protection, wrinkle-free fabric, and front/back cape venting. Made with 52% polyester and 48% nylon. |
| Men's TropicVibe Shirt | Men's sun-protection shirt with built-in UPF 50+ and front/back cape venting. Made with 71% nylon and 29% polyester. |
| Men's Tropical Plaid Short-Sleeve Shirt | Lightest hot-weather shirt with UPF 50+ sun protection, front/back cape venting, and two front bellows pockets. Made with 100% polyester and is wrinkle-resistant. |

All of these shirts provide UPF 50+ sun protection, blocking 98% of the sun's harmful rays. They are made with high-performance fabrics that are moisture-wicking, abrasion-resistant, and/or wrinkle-free. Some have front/back cape venting for added comfort in hot weather. The Sun Shield Shirt is recommended by The Skin Cancer Foundation.

In [28]:
# RetrievalQA chain combines all steps above - do the retrieval & question answering over retrieved document
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, # pass in the LLM
    chain_type="stuff", # "stuff" will stuff all documents into context and makes 1 call to the LLM
    retriever=retriever, # interface for fetching documents and pass to languange model
    verbose=True
)

## Methods to pass document to the LLM

**1. Stuff method**
- Simplest method: stuff all data into the prompt as context to pass to the language model
- Pros: Makes a single call to the LLM. The LLM has access to all the data at once
- Cons: LLMs have a context length, and for large documents or many documents, this method won't work as it will result in a prompt larger than context length<br>

![Alt text](pass_method.png)

**2. Map_reduce**
- Passes all chunks along with the query to a language model, gets back responses
- Use another LLM to summarize all individual responses into a final answer
- Pros: operate over any number of documents & ask individual questions in parallel
- Cons: Takes a lot more calls & Treat all documents as parallel
- Common use case: summarization

**3. Refine**
- Iteratively builds upon the response from the previous document 
- Good for combining info & building up an answer over time
- Lead to longer answers & take long time

**4. Map_rerank**
- Do a single call to the language model for each document - return a score for each
- Select the highest score
- Note: The LLM needs to be carefully instructed that they should give high score if the document is similar to the query

In [29]:
query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."

In [30]:
# Run retrievalQA chain on defined query
response = qa_stuff.run(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [31]:
# same result as above
display(Markdown(response))

| Shirt Number | Name | Description |
| --- | --- | --- |
| 618 | Men's Tropical Plaid Short-Sleeve Shirt | This shirt is made of 100% polyester and is wrinkle-resistant. It has front and back cape venting that lets in cool breezes and two front bellows pockets. It is rated UPF 50+ for superior protection from the sun's UV rays. |
| 374 | Men's Plaid Tropic Shirt, Short-Sleeve | This shirt is made with 52% polyester and 48% nylon. It is machine washable and dryable. It has front and back cape venting, two front bellows pockets, and is rated to UPF 50+. |
| 535 | Men's TropicVibe Shirt, Short-Sleeve | This shirt is made of 71% Nylon and 29% Polyester. It has front and back cape venting that lets in cool breezes and two front bellows pockets. It is rated UPF 50+ for superior protection from the sun's UV rays. |
| 255 | Sun Shield Shirt | This shirt is made of 78% nylon and 22% Lycra Xtra Life fiber. It is handwashable and line dry. It is rated UPF 50+ for superior protection from the sun's UV rays. It is abrasion-resistant and wicks moisture for quick-drying comfort. |

All of these shirts provide UPF 50+ sun protection, blocking 98% of the sun's harmful rays. They are all designed to be lightweight and comfortable in hot weather. They all have front and back cape venting that lets in cool breezes and two front bellows pockets. The Men's Tropical Plaid Short-Sleeve Shirt is made of 100% polyester and is wrinkle-resistant. The Men's Plaid Tropic Shirt, Short-Sleeve is made with 52% polyester and 48% nylon. The Men's TropicVibe Shirt, Short-Sleeve is made of 71% Nylon and 29% Polyester. The Sun Shield Shirt is made of 78% nylon and 22% Lycra Xtra Life fiber and is abrasion-resistant and wicks moisture for quick-drying comfort.

In [32]:
response = index.query(query, llm=llm)

In [33]:
# Can customize the creation of vector database - specify embeddings here
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
).from_loaders([loader])