# LangChain: Q&A over Documents

An example might be a tool that would allow you to query a product catalog for items of interest.

In [1]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv("config.env")) # read local .env file

In [2]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown

In [3]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)

In [4]:
from langchain.indexes import VectorstoreIndexCreator

In [5]:
#pip install docarray

# pip install ipywidgets
# jupyter nbextension enable --py widgetsnbextension

In [6]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [7]:
query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."

In [8]:
response = index.query(query)

In [9]:
display(Markdown(response))



| Name | Description |
| --- | --- |
| Men's Tropical Plaid Short-Sleeve Shirt | UPF 50+ rated, 100% polyester, wrinkle-resistant, front and back cape venting, two front bellows pockets |
| Men's Plaid Tropic Shirt, Short-Sleeve | UPF 50+ rated, 52% polyester and 48% nylon, machine washable and dryable, front and back cape venting, two front bellows pockets |
| Men's TropicVibe Shirt, Short-Sleeve | UPF 50+ rated, 71% nylon, 29% polyester, 100% polyester knit mesh, machine washable and dryable, front and back cape venting, two front bellows pockets |
| Sun Shield Shirt by | UPF 50+ rated, 78% nylon, 22% Lycra Xtra Life fiber, handwash, line dry, wicks moisture, abrasion resistant |

All of these shirts provide UPF 50+ sun protection, blocking 98% of the sun's harmful rays. They are all made of different materials and have different features such as cape venting, bellows pockets, and wrinkle-resistance.

## What is going on underneath the hood?

We want to use LLM and combine it with a lot of our documents.

But LLMs can only inspect a **few thousand words** at a time. If we have really large documents, how can we get the language models to answer questions about **everything**?


This is where embeddings and vector database come into play.

### Embeddings
- Embedding vector captures content/meaning
- Text with similar content will have similar vectors
![embeddings](images/embeddings.png)

### Vector Database
- The big document first gets broke down into smaller chunks
- Create embeddings for each chunk

![VectorDatabase](images/vectordatabase.png)

- Store in vector database (create index)

**Runtime**

- The embedding for the incoming query will be compared with all the vectors in the vector database
- Pick `n` most similar
![Index](images/index.png)

In [11]:
loader = CSVLoader(file_path=file)
docs = loader.load()

In [17]:
len(docs)

1000

In [18]:
docs[0]

Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on.\n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size.\n\nSpecs: Approx. weight: 1 lb.1 oz. per pair.\n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported.\n\nQuestions? Please contact us for any inquiries.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0})

In [19]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [26]:
embed = embeddings.embed_query("Hi my name is Kloping")

In [27]:
print(len(embed))
print(embed[:10])

1536
[-0.020690198987722397, -0.005927876103669405, 0.003980600740760565, -0.03587321192026138, -0.010249488987028599, 0.038116879761219025, -0.01532324030995369, -0.006667267065495253, 0.0012469255598261952, 0.008694218471646309]


In [28]:
db = DocArrayInMemorySearch.from_documents(
    docs, 
    embeddings
)

In [35]:
query = "Please suggest a shirt with sunblocking"

In [36]:
docs = db.similarity_search(query)

In [37]:
len(docs)

4

In [38]:
docs[0]

Document(page_content=': 255\nname: Sun Shield Shirt by\ndescription: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays.\n\nSize & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.\n\nFabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.\n\nAdditional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.\n\nSun Protection That Won\'t Wear Off\nOur high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun\'s harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.', metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 255})

Now let's creat question answering mechanism based on this document

In [39]:
retriever = db.as_retriever()

In [40]:
llm = ChatOpenAI(temperature = 0.0)

In [41]:
# Combine the document into a single piece of text
qdocs = "".join([docs[i].page_content for i in range(len(docs))])

In [42]:
response = llm.call_as_llm(f"{qdocs} Question: Please list all your \
shirts with sun protection in a table in markdown and summarize each one.") 

In [43]:
display(Markdown(response))

| Name | Description |
| --- | --- |
| Sun Shield Shirt | High-performance sun shirt with UPF 50+ sun protection, moisture-wicking fabric, and abrasion resistance. Recommended by The Skin Cancer Foundation. |
| Men's Plaid Tropic Shirt | Ultracomfortable shirt with UPF 50+ sun protection, front and back cape venting, and two front bellows pockets. Made with 52% polyester and 48% nylon. |
| Men's TropicVibe Shirt | Men's sun-protection shirt with built-in UPF 50+ and wrinkle-resistant fabric. Features front and back cape venting and two front bellows pockets. |
| Men's Tropical Plaid Short-Sleeve Shirt | Lightest hot-weather shirt with UPF 50+ sun protection, front and back cape venting, and two front bellows pockets. Made with 100% polyester and is wrinkle-resistant. |

All of these shirts provide UPF 50+ sun protection, blocking 98% of the sun's harmful rays. They also feature additional benefits such as moisture-wicking fabric, wrinkle resistance, and venting for cool breezes.

In [46]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff",  # stuffs all the documents into context
    retriever=retriever, 
    verbose=True
)

In [47]:
query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."

In [48]:
response = qa_stuff.run(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [49]:
display(Markdown(response))

| Shirt Name | Description |
| --- | --- |
| Men's Tropical Plaid Short-Sleeve Shirt | Rated UPF 50+ for superior protection from the sun's UV rays. Made of 100% polyester and is wrinkle-resistant. With front and back cape venting that lets in cool breezes and two front bellows pockets. Provides the highest rated sun protection possible. |
| Men's Plaid Tropic Shirt, Short-Sleeve | Rated to UPF 50+, helping you stay cool and dry. Made with 52% polyester and 48% nylon, this shirt is machine washable and dryable. Additional features include front and back cape venting, two front bellows pockets and an imported design. With UPF 50+ coverage, you can limit sun exposure and feel secure with the highest rated sun protection available. |
| Men's TropicVibe Shirt, Short-Sleeve | Built-in UPF 50+ has the lightweight feel you want and the coverage you need when the air is hot and the UV rays are strong. Made of 71% Nylon, 29% Polyester. Wrinkle resistant. Front and back cape venting lets in cool breezes. Two front bellows pockets. Provides the highest rated sun protection possible. |
| Sun Shield Shirt | High-performance sun shirt is guaranteed to protect from harmful UV rays. Made of 78% nylon, 22% Lycra Xtra Life fiber. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Provides SPF 50+ sun protection, blocking 98% of the sun's harmful rays. |

Each shirt provides UPF 50+ sun protection, blocking 98% of the sun's harmful rays. The Men's Tropical Plaid Short-Sleeve Shirt is made of 100% polyester and is wrinkle-resistant. The Men's Plaid Tropic Shirt, Short-Sleeve is made with 52% polyester and 48% nylon, and is machine washable and dryable. The Men's TropicVibe Shirt, Short-Sleeve is made of 71% Nylon, 29% Polyester, and is wrinkle-resistant. The Sun Shield Shirt is made of 78% nylon, 22% Lycra Xtra Life fiber, and is abrasion-resistant.

In [50]:
# Same
response = index.query(query, llm=llm)

In [51]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
).from_loaders([loader])

## Stuff method
Stuffing is the simplest method. You simply stuff all data into the prompt as context to pass to the language model.

- **Pros**: it makes a single call to the LLM. The LLM has access to all the data at once.
- **Cons**: LLMs have a context length, and for large documents or many documents this will not work as it wil result in a prompt larger than the context length.

## Map reduce

Take all the chunks, pass them along with the query into language model, gets back the responses, then use another language model call to summarize all the individual responses into a final answer.

![MapReduce](images/map_reduce.png)

The indivial query (among different chunks) can be done in parallel.

But
- it takes more calls
- treat each chunk/document as independent

## Refine

does individual questioning process iteratively. It builds upon the answer from the previous document/chunk.

It's good for combining information and building up answer over time.


But
- takes more time

![Refine](images/refine.png)

## Map Rerank

Do individual call for each chunk/document, and you also ask it to return a score. Then your answer is based on the highest score.

This requires the LLM to know what the score should be, so there is an extra step to tell the model.

![MapRerank](images/map_rerank.png)