# LangChain: Q&A over Documents

Although LLMs are powerful, they do not know about information they were not trained on. If you want to use an LLM to answer questions about documents it was not trained on, you have to give it information about those documents. The most common way to do this is through "retrieval augmented generation".

The idea of retrieval augmented generation is that when given a question you first do a retrieval step to fetch any relevant documents. You then pass those documents, along with the original question, to the language model and have it generate a response. In order to do this, however, you first have to have your documents in a format where they can be queried in such a manner. This page goes over the high level ideas between those two steps: 

#### 1. Ingestion
In order use a language model to interact with your data, you first have to get in a suitable format. That format would be an Index. By putting data into an ``Index``, you make it easy for any downstream steps to interact with it.

There are several types of indexes, but by far the most common one is a Vectorstore. Ingesting documents into a vectorstore can be done with the following steps:

1. Load documents (using a Document Loader)
2. Split documents (using a Text Splitter)
3. Create embeddings for documents (using a Text Embedding Model)
4. Store documents and embeddings in a vectorstore

#### 2. Generation
Now that we have an Index, how do we use this to do generation? This can be broken into the following steps:

1. Receive user question
2. Lookup documents in the index relevant to the question
3. Construct a PromptValue from the question and any relevant documents (using a PromptTemplate).
4. Pass the PromptValue to a model
5. Get back the result and return to the user.

Let's see an example in simple and then in more detail.

In [6]:
import openai
from dotenv import load_dotenv, find_dotenv
import os

_ = load_dotenv(find_dotenv())  # add .env to .gitignore
openai.api_key = os.getenv("OPENAI_API_KEY")

In [7]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader 
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown

## Simple Example

In [None]:
# %%capture
# %pip install docarray

In [8]:
# Initilize csv loader with a path to a csv file
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file, encoding='utf-8')

In [9]:
# Import an index to create a vectorstore
from langchain.indexes import VectorstoreIndexCreator

In [11]:
# Initialize a vectorstore index by passing in a vectorstore class and a list of document loaders
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [12]:
# Ask a question
query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."

In [13]:
# Create a response
response = index.query(query)

In [14]:
display(Markdown(response))



| Name | Description |
| --- | --- |
| Men's Tropical Plaid Short-Sleeve Shirt | UPF 50+ rated, 100% polyester, wrinkle-resistant, front and back cape venting, two front bellows pockets |
| Men's Plaid Tropic Shirt, Short-Sleeve | UPF 50+ rated, 52% polyester and 48% nylon, machine washable and dryable, front and back cape venting, two front bellows pockets |
| Men's TropicVibe Shirt, Short-Sleeve | UPF 50+ rated, 71% Nylon, 29% Polyester, 100% Polyester knit mesh, machine wash and dry, front and back cape venting, two front bellows pockets |
| Sun Shield Shirt by | UPF 50+ rated, 78% nylon, 22% Lycra Xtra Life fiber, handwash, line dry, wicks moisture, fits comfortably over swimsuit, abrasion resistant |

All four shirts provide UPF 50+ sun protection, blocking 98% of the sun's harmful rays. The Men's Tropical Plaid Short-Sleeve Shirt is made of 100% polyester and is wrinkle-resistant

## Detailed Example

General Idea: We want to use the LLM and combine it with a lot of documents. But the **key issue** is that the LLM can only inspect a few thousand words at a time. So how can we answer questions about large documents? This is where embeddings and vector store come into play.

### Embeddings
Embeddings create numerical representations for pieces of texts. These numerical representations capture the semantic meaning of the text. If the pieces of text have similar meanings, they will have similar embeddings. This is useful as think about which pieces of text we want include when passing to the LLM to answer a question.
<center><img src="imgs/embedding_vectors.JPG" width="450" height="500"/></center>

### Vector Store/[Database](https://medium.com/sopmac-ai/vector-databases-as-memory-for-your-ai-agents-986288530443)
The most common type of ``index`` is one that creates numerical embeddings (with an Embedding Model) for each document. A vectorstore stores Documents and associated embeddings, and provides fast ways to look up relevant Documents by embeddings.

It's populated by chunks of text from incoming documents. When we get a big incoming documents, we split it into chunks of text and then create embeddings for each chunk. We then store the embeddings and the associated text in the vectorstore. This allows us to look up relevant documents by embeddings. In other words, we can pick up chunk of texts that can fit in the LLM context.
<center><img src="imgs/vector_store.JPG" width="500" height="500"/></center>

Since the returned values can now fit in the LLM context, we can pass the picked chunks in the prompt to the LLM to get back the final answer:

<center><img src="imgs/process_embeddings.JPG" width="400" height="200"/></center>

Let's see how this works in code:

In [35]:
from langchain.document_loaders import CSVLoader 

# Create document loader
loader = CSVLoader(file_path=file, encoding='utf-8')

In [36]:
# Load documents from the document loader
docs = loader.load()

In [37]:
# Each document corresponds to a product in the csv file
docs[1]

Document(page_content=': 1\nname: Recycled Waterhog Dog Mat, Chevron Weave\ndescription: Protect your floors from spills and splashing with our ultradurable recycled Waterhog dog mat made right here in the USA. \n\nSpecs\nSmall - Dimensions: 18" x 28". \nMedium - Dimensions: 22.5" x 34.5".\n\nWhy We Love It\nMother nature, wet shoes and muddy paws have met their match with our Recycled Waterhog mats. Ruggedly constructed from recycled plastic materials, these ultratough mats help keep dirt and water off your floors and plastic out of landfills, trails and oceans. Now, that\'s a win-win for everyone.\n\nFabric & Care\nVacuum or hose clean.\n\nConstruction\n24 oz. polyester fabric made from 94% recycled materials.\nRubber backing.\n\nAdditional Features\nFeatures an -exclusive design.\nFeatures thick and thin fibers for scraping dirt and absorbing water.\nDries quickly and resists fading, rotting, mildew and shedding.\nUse indoors or out.\nMade in the USA.\n\nHave questions? Reach out to

Since these documents are too small, we don't need to split them into chunks. We can create embeddings directly for each document. We will use openai's embedding class to create embeddings for each document:



In [39]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

If you want to see what these embeddings do, let's embedd a piece of text:

In [40]:
embed = embeddings.embed_query("Hi my name is Ahmad")

In [41]:
print(len(embed))

1536


The overall representaion of the previous text is:

In [42]:
print(embed[:5])

[-0.026241900399327278, 0.0029073397163301706, -0.019630862399935722, -0.024295246228575706, -0.027834616601467133]


But we want to create embeddings for **all** pieces of text we've just loaded and store them in a vector store:

In [43]:
# Create a vectorstore from documents and embeddings
db = DocArrayInMemorySearch.from_documents(
    docs,      # list of documents
    embeddings # embedding object
)

Now, we can use vector store to find pieces of text that are similar to a particular query:

In [44]:
query = "Please suggest a shirt with sunblocking"

In [45]:
similar_docs = db.similarity_search(query)

In [46]:
len(similar_docs)

4

In [47]:
list(similar_docs)

[Document(page_content=': 255\nname: Sun Shield Shirt by\ndescription: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. \n\nSize & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.\n\nFabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.\n\nAdditional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.\n\nSun Protection That Won\'t Wear Off\nOur high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun\'s harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.', metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 255}),
 Document(page_content=": 374\nname: Men's Plaid Tropic Shirt, Short-Sleeve\ndescription: Our Ultracomfortable sun protection is rated to UPF

How we can use this to answer questions over documents? 

* First, we need to create a retriever from our vector store. A retriever is a generic interface that can be underpined by any method that takes a query and returns documents.

> NOTE: Vector store and embeddings are not the only way to take a query and return documents. There are [other methods](https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a) (less or more advanced) that can be used to do this. 

In [48]:
# Initialize a retriever
retriever = db.as_retriever()

Since we want to do text generation and return a natural language answer, we will import a language model:

In [49]:
llm = ChatOpenAI(temperature = 0.0)

We can combine the returned documents (similar to the query!) into a single piece of text:

In [50]:
qdocs = "".join([similar_docs[i].page_content for i in range(len(similar_docs))])
qdocs

': 255\nname: Sun Shield Shirt by\ndescription: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. \n\nSize & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.\n\nFabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.\n\nAdditional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.\n\nSun Protection That Won\'t Wear Off\nOur high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun\'s harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.: 374\nname: Men\'s Plaid Tropic Shirt, Short-Sleeve\ndescription: Our Ultracomfortable sun protection is rated to UPF 50+, helping you stay cool and dry. Originally designed for fishing, this lightest hot-weather shirt offers UPF 50+ c

Pass the combined documents into a prompt to get back the answer from the LLM:

In [51]:
response = llm.call_as_llm(f"{qdocs} Question: Please list all your \
shirts with sun protection in a table in markdown and summarize each one.") 


In [52]:
display(Markdown(response))

| Name | Description |
| --- | --- |
| Sun Shield Shirt | High-performance sun shirt with UPF 50+ sun protection, moisture-wicking, and abrasion-resistant fabric. Recommended by The Skin Cancer Foundation. |
| Men's Plaid Tropic Shirt | Ultracomfortable shirt with UPF 50+ sun protection, wrinkle-free fabric, and front/back cape venting. Made with 52% polyester and 48% nylon. |
| Men's TropicVibe Shirt | Men's sun-protection shirt with built-in UPF 50+ and front/back cape venting. Made with 71% nylon and 29% polyester. |
| Men's Tropical Plaid Short-Sleeve Shirt | Lightest hot-weather shirt with UPF 50+ sun protection, front/back cape venting, and two front bellows pockets. Made with 100% polyester. |

All of these shirts provide UPF 50+ sun protection, blocking 98% of the sun's harmful rays. They are made with high-performance fabrics that are moisture-wicking, wrinkle-resistant, and abrasion-resistant. The Men's Plaid Tropic Shirt and Men's Tropical Plaid Short-Sleeve Shirt both have front/back cape venting for added breathability. The Sun Shield Shirt is recommended by The Skin Cancer Foundation.

All previous steps can be encapsulated with a LangChain chain. Here, we can create a ``RetrievalQA`` chain that does the **retrieval** + the **question answering** overall the documents. To create such a chain, we need to pass the following:
* A language model to generate the text
* Chain type: 
    * ``stuff``: uses ALL of the text from the documents in the prompt.
    * ``map_reduce``: separates texts into batches (as an example, you can define batch size in ``llm=OpenAI(batch_size=5)``), feeds each batch with the question to LLM separately, and comes up with the final answer based on the answers from each batch.
    * ``refine``: separates texts into batches, feeds the first batch to LLM, and feeds the answer and the second batch to LLM. It refines the answer by going through all the batches.
    * ``map-rerank``: separates texts into batches, feeds each batch to LLM, returns a score of how fully it answers the question, and comes up with the final answer based on the high-scored answers from each batch.

<center><img src="imgs/chain_typ.JPG" width="900" height="400"/></center>
<center><img src="imgs/chain_typ2.JPG" width="900" height="300"/></center>

> NOT: ``stuff`` and ``map_reduce`` are the most common types of chains.


* A retriever: an interface to fetch relevant documents for a query

In [55]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

Now, we can create a query and use the chain to get back the answer:

In [56]:
query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."

In [57]:
response = qa_stuff.run(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [58]:
display(Markdown(response))

| Shirt Number | Name | Description |
| --- | --- | --- |
| 618 | Men's Tropical Plaid Short-Sleeve Shirt | This shirt is made of 100% polyester and is wrinkle-resistant. It has front and back cape venting that lets in cool breezes and two front bellows pockets. It is rated UPF 50+ for superior protection from the sun's UV rays. |
| 374 | Men's Plaid Tropic Shirt, Short-Sleeve | This shirt is made with 52% polyester and 48% nylon. It is machine washable and dryable. It has front and back cape venting, two front bellows pockets, and is rated to UPF 50+. |
| 535 | Men's TropicVibe Shirt, Short-Sleeve | This shirt is made of 71% Nylon and 29% Polyester. It has front and back cape venting that lets in cool breezes and two front bellows pockets. It is rated UPF 50+ for superior protection from the sun's UV rays. |
| 255 | Sun Shield Shirt | This shirt is made of 78% nylon and 22% Lycra Xtra Life fiber. It is handwashable and line dry. It is rated UPF 50+ for superior protection from the sun's UV rays. It is abrasion-resistant and wicks moisture for quick-drying comfort. |

All of these shirts provide UPF 50+ sun protection, blocking 98% of the sun's harmful rays. They are all designed to be lightweight and comfortable in hot weather. They all have front and back cape venting that lets in cool breezes and two front bellows pockets. The Men's Tropical Plaid Short-Sleeve Shirt is made of 100% polyester and is wrinkle-resistant. The Men's Plaid Tropic Shirt, Short-Sleeve is made with 52% polyester and 48% nylon. The Men's TropicVibe Shirt, Short-Sleeve is made of 71% Nylon and 29% Polyester. The Sun Shield Shirt is made of 78% nylon and 22% Lycra Xtra Life fiber and is abrasion-resistant and wicks moisture for quick-drying comfort.

> Remember: We can do all the previous steps in one line:

In [32]:
response = index.query(query, llm=llm)

> We can also customize the index when creating it:

In [33]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
).from_loaders([loader])