# LangChain: Q&A over Documents

An example might be a tool that would allow you to query a product catalog for items of interest.

One of the most common, complex applications that 
people are building using an LLM is a system that can answer 
questions on top of or about a document. 
So, given a piece of text, maybe extracted from a 
PDF file or from a webpage or from some company's 
intranet internal document collection, can you use an LLM to answer 
questions about the content of those documents to help 
users gain a deeper understanding and 
get access to the information that they need? 
This is really powerful because it starts to combine 
these language models with data that they weren't 
originally trained on. So it makes them much 
more flexible and adaptable to your use case. It's also 
really exciting because we'll start to move beyond language models, prompts, and output 
parsers and start introducing some more of the key components 
of LangChain, such as embedding models and vector stores. 
 

In [None]:
#pip install --upgrade langchain

In [None]:
# environment
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

In [None]:
from langchain.chains import RetrievalQA
# This will do retrieval over some documents.

from langchain.chat_models import ChatOpenAI

from langchain.document_loaders import CSVLoader
# used to load some proprietary data that we're going to combine with the language model.
# In this case it's going to be in a CSV.

from langchain.vectorstores import DocArrayInMemorySearch
# We're going to import a vector store.
# This is really nice because it's an in-memory vector store and it doesn't require connecting to an 
# external database of any kind so it makes it really easy to get started. 

from IPython.display import display, Markdown
# common utilities for displaying information in Jupyter notebooks. 

In [None]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
# Here we're going to initialize a loader, the CSV loader, with a path to this file. 

In [None]:
from langchain.indexes import VectorstoreIndexCreator
# We're next going to import an index, the "VectorStoreIndexCreator". 
# This will help us create a vector store really easily. 

In [None]:
#pip install docarray

To create it, we're going to specify two things. 
First, we're going to specify the vector store class. 
As mentioned before, we're going to use this vector store, as 
it's a particularly easy one to get started with. 
After it's been created, we're then going to call "from_loaders", which takes in 
a list of document loaders. 

In [None]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [None]:
query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."

In [None]:
response = index.query(query)
# create a response using "index.query" and pass in this query

In [None]:
display(Markdown(response))

*OUTPUT*
<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Men's Tropical Plaid Short-Sleeve Shirt</td>
      <td>UPF 50+ rated, 100% polyester, wrinkle-resistant, front and back cape venting, two front bellows pockets</td>
    </tr>
    <tr>
      <td>Men's Plaid Tropic Shirt, Short-Sleeve</td>
      <td>UPF 50+ rated, 52% polyester and 48% nylon, machine washable and dryable, front and back cape venting, two front bellows pockets</td>
    </tr>
    <tr>
      <td>Men's TropicVibe Shirt, Short-Sleeve</td>
      <td>UPF 50+ rated, 71% nylon, 29% polyester, 100% polyester knit mesh, wrinkle-resistant, front and back cape venting, two front bellows pockets</td>
    </tr>
    <tr>
      <td>Sun Shield Shirt by</td>
      <td>UPF 50+ rated, 78% nylon, 22% Lycra Xtra Life fiber, wicks moisture, fits comfortably over swimsuit, abrasion resistant</td>
    </tr>
  </tbody>
</table>

```
All of the shirts provide UPF 50+ sun protection, blocking 98% of the sun's harmful rays. They are all wrinkle-resistant and have front and back cape venting and two front bellows pockets. The Men's Tropical Plaid Short-
```

We've gotten back a table in markdown with names 
and descriptions for all shirts with sun protection. 
We've also got a nice little summary that the 
language model has provided us. 

1. **LLM (Large Language Model):** Refers to a large-scale language model, like GPT-3, capable of processing and generating human-like text.

2. **Question Answering System:** An application built using LLMs that can answer questions based on a given document's content, helping users gain a deeper understanding of the information.

3. **Embeddings:** Numerical representations of pieces of text that capture their semantic meaning, allowing for comparison of similarity between pieces of text.

4. **Vector Store/Database:** A storage system for storing embeddings of various documents, enabling efficient retrieval and comparison of relevant texts.

5. **Retriever:** A component used to fetch documents based on an incoming query.

6. **ChatOpenAI:** A language model used for text generation and natural language responses.

7. **LangChain:** A framework for composing and executing complex language model workflows.

8. **Stuff Method:** A simple approach in LangChain where all documents are combined into one prompt and sent to the language model for generating a single response.

9. **Map_reduce Method:** A method in LangChain where documents are processed independently in parallel and then summarized into a final answer.

10. **Refine Method:** A method in LangChain where answers are built iteratively based on previous responses, allowing for combining information and generating longer answers.

11. **Map_rerank Method:** An experimental method in LangChain where individual calls to the language model are made for each document, and the highest scoring response is selected.

These concepts are relevant to building a powerful and flexible system for question-answering tasks using LLMs and vector stores.

![Question](immagini/15_question.png)

We want to use language models and 
combine it with a lot of our documents. 
But there's a key issue. 
Language models can only inspect a few thousand 
words at a time. 
So if we have really large documents, how can we get 
the language model to answer questions about everything 
that's in there? 
This is where embeddings and vector stores come into play. First, 
let's talk about embeddings.

Embeddings create numerical representations 
for pieces of text. 
This numerical representation captures the semantic 
meaning of the piece of text that it's been run over. 
Pieces of text with similar content will have similar vectors. 
This lets us compare pieces of text in the vector space. 

![Question](immagini/16_question.png)

In the example, we can see that 
we have three sentences. 
The first two are about pets, while the third is about a car. 
If we look at the representation in the numeric space, 
we can see that when we compare the two vectors on the 
pieces of text corresponding to the sentences about pets, they're 
very similar. 
While if we compare it to the one that talks about a car, 
they're not similar at all. 
This will let us easily figure out which pieces of 
text are like each other, which will be very useful as 
we think about which pieces of text we want to include when 
passing to the language model to answer a question. 

![Question](immagini/17_question.png)

The next component that we're going to cover is 
the vector database. 
A vector database is a way to store these 
vector representations that we created in the previous step. 
The way that we create this vector database 
is we populate it with chunks of text 
coming from incoming documents. 
When we get a big incoming document, we're first going to break it 
up into smaller chunks. 
This helps create pieces of text that are 
smaller than the original document, which is useful because 
we may not be able to pass the whole document to the 
language model. So we want to create these small chunks 
so we can only pass the most relevant 
ones to the language model. 
We then create an embedding for each of these chunks, 
and then we store those in a vector database. 

That's what happens when we create the index. 
Now that we've got this index, we can use it during 
runtime to find the pieces of text most 
relevant to an incoming query. 
When a query comes in, we first create an 
embedding for that query. 
We then compare it to all the vectors 
in the vector database, and we pick the n most similar. 
These are then returned, and we can pass those in the prompt 
to the language model to get back a final answer.

![Question](immagini/18_question.png)

In [None]:
loader = CSVLoader(file_path=file)
# We're going to create a document loader, loading from that CSV with all the descriptions of the 
# products that we want to do question answering over. 

In [None]:
docs = loader.load()
# load documents from this document loader

In [None]:
docs[0]

*OUTPUT*

```
Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0})
```

If we look at the individual documents, we can see that each 
document corresponds to one of the products in the CSV. 
Previously, we talked about creating chunks. 
Because these documents are already so small, 
we actually don't need to do any chunking here. 
And so we can create embeddings directly. 

In [None]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [None]:
embed = embeddings.embed_query("Hi my name is Harrison")
# Let's use the "embed_query" method on the embeddings object to create an embeddings for a particular piece of text.

In [None]:
print(len(embed))

*OUTPUT*

1536

In [None]:
print(embed[:5])
# this creates the overall numerical representation for this piece of text.

*OUTPUT*

[-0.021913960576057434, 0.006774206645786762, -0.018190348520874977, -0.039148248732089996, -0.014089343138039112]

In [None]:
db = DocArrayInMemorySearch.from_documents(
    docs, 
    embeddings
)

We want to create embeddings for all the 
pieces of text that we just loaddand then 
we also want to store them in a vector store. 
We can do that by using the "from_documents" method 
on the vector store. 
This method takes in a list of documents, 
an embedding object, and then we'll create an overall vector store. We can now use this vector store to find pieces of text 
similar to an incoming query. 

In [None]:
query = "Please suggest a shirt with sunblocking"

In [None]:
docs = db.similarity_search(query)

In [None]:
len(docs)

*OUTPUT*

4

In [None]:
docs[0]

*OUTPUT*

```
Document(page_content=': 255\nname: Sun Shield Shirt by\ndescription: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. \n\nSize & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.\n\nFabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.\n\nAdditional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.\n\nSun Protection That Won\'t Wear Off\nOur high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun\'s harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.', metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 255})
```

## How do we use this to do question answering over our own documents?
First, we need to create a 
retriever from this vector store. A retriever is a 
generic interface that can be underpinned by any 
method that takes in a query and returns documents.
Vector stores and embeddings are one such method to do so, 
although there are plenty of different methods, 
some less advanced, some more advanced. 

In [None]:
retriever = db.as_retriever()

We want to do text generation and 
return a natural language response, we're going to import a 
language model and we're going to use ChatOpenAI.

In [None]:
llm = ChatOpenAI(temperature = 0.0)

In [None]:
# If we were doing this by hand
qdocs = "".join([docs[i].page_content for i in range(len(docs))])

In [None]:
# We join all the page content in the documents into a variable 
# and then would pass this variable or a variant on the question into the language model. 
response = llm.call_as_llm(f"{qdocs} Question: Please list all your \
shirts with sun protection in a table in markdown and summarize each one.") 

In [None]:
# print out the respond
display(Markdown(response))

*OUTPUT*
![Responde](immagini/21_responde.png)
```
All of these shirts provide UPF 50+ sun protection, blocking 98% of the sun's harmful rays. They also have additional features such as moisture- wicking, wrinkle-free fabric, and front/back cape venting for added comfort. The Sun Shield Shirt is recommended by The Skin Cancer Foundation for its effective UV protection.
```

The retriever we created above is just an 
interface for fetching documents. 
This will be used to fetch the documents and pass 
it to the language model. 

In [None]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

In [None]:
query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."

In [None]:
response = qa_stuff.run(query)

![Responde](immagini/22_responde.png)

In [None]:
display(Markdown(response))

![Responde](immagini/23_responde.png)

#### Remember that we can still do it pretty easily with just the one line that we had up above. 
So, these two things equate to the same result. 
And that's part of the interesting stuff about LangChain. You 
can do it in one line, or you can look at 
the individual things and break it down into 
five more detailed ones. 

You 
can do it in one line, or you can look at 
the individual things and break it down into 
five more detailed ones. 
The five more detailed ones let you set 
more specifics about what exactly is going on, but the one-liner 
is easy to get started. So up to you as to how you'd prefer 
to go forward.

In [None]:
# customize the index
response = index.query(query, llm=llm)

There's the same level of customization that you did when you create 
it by hand that's also available when you create the 
index here. 

In [None]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
).from_loaders([loader])

![Responde](immagini/24_responde.png)

![Responde](immagini/25_responde.png)

The stuff method is really nice because it's 
pretty simple. You just put all of it into one prompt and send that to 
the language model and get back one response. 
So it's quite simple to understand what's going 
on. It's quite cheap and it works pretty well. 
But that doesn't always work okay. 
So if you remember, when we fetched the 
documents in the notebook, we only got four documents back 
and they were relatively small. 
But what if you wanted to do the same type of question 
answering over lots of different types of chunks? 

Then there are a few different methods that we can use. 
The first is "Map_reduce". 
This basically takes all the chunks, passes them along with the 
question to a language model, gets back a response, and then uses 
another language model call to summarize all of the 
individual responses into a final answer. 
This is really powerful because it can operate 
over any number of documents. 
And it's also really powerful because you can do the 
individual questions in parallel. 
But it does take a lot more calls. And it does treat 
all the documents as independent, which may not always 
be the most desired thing. "Refine", which is another method, 
is again used to loop over many documents. 
But it actually does it iteratively. It builds upon the 
answer from the previous document. 
So this is really good for combining information and 
building up an answer over time. It will generally lead to longer 
answers. 
And it's also not as fast because now the calls aren't independent. 
They depend on the result of previous calls. 
This means that it often takes a good 
while longer and takes just as many calls as "Map_reduce", basically. 
"Map_rerank" is a pretty interesting and a bit more 
experimental one where you do a single call to the language model 
for each document. And you also ask it to return a score. 

And then you select the highest score. 
This relies on the language model to know 
what the score should be. So you often have to tell it, "Hey, 
it should be a high score if it's relevant to the document and really 
refine the instructions there". Similar to "Map_reduce", all 
the calls are independent. So you 
can batch them and it's relatively fast. But again, you're making a bunch 
of language model calls. So it will be 
a bit more expensive. 
The most common of these methods is the "stuff method", 
which we used in the notebook to combine 
it all into one document. 
The second most common is the "Map_reduce" method, which takes these chunks 
and sends them to the language model. 
These methods here, stuff, map_reduce, refine, and rerank can also 
be used for lots of other chains besides just 
question answering. 
For example, a really common use case of the "Map_reduce" 
chain is for summarization, where you have a really long document 
and you want to recursively summarize 
pieces of information in it. 
That's it for question answering over documents. 
As you may have noticed, there's a lot going on in the 
different chains that we have here.