# LangChain: Data Connection (Q&A over Documents)

LLM's knowledge is restricted to its training set. So, suppose the model was trained on data up to 2021 and is asked about a company founded in 2023. In that case, it may generate a plausible but entirely fabricated description - a phenomenon known as "**hallucination**.”

Managing **hallucinations** is tricky, especially in applications where accuracy and reliability are paramount, such as customer-service chatbots, knowledge-base assistants, or AI tutors.

One promising strategy to mitigate hallucination is the use of **retrievers** in tandem with LLMs. A **retriever** fetches relevant information from a trusted knowledge base (like a search engine), and the LLM is then specifically prompted to rearrange the information without inventing additional details.

Efficient **retrievers** are built using embedding models that map texts to vectors. These vectors are then stored in specialized databases called vector stores.

-----------

Many LLM applications require user-specific data that is not part of the model's training set. LangChain gives you the building blocks to load, transform, store and query your data.

https://python.langchain.com/docs/modules/data_connection/


## Q&A over Documents

One of the most common, complex applications that people are building using an LLM is a system that can answer questions on top of or about a document.

So, given a piece of text, maybe extracted from a PDF file or from a webpage or from some company's intranet internal document collection, can you use an LLM to answer questions about the content of those documents to help
users gain a deeper understanding and get access to the information that they need?

This is really powerful because it starts to combine these language models with data that they weren't originally trained on. So it makes them much
more flexible and adaptable to your use case.

**Vector Store (DocArrayInMemorySearch):**
- In-memory vector store and doesn’t require a connection to any external database so it makes it really easy to get started
- It is a document index provided by Docarray that stores documents in memory. It is a great starting point for small datasets, where you may not want to launch a database server.
   
    https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/docarray_in_memory

  **DocArray:**

    https://docs.docarray.org/

**Methods or Different approaches:**
- **RetrievalQA** : Allows retrieval of documents
- **VectorestoreIndexCreator**: Allows us to create a vector store easily.

**Ref**:
- https://towardsdatascience.com/4-ways-of-question-answering-in-langchain-188c6707cc5a
- https://medium.com/@JerryYu/langchain-llm-application-development-a6b9e25b83d6
- https://learn.deeplearning.ai/langchain/lesson/5/question-and-answer


In [5]:
!pip install openai
!pip install langchain
!pip install python-dotenv
!pip install docarray # For DocArrayInMemorySearch Vector Store

# This is required if we use the OpenAI embedding model (text-embedding-ada-002)
# https://openai.com/blog/new-and-improved-embedding-model
!pip install tiktoken

# This is required for working with HuggingFace models and tokenizers
!pip install sentence_transformers



In [1]:
from google.colab import drive

drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


In [2]:
%cd gdrive/MyDrive/ColabEnv/env/

/content/gdrive/MyDrive/ColabEnv/env


In [3]:
import os
import openai
from dotenv import load_dotenv, find_dotenv

# Specify the path to the .env file in your Google Drive "ColabEnv" folder
#env_file_path = '/content/gdrive/My Drive/ColabEnv/.env'

# Load the environment variables from the .env file
#load_dotenv(env_file_path)

_ = load_dotenv(find_dotenv()) # Load a local .env file

openai.api_key = os.environ['OPENAI_API_KEY']
# If you would like to use HuggingFace embeddings, make sure your .env file conatins API Token for HuggingFace (HUGGINGFACEHUB_API_TOKEN)

## Embeddings

We want to use **language models** and combine them with our documents. But there's a key issue. **Language models** can only inspect a few thousand
words at a time. So if we have really large documents, how can we get the language model to answer questions about everything that's in there?

This is where **embeddings** and vector stores come into play.
First, let's talk about embeddings.

**Embeddings** create numerical representations for pieces of text.

This numerical representation captures the **semantic meaning** of the piece of text that it's been run over.Pieces of text with **similar content** will have **similar vectors**.

By default, **LangChain** uses **Chroma** as the vectorstore to index and search embeddings. Here we are using in-memory "**DocArrayInMemorySearch**" vectorstore


### **tokeniser**

**tiktoken** is a fast BPE tokeniser for use with OpenAI's models.

Byte pair encoding (BPE) is a way of converting text into tokens.

https://github.com/openai/tiktoken

Given a text string (e.g., "tiktoken is great!") and an encoding (e.g., "cl100k_base"), a tokenizer can split the text string into a list of tokens (e.g., ["t", "ik", "token", " is", " great", "!"]).

Splitting text strings into tokens is useful because GPT models see text in the form of tokens.

#### **Tokens Count**

How can I tell how many tokens a string has before I embed it?

https://platform.openai.com/docs/guides/embeddings/how-can-i-tell-how-many-tokens-a-string-has-before-i-embed-it

Knowing how many tokens are in a text string can tell you

- whether the string is too long for a text model to process and
- how much an OpenAI API call costs (as usage is priced by token)

In [4]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings import OpenAIEmbeddings

embeddingsOpenAI = OpenAIEmbeddings()
embeddingsHuggingFace = HuggingFaceEmbeddings()

In [5]:
# https://python.langchain.com/docs/modules/data_connection/text_embedding/#embed_query

# If we want to see what these embeddings do, we can actually take a look at what happens when we embed a particular piece of text.
# Let's use the "embed_query" method on the embeddings object to create an embeddings for a particular piece of text
embedOpenAI = embeddingsOpenAI.embed_query("Hi my name is Harrison")
embedHuggingFace = embeddingsHuggingFace.embed_query("Hi my name is Harrison")

# If we take a look at this embedding, we can see that there are over a thousand different elements.
# Each of these elements is a different numerical value. Combined, this creates the overall numerical representation for this piece of text.
print(len(embedOpenAI))
print(len(embedHuggingFace))

print(embedOpenAI[:5])
print(embedHuggingFace[:5])

1536
768
[-0.02221718244254589, 0.0065193031914532185, -0.01825656369328499, -0.03920383378863335, -0.01435881294310093]
[0.03336688503623009, -0.004217776469886303, 0.005565300118178129, -0.007028088439255953, 0.010666918009519577]


## Load document(csv) and create Vector DB (In memory)

A **vector database** is a way to store these
vector representations that we created in the previous step.

The way that we create this vector database is we populate it with **chunks** of text coming from incoming documents. When we get a big incoming document, we're first going to break it up into smaller **chunks**.

This helps create pieces of text that are smaller than the original document, which is useful because we may not be able to pass the whole document to the language model. So we want to create these small chunks
so we can only pass the most relevant ones to the language model.
We then create an **embedding** for each of these chunks, and then we store those in a vector database.

That's what happens when we create the index.

we can use the **Index** during runtime to find the pieces of text most
relevant to an incoming query. When a query comes in, we first create an
embedding for that query. We then compare it to all the vectors
in the vector database, and we pick the n most similar.

These are then returned, and we can pass those in the prompt to the language model to get back a final answer. So above, we created this chain and only a few lines of code.

In [6]:
# Import a document loader. This is going to be used to load some proprietary data that we're going to combine with the language model.
from langchain.document_loaders import CSVLoader

# We're going to create a document loader, loading from that CSV with all the descriptions of the products that we want to do question answering over.
# We can then load documents from this document loader. If we look at the individual documents, we can see that each
# document corresponds to one of the products in the CSV.

#Previously, we talked about creating chunks. Because these documents are already so small, we actually don't need to do any chunking here.
# CSV of outdoor clothing
file = "OutdoorClothingCatalog_1000.csv"
loader = CSVLoader(file_path=file)
docs = loader.load()

print(len(docs)) #CSV has 1000 entries
docs[0]

1000


Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0})

In [8]:
from langchain.vectorstores import DocArrayInMemorySearch

"""
We want to create embeddings for all the pieces of text that we just loaddand then we also want to store them in a vector store.
We can do that by using the "from_documents" method on the vector store.
This method takes in a list of documents, an embedding object, and then we'll create an overall vector store.
"""

# https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/docarray_in_memory#using-docarrayinmemorysearch

dbDocArrayInMemorySearchVectorStore = DocArrayInMemorySearch.from_documents(
    docs,
    embeddingsHuggingFace
)

print(type(dbDocArrayInMemorySearchVectorStore))

<class 'langchain.vectorstores.docarray.in_memory.DocArrayInMemorySearch'>


## Method1 : Using QnA Chain (load_qa_chain)

`load_qa_chain` provides the most generic interface for answering questions.

 It loads a chain that you can do QA for your input documents and uses ALL of the text in the documents.

In [9]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature = 0.0)

# query = "Please suggest a shirt with sunblocking"
query =  "Please list all your shirts with sun protection"
# query =  "Please list all your shirts with sun protection in a table in markdown and summarize each one."

# Vector DB Search
searchResultDocs1 = dbDocArrayInMemorySearchVectorStore.similarity_search(query)
print(len(searchResultDocs1))
print(searchResultDocs1[0])

# The default chain_type="stuff" uses ALL of the text from the documents in the prompt.
chain = load_qa_chain(llm, chain_type="stuff")

chain.run(input_documents=searchResultDocs1, question=query)

4
page_content=": 679\nname: Women's Tropical Tee, Sleeveless\ndescription: Our five-star sleeveless button-up shirt has a fit to flatter and SunSmart™ protection to block the sun’s harmful UV rays. Size & Fit: Slightly Fitted: Softly shapes the body. Falls at hip. Fabric & Care: Shell: 71% nylon, 29% polyester. Cape lining: 100% polyester. Built-in SunSmart™ UPF 50+ rated – the highest rated sun protection possible. Machine wash and dry. Additional Features: Updated design with smoother buttons. Wrinkle resistant. Low-profile pockets and side shaping offer a more flattering fit. Front and back cape venting. Two front pockets, tool tabs and eyewear loop. Imported. Sun Protection That Won't Wear Off: Our high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun's harmful rays." metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 679}


"1. Women's Tropical Tee, Sleeveless\n2. Sun Shield Shirt\n3. Sunrise Tee\n4. Girls' Ocean Breeze Long-Sleeve Stripe Shirt"

## Method2 :Using RetrivalQA Chain

All of the above steps can be encapsulated by using the **RetrivalQA** chain.

**RetrievalQA** chain actually uses `load_qa_chain` under the hood. We retrieve the most relevant chunk of text and feed those to the language model.

In [10]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# Enale Langchain DEBUG

#import langchain
#langchain.debug = True

llm = ChatOpenAI(temperature = 0.0)

# Create a Retriever from that index (db)
retriever = dbDocArrayInMemorySearchVectorStore.as_retriever()

qa_stuff = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff", # this is one of many methods to do Q&A
    retriever=retriever,
    verbose=True,
    chain_type_kwargs={"document_separator" : "<<<<<>>>>>"}
)

query =  "Please list all your shirts with sun protection"

"""
query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."
"""

response = qa_stuff.run(query)
#display(Markdown(response))
print(response)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
1. Women's Tropical Tee, Sleeveless - Provides SunSmart™ UPF 50+ rated sun protection.
2. Sun Shield Shirt - Offers UPF 50+ rated sun protection.
3. Sunrise Tee - Features built-in SunSmart™ UPF 50+ rated sun protection.
4. Girls' Ocean Breeze Long-Sleeve Stripe Shirt - Provides UPF 50+ rated sun protection.


### Method3 : VectorstoreIndexCreator

There are three main steps going on after the documents are loaded:

- Splitting documents into chunks
- Creating embeddings for each document
- Storing documents and embeddings in a vectorstore

https://python.langchain.com/docs/modules/data_connection/retrievers/#walkthrough

https://python.langchain.com/docs/modules/data_connection/retrievers/#one-line-index-creation




In [11]:
# Import a vector store.
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.document_loaders import CSVLoader

# CSV of outdoor clothing
file = "OutdoorClothingCatalog_1000.csv"
loader = CSVLoader(file_path=file)

# Create a vectorstore index from loaders.
# from_loaders" method takes in a list of document loaders. Here, We've only got one loader that we really care about, so that's what we're passing in here

# Here By default it is trying to use the OpenAI embeddings, but you get RateLimit errors if you don't have paid OpenAI acount
#index = VectorstoreIndexCreator(vectorstore_cls=DocArrayInMemorySearch).from_loaders([loader]);

#Using OpenSource Hugging Face Embeddings
index = VectorstoreIndexCreator(vectorstore_cls=DocArrayInMemorySearch, embedding=embeddingsHuggingFace).from_loaders([loader]);

print(index)
print(type(index))

query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."

response = index.query(query)
print(response)

from IPython.display import display, Markdown
display(Markdown(response))


vectorstore=<langchain.vectorstores.docarray.in_memory.DocArrayInMemorySearch object at 0x79900d588f70>
<class 'langchain.indexes.vectorstore.VectorStoreIndexWrapper'>


| Name | Description |
| --- | --- |
| Women's Tropical Tee, Sleeveless | Our five-star sleeveless button-up shirt has a fit to flatter and SunSmart™ protection to block the sun’s harmful UV rays. |
| Sun Shield Shirt | Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. |
| Men's Plaid Tropic Shirt, Short-Sleeve | Our Ultracomfortable sun protection is rated to UPF 50+, helping you stay cool and dry. |
| Tropical Breeze Shirt | Beat the heat in this lightweight, breathable long-sleeve men’s UPF shirt, offering superior SunSmart™ protection from the sun’s harmful rays. |

All of the shirts listed provide UPF 50+ sun protection, blocking 98% of the sun's harmful rays. They are all made with high-performance fabrics that are wrinkle-resistant and quickly evaporate pe



| Name | Description |
| --- | --- |
| Women's Tropical Tee, Sleeveless | Our five-star sleeveless button-up shirt has a fit to flatter and SunSmart™ protection to block the sun’s harmful UV rays. |
| Sun Shield Shirt | Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. |
| Men's Plaid Tropic Shirt, Short-Sleeve | Our Ultracomfortable sun protection is rated to UPF 50+, helping you stay cool and dry. |
| Tropical Breeze Shirt | Beat the heat in this lightweight, breathable long-sleeve men’s UPF shirt, offering superior SunSmart™ protection from the sun’s harmful rays. |

All of the shirts listed provide UPF 50+ sun protection, blocking 98% of the sun's harmful rays. They are all made with high-performance fabrics that are wrinkle-resistant and quickly evaporate perspiration. The Women's Tropical Tee and the Tropical Breeze Shirt are both made with a combination of nylon and polyester, while the Sun Shield Shirt is made with 78% nylon and 22% Lyc

## *Method4* : Using call_as_llm

In [None]:
# 3. Create a question answering chain
from langchain.chat_models import ChatOpenAI

# use the above vector store to find pieces of text similar to an incoming query.
query = "Please suggest a shirt with sunblocking"
searchResultDocs = dbDocArrayInMemorySearchVectorStore.similarity_search(query)
print(len(searchResultDocs))
searchResultDocs[0]

llm = ChatOpenAI(temperature = 0.0)
qdocs = "".join([searchResultDocs[i].page_content for i in range(len(searchResultDocs))])

#print(type(qdocs))
#print(qdocs)

response = llm.call_as_llm(f"{qdocs} Question: Please list all your \
shirts with sun protection in a table in markdown and summarize each one.")

4


KeyboardInterrupt: ignored