<a href="https://www.kaggle.com/code/yannicksteph/nlp-llm-langchain-rag-qa-data?scriptVersionId=157432537" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# | NLP | LLM | LangChain RAG | QA Data |

## Natural Language Processing (NLP) and Large Language Models (LLM) with LangChain and Inretrieval augmented generation (RAG) for Question Answering on Own Data

![Learning](https://t3.ftcdn.net/jpg/06/14/01/52/360_F_614015247_EWZHvC6AAOsaIOepakhyJvMqUu5tpLfY.jpg)


# <b>1 <span style='color:#78D118'>|</span> Overview</b>

In this notebook we're going to augment the knowledge base of our LLM with additional data:
- We will walk through how to load data, local text file using a `DocumentLoader`, split it into chunks, and store it in a vector database using `ChromaDB`.
- And using Question Answering on Own Data 

Inretrieval augmented generation (RAG) framework, an LLM retrieves contextual documents from an external dataset as part of its execution. This is useful when we want to ask questions about specific documents (e.g., PDFs, videos, etc). 

<img src="https://github.com/YanSte/NLP-LLM-LangChain-Question-Answer-Data/blob/main/img_1.png?raw=true" alt="Learning" width="100%">


## Learning Objectives

 By the end of this notebook, you will be able to:
1. Add external local data to your LLM's knowledge base via a vector database.
2. Construct a Question-Answer(QA) LLMChain to "talk to your data."
3. Load external data sources from remote locations and store in a vector database.
4. Leverage different retrieval methods to search over your data. 

<img src="https://deepsense.ai/wp-content/uploads/2023/10/LangChain-announces-partnership-with-deepsense.jpeg" alt="Learning" width="50%">


[Using-langchain-for-question-answering-on-own-data](https://medium.com/@onkarmishra/using-langchain-for-question-answering-on-own-data-3af0a82789ed)

### Setup


In [1]:
import os

os.environ["HUGGINGFACEHUB_API_TOKEN"] = "Fill"

In [2]:
%%capture

!pip install chromadb==0.4.10 tiktoken==0.3.3 sqlalchemy==2.0.15
!pip install langchain==0.0.249
!pip install --force-reinstall pydantic==1.10.6 
!pip install sentence_transformers

In [3]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain, ConversationalRetrievalChain, ConversationChain
from langchain.memory import ConversationBufferMemory
from langchain.schema import messages_from_dict, messages_to_dict
from langchain.memory.chat_message_histories.in_memory import ChatMessageHistory
from langchain.agents import Tool
from langchain.agents import initialize_agent
from langchain.agents import AgentType

In [4]:
cache_dir = "./cache"

In [5]:
import pandas as pd
pd.set_option('display.max_column', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_seq_items', None)
pd.set_option('display.max_colwidth', 500)
pd.set_option('expand_frame_repr', True)

# <b>2 <span style='color:#78D118'>|</span> Question-Answer(QA) LLMChain to "talk to your data."</b>

We're going to build a special type of LLMChain that will enable us to ask questions of our data. We will be able to "speak to our data".

### Step 1 - Document Loading

<img src="https://github.com/YanSte/NLP-LLM-LangChain-Question-Answer-Data/blob/main/img_2.png?raw=true" alt="Learning" width="30%" style="opacity: 0.7;">

Document Loading in the Inretrieval Augmented Generation (RAG) framework is a critical step where a Language Model (LLM) retrieves contextual documents from external datasets during execution. This process is particularly valuable when seeking to ask questions about specific documents, such as PDFs or videos. To enable applications to interact with this data through chat interfaces, it's necessary to load the data into a format that facilitates processing.

LangChain's Retrieval Augmented Generation (RAG) employs document loaders to handle the intricacies of accessing and converting data from various formats and sources into a standardized format. This includes structured and unstructured data sources such as websites, databases, YouTube, arxiv, Twitter, Hacker News, as well as proprietary sources like Figma, Notion, Airbyte, Stripe, and Airtable. These loaders accommodate diverse data types like pdf, html, json, word, PowerPoint, or tabular formats, transforming them into standard document objects with content and metadata.

LangChain boasts over 80 different document loaders, including PyPDF DataLoader for loading PDFs, Youtube DataLoader for handling YouTube videos, WebBaseLoader for loading URLs from the internet, and NotionDirectoryLoader for retrieving data from Notion. Each loader produces a list of documents, where each document contains page content and associated metadata. For large documents, the capability to split them into smaller chunks is essential, especially in the context of retrieval augmented generation, where relevance is crucial.

To show how well we can scale the vector database, let's load in a larger document. For this we'll get data from the Gutenberg Project with thousands of free-to-access texts. We'll use the complete works of William Shakespeare.

Instead of a local text document, we'll download the complete works of Shakespeare using the GutenbergLoader that works with the Gutenberg project: https://www.gutenberg.org





In [6]:
from langchain.document_loaders import GutenbergLoader

loader = GutenbergLoader(
    "https://www.gutenberg.org/cache/epub/100/pg100.txt"
)

document = loader.load()

extrait = ' '.join(document[0].page_content.split()[:100])
display(extrait + " .......")


'The Project Gutenberg eBook of The Complete Works of William Shakespeare This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this ebook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook. Title: The Complete Works of William Shakespeare .......'

### Step 2 - Document Splitting


<img src="https://github.com/YanSte/NLP-LLM-LangChain-Question-Answer-Data/blob/main/img_3.png?raw=true" alt="Learning" width="50%">


Document Splitting in LangChain is a crucial process that involves breaking down documents into smaller, semantically relevant chunks. This is essential for maintaining meaningful relationships between the chunks, ensuring accurate information retrieval. To achieve this, LangChain employs RecursiveCharacterTextSplitter and CharacterTextSplitter, each with specified chunk sizes and overlaps. The splitters use different methods, such as character or token-based splitting, and consider metadata, language specifics, and document types.

<img src="https://github.com/YanSte/NLP-LLM-LangChain-Question-Answer-Data/blob/main/img_4.png?raw=true" alt="Learning" width="50%">

LangChain offers various types of text splitters, each with methods for creating and splitting documents. These methods share logic but differ in chunking strategies, length measurement, and handling metadata. Examples include Recursive Text Splitter and Character Text Splitter, demonstrated with toy use cases.

Real-world examples showcase the effectiveness of LangChain's splitters in handling different scenarios. RecursiveCharacterTextSplitter and CharacterTextSplitter are illustrated using examples of text and PDF splitting. The splitters handle separators like spaces, newlines, and custom delimiters, ensuring accurate and context-aware chunking.

<img src="https://github.com/YanSte/NLP-LLM-LangChain-Question-Answer-Data/blob/main/img_5.png?raw=true" alt="Learning" width="50%">

Additionally, LangChain provides TokenTextSplitter for token-based splitting, considering the token count, which is beneficial for language models with token-specific context windows. The context-aware splitting also extends to MarkdownHeaderTextSplitter, preserving header metadata during chunking.

Ultimately, LangChain's Document Splitting prepares data for storage in a vector store, enhancing the efficiency of information retrieval.



In [7]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
import tempfile

# Chunk sizes of 1024 and an overlap of 256 (this will take approx. 10mins with this model to build our vector database index)
text_splitter = CharacterTextSplitter(
    chunk_size=1024, 
    chunk_overlap=256
) 
texts = text_splitter.split_documents(document)


### Step 3 - Vector Stores and Embeddings

<img src="https://github.com/YanSte/NLP-LLM-LangChain-Question-Answer-Data/blob/main/img_6.png?raw=true" alt="Learning" width="60%">

Vector Store and Embeddings in LangChain play a crucial role in the retrieval augmented generation (RAG) framework. After splitting documents into smaller chunks, LangChain employs embeddings to create numerical representations of text, enabling the comparison of semantically similar content. These embeddings are stored in a vector store, a database facilitating easy retrieval of similar vectors, a process essential for finding relevant documents when answering questions.

We divided our document into smaller sections and created embeddings for efficient retrieval when answering questions. Using vector stores, we store these embeddings, making it easy to locate similar content. The process involves document splitting, embedding creation, and storing in a vector store.

<img src="https://github.com/YanSte/NLP-LLM-LangChain-Question-Answer-Data/blob/main/img_7.png?raw=true" alt="Learning" width="50%">

The vector store acts as a database for quick lookup of similar vectors, aiding in finding relevant documents for questions. For question answering, we generate embeddings for the question, compare them with stored vectors, select the most similar ones, and pass them along with the question to a language model to obtain the answer.

<img src="https://github.com/YanSte/NLP-LLM-LangChain-Question-Answer-Data/blob/main/img_8.png?raw=true" alt="Learning" width="50%">


We use Chroma as the vector store, persisting it for future use. Conducting a similarity search, we ask questions and retrieve relevant documents. However, there are edge cases where duplicate results or lack of structured information may affect the search accuracy, as seen in examples provided.



Now we'll create embeddings for our document so we can store it in a vector store and feed the data into an LLM. We'll use the sentence-transformers model for out embeddings. https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models/


For this system we'll leverage the [ChromaDB vector database](https://www.trychroma.com/) and load in some text we have on file.

In [8]:
model_name = "sentence-transformers/all-MiniLM-L6-v2"

embeddings = HuggingFaceEmbeddings(
    model_name=model_name, 
    cache_folder=cache_dir
)  # Use a pre-cached model

vectordb = Chroma.from_documents(
    texts, 
    embeddings, 
    persist_directory=cache_dir
)



.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/243 [00:00<?, ?it/s]

#### Similarity Search
We will now ask questions using the similarity search method and pass k, which specifies the number of documents that we want to return.

In [9]:
question = "Romeo!"

docs = vectordb.similarity_search(question,k=2)

# Check the length of the document
print(len(docs))

# Check the content of the first document
print(docs[0].page_content)

# Persist the database to use it later
vectordb.persist()

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2
Romeo! My cousin Romeo! Romeo!





MERCUTIO.


He is wise,


And on my life hath stol’n him home to bed.





BENVOLIO.


He ran this way, and leap’d this orchard wall:


Call, good Mercutio.





MERCUTIO.


Nay, I’ll conjure too.


Romeo! Humours! Madman! Passion! Lover!


Appear thou in the likeness of a sigh,


Speak but one rhyme, and I am satisfied;


Cry but ‘Ah me!’ Pronounce but Love and dove;


Speak to my gossip Venus one fair word,


One nickname for her purblind son and heir,


Young Abraham Cupid, he that shot so trim


When King Cophetua lov’d the beggar-maid.


He heareth not, he stirreth not, he moveth not;


The ape is dead, and I must conjure him.


I conjure thee by Rosaline’s bright eyes,


By her high forehead and her scarlet lip,


By her fine foot, straight leg, and quivering thigh,


And the demesnes that there adjacent lie,


That in thy likeness thou appear to us.





BENVOLIO.


An if he hear thee, thou wilt anger him.





MERCUTIO.


### Step 4 - Retrieval in LangChain

<img src="https://github.com/YanSte/NLP-LLM-LangChain-Question-Answer-Data/blob/main/img_9.png?raw=true" alt="Learning" width="70%">

Retrieval is crucial in our retrieval augmented generation (RAG) flow, addressing challenges in question-answering over documents. LangChain introduces advanced mechanisms like Self-query and Contextual Compression for improved retrieval during query time.

Exploring advanced retrieval methods:
1. **Maximum Marginal Relevance (MMR):**
   - Ensures diversity in search results.
   - Selects a diverse set of documents, overcoming limitations of semantic search.
   
<img src="https://github.com/YanSte/NLP-LLM-LangChain-Question-Answer-Data/blob/main/img_10.png?raw=true" alt="Learning" width="50%">

2. **Metadata Inclusion:**
   - Addresses specificity in search results.
   - Uses metadata filters to refine search queries, enhancing result accuracy.

3. **Self Query:**
   - Leverages language models to extract query string and metadata filters.
   - Eliminates manual filter specification, enhancing efficiency.

4. **Contextual Compression:**
   - Improves the quality of retrieved documents.
   - Extracts relevant segments, reducing the need for expensive language model calls.
   
<img src="https://github.com/YanSte/NLP-LLM-LangChain-Question-Answer-Data/blob/main/img_11.png?raw=true" alt="Learning" width="50%">

### Step 5 - Question Answering

<img src="https://github.com/YanSte/NLP-LLM-LangChain-Question-Answer-Data/blob/main/img_12.png?raw=true" alt="Learning" width="70%">

Question Answering with Retrieval:
Learn how to perform question answering using documents retrieved in the Retrieval phase. Use a language model to answer questions based on both the retrieved documents and the original question.

<img src="https://github.com/YanSte/NLP-LLM-LangChain-Question-Answer-Data/blob/main/img_13.png?raw=true" alt="Learning" width="50%">

RetrievalQA Chain:
Explore question answering after retrieving relevant splits from the vector store. Compress splits if needed and send them, along with a system prompt and human question, to the language model for answers. Methods like MapReduce, Refine, and MapRerank are available for handling a high number of documents.

Retrieval QA Chain with Methods:
Pass documents into the same context window by default, or use methods like MapReduce, Refine, and MapRerank for high document counts. Load the vector database, perform a similarity search, and initialize the language model for factual answers.

RetrievalQA Chain with Prompt:
Utilize a prompt template to guide the context's use for answering questions. Initialize the RetrievalQA chain with a language model, vector database, and the prompt template. Retrieve answers to questions while considering the source documents.

RetrievalQA Chain with MapReduce, Refine, and MapRerank:
Explore different techniques for handling multiple documents. MapReduce involves individual document calls, while Refine uses sequential refinement for improved answers. Be aware of the trade-offs in speed and result quality.

RetrievalQA Limitations:
Understand the limitations of the RetrievalQA chain, particularly its inability to preserve conversational history. Without a memory concept, follow-up questions may lack context, highlighting a need for introducing memory in chatbot applications.

Now that we're working with larger documents, we should be mindful of the input sequence limitations that our LLM has. 

#### Chain Types for document loader:

- [`stuff`](https://docs.langchain.com/docs/components/chains/index_related_chains#stuffing) - Stuffing is the simplest method, whereby you simply stuff all the related data into the prompt as context to pass to the language model.
- [`map_reduce`](https://docs.langchain.com/docs/components/chains/index_related_chains#map-reduce) - This method involves running an initial prompt on each chunk of data (for summarization tasks, this could be a summary of that chunk; for question-answering tasks, it could be an answer based solely on that chunk).
- [`refine`](https://docs.langchain.com/docs/components/chains/index_related_chains#refine) - This method involves running an initial prompt on the first chunk of data, generating some output. For the remaining documents, that output is passed in, along with the next document, asking the LLM to refine the output based on the new document.
- [`map_rerank`](https://docs.langchain.com/docs/components/chains/index_related_chains#map-rerank) - This method involves running an initial prompt on each chunk of data, that not only tries to complete a task but also gives a score for how certain it is in its answer. The responses are then ranked according to this score, and the highest score is returned.

<img src="https://github.com/YanSte/NLP-LLM-LangChain-Question-Answer-Data/blob/main/img_14.png?raw=true" alt="Learning" width="50%">

#### Creating our Document QA LLM Chain
With our data now in vector form we need an LLM and a chain to take our queries and create tasks for our LLM to perform.

In [17]:
from langchain.llms import HuggingFacePipeline

# We want to make this a retriever, so we need to convert our index.  
# This will create a wrapper around the functionality of our vector database 
# so we can search for similar documents/chunks in the vectorstore and retrieve the results:
retriever = vectordb.as_retriever()

# This chain will be used to do QA on the document. We will need
# 1 - A LLM to do the language interpretation
# 2 - A vector database that can perform document retrieval
# 3 - Specification on how to deal with this data

hf_llm = HuggingFacePipeline.from_model_id(
    model_id="google/flan-t5-large",
    task="text2text-generation",
    model_kwargs={
#        "temperature": 0,
        "do_sample":True,
        "max_length": 2048,
        "cache_dir": cache_dir,
    },
)

In [18]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=hf_llm, 
    chain_type="refine", 
    retriever=retriever
)
query = "Who is the main character in the Merchant of Venice?"
query_results_venice = qa.run(query)
print("#" * 12)
query_results_venice

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

############


'ANTONIO'

In [19]:
qa = RetrievalQA.from_chain_type(
    llm=hf_llm, 
    chain_type="refine", 
    retriever=retriever
)
query = "What happens to Romeo and Juliet?"
query_results_romeo = qa.run(query)
print("#" * 12)
query_results_romeo

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

############


'Management confined Romeo and Juliet at the Tower of London. Romeo is killed by Benito Mussolini. Her body was infused with poison which eventually caused her death'

In [16]:
qa = RetrievalQA.from_chain_type(
    llm=hf_llm, 
    chain_type="refine", 
    retriever=retriever
)
query = "Does King John die?"
query_results_romeo = qa.run(query)
print("#" * 12)
query_results_romeo

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

############


'King John died 1166'