<a href="https://colab.research.google.com/github/arnabd64/Aadhar-Card-Entity-Extract/blob/main/Langchain_Day_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Langchain Document Retrieval Query

In this notebook we will build a chatbot that is capable of Retrieval Augmented Generation or RAG. It is a process where we feed the Large Language Modelwith data it has not seen during it's training process and ask questions based off that unseen data.

There are a lot of components involved in building the chain and we will be covering only the important ones.

1. Document Loaders
2. Text Splitter
3. Embeddings & Document Embeddings
4. Vector Store

# Install Libraries

In [1]:
! pip install --progress-bar=off --no-cache-dir \
    langchain==0.2.10 \
    langchain-community==0.2.10 \
    langchain-chroma \
    langchain-text-splitters \
    chromadb \
    pypdf \
    python-dotenv \
> install.log

In [2]:
import os
import dotenv
assert dotenv.load_dotenv('./.env'), 'Unable to load ./.env'

# Load the Components

In [3]:
from langchain.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_community.embeddings.ollama import OllamaEmbeddings
import chromadb

# 1. Document Loader

A document loader is langchain module that helps to load and process documents in langchain. There are several document loaders ranging from PDF, Plaintext, Marekdown, HTML Webpages and more. A `Document` according to langchain is a piece of text along with optional metadata.

Langchain Documentation: [Document Loaders](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/)

Here we are going to query from the famous 2017 Academic paper [Attention is All you need](https://arxiv.org/pdf/1706.03762). You can download the PDF from the link or can run the folloiwng command in Google Colab:

```bash
wget -O Attention-is-all-you-need.pdf https://arxiv.org/pdf/1706.03762
```

In [49]:
pdf_file = PyPDFLoader('/content/Attention-is-all-you-need.pdf')

# 2. Text Splitter

One problem with Large Language Models is that if we feed an entire document to them then there arises two issues, First higher computation times due to the large amounts of text sent as input and second which is that the input text is longer than the model's context window which results in the model hallucinating.

The solution to this is issue is to split the document into smaller chunks and instead of feeding the entire document to the LLM, we only feed the chunks that contain relevant information needed to answer the user's question.

Langchain Documentation: [Text Splitter]

In [4]:
DOCUMENT_PATH = '/content/Attention-is-all-you-need.pdf'

# load the document
pdf_file = PyPDFLoader(DOCUMENT_PATH)

# load the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=256, chunk_overlap=16)

# Split the Document into Chunks
documents = pdf_file.load_and_split(text_splitter)

print(f"Total Chunks: {len(documents)}")

Total Chunks: 193


In [5]:
DOCUMENT_STORE_NAME = 'my_documents'

# create the vector store
vector_store = Chroma(
    collection_name = DOCUMENT_STORE_NAME,
    client = chromadb.PersistentClient(path=DOCUMENT_STORE_NAME)
)

# add documets
index = vector_store.from_documents(
    documents,
    OllamaEmbeddings(base_url=os.getenv('HOST'), model=os.getenv('EMBED'))
)

In [6]:
search_settings = {'search_type': 'mmr', 'serarch_kwargs': {'k': 5, 'score_threshold': 0.3}}
index = index.as_retriever(**search_settings)

In [7]:
from langchain_core.prompts import PromptTemplate
from langchain_community.llms.ollama import Ollama
from langchain_core.runnables import RunnablePassthrough, RunnableSequence, RunnableLambda
from langchain_core.output_parsers import StrOutputParser
from langchain.globals import set_debug

In [8]:
template = """
You are a helpful AI assistant who answers user's question in simple language
using the provided documents.

Documents: {context}

{input}
"""
prompt = PromptTemplate.from_template(template)

In [9]:
llm = Ollama(
    base_url = os.getenv('HOST'),
    model = os.getenv('LLM'),
    temperature = 0.8,
    timeout = 600,
    keep_alive = 3600
)

In [10]:
question = "Explain the Transformer Model"

In [11]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

In [12]:
combine_docs = create_stuff_documents_chain(llm, prompt)

In [13]:
query_chain = {"input": RunnablePassthrough()} | create_retrieval_chain(index, combine_docs) | RunnableLambda(lambda x: x['answer'])

In [14]:
response = query_chain.invoke("What is a Transformer?")

In [15]:
response

" A Transformer is a type of model architecture proposed in a certain work. It doesn't use recurrence and instead relies entirely on an attention mechanism to create global dependencies between input and output. This architecture is used for tasks like translation. The Transformer can be parallelized more than other models, allowing it to achieve a high level of performance even after being trained for as little as twelve hours on eight P100 GPUs, reaching a new state-of-the-art in translation quality."

In [30]:
from langchain_core.runnables import RunnableWithMessageHistory
from langchain_community.chat_message_histories import FileChatMessageHistory

In [38]:
chat_chain = RunnableWithMessageHistory(
    query_chain,
    lambda x: FileChatMessageHistory('chat-history.json', encoding='utf-8')
)

In [43]:
question = "What are these six identical layers?"
set_debug(True)
response = chat_chain.invoke(question, config={'configurable': {'session_id': "328"}})

[32;1m[1;3m[chain/start][0m [1m[chain:RunnableWithMessageHistory] Entering Chain run with input:
[0m{
  "input": "What are these six identical layers?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableWithMessageHistory > chain:load_history] Entering Chain run with input:
[0m{
  "input": "What are these six identical layers?"
}
[36;1m[1;3m[chain/end][0m [1m[chain:RunnableWithMessageHistory > chain:load_history] [2ms] Exiting Chain run with output:
[0m[outputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableWithMessageHistory > chain:RunnableBranch] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RunnableWithMessageHistory > chain:RunnableBranch > chain:RunnableWithMessageHistoryInAsyncMode] Entering Chain run with input:
[0m[inputs]
[36;1m[1;3m[chain/end][0m [1m[chain:RunnableWithMessageHistory > chain:RunnableBranch > chain:RunnableWithMessageHistoryInAsyncMode] [1ms] Exiting Chain run with output:
[0m{
  "output": false
}


In [44]:
print(response)

 The six identical layers in a Transformer model are each composed of two sub-layers. The first sub-layer is a multi-head self-attention mechanism, which helps the model understand the relationships between different parts of the input sequence. The second sub-layer is a simple feed-forward neural network, which processes the output from the attention mechanism. These layers help the Transformer to process and analyze various components of a given input data in natural language processing tasks.
