# Q&A with LangChain

This notebook demonstrates how to use LangChain to build a chatbot that references a custom knowledge-base. 

Suppose you have some text documents (PDF, blog, Notion pages, etc.) and want to ask questions related to the contents of those documents. LLMs, given their proficiency in understanding text, are a great tool for this. 

### [LangChain](https://python.langchain.com/docs/get_started/introduction)
[**LangChain**](https://python.langchain.com/docs/get_started/introduction) provides a simple framework for connecting LLMs to your own data sources. Since LLMs are both only trained up to a fixed point in time and do not contain knowledge that is proprietary to an Enterprise, they can't answer questions about new or proprietary knowledge. LangChain solves this problem.

<div class="alert alert-block alert-info">
    
⚠️ The notebook after this one, `03_llama_index_simple.ipynb`, contains the same functionality as this notebook but uses LlamaIndex instead of LangChain. Ultimately, we recommend reading about LangChain vs. LlamaIndex and picking the software/components of the software that makes the most sense to you. 

</div>

![data_connection](./imgs/data_connection_langchain.jpeg)

### Step 1: Integrate TensorRT-LLM to LangChain [*(Connector)*](https://docs.llamaindex.ai/en/stable/examples/llm/nvidia_tensorrt.html)

In [None]:
!pip install langchain_nvidia_trt
!pip install tqdm transformers spacy datasets langchain_text_splitters langchain-chroma sentence_transformers
!pip install langchain langchain_community transformers
!pip install datasets 
!pip install "unstructured[pdf]"
!pip install nltk
!pip install pymilvus

Collecting protobuf<4.0.0,>=3.5.0 (from langchain_nvidia_trt)
  Using cached protobuf-3.20.3-py2.py3-none-any.whl.metadata (720 bytes)
Using cached protobuf-3.20.3-py2.py3-none-any.whl (162 kB)
Installing collected packages: protobuf
  Attempting uninstall: protobuf
    Found existing installation: protobuf 4.25.3
    Uninstalling protobuf-4.25.3:
      Successfully uninstalled protobuf-4.25.3
Successfully installed protobuf-3.20.3


  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
grpcio-status 1.62.2 requires grpcio>=1.62.2, but you have grpcio 1.60.0 which is incompatible.
grpcio-status 1.62.2 requires protobuf>=4.21.6, but you have protobuf 3.20.3 which is incompatible.


Collecting protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5 (from google-cloud-vision->unstructured[pdf])
  Using cached protobuf-4.25.3-cp310-abi3-win_amd64.whl.metadata (541 bytes)
Collecting grpcio<2.0dev,>=1.33.2 (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0dev,>=1.34.1->google-cloud-vision->unstructured[pdf])
  Using cached grpcio-1.64.1-cp311-cp311-win_amd64.whl.metadata (3.4 kB)
Using cached protobuf-4.25.3-cp310-abi3-win_amd64.whl (413 kB)
Using cached grpcio-1.64.1-cp311-cp311-win_amd64.whl (4.1 MB)
Installing collected packages: protobuf, grpcio
  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.20.3
    Uninstalling protobuf-3.20.3:
      Successfully uninstalled protobuf-3.20.3
  Attempting uninstall: grpcio
    Found existing installation: grpcio 1.60.0
    Uninstalling grpcio-1.60.0:
      Successfully uninstalled grpcio-

  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-nvidia-trt 0.0.1rc0 requires protobuf<4.0.0,>=3.5.0, but you have protobuf 4.25.3 which is incompatible.
pymilvus 2.4.1 requires grpcio<=1.60.0,>=1.49.1, but you have grpcio 1.64.1 which is incompatible.


In [None]:
from langchain_nvidia_trt.llms import TritonTensorRTLLM

# Connect to the TRT-LLM Llama-2 model running on the Triton server at the url below
# Replace "llm" with the url of the system where llama2 is hosted
triton_url = "llm:8001"
pload = {
            'tokens':500,
            'server_url': triton_url,
            'model_name': "ensemble"
}
llm = TritonTensorRTLLM(**pload)

#### Note: Follow this step for nemotron models
1. In case you have deployed a trt-llm optimized nemotron model following steps [here](../RetrievalAugmentedGeneration/README.md#6-qa-chatbot----nemotron-model), execute the cell below by uncommenting the lines. Here we use a custom wrapper for talking with the model server.

In [25]:
# from triton_trt_llm import TensorRTLLM
# llm = TensorRTLLM(server_url ="llm:8000", model_name="ensemble", tokens=500, streaming=False)

### Step 2: Create a Prompt Template [*(Model I/O)*](https://python.langchain.com/docs/modules/model_io/)

A [**prompt template**](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/) is a common paradigm in LLM development. 

They are a pre-defined set of instructions provided to the LLM and guide the output produced by the model. They can contain few shot examples and guidance and are a quick way to engineer the responses from the LLM. Llama 2 accepts the [prompt format](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) shown in `LLAMA_PROMPT_TEMPLATE`, which we manipulate to be constructed with:
- The system prompt
- The context
- The user's question

Langchain allows you to [create custom wrappers for your LLM](https://python.langchain.com/docs/modules/model_io/models/llms/custom_llm) in case you want to use your own LLM or a different wrapper than the one that is supported in LangChain. Since we are using a custom Llama2 model hosted on Triton with TRT-LLM, we have written a custom wrapper for our LLM. 

In [26]:
from langchain.prompts import PromptTemplate

LLAMA_PROMPT_TEMPLATE = (
 "<s>[INST] <<SYS>>"
 "Use the following context to answer the user's question. If you don't know the answer, just say that you don't know, don't try to make up an answer."
 "<</SYS>>"
 "<s>[INST] Context: {context} Question: {question} Only return the helpful answer below and nothing else. Helpful answer:[/INST]"
)

LLAMA_PROMPT = PromptTemplate.from_template(LLAMA_PROMPT_TEMPLATE)

### Step 3: Load Documents [*(Retrieval)*](https://python.langchain.com/docs/modules/data_connection/)
LangChain provides a variety of [document loaders](https://python.langchain.com/docs/integrations/document_loaders) that load various types of documents (HTML, PDF, code) from many different sources and locations (private s3 buckets, public websites).

Document loaders load data from a source as **Documents**. A **Document** is a piece of text (the page_content) and associated metadata. Document loaders provide a ``load`` method for loading data as documents from a configured source. 

In this example, we use a LangChain [`UnstructuredFileLoader`](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) to load a research paper about Llama2 from Meta.

[Here](https://python.langchain.com/docs/integrations/document_loaders) are some of the other document loaders available from LangChain.

In [27]:
# ! wget -O "llama2_paper.pdf" -nc --user-agent="Mozilla" https://arxiv.org/pdf/2307.09288.pdf
import requests 

url = "https://arxiv.org/pdf/2307.09288.pdf"  # Replace with the actual URL
response = requests.get(url)

# Check if the download was successful
if response.status_code == 200:
    with open("llama2_paper.pdf", "wb") as f:  # Replace "file.txt" with the desired filename
        f.write(response.content)
    print("File downloaded successfully!")
else:
    print(f"Error downloading file: {response.status_code}")


File downloaded successfully!


In [35]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("llama2_paper.pdf")
data = loader.load()


### Step 4: Transform Documents [*(Retrieval)*](https://python.langchain.com/docs/modules/data_connection/)
Once documents have been loaded, they are often transformed. One method of transformation is known as **chunking**, which breaks down large pieces of text, for example, a long document, into smaller segments. This technique is valuable because it helps [optimize the relevance of the content returned from the vector database](https://www.pinecone.io/learn/chunking-strategies/). 

LangChain provides a [variety of document transformers](https://python.langchain.com/docs/integrations/document_transformers/), such as text splitters. In this example, we use a [``SentenceTransformersTokenTextSplitter``](https://api.python.langchain.com/en/latest/sentence_transformers/langchain_text_splitters.sentence_transformers.SentenceTransformersTokenTextSplitter.html). The ``SentenceTransformersTokenTextSplitter`` is a specialized text splitter for use with the sentence-transformer models. The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use. This sentence transformer model is used to generate the embeddings from documents. 

There are some nuanced complexities to text splitting since semantically related text, in theory, should be kept together. 

In [39]:
import time
from langchain.text_splitter import SentenceTransformersTokenTextSplitter

TEXT_SPLITTER_MODEL = "intfloat/e5-large-v2"
TEXT_SPLITTER_TOKENS_PER_CHUNK = 510
TEXT_SPLITTER_CHUNCK_OVERLAP = 200

text_splitter = SentenceTransformersTokenTextSplitter(
    model_name=TEXT_SPLITTER_MODEL,
    tokens_per_chunk=TEXT_SPLITTER_TOKENS_PER_CHUNK,
    chunk_overlap=TEXT_SPLITTER_CHUNCK_OVERLAP,
)
start_time = time.time()
documents = text_splitter.split_documents(data)
print(f"--- {time.time() - start_time} seconds ---")

  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

--- 2.60556960105896 seconds ---


In [40]:
# import time
# from nltk.tokenize import sent_tokenize

# TEXT_SPLITTER_TOKENS_PER_CHUNK = 510
# TEXT_SPLITTER_CHUNCK_OVERLAP = 200

# def split_by_sentences(data, max_sentences=5):
#   """Splits text into chunks with a maximum number of sentences."""
#   chunks = []
#   current_chunk = []
#   sentences = sent_tokenize(data)
#   for sentence in sentences:
#     current_chunk.append(sentence)
#     if len(current_chunk) >= max_sentences:
#       chunks.append(" ".join(current_chunk))
#       current_chunk = []
#   # Add the last chunk (if any)
#   if current_chunk:
#     chunks.append(" ".join(current_chunk))
#   return chunks

# start_time = time.time()
# documents = split_by_sentences(data, max_sentences=TEXT_SPLITTER_TOKENS_PER_CHUNK//3)  # Adjust as needed
# print(f"--- {time.time() - start_time} seconds ---")


Let's view a sample of content that is chunked together in the documents.

In [41]:
documents[40].page_content

'##ze. during this phase, we seek to optimize the following objective : arg max [UNK], [UNK] [ r ( g | p ) ] ( 3 ) we iteratively improve the policy by sampling prompts pfrom our dataset dand generations gfrom the policy πand use the ppo algorithm and loss function to achieve this objective. the final reward function we use during optimization, r ( g | p ) = [UNK] ( g | p ) −βdkl ( πθ ( g | p ) [UNK] ( g | p ) ) ( 4 ) contains a penalty term for diverging from the original policy π0. as was observed in other works ( stiennon et al., 2020 ; ouyang et al., 2022 ), we find this constraint is useful for training stability, and to reduce reward hackingwherebywewouldachievehighscoresfromtherewardmodelbutlowscoresfromhumanevaluation. we define rcto be a piecewise combination of the safety ( rs ) and helpfulness ( rh ) reward models. we havetaggedpromptsinourdatasetthatmightelicitpotentiallyunsaferesponsesandprioritizethescores from the safety model. the threshold of 0. 15is chosen for filteri

### Step 5: Generate Embeddings and Store Embeddings in the Vector Store [*(Retrieval)*](https://python.langchain.com/docs/modules/data_connection/)

#### a) Generate Embeddings
[Embeddings](https://python.langchain.com/docs/modules/data_connection/text_embedding/) for documents are created by vectorizing the document text; this vectorization captures the semantic meaning of the text. This allows you to quickly and efficiently find other pieces of text that are similar. The embedding model used below is [intfloat/e5-large-v2](https://huggingface.co/intfloat/e5-large-v2).

LangChain provides a wide variety of [embedding models](https://python.langchain.com/docs/integrations/text_embedding) from many providers and makes it simple to swap out the models. 

When a user sends in their query, the query is also embedded using the same embedding model that was used to embed the documents. As explained earlier, this allows to find similar (relevant) documents to the user's query. 

#### b) Store Document Embeddings in the Vector Store
Once the document embeddings are generated, they are stored in a vector store so that at query time we can:
1) Embed the user query and
2) Retrieve the embedding vectors that are most similar to the embedding query.

A vector store takes care of storing the embedded data and performing a vector search.

LangChain provides support for a [great selection of vector stores](https://python.langchain.com/docs/integrations/vectorstores/). 

<div class="alert alert-block alert-info">
    
⚠️ For this workflow, [Milvus](https://milvus.io/) vector database is running as a microservice. 

</div>

In [46]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Milvus
import torch
import time


# Running the model on CPU as we want to conserve GPU memory.
# In the production deployment (API server shown as part of the 5th notebook) we run the model on GPU
model_name = "intfloat/e5-large-v2"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": False}  # Optional optimization

hf_embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

start_time = time.time()

try:
  vectorstore = Milvus.from_documents(
      documents=documents, embedding=hf_embeddings, connection_args={"host": "milvus", "port": 19530}
  )
except Exception as e:
  print(f"Error connecting to Milvus: {e}")

print(f"--- {time.time() - start_time} seconds ---")


Error connecting to Milvus: [Errno 2] No such file or directory: 'c:\\users\\dhira\\anaconda3\\lib\\site-packages\\packaging-23.1.dist-info\\METADATA'
--- 11.593196868896484 seconds ---


In [None]:
# Simple Example: Retrieve Documents from the Vector Database
# note: this is just for demonstration purposes of a similarity search
question = "Can you talk about safety evaluation of llama2 chat?"
docs = vectorstore.similarity_search(question)
print(docs[2].page_content)

 > ### Simple Example: Retrieve Documents from the Vector Database [*(Retrieval)*](https://python.langchain.com/docs/modules/data_connection/)
>Given a user query, relevant splits for the question are returned through a **similarity search**. This is also known as a semantic search, and it is done with meaning. It is different from a lexical search, where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query. A semantic search tends to generate more relevant results than a lexical search.
![vector_stores.jpeg](./imgs/vector_stores.jpeg)

### Step 6: Compose a streamed answer using a Chain
We have already integrated the Llama2 TRT LLM with the help of LangChain connector, loaded and transformed documents, and generated and stored document embeddings in a vector database. To finish the pipeline, we need to add a few more LangChain components and combine all the components together with a [chain](https://python.langchain.com/docs/modules/chains/).

A [LangChain chain](https://python.langchain.com/docs/modules/chains/) combines components together. In this case, we use  [Langchain Expression Language](https://python.langchain.com/docs/expression_language/why) to build a chain.

We formulate the prompt placeholders (context and question) and pipe it to our trt-llm connector as shown below and finally stream the result.

In [None]:
from langchain_core.runnables import RunnablePassthrough
import time

chain = (
    {"context": vectorstore.as_retriever(), "question": RunnablePassthrough()}
    | LLAMA_PROMPT
    | llm
)
start_time = time.time()
for token in chain.stream(question):
    print(token, end="", flush=True)
print(f"\n--- {time.time() - start_time} seconds ---")