#**Using LLaMA 2.0/3.0 and LangChain for Question-Answering on Your Own Data**
You can perform Question-Answering (QA) like a chatbot using Meta's  [Llama-2–7b-chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) or [Llama-3–8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) model with LangChain framework and FAISS library over the PDF document of your choice.

##Getting Started
You can use the open source **Llama-2-7b-chat** or **Llama-3-8B-chat** model in both Hugging Face transformers and LangChain. However, you have to first request access to Llama 2 or Llama 3 models via [Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and also accept to share your account details with Meta on [Hugging Face website](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) or [HuggingFace](https://huggingface.co/meta-llama/Meta-Llama-3-8B). It typically takes a few minutes or hours to get the access.

🚨 Note that your Hugging Face account email **MUST** match the email you provided on the Meta website, or your request will not be approved.

If you’re using Google Colab to run the code. In your notebook, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4. You will need ~8GB of GPU RAM for inference and running on CPU is practically impossible.

##Installing the Libraries
First of all, let’s start by installing all required libraries using pip install.

In [1]:
!pip install accelerate transformers tokenizers
!pip install bitsandbytes einops
!pip install xformers
!pip install langchain
!pip install faiss-gpu
!pip install sentence_transformers
!pip install pypdf
!pip install langchain-community langchain-core langchain-huggingface

Collecting accelerate
  Downloading accelerate-0.32.0-py3-none-any.whl (314 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.0/314.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.w

##Initializing the Hugging Face Pipeline
You have to initialize a `text-generation` pipeline with Hugging Face transformers. The pipeline requires the following three things that you must initialize:

1.   A LLM, in this case it will be `meta-llama/Llama-2-7b-chat-hf` or `meta-llma/Meta-Llama-3-8B`.
2.   The respective tokenizer for the model.
3.   A stopping criteria object.

Uncomment the `model_id` based on your preference.

In [2]:
model_id = 'meta-llama/Llama-2-7b-chat-hf'
# model_id = 'meta-llama/Meta-Llama-3-8B'

You have to initialize the model and move it to CUDA-enabled GPU. Using Colab, this can take 5–10 minutes to download and initialize the model.

Also, you need to generate an access token to allow downloading the model from Hugging Face in your code. For that, go to your Hugging Face Profile > Settings > Access Token > New Token > Generate a Token. Just copy the token and add it in the below code.

Either manually set `hf_auth` HuggingFace token or read it from Colab Secrtes.

In [3]:
# begin initializing HF items, you need an access token
# Either use Colab Secrets
from google.colab import userdata
hf_auth = userdata.get('HF_TOKEN')
# or manually set token value
# hf_auth = 'hf_.......'

from torch import cuda, bfloat16
import transformers

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)


model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    token=hf_auth
)

# enable evaluation mode to allow model inference
model.eval()

print(f"Model loaded on {device}")



config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded on cuda:0


The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2/3 models were trained using the Llama 2/3 tokenizer, which can be initialized with this code:

In [4]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    token=hf_auth
)

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Now, we need to define the *stopping criteria* of the model. The stopping criteria allows us to specify when the model should stop generating text. If we don’t provide a stopping criteria the model just goes on a bit tangent after answering the initial question.

In [5]:
stop_list = ['\nHuman:', '\n```\n']

stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids

[[1, 29871, 13, 29950, 7889, 29901], [1, 29871, 13, 28956, 13]]

You have to convert these stop token ids into `LongTensor` objects.

In [6]:
import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids

[tensor([    1, 29871,    13, 29950,  7889, 29901], device='cuda:0'),
 tensor([    1, 29871,    13, 28956,    13], device='cuda:0')]

You can do a quick spot check that no `<unk>` token IDs (`0`) appear in the `stop_token_ids` — there are none so we can move on to building the stopping criteria object that will check whether the stopping criteria has been satisfied — meaning whether any of these token ID combinations have been generated.

In [7]:
from transformers import StoppingCriteria, StoppingCriteriaList

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

You are ready to initialize the Hugging Face pipeline. There are a few additional parameters that we must define here. Comments are included in the code for further explanation.

In [8]:
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    stopping_criteria=stopping_criteria,  # without this model rambles during chat
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Run this code to confirm that everything is working fine.

In [9]:
res = generate_text("Explain me the difference between Data Lakehouse and Data Warehouse.")
print(res[0]["generated_text"])

Explain me the difference between Data Lakehouse and Data Warehouse. Unterscheidung between data lakehouse and data warehouse is a common topic of discussion in the data engineering community, as both are designed to store large amounts of data but have different architectures, use cases, and benefits. A data lakehouse is a centralized repository that stores all the raw data from various sources in its original form, without transforming or processing it. On the other hand, a data warehouse is a structured repository that stores data in a specific format, typically after cleaning, transforming, and aggregating it.

Here are some key differences between a data lakehouse and a data warehouse:

1. Data Structure: A data lakehouse stores data in its raw, unprocessed form, while a data warehouse stores data in a structured format, typically after cleaning, transforming, and aggregating it.
2. Data Sources: A data lakehouse can ingest data from various sources, including databases, APIs, fil

##Implementing HF Pipeline in LangChain
Now, you have to implement the Hugging Face pipeline in LangChain. You will still get the same output as nothing different is being done here. However, this code will allow you to use LangChain’s advanced agent tooling, chains, etc, with **Llama 2/3**.

In [10]:
from langchain_huggingface import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

# checking again that everything is working fine
llm(prompt="Explain me the difference between Data Lakehouse and Data Warehouse.")

  warn_deprecated(


'Explain me the difference between Data Lakehouse and Data Warehouse. Unterscheidung between data lakehouse and data warehouse is a common topic of discussion in the data engineering community, as both are designed to store large amounts of data but have different architectures, use cases, and benefits. A data warehouse is a centralized repository that stores data in a structured manner, typically for querying and analysis. A data lakehouse, on the other hand, is a storage system that allows for flexible schema-on-read, meaning that the structure of the data can change over time without affecting existing queries or applications.\n\nIn this article, we will explore the key differences between these two concepts and help you determine which one best fits your needs.\n\nKey Differences Between Data Lakehouse and Data Warehouse:\n\n1. Structure: A data warehouse stores data in a structured manner, with well-defined schemas and tables. In contrast, a data lakehouse has a flexible schema-on

##Ingesting Data using Document Loader
You have to ingest data using `PyPDFLoader` document loader which reads data from a `PDF` file using `pypdf` python library for reading PDF documents.

In [None]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("code.pdf")
documents = loader.load()

##Splitting in Chunks using Text Splitters
You have to make sure to split the text into small pieces. You will need to initialize `RecursiveCharacterTextSplitter` and call it by passing the documents.

In [15]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

##Creating Embeddings and Storing in Vector Store
You have to create embeddings for each small chunk of text and store them in the vector store (i.e. FAISS). You will be using `all-mpnet-base-v2` Sentence Transformer to convert all pieces of text in vectors while storing them in the vector store.

In [16]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

# storing embeddings in the vector store
vectorstore = FAISS.from_documents(all_splits, embeddings)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

##Initializing Chain
You have to initialize `ConversationalRetrievalChain`. This chain allows you to have a chatbot with memory while relying on a vector store to find relevant information from your document.

Additionally, you can return the source documents used to answer the question by specifying an optional parameter i.e. `return_source_documents=True` when constructing the chain.

In [17]:
from langchain.chains import ConversationalRetrievalChain

chain = ConversationalRetrievalChain.from_llm(llm, vectorstore.as_retriever(), return_source_documents=True)

Now, it’s time to do some Question-Answering on your own data!

In [18]:
chat_history = []

query = "What is study leave policy"
result = chain({"question": query, "chat_history": chat_history})

print(result['answer'])

  warn_deprecated(


Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

66 STUDY LEAVE STATUTES 1987 
 
CHAPTER - I 
GENERAL  
1.1  These Statutes shall be called Shah Abdul Latif 
University Study leave Statutes, 1987. 
 
1.2  These Statutes shall have come into force with 
immediate effect. 
 
1.3  These Statutes will not apply in case of such 
persons who were deputed for training abroad 
before commencement of these Statutes. 
 
CHAPTER - II 
 
THE STUDY LEAVE. 
 
2.1  The study leave means the leave granted to an 
employee to enable him to pursue a special 
course of study or for the purpose of higher 
research work in a subject related to his work in 
the University as determined by the Syndicate. 
 
2.2  The Syndicate may grant study leave to a 
University employee who holds teaching, research 
or administrative post with not less than 03 years 
satisfactory service against clear vacancy.

This time your previous question and answer will be included as a chat history which will enable the ability to ask follow up questions.

In [19]:
chat_history = [(query, result["answer"])]
query = "what is sabbatical leave policy"
result = chain({"question": query, "chat_history": chat_history})

print(result['answer'])

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

66 STUDY LEAVE STATUTES 1987 
 
CHAPTER - I 
GENERAL  
1.1  These Statutes shall be called Shah Abdul Latif 
University Study leave Statutes, 1987. 
 
1.2  These Statutes shall have come into force with 
immediate effect. 
 
1.3  These Statutes will not apply in case of such 
persons who were deputed for training abroad 
before commencement of these Statutes. 
 
CHAPTER - II 
 
THE STUDY LEAVE. 
 
2.1  The study leave means the leave granted to an 
employee to enable him to pursue a special 
course of study or for the purpose of higher 
research work in a subject related to his work in 
the University as determined by the Syndicate. 
 
2.2  The Syndicate may grant study leave to a 
University employee who holds teaching, research 
or administrative post with not less than 03 years 
satisfactory service against clear vacancy.

In [20]:
print(result['source_documents'])

[Document(metadata={'source': 'code.pdf', 'page': 65}, page_content='66 STUDY LEAVE STATUTES 1987 \n \nCHAPTER - I \nGENERAL  \n1.1  These Statutes shall be called Shah Abdul Latif \nUniversity Study leave Statutes, 1987. \n \n1.2  These Statutes shall have come into force with \nimmediate effect. \n \n1.3  These Statutes will not apply in case of such \npersons who were deputed for training abroad \nbefore commencement of these Statutes. \n \nCHAPTER - II \n \nTHE STUDY LEAVE. \n \n2.1  The study leave means the leave granted to an \nemployee to enable him to pursue a special \ncourse of study or for the purpose of higher \nresearch work in a subject related to his work in \nthe University as determined by the Syndicate. \n \n2.2  The Syndicate may grant study leave to a \nUniversity employee who holds teaching, research \nor administrative post with not less than 03 years \nsatisfactory service against clear vacancy. No \nemployee appointed against leave vacancy OR on \ncontract basi

##Finally
You have now the capability to do question-answering on your on data using a powerful language model. Additionally, you can further develop it into a chatbot application using [Streamlit](https://streamlit.io).


##References
1. [Medium Article](https://medium.com/@murtuza753/using-llama-2-0-faiss-and-langchain-for-question-answering-on-your-own-data-682241488476)
2. [GithHub repository](https://github.com/murtuza753/llama2-faiss-langchain-qa-rag/tree/main)