**Author: [Dev Kumar Maan](https://www.linkedin.com/in/dev-kumar-maan-3a6369180/)**

**Institution: National Institute of Technology, Delhi**

**Email: dev.maan02@gmail.com**

# Natural Language Query Agent (Dataset Preparation)

The primary objective of this project is to develop a Natural Language Query Agent that leverages Large Language Models (LLMs) to provide concise responses to straightforward queries within a substantial dataset comprising lecture notes. 

This notebook offers a comprehensive guide to preparing the dataset for use in our final pipeline, facilitating answers to conversational questions.

> The data sources utilized for this project encompass the following:

- [Stanford LLMs Lecture Notes](https://stanford-cs324.github.io/winter2022/lectures/)

- [Awesome LLM Milestone Papers](https://github.com/Hannibal046/Awesome-LLM#milestone-papers)

- [An Extensive Paper List (and Various Resources) on NLP for Social Good](https://github.com/zhijing-jin/NLP4SocialGood_Papers?tab=readme-ov-file)

#### **Important Note**

You are required to initiate an access request for Llama 2 models via the [Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and consent to sharing your [Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) account details with Meta. Please ensure that your Hugging Face account email aligns with the one you've submitted on the Meta website, as mismatched emails may lead to request denial. The approval process typically takes a few minutes to a few hours to complete.

In [94]:
from torch import cuda, bfloat16
import transformers
import torch
import pickle
from transformers import StoppingCriteria, StoppingCriteriaList
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import ConversationalRetrievalChain

import warnings
warnings.filterwarnings("ignore")

To set up a text-generation pipeline using Hugging Face transformers, you must initiate the following three vital elements:

- A Language Model (LLM). We will utilize `meta-llama/Llama-2-7b-chat-hf`
- The tokenizer that corresponds to the model.
- A stopping criteria object.

In [4]:
model_id = 'meta-llama/Llama-2-7b-chat-hf'
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

hf_auth = '<add your access token here>'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)

model.eval()
print(f"Model loaded on {device}")



Downloading (…)lve/main/config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]



Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded on cuda:0


The pipeline necessitates a tokenizer responsible for translating human-readable text into token IDs readable by the LLM. The Llama 2 7B models were trained using the Llama 2 7B tokenizer, which can be initialized as follows:

In [5]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)



Downloading (…)okenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Next, it's important to establish the stopping criteria for the model. Stopping criteria help determine when the model should cease generating text. Without clear criteria in place, the model might continue producing text that deviates from the initial question.

In [7]:
stop_list = ['\nHuman:', '\n```\n']
stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]

In [8]:
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

The functions defined below will be used to print the responses provided by the model

In [95]:
def print_ans(result):
    print('\n')
    answer = result['answer']
    words = answer.split()
    count = 0
    for word in words:
        if word == "Unhelpful":
            break
        count += 1
        print(word, end = ' ')
        if(count==12):
            print('')
            count = 0
    print('\n')

def print_ref(result):
    links = set()
    for x in result['source_documents']:
        y = x.metadata['source']
        if y.startswith("http"):
            links.add(y)
    for x in links:
        print(x)
    print('\n')

def print_output(result):
    print_ans(result)
    print_ref(result)

In [9]:
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    task='text-generation',
    stopping_criteria=stopping_criteria,
    temperature=0.1,
    max_new_tokens=512,
    repetition_penalty=1.1
)

In [10]:
llm = HuggingFacePipeline(pipeline=generate_text)

In [11]:
## Load the dataset

with open('/content/ema_dataset.pkl', 'rb') as file:
    all_splits = pickle.load(file)

In [12]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}
embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)
vectorstore = FAISS.from_documents(all_splits, embeddings)

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [27]:
chain = ConversationalRetrievalChain.from_llm(llm, vectorstore.as_retriever(), return_source_documents=True)

## How to interact with the model?

The model will prompt you for queries until you provide a "stop-word" from the `stop_list` present in the code. The model will output an answer, and will cite a reference to help you confirm that the model is not hallucinating.

In [93]:
stop_list = ["thank u", "thank you", "thanks", "okay", "ok", "cool"]
chat_history = []
while True:
    query = input('Enter your Query: ')
    query = str.lower(query)
    if query in stop_list:
        break

    result = chain({"question": query, "chat_history": chat_history})
    print_output(result)

    chat_history = chat_history + [(query, result["answer"])]

Enter your Query: What is a language model?


A language model is a probability distribution over sequences of tokens. 

https://stanford-cs324.github.io/winter2022/lectures/introduction/


Enter your Query: What is adaptability?


The term "adaptable" refers to the ability of a language model to 
adjust its performance to better suit a particular context or task. This 
can involve modifying the model's parameters or fine-tuning the model on a 
small set of in-context examples. 

https://stanford-cs324.github.io/winter2022/lectures/adaptation/


Enter your Query: What are some milestone model architectures and papers in the last few years?


Yes, here are some recent milestone models and papers in natural language 
processing: * GPT-3 (2020): A large-scale transformer-based language model developed by OpenAI 
that achieved state-of-the-art results on a wide range of natural language processing 
tasks. * BERT (2018): A pre-trained language model developed by Google that 
achieved state-