**Q/A ANSWERING WITH LLAMA2**

**Summary**

LLAMA2 is a LLM model which was released by meta.Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters.This is the collab file for the 7B pretrained model.

Model Architecture Llama 2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use **supervised fine-tuning (SFT)** and **reinforcement learning with human feedback (RLHF)** to align to human preferences for helpfulness and safety.

**Intended Use Cases** Llama 2 is intended for commercial and research use in English. Tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks.

This collab is created to show the basic implementation of LLAMA2 model on a intented pdf to have Q/A interaction. The model will interact with the pdf and it will either retrieve or generate a answer depending on the findings it finds in the pdf.


---


***NOTE***: To run this collab file, You have to load the collab file in

**GPU type** : T4

**Runtime shape** : High-RAM

To make this changes click on  Edit-> Notebook Settings-> and change to above settings.


---


The step by step guide to load the model and interact with it is provided below.



Step1: Installing all dependencies.

In [1]:
!pip -qqq install langchain apify-client faiss-cpu sentence_transformers accelerate bitsandbytes
!pip -q install langchain unstructured sentence_transformers faiss-cpu huggingface_hub OpenAI
!pip -q install git+https://github.com/huggingface/transformers
!pip install -q datasets loralib sentencepiece
!pip -q install bitsandbytes accelerate einops

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.3/66.3 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m40.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.4/75.4 kB[0m [31m9.3 MB/s[0m et

**Step2**: Importing all libraries.Mostly libraries are from langchain because it makes LLM integration seamless easy.


In [2]:
from langchain.document_loaders.base import Document
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS, Chroma
from langchain.chains.question_answering import load_qa_chain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.chains import RetrievalQA
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.document_loaders import ApifyDatasetLoader
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, pipeline, AutoTokenizer, AutoModelForCausalLM
from langchain.llms import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain
import torch

**Step3**:Loading embedding.

Here hugging face embedding is used as it is open source and free for commerical use.

Embedding is a technique in natural language processing (NLP) that represents words or phrases as vectors of real numbers.

In [3]:
embeddings = HuggingFaceEmbeddings(model_name="intfloat/e5-large-v2")

Downloading (…)b9212/.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

Downloading (…)0777bb9212/README.md:   0%|          | 0.00/67.5k [00:00<?, ?B/s]

Downloading (…)77bb9212/config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

Downloading (…)777bb9212/handler.py:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)b9212/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)0777bb9212/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)7bb9212/modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

**Step4**: Loading the dataset, here I have loaded a pdf called "football.pdf" but any other pdf or dataset can be loaded here.

In [4]:
!pip install pdfplumber
import pdfplumber

def pdf_loader(pdf_file):
    with pdfplumber.open(pdf_file) as pdf:
        pages = pdf.pages

        documents = []
        for page in pages:
            text = page.extract_text()
            documents.append(Document(page_content=text))

    return documents

documents = pdf_loader("football.pdf")

text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200
    )

docs = text_splitter.split_documents(documents)
db = FAISS.from_documents(docs, embeddings)


Collecting pdfplumber
  Downloading pdfplumber-0.10.2-py3-none-any.whl (47 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.5/47.5 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20221105 (from pdfplumber)
  Downloading pdfminer.six-20221105-py3-none-any.whl (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m49.3 MB/s[0m eta [36m0:00:00[0m
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.18.0-py3-none-manylinux_2_17_x86_64.whl (2.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m86.8 MB/s[0m eta [36m0:00:00[0m
Collecting cryptography>=36.0.0 (from pdfminer.six==20221105->pdfplumber)
  Downloading cryptography-41.0.3-cp37-abi3-manylinux_2_28_x86_64.whl (4.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m103.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pypdfium2, crypto

**Step5**: Loading LLAMA2 model.

It is most important and crucial step of building this model.

I have used BITSANDBYTESCONFIG to load model in 4bits. It is done to ensure the model doesn't take too much is CPU and GPU RAM.Although it might effect on the type of answer it generates, it's still a better model.

Then model is loaded using  AutoModelForCausalLM. I have used a sharded model because

*   Improved performance: Sharded
LLMs
can be loaded more quickly than monolithic LLMs, which can improve the performance of applications that use them.
*   Reduced memory usage: Sharded LLMs can be stored in smaller chunks of memory, which can reduce the memory requirements of applications that use them.





In [5]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "TinyPixel/Llama-2-7B-bf16-sharded"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)
model.config.use_cache = False

Downloading (…)lve/main/config.json:   0%|          | 0.00/626 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/14 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00014.bin:   0%|          | 0.00/981M [00:00<?, ?B/s]

Downloading (…)l-00002-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00003-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00004-of-00014.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)l-00005-of-00014.bin:   0%|          | 0.00/944M [00:00<?, ?B/s]

Downloading (…)l-00006-of-00014.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)l-00007-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00008-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00009-of-00014.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)l-00010-of-00014.bin:   0%|          | 0.00/944M [00:00<?, ?B/s]

Downloading (…)l-00011-of-00014.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)l-00012-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00013-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00014-of-00014.bin:   0%|          | 0.00/847M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/14 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

**Step6**: Loading tokenizer.Each LLM provides it's own tokenizer.

Tokenizer is a tool that breaks text into smaller units, such as words, phrases, or tokens. This process is called tokenization

In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side="left")

Downloading (…)okenizer_config.json:   0%|          | 0.00/676 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

**Step7**: Creating a pipeline.A pipeline is a sequence of steps or operations that are executed in a specific order to achieve a particular goal.

This code creates a pipeline for text generation using the Hugging Face Transformers library. The pipeline takes a model, a tokenizer, and a number of parameters as input. The model is used to generate text, the tokenizer is used to tokenize the text, and the parameters control the generation process.

In [7]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=3000,
    temperature=0,
    top_p=0.95,
    repetition_penalty=1.2,
)

local_llm = HuggingFacePipeline(pipeline=pipe)

**Step8**: Final step of this Q/A model is to retrieve the answer from our dataset/pdf. This is done by "RetrievalQA.from_chain_type".

Then we just have to pass our query and ask the model to process it. It will then retrieve or generate accordingly.

In [8]:
qa = RetrievalQA.from_chain_type(llm=local_llm,
                                 chain_type="stuff",
                                 retriever=db.as_retriever(k=2),
                                 return_source_documents=True,
                                 verbose=True)

query = "what is football?"
result = qa(query)
result['result']



[1m> Entering new RetrievalQA chain...[0m





[1m> Finished chain.[0m


' It is a game where they kick a round object back and forth between them while running all over the field trying not to get tackled.\n\nAnswer: it is a game where they kick a round object back and forth between them while running all over the field trying not to get tackled.'

**Few other queries and their results.**

In [9]:
query = "Can a player be substituted in football during  match?"
result = qa(query)
result['result']



[1m> Entering new RetrievalQA chain...[0m


' Yes, but only once per half.\n\n\n'

**Challenges faced**

*   To load the Llama model as official version need persmission from meta.
*   Improve the model as it is a large model, I had to play with the parameters.

*   The time taken to load the embedding , model was more apart from that answer took too much time to generate.

**Improvements**

*    Time to generate has to be reduced.

*    Still a good amount of time has to be spent with parameters.

*    Improve model loading time and a good user interface to interact with it.

