<a href="https://colab.research.google.com/github/ashishkumarsahani/NLP_Demos/blob/main/RAG_using_LLAMA2_and_Deployment_on_Streamlit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:

!pip install --upgrade \
  sentence-transformers \
  pinecone-client \
  datasets \
  accelerate \
  einops \
  langchain \
  xformers \
  bitsandbytes\
  langchain-community \
  torch \
  transformers \
  PyMuPDF

Collecting torch
  Using cached torch-2.3.0-cp310-cp310-manylinux1_x86_64.whl (779.1 MB)
Collecting triton==2.3.0 (from torch)
  Using cached triton-2.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (168.1 MB)
Installing collected packages: triton, torch
  Attempting uninstall: triton
    Found existing installation: triton 2.3.1
    Uninstalling triton-2.3.1:
      Successfully uninstalled triton-2.3.1
  Attempting uninstall: torch
    Found existing installation: torch 2.3.1
    Uninstalling torch-2.3.1:
      Successfully uninstalled torch-2.3.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.2.1+cu121 requires torch==2.2.1, but you have torch 2.3.0 which is incompatible.
torchtext 0.17.1 requires torch==2.2.1, but you have torch 2.3.0 which is incompatible.
torchvision 0.15.2 requires torch==2.0.1, but you have torch 2.3.0 w

In [2]:
from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


We can use the embedding model to create document embeddings like so:

**Building the Vector Index**
We now need to use the embedding pipeline to build our embeddings and store them in a Pinecone vector index. To begin we'll initialize our index, for this we'll need a free Pinecone API key.

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index.

Now we initialize the index.

With our index and embedding process ready we can move onto the indexing process itself. For that, we'll need a dataset. We will use a set of Arxiv papers related to (and including) the Llama 2 research paper.

In [4]:
import fitz  # PyMuPDF
import os
import pandas as pd
from tqdm.auto import tqdm

def read_pdf(file_path):
    doc = fitz.open(file_path)
    text = ""
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text += page.get_text()
    return text

def chunk_text(text, chunk_size=1000):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

# Directory containing PDFs
pdf_dir = 'PDFStore'
pdf_files = [os.path.join(pdf_dir, f) for f in os.listdir(pdf_dir) if f.endswith('.pdf')]

data = []

for pdf_file in tqdm(pdf_files):
    text = read_pdf(pdf_file)
    chunks = chunk_text(text)
    for chunk_id, chunk in enumerate(chunks):
        data.append({
            'source': pdf_file,
            'chunk_id': chunk_id,
            'chunk': chunk
        })

data_df = pd.DataFrame(data)

  0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
import pickle

# Assuming embed_model is your embedding model
# and you have a method embed_model.embed_documents to get embeddings

# Directory to save the embeddings
os.makedirs('embeddings', exist_ok=True)

batch_size = 32

all_vectors = []

for i in tqdm(range(0, len(data_df), batch_size)):
    i_end = min(len(data_df), i + batch_size)
    batch = data_df.iloc[i:i_end]
    ids = [f"{x['source']}-{x['chunk_id']}" for _, x in batch.iterrows()]
    texts = [x['chunk'] for _, x in batch.iterrows()]
    embeds = embed_model.embed_documents(texts)

    metadata = [
        {'text': x['chunk'],
         'source': x['source']} for _, x in batch.iterrows()
    ]

    for idx, embed, meta in zip(ids, embeds, metadata):
        vector_data = {
            'id': idx,
            'embedding': embed,
            'metadata': meta
        }
        all_vectors.append(vector_data)

# Save all vectors to a pickle file
with open('embeddings/vectors.pkl', 'wb') as f:
    pickle.dump(all_vectors, f)

print("Vectors have been saved locally.")


  0%|          | 0/3 [00:00<?, ?it/s]

Vectors have been saved locally.


In [6]:
all_vectors[0]['id']

'PDFStore/AARMED_Ventilator_Paper (1).pdf-0'

In [7]:
all_vectors[0]['embedding']

[-0.005052613094449043,
 0.03144381195306778,
 -0.03725692629814148,
 0.022986436262726784,
 0.044998981058597565,
 -0.07513134926557541,
 -0.013103531673550606,
 0.11042957007884979,
 0.03447945415973663,
 0.014979658648371696,
 -0.03177201375365257,
 0.032891470938920975,
 0.09203609079122543,
 0.09892263263463974,
 -0.07650760561227798,
 0.02247806079685688,
 0.04953918978571892,
 -0.025898050516843796,
 -0.060094255954027176,
 -0.030847813934087753,
 0.0206847433000803,
 0.08052218705415726,
 -0.010270497761666775,
 0.03184668347239494,
 -0.07663623988628387,
 -0.04970858618617058,
 -0.05985566973686218,
 -0.0034310612827539444,
 -0.0013729274505749345,
 -0.004418091382831335,
 0.021554183214902878,
 -0.014903075061738491,
 -0.014176206663250923,
 -0.0327991284430027,
 -0.03002236783504486,
 -0.03403373435139656,
 -0.01081168930977583,
 0.054659947752952576,
 -0.12469235807657242,
 -0.024489223957061768,
 -0.018754536285996437,
 0.00110986630897969,
 0.04269465431571007,
 -0.008432

In [8]:
all_vectors[0]['metadata']['text']

'Smart Health 31 (2024) 100445\nAvailable online 18 January 2024\n2352-6483/© 2024 Elsevier Inc. All rights reserved.\nContents lists available at ScienceDirect\nSmart Health\njournal homepage: www.elsevier.com/locate/smhl\nDesign and technical evaluation of an AMBU-BAG based low-cost\nventilator-AARMED\nMohit Kumar a,1, Ravinder Kumar b,∗,1, Vishal Kumar c, Amanpreet Chander a,\nAbhinav Airan d, Rajesh Arya e, Gurpreet Singh Wander e, Ashish Kumar Sahani a\na Department of Biomedical Engineering, Indian Institute of Technology, Ropar, India\nb Department of Bioengineering, University of Pittsburgh, USA\nc University of Notre Dame, USA\nd Magnimus Systems, Bhopal, India\ne Dayanand Medical College and Hospital (DMC&H), Ludhiana, India\nA R T I C L E\nI N F O\nKeywords:\nAARMED\nAmbu-bag\nVentilator\nCovid-19\nA B S T R A C T\nThe COVID-19 pandemic has caused a significant strain on the healthcare system worldwide,\nresulting in an acute shortage of ventilators. Conventional ventilators

In [9]:
all_vectors[0]['metadata']['source']

'PDFStore/AARMED_Ventilator_Paper (1).pdf'

**Initializing the Hugging Face Pipeline**
The first thing we need to do is initialize a text-generation pipeline with Hugging Face transformers. The Pipeline requires three things that we must initialize first, those are:

*   A LLM, in this case it will be meta-llama/Llama-2-13b-chat-hf.

*   The respective tokenizer for the model.

We'll explain these as we get to them, let's begin with our model.

We initialize the model and move it to our CUDA-enabled GPU. Using Colab this can take 5-10 minutes to download and initialize the model.

In [11]:
from torch import cuda, bfloat16
import transformers
from google.colab import userdata

model_id = 'meta-llama/Llama-2-13b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
hf_auth = userdata.get('HuggingFaceKey')
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Model loaded on cuda:0


The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2 13B models were trained using the Llama 2 13B tokenizer, which we initialize like so:

In [13]:
import numpy as np
# Load vectors from the pickle file
with open('embeddings/vectors.pkl', 'rb') as f:
    vectors = pickle.load(f)

# Separate embeddings and metadata
embeddings = np.array([v['embedding'] for v in vectors])
metadata = [v['metadata'] for v in vectors]

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

def retrieve(query_embedding, embeddings, metadata, top_k=5):
    similarities = cosine_similarity([query_embedding], embeddings)[0]
    indices = np.argsort(similarities)[-top_k:][::-1]
    return [metadata[i] for i in indices], [similarities[i] for i in indices]

In [21]:
import torch
import textwrap

# Example usage:
query = "What is the use of AARMED?"

# Generate query embedding
query_embedding = embed_model.embed_documents([query])[0]
retrieved_chunks, scores = retrieve(query_embedding, embeddings, metadata, top_k=3)

# Load the tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, use_auth_token=hf_auth)

# RAG Response
# Combine retrieved chunks into a single context
context = "\n".join([chunk['text'] for chunk in retrieved_chunks])

# Generate RAG response using the context
rag_input_text = f"Context: {context}\n\nQuestion: {query}\nAnswer:"
rag_inputs = tokenizer(rag_input_text, return_tensors="pt").to(device)

with torch.no_grad():
    rag_outputs = model.generate(**rag_inputs, max_length=2048, num_return_sequences=1)

rag_response = tokenizer.decode(rag_outputs[0], skip_special_tokens=True)
rag_helpful_answer = rag_response.split('Answer: ')[-1]

# Wrap the text for RAG response
rag_wrapped_answer = textwrap.fill(rag_helpful_answer, width=80)

# Non-RAG Response
# Generate non-RAG response without context
non_rag_input_text = f"Question: {query}\nAnswer:"
non_rag_inputs = tokenizer(non_rag_input_text, return_tensors="pt").to(device)

with torch.no_grad():
    non_rag_outputs = model.generate(**non_rag_inputs, max_length=2048, num_return_sequences=1)

non_rag_response = tokenizer.decode(non_rag_outputs[0], skip_special_tokens=True)
non_rag_helpful_answer = non_rag_response.split('Answer: ')[-1]

# Wrap the text for non-RAG response
non_rag_wrapped_answer = textwrap.fill(non_rag_helpful_answer, width=80)

# Print the results
print("Question: " + query)
print("\nRAG Answer: " + rag_wrapped_answer)
print("\nNon-RAG Answer: " + non_rag_wrapped_answer)



Question: What is the use of AARMED?

RAG Answer: AARMED is a significant innovation in the field of respiratory care for COVID-19
patients. It provides a reliable and effective solution for patients who do not
have access to a mechanical ventilator, and it can help to reduce the risk of
complications and death associated with COVID-19. Additionally, AARMED is a low-
cost device, making it an affordable option for patients and healthcare systems.
Overall, AARMED is a valuable resource for healthcare providers and patients
during the COVID-19 pandemic.

Non-RAG Answer: AARMED (Aeromedical Evacuation and Medical Evacuation) is a system used by the
military to transport wounded or injured personnel from the battlefield to
medical facilities for treatment. It is used to provide timely and effective
medical care to those who are injured or become ill during military operations.
The AARMED system includes a network of medical facilities, transportation
assets, and communication systems that 

In [1]:
!pip install streamlit

Collecting streamlit
  Downloading streamlit-1.35.0-py2.py3-none-any.whl (8.6 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/8.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/8.6 MB[0m [31m35.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━[0m [32m7.7/8.6 MB[0m [31m112.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m8.6/8.6 MB[0m [31m115.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m78.2 MB/s[0m eta [36m0:00:00[0m
Collecting gitpython!=3.1.19,<4,>=3.0.7 (from streamlit)
  Downloading GitPython-3.1.43-py3-none-any.whl (207 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m41.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydeck<1,>=0.8.0b4 (from str

In [7]:
%%writefile app.py
import streamlit as st
import fitz  # PyMuPDF
import os
import pandas as pd
import pickle
from tqdm.auto import tqdm
from sklearn.metrics.pairwise import cosine_similarity
import torch
import textwrap
import numpy as np
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
import transformers

# Initialize embedding model
embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'
device = f'cuda:{torch.cuda.current_device()}' if torch.cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

# Load the model
model_id = 'meta-llama/Llama-2-7b-chat-hf'
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

hf_auth = 'your_key_here'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()

tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, use_auth_token=hf_auth)

# Helper functions
def read_pdf(file):
    doc = fitz.open(stream=file.read(), filetype="pdf")
    text = ""
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text += page.get_text()
    return text

def chunk_text(text, chunk_size=1000):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

def process_pdf(file):
    text = read_pdf(file)
    chunks = chunk_text(text)
    data = [{'source': file.name, 'chunk_id': idx, 'chunk': chunk} for idx, chunk in enumerate(chunks)]
    return pd.DataFrame(data)

def generate_embeddings(data_df):
    all_vectors = []
    batch_size = 32

    for i in tqdm(range(0, len(data_df), batch_size)):
        i_end = min(len(data_df), i + batch_size)
        batch = data_df.iloc[i:i_end]
        ids = [f"{x['source']}-{x['chunk_id']}" for _, x in batch.iterrows()]
        texts = [x['chunk'] for _, x in batch.iterrows()]
        embeds = embed_model.embed_documents(texts)

        metadata = [{'text': x['chunk'], 'source': x['source']} for _, x in batch.iterrows()]

        for idx, embed, meta in zip(ids, embeds, metadata):
            vector_data = {'id': idx, 'embedding': embed, 'metadata': meta}
            all_vectors.append(vector_data)

    return all_vectors

def retrieve(query_embedding, embeddings, metadata, top_k=5):
    similarities = cosine_similarity([query_embedding], embeddings)[0]
    indices = np.argsort(similarities)[-top_k:][::-1]
    return [metadata[i] for i in indices], [similarities[i] for i in indices]

def generate_response(query, context=None):
    input_text = f"Context: {context}\n\nQuestion: {query}\nAnswer:" if context else f"Question: {query}\nAnswer:"
    inputs = tokenizer(input_text, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=2048, num_return_sequences=1)

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.split('Answer: ')[-1]

# Streamlit app
st.title("PDF QA System")

if 'vectors' not in st.session_state:
    st.session_state.vectors = None

uploaded_file = st.file_uploader("Upload a PDF file", type=["pdf"])

if uploaded_file is not None:
    with st.spinner("Processing PDF..."):
        data_df = process_pdf(uploaded_file)
        st.session_state.vectors = generate_embeddings(data_df)

    st.success("PDF processed and vectors generated.")

    query = st.text_input("Enter your question:")

    if query and st.session_state.vectors is not None:
        embeddings = np.array([v['embedding'] for v in st.session_state.vectors])
        metadata = [v['metadata'] for v in st.session_state.vectors]

        query_embedding = embed_model.embed_documents([query])[0]
        retrieved_chunks, _ = retrieve(query_embedding, embeddings, metadata, top_k=3)
        context = "\n".join([chunk['text'] for chunk in retrieved_chunks])

        rag_response = generate_response(query, context)
        non_rag_response = generate_response(query)

        st.write("### Question:")
        st.write(query)
        st.write("### RAG Answer:")
        st.write(rag_response)
        st.write("### Non-RAG Answer:")
        st.write(non_rag_response)

        if st.button("Ask another question"):
            st.experimental_rerun()


Overwriting app.py


In [8]:
# Check and kill any existing process on port 8501
!fuser -k 8501/tcp
!streamlit run app.py & npx localtunnel --port 8501


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.17.0.2:8501[0m
[34m  External URL: [0m[1mhttp://34.150.213.201:8501[0m
[0m
[K[?25hnpx: installed 22 in 1.487s
your url is: https://every-pumas-suffer.loca.lt

>> from langchain.embeddings import HuggingFaceEmbeddings

with new imports of:

>> from langchain_community.embeddings import HuggingFaceEmbeddings
You can use the langchain cli to **automatically** upgrade many imports. Please see documentation here https://python.langchain.com/v0.2/docs/versions/v0_2/ 
  warn_deprecated(
  warn_deprecated(
2024-06-15 12:30:49.070872: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn th