[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2/llama-2-13b-retrievalqa.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/llm-field-guide/llama-2/llama-2-13b-retrievalqa.ipynb)

# Chaabi Assignment

At the time of writing, you must first request access to Llama 2 models via [this form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) (access is typically granted within a few hours). If you need guidance on getting access please refer to the beginning of this [article](https://www.pinecone.io/learn/llama-2/) or [video](https://youtu.be/6iHVJyX2e50?t=175).

---

Note that running this on CPU is slow. If running on Google Colab you can avoid this by going to **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**. This should be included within the free tier of Colab._

---

We start by doing a `pip install` of all required libraries.

In [1]:
!pip install -qU \
  transformers==4.31.0 \
  sentence-transformers==2.2.2 \
  pinecone-client==2.2.2 \
  datasets==2.14.0 \
  accelerate==0.21.0 \
  einops==0.6.1 \
  langchain==0.0.240 \
  xformers==0.0.20 \
  bitsandbytes==0.41.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m58.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.1/179.1 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.2/492.2 kB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m77.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.1/109.1 MB[0m [31m9.0 M

## Initializing the Hugging Face Embedding Pipeline

We begin by initializing the embedding pipeline that will handle the transformation of our docs into vector embeddings. We will use the `sentence-transformers/all-MiniLM-L6-v2` model for embedding.

In [2]:
from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

We can use the embedding model to create document embeddings like so:

In [3]:
docs = [
    "this is one document",
    "and another document"
]

embeddings = embed_model.embed_documents(docs)

print(f"We have {len(embeddings)} doc embeddings, each with "
      f"a dimensionality of {len(embeddings[0])}.")

We have 2 doc embeddings, each with a dimensionality of 384.


## Building the Vector Index

We now need to use the embedding pipeline to build our embeddings and store them in a Pinecone vector index. To begin we'll initialize our index, for this we'll need a [free Pinecone API key](https://app.pinecone.io/).

In [4]:
import os
import pinecone

# get API key from app.pinecone.io and environment from console
pinecone.init(
    api_key=os.environ.get('1753d509-7863-4d83-9dc3-a90d08c85715') or '1753d509-7863-4d83-9dc3-a90d08c85715',
    environment=os.environ.get('gcp-starter') or 'gcp-starter'
)

Now we initialize the index.

In [5]:
import time

index_name = 'llama-2-rag'

if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=len(embeddings[0]),
        metric='cosine'
    )
    # wait for index to finish initialization
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

Now we connect to the index:

In [6]:
index = pinecone.Index(index_name)
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.27555,
 'namespaces': {'': {'vector_count': 27555}},
 'total_vector_count': 27555}

With our index and embedding process ready we can move onto the indexing process itself. For that, we'll need a dataset. We will use a set of Arxiv papers related to (and including) the Llama 2 research paper.

In [7]:
import pandas as pd

In [8]:
data = pd.read_csv("/content/bigBasketProducts.csv")

In [9]:
data.drop(columns = ["index"], inplace = True)

In [11]:
data['description'] = 'Product: ' + data['product'] + ' Category: ' + data['category'] + ' Sub_Category: ' + data['sub_category'] + ' Brand: ' + data['brand'] + ' Type: ' + data['type'] + ' Description: ' + data['description'] + ' Sale_Price: ' + data['sale_price'].astype(str) + ' Market_Price: ' + data['market_price'].astype(str) + ' Rating: ' + data['rating'].astype(str)

We will embed and index the documents like so:

In [12]:
data.head()

Unnamed: 0,product,category,sub_category,brand,sale_price,market_price,type,rating,description
0,Garlic Oil - Vegetarian Capsule 500 mg,Beauty & Hygiene,Hair Care,Sri Sri Ayurveda,220.0,220.0,Hair Oil & Serum,4.1,Product: Garlic Oil - Vegetarian Capsule 500 m...
1,Water Bottle - Orange,"Kitchen, Garden & Pets",Storage & Accessories,Mastercook,180.0,180.0,Water & Fridge Bottles,2.3,Product: Water Bottle - Orange Category: Kitch...
2,"Brass Angle Deep - Plain, No.2",Cleaning & Household,Pooja Needs,Trm,119.0,250.0,Lamp & Lamp Oil,3.4,"Product: Brass Angle Deep - Plain, No.2 Catego..."
3,Cereal Flip Lid Container/Storage Jar - Assort...,Cleaning & Household,Bins & Bathroom Ware,Nakoda,149.0,176.0,"Laundry, Storage Baskets",3.7,Product: Cereal Flip Lid Container/Storage Jar...
4,Creme Soft Soap - For Hands & Body,Beauty & Hygiene,Bath & Hand Wash,Nivea,162.0,162.0,Bathing Bars & Soaps,4.4,Product: Creme Soft Soap - For Hands & Body Ca...


In [13]:
data["product"].isna().sum()


1

In [14]:
data["description"].fillna("",inplace =True)
data["category"].fillna("",inplace =True)
data["product"].fillna("",inplace =True)

In [15]:
batch_size = 32

for i in range(0, len(data), batch_size):
    i_end = min(len(data), i+batch_size)
    batch = data.iloc[i:i_end]
    # ids = [f"{i}" for i, x in batch.iterrows()]
    ids = [f"{j}" for j in range(i,min(i+batch_size,len(data)))]
    texts = [x['description'] for i, x in batch.iterrows()]
    embeds = embed_model.embed_documents(texts)
    # get metadata to store in Pinecone
    metadata = [
        {'text': x['description'],
         'source': x['category'],
         'title': x['product']} for i, x in batch.iterrows()
    ]
    # add to Pinecone
    index.upsert(vectors=zip(ids, embeds, metadata))

In [16]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.27555,
 'namespaces': {'': {'vector_count': 27555}},
 'total_vector_count': 27555}

## Initializing the Hugging Face Pipeline

The first thing we need to do is initialize a `text-generation` pipeline with Hugging Face transformers. The Pipeline requires three things that we must initialize first, those are:

* A LLM, in this case it will be `meta-llama/Llama-2-7b-chat-hf`.

* The respective tokenizer for the model.

We'll explain these as we get to them, let's begin with our model.

We initialize the model and move it to our CUDA-enabled GPU. Using Colab this can take 5-10 minutes to download and initialize the model.

In [17]:
from torch import cuda, bfloat16
import transformers

model_id = 'meta-llama/Llama-2-7b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)
# Please provide your own authorisation key
# begin initializing HF items, need auth token for these
hf_auth = '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'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)
# flanllm = HuggingFacePipeline.from_model_id(model_id = llm_model_name, task = "text2text-generation",device =0 , pipeline_kwargs = {"max_length":2048})

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)
model.eval()
print(f"Model loaded on {device}")

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded on cuda:0


The pipeline requires a tokenizer which handles the translation of human readable plaintext to LLM readable token IDs. The Llama 2 7B models were trained using the Llama 2 7B tokenizer, which we initialize like so:

In [18]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]



tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Now we're ready to initialize the HF pipeline. There are a few additional parameters that we must define here. Comments explaining these have been included in the code.

In [19]:
tokenizer.model_max_length = 2048

In [20]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    temperature=0.0,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Confirm this is working:

In [21]:
res = generate_text("Which Soap gives your skin the best care that it must get")
print(res[0]["generated_text"])

Which Soap gives your skin the best care that it must get. Unterscheidung between bar soap and liquid soap. Bar soaps are made from a mixture of oils, fats, and lye, while liquid soaps are made from a mixture of water, oil, and surfactants. Bar soaps are generally considered to be more moisturizing than liquid soaps because they contain more oils and fats, which can help to lock in moisture and protect the skin. On the other hand, liquid soaps are often considered to be more convenient and easier to use than bar soaps because they are less messy and don't require as much effort to apply.

In conclusion, both bar and liquid soaps have their own advantages and disadvantages when it comes to caring for your skin. It is important to choose a soap that is appropriate for your skin type and needs, and to use it consistently and correctly in order to achieve the best results.


Now to implement this in LangChain

In [22]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

In [23]:
llm(prompt="Which Soap gives your skin the best care that it must get")

". Unterscheidung between bar soap and liquid soap. Bar soaps are made from a mixture of oils, fats, and lye, while liquid soaps are made from a mixture of water, oil, and surfactants. Bar soaps are generally considered to be more moisturizing than liquid soaps because they contain more oils and fats, which can help to lock in moisture and protect the skin. On the other hand, liquid soaps are often considered to be more convenient and easier to use than bar soaps because they are less messy and don't require as much effort to apply.\n\nIn conclusion, both bar and liquid soaps have their own advantages and disadvantages when it comes to caring for your skin. It is important to choose a soap that is appropriate for your skin type and needs, and to use it consistently and correctly in order to achieve the best results."

We still get the same output as we're not really doing anything differently here, but we have now added **Llama 2 7B Chat** to the LangChain library. Using this we can now begin using LangChain's advanced agent tooling, chains, etc, with **Llama 2**.

## Initializing a RetrievalQA Chain

For **R**etrieval **A**ugmented **G**eneration (RAG) in LangChain we need to initialize either a `RetrievalQA` or `RetrievalQAWithSourcesChain` object. For both of these we need an `llm` (which we have initialized) and a Pinecone index — but initialized within a LangChain vector store object.

Let's begin by initializing the LangChain vector store, we do it like so:

In [24]:
from langchain.vectorstores import Pinecone

text_field = 'text'  # field in metadata that contains text content

vectorstore = Pinecone(
    index, embed_model.embed_query, text_field
)

We can confirm this works like so:

In [29]:
query = 'This Product contains Garlic Oil '

vectorstore.similarity_search(
    query,  # the search query
    k=3  # returns top 3 most relevant chunks of text
)

[Document(page_content='Product: Garlic Oil - Vegetarian Capsule 500 mg Category: Beauty & Hygiene Sub_Category: Hair Care Brand: Sri Sri Ayurveda  Type: Hair Oil & Serum Description: This Product contains Garlic Oil that is known to help proper digestion, maintain proper cholesterol levels, support cardiovascular and also build immunity.  For Beauty tips, tricks & more visit https://bigbasket.blog/ Sale_Price: 220.0 Market_Price: 220.0 Rating: nan', metadata={'source': 'Beauty & Hygiene', 'title': 'Garlic Oil - Vegetarian Capsule 500 mg'}),
 Document(page_content='Product: Garlic Oil - Vegetarian Capsule 500 mg Category: Beauty & Hygiene Sub_Category: Hair Care Brand: Sri Sri Ayurveda  Type: Hair Oil & Serum Description: This Product contains Garlic Oil that is known to help proper digestion, maintain proper cholesterol levels, support cardiovascular and also build immunity.  For Beauty tips, tricks & more visit https://bigbasket.blog/ Sale_Price: 220.0 Market_Price: 220.0 Rating: 4.1

Looks good! Now we can put our `vectorstore` and `llm` together to create our RAG pipeline.

In [26]:
from langchain.chains import RetrievalQA

rag_pipeline = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',
    retriever=vectorstore.as_retriever()
)

Let's begin asking questions! First let's try *without* RAG:

In [27]:
llm('Which Soap gives your skin the best care that it must get?')

"\n everybody has their own preferences when it comes to soap, but there are a few key ingredients that can help give your skin the best care possible. Here are some of the most important ones to look for:\n\n1. Glycerin: This humectant helps draw moisture into your skin, keeping it hydrated and soft. Look for soaps that contain at least 5% glycerin.\n2. Salicylic acid: This beta hydroxy acid helps exfoliate your skin, unclogging pores and reducing inflammation. It's great for acne-prone skin or sensitive skin that needs a gentle exfoliant.\n3. Tea tree oil: This essential oil has antibacterial properties that can help fight acne and reduce inflammation. It's also a natural antiseptic, which can help prevent infections.\n4. Vitamin E: This powerful antioxidant can help protect your skin from damage caused by free radicals, which can lead to premature aging. It also helps nourish and moisturize your skin.\n5. Oatmeal: This gentle ingredient is great for soothing dry, irritated skin. It 

Hmm, that's not what we meant... What if we use our RAG pipeline?

In [28]:
rag_pipeline('Which Soap gives your skin the best care that it must get')

{'query': 'Which Soap gives your skin the best care that it must get',
 'result': ' Based on the information provided, I would recommend the StBotanica Sensual Amber Handmade Luxury Soap. It contains amber oil that provides a relaxing fragrance even after the cleansing ritual is over. The soap is made with premium natural oils, including glycerin, which helps to moisturize and nourish the skin. Additionally, the soap contains plant oils and butter that provide antioxidants and anti-aging benefits. All of these ingredients suggest that this soap would be gentle and nourishing on the skin, providing the best possible care.'}

This looks *much* better! Let's try some more.

Okay, it looks like the LLM with no RAG is less than ideal — let's stop embarassing the poor LLM and stick with RAG + LLM. Let's ask the same question to our RAG pipeline.