<a href="https://colab.research.google.com/github/aniket-work/llama2_rag_langchain/blob/main/llama2_rag_langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🚀 Let's do LLM : RAG (Retrieval Augmented Generation)

To start with, we will install Python packages using `pip`, the package manager for Python, in Google Colab. We will install multiple packages in a single command, and we'll use specific package versions for each installation. The packages we'll install are `transformers`, `sentence-transformers`, `pinecone-client`, `datasets`, `accelerate`, `einops`, `langchain`, `xformers`, and `bitsandbytes`.

## Prerequisites 🛠️

Before we begin, make sure you have Python and pip installed on your system. You can check if pip is installed by running the following command in your terminal or command prompt:

```python
   pip --version
```
## Installing Packages 📦

To install the specified packages with their respective versions, open a code cell in your Colab notebook and run the following command:

```python
!pip install -qU \
  transformers==4.31.0 \
  sentence-transformers==2.2.2 \
  pinecone-client==2.2.2 \
  datasets==2.14.0 \
  accelerate==0.21.0 \
  einops==0.6.1 \
  langchain==0.0.240 \
  xformers==0.0.20 \
  bitsandbytes==0.41.0
```

In [7]:
pip --version


pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)


In [8]:
!pip install -qU \
  transformers==4.31.0 \
  sentence-transformers==2.2.2 \
  pinecone-client==2.2.2 \
  datasets==2.14.0 \
  accelerate==0.21.0 \
  einops==0.6.1 \
  langchain==0.0.240 \
  xformers==0.0.20 \
  bitsandbytes==0.41.0

## Command Explanation

Let's break down this command:

- `!pip install`: This is the basic `pip` command for installing packages.

- `-qU`: These are optional flags:
  - `-q` stands for quiet mode, which suppresses most of the output, making it less verbose.
  - `-U` stands for upgrade. It ensures that if you already have any of these packages installed, they will be upgraded to the specified versions if needed.

- The packages to install are listed one after the other, each with its version specified using `==`. For example, `transformers==4.31.0` means we want to install version 4.31.0 of the `transformers` package.


Build embedding pipeline

In [9]:
from torch import cuda
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

In [10]:
import time
import pinecone


pinecone.init(
     api_key='xxx',    environment='xxx'

)
index_name = 'llama-2-rag'
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        'llama-2-rag',
        dimension=len(embed_model.embed_documents(["test1", "test2"])[0]),
        metric='cosine'
    )

    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

In [11]:
from datasets import load_dataset

data = load_dataset(
    'jamescalam/llama-2-arxiv-papers-chunked',
    split='train'
)


In [14]:
import pinecone
import pandas as pd

# Initialize Pinecone index if not already done
index_name = 'llama-2-rag'
index = pinecone.Index(index_name)

# Assuming data is already loaded as a DataFrame
# No need to convert it to pandas

def process_and_upload_batch(data, embed_model, index, batch_size=32):
    for i in range(0, len(data), batch_size):
        i_end = min(len(data), i + batch_size)
        batch = data.iloc[i:i_end]
        ids, texts, embeds, metadata = prepare_data_for_upload(batch, embed_model)
        index.upsert(vectors=zip(ids, embeds, metadata))

def prepare_data_for_upload(batch, embed_model):
    ids = [f"{x['doi']}-{x['chunk-id']}" for _, x in batch.iterrows()]
    texts = [x['chunk'] for _, x in batch.iterrows()]

    # Call embed_model.embed_documents(texts) to get embeddings
    embeds = embed_model.embed_documents(texts)

    metadata = [
        {
            'text': x['chunk'],
            'source': x['source'],
            'title': x['title']
        }
        for _, x in batch.iterrows()
    ]
    return ids, texts, embeds, metadata

# Usage
batch_size = 32
process_and_upload_batch(data, embed_model, index, batch_size)


In [15]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.04838,
 'namespaces': {'': {'vector_count': 4838}},
 'total_vector_count': 4838}

In [17]:
import torch
import transformers

# Check for CUDA availability and set the device accordingly
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Define the model identifier
model_id = 'meta-llama/Llama-2-13b-chat-hf'

# Load the model configuration
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token='HF_AUTH_TOKEN'  # Replace with your actual authentication token
)

# Configure quantization settings
quantization_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16  # Use torch.float16 for better compatibility
)

# Initialize the model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=quantization_config,
    device_map='auto',
)

# Put the model on the selected device
model.to(device)
model.eval()

# Print the device on which the model is loaded
print(f"Model loaded on {device}")


Downloading (…)lve/main/config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

OSError: ignored