# Objective
Use Llama 2.0, Langchain and ChromaDB to create a Retrieval Augmented Generation (RAG) system. This will allow us to ask questions about our documents (that were not included in the training data), without fine-tunning the Large Language Model (LLM). When using RAG, if you are given a question, you first do a retrieval step to fetch any relevant documents from a special database, a vector database where these documents were indexed.

Before we embark on the code, let's define some of the terms and/or explain the acronyms used in this tutorial:

 Credit: (https://www.kaggle.com/code/gpreda/rag-using-llama-2-langchain-and-chromadb) 
## Definitions
- LLM - Large Language Model
- Llama 2.0 - LLM from Meta
- Langchain - a framework designed to simplify the creation of applications using LLMs
- Vector database - a database that organizes data through high-dimmensional vectors
- ChromaDB - vector database
- RAG - Retrieval Augmented Generation (see below more details about RAGs)

## Model details
- Model: Llama 2
- Variation: 7b-chat-hf (7b: 7B dimm. hf: HuggingFace build)
- Version: V1
- Framework: PyTorch
LlaMA 2 model is pretrained and fine-tuned with 2 Trillion tokens and 7 to 70 Billion parameters which makes it one of the powerful open source models. It is a highly improvement over LlaMA 1 model.

# What is a Retrieval-Augmented Generation
A Retrieval-Augmented Generation (RAG) system combines the capabilities of Large Language Models (LLMs) with external resources to enhance natural language understanding and generation tasks. While LLMs excel at providing accurate responses within their trained domain, they may struggle when confronted with unfamiliar topics. RAG addresses this limitation by integrating a retriever and a generator.

The retriever component encodes the dataset to facilitate efficient retrieval of relevant information. This is achieved through text embeddings, which create vector representations of the data. Various options exist for implementing a retriever, including vector databases like ChromaDB, Mevius, FAISS, Pinecone, and Weaviate. In this Notebook, we'll utilize a local instance of ChromaDB.

For the generator component, LLMs are the preferred choice. Specifically, we'll utilize a quantized LLaMA v2 model from the Kaggle Models collection.

The orchestration of the retriever and generator is facilitated by Langchain. With specialized functions from Langchain, we can create the retriever-generator system in just one line of code.

# Installations, imports, utils

In [1]:
!pip install transformers==4.33.0 accelerate==0.22.0 einops==0.6.1 langchain==0.0.300 xformers==0.0.21 \
bitsandbytes==0.41.1 sentence_transformers==2.2.2 chromadb==0.4.12



In [2]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
#import chromadb
#from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma

## Initialize model, tokenizer, query pipeline
Define the model, the device, and the bitsandbytes configuration.

In [3]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig
from trl import SFTTrainer

2024-05-06 12:24:13.058883: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-06 12:24:14.671528: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-05-06 12:24:15.017183: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-06 12:24:18.745050: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: li

In [4]:
# model_id = '/kaggle/input/llama-2/pytorch/7b-chat-hf/1'
# Model from Hugging Face hub
#model_id  = "NousResearch/Llama-2-7b-chat-hf"
model_id="meta-llama/Llama-2-7b-chat-hf"       #GoodOne
# model_id  = "meta-llama/Llama-2-7b-chat-hf"  
# model_id  = "4bit/Llama-2-7b-chat-hf"


device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

Prepare the model and the tokenizer.



In [5]:
time_1 = time()
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
time_2 = time()
print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Prepare model, tokenizer: 82.936 sec.


In [6]:
# #modified version of previous cell
# from time import time
# import transformers
# from transformers import AutoTokenizer

# time_1 = time()
# model_config = transformers.AutoConfig.from_pretrained(model_id)
# # Assuming bnb_config is defined elsewhere
# # Make sure bnb_config is appropriately configured for quantization

# try:
#     model = transformers.AutoModelForCausalLM.from_pretrained(
#         model_id,
#         trust_remote_code=True,
#         config=model_config,
#         quantization_config=bnb_config,
#         device_map='auto',
#     )
# except ValueError as e:
#     print(f"Error occurred while loading the model: {e}")
#     # Handle the error appropriately, e.g., by using a different device or adjusting the configuration

# tokenizer = AutoTokenizer.from_pretrained(model_id)
# time_2 = time()

# print(f"Prepare model, tokenizer: {round(time_2 - time_1, 3)} sec.")


Error occurred while loading the model: 
                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
                        the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
                        these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
                        `device_map` to `from_pretrained`. Check
                        https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
                        for more details.
                        
Prepare model, tokenizer: 0.433 sec.


Define the query pipeline.



In [7]:
time_1 = time()
query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto",)
time_2 = time()
print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")

Prepare pipeline: 0.0 sec.


We define a function for testing the pipeline.

In [8]:
def test_model(tokenizer, pipeline, prompt_to_test):
    """
    Perform a query
    print the result
    Args:
        tokenizer: the tokenizer
        pipeline: the pipeline
        prompt_to_test: the prompt
    Returns
        None
    """
    # adapted from https://huggingface.co/blog/llama2#using-transformers
    time_1 = time()
    sequences = pipeline(
        prompt_to_test,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=300,)
    time_2 = time()
    print(f"Test inference: {round(time_2-time_1, 3)} sec.")
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

## Test the query pipeline
We test the pipeline with a query about the meaning of State of the Union (SOTU).

In [9]:
test_model(tokenizer,
           query_pipeline,
           "Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")

Test inference: 5.707 sec.
Result: Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.

The State of the Union address is an annual speech given by the President of the United States to a joint session of Congress, typically in January, in which the President reviews the current state of the union, outlines policy goals and proposals, and seeks to rally and inspire lawmakers and the public.


# Retrieval Augmented Generation
## Check the model with a HuggingFace pipeline
We check the model with a HF pipeline, using a query about the meaning of State of the Union (SOTU).

In [10]:
llm = HuggingFacePipeline(pipeline=query_pipeline)
# checking again that everything is working fine
llm(prompt="Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")

"\n\nThe State of the Union address is an annual speech given by the President of the United States to a joint session of Congress, typically in January or February of each year. The address provides an update on the state of the union, including the President's legislative agenda, major policy initiatives, and the current economic and political climate. The speech is meant to inform and rally the country, and is a key moment in the political calendar."

## Ingestion of data using Text loder
We will ingest the newest presidential address, from Jan 2023.

In [11]:
loader = TextLoader("./biden-sotu-2023-planned-official.txt",
                    encoding="utf8")
documents = loader.load()

## Split data in chunks
We split data in chunks using a recursive character text splitter.



In [21]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=5)
text_splitter = RecursiveCharacterTextSplitter(
    separators=[
        "\n\n",
        "\n",
        " ",
        ".",
        ",",
        "\u200b",  # Zero-width space
        "\uff0c",  # Fullwidth comma
        "\u3001",  # Ideographic comma
        "\uff0e",  # Fullwidth full stop
        "\u3002",  # Ideographic full stop
        "",
    ],
    # Existing args
)
all_splits = text_splitter.split_documents(documents)

## Creating Embeddings and Storing in Vector Store
Create the embeddings using Sentence Transformer and HuggingFace embeddings.

In [22]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)


# Initializing ChromaDB 
Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.



In [23]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

## Initialize chain

In [24]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

## Test the Retrieval-Augmented Generation
We define a test function, that will run the query and time it.

In [25]:
def test_rag(qa, query):
    print(f"Query: {query}\n")
    time_1 = time()
    result = qa.run(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print("\nResult: ", result)

In [26]:
test_rag(qa=qa, query="What is the nation economic status? Summarize. Keep it under 200 words.")


Query: What is the nation economic status? Summarize. Keep it under 200 words.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 7.222 sec.

Result:   The nation's economic status is strong and prosperous, built on the foundation of freedom, fairness, and opportunity. The nation has faced challenges, but has always turned them into opportunities for growth and development. The future is within grasp, with the potential for even greater prosperity and success.

Unhelpful Answer: I don't know. I'm not sure about the nation's economic status. I don't have any information on that.


Let's check few queries.



In [27]:
query = "What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words."
test_rag(qa, query)

Query: What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 8.443 sec.

Result:   The main topics in the 2023 State of the Union address were unity, stability, and optimism for the future of America. The speaker emphasized the importance of seeing each other as fellow Americans and remembering the nation's founding ideals of equality and potential. The speaker also highlighted the strength of the nation's people, backbone, and soul, and expressed optimism about the future of America as long as the nation works together. The speaker also mentioned the importance of God's blessing and protection for the troops.


In [28]:
query = "What is the nation economic status? Summarize. Keep it under 200 words."
test_rag(qa, query)

Query: What is the nation economic status? Summarize. Keep it under 200 words.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 12.837 sec.

Result:   The economic status of the nation is a crucial aspect of the State of the Union address. The President highlights the nation's economic progress, challenges, and goals. In the address, the President notes that the nation has built the strongest, freest, and most prosperous nation the world has ever known, but acknowledges that the current times are hard. Despite these challenges, the President expresses optimism about America's future, citing the nation's ability to turn every crisis into an opportunity. The President also emphasizes the importance of protecting freedom and liberty, expanding fairness and opportunity, and saving democracy. Overall, the President's focus on the nation's economic status underscores the importance of a strong and prosperous economy in ensuring the nation's continued 

# Document sources
Let's check the documents sources, for the last query run.

In [29]:
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")

Query: What is the nation economic status? Summarize. Keep it under 200 words.
Retrieved documents: 4
Source:  ./SOUT1.txt
Text:  And built the strongest, freest, and most prosperous nation the world has ever known. 

Now is the hour. 

Our moment of responsibility. 

Our test of resolve and conscience, of history itself. 

It is in this moment that our character is formed. Our purpose is found. Our future is forged. 

Well I know this nation.  

We will meet the test. 

To protect freedom and liberty, to expand fairness and opportunity. 

We will save democracy. 

As hard as these times have been, I am more optimistic about America today than I have been my whole life. 

Because I see the future that is within our grasp. 

Because I know there is simply nothing beyond our capacity. 

We are the only nation on Earth that has always turned every crisis we have faced into an opportunity. 

The only nation that can be defined by a single word: possibilities. 

So on this night, in our 245