In [1]:
from google.colab import drive

drive.mount("/content/drive")
DRIVE_ROOT = "/content/drive/MyDrive"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from huggingface_hub import login

login("****")

# Introduction

## Objective

Use Llama 2.0, Langchain and ChromaDB to create a Retrieval Augmented Generation(RAG) system. This will allow us to ask questions about our documents(that were not included in the training data), without fine-tuning the Large Language model(LLM). When using RAG, if you are given a question, you first do a retrieval step to fetch any relevant documents from a special database, a vector database where these documents were indexed.

## Definitions

- LLM - Large Language Model
- Llama 2.0 - LLM from Meta
- Langchain - a framework designed to simplify the creation of applications using LLMs
- Vector database - a database that organizes data through high-dimmensional vectors
- ChromaDB - vector database
- RAG - Retrieval Augmented Generation

## Model details

- Model: Llama 2
- Variation: 7b-chat-hf(7b: 7B dimm, hf: HuggingFace build)
- Version: V1
- Framework: PyTorch

Llama 2 model is pretrained and fine-tuned with 2 Trillion tokens and 7 to 70 Billion parameters which makes it one of the powerful open source models. It is a highly improvement over Llama 1 model.

## What is a Retrieval Augmented Generation (RAG) system?

LLMs has proven their ability to understand context and provide accurate answers to various NLP tasks, including summarization, Q&A, when prompted. While being able to provide very good answers to questions about information that they were trained with, they tend to hallucinate when the topic is about information that they do "not know", i.e. was not included in their training data. Retrieval Augmented Generation combines external resources with LLMs. The main two components of a RAG are therefore a retriever and a generator.

The retriever part can be described as a system that is able to encode our data so that can be easily retrieved the relevant parts of it upon querying it. The encoding is done using text embeddings, i.e. a model trained to create vector representation of the information. The best option for implementing a retriever is a vector database. As vector database, there are multiple options, both open source or commercial products. Few examples are ChromaDB, Mevius, FAISS, Pinecone, Weaviate. Our option in this Notebook will be a local instance of ChromaDB(persistent).

For the generator part, the obvious option is a LLM. In this notebook we will use a quantized LLaMA v2 model, from the Kaggle Models collection.

The orchestration of the retriever and generator will be done using Langchain. A specialized function from Langchain allows us to create the retriever-generator in one line of code.

# Installations, imports, utils

In [3]:
!pip install langchain_community bitsandbytes accelerate chromadb



In [4]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
from time import time
#import chromadb
#from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma

# Initialize model, tokenizer, query pipeline

Defien the model, the device, and the `bitsandbytes` configuration.

In [5]:
model_id = tokenizer = "meta-llama/Llama-2-7b-chat-hf"

device = f"cuda:{cuda.current_device()}" if cuda.is_available() else "cpu"

# set quantization configuration to load large model with less GPU memory
# this requires the 'bitsandbytes' library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16,
)

Prepare the model and the tokenizer.

In [6]:
time_1 = time()

model_config = AutoConfig.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
time_2 = time()
print(f"Prepare model, tokenizer: {round(time_2 - time_1, 3)} sec.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Prepare model, tokenizer: 10.295 sec.


Define the query pipeline.

In [7]:
time_1 = time()
query_pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)
time_2 = time()
print(f"Prepare pipeline: {round(time_2 - time_1, 3)} sec.")

Device set to use cuda:0


Prepare pipeline: 0.058 sec.


In [8]:
def test_model(tokenizer, pipeline, prompt_to_test):
    """
    Perform a query
    print the result

    Args:
        tokenizer: the tokenizer
        pipeline: the pipeline
        prompt_to_test: the prompt
    Return:
        None
    """
    time_1 = time()
    sequences = pipeline(
        prompt_to_test,
        do_sample=True,
        top_k=10,  # 매 생성 단계에서 확률이 높은 상위 10개 토큰 중 샘플링
        num_return_sequences=1,  # 한 번의 실행에서 반환할 텍스트의 개수를 1개로 설정
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,
        truncation=True,
    )
    time_2 = time()
    print(f"Test inference: {round(time_2 - time_1), 3} sec.")
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

## Test the query pipeline

We test the pipeline with a query about the meaning of State of the Union.

In [9]:
test_model(tokenizer,
           query_pipeline,
           "Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")

Test inference: (4, 3) sec.
Result: Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.

The State of the Union address is an annual speech given by the President of the United States to a joint session of Congress, in which the President reports on the current state of the union and outlines legislative priorities for the upcoming year.


# Retrieval Augmented Generation

## Check the model with a HuggingFace pipeline

We check the model with a HF pipeline, using a query about the meaning of State of the Union(SOTU).

In [10]:
llm = HuggingFacePipeline(pipeline=query_pipeline)
llm(prompt="Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")

  llm = HuggingFacePipeline(pipeline=query_pipeline)
  llm(prompt="Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.")


'Please explain what is the State of the Union address. Give just a definition. Keep it in 100 words.\n\nThe State of the Union address is an annual speech given by the President of the United States to a joint session of Congress, typically in January, in which the President reviews the current state of the nation, highlights accomplishments and challenges, and proposes policies and legislation for the upcoming year.'

## Ingestion of data using Text loader

We will ingest the newest presidential address, from Jan 2023.

In [12]:
loader = TextLoader("/content/drive/MyDrive/Colab Notebooks/input/RAG using Llama 2, Langchain and ChromaDB/biden-sotu-2023-planned-official.txt", encoding="utf8")
documents = loader.load()

## Split data in chunks

We split data in chunks using a recursive character text splitter.

In [13]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(documents)

## Creating Embeddings and Storing in Vector Store

Create the embeddings using Sentence Transformer and HuggingFace embeddings.

In [14]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": device}
embeddings = HuggingFaceEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs
)

  embeddings = HuggingFaceEmbeddings(


Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.

In [15]:
vectordb = Chroma.from_documents(
    documents=all_splits,
    embedding=embeddings,
    persist_directory="/content/drive/MyDrive/Colab Notebooks/input/RAG using Llama 2, Langchain and ChromaDB/chroma_db",
)

## Initialize chain

In [16]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    verbose=True
)

## Test the Retrieval-Augmented Generation

We define a test function, that will run the query and time it.

In [18]:
def test_rag(qa, query):
    print(f"Query: {query}\n")
    time_1 = time()
    result = qa.run(query)
    time_2 = time()
    print(f"Inference time: {round(time_2 - time_1, 3)} sec.")
    print("\nResult:", result)

In [19]:
query = "What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words."
test_rag(qa, query)

Query: What were the main topics in the State of the Union in 2023? Summarize. Keep it under 200 words.



[1m> Entering new RetrievalQA chain...[0m


  result = qa.run(query)



[1m> Finished chain.[0m
Inference time: 9.995 sec.

Result: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

on the state of the union. And here is my report. Because the soul of this nation is strong, because the backbone of this nation is strong, because the people of this nation are strong, the State of the Union is strong. As I stand here tonight, I have never been more optimistic about the future of America. We just have to remember who we are. We are the United States of America and there is nothing, nothingbeyond our capacity if we do it together. May God bless you all. May God protect our troops.

peace,not just in Europe, but everywhere. Before I came to office, the story was about how the People’s Republic of China was increasing its power and America was falling in the world. Not anymore. I’ve made clear with President Xi that we seek competition, not conflic

In [20]:
query = "What is the nation economic status? Summarize. Keep it under 200 words."
test_rag(qa, query)

Query: What is the nation economic status? Summarize. Keep it under 200 words.



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Inference time: 8.457 sec.

Result: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

forward. Of never giving up. A story that is unique among all nations. We are the only country that has emerged from every crisis stronger than when we entered it. That is what we are doing again. Two years ago, our economy was reeling. As I stand here tonight, we have created a record 12 million new jobs, more jobs created in two years than any president has ever created in four years. Two years ago, COVID had shut down our businesses, closed our schools, and robbed us of so much. Today, COVID no longer controls our lives. And two years ago, our democracy faced its greatest threat since the Civil War. Today, though bruised, our democracy 

## Document sources

Let's check the documents sources, for the last query run.

In [22]:
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()["kwargs"]
    print(f"Source: {doc_details['metadata']['source']}")
    print(f"Text: {doc_details['page_content']}\n")

Query: What is the nation economic status? Summarize. Keep it under 200 words.
Retrieved documents: 4
Source: /content/drive/MyDrive/Colab Notebooks/input/RAG using Llama 2, Langchain and ChromaDB/biden-sotu-2023-planned-official.txt
Text: forward. Of never giving up. A story that is unique among all nations. We are the only country that has emerged from every crisis stronger than when we entered it. That is what we are doing again. Two years ago, our economy was reeling. As I stand here tonight, we have created a record 12 million new jobs, more jobs created in two years than any president has ever created in four years. Two years ago, COVID had shut down our businesses, closed our schools, and robbed us of so much. Today, COVID no longer controls our lives. And two years ago, our democracy faced its greatest threat since the Civil War. Today, though bruised, our democracy remains unbowed and unbroken. As we gather here tonight, we are writing the next chapter in the great American st

# Conclusions

We used Langchain, ChromaDB and Llama 2 as a LLM to build a Retrieval Augmented Generation solution. For testing, we were using the latest State of the Union address from Jan 2023.