# Llama 2 with retrieval augmented generation (RAG)

**Retrieval-Augmented Generation (RAG)** has emerged as a transformative technique. By integrating external knowledge retrieval mechanisms with language generation, RAG enables models to produce more accurate and contextually relevant responses. This approach is especially beneficial when addressing intricate scientific inquiries that require up-to-date information and nuanced understanding.

The foundation of RAG lies in its ability to dynamically access a large corpus of information, retrieving relevant documents or snippets that enhance the model's responses. This process bridges the gap between the static knowledge embedded within a pre-trained model and the dynamic nature of user queries. As a result, RAG significantly enhances the effectiveness of language models in real-world applications, particularly in fields such as science, where the complexity of questions often exceeds the pre-existing knowledge of the model.

To exemplify the application of RAG, we utilize the **LLaMA 2** models released by [Meta](https://about.meta.com/) on July 18, 2023. Trained on an extensive dataset, LLaMA 2 includes various sizes—7B, 13B parameters—providing a powerful foundation for fine-tuning. 

For fine-tuning these models with RAG, we employ **[QLoRA (Efficient Finetuning of Quantized LLMs)](https://arxiv.org/abs/2305.14314)**. This innovative method quantizes pretrained models to just 4 bits and incorporates Low-Rank Adapters, allowing us to fine-tune LLaMA 2 efficiently on a single GPU. The process is supported by the **[PEFT library](https://huggingface.co/docs/peft/)**, facilitating effective model adaptation.

### Importance of RAG

RAG not only improves the accuracy of generated responses but also enhances the model's ability to handle a wider variety of questions by leveraging external knowledge sources. This makes it a crucial component in the development of intelligent systems capable of engaging in complex discussions and providing informative answers in real-time.

## Libraries

In [1]:
# Import libraries 
import bitsandbytes as bnb
import datasets
import pandas as pd
import numpy as np
from collections import Counter
from peft import get_peft_model, LoraConfig
import torch
import logging
import time

# Add utils Python function to the notebook
!cp ../kaggle_competition_v2/kaggle/input/utils/functions.py .

In [2]:
# Utils built
import functions

## Setup

In [3]:
## Accessing to Hugging Face cli
# This login is necessary to download the base Llama LLM models from Hugging Face 
hf_token = ''
#!huggingface-cli login --token $hf_token


# Log file (logs.txt) created for the outputs from some specific cells
logger = logging.getLogger()
logger.setLevel(logging.INFO)

FORMAT = '%(asctime)s %(message)s'
logging.basicConfig(format=FORMAT, filename="logs.txt", filemode='a')
logger = logging.getLogger('modelTraining')

## Load the base model

In [4]:
# Current LLM model path from Hugging Face used for the inference
model_name = 'meta-llama/Llama-2-13b-chat-hf'
#model_name = 'lmsys/vicuna-13b-v1.5-16k'

In [5]:
# Load the model in a 4 bit format to optimise the computation and store memory required
bnb_config = functions.create_bnb_config()

# Loads the specific model from the Hugging Face hub, together with its tokenizer
# It uses GPU if available
model, tokenizer = functions.load_model(model_name, bnb_config)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



## Merge model with the trained weights

In [6]:
# Current LLM model used for the inference
model_llama = '13b_chat'

# Reference for the current model finetuning
## The finetuning features reference can be found on a csv file within the next directory:
## ./model_training_features_specifications.csv
reference = '08'

In [7]:
# The weights correspond to the finetuning performed through the finetuning_example_0.ipynb notebook.
## This finetuning is performed through Q-LORA on a percentage of the total parameters of the model
trained_weights_dir = f"./finetuned_models/final_checkpoint_{model_llama}_{reference}"

# Matrix sumations
## Import the trained LORA weights
lora_config = LoraConfig.from_pretrained(trained_weights_dir)
print('Section 1')

# Merge the trained weights with the base model
model = get_peft_model(model, lora_config)

Section 1


## Load dataset to evaluate

In [8]:
# Load the test dataset given by the competition
data = pd.read_csv('./datasets/train.csv')
data.head()

Unnamed: 0,id,prompt,A,B,C,D,E,answer
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D


## Perform retrieval augmented generation (RAG)

In [9]:
# Perform the function to get the context for each question based on sentence transformers
## The contexts are taken from a dataset of 270K documents from Wikipedia
data_with_context = functions.get_contexts()
#data_with_context = pd.read_csv('./context/test_context.csv')

# Depict the dataset in which now the context is present
data_with_context.head()

Unnamed: 0,prompt,context,A,B,C,D,E
0,Which of the following statements accurately d...,The MOND type behavior is suppressed in this r...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...
1,Which of the following is an accurate definiti...,Many of these systems evolve in a self-similar...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...
2,Which of the following statements accurately d...,It is possible that this usage is related with...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...
3,What is the significance of regularization in ...,Renormalization is distinct from regularizatio...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...
4,Which of the following statements accurately d...,Several qualitative observations can be made o...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...


In [10]:
# Format the dataframe aimed to be built as prompts for the model
data_to_test = pd.DataFrame(columns = ['context', 'question', 'options'])
# The context is added to the prompt.
data_to_test['context'] = data_with_context['context']
data_to_test['question'] = data_with_context['prompt']
data_to_test['options'] = "A, " + data_with_context['A'].astype(str) + "\nB, " + data_with_context['B'].astype(str) + "\nC, " + data_with_context['C'].astype(str) + "\nD, " + data_with_context['D'].astype(str) + "\nE, " + data_with_context['E'].astype(str)
data_to_test.head()

Unnamed: 0,context,question,options
0,The MOND type behavior is suppressed in this r...,Which of the following statements accurately d...,"A, MOND is a theory that reduces the observed ..."
1,Many of these systems evolve in a self-similar...,Which of the following is an accurate definiti...,"A, Dynamic scaling refers to the evolution of ..."
2,It is possible that this usage is related with...,Which of the following statements accurately d...,"A, The triskeles symbol was reconstructed as a..."
3,Renormalization is distinct from regularizatio...,What is the significance of regularization in ...,"A, Regularizing the mass-energy of an electron..."
4,Several qualitative observations can be made o...,Which of the following statements accurately d...,"A, The angular spacing of features in the diff..."


## Prompt engineering

In [11]:
# Configure instruction message tags
B_INST, E_INST = "[INST]", "[/INST]"
# Configure system message tags
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

### Option 0

In [12]:
# The system prompt is created in a way that gives a pretty clear instruction an sets a 
# very narrow possibilities for the model on how it must answer.
## The system prompt contains examples that depicts how the answer has to be.
system_prompt = "<s>" + B_SYS + """Assistant will answer a multi choice question based on the context by giving 3 and only 3 letters from the options given.
Assistant will answer the question with its own knowledge in case the answer is not in the context.
Assistant must separate the letters by comma.
Assistant must give the order of the letters from the most likely correct to the less likely correct.
Assistant will only use the letters: A,B,C,D or E.

Here is a previous conversation between the Assistant and the Question and Context of the user:

<<Context:>>
Viruses are microscopic infectious agents composed of genetic material enclosed in a protein coat, requiring
a host cell to replicate. Protozoa, on the other hand, are single-celled eukaryotic microorganisms found in 
aquatic environments, with some being parasitic and causing diseases. Cells, as the fundamental units of life, 
possess a cell membrane, genetic material, and the ability to perform essential life processes. Gymnosperms, a 
group of seed-producing plants, produce "naked seeds" on cones and include species like pine trees and cycads. 
Finally, mesophilic organisms, including bacteria and yeasts, thrive at moderate temperatures and are crucial 
in fermentation processes like cheese and yogurt production, where they facilitate the transformation of raw 
materials into dairy products.

<<Question:>>
What type of organism is commonly used in preparation of foods such as cheese and yogurt

<<Options:>>
A, viruses
B, protozoa
C, cells
D, gymnosperms
E, mesophilic organisms

<<Assistant:>>
E,C,B
<<End>>

\n<<Context:>>
Various natural phenomena influence Earth's atmospheric and oceanic behaviors. One such phenomenon is hurricanes,
which are intense tropical storms characterized by strong winds and heavy rainfall. They form over warm ocean waters
and can have a significant impact on weather patterns. The concept of the tropical effect refers to the conditions
associated with high temperatures and humidity in tropical regions but is not directly related to global wind
patterns. Similarly, the muon effect and centrifugal effect have specific scientific contexts but do not play a
primary role in determining the direction of global winds. In contrast, the Coriolis effect is the key factor
responsible for the observed northeast to southwest and northwest to southeast movement of global winds in the
Northern and Southern Hemispheres, respectively. This phenomenon is a result of the Earth's rotation and 
significantly influences atmospheric circulation patterns worldwide.

<<Question:>> 
What phenomenon makes global winds blow northeast to southwest or the reverse in the northern hemisphere and northwest to southeast or the reverse in the southern hemisphere?

<<Options:>>
A, hurricanes
B, tropical effect
C, muon effect
D, centrifugal effect
E, coriolis effect

<<Assistant:>>
E,C,A
<<End>>""" + E_SYS

## A sample test

### Option 0

In [13]:
# Specify the model input as the concatenation of both prompts
text = system_prompt + functions.human_prompt(data_to_test = data_to_test, n = 0)

# Specify device, GPU if available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Tokenize input text
inputs = tokenizer(text, return_tensors="pt").to(device)

# Get answer
# Adjust max_new_tokens variable to 10 (maximum number of tokens the model can generate to answer the input)
outputs = model.generate(input_ids=inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"], max_new_tokens=10, pad_token_id=tokenizer.eos_token_id)

# Decode output & print it
output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(output)

<<SYS>>
Assistant will answer a multi choice question based on the context by giving 3 and only 3 letters from the options given.
Assistant will answer the question with its own knowledge in case the answer is not in the context.
Assistant must separate the letters by comma.
Assistant must give the order of the letters from the most likely correct to the less likely correct.
Assistant will only use the letters: A,B,C,D or E.

Here is a previous conversation between the Assistant and the Question and Context of the user:

<<Context:>>
Viruses are microscopic infectious agents composed of genetic material enclosed in a protein coat, requiring
a host cell to replicate. Protozoa, on the other hand, are single-celled eukaryotic microorganisms found in 
aquatic environments, with some being parasitic and causing diseases. Cells, as the fundamental units of life, 
possess a cell membrane, genetic material, and the ability to perform essential life processes. Gymnosperms, a 
group of seed-prod

In [14]:
# Get the answered options in a formated way within a list
res = output.replace('</s>', ' ').split(':>>')[-1].strip().split('\n')[0].strip().split('.')[0].strip().split(',')
print(res)

['D', 'E']


## Inference calculations

### Option 0

In [15]:
# Variable initialisation for the loop over the dataset test rows 
ans_temp = []
device = "cuda:0"
results = []
time_start = time.time()
repetitions = []

# The loop is ran over all the rows of the dataset and saves the answers it gives for a later metric evaluation
for i in (range(data_to_test.shape[0])):
#for i in range(5):
    t = []
    tmp = data_to_test.iloc[i]
    t.append(i)
    
    # Specify device
    device = torch.device(device if torch.cuda.is_available() else "cpu")
    
    # Call the recursive function that uses the model
    res, answer_list, c = functions.recursive_inference_1(tokenizer = tokenizer, device = device, model = model, system_prompt = system_prompt,
                                                          human_prompt = functions.human_prompt(n = i, data_to_test = data_to_test), max_repetitions = 1)
    
    # Save the whole results given by the model
    results.append((i, res))
    
    # Save the number of repetitions for the modelto give the 3 answers
    repetitions.append(c)
    
    # Check wheather the characters are correct or not. If not they are printed
    for j, _ in enumerate(answer_list):
        if answer_list[j] not in ['A', 'B', 'C', 'D', 'E']:
            print(f"{i}:{answer_list[j]}\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t")
            del answer_list[j]
    
    if len(answer_list) != 0:
        t.extend(answer_list)
    else:
        # When there is not an answer, the space is filled with a - mark
        t.append('-')
        t.append('-')
        t.append('-')
    
    ans_temp.append(t)

    # Progress percentage
    print(f'% Completed: {np.round((i + 1)/200, 3)}\t\t\t\t', end = '\r')

# The time duration is calculated
time_end = time.time()
print(f'Elapsed time: {np.round((time_end - time_start)/60, 3)} min')

24:0																		
58:Giordano Bruno																
122:32																
Elapsed time: 30.726 min


In [16]:
# Real number of times the model was used
print(f'Number repetitions {sum(repetitions) + 200}')

# Sample with maximum repetitions
print(f'Max sample repetitions {max(repetitions) + 1}')

Number repetitions 324
Max sample repetitions 2


## Output csv

In [17]:
# Save the total answer for each result with the respective reference and model especifications
results = pd.DataFrame(results, columns=['id', 'text'])
results.to_csv(f'./outputs/complete_results_llama2_{model_llama}_{reference}.csv', index=False)

In [18]:
# Show some of the answers given by the model
ans_temp[:10]

[[0, 'D', 'E'],
 [1, 'A', 'B'],
 [2, 'C', 'B', 'A'],
 [3, 'B', 'C'],
 [4, 'D', 'B', 'A'],
 [5, 'B', 'C', 'E'],
 [6, 'A', 'B', 'C'],
 [7, 'D', 'B', 'E'],
 [8, 'C', 'B'],
 [9, 'A', 'B', 'C']]

In [25]:
# Save the predictions into a Pandas dataframe
ans = pd.DataFrame(ans_temp, columns=['id', 'prediction1', 'prediction2', 'prediction3'])#, 'prediction4', 'prediction5'])
#ans.drop(['prediction4'], inplace=True, axis=1)
ans.fillna('-', inplace=True)
ans.head()

Unnamed: 0,id,prediction1,prediction2,prediction3
0,0,D,E,-
1,1,A,B,-
2,2,C,B,A
3,3,B,C,-
4,4,D,B,A


In [26]:
# Check how many predicitions still don't have the correct answer
ans[ans['prediction3'] == '-']

Unnamed: 0,id,prediction1,prediction2,prediction3
0,0,D,E,-
1,1,A,B,-
3,3,B,C,-
8,8,C,B,-
14,14,B,C,-
...,...,...,...,...
191,191,B,C,-
192,192,B,C,-
194,194,B,C,-
196,196,B,C,-


In [27]:
# The output directory in which the results ready to submit will be saved
reference_inference = '08'
output_path = f'./outputs/submission_llama2_{model_llama}_{reference_inference}.csv'

In [28]:
# Save the predictions in the correct specified output format
cols_to_delete = ['prediction1', 'prediction2', 'prediction3']
ans['prediction'] = ans[cols_to_delete].apply(lambda x: ' '.join(x.values.astype(str)), axis=1)
ans.drop(cols_to_delete, axis=1, inplace=True)

# Export the Pandas dataframe to csv file
ans.to_csv(output_path, index=False)

## Metric evaluation

In [29]:
%%capture cap --no-stderr
# Capture the output of this cell into the logs.txt file

# Read the output and use the metric calculation available at the functions.py file (mapk)
df = pd.read_csv(output_path)
answer_df = pd.read_csv('datasets/train.csv')
answer = answer_df['answer'].tolist()
df['prediction'] = df['prediction'].str.split()
prediction= df['prediction'].tolist()
res = functions.mapk(answer, prediction, 3)
# Here all the impressions will be save into the txt log file
print(res)

In [30]:
# Bring from the log.txt file the result from the previous cell
logger.info(f'\nMetric result for model {model_llama} with reference {reference}: %s', cap.stdout)
print(cap.stdout)

0.7283333333333333



In [31]:
# Check some whole answers given by the model 
i = 2
print(results.iloc[i]['text'][-80:])
print('-------------------------------------')
ans.iloc[i]

he island's central location in the Mediterranean.

<<Assistant:>>
C,B,A
<<End>>
-------------------------------------


id                2
prediction    C B A
Name: 2, dtype: object

In [32]:
# Check the log file lines
with open('logs.txt') as f:
    lines = f.readlines()
lines[-30:]

['2023-10-10 16:14:23,126 \n',
 'Metric result for model 13b_chat with reference 087: 0.8483333333333334\n',
 '\n',
 '2023-10-10 16:14:56,008 \n',
 'Metric result for model 13b_chat with reference 086: 0.3055555555555555\n',
 '\n',
 '2023-10-11 16:08:22,825 Load pretrained SentenceTransformer: ./kaggle/input/sentencetransformers-allminilml6v2/sentence-transformers_all-MiniLM-L6-v2\n',
 '2023-10-11 16:11:44,721 Load pretrained SentenceTransformer: ./kaggle/input/sentencetransformers-allminilml6v2/sentence-transformers_all-MiniLM-L6-v2\n',
 '2023-10-11 16:12:52,811 Load pretrained SentenceTransformer: ./kaggle/input/sentencetransformers-allminilml6v2/sentence-transformers_all-MiniLM-L6-v2\n',
 '2023-10-11 16:15:45,329 Load pretrained SentenceTransformer: ./kaggle/input/sentencetransformers-allminilml6v2/sentence-transformers_all-MiniLM-L6-v2\n',
 '2023-10-11 16:24:54,128 Load pretrained SentenceTransformer: ./kaggle/input/sentencetransformers-allminilml6v2/sentence-transformers_all-MiniL