# Llama 2 inference with Q-lora more approaches

**LLaMA 2** represents the next generation of their Large Language Model (LLM). This model was trained on a staggering 2 trillion tokens over six months.

In this notebook, we explore multiple approaches to tackle the challenges presented by the Kaggle competition. To leverage the full potential of LLaMA 2 while maintaining efficiency, we implement **QLoRA**, an innovative fine-tuning method that quantizes the model to just 4 bits and incorporates Low-Rank Adapters. This cutting-edge technique not only reduces the resource demands of fine-tuning but also allows for effective adjustments using a single GPU, making it accessible for various applications. By experimenting with different strategies, we aim to enhance the performance of LLaMA 2 in the context of the competition's requirements.

## Libraries

In [1]:
# Import libraries 
import bitsandbytes as bnb
import datasets
import pandas as pd
import numpy as np
from collections import Counter
from langchain import PromptTemplate
from peft import LoraConfig, get_peft_model
import torch
import logging
import time

# Add utils Python function to the notebook
!cp ../kaggle_competition_v2/kaggle/input/utils/functions.py .

In [2]:
# Utils built
import functions

## Setup

In [3]:
# Accessing to Hugging Face cli
# This login is necessary to download the base Llama LLM models from Hugging Face
hf_token = ''
#!huggingface-cli login --token $hf_token

# Log file (logs.txt) created for the outputs from some specific cells
logger = logging.getLogger()
logger.setLevel(logging.INFO)

FORMAT = '%(asctime)s %(message)s'
logging.basicConfig(format=FORMAT, filename="logs.txt", filemode='a')
logger = logging.getLogger('modelTraining')

## Download llama 2 model

In [4]:
# Current LLM model used for the inference
model_llama = '13b_chat'

# Reference for the current model finetuning
## The finetuning features reference can be found on a csv file within the net directory
## ./model_training_features_specifications.csv
reference = '08'

# Current LLM model path from Hugging Face used for inference
model_name = 'meta-llama/Llama-2-13b-chat-hf'
#model_name = 'lmsys/vicuna-13b-v1.5-16k'

In [5]:
# Load the model in a 4 bit format to optimise the computation and store memory required
bnb_config = functions.create_bnb_config()

# Loads the specific model from the Hugging Face hub, together with its tokenizer
# It uses GPU if available
model, tokenizer = functions.load_model(model_name, bnb_config)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



## Load the trained and merged model

## Load the base model and merged with the trained weights

In [6]:
# The weights trained thorugh Q-Lora and to be merged with the base model
trained_weights_dir = f"./finetuned_models/llama2/final_checkpoint_{model_llama}_{reference}"

In [7]:
# Matrix sumations
## Import the trained Q-Lora weights
lora_config = LoraConfig.from_pretrained(trained_weights_dir)
print('Section 1')

# Merge the trained weights with the base model
model = get_peft_model(model, lora_config)

Section 1


## Load dataset to evaluate

In [8]:
# Load the test dataset given by the competition
data = pd.read_csv('./datasets/train.csv')
data.head()

Unnamed: 0,id,prompt,A,B,C,D,E,answer
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D


In [9]:
# Format the dataframe aimed to be built as prompts for the model
data_to_test = pd.DataFrame(columns = ['question', 'options'])
# The prompt options are concatenated in a single column
data_to_test['id'] = data['id']
data_to_test['question'] = data['prompt']
data_to_test['options'] = "A, " + data['A'].astype(str) + "\nB, " + data['B'].astype(str) + "\nC, " + data['C'].astype(str) + "\nD, " + data['D'].astype(str) + "\nE, " + data['E'].astype(str)
data_to_test.head()

Unnamed: 0,question,options,id
0,Which of the following statements accurately d...,"A, MOND is a theory that reduces the observed ...",0
1,Which of the following is an accurate definiti...,"A, Dynamic scaling refers to the evolution of ...",1
2,Which of the following statements accurately d...,"A, The triskeles symbol was reconstructed as a...",2
3,What is the significance of regularization in ...,"A, Regularizing the mass-energy of an electron...",3
4,Which of the following statements accurately d...,"A, The angular spacing of features in the diff...",4


## Prompt engineering

In [10]:
# Configure instruction message tags
B_INST, E_INST = "[INST]", "[/INST]"
# Configure system message tags
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

### Approximation 0

The system prompt here will instruct the model to answer the 3 options it considers to be more accurate in the format:
* A,B,C
* B,C,A
* C,D,E
* etc

In [11]:
system_prompt = "<s>" + B_SYS + """Assistant will answer a multi choice question by giving 3 and only 3 letters from the options given.
Assistant must separate the letters by comma.
Assistant must give the order of the letters from the most likely correct to the less likely correct.
Assistant will not give explanation in the answer.
Assistant will only use the letters: A,B,C,D or E.
Assistant will not give less than 3 letters for answer.
Assistant must not use special characters in the answer.

Here is a previous conversation between the Assistant and the Question of the user:

\n<<Question:>> What type of organism is commonly used in preparation of foods such as cheese and yogurt

<<Options:>>
A. viruses
B. protozoa
C. cells
D. gymnosperms
E. mesophilic organisms

<<Assistant:>>
E,C,B
<<End>>

\n<<Question:>> What is the least dangerous radioactive decay

<<Options:>>
A. zeta decay
B. beta decay
C. gamma decay
D. alpha decay
E. all of the above

<<Assistant:>>
D,C,B
<<End>>

\n<<Question:>> What phenomenon makes global winds blow northeast to southwest or the reverse in the northern hemisphere and northwest to southeast or the reverse in the southern hemisphere?

<<Options:>>
A. hurricanes
B. tropical effect
C. muon effect
D. centrifugal effect
E. coriolis effect

<<Assistant:>>
E,C,A
<<End>>

\n<<Question:>> Kilauea in hawaii is the world 2019s most continuously active volcano. very active volcanoes characteristically eject red-hot rocks and lava rather than this?

<<Options:>>
A. carbon and smog
B. smoke and ash
C. greenhouse gases
D. magma
E. fire

<<Assistant:>>
B,E,A
<<End>>""" + E_SYS

### Approximation 1

The system prompt here will instruct the model to answer the option it considers to be the most accurate in the format:
* A
* B
* C
* D
* E

In [12]:
system_prompt1 = "<s>" + B_SYS + """Assistant will answer a multi choice question by giving the correct option letter.
Assistant will not give explanation in the answer.
Assistant will only use a letter from the next: A,B,C,D or E.
Assistant must not use special characters in the answer.

Here is a previous conversation between the Assistant and the Question of the user:

\n<<Question:>> What type of organism is commonly used in preparation of foods such as cheese and yogurt

<<Options:>>
A. viruses
B. protozoa
C. cells
D. gymnosperms
E. mesophilic organisms

<<Assistant:>>
E
<<End>>

\n<<Question:>> What is the least dangerous radioactive decay

<<Options:>>
A. zeta decay
B. beta decay
C. gamma decay
D. alpha decay
E. all of the above

<<Assistant:>>
D
<<End>>

\n<<Question:>> What phenomenon makes global winds blow northeast to southwest or the reverse in the northern hemisphere and northwest to southeast or the reverse in the southern hemisphere?

<<Options:>>
A. hurricanes
B. tropical effect
C. muon effect
D. centrifugal effect
E. coriolis effect

<<Assistant:>>
E
<<End>>

\n<<Question:>> Kilauea in hawaii is the world 2019s most continuously active volcano. very active volcanoes characteristically eject red-hot rocks and lava rather than this?

<<Options:>>
A. carbon and smog
B. smoke and ash
C. greenhouse gases
D. magma
E. fire

<<Assistant:>>
B
<<End>>""" + E_SYS

### Approximation 2

The system prompt here will instruct the model to answer a question with a single option and say whether it is correct or incorrect. This will be done for each of the 5 options and the format for the answer will be:
* correct
* incorrect

In [13]:
system_prompt2 = "<s>" + B_SYS + """Assistant will answer a question by saying whether the answer is correct or not.
Assistant will not give explanation in the answer.
Assistant will only use the options: correct - incorrect
Assistant must not use special characters in the answer.

Here is a previous conversation between the Assistant and the Question of the user:

\n<<Question:>> What type of organism is commonly used in preparation of foods such as cheese and yogurt

<<Option:>>
viruses

<<Assistant:>>
incorrect
<<End>>

\n<<Question:>> What type of organism is commonly used in preparation of foods such as cheese and yogurt

<<Option:>>
protozoa

<<Assistant:>>
incorrect
<<End>>

\n<<Question:>> What type of organism is commonly used in preparation of foods such as cheese and yogurt

<<Option:>>
cells

<<Assistant:>>
incorrect
<<End>>

\n<<Question:>> What type of organism is commonly used in preparation of foods such as cheese and yogurt

<<Option:>>
gymnosperms

<<Assistant:>>
incorrect
<<End>>

\n<<Question:>> What type of organism is commonly used in preparation of foods such as cheese and yogurt

<<Option:>>
mesophilic organisms

<<Assistant:>>
correct
<<End>>""" + E_SYS

## A sample test

### Approximation 0

The system prompt here will instruct the model to answer the 3 options it considers to be more accurate in the format:
* A,B,C
* B,C,A
* C,D,E
* etc

In [14]:
# Specify the model input as the concatenation of both prompts
text = system_prompt + functions.human_prompt_without_context(data_to_test = data_to_test, n = 0)

# Specify device, GPU if available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Tokenize input text
inputs = tokenizer(text, return_tensors="pt").to(device)

# Get answer
# Adjust max_new_tokens variable to 10 (maximum number of tokens the model can generate to answer the input)
outputs = model.generate(input_ids=inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"], max_new_tokens=10, pad_token_id=tokenizer.eos_token_id)

# Decode output & print it
output = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(output)

<<SYS>>
Assistant will answer a multi choice question by giving 3 and only 3 letters from the options given.
Assistant must separate the letters by comma.
Assistant must give the order of the letters from the most likely correct to the less likely correct.
Assistant will not give explanation in the answer.
Assistant will only use the letters: A,B,C,D or E.
Assistant will not give less than 3 letters for answer.
Assistant must not use special characters in the answer.

Here is a previous conversation between the Assistant and the Question of the user:


<<Question:>> What type of organism is commonly used in preparation of foods such as cheese and yogurt

<<Options:>>
A. viruses
B. protozoa
C. cells
D. gymnosperms
E. mesophilic organisms

<<Assistant:>>
E,C,B
<<End>>


<<Question:>> What is the least dangerous radioactive decay

<<Options:>>
A. zeta decay
B. beta decay
C. gamma decay
D. alpha decay
E. all of the above

<<Assistant:>>
D,C,B
<<End>>


<<Question:>> What phenomenon makes g

### Approximation 1

The system prompt here will instruct the model to answer the option it considers to be the most accurate in the format:
* A
* B
* C
* D
* E

In [15]:
# The idea here is to iterate the same sample a number of repetitions and count the number of times each option is given
# Thus, the option with more answers by the model will be in the first position and then the other 2 options will be sorted
# by number of appearances.
results = []
repetitions = 10

for i in range(repetitions):
    # Specify the model input as the concatenation of both prompts
    text = system_prompt1 + functions.human_prompt_without_context(data_to_test = data_to_test, n = 0)

    # Specify device, GPU if available
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    # Tokenize input text
    inputs = tokenizer(text, return_tensors="pt").to(device)

    # Get answer
    # Adjust max_new_tokens variable to 6 (maximum number of tokens the model can generate to answer the input)
    outputs = model.generate(input_ids=inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"], max_new_tokens=6, pad_token_id=tokenizer.eos_token_id)
    
    # Decode output & append it
    output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    res = output.replace('</s>', ' ').split(':>>')[-1].strip().split('\n')[0].strip().split('.')[0].strip().split(',')
    
    results.append(res[0])

# Count and sort the options by the number of appearances & print it
sorted_counter = dict(sorted(Counter(results).items(), key = lambda x: x[1], reverse = True))
print(sorted_counter)

{'D': 7, 'B': 2, 'A': 1}


In [16]:
# Get the answered options in a formated way within a list
res = output.replace('</s>', ' ').split(':>>')[-1].strip().split('\n')[0].strip().split('.')[0].strip().split(',')
print(res)

['D']


### Approximation 2

The system prompt here will instruct the model to answer a question with a single option and say whether it is correct or incorrect. This will be done for each of the 5 options and the format for the answer will be:
* correct
* incorrect

In [17]:
# The purpose here is to iterate a single sample per each option and ask the model to say whether the option is correct to the respective question or not.
# After a number of iterations per pair sample-question, the options with more correct answers will be sorted and 
# be given as the final output for the sample, sorted from the number of corrects.
results = []

for i in ['A', 'B', 'C', 'D', 'E']:
    # Specify the model input as the concatenation of both prompts
    text = system_prompt2 + functions.human_prompt_one_option_answer(data = data, opt = i, n = 0)

    # Specify device, GPU if available
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    # Tokenize input text
    inputs = tokenizer(text, return_tensors="pt").to(device)

    # Get answer
    # Adjust max_new_tokens variable to 6 (maximum number of tokens the model can generate to answer the input)
    outputs = model.generate(input_ids=inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"], max_new_tokens=8, pad_token_id=tokenizer.eos_token_id)

    # Decode output & append it
    output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    res = output.replace('</s>', ' ').split(':>>')[-1].strip().split('\n')[0].strip().split('.')[0].strip().split(',')
    
    results.append(res[0])

print(results)

['correct', 'incorrect', 'incorrect', 'correct', 'incorrect']


## Inference calculations

### Approximation 0

The system prompt here will instruct the model to answer the 3 options it considers to be more accurate in the format:
* A,B,C
* B,C,A
* C,D,E
* etc

In [18]:
# Variable initialisation for the loop over the dataset test rows
ans_temp = []
device = "cuda:0"
results = []
time_start = time.time()
repetitions = []

# The loop is ran over all the rows of the dataset and saves the answers it gives for a later metric evaluation
for i in (range(data_to_test.shape[0])):
#for i in range(5):
    t = []
    tmp = data_to_test.iloc[i]
    t.append(tmp['id'])
    
    # Specify device
    device = torch.device(device if torch.cuda.is_available() else "cpu")
    
    # Call the recursive function that uses the model
    res, answer_list, c = functions.recursive_inference_1(tokenizer, device, model, system_prompt, 
                                                          functions.human_prompt_without_context(data_to_test = data_to_test, n = i),
                                                          max_repetitions = 1)
    
    # Save the whole results given by the model
    results.append((i, res))
    
    # Save the number of repetitions for the modelto give the 3 answers
    repetitions.append(c)
    
    # Check wheather the characters are correct or not
    for j, _ in enumerate(answer_list):
        if answer_list[j] not in ['A', 'B', 'C', 'D', 'E']:
            print(f"{i}:{answer_list[j]}\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t")
            del answer_list[j]
    
    # Put a - if there's no an option of the 3 specified by the model
    if len(answer_list) != 0:
        t.extend(answer_list)
    else:
        t.append('-')
        t.append('-')
        t.append('-')
    
    ans_temp.append(t)

    # Progress percentage
    print(f'% Completed: {np.round((i + 1)/200, 3)}\t\t\t\t', end = '\r')

# The time duration is calculated
time_end = time.time()
print(f'Elapsed time: {np.round((time_end - time_start)/60, 3)} min')

Elapsed time: 0.058 min


In [19]:
# Real number of times the model was used
print(f'Number repetitions {sum(repetitions) + 200}')

# Sample with maximum repetitions
print(f'Max sample repetitions {max(repetitions) + 1}')

Number repetitions 200
Max sample repetitions 1


### Approximation 1

The system prompt here will instruct the model to answer the option it considers to be the most accurate in the format:
* A
* B
* C
* D
* E

In [20]:
# Variable initialisation for the loop over the dataset test rows
ans_temp = []
device = "cuda:0"
results = []
time_start = time.time()
repetitions = 10

# The loop is ran over all the rows of the dataset and saves the answers it gives for a later metric evaluation
for i in range(data_to_test.shape[0]):
#for i in range(5):
    t = []
    tmp = data_to_test.iloc[i]
    t.append(tmp['id'])
    
    # Specify device
    device = torch.device(device if torch.cuda.is_available() else "cpu")
    
    # Call the recursive function that uses the model
    res, answer_list = functions.get_most_likely_answers(tokenizer, device, model, system_prompt1,
                                                         functions.human_prompt_without_context(data_to_test = data_to_test, n = i),
                                                         repetitions = 1)
    
    # Save the whole results given by the model
    results.append((i, res))
    
    # Check whether the characters are correct or not
    for j, _ in enumerate(answer_list):
        if answer_list[j] not in ['A', 'B', 'C', 'D', 'E']:
            print(f"{i}:{answer_list[j]}\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t")
            del answer_list[j]
    
    # Put a - if there's no an option of the 3 specified by the model
    if len(answer_list) != 0:
        t.extend(answer_list)
    else:
        t.append('-')
        t.append('-')
        t.append('-')
    
    ans_temp.append(t)

    # Progress percentage
    print(f'% Completed: {np.round((i + 1)/200, 3)}\t\t\t\t', end = '\r')

# The time duration is calculated
time_end = time.time()
print(f'Elapsed time: {np.round((time_end - time_start)/60, 3)} min')

Elapsed time: 0.054 min


### Approximation 2

The system prompt here will instruct the model to answer a question with a single option and say whether it is correct or incorrect. This will be done for each of the 5 options and the format for the answer will be:
* correct
* incorrect

In [21]:
# Variable initialisation for the loop over the dataset test rows
ans_temp = []
device = "cuda:0"
time_start = time.time()

# The loop is ran over all the rows of the dataset and saves the answers it gives for a later metric evaluation
for i in range(data.shape[0]):
#for i in range(5):
    t = []
    tmp = data.iloc[i]
    t.append(tmp['id'])
    
    # Specify device
    device = torch.device(device if torch.cuda.is_available() else "cpu")
    
    # Call the recursive function that uses the model
    answer_list = functions.create_correct_incorrect_questions(tokenizer, device, model, system_prompt2, data, i)
    
    # Check whether the characters are correct or not
    for j, _ in enumerate(answer_list):
        if answer_list[j] not in ['A', 'B', 'C', 'D', 'E']:
            print(f"{i}:{answer_list[j]}\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t")
            del answer_list[j]
    
    # Put a - if there's no an option of the 3 specified by the model
    if len(answer_list) != 0:
        t.extend(answer_list)
    else:
        t.append('-')
        t.append('-')
        t.append('-')
    
    ans_temp.append(t)

    # Progress percentage
    print(f'% Completed: {np.round((i + 1)/200, 3)}\t\t\t\t', end = '\r')

# The time duration is calculated
time_end = time.time()
print(f'Elapsed time: {np.round((time_end - time_start)/60, 3)} min')

Elapsed time: 0.563 min


## Output csv

In [22]:
# Save the total answer for each result with the respective reference and model especifications
results = pd.DataFrame(results, columns=['id', 'text'])
results.to_csv(f'./outputs/complete_results_llama2_{model_llama}_{reference}.csv', index=False)

In [23]:
# Show some of the answers given by the model
ans_temp[:10]

[[0, 'D', 'A']]

In [26]:
# Save the predictions into a Pandas dataframe
ans = pd.DataFrame(ans_temp, columns=['id', 'prediction1', 'prediction2', 'prediction3'])#, 'prediction4', 'prediction5'])
#ans.drop(['prediction4', 'prediction5'], inplace=True, axis=1)
ans.fillna('-', inplace=True)
ans.head()

Unnamed: 0,prediction1,prediction2,prediction3
0,0,D,A


In [27]:
# Check how many predictions still don't have the correct answer
ans[ans['prediction3'] == '-']

Unnamed: 0,prediction1,prediction2,prediction3


In [28]:
# Save the predictions in the correct specified output format
cols = ['prediction1', 'prediction2', 'prediction3']
ans['prediction'] = ans[cols].apply(lambda x: ' '.join(x.values.astype(str)), axis=1)

In [29]:
# The output directory in which the results ready to submit will be saved
output_path = f'./outputs/submission_llama2_{model_llama}_{reference}.csv'

In [30]:
# Export the Pandas dataframe to csv file
cols_to_delete = ['prediction1', 'prediction2', 'prediction3']
ans.drop(cols_to_delete, axis=1, inplace=True)
ans.to_csv(output_path, index=False)

## Metric evaluation

In [32]:
%%capture cap --no-stderr
# Capture the output of this cell into the logs.txtfile

# Read the output and use the metric calculation available at the functions.py file (mapk)
df = pd.read_csv(output_path)
answer_df = pd.read_csv('datasets/train.csv')
answer = answer_df['answer'].tolist()
df['prediction'] = df['prediction'].str.split()
prediction= df['prediction'].tolist()
res = functions.mapk(answer, prediction, 3)
print(res)

In [33]:
# Bring from the log.txt file the result from the previous cell
logger.info(f'\nMetric result for model {model_llama} with reference {reference}: %s', cap.stdout)
print(cap.stdout)

0.5



In [37]:
# Check some whole answers given by the model
i = 2
print(results.iloc[i]['text'][-80:])
print('-------------------------------------')
ans.iloc[i]

-------------------------------------


In [38]:
# Check the log file lines
with open('logs.txt') as f:
    lines = f.readlines()
lines[-10:]

['2023-10-11 21:55:39,162 Load pretrained SentenceTransformer: ./kaggle/input/sentencetransformers-allminilml6v2/sentence-transformers_all-MiniLM-L6-v2\n',
 '2023-10-12 20:20:37,336 \n',
 'Metric result for model 13b_chat with reference 08: 0.7458333333333335\n',
 '\n',
 '2023-10-12 20:56:25,200 \n',
 'Metric result for model 13b_chat with reference 08: 0.7283333333333333\n',
 '\n',
 '2023-10-12 23:57:16,447 \n',
 'Metric result for model 13b_chat with reference 08: 0.5\n',
 '\n']