## LMSYS - Chatbot Arena Human Preference Predictions Using QLoRA Fine-tuned Llama 3

This is one of the notebooks I created for Kaggle competition: [LMSYS - Chatbot Arena Human Preference Predictions](https://www.kaggle.com/competitions/lmsys-chatbot-arena). 

Basically, this competition challenges you to predict which responses users will prefer (or a tie) in a head-to-head battle between chatbots powered by large language models (LLMs). You'll be given a dataset of conversations from the Chatbot Arena, where different LLMs generate answers to user prompts. I didn't formally attend this competition, but I think it is a good opportunity to gain practical experience in LLM fine-tuning as the dataset mainly consists of text data: question, response of model a and response of model b.

I used Llama 3 - 8B as the basic model for classification and fine-tuned the model using QLoRA. QLoRA, stands for, Quantized Low Rank Adaptation, which is a parameter efficient fine tuning method. Since I was fine tuning on a consumer-level GPU (RTX 4090) with my own desktop, I used the smallest model (8B) and leveraged quantization and low rank adaption to save memory usage and maintain an accpetable training speed.

I tried different combination of training parameters and this notebook includes the latest version. It seems that the model still has improvement space. However, to restart the training with new parameters will take a lot of time. I am considering combining the base model and adapter as the new base model and do fine-tuning on the new base model. However, currently, when trying to do this, it seems my GPU is not working. I need to do more research on how to realize it.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

import nltk
# nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize

from sklearn.metrics import accuracy_score, f1_score, log_loss, confusion_matrix

In [2]:
VERSION = 'v7'

RESULT_PATH = './' + VERSION + '_results/' 
if not os.path.exists(RESULT_PATH):
    os.makedirs(RESULT_PATH)

PATH = '/kaggle/input/lmsys-chatbot-arena/'
PATH = 'C:/Users/zyc71/Data Science Projects/LMSYS - Chatbot Arena Human Preference Predictions/Data/'

PERSONAL_PATH = 'C:/Users/zyc71/Data Science Projects/LMSYS - Chatbot Arena Human Preference Predictions/'

TRAIN_CSV = PATH + 'train.csv'
TEST_CSV = PATH + 'test.csv'
SUBM_CSV = PATH + 'sample_submission.csv'

## Text Data Processing

In [3]:
# Preprocess the text
def preprocess_text(df, cols = ['prompt', 'response_a', 'response_b']):
    for col in cols:
        # Remove the [" and "] that appear at the beginning and the end of text.
        df[col] = df[col].str.replace(pat = r'^(\[")', repl = '', regex = True).str.replace(pat = r'("\])$', repl = '', regex = True)
    
    return df
        
# Caculate length of response
def calculate_length(df):
    df['len_prompt'] = df['prompt'].str.len()
    df['len_a'] = df['response_a'].str.len()
    df['len_b'] = df['response_b'].str.len()
    df['len_diff'] = df['len_a'] - df['len_b']

    return df

# Count the tokens of response
def count_tokens(df):
    df['token_num_prompt'] = df['prompt'].apply(lambda x: len(word_tokenize(x)))
    df['token_num_a'] = df['response_a'].apply(lambda x: len(word_tokenize(x)))
    df['token_num_b'] = df['response_b'].apply(lambda x: len(word_tokenize(x)))
    df['token_num_diff'] = df['token_num_a'] - df['token_num_b']
    
    return df

# Count the sentenses of response
def count_sentenses(df):
    df['sentense_num_prompt'] = df['prompt'].apply(lambda x: len(sent_tokenize(x)))
    df['sentense_num_a'] = df['response_a'].apply(lambda x: len(sent_tokenize(x)))
    df['sentense_num_b'] = df['response_b'].apply(lambda x: len(sent_tokenize(x)))
    df['sentense_num_diff'] = df['sentense_num_a'] - df['sentense_num_b']
    
    return df

 # Create prompts for prediction use
def create_prompt(df):
    df['prompt_for_pred'] = ('Based on user prompt, please find out whether user likes response of Model A or response of Model B or think they are tied.\n\n' 
    + 'User prompt:\n' + df['prompt'] + '\n\n' + 'Model A:\n' + df['response_a'] + '\n\n' + 'Model B:\n' + df['response_b'])
     
    return df

# Create unified target column - 'winner_model_a': 0, 'winner_model_b': 1, 'winner_tie': 2
# NOTICE: This is only applicable to train data
def create_label_column(df):
    df['label'] = df.apply(lambda row: 0 if row['winner_model_a'] == 1 else 1 if row['winner_model_b'] == 1 else 2, axis = 1)

    return df



# Pipe all the processing functions
def pipe_processing_function(df):
    df = (
        df.pipe(preprocess_text)
        .pipe(calculate_length)
        .pipe(count_tokens)
        .pipe(count_sentenses)
        .pipe(create_prompt)
    )

    return df

In [4]:
# %%time

# df_train = pipe_processing_function(df_train)
# df_train = create_label_column(df_train)
# df_train.to_csv(PATH + 'processed_train.csv')

In [5]:
# Load preprocessed data to save time
df_train = pd.read_csv(PATH + 'processed_train.csv')

In [6]:
df_train.head(3)

Unnamed: 0.1,Unnamed: 0,id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie,...,token_num_prompt,token_num_a,token_num_b,token_num_diff,sentense_num_prompt,sentense_num_a,sentense_num_b,sentense_num_diff,prompt_for_pred,label
0,0,30192,gpt-4-1106-preview,gpt-4-0613,Is it morally right to try to have a certain p...,The question of whether it is morally right to...,"As an AI, I don't have personal beliefs or opi...",1,0,0,...,36,827,241,586,3,26,12,14,"Based on user prompt, please find out whether ...",0
1,1,53567,koala-13b,gpt-4-0613,What is the difference between marriage licens...,A marriage license is a legal document that al...,A marriage license and a marriage certificate ...,0,1,0,...,45,591,655,-64,3,15,20,-5,"Based on user prompt, please find out whether ...",1
2,2,65089,gpt-3.5-turbo-0613,mistral-medium,explain function calling. how would you call a...,Function calling is the process of invoking or...,Function calling is the process of invoking a ...,0,0,1,...,11,212,423,-211,2,5,11,-6,"Based on user prompt, please find out whether ...",2


## LLM Fine Tuning Preparation

In [7]:
import os
from random import randrange
from functools import partial
import torch
from datasets import Dataset
from transformers import (AutoModelForSequenceClassification,
                          AutoModelForCausalLM,
                          LlamaForSequenceClassification,
                          AutoTokenizer,
                          BitsAndBytesConfig,
                          HfArgumentParser,
                          Trainer,
                          TrainingArguments,
                          DataCollatorForLanguageModeling,
                          EarlyStoppingCallback,
                          pipeline,
                          logging,
                          set_seed)

import bitsandbytes as bnb
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel, AutoPeftModelForCausalLM
from sklearn.model_selection import train_test_split

# from trl import SFTTrainer
# from google.colab import drive
# drive.mount('/content/drive')

In [8]:
# To use Llama3, we need to register for an access token
# access_token = 'Your Llama Token Here.'
from huggingface_hub import login
login(token = access_token)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to C:\Users\zyc71\.cache\huggingface\token
Login successful


In [9]:
model_name = 'meta-llama/Meta-Llama-3-8B'
# model_name = 'meta-llama/Meta-Llama-3-8B-Instruct'
# model_name = 'mistralai/Mistral-7B-v0.3'

# If GPU is available, device = 'auto' will proritize the use of GPU.
device = 'auto'
# device = 'cuda' if torch.cuda.is_available() else 'auto'

## Load Tokenizer

In [10]:
tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    use_auth_token = True, # The token to use as HTTP bearer authorization for remote files. Passing use_auth_token=True is required when you want to use a private model.
    use_fast = True, # load the fast version of the tokenizer
    padding_side = 'left',
    truncation_side = 'right',
)


# Add a self defined padding token
# Note: Llama has no pad and unknown token
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    
tokenizer.pad_token = tokenizer.pad_token


def tokenize_text(batch):
    return tokenizer(batch['prompt_for_pred'],
                     padding = 'max_length',
                     max_length = 1024,
                     truncation = True,
                     return_tensors = 'pt', # return pytorch tensor
                     )

tokenizer.pad_token_id

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


128256

## QLoRA Configuration

In [11]:
# Quantization configuration to reduce memory usage
bnb_config = BitsAndBytesConfig(
    load_in_4bit = True, # Quantize the model to 4-bits when load it
    bnb_4bit_quant_type = 'nf4', # Use a special 4-bit data type for weights initialized from a normal distribution
    bnb_4bit_compute_dtype = torch.bfloat16, # Use bfloat16 for faster computation
    bnb_4bit_use_double_quant = True, # Use a nested quantization scheme to quantize the already quantized weights
)


# Use quantization method load Llama3 for classification
model = LlamaForSequenceClassification.from_pretrained(
        model_name,
        num_labels = 3, # We have 3 classes: model a wins, model b wins and tie
        quantization_config = bnb_config,
        device_map = device,
        trust_remote_code = True
)


# Gradient checkpointing is a technique used to trade off memory usage for computation time during backpropagation
# When activated, it is used to reduce memory consumption by saving only certain intermediate activations during the forward pass 
# and recomputing others during the backward pass. This trades off increased computation time for reduced memory usage.
model.gradient_checkpointing_enable()


# what prepare_mode_for_kbit_training() does? - https://anelmusic13.medium.com/turn-your-llm-into-a-mafioso-code-explanation-companion-8cef7dfee80a
# use_gradient_checkpointing is True by default in prepare_model_for_kbit_training()
model = prepare_model_for_kbit_training(model)


# Since I create a pad token for fine-tuning, some changes are needed
# To avoid ValueError: Cannot handle batch sizes > 1 if no padding token is defined.
model.config.pad_token_id = tokenizer.pad_token_id

# Resize token embeddings
model.resize_token_embeddings(len(tokenizer))

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at meta-llama/Meta-Llama-3-8B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Embedding(128257, 4096)

In [12]:
print(model)

LlamaForSequenceClassification(
  (model): LlamaModel(
    (embed_tokens): Embedding(128257, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )


In [13]:
#LoRA cofiguration

peft_config = LoraConfig(
    r = 4, # the dimension of the low-rank matrices
    lora_alpha = 16, # scaling factor to control the influence of the LoRA relative to the original model weights. A higher lora_alpha gives more weight to the low-rank updates
    lora_dropout = 0.1, # dropout probability of the LoRA layers
    bias = 'none', 
    task_type = 'SEQ_CLS', # Need to set 'SEQ_CLS' for classification
    target_modules = ['q_proj', 'v_proj']
    # target_modules=["q_proj",
    #     "k_proj",
    #     "v_proj",
    #     "o_proj",
        # "gate_proj",
        # "up_proj",
        # "down_proj",
        # "lm_head",
                   # ]
)

# Once the LoraConfig is setup, create a PeftModel with the get_peft_model() function. 
# It takes a base model - which you can load from the Transformers library - and the LoraConfig containing the parameters for how to configure a model for training with LoRA.
model = get_peft_model(model, peft_config)

In [14]:
# print_trainable_parameters() is a method that can be used after get_peft_model()
model.print_trainable_parameters()

trainable params: 1,716,224 || all params: 7,506,657,280 || trainable%: 0.0229


## Dataset Splitting

In [15]:
from datasets import Dataset
from sklearn.model_selection import train_test_split

feature_cols = [col for col in df_train.columns if col not in ['winner_model_a', 'winner_model_b', 'winner_tie', 'label']]
X_train_val, X_test, y_train_val, y_test = train_test_split(df_train[feature_cols], df_train['label'], test_size = 0.05, random_state = 42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size = 0.1, random_state = 42)

train_dataset = Dataset.from_pandas(pd.concat([X_train[['prompt_for_pred']], y_train], axis = 1))
val_dataset = Dataset.from_pandas(pd.concat([X_val[['prompt_for_pred']], y_val], axis = 1))
test_dataset = Dataset.from_pandas(pd.concat([X_test[['prompt_for_pred']], y_test], axis = 1))

In [16]:
# Tokenize the prompt using pre-defined function
train_ds = train_dataset.map(tokenize_text, batched = True)
val_ds = val_dataset.map(tokenize_text, batched = True)

Map:   0%|          | 0/49142 [00:00<?, ? examples/s]

Map:   0%|          | 0/5461 [00:00<?, ? examples/s]

## Fine Tuning

In [17]:
# Define metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    pred_probas = pred.predictions
    accuracy = accuracy_score(labels, preds)
    macro_f1_score = f1_score(labels, preds, average='macro')
    log_loss_score = log_loss(y_true = labels, y_pred = pred_probas)
    return {'log_loss_score': log_loss_score, 'accuracy': accuracy, 'macro_f1_score': macro_f1_score}

In [18]:
# # For test use
# train_ds = train_ds.select(range(5000))
# val_ds = val_ds.select(range(1000))

In [19]:
# Training parameter configuration
epochs = 30
batch_size = 10
# Gradient accumulation is a way to virtually increase the batch size during training
'''
For example, if you want to use a batch size of 256 but can only fit a batch size of 64 into GPU memory, you can perform gradient accumulation 
over four batches of size 64. That is, set batch_size = 64 and gradient_accumulation_steps = 4
'''
gradient_accumulation_steps = 6

training_args = TrainingArguments(
    output_dir = RESULT_PATH,
    logging_dir = RESULT_PATH + 'logs',
    
    num_train_epochs = epochs,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    gradient_accumulation_steps = gradient_accumulation_steps, # Number of updates steps to accumulate the gradients for, before performing a backward/update pass
    
    learning_rate = 2e-4, # The initial learning rate for Adam.
    # warmup_ratio = 0.01, # Ratio of total training steps used for a linear warmup from 0 to learning_rate
    weight_decay = 0.05, # Regularization parameter for L2 penalty
    
    lr_scheduler_type = 'polynomial',
    fp16 = True, # Mixed precision training can speed up the computations by reducing some variables to fp16 instead of keeping all variables in fp32
    optim = 'paged_adamw_32bit', # The optimizer to use for training the model, 'paged_adamw_32bit' is a variant of the AdamW optimizer designed to be more efficient on 32-bit GPUs.

    logging_strategy = 'steps',
    eval_strategy = 'steps',
    save_strategy = 'steps',
    logging_steps = 0.07, # Number of update steps between two logs
    eval_steps = 0.07, # Number of update steps between two evaluations
    save_steps = 0.07, # Number of updates steps before two checkpoint saves
    
    load_best_model_at_end = True, # Whether or not to load the best model found during training at the end of training
    metric_for_best_model = 'eval_loss',
)


In [25]:
%%time
# DataCollatorWithPadding handles cases where input sequences have different lengths by dynamically padding them within a batch
from transformers import DataCollatorWithPadding
# For DataCollatorWithPadding, the default parameters are: padding = True, max_length = None, return_tensors = 'pt' 
# In fact, if we use DataCollatorWithPadding, we don't need to pad the input sequence originally.

trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_ds,
    eval_dataset = val_ds,
    tokenizer = tokenizer,
    data_collator = DataCollatorWithPadding(tokenizer = tokenizer, padding = 'max_length', max_length = 1024, return_tensors = 'pt'),
    compute_metrics = compute_metrics,
)
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss,Validation Loss,Log Loss Score,Accuracy,Macro F1 Score
1720,1.3473,1.229279,3.686457,0.412012,0.395194
3440,1.1395,1.195851,3.383888,0.404138,0.398865
5160,1.0896,1.17121,3.345467,0.420985,0.405652
6880,1.0599,1.154451,3.169676,0.424831,0.416609
8600,1.0448,1.160874,3.276569,0.418238,0.416523
10320,1.0321,1.163804,3.404346,0.427211,0.409752
12040,1.0225,1.159704,3.374507,0.428493,0.416291
13760,1.0107,1.149961,3.200396,0.414393,0.413192
15480,1.0026,1.152832,3.377131,0.425563,0.414514
17200,0.9955,1.150648,3.32708,0.426112,0.410535




CPU times: total: 10h 13min 34s
Wall time: 3d 1h 7min 25s


TrainOutput(global_step=24570, training_loss=1.0446081152959814, metrics={'train_runtime': 263245.0468, 'train_samples_per_second': 5.6, 'train_steps_per_second': 0.093, 'total_flos': 6.322321783640398e+19, 'train_loss': 1.0446081152959814, 'epoch': 29.99389623601221})

In [26]:
# Save model and tokenizer
trainer.save_model(RESULT_PATH +'model')

tokenizer.save_pretrained(RESULT_PATH + 'tokenizer')



('./v7_results/tokenizer\\tokenizer_config.json',
 './v7_results/tokenizer\\special_tokens_map.json',
 './v7_results/tokenizer\\tokenizer.json')

## Inference

In inference part, I duplicates some codes above so that I can run inference part directly after the model is fine-tuned.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

import nltk
# nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize

from sklearn.metrics import accuracy_score, f1_score, log_loss, confusion_matrix


from random import randrange
from functools import partial
import torch
from datasets import Dataset
from transformers import (AutoModelForSequenceClassification,
                          AutoModelForCausalLM,
                          LlamaForSequenceClassification,
                          AutoTokenizer,
                          BitsAndBytesConfig,
                          HfArgumentParser,
                          Trainer,
                          TrainingArguments,
                          DataCollatorForLanguageModeling,
                          EarlyStoppingCallback,
                          pipeline,
                          logging,
                          set_seed)

import bitsandbytes as bnb
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel, AutoPeftModelForCausalLM
from sklearn.model_selection import train_test_split

In [2]:
VERSION = 'v7'

RESULT_PATH = './' + VERSION + '_results/' 
if not os.path.exists(RESULT_PATH):
    os.makedirs(RESULT_PATH)

PATH = '/kaggle/input/lmsys-chatbot-arena/'
PATH = 'C:/Users/zyc71/Data Science Projects/LMSYS - Chatbot Arena Human Preference Predictions/Data/'

PERSONAL_PATH = 'C:/Users/zyc71/Data Science Projects/LMSYS - Chatbot Arena Human Preference Predictions/'

TRAIN_CSV = PATH + 'train.csv'
TEST_CSV = PATH + 'test.csv'
SUBM_CSV = PATH + 'sample_submission.csv'

In [3]:
# access_token = 'Your Llama Token Here.'
from huggingface_hub import login
login(token = access_token)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to C:\Users\zyc71\.cache\huggingface\token
Login successful


In [4]:
device ='auto'

# base model is what I fine tuned
base_model_name = 'meta-llama/Meta-Llama-3-8B'
# Below is where I save the fine tuned adapter
model_name = RESULT_PATH + 'model'

In [5]:
# Load saved tokenizer
tokenizer = AutoTokenizer.from_pretrained(RESULT_PATH + 'tokenizer')

def tokenize_text(batch):
    return tokenizer(batch['prompt_for_pred'],
                     padding = 'max_length',
                     max_length = 1024,
                     truncation = True,
                     return_tensors = 'pt',
                     )

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [6]:
# Load base model with quantization

# Quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type = 'nf4',
    bnb_4bit_compute_dtype = torch.bfloat16,
    bnb_4bit_use_double_quant = True,
)


#Load the base model with default precision
model = LlamaForSequenceClassification.from_pretrained(base_model_name, 
                                                       num_labels = 3, 
                                                       quantization_config = bnb_config, 
                                                       device_map = device, 
                                                       trust_remote_code = True,
                                                      )

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at meta-llama/Meta-Llama-3-8B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
# To avoid ValueError: Cannot handle batch sizes > 1 if no padding token is defined.
model.config.pad_token_id = tokenizer.pad_token_id

# Resize token embeddings
model.resize_token_embeddings(len(tokenizer))

Embedding(128257, 4096)

In [8]:
# Load and activate the adapter on top of the base model
model = PeftModel.from_pretrained(model, model_name)

In [9]:
# Merge the adapter with the base model
'''
While LoRA is significantly smaller and faster to train, you may encounter latency issues during inference due to separately loading the base model 
and the LoRA adapter. To eliminate latency, use the merge_and_unload() function to merge the adapter weights with the base model. This allows you to 
use the newly merged model as a standalone model. The merge_and_unload() function doesn’t keep the adapter weights in memory.
'''
model = model.merge_and_unload()



In [10]:
# Load preprocessed data to save time
df_train = pd.read_csv(PATH + 'processed_train.csv')

In [11]:
from datasets import Dataset
from sklearn.model_selection import train_test_split

feature_cols = [col for col in df_train.columns if col not in ['winner_model_a', 'winner_model_b', 'winner_tie', 'label']]
X_train_val, X_test, y_train_val, y_test = train_test_split(df_train[feature_cols], df_train['label'], test_size = 0.1, random_state = 42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size = 0.2, random_state = 42)

train_dataset = Dataset.from_pandas(pd.concat([X_train[['prompt_for_pred']], y_train], axis = 1))
val_dataset = Dataset.from_pandas(pd.concat([X_val[['prompt_for_pred']], y_val], axis = 1))
test_dataset = Dataset.from_pandas(pd.concat([X_test[['prompt_for_pred']], y_test], axis = 1))

In [12]:
# Tokenize the prompt using pre-defined function
# train_ds = train_dataset.map(tokenize_text, batched = True)
# val_ds = val_dataset.map(tokenize_text, batched = True)

infer_ds = test_dataset.select(range(0,1000)).map(tokenize_text, batched = True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [13]:
from tqdm import tqdm

predicted = []

for text in tqdm(infer_ds['prompt_for_pred']):
    # Tokenize the text and create a batch with a single data point
    tokenized = tokenizer(text, return_tensors = 'pt', padding = 'max_length', truncation = True, max_length = 1024)

    # Perform inference on the single data point
    output = model(**tokenized)
    logits = output.logits
    logits = logits.float()

    # Calculate class probabilities
    class_probabilities = torch.nn.functional.softmax(logits, dim = 1)

    predicted.append(class_probabilities)
    
concatenated_tensor = torch.cat(predicted)
predicted = concatenated_tensor.detach().cpu().numpy()
predicted

100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [02:59<00:00,  5.56it/s]


array([[0.5650298 , 0.16093798, 0.2740322 ],
       [0.13291879, 0.61656404, 0.25051716],
       [0.2341742 , 0.3793606 , 0.38646522],
       ...,
       [0.17246105, 0.62455326, 0.20298564],
       [0.51112574, 0.36297372, 0.12590058],
       [0.2586679 , 0.5066017 , 0.23473048]], dtype=float32)

In [15]:
from sklearn.metrics import f1_score, accuracy_score, log_loss

def get_classification_report(p, y):
    probabilities = p

    labels = np.array(y)

    # Threshold probabilities if needed
    thresholded_predictions = np.argmax(probabilities, axis=1)

    f1 = f1_score(labels, thresholded_predictions, average='macro')
    accuracy = accuracy_score(labels, thresholded_predictions)
    logloss = log_loss(labels, probabilities)
    
    
    # Confusion matrix
    cm = confusion_matrix(labels, thresholded_predictions)
    
    # # Plot confusion matrix
    # sns.heatmap(cm, annot=True, fmt='g', cmap='Blues', xticklabels=['Non-Hate', 'Hate'], yticklabels=['Non-Hate', 'Hate'])
    # plt.xlabel('Predicted')
    # plt.ylabel('True')
    # plt.title('Confusion Matrix')
    # plt.show()

    print({"F1_Score": f1, 'Accuracy': accuracy,"Log_Loss": logloss})

metrics = get_classification_report(predicted, infer_ds['label'])

{'F1_Score': 0.4448254396328788, 'Accuracy': 0.448, 'Log_Loss': 1.1355878386265164}


## GPU Memory Clean

In [16]:
# empty_cache() is usually not working, restart the kernel and use the code in next cell to test whether GPU memory usage has been reset
torch.cuda.empty_cache()

In [17]:
import torch
print("torch.cuda.memory_allocated: %fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024))
print("torch.cuda.memory_reserved: %fGB"%(torch.cuda.memory_reserved(0)/1024/1024/1024))
print("torch.cuda.max_memory_reserved: %fGB"%(torch.cuda.max_memory_reserved(0)/1024/1024/1024))

torch.cuda.memory_allocated: 13.258203GB
torch.cuda.memory_reserved: 13.816406GB
torch.cuda.max_memory_reserved: 13.824219GB


## Future Plan

It seems that the model still has improvement space. However, to restart the training with new parameters will take a lot of time. I am considering combining the base model and adapter as the new base model and do fine-tuning on the new base model. However, currently, when trying to do this, it seems my GPU is not working. I need to do more research on how to realize it.