# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: 
* Model: 
* Evaluation approach: 
* Fine-tuning dataset: 

## Loading and Evaluating a Foundation Model

Load pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer

id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}
# Load the pre-trained model and tokenizer
model_name = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label=id2label,
    label2id=label2id
)



  from .autonotebook import tqdm as notebook_tqdm
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [2]:
from transformers import pipeline
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)
#tokenizer.pad_token = tokenizer.

print(tokenizer.model_max_length)  # Check the maximum length of the tokenizer

512


### Load the Dataset

Choose some prompts. Then evaluate the model generated responses.

In [3]:
# Load the finance instruction dataset
from datasets import load_dataset, DatasetDict

# Login using e.g. `huggingface-cli login` to access this dataset
# ds = load_dataset("Josephgflowers/Finance-Instruct-500k", split="train[:5000]")

# Just read the first 5000 entries only due to resource limits
# ds = load_dataset("talkmap/banking-conversation-corpus", split="train[:5000]")

# ds = load_dataset("KidzRizal/twitter-sentiment-analysis", split="train[:5000]")

dataset_name = "AiresPucrs/sentiment-analysis"
ds = load_dataset(dataset_name, split="train[:5000]")

# split into train and test sets
ds = ds.train_test_split(test_size=0.1)
# explore the dataset
print(ds)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 4500
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 500
    })
})


### Quickly check the model

Use the first 10 texts from the test set to check how the model perform
> The same set of prompts will be used before and after training

In [4]:
check_df = ds['test'].to_pandas()
check_df = check_df.loc[check_df['text'].str.len() < 512][:8]
check_df


Unnamed: 0,text,label
1,the magnificent greta garbo is in top form in ...,1
20,cates is insipid and unconvincing kline over a...,0
23,thriller is the greatest music video of all ti...,1
24,sorry to go against the flow but i thought thi...,0
25,i found this movie to be a simple yet wonderfu...,1
28,worst dcom i have seen ever well maybe not as ...,0
33,michael is king this film contains some of the...,1
37,guy pearce almost looks like flynn and this re...,0


Check the model before training

In [5]:

orig_classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# Quickly check the model
for prompt in check_df['text'].tolist():
    print(orig_classifier(prompt, truncation=True, max_length=512))

Device set to use cpu


[{'label': 'NEGATIVE', 'score': 0.5052009224891663}]
[{'label': 'NEGATIVE', 'score': 0.5264928340911865}]
[{'label': 'NEGATIVE', 'score': 0.5301784873008728}]
[{'label': 'NEGATIVE', 'score': 0.5146786570549011}]
[{'label': 'NEGATIVE', 'score': 0.5180830359458923}]
[{'label': 'NEGATIVE', 'score': 0.5140610933303833}]
[{'label': 'NEGATIVE', 'score': 0.5085684657096863}]
[{'label': 'NEGATIVE', 'score': 0.5569984912872314}]


**Before training**, the model seems simply picks *POSITIVE*. This is expected according to the warning message. It needs to be trained.

### Preprocess the Data

#### Tokenize the dataset

In [6]:
# quick check that things are working

inputs = tokenizer(ds['train'][0]['text'], max_length=512, padding="max_length", truncation=True, return_tensors="pt")
inputs['input_ids'].shape
#print(tokenizer.decode(inputs['input_ids']))
outputs = model(**inputs)  # Forward pass with the tokenized inputs
print(outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[0.2005, 0.0361]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


Define a function to group the *tokenized* text into smaller (block size 128) chunks

In [7]:
# tokenizer function
def tokenize_func(examples):
    return tokenizer(
        examples["text"],
        max_length=512,
        truncation=True,
        # return_tensors="pt",
    )


In [8]:
# Do the simple tokenization first and drop the un-used features.

tokenized_datasets = {}
for split in ds.keys():
    tokenized_datasets[split] = ds[split].map(
        tokenize_func,
        batched=True,
        remove_columns=["text"],
    )


Map: 100%|██████████| 4500/4500 [00:00<00:00, 9660.96 examples/s]
Map: 100%|██████████| 500/500 [00:00<00:00, 9227.38 examples/s]


In [9]:
tokenized_datasets

{'train': Dataset({
     features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
     num_rows: 4500
 }),
 'test': Dataset({
     features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
     num_rows: 500
 })}

### Define the compute metric

In [10]:
# define a compute metric function
from sklearn.metrics import precision_score, recall_score, f1_score
import numpy as np

def compute_metrics(eval_preds):
    # Convert logits to predictions
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return {
        "precision": precision_score(y_true=labels, y_pred=predictions, average="weighted"),
        "recall": recall_score(y_true=labels, y_pred=predictions, average="weighted"),
        "f1": f1_score(y_true=labels, y_pred=predictions, average="weighted"),
    }


### Evaluate the base model using the test dataset

In [11]:
# define the training arguments
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./data/base_model",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    weight_decay=0.01,
    logging_dir="./data/logs",
    logging_steps=10,
)

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

base_results = trainer.evaluate()
print(base_results)

  trainer = Trainer(


{'eval_loss': 0.7026612162590027, 'eval_model_preparation_time': 0.0012, 'eval_precision': 0.46263948497854074, 'eval_recall': 0.482, 'eval_f1': 0.3644460991147514, 'eval_runtime': 95.7588, 'eval_samples_per_second': 5.221, 'eval_steps_per_second': 0.658}


### Setup PEFT for LORA Training

In [12]:
from peft import LoraConfig, get_peft_model
config = LoraConfig(
    task_type="SEQ_CLS",
    r=8,
    #lora_alpha=32,
    #lora_dropout=0.1,
    target_modules=["query", "value"],
)

lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()


trainable params: 296,450 || all params: 109,780,228 || trainable%: 0.2700


#### Set the training Arguments and the Trainer

In [13]:
from transformers import TrainingArguments

save_path = "./data/lora-finetuned-sentiment-analysis"

training_args = TrainingArguments(
    output_dir=save_path,
    num_train_epochs=2,
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=1e-5,
    weight_decay=0.01,
    load_best_model_at_end=False,
    push_to_hub=False,
)

In [14]:
# Train
from transformers import Trainer
from transformers import DataCollatorWithPadding

# let the data_collator handle the batching jobs
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)

No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


#### Now train the model. Without GPU, this will take a long time

In [15]:
trainer.train()



Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.7097,0.690214,0.550119,0.538,0.519664
2,0.7066,0.687738,0.583362,0.556,0.525589




TrainOutput(global_step=1126, training_loss=0.6983768745171663, metrics={'train_runtime': 6503.1135, 'train_samples_per_second': 1.384, 'train_steps_per_second': 0.173, 'total_flos': 2107817639208960.0, 'train_loss': 0.6983768745171663, 'epoch': 2.0})

#### Evaluate the trainer **After** training completed

In [16]:
# Evaluate the fine-tuned model
from transformers import pipeline
results = trainer.evaluate()
print(results)



{'eval_loss': 0.6877377033233643, 'eval_precision': 0.5833615837484636, 'eval_recall': 0.556, 'eval_f1': 0.5255887306691166, 'eval_runtime': 97.5221, 'eval_samples_per_second': 5.127, 'eval_steps_per_second': 0.646, 'epoch': 2.0}


#### Check the response for the same set of prompts after training

In [17]:
from transformers import pipeline

new_clfr = pipeline("sentiment-analysis", model=lora_model, tokenizer=tokenizer)

for prompt in check_df['text'].tolist():
    print(new_clfr(prompt, truncation=True, max_length=512))

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.5196443200111389}]
[{'label': 'NEGATIVE', 'score': 0.5295662879943848}]
[{'label': 'POSITIVE', 'score': 0.5149076581001282}]
[{'label': 'POSITIVE', 'score': 0.5009469389915466}]
[{'label': 'POSITIVE', 'score': 0.5080550909042358}]
[{'label': 'NEGATIVE', 'score': 0.5076899528503418}]
[{'label': 'NEGATIVE', 'score': 0.5017227530479431}]
[{'label': 'NEGATIVE', 'score': 0.5228018760681152}]


###  Save the PEFT Tuned model to disk


In [18]:
# Saving the model
from transformers import AutoModelForSequenceClassification
from peft import PeftModel
from peft import PeftConfig

# save_path = "./data/lora-finetuned-sentiment-analysis" # (already defined above)
lora_model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)


('./data/lora-finetuned-sentiment-analysis/tokenizer_config.json',
 './data/lora-finetuned-sentiment-analysis/special_tokens_map.json',
 './data/lora-finetuned-sentiment-analysis/vocab.txt',
 './data/lora-finetuned-sentiment-analysis/added_tokens.json',
 './data/lora-finetuned-sentiment-analysis/tokenizer.json')

## Performing Inference with a Saved PEFT Model

In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [23]:
# Load the fine-tuned AutoPeftModelForSequenceClassification model for inference
# save_path = "./data/lora-finetuned-sentiment-analysis" # (already defined above)
# Load the fine-tuned model for inference
from transformers import pipeline, AutoTokenizer
from peft import PeftModel, PeftConfig, AutoPeftModelForSequenceClassification

loaded_lora_model = AutoPeftModelForSequenceClassification.from_pretrained(
    save_path,
    num_labels=2,
    id2label=id2label,
    label2id=label2id
)

tokenizer = AutoTokenizer.from_pretrained(save_path)

new_clfr = pipeline("sentiment-analysis", model=loaded_lora_model, tokenizer=tokenizer)

for prompt in check_df['text'].tolist():
    print(new_clfr(prompt, truncation=True, max_length=512))

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.5196443200111389}]
[{'label': 'NEGATIVE', 'score': 0.5295662879943848}]
[{'label': 'POSITIVE', 'score': 0.5149076581001282}]
[{'label': 'POSITIVE', 'score': 0.5009469389915466}]
[{'label': 'POSITIVE', 'score': 0.5080550909042358}]
[{'label': 'NEGATIVE', 'score': 0.5076899528503418}]
[{'label': 'NEGATIVE', 'score': 0.5017227530479431}]
[{'label': 'NEGATIVE', 'score': 0.5228018760681152}]


### Evaluate the model just loaded from disk

In [25]:
trainer.model = loaded_lora_model
print(trainer.evaluate())



{'eval_loss': 0.6877377033233643, 'eval_precision': 0.5833615837484636, 'eval_recall': 0.556, 'eval_f1': 0.5255887306691166, 'eval_runtime': 94.6162, 'eval_samples_per_second': 5.285, 'eval_steps_per_second': 0.666, 'epoch': 2.0}


## Conclusion

The fine-tuned model does a better job than the original model as the additional training
dataset added more infomation to the original model. The precision/recall/F1 scores improvements
are visible but relatively small. If the whole dataset is used in the training, the improvement 
would be more.

The lodel loaded from disk has the same performance as the LORA model right after training completed
as expected.