# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: PEFT stands for Parameter-Efficient Fine-Tuning, is a library for efficiently adapting large pretrained models to various downstream applications without fine-tuning all of a model’s parameters because it is prohibitively costly. The technique implemented is called LoRA, (Low-Rank Adaptation of Large Language Models) is a popular and lightweight training technique that significantly reduces the number of trainable parameters. It works by inserting a smaller number of new weights into the model and only these are trained.

* Model: The choosen model is distilbert, a successful language model that makes use of attention mechanism to improve it's performance.

* Evaluation approach: Since this is a classification challenge I decided to monitor accuracy score, the total amount of correctly classified samples divided by the total number of samples. The Categorcal Cross Entropy Loss is also monitored, a function that captures discrepancy among real values and predictions.

* Fine-tuning dataset: I choose a sentiment analysis dataset in the financial industry called Auditor Sentiment. Data can be found [here](https://huggingface.co/datasets/FinanceInc/auditor_sentiment) has the following description: ***Auditor sentiment dataset of sentences from financial news. The dataset consists of several thousand sentences from English language financial news categorized by sentiment.***

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
from datasets import load_dataset

# load sentiment analysis financial data
dataset = load_dataset(
    "FinanceInc/auditor_sentiment", 
    split="train").train_test_split(
        test_size=0.2, 
        shuffle=True, 
        seed=23
)

splits = ["train", "test"]

# View the dataset characteristics
dataset["train"]

  from .autonotebook import tqdm as notebook_tqdm
Downloading readme: 100%|██████████| 3.71k/3.71k [00:00<00:00, 3.17MB/s]
Downloading metadata: 100%|██████████| 800/800 [00:00<00:00, 697kB/s]
Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading data:   0%|          | 0.00/327k [00:00<?, ?B/s][A
Downloading data: 100%|██████████| 327k/327k [00:00<00:00, 561kB/s][A
Downloading data files:  50%|█████     | 1/2 [00:00<00:00,  1.68it/s]
Downloading data:   0%|          | 0.00/80.9k [00:00<?, ?B/s][A
Downloading data: 100%|██████████| 80.9k/80.9k [00:00<00:00, 216kB/s][A
Downloading data files: 100%|██████████| 2/2 [00:00<00:00,  2.03it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 1289.36it/s]
Generating train split: 100%|██████████| 3877/3877 [00:00<00:00, 518388.11 examples/s]
Generating test split: 100%|██████████| 969/969 [00:00<00:00, 386191.62 examples/s]


Dataset({
    features: ['sentence', 'label'],
    num_rows: 3101
})

There are two different sets of data:

In [2]:
dataset.keys()

dict_keys(['train', 'test'])

Data sizes for training and testing:

In [3]:
print(f"Training set dimensions: {dataset['train'].shape[0]} rows and {dataset['train'].shape[1]} columns.")
print(f"Testing set dimensions: {dataset['test'].shape[0]} rows and {dataset['test'].shape[1]} columns.")

Training set dimensions: 3101 rows and 2 columns.
Testing set dimensions: 776 rows and 2 columns.


Data type:

In [4]:
type(dataset["train"])

datasets.arrow_dataset.Dataset

Checking some samples:

In [5]:
for i in range(10):
    print(f"Sentence: {dataset['train'][i]['sentence']}")
    print(f"Label: {dataset['train'][i]['label']}")
    print("_"*100)

Sentence: ---------------------------------------------------------------------- -------------- Munich , 14 January 2008 : BAVARIA Industriekapital AG closed the acquisition of Elcoteq Communications Technology GmbH in Offenburg , Germany , with the approval of the
Label: 1
____________________________________________________________________________________________________
Sentence: However , sales volumes in the food industry are expected to remain at relatively good levels in Finland and in Scandinavia , Atria said .
Label: 2
____________________________________________________________________________________________________
Sentence: The optimization of the steel components heating process will reduce the energy consumption .
Label: 2
____________________________________________________________________________________________________
Sentence: Each share is entitled to one vote .
Label: 1
_______________________________________________________________________________________________

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Let's use a lambda function to tokenize all the examples
tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = dataset[split].map(
        lambda x: tokenizer(x["sentence"], truncation=True), batched=True
    )

# Inspect the available columns in the dataset
tokenized_dataset["train"]

tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 108kB/s]
config.json: 100%|██████████| 483/483 [00:00<00:00, 2.21MB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 6.27MB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 24.5MB/s]
Map: 100%|██████████| 3101/3101 [00:00<00:00, 6785.59 examples/s]
Map: 100%|██████████| 776/776 [00:00<00:00, 6914.63 examples/s]


Dataset({
    features: ['sentence', 'label', 'input_ids', 'attention_mask'],
    num_rows: 3101
})

In [7]:
from transformers import AutoModelForSequenceClassification

# label: a label corresponding to the class as a string: 'positive' - (2), 'neutral' - (1), or 'negative' - (0)

id2label_dict = {
    2:'positive', 
    1: 'neutral', 
    0: 'negative'
} 

label2id_dict = {
    'positive':2, 
    'neutral':1, 
    'negative':0
} 


model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=3,
    id2label = id2label_dict,
    label2id = label2id_dict,
)

    
# Hint: Check the documentation at https://huggingface.co/transformers/v4.2.2/training.html
for param in model.base_model.parameters():
    param.requires_grad = False

model.safetensors: 100%|██████████| 268M/268M [00:01<00:00, 183MB/s] 
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [9]:
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

# set the number of epochs in the experiment
NUM_EPOCHS = 20



# The HuggingFace Trainer class handles the training and eval loop for PyTorch for us.
# Read more about it here https://huggingface.co/docs/transformers/main_classes/trainer
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./data/sentiment_analysis",
        # Set the learning rate
        learning_rate = 2e-5,
        # Set the per device train batch size and eval batch size
        per_device_train_batch_size=16, 
        per_device_eval_batch_size=16, 
        # Evaluate and save the model after each epoch
        evaluation_strategy="epoch",
        save_strategy="epoch",
        num_train_epochs=NUM_EPOCHS,
        weight_decay=0.01,
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.867723,0.592784
2,No log,0.821184,0.597938
3,0.878700,0.776846,0.634021
4,0.878700,0.748419,0.641753
5,0.878700,0.720939,0.658505
6,0.769000,0.70093,0.67268
7,0.769000,0.688914,0.679124
8,0.713100,0.675683,0.694588
9,0.713100,0.669286,0.695876
10,0.713100,0.656425,0.701031


TrainOutput(global_step=3880, training_loss=0.7070649294509102, metrics={'train_runtime': 172.6696, 'train_samples_per_second': 359.183, 'train_steps_per_second': 22.471, 'total_flos': 952228605710790.0, 'train_loss': 0.7070649294509102, 'epoch': 20.0})

In [10]:
trainer.evaluate()

{'eval_loss': 0.6204381585121155,
 'eval_accuracy': 0.7164948453608248,
 'eval_runtime': 1.3257,
 'eval_samples_per_second': 585.355,
 'eval_steps_per_second': 36.962,
 'epoch': 20.0}

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [11]:
from peft import LoraConfig, get_peft_model 

config = LoraConfig(
    r=16, #attention heads
    lora_alpha=32, #alpha scaling
    target_modules=["q_lin", "k_lin", "v_lin"], 
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM" # set this for CLM or Seq2Seq
)

            
lora_model = get_peft_model(model, config)

In [12]:
lora_model.print_trainable_parameters()

trainable params: 442,368 || all params: 67,398,147 || trainable%: 0.6563503889802786


In [13]:
lora_trainer = Trainer(
    model=lora_model,
    args=TrainingArguments(
        output_dir="./data/sentiment_analysis",
        # Set the learning rate
        learning_rate = 2e-5,
        # Set the per device train batch size and eval batch size
        per_device_train_batch_size=16, 
        per_device_eval_batch_size=16, 
        # Evaluate and save the model after each epoch
        evaluation_strategy="epoch",
        save_strategy="epoch",
        num_train_epochs=NUM_EPOCHS,
        weight_decay=0.01,
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

lora_trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.55813,0.75
2,No log,0.498131,0.792526
3,0.562800,0.463955,0.819588
4,0.562800,0.436276,0.829897
5,0.562800,0.421365,0.829897
6,0.479000,0.406964,0.835052
7,0.479000,0.403874,0.83634
8,0.428200,0.397858,0.83634
9,0.428200,0.388594,0.844072
10,0.428200,0.379351,0.840206


TrainOutput(global_step=3880, training_loss=0.4185263800866825, metrics={'train_runtime': 270.882, 'train_samples_per_second': 228.956, 'train_steps_per_second': 14.324, 'total_flos': 961997139562950.0, 'train_loss': 0.4185263800866825, 'epoch': 20.0})

In [14]:
lora_trainer.evaluate()

{'eval_loss': 0.3607477843761444,
 'eval_accuracy': 0.8518041237113402,
 'eval_runtime': 1.4421,
 'eval_samples_per_second': 538.11,
 'eval_steps_per_second': 33.979,
 'epoch': 20.0}

Accuracy with PEFT in the test set improves from 71.64% to 85.05%. 

In [15]:
lora_model.save_pretrained("lora_model")


In [16]:
!ls

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
LightweightFineTuning.ipynb  data  logs  lora_model  train_data


## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [18]:
from peft import PeftModel, PeftConfig, AutoPeftModelForSequenceClassification, AutoPeftModelForCausalLM

from transformers import AutoModelForSequenceClassification

peft_model_id = "lora_model"

config = PeftConfig.from_pretrained(peft_model_id)


In [19]:
config.base_model_name_or_path

'distilbert-base-uncased'

In [None]:
# do not execute! thorows an error D:
#model_from_disk = AutoPeftModelForSequenceClassification.from_pretrained(
#    config.base_model_name_or_path,
#)

In [20]:
# loading model with AutoModelForSequenceClassification since peft doesn't work
model_from_disk = AutoModelForSequenceClassification.from_pretrained(
    config.base_model_name_or_path,
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ImportError: cannot import name 'DataCollatorForMultipleChoice' from 'transformers' (/opt/conda/lib/python3.10/site-packages/transformers/__init__.py)

In [27]:
trainer_loaded_model = Trainer(
    model=model,
    #data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer), 
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)

In [28]:
import pandas as pd

df = pd.DataFrame(tokenized_dataset["test"])
df.head()

Unnamed: 0,sentence,label,input_ids,attention_mask
0,The Efore plant at Saarijarvi in central Finla...,1,"[101, 1996, 1041, 29278, 2063, 3269, 2012, 784...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,The companies will divest to UPM Fray Bentos p...,1,"[101, 1996, 3316, 2097, 11529, 3367, 2000, 203...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,The tower 's engineers have created an 18 degr...,1,"[101, 1996, 3578, 1005, 1055, 6145, 2031, 2580...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,Finnish silicon wafer technology company Okmet...,1,"[101, 6983, 13773, 11333, 7512, 2974, 2194, 79...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,"According to Kesko , the company agreed with t...",2,"[101, 2429, 2000, 17710, 21590, 1010, 1996, 21...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


In [29]:
import numpy as np

df = df[["sentence", "label"]]

# Replace <br /> tags in the text with spaces
df["sentence"] = df["sentence"].str.replace("<br />", " ")

# Add the model predictions to the dataframe
predictions = trainer_loaded_model.predict(tokenized_dataset["test"])
df["predicted_label"] = np.argmax(predictions[0], axis=1)

df.head(10)

Unnamed: 0,sentence,label,predicted_label
0,The Efore plant at Saarijarvi in central Finla...,1,1
1,The companies will divest to UPM Fray Bentos p...,1,1
2,The tower 's engineers have created an 18 degr...,1,1
3,Finnish silicon wafer technology company Okmet...,1,1
4,"According to Kesko , the company agreed with t...",2,1
5,"`` After this purchase , Cramo will become the...",2,1
6,"According to HKScan Finland , the plan is to i...",2,2
7,"According to Arokarhu , some of the purchases ...",0,1
8,"Barclays Plc ( LSE : BARC ) ( NYSE : BCS ) , C...",1,1
9,There will be return flights from Stuttgart ev...,1,1


In [32]:
score = df["label"] == df["predicted_label"]
print(score)

0       True
1       True
2       True
3       True
4      False
       ...  
771     True
772     True
773     True
774     True
775     True
Length: 776, dtype: bool


In [34]:
score.sum() / len(score)

0.8518041237113402