# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: Used LoRA as a PEFT method
* Model: Using DistilBERT as transformer model with Sequence Classification learning approach.
* Evaluation approach: Evaluated using Trainer class
* Fine-tuning dataset: MASSIVE 1.1: A 1M-Example Multilingual Natural Language Understanding Dataset with 52 Typologically-Diverse Languages (https://huggingface.co/datasets/AmazonScience/massive)

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [4]:
from datasets import load_dataset

splits = ["train", "validation", "test"]
# Load the train, validation and test splits of the AmazonScience/massive dataset
dataset = {split: ds for split, ds in zip(splits, load_dataset("AmazonScience/massive", "en-US", split=splits))}
#print(dataset[0])
dataset

Downloading data: 100%|██████████| 676k/676k [00:00<00:00, 8.91MB/s]
Downloading data: 100%|██████████| 140k/140k [00:00<00:00, 2.10MB/s]
Downloading data: 100%|██████████| 191k/191k [00:00<00:00, 2.90MB/s]


Generating train split:   0%|          | 0/11514 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2033 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2974 [00:00<?, ? examples/s]

{'train': Dataset({
     features: ['id', 'locale', 'partition', 'scenario', 'intent', 'utt', 'annot_utt', 'worker_id', 'slot_method', 'judgments'],
     num_rows: 11514
 }),
 'validation': Dataset({
     features: ['id', 'locale', 'partition', 'scenario', 'intent', 'utt', 'annot_utt', 'worker_id', 'slot_method', 'judgments'],
     num_rows: 2033
 }),
 'test': Dataset({
     features: ['id', 'locale', 'partition', 'scenario', 'intent', 'utt', 'annot_utt', 'worker_id', 'slot_method', 'judgments'],
     num_rows: 2974
 })}

In [5]:
ds = {}
for split in splits:
  portion = int(dataset[split].shape[0]*0.5) # select 50% from dataset data to speed up training
  ds[split] = dataset[split].shuffle(seed=42).select(range(portion))
ds

{'train': Dataset({
     features: ['id', 'locale', 'partition', 'scenario', 'intent', 'utt', 'annot_utt', 'worker_id', 'slot_method', 'judgments'],
     num_rows: 5757
 }),
 'validation': Dataset({
     features: ['id', 'locale', 'partition', 'scenario', 'intent', 'utt', 'annot_utt', 'worker_id', 'slot_method', 'judgments'],
     num_rows: 1016
 }),
 'test': Dataset({
     features: ['id', 'locale', 'partition', 'scenario', 'intent', 'utt', 'annot_utt', 'worker_id', 'slot_method', 'judgments'],
     num_rows: 1487
 })}

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")  #"gpt2"
tokenized_dataset = {}

#tokenizer.pad_token = 'NULL'
#tokenizer.add_special_tokens({'pad_token': '!'})
for split in splits:
  tokenized_dataset[split] = ds[split].map(
      lambda x: tokenizer(x["utt"], padding=True, truncation=True, return_tensors="pt"), batched=True
  )
  tokenized_dataset[split] = tokenized_dataset[split].rename_column("scenario", "label")

Map:   0%|          | 0/5757 [00:00<?, ? examples/s]

Map:   0%|          | 0/1016 [00:00<?, ? examples/s]

Map:   0%|          | 0/1487 [00:00<?, ? examples/s]

In [7]:
from transformers import AutoModelForSequenceClassification, AutoModelForCausalLM

id2label={i: label for i, label in enumerate(tokenized_dataset["train"].features["label"].names)}
label2id={label: i for i, label in enumerate(tokenized_dataset["train"].features["label"].names)}

        #AutoModelForSequenceClassification
        #RobertaForCausalLM
        #AutoModelForCausalLM
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    #"gpt2",
    num_labels=18,
    id2label=id2label,
    label2id=label2id,
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
for param in model.base_model.parameters():
    param.requires_grad = False # Unfreeze -> True
    #print(param.requires_grad)

In [9]:
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments


def compute_metrics(eval_pred):
    #print(list(eval_pred))
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}


# The HuggingFace Trainer class handles the training and eval loop for PyTorch for us.
# Read more about it here https://huggingface.co/docs/transformers/main_classes/trainer
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./data/voice_assistant_interactions",
        # Set the learning rate
        learning_rate= 2e-05,
        # Set the per device train batch size and eval batch size
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        # Evaluate and save the model after each epoch
        evaluation_strategy="epoch",
        save_strategy="epoch",
        num_train_epochs=10,
        weight_decay=0.01,
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

#trainer.train()
trainer.evaluate()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 2.902759075164795,
 'eval_accuracy': 0.09448818897637795,
 'eval_runtime': 1.664,
 'eval_samples_per_second': 610.592,
 'eval_steps_per_second': 38.462}

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [10]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [11]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.01,
    bias="none",
    task_type='SEQ_CLS',
    target_modules=["q_lin", "k_lin", "v_lin", "out_lin"]
)

lora_model = get_peft_model(model, config)

In [12]:
lora_model.config.architectures

['DistilBertForMaskedLM']

In [13]:
lora_model.print_trainable_parameters()

trainable params: 1,503,780 || all params: 67,866,660 || trainable%: 2.21578607227761


In [14]:
lora_trainer = Trainer(
    model=lora_model,
    args=TrainingArguments(
        output_dir="./data/voice_assistant_interactions_lora",
        # Set the learning rate
        learning_rate= 2e-05,
        # Set the per device train batch size and eval batch size
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        # Evaluate and save the model after each epoch
        evaluation_strategy="epoch",
        save_strategy="epoch",
        num_train_epochs=10,
        weight_decay=0.01,
        load_best_model_at_end=True,
        #label_names = ["start_positions", "end_positions"],
        #label_names=["label"],
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

lora_trainer.train()
#lora_trainer.evaluate()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.972813,0.49311
2,2.319200,1.063983,0.730315
3,1.150000,0.760502,0.808071
4,1.150000,0.638603,0.830709
5,0.760200,0.573173,0.848425
6,0.635400,0.535031,0.853346
7,0.566500,0.510437,0.857283
8,0.566500,0.498108,0.859252
9,0.539900,0.490194,0.855315
10,0.519100,0.487926,0.857283


Checkpoint destination directory ./data/voice_assistant_interactions_lora/checkpoint-360 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/voice_assistant_interactions_lora/checkpoint-720 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/voice_assistant_interactions_lora/checkpoint-1080 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/voice_assistant_interactions_lora/checkpoint-1440 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/voice_assistant_interactions_lora/checkpoint-1800 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/voice_assistant_interactions_lora/checkpoint-2160 already exists and is non-empty.

TrainOutput(global_step=3600, training_loss=0.916069999270969, metrics={'train_runtime': 194.5722, 'train_samples_per_second': 295.88, 'train_steps_per_second': 18.502, 'total_flos': 592220421776088.0, 'train_loss': 0.916069999270969, 'epoch': 10.0})

In [15]:
lora_model.save_pretrained("distilbert-lora")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [16]:
from peft import AutoPeftModelForSequenceClassification, PeftModel

peft_model = AutoPeftModelForSequenceClassification.from_pretrained(
    "distilbert-lora",
    num_labels=18,
    id2label=id2label,
    label2id=label2id,
    #is_trainable=False,
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
peft_model.config.architectures

['DistilBertForMaskedLM']

In [18]:
peft_trainer = Trainer(
    model=peft_model,
    args=TrainingArguments(
        output_dir="./data/voice_assistant_interactions_peft",
        # Set the learning rate
        learning_rate= 2e-05,
        # Set the per device train batch size and eval batch size
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        # Evaluate and save the model after each epoch
        evaluation_strategy="epoch",
        save_strategy="epoch",
        num_train_epochs=3,
        weight_decay=0.01,
        load_best_model_at_end=True,
        #label_names = ["start_positions", "end_positions"],
        #label_names=["label"],
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

#peft_trainer.train()
peft_trainer.evaluate()

{'eval_loss': 0.48792585730552673,
 'eval_accuracy': 0.8572834645669292,
 'eval_runtime': 1.2575,
 'eval_samples_per_second': 807.927,
 'eval_steps_per_second': 50.893}

In [20]:
# Make a dataframe with the predictions and the text and the labels
import pandas as pd

review_items = tokenized_dataset["test"].select(range(20)) #Using test data set for the 1st time

results = peft_trainer.predict(review_items)
df = pd.DataFrame(
    {
        "utterance": [item["utt"] for item in review_items],
        "predicted_label": [id2label[id] for id in results.predictions.argmax(axis=1)],
        "label": [id2label[id] for id in results.label_ids]
    }
)
# Show all the cell
pd.set_option("display.max_colwidth", None)
df

Unnamed: 0,utterance,predicted_label,label
0,i need you to mark next monday,calendar,calendar
1,let's play my most played song list,play,play
2,disable alarm for three p. m.,alarm,alarm
3,cancel alarm,alarm,alarm
4,remind me about my schedule for the afternoon,calendar,calendar
5,pause the book,play,play
6,hi can you please turn lower the lights,iot,iot
7,when is easter in the year two thousand and eighteen,datetime,datetime
8,speak loudly,audio,audio
9,how do i make a turkey,cooking,cooking


Accuracy comparison:

Pre-trained model:'eval_accuracy': 0.09448818897637795,

PEFT model after training: 'eval_accuracy': 0.8572834645669292,