# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: LoRA - I chose to keep it simple for this part. LoRA is a good option for optimizing the fine-tuning process
* Model: I chose distilbert-base-uncased because in the dataset I chose, case doesn't matter. I also wanted a model that was fast and lightweight. Distilbert has historically worked well for sequence classification tasks.
* Evaluation approach: Argmax
* Fine-tuning dataset: I chose the clinc_oos dataset because I wanted to apply text classification concepts when there are more than two classes

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
# imports

from datasets import load_dataset, Features, Value, ClassLabel
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments
from peft import get_peft_model, LoraConfig, AutoPeftModelForSequenceClassification, AutoPeftModelForCausalLM, PeftConfig

  warn("The installed version of bitsandbytes was compiled without GPU support. "


/opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32


In [2]:
# load a dataset 

dataset = load_dataset('clinc/clinc_oos', 'small', split='train').train_test_split(
    test_size=0.2, shuffle=True, seed=23
)

# limit the dataset to 3 classes only for simplicity
dataset = dataset.filter(lambda x: (x["intent"] == 0) or (x["intent"] == 1) or (x["intent"] == 9))

# change column names for ease
dataset = dataset.rename_column("intent", "labels")

# change label 9 to 2 for ease
def change_label(x):
    if (x["labels"] == 9):
        x['labels'] = 2
    return x

dataset = dataset.map(change_label)

splits = ["train", "test"]

print(dataset["train"])

print(dataset["test"])
print(dataset["test"]['labels'])

Downloading readme:   0%|          | 0.00/24.0k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 172k/172k [00:00<00:00, 530kB/s]
Downloading data: 100%|██████████| 77.8k/77.8k [00:00<00:00, 373kB/s]
Downloading data: 100%|██████████| 136k/136k [00:00<00:00, 641kB/s]


Generating train split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3100 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/6080 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1520 [00:00<?, ? examples/s]

Map:   0%|          | 0/115 [00:00<?, ? examples/s]

Map:   0%|          | 0/35 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'labels'],
    num_rows: 115
})
Dataset({
    features: ['text', 'labels'],
    num_rows: 35
})
[0, 2, 2, 0, 2, 0, 2, 1, 0, 2, 0, 0, 2, 1, 1, 2, 0, 1, 2, 2, 2, 2, 0, 2, 2, 2, 2, 1, 2, 0, 1, 1, 1, 0, 0]


In [3]:
# tokenize the dataset

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = dataset[split].map(
        lambda x: tokenizer(x["text"], truncation=True), batched=True
    )

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/115 [00:00<?, ? examples/s]

Map:   0%|          | 0/35 [00:00<?, ? examples/s]

In [4]:
# load a model

model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=3,
    id2label={0: 'restaurant_reviews', 1: 'nutrition_info', 2: 'accept_reservations'},
    label2id={'restaurant_reviews': 0, 'nutrition_info': 1, 'accept_reservations': 2},
)

# freeze the parameters so it doesn't train

for param in model.base_model.parameters():
    param.requires_grad = False

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
# evaluate the model as-is on the subset dataset (one epoch only)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
#     print(predictions)
    predictions = np.argmax(predictions, axis=-1)
    return {"accuracy": (predictions == labels).mean()}

trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./data2/intent_analysis",
        learning_rate=0.001,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=1, 
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
#     data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics,
)

trainer.evaluate()

# {'eval_loss': 1.1439210176467896,
#  'eval_accuracy': 0.2,
#  'eval_runtime': 5.6686,
#  'eval_samples_per_second': 6.174,
#  'eval_steps_per_second': 0.882}


You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 1.0956395864486694,
 'eval_accuracy': 0.3142857142857143,
 'eval_runtime': 2.217,
 'eval_samples_per_second': 15.787,
 'eval_steps_per_second': 2.255}

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [8]:
# print(model)

In [13]:
# set up LoRA

lora_config = LoraConfig(
    r=8, 
    lora_alpha=32,
    lora_dropout=0.05, 
    # I am targeting the linear layers 
    target_modules=['q_lin', 'k_lin', 'v_lin'],
    bias='none',
    task_type="SEQ_CLS" 
)

lora_model = get_peft_model(model, lora_config)

lora_model.print_trainable_parameters()
# trainable params: 1,406,982 || all params: 67,769,862 || trainable%: 2.076117552076467

trainable params: 1,406,982 || all params: 67,769,862 || trainable%: 2.076117552076467


In [15]:
# train the LoRA model for 10 epochs

trainerLoRA = Trainer(
    model=lora_model,
    args=TrainingArguments(
        output_dir="./data3/peft_intent_analysis",
        learning_rate=0.001,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=10,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
#     data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics,
)

trainerLoRA.train()

# RESULTS WITH 10 EPOCHS, NO PADDING:
# Epoch	Training Loss	Validation Loss	Accuracy
# 1	No log	0.009005	1.000000
# 2	No log	0.000003	1.000000
# 3	No log	0.000002	1.000000
# 4	No log	0.000003	1.000000
# 5	No log	0.000003	1.000000
# 6	No log	0.000002	1.000000
# 7	No log	0.000002	1.000000
# 8	No log	0.000002	1.000000
# 9	No log	0.000002	1.000000
# 10	No log	0.000002	1.000000

# TrainOutput(global_step=150, training_loss=0.0034041416645050047, metrics={'train_runtime': 217.7729, 'train_samples_per_second': 5.281, 'train_steps_per_second': 0.689, 'total_flos': 4283999239500.0, 'train_loss': 0.0034041416645050047, 'epoch': 10.0})

# looks like better accuracy and better loss


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.009005,1.0
2,No log,3e-06,1.0
3,No log,2e-06,1.0
4,No log,3e-06,1.0
5,No log,3e-06,1.0
6,No log,2e-06,1.0
7,No log,2e-06,1.0
8,No log,2e-06,1.0
9,No log,2e-06,1.0
10,No log,2e-06,1.0


TrainOutput(global_step=150, training_loss=0.0034041416645050047, metrics={'train_runtime': 217.7729, 'train_samples_per_second': 5.281, 'train_steps_per_second': 0.689, 'total_flos': 4283999239500.0, 'train_loss': 0.0034041416645050047, 'epoch': 10.0})

In [16]:
trainerLoRA.evaluate()
# {'eval_loss': 1.6552737633901415e-06,
#  'eval_accuracy': 1.0,
#  'eval_runtime': 2.6453,
#  'eval_samples_per_second': 13.231,
#  'eval_steps_per_second': 1.89,
#  'epoch': 10.0}

{'eval_loss': 1.6552737633901415e-06,
 'eval_accuracy': 1.0,
 'eval_runtime': 2.6453,
 'eval_samples_per_second': 13.231,
 'eval_steps_per_second': 1.89,
 'epoch': 10.0}

In [19]:
# save LoRA model
trainerLoRA.save_model("./data3/lora_trainer")
lora_model.save_pretrained("./data3/lora_model")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [25]:
# load the saved LoRA model

saved_model =  AutoPeftModelForSequenceClassification.from_pretrained('./data3/lora_model',
                                                                      num_labels = 3,
                                                                      ignore_mismatched_sizes=True)

saved_model.eval()


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): DistilBertForSequenceClassification(
      (distilbert): DistilBertModel(
        (embeddings): Embeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (transformer): Transformer(
          (layer): ModuleList(
            (0-5): 6 x TransformerBlock(
              (attention): MultiHeadSelfAttention(
                (dropout): Dropout(p=0.1, inplace=False)
                (q_lin): Linear(
                  in_features=768, out_features=768, bias=True
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.05, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=768, out_features=8, bias=Fal

In [26]:
# freeze parameters so nothing gets updated

for param in saved_model.base_model.parameters():
    param.requires_grad = False

In [27]:
# evaluate the saved model

trainer_saved_model = Trainer(
    model=saved_model,
    args=TrainingArguments(
        output_dir="./data/LoRA_model4",
        learning_rate=0.001,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
#     data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics,
)

trainer_saved_model.evaluate()
# {'eval_loss': 1.6552737633901415e-06,
#  'eval_accuracy': 1.0,
#  'eval_runtime': 2.2884,
#  'eval_samples_per_second': 15.294,
#  'eval_steps_per_second': 2.185}

{'eval_loss': 1.6552737633901415e-06,
 'eval_accuracy': 1.0,
 'eval_runtime': 2.2884,
 'eval_samples_per_second': 15.294,
 'eval_steps_per_second': 2.185}

Final notes:

* The baseline model was fairly poor, with loss = 1.1439210176467896 and accuracy = 20%
* The LoRA model, while only training 2.1% of all available parameters, was much better with loss < 0.0 and accuracy =	100%
* The LoRA model was saved and loaded correctly as we can see loss < 0.0 and accuracy = 100% on the test set

However, the final accuracy of 100% is a bit suspicious, and more training data and a separate test set would be necessary in order to make a final evaluation.
