<a href="https://colab.research.google.com/github/danjshaw/ece57000_finalProject/blob/main/LoRA_NLU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Preliminary implementation using the loralib library and following the examples for training from this NLP course on Hugging Face (https://huggingface.co/learn/nlp-course/en/).

Package installations

In [1]:
!pip install datasets
!pip install evaluate
!pip install loralib

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [2]:
import torch
import torch.nn as nn
import numpy as np

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


LoRA implementation.

In [4]:
class Lora(nn.Module):
  def __init__(self, linear, rank=8, alpha=16):
    super().__init__()
    self.linear = linear
    self.scale = alpha / rank

    # Matrix A initialized with random gaussian distrubtion with shape (r, d)
    self.A = nn.Parameter(torch.randn(linear.in_features, rank))

    # Matrix B initialized to zeros with shape (d, r)
    self.B = nn.Parameter(torch.zeros(rank, linear.out_features))

  def forward(self, x):
    # h = x * W_0 + x * A * B
    return self.linear(x) + (self.scale * (x @ self.A @ self.B))

In [5]:
def init_lora_layers(rank, alpha, model):
  layers = {}
  if "roberta" in model.__class__.__name__.lower():
    layers = model.roberta.encoder.layer
    # Replace dense layers of the model with LoRA
    for layer in layers:
      s = layer.attention.self
      s.query = Lora(s.query, rank, alpha)
      s.value = Lora(s.value, rank, alpha)

  elif "deberta" in model.__class__.__name__.lower():
    layers = model.deberta.encoder.layer
    # Replace dense layers of the model with LoRA
    for layer in layers:
      s = layer.attention.self
      s.query_proj = Lora(s.query_proj, rank, alpha)
      s.value_proj = Lora(s.value_proj, rank, alpha)

  # Only train parameters in matrix A and B
  for name, param in model.named_parameters():
      if 'A' in name or 'B' in name:
        pass
      else:
        param.requires_grad = False

Check trainable parameters

In [6]:
def get_trainable_parameters(model):
  trainable_parameters = 0
  parameters = 0
  for param in model.parameters():
    # count the total number of parameters
    parameters += param.numel()
    if param.requires_grad:
      # count the total number of trainable parameters
      trainable_parameters += param.numel()
  return parameters, trainable_parameters

Define a training function for glue

In [22]:
roberta_base_hyperparameters = {
    "mnli": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.06,
      "batch-size": 16,
      "epochs": 30,
      "learning-rate": 5e-04,
      "weight-decay": 0,
      "rank": 8,
      "alpha": 8,
      "max-seq-len": 512
    },
    "sst2": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.06,
      "batch-size": 16,
      "epochs": 60,
      "learning-rate": 5e-04,
      "weight-decay": 0,
      "rank": 8,
      "alpha": 8,
      "max-seq-len": 512
    },
    "mrpc": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.06,
      "batch-size": 16,
      "epochs": 30,
      "learning-rate": 4e-04,
      "weight-decay": 0,
      "rank": 8,
      "alpha": 8,
      "max-seq-len": 512,
      "weight-decay": 0.1
    },
    "cola": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.06,
      "batch-size": 32,
      "epochs": 80,
      "learning-rate": 4e-04,
      "weight-decay": 0,
      "rank": 8,
      "alpha": 8,
      "max-seq-len": 512
    },
    "qnli": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.06,
      "batch-size": 32,
      "epochs": 25,
      "learning-rate": 4e-04,
      "weight-decay": 0,
      "rank": 8,
      "alpha": 8,
      "max-seq-len": 512
    },
    "qqp": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.06,
      "batch-size": 16,
      "epochs": 25,
      "learning-rate": 4e-04,
      "weight-decay": 0,
      "rank": 8,
      "alpha": 8,
      "max-seq-len": 512
    },
    "rte": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.06,
      "batch-size": 32,
      "epochs": 80,
      "learning-rate": 5e-04,
      "weight-decay": 0,
      "rank": 8,
      "alpha": 8,
      "max-seq-len": 512
    },
    "stsb": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.06,
      "batch-size": 16,
      "epochs": 40,
      "learning-rate": 4e-04,
      "weight-decay": 0,
      "rank": 8,
      "alpha": 8,
      "max-seq-len": 512
    }
}

In [8]:
roberta_large_hyperparameters = {
    "mnli": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.06,
      "batch-size": 4,
      "epochs": 10,
      "learning-rate": 3e-04,
      "weight-decay": 0,
      "rank": 8,
      "alpha": 16,
      "max-seq-len": 128
    },
    "sst2": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.06,
      "batch-size": 4,
      "epochs": 10,
      "learning-rate": 4e-04,
      "weight-decay": 0,
      "rank": 8,
      "alpha": 16,
      "max-seq-len": 128
    },
    "mrpc": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.06,
      "batch-size": 4,
      "epochs":20,
      "learning-rate": 3e-04,
      "weight-decay": 0,
      "rank": 8,
      "alpha": 16,
      "max-seq-len": 512
    },
    "cola": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.06,
      "batch-size": 4,
      "epochs": 20,
      "learning-rate": 2e-04,
      "weight-decay": 0,
      "rank": 8,
      "alpha": 16,
      "max-seq-len": 128
    },
    "qnli": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.06,
      "batch-size": 4,
      "epochs": 10,
      "learning-rate": 2e-04,
      "weight-decay": 0,
      "rank": 8,
      "alpha": 16,
      "max-seq-len": 128
    },
    "qqp": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.06,
      "batch-size": 4,
      "epochs": 20,
      "learning-rate": 3e-04,
      "weight-decay": 0,
      "rank": 8,
      "alpha": 16,
      "max-seq-len": 512
    },
    "rte": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.06,
      "batch-size": 8,
      "epochs": 20,
      "learning-rate": 4e-04,
      "weight-decay": 0,
      "rank": 8,
      "alpha": 16,
      "max-seq-len": 512
    },
    "stsb": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.06,
      "batch-size": 8,
      "epochs": 30,
      "learning-rate": 2e-04,
      "weight-decay": 0,
      "rank": 8,
      "alpha": 16,
      "max-seq-len": 512
    }
}

In [9]:
deberta_hyperparameters = {
    "mnli": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.1,
      "batch-size": 8,
      "epochs": 5,
      "learning-rate": 1e-04,
      "weight-decay": 0,
      "cls-dropout": 0.15,
      "rank": 8,
      "alpha": 8,
      "max-seq-len": 256
    },
    "sst2": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.1,
      "batch-size": 8,
      "epochs": 16,
      "learning-rate": 6e-05,
      "weight-decay": 0.01,
      "cls-dropout": 0,
      "rank": 8,
      "alpha": 8,
      "max-seq-len": 128
    },
    "mrpc": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.1,
      "batch-size": 32,
      "epochs": 30,
      "learning-rate": 2e-04,
      "weight-decay": 0.01,
      "cls-dropout": 0,
      "rank": 8,
      "alpha": 8,
      "max-seq-len": 128
    },
    "cola": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.1,
      "batch-size": 4,
      "epochs": 10,
      "learning-rate": 1e-04,
      "weight-decay": 0,
      "cls-dropout": 0.1,
      "rank": 8,
      "alpha": 8,
      "max-seq-len": 256
    },
    "qnli": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.1,
      "batch-size": 6,
      "epochs": 8,
      "learning-rate": 1e-04,
      "weight-decay": 0.01,
      "cls-dropout": 0.1,
      "rank": 8,
      "alpha": 8,
      "max-seq-len": 512
    },
    "qqp": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.1,
      "batch-size": 8,
      "epochs": 11,
      "learning-rate": 1e-04,
      "weight-decay": 0.01,
      "cls-dropout": 0.2,
      "rank": 8,
      "alpha": 8,
      "max-seq-len": 320
    },
    "rte": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.1,
      "batch-size": 4,
      "epochs": 11,
      "learning-rate": 2e-04,
      "weight-decay": 0.01,
      "cls-dropout": 0.2,
      "rank": 8,
      "alpha": 8,
      "max-seq-len": 320
    },
    "stsb": {
      "lr-schedule": "linear",
      "warmup-ratio": 0.1,
      "batch-size": 4,
      "epochs": 10,
      "learning-rate": 2e-04,
      "weight-decay": 0.1,
      "cls-dropout": 0.2,
      "rank": 8,
      "alpha": 8,
      "max-seq-len": 128
    }
}

In [41]:
from transformers import DataCollatorWithPadding
from datasets import load_dataset
import evaluate
from transformers import Trainer, TrainingArguments
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# From https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb#scrollTo=TlqNaB8jIrJW
# Look up table for GLUE columns
task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp" : ("question1", "question2"),
    "rte" : ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

def glue_train(model_name, task, lora):
  hyperparameters = roberta_base_hyperparameters if model_name=="roberta-base" else \
                    roberta_large_hyperparameters if model_name=="roberta-large" else \
                    deberta_hyperparameters
  h = hyperparameters[task]


  # From https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/text_classification.ipynb#scrollTo=TlqNaB8jIrJW
  num_labels = 3 if task.startswith("mnli") else 1 if task=="stsb" else 2
  model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels).to(device)
  tokenizer = AutoTokenizer.from_pretrained(model_name, max_length=h["max-seq-len"], truncation=True)
  raw_datasets = load_dataset("glue", task)

  def tokenize_function(example):
    col1 = task_to_keys[task][0]
    col2 = task_to_keys[task][1]
    if col2 == None:
      return tokenizer(example[col1], truncation=True)
    else:
      return tokenizer(example[col1], example[col2], truncation=True)

  tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
  data_collator = DataCollatorWithPadding(tokenizer=tokenizer, pad_to_multiple_of=8)

  def compute_metrics(eval_pred):
    metric = evaluate.load("glue", task)
    predictions, labels = eval_pred
    if task == "stsb":
      predictions = predictions[:,0]
    else:
      predictions = np.argmax(predictions, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

  metric_name = "pearson" if task == "stsb" else "matthews_correlation" if task == "cola" else "accuracy"

  training_args = TrainingArguments(
      f"{model_name}-finetuned-{task}", # output_dir
      warmup_ratio=h["warmup-ratio"],
      lr_scheduler_type=h["lr-schedule"],
      per_device_train_batch_size=h["batch-size"],
      per_device_eval_batch_size=h["batch-size"],
      eval_strategy="epoch",
      save_strategy="epoch",
      num_train_epochs=5, #h["epochs"],
      learning_rate=2e-5, #h["learning-rate"],
      weight_decay=0.01, #h["weight-decay"],
      load_best_model_at_end=True,
      metric_for_best_model=metric_name,
      report_to="none"
      )

  # Handle mnli-mm and mnli expected keys for eval_dataset
  validation_key = "validation_mismatched" if task=="mnli-mm" else \
                "validation_matched" if task=="mnli" else "validation"

  if (lora):
    init_lora_layers(h["rank"], h["alpha"], model)
  num_params, num_trainable_params = get_trainable_parameters(model)
  print(f"Parameters={num_params}; Trainable Parameters={num_trainable_params}")

  # defaults to AdamW optimizer
  trainer = Trainer(
      model,
      training_args,
      train_dataset=tokenized_datasets["train"],
      eval_dataset=tokenized_datasets[validation_key],
      data_collator=data_collator,
      tokenizer=tokenizer,
      compute_metrics=compute_metrics
  )

  trainer.train()

roberta-base full fine tuned for cola

In [42]:
glue_train("roberta-base", "cola", False)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Parameters=124647170; Trainable Parameters=124647170


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.497359,0.438259
2,0.450400,0.400729,0.578171
3,0.450400,0.522429,0.585536
4,0.220900,0.632975,0.565194
5,0.220900,0.685817,0.608229


roberta-base fine tuned for cola using lora

In [43]:
glue_train("roberta-base", "cola", True)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Parameters=124942082; Trainable Parameters=294912


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,No log,0.506533,0.136113
2,0.547100,0.44524,0.505905
3,0.547100,0.503934,0.448177
4,0.443300,0.544461,0.454491
5,0.443300,0.521536,0.489032


roberta-large full fine tuned for cola

In [None]:
glue_train("roberta-large", "cola", False)

roberta-large fine tuned for cola using lora

In [None]:
glue_train("roberta-large", "cola", True)

roberta-base full fine tuned for mrpc

In [44]:
glue_train("roberta-base", "mrpc", False)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Parameters=124647170; Trainable Parameters=124647170


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.384034,0.830882,0.871985
2,No log,0.33758,0.884804,0.916519
3,0.419500,0.412299,0.867647,0.897338
4,0.419500,0.521754,0.892157,0.923077
5,0.169000,0.570815,0.887255,0.917266


roberta-base fine tuned for mrpc using lora

In [32]:
glue_train("roberta-base", "mrpc", True)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Parameters=124942082; Trainable Parameters=294912


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.62486,0.683824,0.812227
2,No log,0.626203,0.683824,0.812227
3,0.639800,0.624259,0.683824,0.812227
4,0.639800,0.624604,0.683824,0.812227
5,0.631900,0.624006,0.683824,0.812227


robert-large fine tuned for mrpc using lora

In [34]:
glue_train("roberta-large", "mrpc", True)

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

Parameters=356148226; Trainable Parameters=786432


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.665,0.624615,0.683824,0.812227
2,0.6458,0.623957,0.683824,0.812227
3,0.6421,0.62396,0.683824,0.812227
4,0.6664,0.624091,0.683824,0.812227
5,0.6805,0.623982,0.683824,0.812227


roberta-base fine tuned for stsb using lora

In [33]:
glue_train("roberta-base", "stsb", True)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Parameters=124941313; Trainable Parameters=294912


Epoch,Training Loss,Validation Loss,Pearson,Spearmanr
1,No log,2.713139,0.106171,0.101542
2,2.505600,2.541281,-0.038081,-0.032583
3,2.159000,2.540647,0.044612,0.046134
4,2.159000,2.640274,0.13063,0.134497
5,2.169500,2.572468,0.112225,0.118948
