## Working with Transformers in the HuggingFace Ecosystem

In this laboratory exercise we will learn how to work with the HuggingFace ecosystem to adapt models to new tasks. As you will see, much of what is required is *investigation* into the inner-workings of the HuggingFace abstractions. With a little work, a little trial-and-error, it is fairly easy to get a working adaptation pipeline up and running.

#### Exercise 1.1


In [1]:
from datasets import load_dataset, get_dataset_split_names

In [2]:
dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes")

splits = get_dataset_split_names("cornell-movie-review-data/rotten_tomatoes")
print("Splits:", splits)

Splits: ['train', 'validation', 'test']


In [3]:
print("Train size:", len(dataset["train"]))
print("Validation size:", len(dataset["validation"]))
print("Test size:", len(dataset["test"]))

Train size: 8530
Validation size: 1066
Test size: 1066


In [4]:
print("Example:\n", dataset["train"][211])

Example:
 {'text': '[a] rare , beautiful film .', 'label': 1}


#### Exercise 1.2

In [5]:
from transformers import AutoTokenizer, AutoModel

name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModel.from_pretrained(name)

sample = dataset["test"]["text"][260]
print(sample)

encoded_input = tokenizer(sample, return_tensors="pt")
print(encoded_input)

intensely romantic , thought-provoking and even an engaging mystery .
{'input_ids': tensor([[  101, 20531,  6298,  1010,  2245,  1011,  4013, 22776,  1998,  2130,
          2019, 11973,  6547,  1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [6]:
output = model(**encoded_input)
print(output)

BaseModelOutput(last_hidden_state=tensor([[[-0.2474, -0.2550, -0.1434,  ..., -0.2345,  0.4026,  0.1096],
         [-0.0094,  0.2953,  0.0051,  ..., -0.3797,  0.2477, -0.3513],
         [-0.2699,  0.1544,  0.3303,  ..., -0.5992,  0.1053, -0.4761],
         ...,
         [-0.1478, -0.0413,  0.4556,  ..., -0.2515,  0.2923, -0.3443],
         [-0.5207, -0.7747, -0.1066,  ...,  0.3429,  0.4019, -0.4375],
         [ 0.9744, -0.0602, -0.1639,  ..., -0.0061, -0.3364, -0.0374]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)


In [7]:
print("\nLast hidden state shape:", output.last_hidden_state.shape)


Last hidden state shape: torch.Size([1, 15, 768])


In [8]:
sample = dataset["test"]["text"][29]
print(sample)

encoded_input = tokenizer(sample, return_tensors="pt")

output = model(**encoded_input)
print("\nLast hidden state shape:", output.last_hidden_state.shape)

a soul-stirring documentary about the israeli/palestinian conflict as revealed through the eyes of some children who remain curious about each other against all odds .

Last hidden state shape: torch.Size([1, 31, 768])


#### Exercise 1.3: A Stable Baseline

In [9]:
sample = "i want to see if [CLS] is at the beginning or at the end of the sequence"
encoded = tokenizer(sample, return_tensors="pt")

print("Token IDs:", encoded["input_ids"])
print("Decoded tokens:", tokenizer.convert_ids_to_tokens(encoded["input_ids"][0]))

Token IDs: tensor([[ 101, 1045, 2215, 2000, 2156, 2065,  101, 2003, 2012, 1996, 2927, 2030,
         2012, 1996, 2203, 1997, 1996, 5537,  102]])
Decoded tokens: ['[CLS]', 'i', 'want', 'to', 'see', 'if', '[CLS]', 'is', 'at', 'the', 'beginning', 'or', 'at', 'the', 'end', 'of', 'the', 'sequence', '[SEP]']


#### So the [CLS] is the first element of the output sequence!

In [10]:
import torch
from tqdm import tqdm

def extract_features(texts, tokenizer, model, device="cuda"):
    model = model.to(device)
    model.eval()  

    hidden_size = 768  
    features = torch.zeros((len(texts), hidden_size))

    with torch.no_grad():  
        for i, text in enumerate(tqdm(texts, desc="Extracting features")):
            encoded = tokenizer(text, return_tensors="pt").to(device)

            output = model(**encoded)

            cls_vec = output.last_hidden_state[:, 0, :] 

            features[i, :] = cls_vec.cpu()

    return features


In [11]:
text_train = extract_features(dataset["train"]["text"], tokenizer, model)
labels_train = dataset["train"]["label"]

text_validation = extract_features(dataset["validation"]["text"], tokenizer, model)
labels_validation = dataset["validation"]["label"]

text_test = extract_features(dataset["test"]["text"], tokenizer, model)
labels_test = dataset["test"]["label"]


Extracting features: 100%|██████████| 8530/8530 [01:08<00:00, 125.34it/s]
Extracting features: 100%|██████████| 1066/1066 [00:08<00:00, 128.84it/s]
Extracting features: 100%|██████████| 1066/1066 [00:08<00:00, 126.44it/s]


In [12]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

clf = LinearSVC()
clf.fit(text_train, labels_train)

validation_predictions = clf.predict(text_validation)
test_predictions = clf.predict(text_test)

print(f"Validation accuracy: {accuracy_score(labels_validation, validation_predictions):.3f}")
print(f"Test accuracy: {accuracy_score(labels_test, test_predictions):.3f}")

Validation accuracy: 0.822
Test accuracy: 0.798


-----
### Exercise 2: Fine-tuning Distilbert

#### Exercise 2.1: Token Preprocessing

In [13]:
def tokenize(dataset_element):
    encoded = tokenizer(dataset_element["text"])

    return {
        "text": dataset_element["text"],
        "label": dataset_element["label"],
        "input_ids": encoded["input_ids"],
        "attention_mask": encoded["attention_mask"],
    }

In [14]:
tokenized_train = dataset["train"].map(tokenize)
tokenized_val = dataset["validation"].map(tokenize)
tokenized_test = dataset["test"].map(tokenize)

In [15]:
print(tokenized_test[29])

{'text': 'a soul-stirring documentary about the israeli/palestinian conflict as revealed through the eyes of some children who remain curious about each other against all odds .', 'label': 1, 'input_ids': [101, 1037, 3969, 1011, 18385, 4516, 2055, 1996, 5611, 1013, 9302, 4736, 2004, 3936, 2083, 1996, 2159, 1997, 2070, 2336, 2040, 3961, 8025, 2055, 2169, 2060, 2114, 2035, 10238, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


#### Exercise 2.2: Setting up the Model to be Fine-tuned

In this exercise we need to prepare the base Distilbert model for fine-tuning for a *sequence classification task*. This means, at the very least, appending a new, randomly-initialized classification head connected to the `[CLS]` token of the last transformer layer. Luckily, HuggingFace already provides an `AutoModel` for just this type of instantiation: [`AutoModelForSequenceClassification`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification). You will want you instantiate one of these for fine-tuning.

In [16]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Exercise 2.3: Fine-tuning Distilbert

Finally. In this exercise you should use a HuggingFace [`Trainer`](https://huggingface.co/docs/transformers/main/en/trainer) to fine-tune your model on the Rotten Tomatoes training split. Setting up the trainer will involve (at least):


1. Instantiating a [`DataCollatorWithPadding`](https://huggingface.co/docs/transformers/en/main_classes/data_collator) object which is what *actually* does your batch construction (by padding all sequences to the same length).
2. Writing an *evaluation function* that will measure the classification accuracy. This function takes a single argument which is a tuple containing `(logits, labels)` which you should use to compute classification accuracy (and maybe other metrics like F1 score, precision, recall) and return a `dict` with these metrics.  
3. Instantiating a [`TrainingArguments`](https://huggingface.co/docs/transformers/v4.51.1/en/main_classes/trainer#transformers.TrainingArguments) object using some reasonable defaults.
4. Instantiating a `Trainer` object using your train and validation splits, you data collator, and function to compute performance metrics.
5. Calling `trainer.train()`, waiting, waiting some more, and then calling `trainer.evaluate()` to see how it did.

**Tip**: When prototyping this laboratory I discovered the HuggingFace [Evaluate library](https://huggingface.co/docs/evaluate/en/index) which provides evaluation metrics. However I found it to have insufferable layers of abstraction and getting actual metrics computed. I suggest just using the Scikit-learn metrics...

In [56]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [57]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)  
    acc = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}


In [42]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./output_dir",     
    overwrite_output_dir=True,
    do_train=True,
    eval_strategy="epoch",
    num_train_epochs=100,
    logging_dir="./logs/exercise2_3",
    logging_strategy="epoch",
    metric_for_best_model="accuracy",
    save_strategy="epoch",
    save_only_model=True,
    save_total_limit=1,           
    load_best_model_at_end=True,  
    seed=29,
    disable_tqdm=False,
)

PyTorch: setting up devices
average_tokens_across_devices is True but world size is 1. Setting it to False automatically.
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [43]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)


In [44]:
trainer.train()

The following columns in the Training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 8,530
  Num Epochs = 100
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 106,700
  Number of trainable parameters = 66,955,010


Epoch,Training Loss,Validation Loss


The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1066
  Batch size = 8
Saving model checkpoint to ./output_dir\checkpoint-1067
Configuration saved in ./output_dir\checkpoint-1067\config.json
Model weights saved in ./output_dir\checkpoint-1067\model.safetensors
tokenizer config file saved in ./output_dir\checkpoint-1067\tokenizer_config.json
Special tokens file saved in ./output_dir\checkpoint-1067\special_tokens_map.json
The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  N

TrainOutput(global_step=106700, training_loss=0.00850103682809972, metrics={'train_runtime': 10427.2789, 'train_samples_per_second': 81.805, 'train_steps_per_second': 10.233, 'total_flos': 9804603488767296.0, 'train_loss': 0.00850103682809972, 'epoch': 100.0})

In [45]:
# Evaluate on validation set
eval_results = trainer.evaluate()
print(eval_results)

The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1066
  Batch size = 8


{'eval_loss': 1.0913594961166382, 'eval_accuracy': 0.8433395872420263, 'eval_precision': 0.8546511627906976, 'eval_recall': 0.8273921200750469, 'eval_f1': 0.8408007626310772, 'eval_runtime': 15.8252, 'eval_samples_per_second': 67.361, 'eval_steps_per_second': 8.468, 'epoch': 100.0}


In [58]:
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding

# Data collator (dynamic padding)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    logging_dir="./logs",
    eval_strategy="epoch",      
    save_strategy="epoch",  
    save_total_limit=1,           
    load_best_model_at_end=True,   
    metric_for_best_model="accuracy",
    greater_is_better=True,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=30,
    logging_strategy="epoch" 
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train and evaluate
trainer.train()

PyTorch: setting up devices
average_tokens_across_devices is True but world size is 1. Setting it to False automatically.
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the Training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 8,530
  Num Epochs = 30
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 4,020
  Number of trainable parameters = 66,955,010


Epoch,Training Loss,Validation Loss


The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1066
  Batch size = 64
Saving model checkpoint to ./results\checkpoint-134
Configuration saved in ./results\checkpoint-134\config.json
Model weights saved in ./results\checkpoint-134\model.safetensors
tokenizer config file saved in ./results\checkpoint-134\tokenizer_config.json
Special tokens file saved in ./results\checkpoint-134\special_tokens_map.json
The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1066


TrainOutput(global_step=4020, training_loss=0.0009202774501549112, metrics={'train_runtime': 3209.7005, 'train_samples_per_second': 79.727, 'train_steps_per_second': 1.252, 'total_flos': 3698470624796808.0, 'train_loss': 0.0009202774501549112, 'epoch': 30.0})

In [62]:
results = trainer.evaluate()

import pandas as pd
df_results = pd.DataFrame([eval_results])
df_results

The following columns in the Evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1066
  Batch size = 64


Unnamed: 0,eval_loss,eval_accuracy,eval_precision,eval_recall,eval_f1,eval_runtime,eval_samples_per_second,eval_steps_per_second,epoch
0,1.55037,0.847092,0.855769,0.834897,0.845204,2.9078,366.598,5.846,30.0


-----
### Exercise 3: Choose at Least One


#### Exercise 3.1: Efficient Fine-tuning for Sentiment Analysis (easy)

In Exercise 2 we fine-tuned the *entire* Distilbert model on Rotten Tomatoes. This is expensive, even for a small model. Find an *efficient* way to fine-tune Distilbert on the Rotten Tomatoes dataset (or some other dataset).

**Hint**: You could check out the [HuggingFace PEFT library](https://huggingface.co/docs/peft/en/index) for some state-of-the-art approaches that should "just work". How else might you go about making fine-tuning more efficient without having to change your training pipeline from above?

In [22]:
# Your code here.

#### Exercise 3.2: Fine-tuning a CLIP Model (harder)

Use a (small) CLIP model like [`openai/clip-vit-base-patch16`](https://huggingface.co/openai/clip-vit-base-patch16) and evaluate its zero-shot performance on a small image classification dataset like ImageNette or TinyImageNet. Fine-tune (using a parameter-efficient method!) the CLIP model to see how much improvement you can squeeze out of it.

**Note**: There are several ways to adapt the CLIP model; you could fine-tune the image encoder, the text encoder, or both. Or, you could experiment with prompt learning.

**Tip**: CLIP probably already works very well on ImageNet and ImageNet-like images. For extra fun, look for an image classification dataset with different image types (e.g. *sketches*).

In [23]:
# Your code here.

#### Exercise 3.3: Choose your Own Adventure

There are a *ton* of interesting and fun models on the HuggingFace hub. Pick one that does something interesting and adapt it in some way to a new task. Or, combine two or more models into something more interesting or fun. The sky's the limit.

**Note**: Reach out to me by email or on the Discord if you are unsure about anything.

In [24]:
# Your code here.