## Working with Transformers in the HuggingFace Ecosystem

In this laboratory exercise we will learn how to work with the HuggingFace ecosystem to adapt models to new tasks. As you will see, much of what is required is *investigation* into the inner-workings of the HuggingFace abstractions. With a little work, a little trial-and-error, it is fairly easy to get a working adaptation pipeline up and running.

### Exercise 1: Sentiment Analysis (warm up)

In this first exercise we will start from a pre-trained BERT transformer and build up a model able to perform text sentiment analysis. Transformers are complex beasts, so we will build up our pipeline in several explorative and incremental steps.

#### Exercise 1.1: Dataset Splits and Pre-trained model
There are a many sentiment analysis datasets, but we will use one of the smallest ones available: the [Cornell Rotten Tomatoes movie review dataset](cornell-movie-review-data/rotten_tomatoes), which consists of 5,331 positive and 5,331 negative processed sentences from the Rotten Tomatoes movie reviews.

**Your first task**: Load the dataset and figure out what splits are available and how to get them. Spend some time exploring the dataset to see how it is organized. Note that we will be using the [HuggingFace Datasets](https://huggingface.co/docs/datasets/en/index) library for downloading, accessing, splitting, and batching data for training and evaluation.

## **Import**

In [None]:
# Import standard libraries
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

#import models
from transformers import AutoTokenizer, AutoModel
from transformers import pipeline

# Import PyTorch
import torch

#import Datasets
from datasets import load_dataset


### ***Further settings***

In [2]:
# Set device (GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# fix random seed for reproducibility
torch.manual_seed(808)

<torch._C.Generator at 0x2121ce5f5f0>

### **Load the dataset and split checking**

In [3]:

# Load the dataset from HuggingFace
dataset = load_dataset("rotten_tomatoes")
dataset_id = "cornell-movie-review-data/rotten_tomatoes"
ds_train = load_dataset(dataset_id, split="train")
ds_test = load_dataset(dataset_id, split="test")
ds_validation = load_dataset(dataset_id, split="validation")

# available splits
print("Available splits:", dataset.keys())
print(f"Labels in the dataset: {np.unique(ds_train['label'])}")


for split_name, split_data in dataset.items():
    print(f"{split_name}: {len(split_data)} examples")

    # some rows from the training set
print("\nExamples from the training set:")
print(dataset["train"][0])   # first example
print(dataset["train"][4])   # second example



Available splits: dict_keys(['train', 'validation', 'test'])
Labels in the dataset: [0 1]
train: 8530 examples
validation: 1066 examples
test: 1066 examples

Examples from the training set:
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}
{'text': "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one .", 'label': 1}


#### Exercise 1.2: A Pre-trained BERT and Tokenizer

The model we will use is a *very* small BERT transformer called [Distilbert](https://huggingface.co/distilbert/distilbert-base-uncased) this model was trained (using self-supervised learning) on the same corpus as BERT but using the full BERT base model as a *teacher*.

**Your next task**: Load the Distilbert model and corresponding tokenizer. Use the tokenizer on a few samples from the dataset and pass the tokens through the model to see what outputs are provided. I suggest you use the [`AutoModel`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html) class (and the `from_pretrained()` method) to load the model and `AutoTokenizer` to load the tokenizer).

### ***DistilBERT***

In [None]:

# Load the pre-trained DistilBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModel.from_pretrained("distilbert-base-uncased")

# Take a single sample from the training set
sample_text = ds_train[8]["text"]
print("Sample text:", sample_text)

# Tokenize the sample
inputs = tokenizer(sample_text, return_tensors="pt", truncation=True, padding=True)

# Forward pass without gradient calculation (faster, less memory)
with torch.no_grad():
    outputs = model(**inputs)

print(f"\nmodel output: {outputs.keys()}")

# Extract the last hidden states
last_hidden_states = outputs.last_hidden_state


print("\nShape of last hidden states:", last_hidden_states.shape)
print("Last hidden states for the sample:")
print(last_hidden_states)


Sample text: steers turns in a snappy screenplay that curls at the edges ; it's so clever you want to hate it . but he somehow pulls it off .

model output: odict_keys(['last_hidden_state'])

Shape of last hidden states: torch.Size([1, 34, 768])
Last hidden states for the sample:
tensor([[[-0.1363, -0.0400,  0.0322,  ..., -0.1565,  0.5009,  0.2881],
         [ 0.2580,  0.4807,  0.1554,  ...,  0.0087,  0.6204, -0.1539],
         [-0.3299, -0.2383,  0.1396,  ..., -0.1244,  0.6183, -0.1483],
         ...,
         [ 0.0929, -0.4927,  0.1931,  ...,  0.0601,  0.1228, -0.1329],
         [ 0.6391,  0.1454, -0.3272,  ..., -0.0744, -0.1469, -0.4849],
         [-0.2033,  0.6055,  0.6094,  ..., -0.0866,  0.5897,  0.0293]]])


#### Exercise 1.3: A Stable Baseline

In this exercise I want you to:
1. Use Distilbert as a *feature extractor* to extract representations of the text strings from the dataset splits;
2. Train a classifier (your choice, by an SVM from Scikit-learn is an easy choice).
3. Evaluate performance on the validation and test splits.

These results are our *stable baseline* -- the **starting** point on which we will (hopefully) improve in the next exercise.

**Hint**: There are a number of ways to implement the feature extractor, but probably the best is to use a [feature extraction `pipeline`](https://huggingface.co/tasks/feature-extraction). You will need to interpret the output of the pipeline and extract only the `[CLS]` token from the *last* transformer layer. *How can you figure out which output that is?*

### **Feature extraction with Distilbert and SVM (Linear) Training**

### *Checking CLS token position*

In [22]:
text = " welcome to Florence!"
encoding = tokenizer(text, return_tensors="pt")
tokens = tokenizer.convert_ids_to_tokens(encoding['input_ids'][0])

print("Tokens:", tokens)

Tokens: ['[CLS]', 'welcome', 'to', 'florence', '!', '[SEP]']


In [None]:
# Initialize feature-extraction pipeline with DistilBERT
feature_extractor = pipeline("feature-extraction", model=model,tokenizer=tokenizer)

def extract_cls_embeddings(texts):
    
    # Extract the CLS token embeddings for a batch of texts
    features = feature_extractor(texts, truncation=True, padding=True)

    # CLS token is always the first token
    cls_embeddings = [f[0] for f in features] 
    return np.array(cls_embeddings)

X_train = extract_cls_embeddings(list(ds_train["text"]))
y_train = np.array(ds_train["label"])

X_val = extract_cls_embeddings(list(ds_validation["text"]))
y_val = np.array(ds_validation["label"])

X_test = extract_cls_embeddings(list(ds_test["text"]))
y_test = np.array(ds_test["label"])

# Train an SVM classifier
clf = SVC(kernel='linear') 
clf.fit(X_train, y_train)

# Evaluate on validation and test sets
val_preds = clf.predict(X_val)
test_preds = clf.predict(X_test)

val_acc = accuracy_score(y_val, val_preds)
test_acc = accuracy_score(y_test, test_preds)

print(f"Validation accuracy: {val_acc:.4f}")
print(f"Test accuracy: {test_acc:.4f}")


Device set to use cuda:0


In [25]:
# Initialize feature-extraction pipeline with DistilBERT
feature_extractor = pipeline("feature-extraction", model=model, tokenizer=tokenizer)

def extract_cls_embeddings(texts):
    # Extract the CLS token embeddings for a batch of texts
    features = feature_extractor(texts, truncation=True, padding=True)
    # CLS token is always the first token
    cls_embeddings = [f[0] for f in features]
    return cls_embeddings  # return list of vectors, not np.array yet

# Extract CLS embeddings on a sample batch, e.g. first 5 texts from training set
sample_texts = list(ds_train["text"])[:60]
cls_embeddings = extract_cls_embeddings(sample_texts)

for i, emb in enumerate(cls_embeddings):
    print(f"Embedding {i} shape: {np.array(emb).shape}")


Device set to use cuda:0


Embedding 0 shape: (47, 768)
Embedding 1 shape: (52, 768)
Embedding 2 shape: (10, 768)
Embedding 3 shape: (24, 768)
Embedding 4 shape: (28, 768)
Embedding 5 shape: (32, 768)
Embedding 6 shape: (11, 768)
Embedding 7 shape: (22, 768)
Embedding 8 shape: (34, 768)
Embedding 9 shape: (17, 768)
Embedding 10 shape: (17, 768)
Embedding 11 shape: (19, 768)
Embedding 12 shape: (28, 768)
Embedding 13 shape: (16, 768)
Embedding 14 shape: (15, 768)
Embedding 15 shape: (26, 768)
Embedding 16 shape: (9, 768)
Embedding 17 shape: (10, 768)
Embedding 18 shape: (26, 768)
Embedding 19 shape: (16, 768)
Embedding 20 shape: (13, 768)
Embedding 21 shape: (42, 768)
Embedding 22 shape: (21, 768)
Embedding 23 shape: (27, 768)
Embedding 24 shape: (21, 768)
Embedding 25 shape: (15, 768)
Embedding 26 shape: (5, 768)
Embedding 27 shape: (22, 768)
Embedding 28 shape: (46, 768)
Embedding 29 shape: (30, 768)
Embedding 30 shape: (15, 768)
Embedding 31 shape: (27, 768)
Embedding 32 shape: (9, 768)
Embedding 33 shape: (35

-----
### Exercise 2: Fine-tuning Distilbert

In this exercise we will fine-tune the Distilbert model to (hopefully) improve sentiment analysis performance.

#### Exercise 2.1: Token Preprocessing

The first thing we need to do is *tokenize* our dataset splits. Our current datasets return a dictionary with *strings*, but we want *input token ids* (i.e. the output of the tokenizer). This is easy enough to do my hand, but the HugginFace `Dataset` class provides convenient, efficient, and *lazy* methods. See the documentation for [`Dataset.map`](https://huggingface.co/docs/datasets/v3.5.0/en/package_reference/main_classes#datasets.Dataset.map).

**Tip**: Verify that your new datasets are returning for every element: `text`, `label`, `intput_ids`, and `attention_mask`.

In [None]:
# Your code here.

#### Exercise 2.2: Setting up the Model to be Fine-tuned

In this exercise we need to prepare the base Distilbert model for fine-tuning for a *sequence classification task*. This means, at the very least, appending a new, randomly-initialized classification head connected to the `[CLS]` token of the last transformer layer. Luckily, HuggingFace already provides an `AutoModel` for just this type of instantiation: [`AutoModelForSequenceClassification`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification). You will want you instantiate one of these for fine-tuning.

In [1]:
# Your code here.

#### Exercise 2.3: Fine-tuning Distilbert

Finally. In this exercise you should use a HuggingFace [`Trainer`](https://huggingface.co/docs/transformers/main/en/trainer) to fine-tune your model on the Rotten Tomatoes training split. Setting up the trainer will involve (at least):


1. Instantiating a [`DataCollatorWithPadding`](https://huggingface.co/docs/transformers/en/main_classes/data_collator) object which is what *actually* does your batch construction (by padding all sequences to the same length).
2. Writing an *evaluation function* that will measure the classification accuracy. This function takes a single argument which is a tuple containing `(logits, labels)` which you should use to compute classification accuracy (and maybe other metrics like F1 score, precision, recall) and return a `dict` with these metrics.  
3. Instantiating a [`TrainingArguments`](https://huggingface.co/docs/transformers/v4.51.1/en/main_classes/trainer#transformers.TrainingArguments) object using some reasonable defaults.
4. Instantiating a `Trainer` object using your train and validation splits, you data collator, and function to compute performance metrics.
5. Calling `trainer.train()`, waiting, waiting some more, and then calling `trainer.evaluate()` to see how it did.

**Tip**: When prototyping this laboratory I discovered the HuggingFace [Evaluate library](https://huggingface.co/docs/evaluate/en/index) which provides evaluation metrics. However I found it to have insufferable layers of abstraction and getting actual metrics computed. I suggest just using the Scikit-learn metrics...

In [2]:
# Your code here.

-----
### Exercise 3: Choose at Least One


#### Exercise 3.1: Efficient Fine-tuning for Sentiment Analysis (easy)

In Exercise 2 we fine-tuned the *entire* Distilbert model on Rotten Tomatoes. This is expensive, even for a small model. Find an *efficient* way to fine-tune Distilbert on the Rotten Tomatoes dataset (or some other dataset).

**Hint**: You could check out the [HuggingFace PEFT library](https://huggingface.co/docs/peft/en/index) for some state-of-the-art approaches that should "just work". How else might you go about making fine-tuning more efficient without having to change your training pipeline from above?

In [3]:
# Your code here.

#### Exercise 3.2: Fine-tuning a CLIP Model (harder)

Use a (small) CLIP model like [`openai/clip-vit-base-patch16`](https://huggingface.co/openai/clip-vit-base-patch16) and evaluate its zero-shot performance on a small image classification dataset like ImageNette or TinyImageNet. Fine-tune (using a parameter-efficient method!) the CLIP model to see how much improvement you can squeeze out of it.

**Note**: There are several ways to adapt the CLIP model; you could fine-tune the image encoder, the text encoder, or both. Or, you could experiment with prompt learning.

**Tip**: CLIP probably already works very well on ImageNet and ImageNet-like images. For extra fun, look for an image classification dataset with different image types (e.g. *sketches*).

In [4]:
# Your code here.

#### Exercise 3.3: Choose your Own Adventure

There are a *ton* of interesting and fun models on the HuggingFace hub. Pick one that does something interesting and adapt it in some way to a new task. Or, combine two or more models into something more interesting or fun. The sky's the limit.

**Note**: Reach out to me by email or on the Discord if you are unsure about anything.

In [5]:
# Your code here.