## Working with Transformers in the HuggingFace Ecosystem

In this laboratory exercise we will learn how to work with the HuggingFace ecosystem to adapt models to new tasks. As you will see, much of what is required is *investigation* into the inner-workings of the HuggingFace abstractions. With a little work, a little trial-and-error, it is fairly easy to get a working adaptation pipeline up and running.

### Exercise 1: Sentiment Analysis (warm up)

In this first exercise we will start from a pre-trained BERT transformer and build up a model able to perform text sentiment analysis. Transformers are complex beasts, so we will build up our pipeline in several explorative and incremental steps.

#### Exercise 1.1: Dataset Splits and Pre-trained model
There are a many sentiment analysis datasets, but we will use one of the smallest ones available: the [Cornell Rotten Tomatoes movie review dataset](cornell-movie-review-data/rotten_tomatoes), which consists of 5,331 positive and 5,331 negative processed sentences from the Rotten Tomatoes movie reviews.

**Your first task**: Load the dataset and figure out what splits are available and how to get them. Spend some time exploring the dataset to see how it is organized. Note that we will be using the [HuggingFace Datasets](https://huggingface.co/docs/datasets/en/index) library for downloading, accessing, splitting, and batching data for training and evaluation.

In [None]:
import torch
import tqdm as notebook_tqdm
from datasets import load_dataset


# Load the Rotten Tomatoes dataset from the Hugging Face Hub
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes")
for split in ds:
    print(f"Split '{split}' contains {len(ds[split])} examples")

# Visualize the structure of the dataset
print("\nData structure:")
print(ds["train"].features)

for i in range(3):
    example = ds["train"][i]
    print(f"\nExample {i+1}:")
    print(f"Text: {example['text']}")
    print(f"Label: {example['label']} (0=negative, 1=positive)")

  from .autonotebook import tqdm as notebook_tqdm


Split 'train' contains 8530 examples
Split 'validation' contains 1066 examples
Split 'test' contains 1066 examples

Data structure:
{'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos'])}

Example 1:
Text: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .
Label: 1 (0=negative, 1=positive)

Example 2:
Text: the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .
Label: 1 (0=negative, 1=positive)

Example 3:
Text: effective but too-tepid biopic
Label: 1 (0=negative, 1=positive)


#### Exercise 1.2: A Pre-trained BERT and Tokenizer

The model we will use is a *very* small BERT transformer called [Distilbert](https://huggingface.co/distilbert/distilbert-base-uncased) this model was trained (using self-supervised learning) on the same corpus as BERT but using the full BERT base model as a *teacher*.

**Your next task**: Load the Distilbert model and corresponding tokenizer. Use the tokenizer on a few samples from the dataset and pass the tokens through the model to see what outputs are provided. I suggest you use the [`AutoModel`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html) class (and the `from_pretrained()` method) to load the model and `AutoTokenizer` to load the tokenizer).

In [None]:
import torch
from transformers import AutoTokenizer, AutoModel

# Load pre-trained model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Tokenize a batch of texts
texts = [ds["train"][0]["text"], ds["train"][1]["text"]] # Fist two examples from the training set
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
print("\nTokenized inputs:")
print(inputs)

with torch.no_grad():  # no gradient calculation needed
    outputs = model(**inputs)

# Inspect the model outputs
print("\nModel outputs:")
print(outputs)
print("\nLast hidden state shape:", outputs.last_hidden_state.shape)


Tokenized inputs:
{'input_ids': tensor([[  101,  1996,  2600,  2003, 16036,  2000,  2022,  1996,  7398,  2301,
          1005,  1055,  2047,  1000, 16608,  1000,  1998,  2008,  2002,  1005,
          1055,  2183,  2000,  2191,  1037, 17624,  2130,  3618,  2084,  7779,
         29058,  8625, 13327,  1010,  3744,  1011, 18856, 19513,  3158,  5477,
          4168,  2030,  7112, 16562,  2140,  1012,   102,     0,     0,     0,
             0,     0],
        [  101,  1996,  9882,  2135,  9603, 13633,  1997,  1000,  1996,  2935,
          1997,  1996,  7635,  1000, 11544,  2003,  2061,  4121,  2008,  1037,
          5930,  1997,  2616,  3685, 23613,  6235,  2522,  1011,  3213,  1013,
          2472,  2848,  4027,  1005,  1055,  4423,  4432,  1997,  1046,  1012,
          1054,  1012,  1054,  1012, 23602,  1005,  1055,  2690,  1011,  3011,
          1012,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1

#### Exercise 1.3: A Stable Baseline

In this exercise I want you to:
1. Use Distilbert as a *feature extractor* to extract representations of the text strings from the dataset splits;
2. Train a classifier (your choice, by an SVM from Scikit-learn is an easy choice).
3. Evaluate performance on the validation and test splits.

These results are our *stable baseline* -- the **starting** point on which we will (hopefully) improve in the next exercise.

**Hint**: There are a number of ways to implement the feature extractor, but probably the best is to use a [feature extraction `pipeline`](https://huggingface.co/tasks/feature-extraction). You will need to interpret the output of the pipeline and extract only the `[CLS]` token from the *last* transformer layer. *How can you figure out which output that is?*

In [None]:
from transformers import pipeline
import numpy as np

def extract_features(texts, batch_size=32):
    """Extract features from a list of texts using a pre-trained transformer model.
     Args:
         texts (list of str): List of input texts.
         batch_size (int): Number of texts to process in each batch.
     
     Returns:
         np.ndarray: Array of extracted features.
     """
    
    feature_extractor = pipeline("feature-extraction", model=model, tokenizer=tokenizer)
    all_features = []
    

    for i in range(0, len(texts), batch_size):
        # estract texts for the current batch
        batch_texts = texts[i:i + batch_size]
        # Extract features for the current batch
        outputs = feature_extractor(batch_texts)
        # Extract the CLS token features (first token) from each output
        batch_features = [np.array(output)[0, 0] for output in outputs]
        all_features.extend(batch_features)
    
    return np.array(all_features)




In [None]:
# Extract features and labels for each split
train_features = extract_features(ds['train']['text'])
train_labels = ds['train']['label']
print("\nTrain features shape:", train_features.shape)
print("Train labels length:", len(train_labels))

val_features = extract_features(ds['validation']['text'])
val_labels = ds['validation']['label']
print("\nValidation features shape:", val_features.shape)
print("Validation labels length:", len(val_labels))

test_features = extract_features(ds['test']['text'])
test_labels = ds['test']['label']
print("\nTest features shape:", test_features.shape)
print("Test labels length:", len(test_labels))

Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Device set to use cuda:0



Train features shape: (8530, 768)
Train labels length: 8530


Device set to use cuda:0



Validation features shape: (1066, 768)
Validation labels length: 1066

Test features shape: (1066, 768)
Test labels length: 1066


In [None]:
from sklearn.svm import  LinearSVC
from sklearn.metrics import accuracy_score, classification_report

# Load a LinearSVC classifier and train it on the extracted features
print("\nTraining LinearSVC...")
classifier = LinearSVC(max_iter=10000, random_state=42)
classifier.fit(train_features, train_labels)

# Evaluate the classifier on the validation and test sets
print("\nEvaluating on validation set...")
val_pred = classifier.predict(val_features)
val_accuracy = accuracy_score(val_labels, val_pred)
print(f"Validation Accuracy: {val_accuracy:.4f}")

# Test the classifier on the test set
print("\nEvaluating on test set...")
test_pred = classifier.predict(test_features)
test_accuracy = accuracy_score(test_labels, test_pred)
print(f"Test Accuracy: {test_accuracy:.4f}")
print("\nTest Classification Report:")
print(classification_report(test_labels, test_pred))


Training LinearSVC...

Evaluating on validation set...
Validation Accuracy: 0.8218

Evaluating on test set...
Test Accuracy: 0.7983

Test Classification Report:
              precision    recall  f1-score   support

           0       0.79      0.81      0.80       533
           1       0.81      0.78      0.80       533

    accuracy                           0.80      1066
   macro avg       0.80      0.80      0.80      1066
weighted avg       0.80      0.80      0.80      1066



-----
### Exercise 2: Fine-tuning Distilbert

In this exercise we will fine-tune the Distilbert model to (hopefully) improve sentiment analysis performance.

#### Exercise 2.1: Token Preprocessing

The first thing we need to do is *tokenize* our dataset splits. Our current datasets return a dictionary with *strings*, but we want *input token ids* (i.e. the output of the tokenizer). This is easy enough to do my hand, but the HugginFace `Dataset` class provides convenient, efficient, and *lazy* methods. See the documentation for [`Dataset.map`](https://huggingface.co/docs/datasets/v3.5.0/en/package_reference/main_classes#datasets.Dataset.map).

**Tip**: Verify that your new datasets are returning for every element: `text`, `label`, `intput_ids`, and `attention_mask`.

In [None]:
# Tokenization function for dataset mapping
def tokenize(example):
    """ Tokenize a single example from the dataset.
        Args: example (dict): A single example from the dataset containing a "text" field.
        Returns: dict: Tokenized representation of the input text.
        """
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=512, return_tensors=None)

In [None]:
# Apply the tokenization function to the entire dataset using map
ds = ds.map(tokenize, batched=True, num_proc=4)
print(f"Columns: {ds['train'].column_names}")
print(f"example: {ds['train'][0].keys()} ")
print(f"example text: {ds['train'][0]['text']}")
print(f"example label: {ds['train'][0]['label']}")
print(f"example tokenized: {ds['train'][0]['input_ids']}")
print(f"example attention mask: {ds['train'][0]['attention_mask']}")

Columns: ['text', 'label', 'input_ids', 'attention_mask']
example: dict_keys(['text', 'label', 'input_ids', 'attention_mask']) 
example text: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .
example label: 1
example tokenized: [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1000, 16608, 1000, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

#### Exercise 2.2: Setting up the Model to be Fine-tuned

In this exercise we need to prepare the base Distilbert model for fine-tuning for a *sequence classification task*. This means, at the very least, appending a new, randomly-initialized classification head connected to the `[CLS]` token of the last transformer layer. Luckily, HuggingFace already provides an `AutoModel` for just this type of instantiation: [`AutoModelForSequenceClassification`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification). You will want you instantiate one of these for fine-tuning.

In [None]:
from transformers import AutoModelForSequenceClassification

# Use AutoModelForSequenceClassification to load a new classification head
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2, id2label={0: "negative", 1: "positive"}, label2id={"negative": 0, "positive": 1})

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Exercise 2.3: Fine-tuning Distilbert

Finally. In this exercise you should use a HuggingFace [`Trainer`](https://huggingface.co/docs/transformers/main/en/trainer) to fine-tune your model on the Rotten Tomatoes training split. Setting up the trainer will involve (at least):


1. Instantiating a [`DataCollatorWithPadding`](https://huggingface.co/docs/transformers/en/main_classes/data_collator) object which is what *actually* does your batch construction (by padding all sequences to the same length).
2. Writing an *evaluation function* that will measure the classification accuracy. This function takes a single argument which is a tuple containing `(logits, labels)` which you should use to compute classification accuracy (and maybe other metrics like F1 score, precision, recall) and return a `dict` with these metrics.  
3. Instantiating a [`TrainingArguments`](https://huggingface.co/docs/transformers/v4.51.1/en/main_classes/trainer#transformers.TrainingArguments) object using some reasonable defaults.
4. Instantiating a `Trainer` object using your train and validation splits, you data collator, and function to compute performance metrics.
5. Calling `trainer.train()`, waiting, waiting some more, and then calling `trainer.evaluate()` to see how it did.

**Tip**: When prototyping this laboratory I discovered the HuggingFace [Evaluate library](https://huggingface.co/docs/evaluate/en/index) which provides evaluation metrics. However I found it to have insufferable layers of abstraction and getting actual metrics computed. I suggest just using the Scikit-learn metrics...

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def evaluate(eval_predictions):
    """ Compute evaluation metrics.
        Args:
            eval_predictions (tuple): A tuple containing logits and true labels.
        Returns:
            dict: A dictionary with accuracy, precision, recall, and F1 score.
        """
    logits, labels = eval_predictions
    predictions = np.argmax(logits, axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')
    accuracy = accuracy_score(labels, predictions)
    return {
        'accuracy': accuracy,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [None]:
import inspect
from transformers import TrainingArguments

# Inspect the signature of TrainingArguments to understand its parameters
print(inspect.signature(TrainingArguments.__init__))



In [None]:
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

# load a data collator that will dynamically pad the inputs received
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Define training arguments
training_args = TrainingArguments(
    output_dir="./Distilbert_results",
    report_to="wandb",
    run_name="distilbert-rotten-tomatoes-run",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5, 
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,  
    logging_dir="./logs",
    logging_steps=10,
    fp16=True,  # Use mixed precision
    metric_for_best_model="accuracy",
    load_best_model_at_end=True)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds["train"],
    eval_dataset=ds["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=evaluate
)

# Train the model
trainer.train()
trainer.save_model("./Distilbert_rotten_tomatoes_model")
tokenizer.save_pretrained("./Distilbert_rotten_tomatoes_tokenizer")


  trainer = Trainer(
[34m[1mwandb[0m: Currently logged in as: [33melena-daveri00[0m ([33melena-daveri00-universit-di-firenze[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3094,0.353392,0.850844,0.859167,0.813758,0.909944
2,0.2215,0.389317,0.84334,0.846929,0.827957,0.866792
3,0.1333,0.479846,0.84803,0.85,0.839122,0.861163


('./Distilbert_rotten_tomatoes_tokenizer/tokenizer_config.json',
 './Distilbert_rotten_tomatoes_tokenizer/special_tokens_map.json',
 './Distilbert_rotten_tomatoes_tokenizer/vocab.txt',
 './Distilbert_rotten_tomatoes_tokenizer/added_tokens.json',
 './Distilbert_rotten_tomatoes_tokenizer/tokenizer.json')

In [12]:
trainer.evaluate(ds["test"])

{'eval_loss': 0.3902638852596283,
 'eval_accuracy': 0.8330206378986866,
 'eval_f1': 0.842756183745583,
 'eval_precision': 0.7963272120200334,
 'eval_recall': 0.8949343339587242,
 'eval_runtime': 92.8216,
 'eval_samples_per_second': 11.484,
 'eval_steps_per_second': 0.366,
 'epoch': 3.0}

-----
### Exercise 3: Choose at Least One


#### Exercise 3.1: Efficient Fine-tuning for Sentiment Analysis (easy)

In Exercise 2 we fine-tuned the *entire* Distilbert model on Rotten Tomatoes. This is expensive, even for a small model. Find an *efficient* way to fine-tune Distilbert on the Rotten Tomatoes dataset (or some other dataset).

**Hint**: You could check out the [HuggingFace PEFT library](https://huggingface.co/docs/peft/en/index) for some state-of-the-art approaches that should "just work". How else might you go about making fine-tuning more efficient without having to change your training pipeline from above?

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
from peft import LoraConfig, get_peft_model
import numpy as np

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Define LoRA configuration
lora_config = LoraConfig(
    r=8,                      # low rank (4-16)
    lora_alpha=32,            # scaling
    target_modules=["q_lin", "k_lin", "v_lin", "out_lin"],  # LoRA target modules for DistilBERT
    lora_dropout=0.1,
    bias="none",              # "none"|"all"|"lora_only"
    task_type="SEQ_CLS"       # sequene classification
)
"""
q_lin: query projection layer
k_lin: key projection layer
v_lin: value projection layer
out_lin: output projection layer
"""
# get the PEFT model --> apply LoRA to the pre-trained model
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()

training_args = TrainingArguments(
    output_dir="./distilbert-peft",
    report_to="wandb",
    run_name="distilbert-rotten-tomatoes-peft-run",
    eval_strategy="epoch",      
    save_strategy="epoch",
    learning_rate=2e-4,         # With LoRA we use a higher lr (es. 1e-4 - 5e-4)
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True        
)


trainable params: 887,042 || all params: 67,842,052 || trainable%: 1.3075


In [14]:

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=ds["train"],
    eval_dataset=ds["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=evaluate
)

trainer.train()
trainer.save_model("./Distilbert_rotten_tomatoes_peft_model")
tokenizer.save_pretrained("./Distilbert_rotten_tomatoes_peft_tokenizer")
trainer.evaluate(ds["test"])

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.2055,0.358162,0.861163,0.86422,0.845601,0.883677
2,0.2134,0.395202,0.85272,0.851185,0.860153,0.842402
3,0.1726,0.407888,0.855535,0.856075,0.852886,0.859287


{'eval_loss': 0.43398192524909973,
 'eval_accuracy': 0.8339587242026266,
 'eval_f1': 0.8371665133394665,
 'eval_precision': 0.8212996389891697,
 'eval_recall': 0.8536585365853658,
 'eval_runtime': 92.9287,
 'eval_samples_per_second': 11.471,
 'eval_steps_per_second': 0.366,
 'epoch': 3.0}

In [15]:
from datasets import load_dataset

# Load the Twitter Sentiment Analysis dataset from the Hugging Face Hub
# Same as before but with 3 labels (negative, neutral, positive) and a bigger dataset
ds_twitter = load_dataset("tweet_eval", "sentiment")
for split in ds_twitter:
    print(f"Split '{split}' contains {len(ds_twitter[split])} examples")

print("\nData structure:")
print(ds_twitter["train"].features)

for i in range(3):
    example = ds_twitter["train"][i]
    print(f"\nExample {i+1}:")
    print(f"Text: {example['text']}")
    print(f"Label: {example['label']} (0=negative, 1=neutral, 2=positive)") # 3 label

# Aplly the tokenization function to the entire dataset using map
ds_twitter = ds_twitter.map(tokenize, batched=True, num_proc=4)
print(f"Columns: {ds_twitter['train'].column_names}")
print(f"example: {ds_twitter['train'][0].keys()} ")
print(f"example text: {ds_twitter['train'][0]['text']}")
print(f"example label: {ds_twitter['train'][0]['label']}")
print(f"example tokenized: {ds_twitter['train'][0]['input_ids']}")
print(f"example attention mask: {ds_twitter['train'][0]['attention_mask']}")

Split 'train' contains 45615 examples
Split 'test' contains 12284 examples
Split 'validation' contains 2000 examples

Data structure:
{'text': Value('string'), 'label': ClassLabel(names=['negative', 'neutral', 'positive'])}

Example 1:
Text: "QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin"
Label: 2 (0=negative, 1=neutral, 2=positive)

Example 2:
Text: "Ben Smith / Smith (concussion) remains out of the lineup Thursday, Curtis #NHL #SJ"
Label: 1 (0=negative, 1=neutral, 2=positive)

Example 3:
Text: Sorry bout the stream last night I crashed out but will be on tonight for sure. Then back to Minecraft in pc tomorrow night.
Label: 1 (0=negative, 1=neutral, 2=positive)
Columns: ['text', 'label', 'input_ids', 'attention_mask']
example: dict_keys(['text', 'label', 'input_ids', 'attention_mask']) 
example text: "QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLu

In [16]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
from peft import LoraConfig, get_peft_model
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Load the pre-trained model and aplly the new classification head
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=3,  # for Twitter sentiment: 0=negative, 1=neutral, 2=positive
    id2label={0: "negative", 1: "neutral", 2: "positive"},
    label2id={"negative": 0, "neutral": 1, "positive": 2}
)


lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_lin", "k_lin", "v_lin", "out_lin"],
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_CLS"
)


peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()


def evaluate_multiclass(eval_predictions):
    logits, labels = eval_predictions
    predictions = np.argmax(logits, axis=-1)
    # Use 'macro' average for multi-class classification
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='macro')
    accuracy = accuracy_score(labels, predictions)
    return {
        'accuracy': accuracy,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Same training arguments as before
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=ds_twitter["train"],
    eval_dataset=ds_twitter["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=evaluate_multiclass
)

trainer.train()
trainer.save_model("./Distilbert_tweet_eval_peft_model")
tokenizer.save_pretrained("./Distilbert_tweet_eval_peft_tokenizer")
trainer.evaluate(ds_twitter["test"])

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


trainable params: 887,811 || all params: 67,843,590 || trainable%: 1.3086


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.642,0.656143,0.714,0.689136,0.725259,0.674762
2,0.593,0.634364,0.7245,0.694394,0.721402,0.679254
3,0.556,0.627691,0.729,0.713712,0.712907,0.714563


{'eval_loss': 0.6897343993186951,
 'eval_accuracy': 0.6872354281992836,
 'eval_f1': 0.6853383354151851,
 'eval_precision': 0.6778643904462941,
 'eval_recall': 0.6953055425995139,
 'eval_runtime': 1066.0211,
 'eval_samples_per_second': 11.523,
 'eval_steps_per_second': 0.36,
 'epoch': 3.0}

In [20]:
trainer.evaluate(ds_twitter["test"])

{'eval_loss': 0.7098513245582581,
 'eval_accuracy': 0.6803972647346141,
 'eval_f1': 0.6716565079575956,
 'eval_precision': 0.6818705188599271,
 'eval_recall': 0.6748206667169643,
 'eval_runtime': 1084.6351,
 'eval_samples_per_second': 11.325,
 'eval_steps_per_second': 0.354,
 'epoch': 3.0}

#### Exercise 3.2: Fine-tuning a CLIP Model (harder)

Use a (small) CLIP model like [`openai/clip-vit-base-patch16`](https://huggingface.co/openai/clip-vit-base-patch16) and evaluate its zero-shot performance on a small image classification dataset like ImageNette or TinyImageNet. Fine-tune (using a parameter-efficient method!) the CLIP model to see how much improvement you can squeeze out of it.

**Note**: There are several ways to adapt the CLIP model; you could fine-tune the image encoder, the text encoder, or both. Or, you could experiment with prompt learning.

**Tip**: CLIP probably already works very well on ImageNet and ImageNet-like images. For extra fun, look for an image classification dataset with different image types (e.g. *sketches*).

In [4]:
# Your code here.

#### Exercise 3.3: Choose your Own Adventure

There are a *ton* of interesting and fun models on the HuggingFace hub. Pick one that does something interesting and adapt it in some way to a new task. Or, combine two or more models into something more interesting or fun. The sky's the limit.

**Note**: Reach out to me by email or on the Discord if you are unsure about anything.

In [5]:
# Your code here.