<a href="https://colab.research.google.com/github/fubotz/cl_intro_ws2024/blob/main/HomeExercise3_Fabian_SCHAMBECK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Home Exericse 3: Hyperparameters and Evaluation
In this third home exercise, you will use the knowledge from Tutorial 4 to experiment with hyperparameters, create a test set, and evaluate your final model on the created test set.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

## **Distilbert: Hyperparameters and Evaluation**

Use the code of Tutorial 4 to load and fine-tune the `distilbert-base-cased` model on the small subset of the `imdb` Movie Review Dataset. For convenience, the code of Tutorial 4 required for this exercise is already provided in the code cells below.

👋 ⚒ When creating the dataset splits in the code cell below, additionally create a test set to be used after the training. Make sure that your test set does not contain any of the sentences contained in the training or validation set and is approximately of the same size as the validation set.

In [None]:
!pip install transformers
!pip install datasets
!pip install evaluate
!pip install accelerate --upgrade



In [None]:
from datasets import load_dataset, DatasetDict
from transformers import DataCollatorWithPadding
from transformers import AutoTokenizer

# Load IMDB dataset for sentiment analysis (contains labeled reviews: positive, negative)
imdb_dataset = load_dataset("imdb")

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased")


# Function to truncate each random example from the IMDB to the first 100 tokens
def truncate(example):
    return {
        'text': " ".join(example['text'].split()[:100]),
        'label': example['label']
    }

# Shuffle once to ensure consistency and prevent data leaking from one set to another
shuffled_dataset = imdb_dataset['train'].shuffle(seed=24)

# Initialize training, validation, and test sets using non-overlapping indices
small_imdb_dataset = DatasetDict(
    train=shuffled_dataset.select(range(128)).map(truncate),        # NB: .map() applies a function (here: truncate) to all elements of a dataset (here: range 0, 128)
    val=shuffled_dataset.select(range(128, 160)).map(truncate),
    test=shuffled_dataset.select(range(160, 192)).map(truncate)
)


def tokenize_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True)

small_tokenized_dataset = small_imdb_dataset.map(tokenize_function, batched=True, batch_size=16)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)        # padding applied to batch while training

Map:   0%|          | 0/32 [00:00<?, ? examples/s]

👋 ⚒ For this exercise, we will use the Hugging Face Trainer class to play with hyperparamters. Try to find a set of hyperparameter settings that achieves the highest possilbe accuracy on the **validation set** with the small dataset and model in this setup.

**Optional:** If you want to follow a more systematic route, feel free to use available frameworks for hyperparameter optimization, such as [Optuna](https://optuna.org/).

In [None]:
import numpy as np
import evaluate
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification
from transformers import set_seed
from transformers import EarlyStoppingCallback

set_seed(24)

model = AutoModelForSequenceClassification.from_pretrained('distilbert/distilbert-base-cased', num_labels=2)
accuracy = evaluate.load("accuracy")

arguments = TrainingArguments(      # specify hyperparameters and settings for training
    output_dir="sample_cl_trainer",
    per_device_train_batch_size=16,     # hyperparameter: batch size (examples processed simultaneously)
    per_device_eval_batch_size=16,
    logging_steps=4,            # log metrics after training step / epoch
    num_train_epochs=5,     # hyperparameter: how many full passes over dataset; NB: one training epoch == one full pass over training data
    eval_strategy="epoch",          # run validation at the end of each epoch
    save_strategy="epoch",          # save model at the end of each epoch
    learning_rate=2e-5,     # hyperparameter: controls learning step size
    weight_decay=0.2,      # hyperparameter: helps prevent overfitting (penalty to large weights)
    load_best_model_at_end=True,
    report_to='none',
    seed=24
)


def compute_metrics(eval_pred):
    """Called at the end of validation. Gives accuracy"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    # calculates the accuracy
    return accuracy.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=arguments,
    train_dataset=small_tokenized_dataset['train'],
    eval_dataset=small_tokenized_dataset['val'],        # change to test when you do your final evaluation: see below
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.add_callback(EarlyStoppingCallback(early_stopping_patience=2))

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.6906,0.685231,0.71875
2,0.6761,0.688702,0.46875
3,0.6434,0.67131,0.71875
4,0.6085,0.659854,0.75
5,0.5857,0.655104,0.75


TrainOutput(global_step=40, training_loss=0.6477518439292907, metrics={'train_runtime': 841.5586, 'train_samples_per_second': 0.76, 'train_steps_per_second': 0.048, 'total_flos': 31941201500928.0, 'train_loss': 0.6477518439292907, 'epoch': 5.0})

## **Evaluation**

**1) Training Loss:**

**Definition:** The average loss computed on the training dataset after each epoch. It reflects how well the model is fitting the training data. A high loss means the model's predictions are far from the true labels (poor performance), while a low loss means the predictions are close to the true labels (good performance).

**Ideal Trend:** Training loss should decrease consistently as the model learns the patterns in the training data.

**In Practice:** A continuously decreasing training loss suggests the model is learning, but if it gets too low compared to validation loss, it might indicate overfitting (equal decrease important).


**2) Validation Loss:**

**Definition:** The average loss computed on the validation dataset, which is not used for training. It measures the model's ability to generalize to unseen data.

**Ideal Trend:** Validation loss should decrease initially and stabilize as training progresses. If it starts increasing while training loss decreases, it indicates overfitting.

**In Practice:** Validation loss that closely follows training loss suggests good generalization, while a divergence indicates overfitting.


**3) Accuracy:**

**Definition:** The proportion of correctly classified samples in the dataset. Here, it is computed on the validation set.

**Ideal Trend:** Accuracy should increase as the model learns and improves its predictions. Stagnation or a decrease in accuracy indicates overfitting or poor generalization.

**In Practice:** Accuracy complements validation loss. High accuracy with low validation loss indicates good generalization, whereas high accuracy with high loss might signal overconfidence.

👋 ⚒ Change the following code cell in a way that not only a single sentence is evaluated on your trained model (!make sure to use the correct checkpoint!) but the evaluation is performaned on the entire newly created test set.

This might also be a good occassion to get familiar with the [Hugging Face documentation and tutorials](https://huggingface.co/docs/transformers/index).

In [None]:
import torch

test_str = "I love this movie!"

fine_tuned_model = AutoModelForSequenceClassification.from_pretrained("sample_cl_trainer/checkpoint-160")
model_inputs = tokenizer(test_str, return_tensors="pt")
prediction = torch.argmax(fine_tuned_model(**model_inputs).logits)

print(["NEGATIVE", "POSITIVE"][prediction])

# Load the fine-tuned model from the correct checkpoint
fine_tuned_model = AutoModelForSequenceClassification.from_pretrained("sample_cl_trainer/checkpoint-160")

# Use the Trainer class to evaluate the model on the entire test set
test_results = trainer.evaluate(eval_dataset=small_tokenized_dataset['test'])

# Print the evaluation metrics
print("Test Set Evaluation Results:")
print(f"Accuracy: {test_results['eval_accuracy']:.4f}")
print(f"Loss: {test_results['eval_loss']:.4f}")

POSITIVE


Test Set Evaluation Results:
Accuracy: 0.6250
Loss: 0.6829


In [None]:
### Revision:
# Just for the record and in case you need it, here is an alternative method to
# evaluate on the test set (other than trainer.evaluate(test_set)):

import evaluate
from transformers import AutoModelForSequenceClassification
import torch

metric = evaluate.load("accuracy")
fine_tuned_model = AutoModelForSequenceClassification.from_pretrained("sample_cl_trainer/checkpoint-32")

model_inputs = tokenizer(small_tokenized_dataset['test']['text'], padding=True, truncation=True, return_tensors='pt')
outputs = fine_tuned_model(**model_inputs, output_hidden_states=True)
predictions = torch.argmax(outputs.logits, dim=-1)
accuracy = metric.compute(predictions=predictions, references=small_tokenized_dataset['test']['label'])
print(accuracy)

# And a version for batch processing large test sets:

import evaluate
import torch
from torch.utils.data import DataLoader

metric = evaluate.load("accuracy")
fine_tuned_model = AutoModelForSequenceClassification.from_pretrained("sample_cl_trainer/checkpoint-40")
eval_dataloader = DataLoader(small_tokenized_dataset['test'], batch_size=8)

fine_tuned_model.eval()

for batch in eval_dataloader:
    input = tokenizer(batch['text'], padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        outputs = fine_tuned_model(**input)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch['label'])

metric.compute()