<a href="https://colab.research.google.com/github/ch2ohch2oh/plm-notebooks/blob/main/gpt2_finetune_imdb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers datasets torch scikit-learn



In [2]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [3]:
from datasets import load_dataset

# Load the IMDB dataset
imdb = load_dataset("imdb")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
print(imdb)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [5]:
from datasets import load_dataset

# Load the IMDB dataset
imdb = load_dataset("imdb")

# Perform a train-test split on the original training dataset
# This will create two splits: 'train' and 'test' from imdb['train']
train_test_split_dataset = imdb["train"].train_test_split(test_size=0.05, seed=42) # Splitting 80% for train and 20% for evaluation

# Assign the splits to small_train_dataset and small_eval_dataset
small_train_dataset = train_test_split_dataset["train"]
small_eval_dataset = train_test_split_dataset["test"]

print("New dataset splits:")
print(small_train_dataset)
print(small_eval_dataset)

New dataset splits:
Dataset({
    features: ['text', 'label'],
    num_rows: 23750
})
Dataset({
    features: ['text', 'label'],
    num_rows: 1250
})


In [6]:
from transformers import AutoTokenizer

# Load the tokenizer for gpt2
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Set the padding token
tokenizer.pad_token = tokenizer.eos_token

# Create a function to tokenize the text
def tokenize_function(examples):
    # padding="max_length" pads sentences to a uniform length.
    # truncation=True ensures sentences longer than the model's max input size are cut down.
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Apply the tokenizer to the entire dataset using map()
tokenized_train_dataset = small_train_dataset.map(tokenize_function, batched=True)
tokenized_eval_dataset = small_eval_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/23750 [00:00<?, ? examples/s]

Map:   0%|          | 0/1250 [00:00<?, ? examples/s]

In [7]:
from transformers import AutoModelForSequenceClassification

# Load the model for sequence classification with 2 labels
model = AutoModelForSequenceClassification.from_pretrained("gpt2", num_labels=2)
model.config.pad_token_id = model.config.eos_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
import numpy as np
from sklearn.metrics import accuracy_score
from transformers import TrainingArguments, Trainer

# 1. Define a function to compute metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": accuracy_score(labels, predictions)}

# 2. Define Training Arguments
training_args = TrainingArguments(
    output_dir="./results",          # Directory to save the model
    num_train_epochs=10,              # Total number of training epochs (reduced for faster execution)
    per_device_train_batch_size=10,   # Batch size for training (reduced to save GPU memory)
    per_device_eval_batch_size=10,    # Batch size for evaluation (reduced to save GPU memory)
    warmup_steps=60,                # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # Strength of weight decay
    logging_dir="./logs",            # Directory for storing logs
    logging_steps=10,
    eval_strategy="steps",           # Evaluate every n steps
    eval_steps=100,                  # Perform evaluation every 100 steps (batches)
    report_to=None,
    fp16=True  # This is critical, otherwise it is slow as hell!
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [9]:
from transformers import Trainer

tokenizer.pad_token = tokenizer.eos_token

# Create the Trainer instance with updated training_args
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,
    compute_metrics=compute_metrics,
)

# Start fine-tuning
trainer.train() # Should expect around 1 it/s using colab's Tesla T4 GPU

Step,Training Loss,Validation Loss,Accuracy
200,0.319,0.32278,0.8992
400,0.2146,0.32917,0.904
600,0.3802,0.340143,0.9064
800,0.274,0.26203,0.908
1000,0.362,0.267196,0.9152
1200,0.1498,0.334249,0.9016


KeyboardInterrupt: 