###  1- Train base model

In this project, you will build a news topic classifier using the [GPT-2](https://huggingface.co/docs/transformers/en/model_doc/gpt2) model from the [Hugging Face Transformers](https://huggingface.co/transformers/) library.

The dataset used for training and evaluation is the [AG News Topic Classification Dataset](https://huggingface.co/datasets/sh0416/ag_news). This dataset contains over 1 million news articles collected from more than 2,000 sources over a year. Each article is categorized into one of four topics: World, Sports, Business, or Science/Technology.


In [1]:
import gc
import torch

gc.collect()       # Python garbage collection
torch.cuda.empty_cache()  # Free up the GPU cache

In [2]:
# Import the datasets and transformers packages
from datasets import load_dataset

# Load the train and test splits of the AG News dataset
splits = ["train", "test"]  # Define the dataset splits to load
ds = {split: ds for split, ds in zip(splits, load_dataset("ag_news", split=splits))}  # Load the dataset and store it in a dictionary

# For each split, shuffle the dataset and select a subset of samples
for split in splits:
    print(f"{split}: {len(ds[split])//10} samples")  # Print the number of samples in the subset
    ds[split] = ds[split].shuffle(seed=42).select(range(len(ds[split]) // 10))  # Shuffle and select 10% of the dataset



train: 12000 samples
test: 760 samples


#### Pre-process Datasets

Next, we will preprocess our datasets by converting all the text into tokens that our model can understand. You might wonder why the text isn't already tokenized. The reason is that different models use different tokenizers, and by performing tokenization during training, we maintain flexibility to adapt to the specific tokenizer required by the model.

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

# # Add a padding token to the tokenizer
tokenizer.pad_token = tokenizer.eos_token

def preprocess_function(examples):
    """Preprocess the imdb dataset by returning tokenized examples."""
    return tokenizer(examples["text"], truncation=True, padding="max_length")


tokenized_ds = {}
for split in splits:
    tokenized_ds[split] = ds[split].map(preprocess_function, batched=True)


# Show the first example of the tokenized training set
print(tokenized_ds["train"][0]["input_ids"])

[43984, 75, 13410, 1582, 47557, 416, 8956, 29560, 7941, 423, 3181, 867, 11684, 290, 4736, 287, 19483, 284, 257, 17369, 11, 262, 1110, 706, 1248, 661, 3724, 287, 23171, 379, 257, 1964, 7903, 13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 5

#### Load and Configure the Model

Next, we will load the model and freeze most of its parameters, keeping only the classification head trainable.

In [4]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "gpt2",
    num_labels=4,
    #add the label2id and id2label arguments [0,1,2,3], ["World", "Sports", "Business", "Sci/Tech"]
    label2id={"World": 0, "Sports": 1, "Business": 2, "Sci/Tech": 3},
    id2label={0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"},
)

# Freeze all the parameters of the base model
# Hint: Check the documentation at https://huggingface.co/transformers/v4.2.2/training.html
for param in model.base_model.parameters():
    # freaze all the parameters
    param.requires_grad = False



Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
print(model)

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=4, bias=False)
)


#### Time to Train the Model!

We're now ready to train our model! To make this process easier, we'll use the `Trainer` class from the 🤗 Transformers library. This class provides a convenient high-level interface that handles most of the training logic for us.

Before setting up the `Trainer`, we'll define a function to calculate the accuracy of our model, which we'll use as an evaluation metric.

This is also a good moment to introduce the concept of a **Data Collator**. As explained in the Hugging Face documentation:

> A data collator is an object that creates a batch from a list of dataset samples. These samples come from either the training or evaluation dataset.

> In order to form proper batches, data collators might apply some preprocessing steps, such as padding the sequences to the same length.



In [6]:
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

# Define a function to compute the metrics for evaluation
def compute_metrics(eval_pred):
    predictions, labels = eval_pred  # Unpack predictions and labels
    predictions = np.argmax(predictions, axis=1)  # Get the index of the highest probability for each prediction
    return {"accuracy": (predictions == labels).mean()}  # Calculate accuracy as the fraction of correct predictions

# Set the padding token ID in the model configuration to match the tokenizer's padding token ID
model.config.pad_token_id = tokenizer.pad_token_id

# Enable gradient checkpointing to reduce memory usage during training
model.gradient_checkpointing_enable()

# Initialize the Hugging Face Trainer class to handle the training and evaluation loop
trainer = Trainer(
    model=model,  # The model to be trained
    args=TrainingArguments(
        per_device_train_batch_size=1,  # Batch size for training (reduce if needed)
        per_device_eval_batch_size=1,  # Batch size for evaluation (reduce if needed)
        fp16=True,  # Enable mixed precision training
        gradient_accumulation_steps=2,  # Accumulate gradients over multiple steps to simulate a larger batch size
        num_train_epochs=1,  # Number of training epochs
        weight_decay=0.01,  # Weight decay for regularization
        evaluation_strategy="epoch",  # Evaluate the model at the end of each epoch
        save_strategy="epoch",  # Save the model at the end of each epoch
        load_best_model_at_end=True,  # Load the best model (based on evaluation) at the end of training
    ),
    train_dataset=tokenized_ds["train"],  # Tokenized training dataset
    eval_dataset=tokenized_ds["test"],  # Tokenized evaluation dataset
    tokenizer=tokenizer,  # Tokenizer used for preprocessing
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),  # Data collator to handle padding dynamically
    compute_metrics=compute_metrics,  # Function to compute evaluation metrics
)

# Start the training process
trainer.train()

  trainer = Trainer(
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Epoch,Training Loss,Validation Loss,Accuracy
1,0.621,0.584402,0.831579


TrainOutput(global_step=6000, training_loss=0.9549507649739584, metrics={'train_runtime': 24765.93, 'train_samples_per_second': 0.485, 'train_steps_per_second': 0.242, 'total_flos': 6271235260416000.0, 'train_loss': 0.9549507649739584, 'epoch': 1.0})

#### Evaluate the model

To evaluate the model, simply call the `evaluate` method on the `trainer` object. This will test the model on the evaluation dataset and calculate the metrics defined in the `compute_metrics` function.

In [7]:
# Show the performance of the model on the test set
trainer.evaluate()

{'eval_loss': 0.5844020247459412,
 'eval_accuracy': 0.8315789473684211,
 'eval_runtime': 1435.8574,
 'eval_samples_per_second': 0.529,
 'eval_steps_per_second': 0.529,
 'epoch': 1.0}

#### View the results

Let's examine two examples along with their labels and predicted values.

In [8]:
import pandas as pd

# Convert the test dataset into a pandas DataFrame
df = pd.DataFrame(tokenized_ds["test"])

# Select only the "text" and "label" columns for analysis
df = df[["text", "label"]]

# Replace HTML line breaks with spaces in the "text" column
df["text"] = df["text"].str.replace("<br />", " ")

# Use the trained model to make predictions on the test dataset
predictions = trainer.predict(tokenized_ds["test"])

# Add a new column "predicted_label" to the DataFrame with the predicted labels
# The predicted label is the index of the highest probability in the model's output
df["predicted_label"] = np.argmax(predictions[0], axis=1)

# Display the first two rows of the DataFrame to verify the results
df.head(2)

Unnamed: 0,text,label,predicted_label
0,Indian board plans own telecast of Australia s...,1,0
1,Stocks Higher on Drop in Jobless Claims A shar...,2,2


#### Examine Incorrect Predictions

Let's review some examples where the model made incorrect predictions.

In [9]:
# Set the display option for pandas to show the full content of the "text" column without truncation
pd.set_option("display.max_colwidth", None)

# Filter the DataFrame to show only the rows where the actual label ("label") does not match the predicted label ("predicted_label")
# Display the first two rows of these mismatched predictions for analysis
df[df["label"] != df["predicted_label"]].head(2)

Unnamed: 0,text,label,predicted_label
0,"Indian board plans own telecast of Australia series The Indian cricket board said on Wednesday it was making arrangements on its own to broadcast next month #39;s test series against Australia, which is under threat because of a raging TV rights dispute.",1,0
5,"China's inflation rate slows sharply but problems remain (AFP) AFP - China's inflation rate eased sharply in October as government efforts to cool the economy began to really bite, with food prices, one of the main culprits, showing some signs of slowing, official data showed.",0,2


In [10]:
#save the model
trainer.save_model("models/gpt2_ag_news")


#### Upload the Model to Hugging Face

In this step, we will upload our fine-tuned model to the Hugging Face Hub. This allows us to share the model with the community or use it in other projects. Ensure that you have your Hugging Face token set up in your environment variables for authentication.


In [None]:
# !pip install huggingface_hub
# from huggingface_hub import notebook_login
from huggingface_hub import HfApi
import os
from dotenv import load_dotenv
# Load the Hugging Face token from an .env file

# Load environment variables from a .env file
load_dotenv()
hf_token = os.getenv("HF_TOKEN")
if not hf_token:
    raise ValueError("Hugging Face token not found. Please set the 'HF_TOKEN' environment variable.")


api = HfApi(token=hf_token)
api.upload_folder(
    folder_path="./models/gpt2_ag_news",
    repo_id="elewah/gpt2-ag-news",
    repo_type="model",
)



model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/5.30k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/elewah/gpt2-ag-news/commit/18c304fcf760006d21176865eed7ffb24a76e266', commit_message='Upload folder using huggingface_hub', commit_description='', oid='18c304fcf760006d21176865eed7ffb24a76e266', pr_url=None, repo_url=RepoUrl('https://huggingface.co/elewah/gpt2-ag-news', endpoint='https://huggingface.co', repo_type='model', repo_id='elewah/gpt2-ag-news'), pr_revision=None, pr_num=None)