# 🧠 Fine-Tuning DistilBERT on AG News Dataset

**Goal:**  
Fine-tune a pre-trained DistilBERT model to classify news headlines into one of 4 categories:
- World 🌍  
- Sports 🏀  
- Business 💼  
- Science/Technology 💻  

**Why this project?**  
You’ll learn how fine-tuning works for *multi-class text classification* — a perfect next step after IMDb sentiment analysis.

**Concepts Covered:**
- Tokenization  
- Model fine-tuning  
- Evaluation  
- Testing on custom text  
- Saving & reusing models  

**Dataset:** AG News (from Hugging Face Datasets)  
**Model:** distilbert-base-uncased  
**Framework:** Hugging Face Transformers


In [1]:
# ============================================
# Step 1: Install Required Libraries
# ============================================
# ❓ Why? → These libraries provide:
# - Transformers (for pre-trained models)
# - Datasets (to load AG News easily)
# - Evaluate (to measure accuracy)

!pip install transformers datasets evaluate -q


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25h

## 📦 Step 2: Import Required Libraries

We import the necessary Python packages for:
- Loading and processing the dataset  
- Tokenizing text  
- Loading a pre-trained DistilBERT model  
- Training and evaluation  


In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate
import numpy as np


## 🗞 Step 3: Load the AG News Dataset

The **AG News dataset** contains 120,000 training samples and 7,600 test samples.  
Each article belongs to one of four categories:  
- 0: World  
- 1: Sports  
- 2: Business  
- 3: Sci/Tech  


In [3]:
dataset = load_dataset("ag_news")
print(dataset)

print("\nSample example from dataset:")
print(dataset["train"][0])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

Sample example from dataset:
{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", 'label': 2}


## 🔤 Step 4: Tokenize the Dataset

Before feeding text into BERT, we must convert it into tokens (numbers).  
We’ll use `distilbert-base-uncased` tokenizer which:
- Converts text → tokens → token IDs  
- Pads/truncates sentences to the same length  


In [4]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

tokenized_datasets = dataset.map(preprocess_function, batched=True)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

## 🧩 Step 5: Load the Pre-trained Model

We’ll use a **DistilBERT model** already trained on large English text corpora.  
We’ll fine-tune it for our specific task (news classification) with 4 output labels.  


In [5]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=4
)


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# 1) Upgrade packages to a modern version that supports evaluation_strategy
!pip install -q --upgrade transformers datasets evaluate accelerate

# 2) Verify the transformers version (should be >= ~4.4+; newer is better)
import transformers
print("transformers version:", transformers.__version__)


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.6/511.6 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
pylibcudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.[0m[31m
[0mtransformers version: 4.57.1


## ⚙️ Step 6: Define Training Parameters

These are **hyperparameters** that control how our model learns:
- `learning_rate`: speed of model learning  
- `batch_size`: how many samples per update  
- `epochs`: how many times to go through the dataset  
- `weight_decay`: helps avoid overfitting  


In [7]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",  # Corrected argument name
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=100
)


## 📏 Step 7: Define Evaluation Metric

We’ll use **accuracy** as our main metric —  
to check how many headlines are correctly classified.


In [8]:
accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)


Downloading builder script: 0.00B [00:00, ?B/s]

## 🤖 Step 8: Create Trainer Object

The `Trainer` API from Hugging Face automates:
- Training  
- Evaluation  
- Logging  


In [10]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)


  trainer = Trainer(


## 🔥 Step 9: Fine-tune the Model

Now we train the model —  
it adjusts its weights slightly to fit the AG News task.


In [None]:
trainer.train()


  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33malijawad832833[0m ([33malijawad832833-the-university-of-lahore[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss,Accuracy
1,0.1964,0.175731,0.941316
2,0.1209,0.181708,0.947632


## 🧮 Step 10: Evaluate the Model

After training, we check how well it performs on unseen data.


In [None]:
results = trainer.evaluate()
print("\n✅ Evaluation Results:")
print(results)


In [1]:
  !git clone https://github.com/alijawad832833-DS/-Fine-Tuning-DistilBERT-on-AG-News-Dataset.git


Cloning into '-Fine-Tuning-DistilBERT-on-AG-News-Dataset'...
remote: Enumerating objects: 3, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (3/3), done.


In [2]:
!cp /content/Fine-Tuning-DistilBERT-on-AG-News.ipynb /content/-Fine-Tuning-DistilBERT-on-AG-News-Dataset/


cp: cannot stat '/content/Fine-Tuning-DistilBERT-on-AG-News.ipynb': No such file or directory
