## Project Overview

This project demonstrates text classification using the AG News dataset. It leverages the Hugging Face datasets library for data loading and preprocessing, and uses a pre-trained DistilBERT model from the Transformers library for fine-tuning.

## Workflow Summary

    Dataset: AG News (from Hugging Face datasets)

    Model: Pre-trained DistilBERT (distilbert-base-uncased)

    Preprocessing: Tokenization using DistilBertTokenizer

    Fine-tuning: Hugging Face Trainer API with custom training arguments

    Evaluation: The model’s performance is evaluated on the test set primarily using accuracy as the key metric.

    Confusion Matrix: The confusion matrix is computed and printed during evaluation to help analyze prediction errors and class-wise performance.

## Inference

After fine-tuning, the model can be used to classify new news text samples into their respective categories. The inference process involves tokenizing input sentences, passing them through the fine-tuned model, and interpreting the predicted class label along with confidence scores.

In [1]:
!pip install transformers
!pip install 'accelerate>=0.26.0'
!pip install -U datasets huggingface_hub
!pip install fsspec==2023.9.2

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate>=0.26.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate>=0.26.0)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate>=0.26.0)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate>=0.26.0)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0.0->accelerate>=0.26.0)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=2.0.0->accelerate>=0.26.0)
  Downloading nvidia_cuff

In [2]:
from datasets import load_dataset
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np
import torch

2025-08-12 00:14:32.593684: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1754957672.928002      36 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1754957673.024427      36 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
# Load dataset
dataset = load_dataset("ag_news")

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

In [4]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})


In [5]:
# Tokenizer and model
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=4)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# Tokenize
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

In [7]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)

    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)

    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [8]:
#Step 4: Define Training Arguments
from transformers import TrainingArguments

# Define training arguments
training_args = TrainingArguments(
output_dir="./results", eval_strategy="epoch",
    learning_rate=2e-5, per_device_train_batch_size=16,
    per_device_eval_batch_size=16, num_train_epochs=1,
    weight_decay=0.01,
    report_to="none",
)

In [9]:
#Step 5: Fine-Tune the Model
from transformers import Trainer
# Create Trainer instance
trainer = Trainer (
model=model, args=training_args, train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],compute_metrics=compute_metrics)

In [None]:
# Fine-tune the model
trainer.train()



Epoch,Training Loss,Validation Loss




In [11]:
#Step 6: Evaluate the Model
# Evaluate the fine-tuned model
trainer.evaluate()

{'eval_loss': 0.1745474934577942,
 'eval_accuracy': 0.9411842105263157,
 'eval_f1': 0.9411615420294874,
 'eval_precision': 0.9411549547036132,
 'eval_recall': 0.9411842105263157,
 'eval_runtime': 77.5854,
 'eval_samples_per_second': 97.957,
 'eval_steps_per_second': 3.068,
 'epoch': 1.0}

### Inference

In [12]:
import torch
import torch.nn.functional as F

# Map label numbers to actual AG News categories
label_names = {
    0: "World 🌍",
    1: "Sports 🏅",
    2: "Business 💼",
    3: "Sci/Tech 🔬"
}

def predict_ag_news(text):
    model.eval()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)  # Ensure model is on same device as input

    # Tokenize and move input to device
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    inputs = {key: val.to(device) for key, val in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        probs = F.softmax(outputs.logits, dim=1)
        predicted_class = torch.argmax(probs, dim=1).item()
        confidence = probs[0][predicted_class].item()

    label = label_names.get(predicted_class, f"Class {predicted_class}")
    print(f"\nText: {text}")
    print(f"Predicted Category: {label} (Confidence: {confidence:.2f})")


In [13]:
# 0: World
predict_ag_news("Tensions rise in the Middle East as diplomats call for peace talks.")

# 1: Sports
predict_ag_news("The football team won the championship after a tough game.")

# 2: Business
predict_ag_news("Stock markets rallied after a positive earnings report.")

# 3: Sci/Tech
predict_ag_news("NASA is preparing to launch a new satellite next month.")


Text: Tensions rise in the Middle East as diplomats call for peace talks.
Predicted Category: World 🌍 (Confidence: 1.00)

Text: The football team won the championship after a tough game.
Predicted Category: Sports 🏅 (Confidence: 0.96)

Text: Stock markets rallied after a positive earnings report.
Predicted Category: Business 💼 (Confidence: 0.95)

Text: NASA is preparing to launch a new satellite next month.
Predicted Category: Sci/Tech 🔬 (Confidence: 0.98)
