### **Name:** *`Urooj Fatima`*  
### **DHC-ID:** `DHC-29`   
### **Domain:** *`AI/ML Engineering Internship Tasks`*  
# **`DevelopersHub Corporation`**  

# 📌 Task 3: News Topic Classifier Using BERT    

## 🧠 Problem Statement
News articles are published in massive volumes every day, covering diverse topics such as world events, sports, business, and science/technology. Manually categorizing these news items is time-consuming and inefficient. Therefore, there is a need for an automated system that can classify news headlines into predefined categories based on their content.

## 🚀 Objective
Fine-tune a transformer model (e.g., BERT) to classify news headlines into topic categories.

## 📂 Dataset
#### Source:  
AG News Corpus (commonly available via Hugging Face datasets library).

#### Content:  
A large collection of news articles categorized into 4 classes.

#### Features:  

Class Index → The label (0–3) corresponding to one of the categories:

0 → World 🌍

1 → Sports 🏅

2 → Business 💼

3 → Sci/Tech 🔬


## Libraries Import

In [8]:
import os
import random
import numpy as np
import torch

from datasets import load_dataset
from datasets.utils.logging import disable_progress_bar
disable_progress_bar()  # avoid ipywidgets progress bar issues

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)

import evaluate

# Reproducibility
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

print('Torch:', torch.__version__)
try:
    import transformers, datasets, accelerate
    print('Transformers:', transformers.__version__)
    print('Datasets:', datasets.__version__)
    print('Accelerate:', accelerate.__version__)
except Exception as e:
    print('Versions check note:', e)


Torch: 2.8.0+cpu
Transformers: 4.55.4
Datasets: 4.0.0
Accelerate: 1.10.1


## 1) Load & Explore AG News (Headlines Only)

In [11]:
print('Loading AG News dataset...')
dataset = load_dataset('ag_news')  # features: ['Class Index', 'Title', 'Description']
print(dataset)

print(f"Train size: {len(dataset['train'])}, Test size: {len(dataset['test'])}")

# Peek a few samples (Title only, since we classify headlines)
for i in range(3):
    print(f"[{i}] Title:", dataset['train'][i]['Title'])
    print('Label (Class Index 1-4):', dataset['train'][i]['Class Index'])
    print('-'*80)

# Class distribution
train_labels = [ex['Class Index'] for ex in dataset['train']]
test_labels  = [ex['Class Index'] for ex in dataset['test']]
print('\nTraining class distribution:')
for i in range(1,5):
    print(f'Class {i}:', train_labels.count(i))
print('\nTest class distribution:')
for i in range(1,5):
    print(f'Class {i}:', test_labels.count(i))


Loading AG News dataset...
DatasetDict({
    train: Dataset({
        features: ['Class Index', 'Title', 'Description'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['Class Index', 'Title', 'Description'],
        num_rows: 7600
    })
})
Train size: 120000, Test size: 7600
[0] Title: Wall St. Bears Claw Back Into the Black (Reuters)
Label (Class Index 1-4): 3
--------------------------------------------------------------------------------
[1] Title: Carlyle Looks Toward Commercial Aerospace (Reuters)
Label (Class Index 1-4): 3
--------------------------------------------------------------------------------
[2] Title: Oil and Economy Cloud Stocks' Outlook (Reuters)
Label (Class Index 1-4): 3
--------------------------------------------------------------------------------

Training class distribution:
Class 1: 30000
Class 2: 30000
Class 3: 30000
Class 4: 30000

Test class distribution:
Class 1: 1900
Class 2: 1900
Class 3: 1900
Class 4: 1900


## 2) Preprocess (Tokenize Title only, remap labels 1‑4 → 0‑3)

In [24]:
label_names = ['World', 'Sports', 'Business', 'Sci/Tech']

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def preprocess(batch):
    # Input: Headline only (Title). We ignore Description by requirement.
    # Map Class Index from 1..4 → 0..3 for HF Trainer
    return {
        **tokenizer(batch['Title'], padding='max_length', truncation=True, max_length=64),
        'labels': [y - 1 for y in batch['Class Index']],
    }

encoded = dataset.map(
    preprocess,
    batched=True,
    remove_columns=['Title', 'Description', 'Class Index'],
)

# Set PyTorch format
encoded.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
encoded


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 7600
    })
})

## 3) Version‑Safe TrainingArguments (handles evaluation_strategy vs eval_strategy)

In [27]:
from inspect import signature

def build_training_args():
    base = dict(
        output_dir='./results',
        save_strategy='epoch',
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=2,
        weight_decay=0.01,
        logging_dir='./logs',
        logging_steps=100,
        report_to='none',
        load_best_model_at_end=False,
    )
    params = signature(TrainingArguments).parameters
    if 'evaluation_strategy' in params:
        base['evaluation_strategy'] = 'epoch'
    elif 'eval_strategy' in params:
        base['eval_strategy'] = 'epoch'
    if 'logging_strategy' in params:
        base.setdefault('logging_strategy', 'steps')
    return TrainingArguments(**base)

training_args = build_training_args()
training_args


TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.EPOCH,
eval_use_gather_object=False

## 4) Model, Metrics, Trainer, Train

In [None]:
num_labels = 4
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)

accuracy = evaluate.load('accuracy')
f1 = evaluate.load('f1')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        'accuracy': accuracy.compute(predictions=preds, references=labels)['accuracy'],
        'f1': f1.compute(predictions=preds, references=labels, average='weighted')['f1'],
    }

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded['train'],
    eval_dataset=encoded['test'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

## 5) Evaluate

In [None]:
results = trainer.evaluate()
print('Final Evaluation:', results)

save_dir = './bert_agnews_headlines'
trainer.save_model(save_dir)
tokenizer.save_pretrained(save_dir)
print(f'Saved model + tokenizer to: {save_dir}')

## 6) Minimal Gradio UI (textbox → predicted category with scores)

In [None]:
import gradio as gr
import torch.nn.functional as F

labels = label_names  # ['World','Sports','Business','Sci/Tech']

def classify_headline(text):
    if not text or not text.strip():
        return {l: 0.0 for l in labels}
    enc = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=64)
    with torch.no_grad():
        out = model(**enc)
        probs = F.softmax(out.logits, dim=-1)[0].tolist()
    return {labels[i]: float(probs[i]) for i in range(len(labels))}

demo = gr.Interface(
    fn=classify_headline,
    inputs=gr.Textbox(lines=2, placeholder='Enter a news headline...'),
    outputs=gr.Label(num_top_classes=4),
    title='AG News Headline Classifier (BERT)',
    description='Enter a headline. Returns category and confidence scores.'
)

# To launch from the notebook, uncomment:
# demo.launch(share=False)
print('Gradio app object created. Call demo.launch() to start the UI.')

### Key Insights:

- Balanced dataset → no significant class imbalance issues.

- BERT performs well on short text like headlines; typical accuracy ~94–96% after fine-tuning.

- Headline-only training still achieves strong results, since headlines capture the essence of the article.

- Deployment via Gradio/Streamlit makes the model practical for end users without coding.