# AG News Text Classification - Google Colab Quick Start

## Overview

This notebook provides a complete quick start guide for AG News classification in Google Colab, following methodologies from:
- Wing (2006): "Computational Thinking"
- Guzdial (2015): "Learner-Centered Design of Computing Education"
- Zhang et al. (2015): "Character-level Convolutional Networks for Text Classification"

### Learning Objectives
1. Set up AG News classification environment in Colab
2. Load and explore the dataset
3. Train a transformer-based classifier
4. Evaluate model performance
5. Deploy model for inference

Author: Võ Hải Dũng  
Email: vohaidung.work@gmail.com  
Date: 2025

## 1. Environment Setup

In [None]:
# Standard library imports
import os
import sys
import json
import warnings
from pathlib import Path
from typing import Dict, List, Tuple, Optional, Any
from collections import Counter
from datetime import datetime

# Check GPU availability
import torch

print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("Warning: GPU not available. Training will be slower.")
    print("To enable GPU: Runtime -> Change runtime type -> Hardware accelerator -> GPU")

In [None]:
# Clone repository
!git clone https://github.com/VoHaiDung/ag-news-text-classification.git
%cd ag-news-text-classification

# Verify repository structure
!ls -la

In [None]:
# Install dependencies
print("Installing required packages...")
!pip install -q -r requirements/minimal.txt

# Import additional libraries
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Configuration
warnings.filterwarnings('ignore')
np.random.seed(42)
torch.manual_seed(42)

print("Environment setup completed!")

## 2. Data Loading and Preparation

In [None]:
# Download AG News dataset
print("Downloading AG News dataset...")
!python scripts/setup/download_all_data.py --dataset ag_news

# Prepare data splits
print("\nPreparing data splits...")
!python scripts/data_preparation/prepare_ag_news.py

print("\nData preparation completed!")

In [None]:
# Load and explore data
data_dir = Path("data/processed")

# Load datasets
train_df = pd.read_csv(data_dir / "train.csv")
val_df = pd.read_csv(data_dir / "validation.csv")
test_df = pd.read_csv(data_dir / "test.csv")

# Define class names
AG_NEWS_CLASSES = ["World", "Sports", "Business", "Sci/Tech"]

print("Dataset Statistics:")
print("="*50)
print(f"Training samples: {len(train_df):,}")
print(f"Validation samples: {len(val_df):,}")
print(f"Test samples: {len(test_df):,}")
print(f"Total samples: {len(train_df) + len(val_df) + len(test_df):,}")
print(f"Number of classes: {len(AG_NEWS_CLASSES)}")
print(f"Classes: {', '.join(AG_NEWS_CLASSES)}")

## 3. Data Exploration

In [None]:
# Display sample data
print("Sample Training Data:")
print("="*80)
print(train_df.head())

# Label distribution
print("\nLabel Distribution in Training Set:")
print("="*50)
label_counts = train_df['label'].value_counts().sort_index()
for label, count in label_counts.items():
    percentage = (count / len(train_df)) * 100
    print(f"  {AG_NEWS_CLASSES[label]}: {count:,} samples ({percentage:.1f}%)")

In [None]:
# Sample texts from each category
print("Sample Texts from Each Category:")
print("="*80)

for label in range(len(AG_NEWS_CLASSES)):
    print(f"\n{AG_NEWS_CLASSES[label].upper()}:")
    print("-"*40)
    samples = train_df[train_df['label'] == label]['text'].sample(2, random_state=42)
    for i, text in enumerate(samples, 1):
        # Truncate for display
        display_text = text[:200] + "..." if len(text) > 200 else text
        print(f"  {i}. {display_text}")
        print()

In [None]:
# Text length analysis
train_df['word_count'] = train_df['text'].str.split().str.len()
train_df['char_count'] = train_df['text'].str.len()

print("Text Length Statistics:")
print("="*50)
print(f"Average words per text: {train_df['word_count'].mean():.1f}")
print(f"Std dev of word count: {train_df['word_count'].std():.1f}")
print(f"Min words: {train_df['word_count'].min()}")
print(f"Max words: {train_df['word_count'].max()}")
print(f"Median words: {train_df['word_count'].median():.0f}")

# Visualize distribution
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(train_df['word_count'], bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Number of Words')
plt.ylabel('Frequency')
plt.title('Word Count Distribution')

plt.subplot(1, 2, 2)
for label in range(len(AG_NEWS_CLASSES)):
    subset = train_df[train_df['label'] == label]['word_count']
    plt.hist(subset, bins=30, alpha=0.5, label=AG_NEWS_CLASSES[label])
plt.xlabel('Number of Words')
plt.ylabel('Frequency')
plt.title('Word Count by Category')
plt.legend()

plt.tight_layout()
plt.show()

## 4. Model Setup and Training

In [None]:
# Import required modules
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
from torch.utils.data import Dataset, DataLoader
import torch.nn as nn
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("Transformers library imported successfully!")

In [None]:
# Dataset class
class AGNewsDataset(Dataset):
    """
    PyTorch Dataset for AG News classification.
    
    Following dataset design patterns from:
    - Paszke et al. (2019): "PyTorch: An Imperative Style, High-Performance Deep Learning Library"
    """
    
    def __init__(self, texts: List[str], labels: List[int], tokenizer, max_length: int = 256):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self) -> int:
        return len(self.texts)
    
    def __getitem__(self, idx: int) -> Dict[str, torch.Tensor]:
        text = str(self.texts[idx])
        label = self.labels[idx]
        
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_length,
            return_tensors="pt"
        )
        
        return {
            "input_ids": encoding["input_ids"].flatten(),
            "attention_mask": encoding["attention_mask"].flatten(),
            "labels": torch.tensor(label, dtype=torch.long)
        }

In [None]:
# Initialize model and tokenizer
model_name = "distilbert-base-uncased"  # Fast and efficient for quick start

print(f"Loading model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=len(AG_NEWS_CLASSES)
)

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

print(f"Model loaded successfully!")
print(f"Device: {device}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
# Create datasets
print("Creating datasets...")

train_dataset = AGNewsDataset(
    train_df["text"].values,
    train_df["label"].values,
    tokenizer
)

val_dataset = AGNewsDataset(
    val_df["text"].values,
    val_df["label"].values,
    tokenizer
)

test_dataset = AGNewsDataset(
    test_df["text"].values,
    test_df["label"].values,
    tokenizer
)

print(f"Train dataset: {len(train_dataset)} samples")
print(f"Val dataset: {len(val_dataset)} samples")
print(f"Test dataset: {len(test_dataset)} samples")

## 5. Training

In [None]:
# Training configuration
from transformers import AdamW, get_linear_schedule_with_warmup

# Training hyperparameters
NUM_EPOCHS = 2  # Quick training for demo
BATCH_SIZE = 32 if torch.cuda.is_available() else 16
LEARNING_RATE = 2e-5
WARMUP_RATIO = 0.1

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE*2, shuffle=False)

# Setup optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
total_steps = len(train_loader) * NUM_EPOCHS
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=int(WARMUP_RATIO * total_steps),
    num_training_steps=total_steps
)

print(f"Training Configuration:")
print(f"  Epochs: {NUM_EPOCHS}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Total steps: {total_steps}")
print(f"  Warmup steps: {int(WARMUP_RATIO * total_steps)}")

In [None]:
# Training loop
def train_epoch(model, dataloader, optimizer, scheduler, device):
    """Train for one epoch."""
    model.train()
    total_loss = 0
    progress_bar = tqdm(dataloader, desc="Training")
    
    for batch in progress_bar:
        batch = {k: v.to(device) for k, v in batch.items()}
        
        # Forward pass
        outputs = model(**batch)
        loss = outputs.loss
        
        # Backward pass
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        # Update weights
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        
        total_loss += loss.item()
        progress_bar.set_postfix({"loss": loss.item():.4f})
    
    return total_loss / len(dataloader)

def evaluate(model, dataloader, device):
    """Evaluate model on dataset."""
    model.eval()
    all_preds = []
    all_labels = []
    total_loss = 0
    
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            batch = {k: v.to(device) for k, v in batch.items()}
            
            outputs = model(**batch)
            loss = outputs.loss
            logits = outputs.logits
            
            preds = torch.argmax(logits, dim=-1)
            
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(batch["labels"].cpu().numpy())
            total_loss += loss.item()
    
    accuracy = accuracy_score(all_labels, all_preds)
    avg_loss = total_loss / len(dataloader)
    
    return accuracy, avg_loss, all_preds, all_labels

# Training
print("\nStarting training...")
print("="*50)

best_accuracy = 0
training_history = {"train_loss": [], "val_loss": [], "val_accuracy": []}

for epoch in range(NUM_EPOCHS):
    print(f"\nEpoch {epoch + 1}/{NUM_EPOCHS}")
    
    # Train
    train_loss = train_epoch(model, train_loader, optimizer, scheduler, device)
    training_history["train_loss"].append(train_loss)
    
    # Evaluate
    val_accuracy, val_loss, _, _ = evaluate(model, val_loader, device)
    training_history["val_loss"].append(val_loss)
    training_history["val_accuracy"].append(val_accuracy)
    
    print(f"  Train Loss: {train_loss:.4f}")
    print(f"  Val Loss: {val_loss:.4f}")
    print(f"  Val Accuracy: {val_accuracy:.4f}")
    
    # Save best model
    if val_accuracy > best_accuracy:
        best_accuracy = val_accuracy
        print(f"  New best model! Accuracy: {best_accuracy:.4f}")

print("\nTraining completed!")
print(f"Best validation accuracy: {best_accuracy:.4f}")

## 6. Model Evaluation

In [None]:
# Evaluate on test set
print("Evaluating on test set...")
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE*2, shuffle=False)

test_accuracy, test_loss, test_preds, test_labels = evaluate(model, test_loader, device)

print(f"\nTest Results:")
print("="*50)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

# Classification report
print("\nClassification Report:")
print("="*50)
print(classification_report(test_labels, test_preds, target_names=AG_NEWS_CLASSES))

In [None]:
# Confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(test_labels, test_preds)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=AG_NEWS_CLASSES, 
            yticklabels=AG_NEWS_CLASSES)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# Per-class accuracy
print("\nPer-Class Accuracy:")
print("="*50)
for i, class_name in enumerate(AG_NEWS_CLASSES):
    class_correct = cm[i, i]
    class_total = cm[i].sum()
    class_acc = class_correct / class_total if class_total > 0 else 0
    print(f"{class_name}: {class_acc:.4f} ({class_correct}/{class_total})")

## 7. Interactive Prediction

In [None]:
def predict_text(text: str, model, tokenizer, device) -> Tuple[str, float, np.ndarray]:
    """
    Predict class for a single text input.
    
    Parameters
    ----------
    text : str
        Input text to classify
    model : transformers.PreTrainedModel
        Trained model
    tokenizer : transformers.PreTrainedTokenizer
        Tokenizer for the model
    device : torch.device
        Device to run inference on
    
    Returns
    -------
    Tuple[str, float, np.ndarray]
        Predicted class name, confidence score, and all probabilities
    """
    model.eval()
    
    # Tokenize
    inputs = tokenizer(
        text,
        truncation=True,
        padding="max_length",
        max_length=256,
        return_tensors="pt"
    ).to(device)
    
    # Predict
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = torch.softmax(logits, dim=-1)
        pred = torch.argmax(logits, dim=-1)
    
    pred_class = AG_NEWS_CLASSES[pred.item()]
    confidence = probs[0][pred].item()
    
    return pred_class, confidence, probs[0].cpu().numpy()

# Test predictions
test_texts = [
    "Apple announces new iPhone with revolutionary camera system and AI features",
    "Stock market reaches all-time high amid economic recovery optimism",
    "Scientists discover potential signs of life on distant exoplanet",
    "Local team wins championship in thrilling overtime victory against rivals",
    "UN Security Council meets to discuss international peace efforts"
]

print("Interactive Predictions:")
print("="*80)

for text in test_texts:
    pred_class, confidence, probs = predict_text(text, model, tokenizer, device)
    
    print(f"\nText: {text[:60]}...")
    print(f"Predicted: {pred_class} (confidence: {confidence:.4f})")
    print(f"All probabilities:")
    for i, prob in enumerate(probs):
        print(f"  {AG_NEWS_CLASSES[i]}: {prob:.4f}")

## 8. Save and Load Model

In [None]:
# Save model
output_dir = Path("outputs/colab_model")
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Saving model to {output_dir}...")
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

# Save training history
history_path = output_dir / "training_history.json"
with open(history_path, "w") as f:
    json.dump(training_history, f, indent=2)

# Save metadata
metadata = {
    "model_name": model_name,
    "num_epochs": NUM_EPOCHS,
    "batch_size": BATCH_SIZE,
    "learning_rate": LEARNING_RATE,
    "best_accuracy": best_accuracy,
    "test_accuracy": test_accuracy,
    "timestamp": datetime.now().isoformat(),
    "author": "Võ Hải Dũng",
    "email": "vohaidung.work@gmail.com"
}

metadata_path = output_dir / "metadata.json"
with open(metadata_path, "w") as f:
    json.dump(metadata, f, indent=2)

print(f"Model saved successfully!")
print(f"Files saved:")
for file in output_dir.glob("*"):
    print(f"  - {file.name}")

In [None]:
# Test loading
print("\nTesting model loading...")
loaded_model = AutoModelForSequenceClassification.from_pretrained(output_dir)
loaded_tokenizer = AutoTokenizer.from_pretrained(output_dir)

# Test prediction with loaded model
test_text = "Breaking news: Major technological breakthrough announced"
loaded_model = loaded_model.to(device)
pred_class, confidence, _ = predict_text(test_text, loaded_model, loaded_tokenizer, device)

print(f"Test prediction with loaded model:")
print(f"  Text: {test_text}")
print(f"  Prediction: {pred_class} (confidence: {confidence:.4f})")
print("\nModel loaded and verified successfully!")

## 9. Download Results

In [None]:
# Create zip file with results
import zipfile
from datetime import datetime

zip_filename = f"colab_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.zip"

with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for file in output_dir.rglob('*'):
        if file.is_file():
            zipf.write(file, file.relative_to(output_dir.parent))

print(f"Results compressed to: {zip_filename}")

# Download (only works in Colab)
try:
    from google.colab import files
    files.download(zip_filename)
    print("Download started!")
except ImportError:
    print("Not running in Colab. File saved locally.")

## 10. Conclusions and Next Steps

### Summary

You have successfully completed the AG News text classification quick start:

1. **Environment Setup**: Configured Google Colab with GPU support
2. **Data Preparation**: Loaded and explored AG News dataset
3. **Model Training**: Trained DistilBERT classifier with {best_accuracy:.2%} validation accuracy
4. **Evaluation**: Achieved {test_accuracy:.2%} test accuracy
5. **Deployment**: Saved model for future use

### Key Observations

- Dataset is well-balanced across 4 categories
- Text lengths suitable for standard transformer models
- DistilBERT provides good balance of speed and accuracy
- Model performs well even with minimal training

### Next Steps

1. **Improve Performance**:
   - Try larger models (RoBERTa, DeBERTa)
   - Increase training epochs
   - Experiment with hyperparameters
   - Apply data augmentation

2. **Advanced Techniques**:
   - Implement ensemble methods
   - Try prompt-based learning
   - Explore few-shot learning
   - Use advanced training strategies

3. **Production Deployment**:
   - Optimize model for inference
   - Deploy via REST API
   - Implement monitoring
   - Add A/B testing

### Resources

- **Full Documentation**: [GitHub Repository](https://github.com/VoHaiDung/ag-news-text-classification)
- **Advanced Notebooks**: See `notebooks/` directory
- **API Examples**: Check `quickstart/api_quickstart.py`
- **Contact**: vohaidung.work@gmail.com

---

**Thank you for using this quick start guide!**
