# Text Preprocessing Tutorial for AG News Classification

## Overview

This notebook demonstrates comprehensive text preprocessing techniques following methodologies from:
- Pennington et al. (2014): "GloVe: Global Vectors for Word Representation"
- Bojanowski et al. (2017): "Enriching Word Vectors with Subword Information"
- Kudo & Richardson (2018): "SentencePiece: A simple and language independent subword tokenizer"

### Tutorial Objectives
1. Implement text cleaning and normalization
2. Apply tokenization for transformer models
3. Prepare data compatible with model training
4. Extract features for analysis
5. Optimize preprocessing pipeline

Author: Võ Hải Dũng  
Email: vohaidung.work@gmail.com  
Date: 2025

## 1. Environment Setup

In [None]:
# Standard library imports
import sys
import re
import string
import unicodedata
from pathlib import Path
from typing import Dict, List, Tuple, Optional, Any
from collections import Counter
import warnings

# Data manipulation and NLP imports
import numpy as np
import pandas as pd
import torch
from transformers import AutoTokenizer
import nltk

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

# Project imports
PROJECT_ROOT = Path("../..").resolve()
sys.path.insert(0, str(PROJECT_ROOT))

from src.data.datasets.ag_news import AGNewsDataset, AGNewsConfig
from src.data.preprocessing.text_cleaner import TextCleaner, CleaningConfig
from src.data.preprocessing.tokenization import Tokenizer, TokenizationConfig
from src.data.preprocessing.feature_extraction import FeatureExtractor, FeatureExtractionConfig
from src.utils.io_utils import safe_load, safe_save, ensure_dir
from src.utils.logging_config import setup_logging
from src.utils.reproducibility import set_seed
from configs.constants import (
    AG_NEWS_CLASSES,
    AG_NEWS_NUM_CLASSES,
    DATA_DIR,
    OUTPUT_DIR
)

# Configuration
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
set_seed(42)
logger = setup_logging('preprocessing_tutorial')

# Download NLTK resources
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

print("Text Preprocessing Tutorial")
print("="*50)
print(f"Project Root: {PROJECT_ROOT}")

## 2. Load Sample Data

In [None]:
# Load AG News dataset (matching model training configuration)
config = AGNewsConfig(
    data_dir=DATA_DIR / "processed",
    max_samples=1000  # Same as model training tutorial
)

dataset = AGNewsDataset(config, split="train")

# Extract sample texts
sample_texts = dataset.texts[:10]
sample_labels = dataset.labels[:10]

print("Sample Data Loaded:")
print(f"  Total samples: {len(dataset)}")
print(f"  Sample texts: {len(sample_texts)}")

# Display first sample
print(f"\nFirst sample:")
print(f"  Label: {sample_labels[0]} ({dataset.label_names[0]})")
print(f"  Text: {sample_texts[0][:200]}...")

## 3. Text Preprocessing for DeBERTa Model

In [None]:
# Initialize tokenizer (same as in model training)
model_name = 'microsoft/deberta-v3-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Tokenizer Configuration:")
print("="*50)
print(f"Model: {model_name}")
print(f"Vocab size: {tokenizer.vocab_size}")
print(f"Max length: 256")
print(f"Special tokens: {tokenizer.special_tokens_map}")

# Tokenize sample text
sample_text = sample_texts[0]
encoding = tokenizer(
    sample_text,
    truncation=True,
    max_length=256,
    padding='max_length',
    return_tensors='pt'
)

print(f"\nTokenization Example:")
print(f"  Original text length: {len(sample_text)} chars")
print(f"  Input IDs shape: {encoding['input_ids'].shape}")
print(f"  Attention mask shape: {encoding['attention_mask'].shape}")

# Decode to verify
decoded = tokenizer.decode(encoding['input_ids'][0], skip_special_tokens=True)
print(f"  Decoded text: {decoded[:100]}...")

## 4. Batch Preprocessing Pipeline

In [None]:
def preprocess_for_training(
    texts: List[str],
    labels: List[int],
    tokenizer: AutoTokenizer,
    max_length: int = 256
) -> Dict[str, torch.Tensor]:
    """
    Preprocess texts for model training.
    
    This function matches the preprocessing used in model training.
    """
    # Tokenize all texts
    encoding = tokenizer(
        texts,
        truncation=True,
        max_length=max_length,
        padding='max_length',
        return_tensors='pt'
    )
    
    # Add labels
    encoding['labels'] = torch.tensor(labels, dtype=torch.long)
    
    return encoding

# Preprocess batch of samples
batch_size = 8  # Same as model training
batch_texts = dataset.texts[:batch_size]
batch_labels = dataset.labels[:batch_size]

batch_encoding = preprocess_for_training(
    batch_texts,
    batch_labels,
    tokenizer,
    max_length=256
)

print("Batch Preprocessing Results:")
print("="*50)
print(f"Batch size: {batch_size}")
print(f"Input IDs shape: {batch_encoding['input_ids'].shape}")
print(f"Attention mask shape: {batch_encoding['attention_mask'].shape}")
print(f"Labels shape: {batch_encoding['labels'].shape}")

# Check padding statistics
padding_tokens = (batch_encoding['input_ids'] == tokenizer.pad_token_id).sum()
total_tokens = batch_encoding['input_ids'].numel()
print(f"\nPadding Statistics:")
print(f"  Total tokens: {total_tokens}")
print(f"  Padding tokens: {padding_tokens}")
print(f"  Padding ratio: {padding_tokens/total_tokens:.2%}")

## 5. Text Cleaning Options

In [None]:
# Different cleaning strategies for preprocessing
from src.data.preprocessing.text_cleaner import get_minimal_cleaner, get_aggressive_cleaner

# Minimal cleaning (recommended for transformers)
minimal_cleaner = get_minimal_cleaner()

# Aggressive cleaning (optional)
aggressive_cleaner = get_aggressive_cleaner()

print("Text Cleaning Comparison:")
print("="*50)

test_text = "Check out this AMAZING deal at https://example.com! Only $99.99 #BestDeal"

print(f"Original text:")
print(f"  {test_text}")

minimal_cleaned = minimal_cleaner.clean(test_text)
print(f"\nMinimal cleaning (for transformers):")
print(f"  {minimal_cleaned}")

aggressive_cleaned = aggressive_cleaner.clean(test_text)
print(f"\nAggressive cleaning (optional):")
print(f"  {aggressive_cleaned}")

# Apply to AG News sample
sample_cleaned = minimal_cleaner.clean(sample_texts[0])
print(f"\nAG News sample (minimal cleaning):")
print(f"  Before: {sample_texts[0][:100]}...")
print(f"  After:  {sample_cleaned[:100]}...")

## 6. Feature Extraction for Analysis

In [None]:
# Extract features for data analysis
feature_config = FeatureExtractionConfig(
    extract_length_features=True,
    extract_readability_features=True,
    extract_pos_features=False  # Skip for speed
)

feature_extractor = FeatureExtractor(feature_config)

# Extract features from samples
features_list = []
for text in sample_texts[:5]:
    features = feature_extractor.extract(text)
    features_list.append(features)

# Convert to DataFrame for analysis
features_df = pd.DataFrame(features_list)

print("Extracted Features Summary:")
print("="*50)
print(features_df.describe().round(2))

# Feature correlation with labels
print("\nFeature Statistics by Class:")
for i in range(min(5, len(sample_labels))):
    label_name = dataset.label_names[i]
    print(f"\n{label_name}:")
    print(f"  Word count: {features_list[i]['word_count']}")
    print(f"  Char count: {features_list[i]['char_count']}")
    print(f"  Avg word length: {features_list[i].get('avg_word_length', 0):.2f}")

## 7. DataLoader Integration

In [None]:
from torch.utils.data import DataLoader, Dataset

class PreprocessedDataset(Dataset):
    """
    Dataset with integrated preprocessing for model training.
    """
    
    def __init__(self, texts: List[str], labels: List[int], 
                 tokenizer: AutoTokenizer, max_length: int = 256):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx: int) -> Dict[str, torch.Tensor]:
        text = self.texts[idx]
        label = self.labels[idx]
        
        # Tokenize
        encoding = self.tokenizer(
            text,
            truncation=True,
            max_length=self.max_length,
            padding='max_length',
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# Create dataset
preprocessed_dataset = PreprocessedDataset(
    texts=dataset.texts[:100],
    labels=dataset.labels[:100],
    tokenizer=tokenizer,
    max_length=256
)

# Create DataLoader
dataloader = DataLoader(
    preprocessed_dataset,
    batch_size=8,
    shuffle=True,
    num_workers=2,
    pin_memory=torch.cuda.is_available()
)

# Test DataLoader
batch = next(iter(dataloader))

print("DataLoader Integration Test:")
print("="*50)
print(f"Dataset size: {len(preprocessed_dataset)}")
print(f"Number of batches: {len(dataloader)}")
print(f"\nBatch shapes:")
print(f"  Input IDs: {batch['input_ids'].shape}")
print(f"  Attention mask: {batch['attention_mask'].shape}")
print(f"  Labels: {batch['labels'].shape}")

## 8. Preprocessing Performance Optimization

In [None]:
import time

def benchmark_preprocessing(texts: List[str], tokenizer: AutoTokenizer) -> Dict[str, float]:
    """
    Benchmark preprocessing performance.
    """
    results = {}
    
    # Single text processing
    start = time.time()
    for text in texts:
        _ = tokenizer(
            text,
            truncation=True,
            max_length=256,
            padding='max_length',
            return_tensors='pt'
        )
    results['sequential'] = time.time() - start
    
    # Batch processing
    start = time.time()
    _ = tokenizer(
        texts,
        truncation=True,
        max_length=256,
        padding='max_length',
        return_tensors='pt'
    )
    results['batch'] = time.time() - start
    
    results['speedup'] = results['sequential'] / results['batch']
    
    return results

# Benchmark with different batch sizes
test_sizes = [10, 50, 100]

print("Preprocessing Performance Benchmark:")
print("="*50)

for size in test_sizes:
    test_texts = dataset.texts[:size]
    results = benchmark_preprocessing(test_texts, tokenizer)
    
    print(f"\nBatch size: {size}")
    print(f"  Sequential: {results['sequential']:.3f} seconds")
    print(f"  Batch: {results['batch']:.3f} seconds")
    print(f"  Speedup: {results['speedup']:.1f}x")

## 9. Save Preprocessed Data

In [None]:
# Preprocess and save data for model training
def save_preprocessed_dataset(
    texts: List[str],
    labels: List[int],
    tokenizer: AutoTokenizer,
    output_path: Path,
    max_length: int = 256
):
    """
    Save preprocessed data compatible with model training.
    """
    ensure_dir(output_path.parent)
    
    # Preprocess all texts
    print("Preprocessing texts...")
    encoding = tokenizer(
        texts,
        truncation=True,
        max_length=max_length,
        padding='max_length',
        return_tensors='pt'
    )
    
    # Create dataset dictionary
    dataset_dict = {
        'input_ids': encoding['input_ids'],
        'attention_mask': encoding['attention_mask'],
        'labels': torch.tensor(labels, dtype=torch.long),
        'metadata': {
            'num_samples': len(texts),
            'max_length': max_length,
            'tokenizer': model_name,
            'preprocessing_date': pd.Timestamp.now().isoformat()
        }
    }
    
    # Save to disk
    torch.save(dataset_dict, output_path)
    
    print(f"\nPreprocessed data saved to: {output_path}")
    print(f"  File size: {output_path.stat().st_size / 1024**2:.2f} MB")
    
    return output_path

# Save preprocessed data
output_file = OUTPUT_DIR / "tutorial" / "preprocessed_agnews.pt"
saved_path = save_preprocessed_dataset(
    texts=dataset.texts[:500],
    labels=dataset.labels[:500],
    tokenizer=tokenizer,
    output_path=output_file,
    max_length=256
)

# Verify saved data
loaded_data = torch.load(saved_path)
print("\nVerification of saved data:")
print(f"  Input IDs shape: {loaded_data['input_ids'].shape}")
print(f"  Attention mask shape: {loaded_data['attention_mask'].shape}")
print(f"  Labels shape: {loaded_data['labels'].shape}")
print(f"  Metadata: {loaded_data['metadata']}")

## 10. Conclusions and Best Practices

### Key Takeaways

1. **Preprocessing for Transformers**:
   - Use tokenizer from target model (DeBERTa-v3)
   - Maintain consistent max_length (256 tokens)
   - Apply minimal text cleaning to preserve information

2. **Batch Processing**:
   - Batch tokenization is significantly faster
   - Use same batch size as training (8)
   - Enable padding and truncation

3. **Data Compatibility**:
   - Ensure preprocessing matches model training setup
   - Save preprocessed data for reproducibility
   - Include metadata for tracking

4. **Performance Optimization**:
   - Use DataLoader with multiple workers
   - Enable pin_memory for GPU training
   - Cache preprocessed data when possible

### Best Practices

1. **Always use the same tokenizer** for preprocessing and model training
2. **Maintain consistent parameters** (max_length, padding strategy)
3. **Validate preprocessed data** before training
4. **Monitor preprocessing time** for optimization
5. **Save preprocessed versions** for reproducibility

### Next Steps

- Continue to: `03_model_training_basics.ipynb` for model training
- Explore: `04_evaluation_tutorial.ipynb` for model evaluation
- Review: Documentation at `docs/user_guide/data_preparation.md`