# Text Preprocessing Tutorial for AG News Classification

## Overview

This notebook demonstrates comprehensive text preprocessing techniques following methodologies from:
- Pennington et al. (2014): "GloVe: Global Vectors for Word Representation"
- Bojanowski et al. (2017): "Enriching Word Vectors with Subword Information"
- Kudo & Richardson (2018): "SentencePiece: A simple and language independent subword tokenizer"

### Tutorial Objectives
1. Implement text cleaning and normalization
2. Apply various tokenization strategies
3. Extract linguistic features
4. Handle special characters and noise
5. Prepare text for model input

Author: Võ Hải Dũng  
Email: vohaidung.work@gmail.com  
Date: 2025

## 1. Environment Setup

In [None]:
# Standard library imports
import sys
import re
import string
import unicodedata
from pathlib import Path
from typing import Dict, List, Tuple, Optional, Any
from collections import Counter
import warnings

# Data manipulation and NLP imports
import numpy as np
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModel
import nltk
import spacy

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from tqdm.auto import tqdm

# Project imports
PROJECT_ROOT = Path("../..").resolve()
sys.path.insert(0, str(PROJECT_ROOT))

from src.data.datasets.ag_news import AGNewsDataset, AGNewsConfig
from src.data.preprocessing.text_cleaner import TextCleaner, CleaningConfig
from src.data.preprocessing.tokenization import Tokenizer, TokenizationConfig
from src.data.preprocessing.feature_extraction import FeatureExtractor, FeatureExtractionConfig
from src.data.preprocessing.sliding_window import SlidingWindowProcessor
from src.data.preprocessing.prompt_formatter import PromptFormatter
from src.utils.io_utils import safe_load, safe_save, ensure_dir
from src.utils.logging_config import setup_logging
from configs.constants import (
    AG_NEWS_CLASSES,
    DATA_DIR,
    OUTPUT_DIR
)

# Configuration
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
logger = setup_logging('preprocessing_tutorial')

# Download NLTK resources
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

print("Text Preprocessing Tutorial")
print("="*50)
print(f"Project Root: {PROJECT_ROOT}")

## 2. Load Sample Data

In [None]:
# Load AG News dataset
config = AGNewsConfig(
    data_dir=DATA_DIR / "processed",
    max_samples=1000  # Limit for tutorial
)

dataset = AGNewsDataset(config, split="train")

# Extract sample texts
sample_texts = dataset.texts[:10]
sample_labels = dataset.labels[:10]

print("Sample Data Loaded:")
print(f"  Total samples: {len(dataset)}")
print(f"  Sample texts: {len(sample_texts)}")

# Display first sample
print(f"\nFirst sample:")
print(f"  Label: {sample_labels[0]} ({dataset.label_names[0]})")
print(f"  Text: {sample_texts[0][:200]}...")

## 3. Basic Text Cleaning

In [None]:
# Initialize text cleaner
cleaning_config = CleaningConfig(
    lowercase=True,
    remove_punctuation=False,
    remove_digits=False,
    remove_urls=True,
    remove_emails=True,
    remove_html_tags=True,
    remove_special_chars=False,
    normalize_whitespace=True
)

cleaner = TextCleaner(cleaning_config)

# Clean sample texts
cleaned_texts = [cleaner.clean(text) for text in sample_texts]

print("Text Cleaning Examples:")
print("="*50)

for i in range(3):
    print(f"\nSample {i+1}:")
    print(f"  Original: {sample_texts[i][:150]}...")
    print(f"  Cleaned:  {cleaned_texts[i][:150]}...")
    
    # Calculate reduction
    orig_len = len(sample_texts[i])
    clean_len = len(cleaned_texts[i])
    reduction = (1 - clean_len/orig_len) * 100
    print(f"  Length reduction: {reduction:.1f}%")

## 4. Advanced Text Normalization

In [None]:
def advanced_normalize(text: str) -> str:
    """
    Apply advanced text normalization techniques.
    
    Following normalization practices from:
        Jurafsky & Martin (2023): "Speech and Language Processing"
    """
    # Unicode normalization
    text = unicodedata.normalize('NFKD', text)
    
    # Remove accents
    text = ''.join([c for c in text if not unicodedata.combining(c)])
    
    # Expand contractions
    contractions = {
        "won't": "will not",
        "can't": "cannot",
        "n't": " not",
        "'re": " are",
        "'ve": " have",
        "'ll": " will",
        "'d": " would",
        "'m": " am"
    }
    
    for contraction, expansion in contractions.items():
        text = text.replace(contraction, expansion)
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

# Apply advanced normalization
normalized_texts = [advanced_normalize(text) for text in cleaned_texts]

print("Advanced Normalization Examples:")
print("="*50)

test_text = "It won't be easy, but we'll succeed! Café résumé naïve."
normalized = advanced_normalize(test_text)

print(f"Original:   {test_text}")
print(f"Normalized: {normalized}")

# Apply to sample
print(f"\nSample text normalization:")
print(f"  Before: {cleaned_texts[0][:100]}...")
print(f"  After:  {normalized_texts[0][:100]}...")

## 5. Tokenization Strategies

In [None]:
# Different tokenization approaches
print("Tokenization Strategies Comparison:")
print("="*50)

sample_text = normalized_texts[0]

# 1. Whitespace tokenization
whitespace_tokens = sample_text.split()
print(f"\n1. Whitespace tokenization:")
print(f"   Tokens: {len(whitespace_tokens)}")
print(f"   Sample: {whitespace_tokens[:10]}")

# 2. NLTK tokenization
from nltk.tokenize import word_tokenize
nltk_tokens = word_tokenize(sample_text)
print(f"\n2. NLTK tokenization:")
print(f"   Tokens: {len(nltk_tokens)}")
print(f"   Sample: {nltk_tokens[:10]}")

# 3. Subword tokenization (BPE)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
subword_tokens = tokenizer.tokenize(sample_text)
print(f"\n3. Subword tokenization (BERT):")
print(f"   Tokens: {len(subword_tokens)}")
print(f"   Sample: {subword_tokens[:10]}")

# 4. Character-level tokenization
char_tokens = list(sample_text)
print(f"\n4. Character tokenization:")
print(f"   Tokens: {len(char_tokens)}")
print(f"   Sample: {char_tokens[:20]}")

## 6. Linguistic Feature Extraction

In [None]:
def extract_linguistic_features(text: str) -> Dict[str, Any]:
    """
    Extract linguistic features from text.
    
    Following feature engineering practices from:
        Zheng & Casari (2018): "Feature Engineering for Machine Learning"
    """
    features = {}
    
    # Basic statistics
    features['char_count'] = len(text)
    features['word_count'] = len(text.split())
    features['sentence_count'] = text.count('.') + text.count('!') + text.count('?')
    features['avg_word_length'] = np.mean([len(word) for word in text.split()])
    
    # Vocabulary richness
    words = text.lower().split()
    features['unique_words'] = len(set(words))
    features['lexical_diversity'] = features['unique_words'] / len(words) if words else 0
    
    # POS tagging
    pos_tags = nltk.pos_tag(nltk.word_tokenize(text))
    pos_counts = Counter([tag for word, tag in pos_tags])
    
    features['noun_count'] = sum(count for tag, count in pos_counts.items() if tag.startswith('NN'))
    features['verb_count'] = sum(count for tag, count in pos_counts.items() if tag.startswith('VB'))
    features['adj_count'] = sum(count for tag, count in pos_counts.items() if tag.startswith('JJ'))
    
    # Special patterns
    features['has_numbers'] = bool(re.search(r'\d', text))
    features['has_urls'] = bool(re.search(r'https?://', text))
    features['uppercase_ratio'] = sum(1 for c in text if c.isupper()) / len(text) if text else 0
    
    return features

# Extract features from samples
print("Linguistic Feature Extraction:")
print("="*50)

for i in range(3):
    features = extract_linguistic_features(sample_texts[i])
    
    print(f"\nSample {i+1} features:")
    for key, value in list(features.items())[:8]:  # Show first 8 features
        if isinstance(value, float):
            print(f"  {key:20}: {value:.3f}")
        else:
            print(f"  {key:20}: {value}")

## 7. Stop Words and Filtering

In [None]:
from nltk.corpus import stopwords

# Get stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(text: str, custom_stopwords: Optional[set] = None) -> str:
    """
    Remove stopwords from text.
    
    Following stopword removal strategies from:
        Manning et al. (2008): "Introduction to Information Retrieval"
    """
    words = text.lower().split()
    
    # Combine default and custom stopwords
    all_stopwords = stop_words.copy()
    if custom_stopwords:
        all_stopwords.update(custom_stopwords)
    
    # Filter words
    filtered_words = [word for word in words if word not in all_stopwords]
    
    return ' '.join(filtered_words)

# Apply stopword removal
print("Stopword Removal Examples:")
print("="*50)

# Add domain-specific stopwords for news
news_stopwords = {'said', 'says', 'according', 'reported', 'news'}

for i in range(2):
    original = normalized_texts[i]
    filtered = remove_stopwords(original, news_stopwords)
    
    print(f"\nSample {i+1}:")
    print(f"  Original ({len(original.split())} words):")
    print(f"    {original[:150]}...")
    print(f"  Filtered ({len(filtered.split())} words):")
    print(f"    {filtered[:150]}...")
    print(f"  Reduction: {(1 - len(filtered.split())/len(original.split()))*100:.1f}%")

## 8. Text Vectorization

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Prepare texts for vectorization
texts_for_vectorization = [remove_stopwords(advanced_normalize(text)) 
                          for text in dataset.texts[:100]]

print("Text Vectorization Methods:")
print("="*50)

# 1. Bag of Words
count_vectorizer = CountVectorizer(max_features=100)
bow_matrix = count_vectorizer.fit_transform(texts_for_vectorization)

print("\n1. Bag of Words (BoW):")
print(f"   Vocabulary size: {len(count_vectorizer.vocabulary_)}")
print(f"   Matrix shape: {bow_matrix.shape}")
print(f"   Sparsity: {(bow_matrix.nnz / np.prod(bow_matrix.shape))*100:.2f}%")

# 2. TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=100, ngram_range=(1, 2))
tfidf_matrix = tfidf_vectorizer.fit_transform(texts_for_vectorization)

print("\n2. TF-IDF:")
print(f"   Vocabulary size: {len(tfidf_vectorizer.vocabulary_)}")
print(f"   Matrix shape: {tfidf_matrix.shape}")
print(f"   Sparsity: {(tfidf_matrix.nnz / np.prod(tfidf_matrix.shape))*100:.2f}%")

# Show top features
feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_scores = tfidf_matrix.sum(axis=0).A1
top_indices = tfidf_scores.argsort()[-10:][::-1]

print("\nTop 10 TF-IDF features:")
for idx in top_indices:
    print(f"   {feature_names[idx]:20} : {tfidf_scores[idx]:.3f}")

## 9. Transformer Tokenization

In [None]:
# Compare different transformer tokenizers
print("Transformer Tokenization Comparison:")
print("="*50)

sample_text = sample_texts[0]

tokenizers = {
    'BERT': 'bert-base-uncased',
    'RoBERTa': 'roberta-base',
    'DeBERTa': 'microsoft/deberta-v3-base'
}

for name, model_name in tokenizers.items():
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # Tokenize
    encoding = tokenizer(
        sample_text,
        truncation=True,
        max_length=128,
        padding='max_length',
        return_tensors='pt'
    )
    
    # Get tokens
    tokens = tokenizer.convert_ids_to_tokens(encoding['input_ids'][0])
    
    # Count non-padding tokens
    non_padding = sum(1 for token in tokens if token not in ['[PAD]', '<pad>'])
    
    print(f"\n{name} Tokenizer:")
    print(f"  Vocab size: {tokenizer.vocab_size}")
    print(f"  Tokens (non-padding): {non_padding}")
    print(f"  First 10 tokens: {tokens[:10]}")
    print(f"  Special tokens: {list(tokenizer.special_tokens_map.keys())}")

## 10. Preprocessing Pipeline

In [None]:
class PreprocessingPipeline:
    """
    Complete preprocessing pipeline for AG News.
    
    Following pipeline design from:
        Pedregosa et al. (2011): "Scikit-learn: Machine Learning in Python"
    """
    
    def __init__(self, 
                 lowercase: bool = True,
                 remove_stopwords: bool = True,
                 normalize: bool = True,
                 tokenizer_name: str = 'bert-base-uncased'):
        
        self.lowercase = lowercase
        self.remove_stopwords = remove_stopwords
        self.normalize = normalize
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
        self.stop_words = set(stopwords.words('english'))
        
    def process(self, text: str) -> Dict[str, Any]:
        """Process single text through pipeline."""
        
        # Step 1: Basic cleaning
        if self.lowercase:
            text = text.lower()
        
        # Step 2: Normalization
        if self.normalize:
            text = advanced_normalize(text)
        
        # Step 3: Remove stopwords
        if self.remove_stopwords:
            words = text.split()
            text = ' '.join([w for w in words if w not in self.stop_words])
        
        # Step 4: Tokenization
        encoding = self.tokenizer(
            text,
            truncation=True,
            max_length=256,
            padding='max_length',
            return_tensors='pt'
        )
        
        return {
            'processed_text': text,
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'num_tokens': encoding['attention_mask'].sum().item()
        }

# Create and test pipeline
pipeline = PreprocessingPipeline()

print("Preprocessing Pipeline Test:")
print("="*50)

# Process samples
for i in range(3):
    result = pipeline.process(sample_texts[i])
    
    print(f"\nSample {i+1}:")
    print(f"  Original length: {len(sample_texts[i])} chars")
    print(f"  Processed length: {len(result['processed_text'])} chars")
    print(f"  Number of tokens: {result['num_tokens']}")
    print(f"  Input shape: {result['input_ids'].shape}")

## 11. Batch Processing

In [None]:
def batch_preprocess(texts: List[str], 
                    batch_size: int = 32,
                    show_progress: bool = True) -> Dict[str, torch.Tensor]:
    """
    Batch preprocessing for efficiency.
    
    Following batch processing optimization from:
        Howard & Ruder (2018): "Universal Language Model Fine-tuning for Text Classification"
    """
    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
    
    all_input_ids = []
    all_attention_masks = []
    
    # Process in batches
    num_batches = (len(texts) + batch_size - 1) // batch_size
    
    iterator = range(0, len(texts), batch_size)
    if show_progress:
        iterator = tqdm(iterator, desc="Processing batches", total=num_batches)
    
    for i in iterator:
        batch_texts = texts[i:i+batch_size]
        
        # Batch tokenization
        encoding = tokenizer(
            batch_texts,
            truncation=True,
            max_length=256,
            padding='max_length',
            return_tensors='pt'
        )
        
        all_input_ids.append(encoding['input_ids'])
        all_attention_masks.append(encoding['attention_mask'])
    
    # Concatenate all batches
    return {
        'input_ids': torch.cat(all_input_ids, dim=0),
        'attention_mask': torch.cat(all_attention_masks, dim=0)
    }

# Process dataset in batches
print("Batch Processing Performance:")
print("="*50)

import time

# Test with different batch sizes
test_texts = dataset.texts[:200]
batch_sizes = [8, 16, 32, 64]

for batch_size in batch_sizes:
    start_time = time.time()
    result = batch_preprocess(test_texts, batch_size=batch_size, show_progress=False)
    elapsed = time.time() - start_time
    
    print(f"\nBatch size {batch_size}:")
    print(f"  Processing time: {elapsed:.2f} seconds")
    print(f"  Throughput: {len(test_texts)/elapsed:.0f} samples/sec")
    print(f"  Output shape: {result['input_ids'].shape}")

## 12. Save Preprocessed Data

In [None]:
# Save preprocessed data for future use
def save_preprocessed_data(texts: List[str], 
                          labels: List[int],
                          output_path: Path):
    """
    Save preprocessed data efficiently.
    
    Following data serialization best practices from:
        PyTorch Documentation: "Saving and Loading Models"
    """
    ensure_dir(output_path.parent)
    
    # Preprocess all texts
    print("Preprocessing texts...")
    processed = batch_preprocess(texts, batch_size=32)
    
    # Create dataset dictionary
    dataset_dict = {
        'input_ids': processed['input_ids'],
        'attention_mask': processed['attention_mask'],
        'labels': torch.tensor(labels, dtype=torch.long),
        'metadata': {
            'num_samples': len(texts),
            'max_length': 256,
            'tokenizer': 'bert-base-uncased',
            'preprocessing_date': pd.Timestamp.now().isoformat()
        }
    }
    
    # Save to disk
    torch.save(dataset_dict, output_path)
    
    print(f"\nPreprocessed data saved to: {output_path}")
    print(f"  File size: {output_path.stat().st_size / 1024**2:.2f} MB")
    
    return output_path

# Save sample preprocessed data
output_file = OUTPUT_DIR / "tutorial" / "preprocessed_sample.pt"
saved_path = save_preprocessed_data(
    texts=dataset.texts[:500],
    labels=dataset.labels[:500],
    output_path=output_file
)

# Verify saved data
loaded_data = torch.load(saved_path)
print("\nVerification of saved data:")
print(f"  Input IDs shape: {loaded_data['input_ids'].shape}")
print(f"  Attention mask shape: {loaded_data['attention_mask'].shape}")
print(f"  Labels shape: {loaded_data['labels'].shape}")
print(f"  Metadata: {loaded_data['metadata']}")

## 13. Conclusions and Best Practices

### Key Takeaways

1. **Text Cleaning**:
   - Remove noise while preserving information
   - Balance between aggressive and minimal cleaning
   - Consider domain-specific requirements

2. **Tokenization**:
   - Choose appropriate tokenization for model type
   - Subword tokenization handles OOV words better
   - Consider computational cost vs. accuracy

3. **Feature Engineering**:
   - Extract both statistical and linguistic features
   - Combine multiple feature types for robustness
   - Validate feature importance

4. **Efficiency**:
   - Use batch processing for large datasets
   - Cache preprocessed data
   - Optimize pipeline for production

### Best Practices

1. **Always preserve original data** for reproducibility
2. **Document preprocessing steps** clearly
3. **Validate preprocessing** impact on model performance
4. **Use consistent preprocessing** across train/val/test
5. **Monitor preprocessing time** for optimization

### Next Steps

- Continue to: `03_model_training_basics.ipynb` for model training
- Explore: `05_prompt_engineering.ipynb` for advanced formatting
- Review: Documentation at `docs/user_guide/data_preparation.md`