# Comprehensive Emotion Classification Model Comparison

**Author:** Sleep Well  
**Date:** 2025-08-04  
**Purpose:** Compare multiple state-of-the-art deep learning models for text emotion classification

This notebook provides a comprehensive comparison of 6 different deep learning models for emotion classification, including both traditional architectures and modern transformer-based models.

## 1. Environment Setup and Imports

In [1]:
# Core libraries (always available)
import os
import json
import time
import warnings
import random
import math
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Dict, List, Tuple, Optional, Any
from collections import Counter, defaultdict

# Data manipulation and analysis
import numpy as np
import pandas as pd

# Check for PyTorch availability
TORCH_AVAILABLE = False
try:
    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torch.utils.data import DataLoader, Dataset
    from torch.nn.utils.rnn import pad_sequence
    TORCH_AVAILABLE = True
    print(f"✅ PyTorch {torch.__version__} imported successfully!")
    print(f"🔥 Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")
except ImportError as e:
    print(f"⚠️ PyTorch not available: {e}")
    print("📝 Will use NumPy-based implementations")

# Check for Transformers availability
TRANSFORMERS_AVAILABLE = False
try:
    from transformers import (
        AutoTokenizer, AutoModelForSequenceClassification,
        get_linear_schedule_with_warmup
    )
    TRANSFORMERS_AVAILABLE = True
    print("✅ Transformers imported successfully!")
except ImportError as e:
    print(f"⚠️ Transformers not available: {e}")
    print("📝 Will use traditional models only")

# Check for Datasets availability
DATASETS_AVAILABLE = False
try:
    from datasets import load_dataset
    DATASETS_AVAILABLE = True
    print("✅ Datasets imported successfully!")
except ImportError as e:
    print(f"⚠️ Datasets not available: {e}")
    print("📝 Will use sample data")

# Evaluation metrics
from sklearn.metrics import (
    accuracy_score, f1_score, precision_score, recall_score,
    classification_report, confusion_matrix
)

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

PLOTLY_AVAILABLE = False
try:
    import plotly.graph_objects as go
    import plotly.express as px
    from plotly.subplots import make_subplots
    PLOTLY_AVAILABLE = True
    print("✅ Plotly imported successfully!")
except ImportError:
    print("⚠️ Plotly not available - will use matplotlib")

# Progress tracking
try:
    from tqdm.auto import tqdm
except ImportError:
    # Fallback tqdm
    class tqdm:
        def __init__(self, iterable, desc="", *args, **kwargs):
            self.iterable = iterable
            self.desc = desc
        def __iter__(self):
            for i, item in enumerate(self.iterable):
                if i % 10 == 0:
                    print(f"\r{self.desc}: {i}/{len(self.iterable)}", end="")
                yield item
        def set_postfix(self, *args, **kwargs):
            pass

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print(f"\n📊 Environment Status:")
print(f"   PyTorch: {'✅' if TORCH_AVAILABLE else '❌'}")
print(f"   Transformers: {'✅' if TRANSFORMERS_AVAILABLE else '❌'}")
print(f"   Datasets: {'✅' if DATASETS_AVAILABLE else '❌'}")
print(f"   Plotly: {'✅' if PLOTLY_AVAILABLE else '❌'}")
print("\n🚀 Environment setup completed!")

✅ PyTorch 2.7.1+cu126 imported successfully!
🔥 Device: GPU
✅ Transformers imported successfully!
✅ Datasets imported successfully!
✅ Plotly imported successfully!

📊 Environment Status:
   PyTorch: ✅
   Transformers: ✅
   Datasets: ✅
   Plotly: ✅

🚀 Environment setup completed!


In [2]:
# Create necessary directories
directories = ['models', 'results', 'visualizations', 'logs', 'checkpoints']

for directory in directories:
    os.makedirs(directory, exist_ok=True)
    print(f"📁 Created directory: {directory}")

print("\n✅ Project structure initialized successfully!")

📁 Created directory: models
📁 Created directory: results
📁 Created directory: visualizations
📁 Created directory: logs
📁 Created directory: checkpoints

✅ Project structure initialized successfully!


## 2. Configuration and Global Settings

In [3]:
# Global configuration
@dataclass
class GlobalConfig:
    # Dataset settings
    dataset_name: str = "dair-ai/emotion"
    num_labels: int = 6
    emotion_labels: List[str] = None

    # Training settings
    batch_size: int = 16
    max_epochs: int = 3
    learning_rate: float = 2e-5
    warmup_steps: int = 500
    max_length: int = 128

    # Hardware settings
    device: str = "cuda" if TORCH_AVAILABLE and torch.cuda.is_available() else "cpu"
    mixed_precision: bool = True

    # Evaluation settings
    eval_steps: int = 500
    save_steps: int = 1000

    # Visualization settings
    figure_width: int = 1000
    figure_height: int = 600
    color_palette: str = "viridis"

    def __post_init__(self):
        if self.emotion_labels is None:
            self.emotion_labels = ["sadness", "joy", "love", "anger", "fear", "surprise"]

# Initialize global configuration
config = GlobalConfig()

print(f"🔧 Configuration initialized:")
print(f"   Device: {config.device}")
print(f"   Batch size: {config.batch_size}")
print(f"   Max epochs: {config.max_epochs}")
print(f"   Emotion labels: {config.emotion_labels}")

🔧 Configuration initialized:
   Device: cuda
   Batch size: 16
   Max epochs: 3
   Emotion labels: ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']


## 3. Utility Functions and Helpers

In [4]:
def setup_reproducibility(seed: int = 42):
    """Set up reproducible training environment"""
    np.random.seed(seed)
    random.seed(seed)
    if TORCH_AVAILABLE:
        torch.manual_seed(seed)
        if torch.cuda.is_available():
            torch.cuda.manual_seed_all(seed)
    print(f"🎯 Reproducibility set with seed: {seed}")

def get_model_size(model):
    """Calculate model size in parameters"""
    if TORCH_AVAILABLE and hasattr(model, 'parameters'):
        return sum(p.numel() for p in model.parameters())
    else:
        return 0  # Fallback for non-PyTorch models

def format_time(seconds):
    """Format time in human readable format"""
    if seconds < 60:
        return f"{seconds:.1f}s"
    elif seconds < 3600:
        return f"{seconds/60:.1f}m"
    else:
        return f"{seconds/3600:.1f}h"

def save_results(results: Dict, filename: str):
    """Save results to JSON file"""
    filepath = os.path.join('results', filename)
    with open(filepath, 'w') as f:
        json.dump(results, f, indent=2, default=str)
    print(f"💾 Results saved to {filepath}")

def load_results(filename: str) -> Dict:
    """Load results from JSON file"""
    filepath = os.path.join('results', filename)
    if os.path.exists(filepath):
        with open(filepath, 'r') as f:
            return json.load(f)
    return {}

# Setup reproducibility
setup_reproducibility()

print("🛠️ Utility functions initialized successfully!")

🎯 Reproducibility set with seed: 42
🛠️ Utility functions initialized successfully!


## 4. Data Management System

In [5]:
if TORCH_AVAILABLE:
    class EmotionDataset(Dataset):
        """Custom dataset for emotion classification"""

        def __init__(self, texts, labels, tokenizer=None, vocab=None, max_length=128):
            self.texts = texts
            self.labels = labels
            self.tokenizer = tokenizer
            self.vocab = vocab
            self.max_length = max_length

        def __len__(self):
            return len(self.texts)

        def __getitem__(self, idx):
            text = str(self.texts[idx])
            label = self.labels[idx]

            if self.tokenizer is not None:
                # For transformer models
                encoding = self.tokenizer(
                    text,
                    truncation=True,
                    padding='max_length',
                    max_length=self.max_length,
                    return_tensors='pt'
                )
                return {
                    'input_ids': encoding['input_ids'].flatten(),
                    'attention_mask': encoding['attention_mask'].flatten(),
                    'labels': torch.tensor(label, dtype=torch.long)
                }
            else:
                # For traditional models (BiLSTM, CNN)
                tokens = text.lower().split()
                indices = [self.vocab.get(token, self.vocab.get('<unk>', 1)) for token in tokens]
                return torch.tensor(indices), torch.tensor(label, dtype=torch.long)
else:
    # Fallback dataset class for non-PyTorch environments
    class EmotionDataset:
        """Simple dataset for emotion classification without PyTorch"""

        def __init__(self, texts, labels, tokenizer=None, vocab=None, max_length=128):
            self.texts = texts
            self.labels = labels
            self.tokenizer = tokenizer
            self.vocab = vocab
            self.max_length = max_length

        def __len__(self):
            return len(self.texts)

        def __getitem__(self, idx):
            text = str(self.texts[idx])
            label = self.labels[idx]

            if self.vocab is not None:
                # For traditional models
                tokens = text.lower().split()
                indices = [self.vocab.get(token, self.vocab.get('<unk>', 1)) for token in tokens]
                return indices, label
            else:
                return text, label

def collate_fn_traditional(batch):
    """Collate function for traditional models"""
    texts, labels = zip(*batch)
    texts = pad_sequence(texts, batch_first=True, padding_value=0)
    labels = torch.stack(labels)
    return texts, labels

def collate_fn_transformer(batch):
    """Collate function for transformer models"""
    input_ids = torch.stack([item['input_ids'] for item in batch])
    attention_mask = torch.stack([item['attention_mask'] for item in batch])
    labels = torch.stack([item['labels'] for item in batch])

    return {
        'input_ids': input_ids,
        'attention_mask': attention_mask,
        'labels': labels
    }

print("✅ Dataset classes defined successfully!")

✅ Dataset classes defined successfully!


In [6]:
class DataManager:
    """Unified data management system for all models"""

    def __init__(self, dataset_name="dair-ai/emotion"):
        self.dataset_name = dataset_name
        self.dataset = None
        self.tokenizers = {}
        self.vocab = None
        self.emotion_labels = ["sadness", "joy", "love", "anger", "fear", "surprise"]

    def load_dataset(self, use_sample=False):
        """Load the emotion dataset with fallback to sample data"""
        if DATASETS_AVAILABLE and not use_sample:
            try:
                print("📊 Loading emotion dataset from HuggingFace...")
                self.dataset = load_dataset(self.dataset_name)
                print(f"✅ Dataset loaded successfully!")
                print(f"   Train samples: {len(self.dataset['train'])}")
                print(f"   Validation samples: {len(self.dataset['validation'])}")
                print(f"   Test samples: {len(self.dataset['test'])}")
                return True
            except Exception as e:
                print(f"⚠️ Failed to load HuggingFace dataset: {e}")
                print("📝 Falling back to sample data...")

        # Use sample data
        print("📊 Using sample dataset...")
        sample_data = self._create_sample_dataset()

        # Split into train/val/test
        total_size = len(sample_data['text'])
        train_size = int(0.7 * total_size)
        val_size = int(0.15 * total_size)

        self.dataset = {
            'train': {
                'text': sample_data['text'][:train_size],
                'label': sample_data['label'][:train_size]
            },
            'validation': {
                'text': sample_data['text'][train_size:train_size+val_size],
                'label': sample_data['label'][train_size:train_size+val_size]
            },
            'test': {
                'text': sample_data['text'][train_size+val_size:],
                'label': sample_data['label'][train_size+val_size:]
            }
        }

        print(f"✅ Sample dataset created!")
        print(f"   Train samples: {len(self.dataset['train']['text'])}")
        print(f"   Validation samples: {len(self.dataset['validation']['text'])}")
        print(f"   Test samples: {len(self.dataset['test']['text'])}")
        return True

    def _create_sample_dataset(self, size_per_emotion=100):
        """Create sample dataset when HuggingFace datasets is not available"""
        sample_texts = {
            0: [  # sadness
                "I feel so sad today", "This makes me really depressed", "I'm feeling down and blue",
                "Everything seems hopeless", "I can't stop crying", "My heart is broken",
                "I feel empty inside", "Nothing brings me joy anymore", "I'm overwhelmed with sorrow",
                "This is devastating news", "I'm so disappointed", "Life feels meaningless",
                "I'm drowning in sadness", "This hurts so much", "I feel lost and alone"
            ],
            1: [  # joy
                "I am so happy today!", "This brings me great joy", "I'm feeling fantastic",
                "What a wonderful day", "I'm on cloud nine", "This makes me smile",
                "I'm bursting with happiness", "Life is beautiful", "I feel amazing",
                "This is the best day ever", "I'm so excited", "Everything is perfect",
                "I'm filled with joy", "This is incredible", "I'm so grateful"
            ],
            2: [  # love
                "I love spending time with family", "My heart is full of love", "I adore this person",
                "Love is in the air", "I cherish these moments", "You mean everything to me",
                "I'm deeply in love", "This fills my heart with warmth", "I care about you so much",
                "Love conquers all", "I'm so in love", "My heart belongs to you",
                "I love you more than words", "This love is eternal", "You are my everything"
            ],
            3: [  # anger
                "That makes me so angry!", "This is infuriating!", "I'm furious about this",
                "This makes my blood boil", "I'm fed up with this", "This is absolutely outrageous",
                "I can't stand this anymore", "This is driving me crazy", "I'm seeing red",
                "This is completely unacceptable", "I'm so mad", "This is ridiculous",
                "I'm boiling with rage", "This is so frustrating", "I'm livid"
            ],
            4: [  # fear
                "I'm scared of what might happen", "This terrifies me", "I'm afraid of the dark",
                "This gives me anxiety", "I'm worried about the future", "This is my worst nightmare",
                "I'm trembling with fear", "This makes me nervous", "I'm panicking",
                "This is frightening", "I'm so anxious", "This scares me",
                "I'm filled with dread", "This is terrifying", "I'm so worried"
            ],
            5: [  # surprise
                "What a surprise that was!", "I can't believe this happened", "This is unexpected",
                "Wow, I didn't see that coming", "This is amazing news", "I'm shocked by this",
                "This caught me off guard", "What an incredible turn of events", "This is beyond my expectations",
                "I'm stunned", "This is so surprising", "I never expected this",
                "What a twist", "This is unbelievable", "I'm amazed"
            ]
        }

        texts = []
        labels = []

        for emotion_id, emotion_texts in sample_texts.items():
            for _ in range(size_per_emotion):
                # Randomly select and slightly modify texts
                base_text = random.choice(emotion_texts)
                texts.append(base_text)
                labels.append(emotion_id)

        # Shuffle the data
        combined = list(zip(texts, labels))
        random.shuffle(combined)
        texts, labels = zip(*combined)

        return {'text': list(texts), 'label': list(labels)}

    def build_vocab(self, texts, min_freq=2):
        """Build vocabulary for traditional models"""
        print("🔤 Building vocabulary...")
        word_counts = Counter()
        for text in texts:
            tokens = str(text).lower().split()
            word_counts.update(tokens)

        # Create vocabulary
        vocab = {'<pad>': 0, '<unk>': 1}
        for word, count in word_counts.items():
            if count >= min_freq:
                vocab[word] = len(vocab)

        self.vocab = vocab
        print(f"✅ Vocabulary built with {len(vocab)} words")
        return vocab

    def register_tokenizer(self, model_name, tokenizer):
        """Register a tokenizer for a specific model"""
        self.tokenizers[model_name] = tokenizer
        print(f"🔧 Tokenizer registered for {model_name}")

    def get_data_loaders(self, model_type, batch_size=16, max_length=128):
        """Get data loaders for different model types"""
        if self.dataset is None:
            raise ValueError("Dataset not loaded. Call load_dataset() first.")

        # 修复：确保数据类型正确
        train_texts = self.dataset['train']['text']
        train_labels = self.dataset['train']['label']
        val_texts = self.dataset['validation']['text']
        val_labels = self.dataset['validation']['label']
        test_texts = self.dataset['test']['text']
        test_labels = self.dataset['test']['label']

        # 修复：转换为 Python 原生类型
        if hasattr(train_texts, 'tolist'):  # 如果是 pandas Series 或 numpy array
            train_texts = train_texts.tolist()
        if hasattr(train_labels, 'tolist'):
            train_labels = train_labels.tolist()
        if hasattr(val_texts, 'tolist'):
            val_texts = val_texts.tolist()
        if hasattr(val_labels, 'tolist'):
            val_labels = val_labels.tolist()
        if hasattr(test_texts, 'tolist'):
            test_texts = test_texts.tolist()
        if hasattr(test_labels, 'tolist'):
            test_labels = test_labels.tolist()

        # 修复：确保所有文本都是字符串类型
        train_texts = [str(text) for text in train_texts]
        val_texts = [str(text) for text in val_texts]
        test_texts = [str(text) for text in test_texts]

        # 修复：确保所有标签都是整数类型
        train_labels = [int(label) for label in train_labels]
        val_labels = [int(label) for label in val_labels]
        test_labels = [int(label) for label in test_labels]

        if model_type == 'traditional':
            # Build vocabulary if not exists
            if self.vocab is None:
                all_texts = train_texts + val_texts + test_texts
                self.build_vocab(all_texts)

            # Create datasets
            train_dataset = EmotionDataset(train_texts, train_labels, vocab=self.vocab, max_length=max_length)
            val_dataset = EmotionDataset(val_texts, val_labels, vocab=self.vocab, max_length=max_length)
            test_dataset = EmotionDataset(test_texts, test_labels, vocab=self.vocab, max_length=max_length)

            # Create data loaders
            train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn_traditional)
            val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn_traditional)
            test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn_traditional)

        else:
            # For transformer models
            tokenizer = self.tokenizers.get(model_type)
            if tokenizer is None:
                raise ValueError(f"Tokenizer for {model_type} not registered")

            # Create datasets
            train_dataset = EmotionDataset(train_texts, train_labels, tokenizer=tokenizer, max_length=max_length)
            val_dataset = EmotionDataset(val_texts, val_labels, tokenizer=tokenizer, max_length=max_length)
            test_dataset = EmotionDataset(test_texts, test_labels, tokenizer=tokenizer, max_length=max_length)

            # Create data loaders
            train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn_transformer)
            val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn_transformer)
            test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn_transformer)

        print(f"✅ Data loaders created for {model_type} models")
        return train_loader, val_loader, test_loader

    def get_class_weights(self):
        """Calculate class weights for imbalanced dataset"""
        if self.dataset is None:
            raise ValueError("Dataset not loaded")

        train_labels = self.dataset['train']['label']
        class_counts = Counter(train_labels)
        total_samples = len(train_labels)

        # Calculate weights (inverse frequency)
        weights = []
        for i in range(len(self.emotion_labels)):
            weight = total_samples / (len(self.emotion_labels) * class_counts[i])
            weights.append(weight)

        return torch.tensor(weights, dtype=torch.float32)

    def get_dataset_info(self):
        """Get dataset statistics"""
        if self.dataset is None:
            return None

        train_labels = self.dataset['train']['label']
        class_counts = Counter(train_labels)

        info = {
            'total_samples': len(train_labels),
            'num_classes': len(self.emotion_labels),
            'class_distribution': dict(class_counts),
            'emotion_labels': self.emotion_labels
        }

        return info

print("✅ DataManager class defined successfully!")

✅ DataManager class defined successfully!


In [7]:
# Initialize data manager and load dataset
data_manager = DataManager()

# Load dataset
if data_manager.load_dataset():
    # Display dataset information
    info = data_manager.get_dataset_info()
    print("\n📊 Dataset Information:")
    print(f"   Total training samples: {info['total_samples']}")
    print(f"   Number of classes: {info['num_classes']}")
    print(f"   Emotion labels: {info['emotion_labels']}")

    print("\n📈 Class Distribution:")
    for i, (label, count) in enumerate(info['class_distribution'].items()):
        emotion = info['emotion_labels'][label]
        percentage = (count / info['total_samples']) * 100
        print(f"   {emotion}: {count} samples ({percentage:.1f}%)")

    # Calculate class weights
    class_weights = data_manager.get_class_weights()
    print(f"\n⚖️ Class weights calculated: {[f'{w:.3f}' for w in class_weights.tolist()]}")

else:
    print("❌ Failed to initialize data manager")

📊 Loading emotion dataset from HuggingFace...


✅ Dataset loaded successfully!
   Train samples: 16000
   Validation samples: 2000
   Test samples: 2000

📊 Dataset Information:
   Total training samples: 16000
   Number of classes: 6
   Emotion labels: ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']

📈 Class Distribution:
   sadness: 4666 samples (29.2%)
   anger: 2159 samples (13.5%)
   love: 1304 samples (8.2%)
   surprise: 572 samples (3.6%)
   fear: 1937 samples (12.1%)
   joy: 5362 samples (33.5%)

⚖️ Class weights calculated: ['0.572', '0.497', '2.045', '1.235', '1.377', '4.662']


### 4.1 Tokenizer Registration

In [8]:
# Register tokenizers for different models
print("🔧 Registering tokenizers for transformer models...")
tokenizer_configs = {
    'roberta-base': 'roberta-base',
    'roberta-large': 'roberta-large',
    'deberta-v3-base': 'microsoft/deberta-v3-base',
    'distilbert-base': 'distilbert-base-uncased',
    'electra-base': 'google/electra-base-discriminator',
    'xlnet-base': 'xlnet-base-cased',
    'albert-base': 'albert-base-v2'
}
successful_tokenizers = 0
for model_name, model_path in tokenizer_configs.items():
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        data_manager.register_tokenizer(model_name, tokenizer)
        successful_tokenizers += 1
    except Exception as e:
        print(f"⚠️ Failed to load tokenizer for {model_name}: {e}")
print(f"\n✅ Successfully registered {successful_tokenizers}/{len(tokenizer_configs)} tokenizers")

🔧 Registering tokenizers for transformer models...
🔧 Tokenizer registered for roberta-base
🔧 Tokenizer registered for roberta-large
🔧 Tokenizer registered for deberta-v3-base
🔧 Tokenizer registered for distilbert-base
🔧 Tokenizer registered for electra-base
🔧 Tokenizer registered for xlnet-base
🔧 Tokenizer registered for albert-base

✅ Successfully registered 7/7 tokenizers


### 4.2 Test Data Loading

In [9]:
# Test data loading for traditional models
print("🧪 Testing data loading for traditional models...")
try:
    train_loader, val_loader, test_loader = data_manager.get_data_loaders('traditional', batch_size=8)

    # Verify loaders are created
    if not all([train_loader, val_loader, test_loader]):
        raise ValueError("One or more data loaders is None")

    # Test a batch
    sample_batch = next(iter(train_loader))

    # Handle different batch formats
    if isinstance(sample_batch, (list, tuple)) and len(sample_batch) == 2:
        texts, labels = sample_batch
        print(f"✅ Traditional model data loading successful")

        if hasattr(texts, 'shape'):
            batch_shape_info = texts.shape
        else:
            text_length = len(texts) if hasattr(texts, "__len__") else "Unknown"
            batch_shape_info = f'Type: {type(texts)}, Length: {text_length}'

        print(f"   Batch text shape: {batch_shape_info}")

        if hasattr(labels, 'shape'):
            label_shape_info = labels.shape
        else:
            label_length = len(labels) if hasattr(labels, "__len__") else "Unknown"
            label_shape_info = f'Type: {type(labels)}, Length: {label_length}'

        print(f"   Batch labels shape: {label_shape_info}")

        print(f"   Train/Val/Test lengths: {len(train_loader)}/{len(val_loader)}/{len(test_loader)}")

        if hasattr(data_manager, 'vocab') and data_manager.vocab:
            print(f"   Vocabulary size: {len(data_manager.vocab)}")
    else:
        print(f"✅ Traditional model data loading successful")
        print(f"   Batch type: {type(sample_batch)}")
        print(f"   Batch info: {str(sample_batch)[:100]}...")

except Exception as e:
    print(f"❌ Traditional model data loading failed: {e}")
    print(f"   Error type: {type(e).__name__}")
    print(f"   Debug: Has vocab = {hasattr(data_manager, 'vocab')}")

# Test data loading for transformer models (if tokenizers are available)
if hasattr(data_manager, 'tokenizers') and data_manager.tokenizers:
    print("\n🧪 Testing data loading for transformer models...")
    try:
        # Test with first available tokenizer
        available_models = list(data_manager.tokenizers.keys())
        first_model = available_models[0]
        print(f"   Testing with model: {first_model}")

        train_loader, val_loader, test_loader = data_manager.get_data_loaders(first_model, batch_size=4)

        # Verify loaders
        if not all([train_loader, val_loader, test_loader]):
            raise ValueError("One or more data loaders is None")

        # Test a batch
        sample_batch = next(iter(train_loader))
        print(f"✅ Transformer model data loading successful")

        if isinstance(sample_batch, dict):
            if 'input_ids' in sample_batch:
                print(f"   Input IDs shape: {sample_batch['input_ids'].shape}")
            if 'attention_mask' in sample_batch:
                print(f"   Attention mask shape: {sample_batch['attention_mask'].shape}")
            if 'labels' in sample_batch:
                print(f"   Labels shape: {sample_batch['labels'].shape}")
            print(f"   Batch keys: {list(sample_batch.keys())}")
        else:
            print(f"   Batch type: {type(sample_batch)}")

        print(f"   Train/Val/Test lengths: {len(train_loader)}/{len(val_loader)}/{len(test_loader)}")

    except Exception as e:
        print(f"❌ Transformer model data loading failed: {e}")
        print(f"   Error type: {type(e).__name__}")
        print(f"   Available models: {list(data_manager.tokenizers.keys())}")
else:
    print("\n⚠️ No tokenizers available for transformer model testing")
    print("   Check if tokenizers are properly initialized")

🧪 Testing data loading for traditional models...
🔤 Building vocabulary...
✅ Vocabulary built with 8430 words
✅ Data loaders created for traditional models
✅ Traditional model data loading successful
   Batch text shape: torch.Size([8, 53])
   Batch labels shape: torch.Size([8])
   Train/Val/Test lengths: 2000/250/250
   Vocabulary size: 8430

🧪 Testing data loading for transformer models...
   Testing with model: roberta-base
✅ Data loaders created for roberta-base models
✅ Transformer model data loading successful
   Input IDs shape: torch.Size([4, 128])
   Attention mask shape: torch.Size([4, 128])
   Labels shape: torch.Size([4])
   Batch keys: ['input_ids', 'attention_mask', 'labels']
   Train/Val/Test lengths: 4000/500/500


## 5. Model Architecture Framework

In [10]:
@dataclass
class ModelConfig:
    """Configuration class for models"""
    name: str
    model_type: str  # "traditional", "transformer"
    num_labels: int = 6
    vocab_size: Optional[int] = None
    embedding_dim: int = 100
    hidden_dim: int = 256
    num_layers: int = 2
    dropout: float = 0.5
    learning_rate: float = 2e-5
    batch_size: int = 16
    max_epochs: int = 3
    pretrained_model: Optional[str] = None

@dataclass
class TrainingResult:
    """Training result data structure"""
    model_name: str
    accuracy: float
    f1_macro: float
    f1_weighted: float
    precision_macro: float
    recall_macro: float
    confusion_matrix: np.ndarray
    training_time: float
    inference_time: float
    model_size: int
    training_history: Dict
    classification_report: str

print("✅ Data structures defined successfully!")

✅ Data structures defined successfully!


In [11]:
class BaseEmotionModel(ABC, nn.Module):
    """Abstract base class for all emotion classification models"""

    def __init__(self, config: ModelConfig):
        super().__init__()
        self.config = config
        self.model_name = config.name
        self.num_labels = config.num_labels

    @abstractmethod
    def forward(self, x):
        """Forward pass of the model"""
        pass

    def get_config(self) -> Dict:
        """Get model configuration"""
        return {
            'name': self.model_name,
            'type': self.config.model_type,
            'num_parameters': sum(p.numel() for p in self.parameters()),
            'config': self.config.__dict__
        }

    def save_model(self, path: str):
        """Save model state and configuration"""
        os.makedirs(os.path.dirname(path), exist_ok=True)
        torch.save({
            'model_state_dict': self.state_dict(),
            'config': self.config,
            'model_info': self.get_config()
        }, path)
        print(f"💾 Model saved to {path}")

    def load_model(self, path: str):
        """Load model state"""
        checkpoint = torch.load(path, map_location='cpu')
        self.load_state_dict(checkpoint['model_state_dict'])
        print(f"📂 Model loaded from {path}")
        return checkpoint.get('model_info', {})

print("✅ BaseEmotionModel abstract class defined successfully!")

✅ BaseEmotionModel abstract class defined successfully!


In [12]:
# Traditional Models Implementation

class BiLSTMAttention(BaseEmotionModel):
    """BiLSTM with Attention mechanism for emotion classification"""

    def __init__(self, config: ModelConfig):
        super().__init__(config)

        self.embedding = nn.Embedding(config.vocab_size, config.embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(
            config.embedding_dim,
            config.hidden_dim,
            num_layers=config.num_layers,
            bidirectional=True,
            dropout=config.dropout if config.num_layers > 1 else 0,
            batch_first=True
        )

        # Attention mechanism
        self.attention = nn.Linear(config.hidden_dim * 2, 1)

        # Classification head
        self.dropout = nn.Dropout(config.dropout)
        self.classifier = nn.Linear(config.hidden_dim * 2, config.num_labels)

    def forward(self, x):
        # x shape: (batch_size, seq_len)
        embedded = self.dropout(self.embedding(x))  # (batch_size, seq_len, embedding_dim)

        # LSTM forward pass
        lstm_out, (hidden, cell) = self.lstm(embedded)  # (batch_size, seq_len, hidden_dim*2)

        # Attention mechanism
        attention_weights = torch.softmax(self.attention(lstm_out).squeeze(2), dim=1)  # (batch_size, seq_len)
        context_vector = torch.bmm(attention_weights.unsqueeze(1), lstm_out).squeeze(1)  # (batch_size, hidden_dim*2)

        # Classification
        output = self.classifier(self.dropout(context_vector))
        return output

class CNNEmotionClassifier(BaseEmotionModel):
    """CNN model for emotion classification"""

    def __init__(self, config: ModelConfig):
        super().__init__(config)

        self.embedding = nn.Embedding(config.vocab_size, config.embedding_dim, padding_idx=0)

        # Multiple filter sizes for n-gram capture
        self.filter_sizes = [2, 3, 4, 5]
        self.num_filters = config.hidden_dim // len(self.filter_sizes)

        # Convolutional layers
        self.convs = nn.ModuleList([
            nn.Conv2d(1, self.num_filters, (fs, config.embedding_dim))
            for fs in self.filter_sizes
        ])

        # Classification head
        self.dropout = nn.Dropout(config.dropout)
        self.classifier = nn.Linear(len(self.filter_sizes) * self.num_filters, config.num_labels)

    def forward(self, x):
        # x shape: (batch_size, seq_len)
        embedded = self.embedding(x).unsqueeze(1)  # (batch_size, 1, seq_len, embedding_dim)

        # Apply convolutions
        conv_outputs = []
        for conv in self.convs:
            conv_out = torch.relu(conv(embedded)).squeeze(3)  # (batch_size, num_filters, new_seq_len)
            pooled = torch.max_pool1d(conv_out, conv_out.size(2)).squeeze(2)  # (batch_size, num_filters)
            conv_outputs.append(pooled)

        # Concatenate all conv outputs
        concatenated = torch.cat(conv_outputs, dim=1)  # (batch_size, total_filters)

        # Classification
        output = self.classifier(self.dropout(concatenated))
        return output

print("✅ Traditional model architectures defined successfully!")

✅ Traditional model architectures defined successfully!


In [13]:
# Transformer Model Wrapper

class TransformerEmotionModel(BaseEmotionModel):
    """Wrapper for transformer-based emotion classification models"""

    def __init__(self, config: ModelConfig):
        super().__init__(config)

        if config.pretrained_model is None:
            raise ValueError("Pretrained model path must be specified for transformer models")

        try:
            self.transformer = AutoModelForSequenceClassification.from_pretrained(
                config.pretrained_model,
                num_labels=config.num_labels,
                ignore_mismatched_sizes=True
            )
            print(f"✅ Loaded transformer model: {config.pretrained_model}")
        except Exception as e:
            print(f"❌ Failed to load transformer model {config.pretrained_model}: {e}")
            raise

    def forward(self, input_ids, attention_mask=None, labels=None):
        """Forward pass for transformer models"""
        outputs = self.transformer(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        return outputs

    def get_config(self) -> Dict:
        """Get model configuration including transformer config"""
        base_config = super().get_config()
        base_config['transformer_config'] = self.transformer.config.to_dict()
        return base_config

print("✅ Transformer model wrapper defined successfully!")

✅ Transformer model wrapper defined successfully!


In [14]:
class ModelFactory:
    """Factory class for creating different emotion classification models"""

    @staticmethod
    def get_available_models() -> List[str]:
        """Get list of available model types"""
        return [
            'bilstm-attention',
            'cnn',
            'roberta-base',
            'roberta-large',
            'deberta-v3-base',
            'distilbert-base',
            'electra-base',
            'xlnet-base',
            'albert-base'
        ]

    @staticmethod
    def create_model(model_name: str, vocab_size: Optional[int] = None) -> BaseEmotionModel:
        """Create a model instance based on model name"""

        if model_name == 'bilstm-attention':
            if vocab_size is None:
                raise ValueError("vocab_size must be provided for BiLSTM model")
            config = ModelConfig(
                name='BiLSTM-Attention',
                model_type='traditional',
                vocab_size=vocab_size,
                embedding_dim=100,
                hidden_dim=256,
                num_layers=2,
                dropout=0.5
            )
            return BiLSTMAttention(config)

        elif model_name == 'cnn':
            if vocab_size is None:
                raise ValueError("vocab_size must be provided for CNN model")
            config = ModelConfig(
                name='CNN-Emotion',
                model_type='traditional',
                vocab_size=vocab_size,
                embedding_dim=100,
                hidden_dim=256,
                dropout=0.5
            )
            return CNNEmotionClassifier(config)

        elif model_name in ['roberta-base', 'roberta-large', 'deberta-v3-base',
                           'distilbert-base', 'electra-base', 'xlnet-base', 'albert-base']:

            # Map model names to HuggingFace model paths
            model_paths = {
                'roberta-base': 'roberta-base',
                'roberta-large': 'roberta-large',
                'deberta-v3-base': 'microsoft/deberta-v3-base',
                'distilbert-base': 'distilbert-base-uncased',
                'electra-base': 'google/electra-base-discriminator',
                'xlnet-base': 'xlnet-base-cased',
                'albert-base': 'albert-base-v2'
            }

            config = ModelConfig(
                name=model_name.upper(),
                model_type='transformer',
                pretrained_model=model_paths[model_name],
                learning_rate=2e-5,
                batch_size=16
            )
            return TransformerEmotionModel(config)

        else:
            raise ValueError(f"Unknown model name: {model_name}. Available models: {ModelFactory.get_available_models()}")

    @staticmethod
    def get_model_info(model_name: str) -> Dict:
        """Get information about a specific model"""
        model_info = {
            'bilstm-attention': {
                'description': 'Bidirectional LSTM with attention mechanism',
                'type': 'traditional',
                'parameters': '~2M (depends on vocab size)',
                'strengths': 'Good for sequential patterns, attention mechanism'
            },
            'cnn': {
                'description': 'Convolutional Neural Network with multiple filter sizes',
                'type': 'traditional',
                'parameters': '~1M (depends on vocab size)',
                'strengths': 'Fast training, good for local patterns'
            },
            'roberta-base': {
                'description': 'RoBERTa base model (125M parameters)',
                'type': 'transformer',
                'parameters': '125M',
                'strengths': 'Strong contextual understanding, robust training'
            },
            'roberta-large': {
                'description': 'RoBERTa large model (355M parameters)',
                'type': 'transformer',
                'parameters': '355M',
                'strengths': 'Best performance, large capacity'
            },
            'deberta-v3-base': {
                'description': 'DeBERTa v3 base with enhanced attention',
                'type': 'transformer',
                'parameters': '184M',
                'strengths': 'Enhanced attention mechanism, strong performance'
            },
            'distilbert-base': {
                'description': 'Distilled BERT for efficiency',
                'type': 'transformer',
                'parameters': '66M',
                'strengths': 'Fast inference, good performance/size ratio'
            },
            'electra-base': {
                'description': 'ELECTRA discriminator model',
                'type': 'transformer',
                'parameters': '110M',
                'strengths': 'Efficient pre-training, good downstream performance'
            },
            'xlnet-base': {
                'description': 'XLNet with permutation language modeling',
                'type': 'transformer',
                'parameters': '117M',
                'strengths': 'Bidirectional context, autoregressive benefits'
            },
            'albert-base': {
                'description': 'ALBERT with parameter sharing',
                'type': 'transformer',
                'parameters': '12M',
                'strengths': 'Parameter efficient, good performance'
            }
        }

        return model_info.get(model_name, {'description': 'Unknown model'})

print("✅ ModelFactory class defined successfully!")
print(f"📋 Available models: {ModelFactory.get_available_models()}")

✅ ModelFactory class defined successfully!
📋 Available models: ['bilstm-attention', 'cnn', 'roberta-base', 'roberta-large', 'deberta-v3-base', 'distilbert-base', 'electra-base', 'xlnet-base', 'albert-base']


### 5.1 Test Model Creation

In [15]:
# Test model creation with vocabulary building
print("🧪 Testing model creation...")

# Ensure vocabulary is built for traditional models
if data_manager.dataset is not None and data_manager.vocab is None:
    print("🔤 Building vocabulary for model creation tests...")
    # Get all text data for vocabulary building
    all_texts = []
    all_texts.extend(list(data_manager.dataset['train']['text']))
    all_texts.extend(list(data_manager.dataset['validation']['text']))
    all_texts.extend(list(data_manager.dataset['test']['text']))

    data_manager.build_vocab(all_texts)
    print(f"✅ Vocabulary built with {len(data_manager.vocab)} words")

# Test traditional models (need vocab size)
if data_manager.vocab is not None:
    vocab_size = len(data_manager.vocab)
    print(f"\n📝 Using vocabulary size: {vocab_size}")

    try:
        # Test BiLSTM creation
        bilstm_model = ModelFactory.create_model('bilstm-attention', vocab_size=vocab_size)
        print(f"✅ BiLSTM model created: {get_model_size(bilstm_model):,} parameters")

        # Test CNN creation
        cnn_model = ModelFactory.create_model('cnn', vocab_size=vocab_size)
        print(f"✅ CNN model created: {get_model_size(cnn_model):,} parameters")

        # Clean up traditional models to save memory
        del bilstm_model, cnn_model
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

    except Exception as e:
        print(f"❌ Traditional model creation failed: {e}")
else:
    print("⚠️ Vocabulary not available, skipping traditional model tests")

# Test transformer models (if tokenizers are available)
if data_manager.tokenizers:
    print("\n🤖 Testing transformer model creation...")

    # Test a few transformer models
    test_models = ['roberta-base', 'distilbert-base']

    for model_name in test_models:
        try:
            model = ModelFactory.create_model(model_name)
            print(f"✅ {model_name} created: {get_model_size(model):,} parameters")

            # Clean up to save memory
            del model
            if torch.cuda.is_available():
                torch.cuda.empty_cache()

        except Exception as e:
            print(f"⚠️ {model_name} creation failed: {e}")
else:
    print("\n⚠️ No tokenizers available, skipping transformer model tests")

# Display model information
print("\n📊 Model Information Summary:")
for model_name in ModelFactory.get_available_models()[:5]:  # Show first 5
    info = ModelFactory.get_model_info(model_name)
    print(f"   {model_name}: {info['description']} ({info.get('parameters', 'Unknown')} params)")

print("\n✅ Model creation tests completed!")

🧪 Testing model creation...

📝 Using vocabulary size: 8430


✅ BiLSTM model created: 3,156,735 parameters
✅ CNN model created: 934,398 parameters

🤖 Testing transformer model creation...


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Loaded transformer model: roberta-base
✅ roberta-base created: 124,650,246 parameters


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Loaded transformer model: distilbert-base-uncased
✅ distilbert-base created: 66,958,086 parameters

📊 Model Information Summary:
   bilstm-attention: Bidirectional LSTM with attention mechanism (~2M (depends on vocab size) params)
   cnn: Convolutional Neural Network with multiple filter sizes (~1M (depends on vocab size) params)
   roberta-base: RoBERTa base model (125M parameters) (125M params)
   roberta-large: RoBERTa large model (355M parameters) (355M params)
   deberta-v3-base: DeBERTa v3 base with enhanced attention (184M params)

✅ Model creation tests completed!


## 6. Training Management System

In [16]:
class TrainingManager:
    """Unified training manager for all emotion classification models"""

    def __init__(self, device=None):
        self.device = device if device else torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.models = {}
        self.training_history = {}
        self.results = {}

        print(f"🚀 TrainingManager initialized on device: {self.device}")

    def train_traditional_model(self, model: BaseEmotionModel, train_loader, val_loader,
                              epochs=3, learning_rate=1e-3, class_weights=None):
        """Train traditional models (BiLSTM, CNN)"""
        model.to(self.device)
        model.train()

        # Setup optimizer and loss function
        optimizer = optim.Adam(model.parameters(), lr=learning_rate)

        if class_weights is not None:
            criterion = nn.CrossEntropyLoss(weight=class_weights.to(self.device))
        else:
            criterion = nn.CrossEntropyLoss()

        # Training history
        history = {
            'train_loss': [],
            'val_loss': [],
            'val_accuracy': [],
            'val_f1': []
        }

        best_val_f1 = 0.0
        start_time = time.time()

        print(f"🏋️ Training {model.model_name} for {epochs} epochs...")

        for epoch in range(epochs):
            # Training phase
            model.train()
            train_loss = 0.0
            train_pbar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs} [Train]")

            for batch in train_pbar:
                texts, labels = batch
                texts, labels = texts.to(self.device), labels.to(self.device)

                optimizer.zero_grad()
                outputs = model(texts)
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()

                train_loss += loss.item()
                train_pbar.set_postfix({'loss': f'{loss.item():.4f}'})

            avg_train_loss = train_loss / len(train_loader)

            # Validation phase
            model.eval()
            val_loss = 0.0
            all_preds = []
            all_labels = []

            with torch.no_grad():
                for batch in val_loader:
                    texts, labels = batch
                    texts, labels = texts.to(self.device), labels.to(self.device)

                    outputs = model(texts)
                    loss = criterion(outputs, labels)
                    val_loss += loss.item()

                    preds = torch.argmax(outputs, dim=1)
                    all_preds.extend(preds.cpu().numpy())
                    all_labels.extend(labels.cpu().numpy())

            avg_val_loss = val_loss / len(val_loader)
            val_accuracy = accuracy_score(all_labels, all_preds)
            val_f1 = f1_score(all_labels, all_preds, average='macro')

            # Update history
            history['train_loss'].append(avg_train_loss)
            history['val_loss'].append(avg_val_loss)
            history['val_accuracy'].append(val_accuracy)
            history['val_f1'].append(val_f1)

            print(f"Epoch {epoch+1}: Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}, "
                  f"Val Acc: {val_accuracy:.4f}, Val F1: {val_f1:.4f}")

            # Save best model
            if val_f1 > best_val_f1:
                best_val_f1 = val_f1
                model.save_model(f'models/{model.model_name.lower().replace("-", "_")}_best.pt')

        training_time = time.time() - start_time
        print(f"✅ Training completed in {format_time(training_time)}")

        return history, training_time

    def train_transformer_model(self, model: TransformerEmotionModel, train_loader, val_loader,
                              epochs=3, learning_rate=2e-5, class_weights=None):
        """Train transformer models"""
        model.to(self.device)
        model.train()

        # Setup optimizer and scheduler
        optimizer = optim.AdamW(model.parameters(), lr=learning_rate)
        total_steps = len(train_loader) * epochs
        scheduler = get_linear_schedule_with_warmup(
            optimizer,
            num_warmup_steps=int(0.1 * total_steps),
            num_training_steps=total_steps
        )

        # Training history
        history = {
            'train_loss': [],
            'val_loss': [],
            'val_accuracy': [],
            'val_f1': []
        }

        best_val_f1 = 0.0
        start_time = time.time()

        print(f"🤖 Training {model.model_name} for {epochs} epochs...")

        for epoch in range(epochs):
            # Training phase
            model.train()
            train_loss = 0.0
            train_pbar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs} [Train]")

            for batch in train_pbar:
                input_ids = batch['input_ids'].to(self.device)
                attention_mask = batch['attention_mask'].to(self.device)
                labels = batch['labels'].to(self.device)

                optimizer.zero_grad()
                outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
                loss = outputs.loss

                loss.backward()
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                optimizer.step()
                scheduler.step()

                train_loss += loss.item()
                train_pbar.set_postfix({'loss': f'{loss.item():.4f}'})

            avg_train_loss = train_loss / len(train_loader)

            # Validation phase
            model.eval()
            val_loss = 0.0
            all_preds = []
            all_labels = []

            with torch.no_grad():
                for batch in val_loader:
                    input_ids = batch['input_ids'].to(self.device)
                    attention_mask = batch['attention_mask'].to(self.device)
                    labels = batch['labels'].to(self.device)

                    outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
                    loss = outputs.loss
                    val_loss += loss.item()

                    preds = torch.argmax(outputs.logits, dim=1)
                    all_preds.extend(preds.cpu().numpy())
                    all_labels.extend(labels.cpu().numpy())

            avg_val_loss = val_loss / len(val_loader)
            val_accuracy = accuracy_score(all_labels, all_preds)
            val_f1 = f1_score(all_labels, all_preds, average='macro')

            # Update history
            history['train_loss'].append(avg_train_loss)
            history['val_loss'].append(avg_val_loss)
            history['val_accuracy'].append(val_accuracy)
            history['val_f1'].append(val_f1)

            print(f"Epoch {epoch+1}: Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}, "
                  f"Val Acc: {val_accuracy:.4f}, Val F1: {val_f1:.4f}")

            # Save best model
            if val_f1 > best_val_f1:
                best_val_f1 = val_f1
                model.save_model(f'models/{model.model_name.lower().replace("-", "_")}_best.pt')

        training_time = time.time() - start_time
        print(f"✅ Training completed in {format_time(training_time)}")

        return history, training_time

    def train_model(self, model_name: str, data_manager: DataManager,
                   epochs=3, batch_size=16, learning_rate=None):
        """Train a specific model"""
        print(f"\n🎯 Starting training for {model_name}...")

        try:
            # Create model
            if model_name in ['bilstm-attention', 'cnn']:
                if data_manager.vocab is None:
                    raise ValueError("Vocabulary not built for traditional models")
                model = ModelFactory.create_model(model_name, vocab_size=len(data_manager.vocab))

                # Get data loaders
                train_loader, val_loader, test_loader = data_manager.get_data_loaders(
                    'traditional', batch_size=batch_size
                )

                # Train model
                lr = learning_rate if learning_rate else 1e-3
                class_weights = data_manager.get_class_weights()
                history, training_time = self.train_traditional_model(
                    model, train_loader, val_loader, epochs, lr, class_weights
                )

            else:
                # Transformer model
                if model_name not in data_manager.tokenizers:
                    raise ValueError(f"Tokenizer for {model_name} not registered")

                model = ModelFactory.create_model(model_name)

                # Get data loaders
                train_loader, val_loader, test_loader = data_manager.get_data_loaders(
                    model_name, batch_size=batch_size
                )

                # Train model
                lr = learning_rate if learning_rate else 2e-5
                history, training_time = self.train_transformer_model(
                    model, train_loader, val_loader, epochs, lr
                )

            # Store results
            self.models[model_name] = model
            self.training_history[model_name] = history

            print(f"✅ {model_name} training completed successfully!")
            return model, history, training_time

        except Exception as e:
            print(f"❌ Training failed for {model_name}: {e}")
            return None, None, 0

    def train_multiple_models(self, model_names: List[str], data_manager: DataManager,
                            epochs=3, batch_size=16):
        """Train multiple models sequentially"""
        print(f"🚀 Starting batch training for {len(model_names)} models...")

        results = {}
        total_start_time = time.time()

        for i, model_name in enumerate(model_names, 1):
            print(f"\n{'='*60}")
            print(f"Training Model {i}/{len(model_names)}: {model_name.upper()}")
            print(f"{'='*60}")

            model, history, training_time = self.train_model(
                model_name, data_manager, epochs, batch_size
            )

            if model is not None:
                results[model_name] = {
                    'model': model,
                    'history': history,
                    'training_time': training_time,
                    'status': 'success'
                }
            else:
                results[model_name] = {
                    'status': 'failed'
                }

            # Clean up GPU memory
            if torch.cuda.is_available():
                torch.cuda.empty_cache()

        total_time = time.time() - total_start_time
        successful_models = sum(1 for r in results.values() if r['status'] == 'success')

        print(f"\n🎉 Batch training completed!")
        print(f"   Successful models: {successful_models}/{len(model_names)}")
        print(f"   Total time: {format_time(total_time)}")

        return results

print("✅ TrainingManager class defined successfully!")

✅ TrainingManager class defined successfully!


### 6.1 Test Training System

In [17]:
# Initialize training manager
trainer = TrainingManager()

print(f"🔧 Training manager ready on {trainer.device}")
print(f"📊 Available models for training: {ModelFactory.get_available_models()}")

# Check if we have the necessary data
if data_manager.dataset is not None:
    print("✅ Dataset loaded and ready for training")
    print(f"   Vocabulary size: {len(data_manager.vocab) if data_manager.vocab else 'Not built'}")
    print(f"   Registered tokenizers: {list(data_manager.tokenizers.keys())}")
else:
    print("❌ Dataset not loaded. Please run the data loading cells first.")

🚀 TrainingManager initialized on device: cuda
🔧 Training manager ready on cuda
📊 Available models for training: ['bilstm-attention', 'cnn', 'roberta-base', 'roberta-large', 'deberta-v3-base', 'distilbert-base', 'electra-base', 'xlnet-base', 'albert-base']
✅ Dataset loaded and ready for training
   Vocabulary size: 8430
   Registered tokenizers: ['roberta-base', 'roberta-large', 'deberta-v3-base', 'distilbert-base', 'electra-base', 'xlnet-base', 'albert-base']


## 7. Evaluation and Analysis System

In [18]:
class EvaluationManager:
    """Comprehensive evaluation system for emotion classification models"""

    def __init__(self, emotion_labels=None):
        self.emotion_labels = emotion_labels or ["sadness", "joy", "love", "anger", "fear", "surprise"]
        self.results = {}
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

        print(f"📈 EvaluationManager initialized with {len(self.emotion_labels)} emotion classes")

    def evaluate_traditional_model(self, model: BaseEmotionModel, test_loader, model_name: str):
        """Evaluate traditional models (BiLSTM, CNN)"""
        model.to(self.device)
        model.eval()

        all_preds = []
        all_labels = []
        inference_times = []

        print(f"🔍 Evaluating {model_name} on test set...")

        with torch.no_grad():
            for batch in tqdm(test_loader, desc="Evaluating"):
                texts, labels = batch
                texts, labels = texts.to(self.device), labels.to(self.device)

                # Measure inference time
                start_time = time.time()
                outputs = model(texts)
                inference_time = time.time() - start_time
                inference_times.append(inference_time)

                preds = torch.argmax(outputs, dim=1)
                all_preds.extend(preds.cpu().numpy())
                all_labels.extend(labels.cpu().numpy())

        return self._calculate_metrics(all_labels, all_preds, model_name, inference_times)

    def evaluate_transformer_model(self, model: TransformerEmotionModel, test_loader, model_name: str):
        """Evaluate transformer models"""
        model.to(self.device)
        model.eval()

        all_preds = []
        all_labels = []
        inference_times = []

        print(f"🤖 Evaluating {model_name} on test set...")

        with torch.no_grad():
            for batch in tqdm(test_loader, desc="Evaluating"):
                input_ids = batch['input_ids'].to(self.device)
                attention_mask = batch['attention_mask'].to(self.device)
                labels = batch['labels'].to(self.device)

                # Measure inference time
                start_time = time.time()
                outputs = model(input_ids=input_ids, attention_mask=attention_mask)
                inference_time = time.time() - start_time
                inference_times.append(inference_time)

                preds = torch.argmax(outputs.logits, dim=1)
                all_preds.extend(preds.cpu().numpy())
                all_labels.extend(labels.cpu().numpy())

        return self._calculate_metrics(all_labels, all_preds, model_name, inference_times)

    def _calculate_metrics(self, true_labels, pred_labels, model_name, inference_times):
        """Calculate comprehensive evaluation metrics"""
        # Basic metrics
        accuracy = accuracy_score(true_labels, pred_labels)
        f1_macro = f1_score(true_labels, pred_labels, average='macro')
        f1_weighted = f1_score(true_labels, pred_labels, average='weighted')
        precision_macro = precision_score(true_labels, pred_labels, average='macro')
        recall_macro = recall_score(true_labels, pred_labels, average='macro')

        # Per-class metrics
        precision_per_class = precision_score(true_labels, pred_labels, average=None)
        recall_per_class = recall_score(true_labels, pred_labels, average=None)
        f1_per_class = f1_score(true_labels, pred_labels, average=None)

        # Confusion matrix
        cm = confusion_matrix(true_labels, pred_labels)

        # Classification report
        class_report = classification_report(
            true_labels, pred_labels,
            target_names=self.emotion_labels,
            output_dict=True
        )

        # Timing metrics
        avg_inference_time = np.mean(inference_times)
        total_inference_time = np.sum(inference_times)

        # Create result dictionary
        result = {
            'model_name': model_name,
            'accuracy': accuracy,
            'f1_macro': f1_macro,
            'f1_weighted': f1_weighted,
            'precision_macro': precision_macro,
            'recall_macro': recall_macro,
            'precision_per_class': precision_per_class.tolist(),
            'recall_per_class': recall_per_class.tolist(),
            'f1_per_class': f1_per_class.tolist(),
            'confusion_matrix': cm.tolist(),
            'classification_report': class_report,
            'avg_inference_time': avg_inference_time,
            'total_inference_time': total_inference_time,
            'samples_evaluated': len(true_labels)
        }

        return result

    def evaluate_model(self, model: BaseEmotionModel, test_loader, model_name: str, model_type: str):
        """Evaluate a single model"""
        print(f"\n📊 Evaluating {model_name}...")

        try:
            if model_type == 'traditional':
                result = self.evaluate_traditional_model(model, test_loader, model_name)
            else:
                result = self.evaluate_transformer_model(model, test_loader, model_name)

            # Store result
            self.results[model_name] = result

            # Print summary
            print(f"✅ {model_name} Evaluation Results:")
            print(f"   Accuracy: {result['accuracy']:.4f}")
            print(f"   F1-Macro: {result['f1_macro']:.4f}")
            print(f"   F1-Weighted: {result['f1_weighted']:.4f}")
            print(f"   Avg Inference Time: {result['avg_inference_time']*1000:.2f}ms per batch")

            return result

        except Exception as e:
            print(f"❌ Evaluation failed for {model_name}: {e}")
            return None

    def evaluate_all_models(self, models_dict: Dict, data_manager: DataManager, batch_size=16):
        """Evaluate all trained models"""
        print(f"🎯 Starting comprehensive evaluation of {len(models_dict)} models...")

        evaluation_results = {}

        for model_name, model_info in models_dict.items():
            if model_info['status'] != 'success':
                print(f"⚠️ Skipping {model_name} (training failed)")
                continue

            model = model_info['model']

            # Get appropriate test loader
            if model_name in ['bilstm-attention', 'cnn']:
                _, _, test_loader = data_manager.get_data_loaders('traditional', batch_size=batch_size)
                model_type = 'traditional'
            else:
                _, _, test_loader = data_manager.get_data_loaders(model_name, batch_size=batch_size)
                model_type = 'transformer'

            # Evaluate model
            result = self.evaluate_model(model, test_loader, model_name, model_type)

            if result is not None:
                # Add training information
                result['training_time'] = model_info.get('training_time', 0)
                result['model_size'] = get_model_size(model)
                evaluation_results[model_name] = result

            # Clean up GPU memory
            if torch.cuda.is_available():
                torch.cuda.empty_cache()

        print(f"\n🎉 Evaluation completed for {len(evaluation_results)} models!")
        return evaluation_results

    def generate_comparison_report(self) -> pd.DataFrame:
        """Generate a comprehensive comparison report"""
        if not self.results:
            print("❌ No evaluation results available")
            return pd.DataFrame()

        # Create comparison dataframe
        comparison_data = []

        for model_name, result in self.results.items():
            row = {
                'Model': model_name,
                'Accuracy': result['accuracy'],
                'F1-Macro': result['f1_macro'],
                'F1-Weighted': result['f1_weighted'],
                'Precision': result['precision_macro'],
                'Recall': result['recall_macro'],
                'Training Time (s)': result.get('training_time', 0),
                'Inference Time (ms)': result['avg_inference_time'] * 1000,
                'Model Size (M)': result.get('model_size', 0) / 1e6,
                'Samples': result['samples_evaluated']
            }
            comparison_data.append(row)

        df = pd.DataFrame(comparison_data)

        # Sort by F1-Macro score (descending)
        df = df.sort_values('F1-Macro', ascending=False).reset_index(drop=True)

        return df

    def get_best_models(self, metric='f1_macro', top_k=3):
        """Get top-k best performing models"""
        if not self.results:
            return []

        # Sort models by specified metric
        sorted_models = sorted(
            self.results.items(),
            key=lambda x: x[1][metric],
            reverse=True
        )

        return sorted_models[:top_k]

    def save_results(self, filename='evaluation_results.json'):
        """Save evaluation results to file"""
        save_results(self.results, filename)

    def load_results(self, filename='evaluation_results.json'):
        """Load evaluation results from file"""
        self.results = load_results(filename)
        return self.results

print("✅ EvaluationManager class defined successfully!")

✅ EvaluationManager class defined successfully!


### 7.1 Initialize Evaluation System

In [19]:
# Initialize evaluation manager
evaluator = EvaluationManager(emotion_labels=config.emotion_labels)

print(f"📊 Evaluation system ready")
print(f"   Emotion classes: {evaluator.emotion_labels}")
print(f"   Device: {evaluator.device}")

📈 EvaluationManager initialized with 6 emotion classes
📊 Evaluation system ready
   Emotion classes: ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']
   Device: cuda


## 8. Advanced Visualization System

In [20]:
class VisualizationEngine:
    """Professional-grade visualization system for model comparison"""

    def __init__(self, emotion_labels=None, color_palette='viridis'):
        self.emotion_labels = emotion_labels or ["sadness", "joy", "love", "anger", "fear", "surprise"]
        self.color_palette = color_palette
        self.colors = px.colors.qualitative.Set3

        # Professional color scheme
        self.model_colors = {
            'bilstm-attention': '#1f77b4',
            'cnn': '#ff7f0e',
            'roberta-base': '#2ca02c',
            'roberta-large': '#d62728',
            'deberta-v3-base': '#9467bd',
            'distilbert-base': '#8c564b',
            'electra-base': '#e377c2',
            'xlnet-base': '#7f7f7f',
            'albert-base': '#bcbd22'
        }

        print(f"🎨 VisualizationEngine initialized with {len(self.emotion_labels)} emotion classes")

    def plot_performance_comparison(self, results: Dict, metrics=['accuracy', 'f1_macro', 'f1_weighted']):
        """Create interactive performance comparison chart"""
        if not results:
            print("❌ No results available for visualization")
            return None

        # Prepare data
        models = list(results.keys())

        fig = make_subplots(
            rows=1, cols=len(metrics),
            subplot_titles=[metric.replace('_', ' ').title() for metric in metrics],
            specs=[[{'secondary_y': False} for _ in metrics]]
        )

        for i, metric in enumerate(metrics, 1):
            values = [results[model][metric] for model in models]
            colors = [self.model_colors.get(model, '#636EFA') for model in models]

            fig.add_trace(
                go.Bar(
                    x=models,
                    y=values,
                    name=metric.replace('_', ' ').title(),
                    marker_color=colors,
                    text=[f'{v:.3f}' for v in values],
                    textposition='auto',
                    showlegend=False
                ),
                row=1, col=i
            )

        fig.update_layout(
            title={
                'text': 'Model Performance Comparison',
                'x': 0.5,
                'xanchor': 'center',
                'font': {'size': 20}
            },
            height=500,
            showlegend=False,
            template='plotly_white'
        )

        # Update x-axis labels
        for i in range(1, len(metrics) + 1):
            fig.update_xaxes(tickangle=45, row=1, col=i)

        return fig

    def plot_confusion_matrices(self, results: Dict, models_to_show=None):
        """Create confusion matrix heatmaps for multiple models"""
        if not results:
            return None

        models = models_to_show or list(results.keys())
        n_models = len(models)

        # Calculate grid dimensions
        cols = min(3, n_models)
        rows = (n_models + cols - 1) // cols

        fig = make_subplots(
            rows=rows, cols=cols,
            subplot_titles=[f'{model.upper()}' for model in models],
            specs=[[{'type': 'heatmap'} for _ in range(cols)] for _ in range(rows)]
        )

        for idx, model in enumerate(models):
            if model not in results:
                continue

            row = idx // cols + 1
            col = idx % cols + 1

            cm = np.array(results[model]['confusion_matrix'])

            # Normalize confusion matrix
            cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

            # Create annotations
            annotations = []
            for i in range(len(self.emotion_labels)):
                for j in range(len(self.emotion_labels)):
                    annotations.append(
                        dict(
                            x=j, y=i,
                            text=str(cm[i][j]),
                            showarrow=False,
                            font=dict(color='white' if cm_normalized[i][j] > 0.5 else 'black')
                        )
                    )

            fig.add_trace(
                go.Heatmap(
                    z=cm_normalized,
                    x=self.emotion_labels,
                    y=self.emotion_labels,
                    colorscale='Blues',
                    showscale=idx == 0,  # Show scale only for first plot
                    text=cm,
                    texttemplate='%{text}',
                    textfont={'color': 'white'},
                    hovertemplate='Predicted: %{x}<br>Actual: %{y}<br>Count: %{text}<br>Rate: %{z:.2f}<extra></extra>'
                ),
                row=row, col=col
            )

        fig.update_layout(
            title={
                'text': 'Confusion Matrices Comparison',
                'x': 0.5,
                'xanchor': 'center',
                'font': {'size': 20}
            },
            height=200 * rows + 100,
            template='plotly_white'
        )

        return fig

    def plot_training_curves(self, training_history: Dict):
        """Plot training curves for multiple models"""
        if not training_history:
            return None

        fig = make_subplots(
            rows=2, cols=2,
            subplot_titles=['Training Loss', 'Validation Loss', 'Validation Accuracy', 'Validation F1-Score'],
            specs=[[{'secondary_y': False}, {'secondary_y': False}],
                   [{'secondary_y': False}, {'secondary_y': False}]]
        )

        metrics = [
            ('train_loss', 1, 1),
            ('val_loss', 1, 2),
            ('val_accuracy', 2, 1),
            ('val_f1', 2, 2)
        ]

        for model_name, history in training_history.items():
            color = self.model_colors.get(model_name, '#636EFA')

            for metric, row, col in metrics:
                if metric in history:
                    epochs = list(range(1, len(history[metric]) + 1))
                    fig.add_trace(
                        go.Scatter(
                            x=epochs,
                            y=history[metric],
                            mode='lines+markers',
                            name=f'{model_name}',
                            line=dict(color=color),
                            showlegend=(row == 1 and col == 1)  # Show legend only once
                        ),
                        row=row, col=col
                    )

        fig.update_layout(
            title={
                'text': 'Training Progress Comparison',
                'x': 0.5,
                'xanchor': 'center',
                'font': {'size': 20}
            },
            height=600,
            template='plotly_white'
        )

        # Update axis labels
        fig.update_xaxes(title_text='Epoch')
        fig.update_yaxes(title_text='Loss', row=1, col=1)
        fig.update_yaxes(title_text='Loss', row=1, col=2)
        fig.update_yaxes(title_text='Accuracy', row=2, col=1)
        fig.update_yaxes(title_text='F1-Score', row=2, col=2)

        return fig

    def plot_radar_chart(self, results: Dict, models_to_show=None):
        """Create radar chart for per-emotion performance"""
        if not results:
            return None

        models = models_to_show or list(results.keys())

        fig = go.Figure()

        for model in models:
            if model not in results:
                continue

            f1_scores = results[model]['f1_per_class']
            color = self.model_colors.get(model, '#636EFA')

            fig.add_trace(go.Scatterpolar(
                r=f1_scores + [f1_scores[0]],  # Close the polygon
                theta=self.emotion_labels + [self.emotion_labels[0]],
                fill='toself',
                name=model.upper(),
                line_color=color,
                fillcolor=color,
                opacity=0.3
            ))

        fig.update_layout(
            polar=dict(
                radialaxis=dict(
                    visible=True,
                    range=[0, 1]
                )
            ),
            title={
                'text': 'Per-Emotion F1-Score Comparison',
                'x': 0.5,
                'xanchor': 'center',
                'font': {'size': 20}
            },
            height=600,
            template='plotly_white'
        )

        return fig

    def plot_efficiency_analysis(self, results: Dict):
        """Create efficiency analysis plots"""
        if not results:
            return None

        # Prepare data
        models = []
        f1_scores = []
        training_times = []
        inference_times = []
        model_sizes = []

        for model, result in results.items():
            models.append(model)
            f1_scores.append(result['f1_macro'])
            training_times.append(result.get('training_time', 0))
            inference_times.append(result['avg_inference_time'] * 1000)  # Convert to ms
            model_sizes.append(result.get('model_size', 0) / 1e6)  # Convert to millions

        fig = make_subplots(
            rows=1, cols=2,
            subplot_titles=['Training Time vs Performance', 'Model Size vs Performance'],
            specs=[[{'secondary_y': False}, {'secondary_y': False}]]
        )

        # Training time vs performance
        fig.add_trace(
            go.Scatter(
                x=training_times,
                y=f1_scores,
                mode='markers+text',
                text=models,
                textposition='top center',
                marker=dict(
                    size=10,
                    color=[self.model_colors.get(m, '#636EFA') for m in models]
                ),
                name='Models',
                showlegend=False
            ),
            row=1, col=1
        )

        # Model size vs performance
        fig.add_trace(
            go.Scatter(
                x=model_sizes,
                y=f1_scores,
                mode='markers+text',
                text=models,
                textposition='top center',
                marker=dict(
                    size=10,
                    color=[self.model_colors.get(m, '#636EFA') for m in models]
                ),
                name='Models',
                showlegend=False
            ),
            row=1, col=2
        )

        fig.update_layout(
            title={
                'text': 'Model Efficiency Analysis',
                'x': 0.5,
                'xanchor': 'center',
                'font': {'size': 20}
            },
            height=500,
            template='plotly_white'
        )

        # Update axis labels
        fig.update_xaxes(title_text='Training Time (seconds)', row=1, col=1)
        fig.update_xaxes(title_text='Model Size (M parameters)', row=1, col=2)
        fig.update_yaxes(title_text='F1-Macro Score', row=1, col=1)
        fig.update_yaxes(title_text='F1-Macro Score', row=1, col=2)

        return fig

    def create_model_ranking(self, results: Dict, metric='f1_macro'):
        """Create model performance ranking"""
        if not results:
            return None

        # Sort models by metric
        sorted_models = sorted(
            results.items(),
            key=lambda x: x[1][metric],
            reverse=True
        )

        models = [item[0] for item in sorted_models]
        scores = [item[1][metric] for item in sorted_models]
        colors = [self.model_colors.get(model, '#636EFA') for model in models]

        fig = go.Figure(data=[
            go.Bar(
                y=models,
                x=scores,
                orientation='h',
                marker_color=colors,
                text=[f'{score:.3f}' for score in scores],
                textposition='auto'
            )
        ])

        fig.update_layout(
            title={
                'text': f'Model Ranking by {metric.replace("_", " ").title()}',
                'x': 0.5,
                'xanchor': 'center',
                'font': {'size': 20}
            },
            xaxis_title=metric.replace('_', ' ').title(),
            yaxis_title='Models',
            height=400,
            template='plotly_white'
        )

        return fig

    def save_figure(self, fig, filename, format='html'):
        """Save figure to file"""
        if fig is None:
            print("❌ No figure to save")
            return

        filepath = os.path.join('visualizations', filename)

        if format == 'html':
            fig.write_html(filepath)
        elif format == 'png':
            fig.write_image(filepath, width=1200, height=800)
        elif format == 'pdf':
            fig.write_image(filepath, width=1200, height=800)

        print(f"💾 Figure saved to {filepath}")

print("✅ VisualizationEngine class defined successfully!")

✅ VisualizationEngine class defined successfully!


### 8.1 Initialize Visualization System

In [21]:
# Initialize visualization engine
visualizer = VisualizationEngine(emotion_labels=config.emotion_labels)

print(f"🎨 Visualization system ready")
print(f"   Color palette: {visualizer.color_palette}")
print(f"   Available model colors: {len(visualizer.model_colors)}")

🎨 VisualizationEngine initialized with 6 emotion classes
🎨 Visualization system ready
   Color palette: viridis
   Available model colors: 9


## 9. Main Execution Pipeline

This section orchestrates the entire model comparison process.

### 9.1 Quick Test Run (Small Scale)

Let's first run a quick test with a subset of models to ensure everything works.

In [22]:
# Quick test configuration
TEST_CONFIG = {
    'models': ['bilstm-attention', 'cnn', 'roberta-base', 'distilbert-base', 'electra-base', 'albert-base'],  # Start with traditional models
    'epochs': 1,  # Quick training
    'batch_size': 32,
    'max_samples': 1000  # Limit dataset size for testing
}

print("🧪 Quick Test Configuration:")
print(f"   Models: {TEST_CONFIG['models']}")
print(f"   Epochs: {TEST_CONFIG['epochs']}")
print(f"   Batch size: {TEST_CONFIG['batch_size']}")
print(f"   Max samples: {TEST_CONFIG['max_samples']}")

🧪 Quick Test Configuration:
   Models: ['bilstm-attention', 'cnn', 'roberta-base', 'distilbert-base', 'electra-base', 'albert-base']
   Epochs: 1
   Batch size: 32
   Max samples: 1000


In [23]:
def run_quick_test():
    """Run a quick test with traditional models"""
    print("🚀 Starting quick test run...")

    # Check if data is ready
    if data_manager.dataset is None:
        print("❌ Dataset not loaded. Please run data loading cells first.")
        return

    # FIX: Build vocabulary if not already built
    if data_manager.vocab is None:
        print("🔤 Building vocabulary for traditional models...")
        # Get all text data for vocabulary building - FIX: Convert to list properly
        all_texts = []
        all_texts.extend(list(data_manager.dataset['train']['text']))
        all_texts.extend(list(data_manager.dataset['validation']['text']))
        all_texts.extend(list(data_manager.dataset['test']['text']))

        data_manager.build_vocab(all_texts)
        print(f"✅ Vocabulary built with {len(data_manager.vocab)} words")

    # Train models
    print("\n📚 Training models...")
    training_results = trainer.train_multiple_models(
        TEST_CONFIG['models'],
        data_manager,
        epochs=TEST_CONFIG['epochs'],
        batch_size=TEST_CONFIG['batch_size']
    )

    # Evaluate models
    print("\n📊 Evaluating models...")
    evaluation_results = evaluator.evaluate_all_models(
        training_results,
        data_manager,
        batch_size=TEST_CONFIG['batch_size']
    )

    # Generate comparison report
    if evaluation_results:
        print("\n📋 Generating comparison report...")
        comparison_df = evaluator.generate_comparison_report()
        print("\n🏆 Model Comparison Results:")
        print(comparison_df.to_string(index=False, float_format='%.4f'))

        # Create basic visualizations
        print("\n🎨 Creating visualizations...")

        # Performance comparison
        perf_fig = visualizer.plot_performance_comparison(evaluation_results)
        if perf_fig:
            perf_fig.show()
            visualizer.save_figure(perf_fig, 'quick_test_performance.html')

        # Training curves
        training_history = {name: result['history'] for name, result in training_results.items()
                          if result['status'] == 'success'}
        if training_history:
            curves_fig = visualizer.plot_training_curves(training_history)
            if curves_fig:
                curves_fig.show()
                visualizer.save_figure(curves_fig, 'quick_test_training_curves.html')

        # Save results
        evaluator.save_results('quick_test_results.json')

        print("\n✅ Quick test completed successfully!")
        return evaluation_results

    else:
        print("❌ No successful evaluations to report")
        return None

# Run the quick test
quick_results = run_quick_test()

🚀 Starting quick test run...

📚 Training models...
🚀 Starting batch training for 6 models...

Training Model 1/6: BILSTM-ATTENTION

🎯 Starting training for bilstm-attention...
✅ Data loaders created for traditional models
🏋️ Training BiLSTM-Attention for 1 epochs...


Epoch 1/1 [Train]:   0%|          | 0/500 [00:00<?, ?it/s]

Epoch 1: Train Loss: 1.6719, Val Loss: 1.0990, Val Acc: 0.5195, Val F1: 0.5389
💾 Model saved to models/bilstm_attention_best.pt
✅ Training completed in 18.7s
✅ bilstm-attention training completed successfully!

Training Model 2/6: CNN

🎯 Starting training for cnn...
✅ Data loaders created for traditional models
🏋️ Training CNN-Emotion for 1 epochs...


Epoch 1/1 [Train]:   0%|          | 0/500 [00:00<?, ?it/s]

Epoch 1: Train Loss: 1.7433, Val Loss: 1.3656, Val Acc: 0.4300, Val F1: 0.4443
💾 Model saved to models/cnn_emotion_best.pt
✅ Training completed in 8.0s
✅ cnn training completed successfully!

Training Model 3/6: ROBERTA-BASE

🎯 Starting training for roberta-base...


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Loaded transformer model: roberta-base
✅ Data loaders created for roberta-base models
🤖 Training ROBERTA-BASE for 1 epochs...


Epoch 1/1 [Train]:   0%|          | 0/500 [00:00<?, ?it/s]

Epoch 1: Train Loss: 0.6903, Val Loss: 0.2719, Val Acc: 0.9015, Val F1: 0.8720
💾 Model saved to models/roberta_base_best.pt
✅ Training completed in 4.5m
✅ roberta-base training completed successfully!

Training Model 4/6: DISTILBERT-BASE

🎯 Starting training for distilbert-base...


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Loaded transformer model: distilbert-base-uncased
✅ Data loaders created for distilbert-base models
🤖 Training DISTILBERT-BASE for 1 epochs...


Epoch 1/1 [Train]:   0%|          | 0/500 [00:00<?, ?it/s]

Epoch 1: Train Loss: 0.7141, Val Loss: 0.2969, Val Acc: 0.9120, Val F1: 0.8684
💾 Model saved to models/distilbert_base_best.pt
✅ Training completed in 2.3m
✅ distilbert-base training completed successfully!

Training Model 5/6: ELECTRA-BASE

🎯 Starting training for electra-base...


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-base-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Loaded transformer model: google/electra-base-discriminator
✅ Data loaders created for electra-base models
🤖 Training ELECTRA-BASE for 1 epochs...


Epoch 1/1 [Train]:   0%|          | 0/500 [00:00<?, ?it/s]

Epoch 1: Train Loss: 1.0490, Val Loss: 0.6230, Val Acc: 0.8105, Val F1: 0.6853
💾 Model saved to models/electra_base_best.pt
✅ Training completed in 4.7m
✅ electra-base training completed successfully!

Training Model 6/6: ALBERT-BASE

🎯 Starting training for albert-base...


Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Loaded transformer model: albert-base-v2
✅ Data loaders created for albert-base models
🤖 Training ALBERT-BASE for 1 epochs...


Epoch 1/1 [Train]:   0%|          | 0/500 [00:00<?, ?it/s]

Epoch 1: Train Loss: 0.7630, Val Loss: 0.3412, Val Acc: 0.9015, Val F1: 0.8689
💾 Model saved to models/albert_base_best.pt
✅ Training completed in 5.0m
✅ albert-base training completed successfully!

🎉 Batch training completed!
   Successful models: 6/6
   Total time: 17.1m

📊 Evaluating models...
🎯 Starting comprehensive evaluation of 6 models...
✅ Data loaders created for traditional models

📊 Evaluating bilstm-attention...
🔍 Evaluating bilstm-attention on test set...


Evaluating:   0%|          | 0/63 [00:00<?, ?it/s]

✅ bilstm-attention Evaluation Results:
   Accuracy: 0.5110
   F1-Macro: 0.5121
   F1-Weighted: 0.5161
   Avg Inference Time: 8.74ms per batch
✅ Data loaders created for traditional models

📊 Evaluating cnn...
🔍 Evaluating cnn on test set...


Evaluating:   0%|          | 0/63 [00:00<?, ?it/s]

✅ cnn Evaluation Results:
   Accuracy: 0.4005
   F1-Macro: 0.4081
   F1-Weighted: 0.3899
   Avg Inference Time: 1.97ms per batch
✅ Data loaders created for roberta-base models

📊 Evaluating roberta-base...
🤖 Evaluating roberta-base on test set...


Evaluating:   0%|          | 0/63 [00:00<?, ?it/s]

✅ roberta-base Evaluation Results:
   Accuracy: 0.9045
   F1-Macro: 0.8585
   F1-Weighted: 0.9053
   Avg Inference Time: 15.51ms per batch
✅ Data loaders created for distilbert-base models

📊 Evaluating distilbert-base...
🤖 Evaluating distilbert-base on test set...


Evaluating:   0%|          | 0/63 [00:00<?, ?it/s]

✅ distilbert-base Evaluation Results:
   Accuracy: 0.9070
   F1-Macro: 0.8550
   F1-Weighted: 0.9053
   Avg Inference Time: 8.36ms per batch
✅ Data loaders created for electra-base models

📊 Evaluating electra-base...
🤖 Evaluating electra-base on test set...


Evaluating:   0%|          | 0/63 [00:00<?, ?it/s]

✅ electra-base Evaluation Results:
   Accuracy: 0.8370
   F1-Macro: 0.6970
   F1-Weighted: 0.8176
   Avg Inference Time: 18.92ms per batch
✅ Data loaders created for albert-base models

📊 Evaluating albert-base...
🤖 Evaluating albert-base on test set...


Evaluating:   0%|          | 0/63 [00:00<?, ?it/s]

✅ albert-base Evaluation Results:
   Accuracy: 0.8935
   F1-Macro: 0.8420
   F1-Weighted: 0.8926
   Avg Inference Time: 18.22ms per batch

🎉 Evaluation completed for 6 models!

📋 Generating comparison report...

🏆 Model Comparison Results:
           Model  Accuracy  F1-Macro  F1-Weighted  Precision  Recall  Training Time (s)  Inference Time (ms)  Model Size (M)  Samples
    roberta-base    0.9045    0.8585       0.9053     0.8513  0.8678           267.4825              15.5066        124.6502     2000
 distilbert-base    0.9070    0.8550       0.9053     0.8829  0.8373           138.2077               8.3564         66.9581     2000
     albert-base    0.8935    0.8420       0.8926     0.8515  0.8334           298.5468              18.2246         11.6882     2000
    electra-base    0.8370    0.6970       0.8176     0.8318  0.6703           280.3746              18.9182        109.4869     2000
bilstm-attention    0.5110    0.5121       0.5161     0.4976  0.5843            18.6845   

💾 Figure saved to visualizations/quick_test_performance.html


💾 Figure saved to visualizations/quick_test_training_curves.html
💾 Results saved to results/quick_test_results.json

✅ Quick test completed successfully!


### 9.2 Full Model Comparison Pipeline

Now let's run the complete comparison with all available models.

In [24]:
# Full comparison configuration
FULL_CONFIG = {
    'traditional_models': ['bilstm-attention', 'cnn'],
    'transformer_models': ['roberta-base', 'distilbert-base', 'electra-base', 'albert-base'],
    #'transformer_models': ['roberta-base', 'distilbert-base'],
    'epochs': 3,
    'batch_size': 16,
    'run_full_comparison': True  # Set to True to run full comparison
}

# Only include models that have tokenizers registered
available_transformers = [model for model in FULL_CONFIG['transformer_models']
                         if model in data_manager.tokenizers]

all_models = FULL_CONFIG['traditional_models'] + available_transformers

print("🎯 Full Comparison Configuration:")
print(f"   Traditional models: {FULL_CONFIG['traditional_models']}")
print(f"   Available transformer models: {available_transformers}")
print(f"   Total models: {len(all_models)}")
print(f"   Epochs: {FULL_CONFIG['epochs']}")
print(f"   Batch size: {FULL_CONFIG['batch_size']}")
print(f"   Run full comparison: {FULL_CONFIG['run_full_comparison']}")

🎯 Full Comparison Configuration:
   Traditional models: ['bilstm-attention', 'cnn']
   Available transformer models: ['roberta-base', 'distilbert-base', 'electra-base', 'albert-base']
   Total models: 6
   Epochs: 3
   Batch size: 16
   Run full comparison: True


In [25]:
def run_full_comparison():
    """Run comprehensive model comparison"""
    if not FULL_CONFIG['run_full_comparison']:
        print("⚠️ Full comparison is disabled. Set FULL_CONFIG['run_full_comparison'] = True to enable.")
        return None

    print(f"🚀 Starting full model comparison with {len(all_models)} models...")
    print(f"   This may take a while depending on your hardware.")

    # Check if data is ready
    if data_manager.dataset is None:
        print("❌ Dataset not loaded. Please run data loading cells first.")
        return None

    total_start_time = time.time()

    # Train all models
    print("\n" + "="*80)
    print("🏋️ TRAINING PHASE")
    print("="*80)

    training_results = trainer.train_multiple_models(
        all_models,
        data_manager,
        epochs=FULL_CONFIG['epochs'],
        batch_size=FULL_CONFIG['batch_size']
    )

    # Evaluate all models
    print("\n" + "="*80)
    print("📊 EVALUATION PHASE")
    print("="*80)

    evaluation_results = evaluator.evaluate_all_models(
        training_results,
        data_manager,
        batch_size=FULL_CONFIG['batch_size']
    )

    if not evaluation_results:
        print("❌ No successful evaluations to analyze")
        return None

    # Generate comprehensive analysis
    print("\n" + "="*80)
    print("📋 ANALYSIS PHASE")
    print("="*80)

    # Comparison report
    comparison_df = evaluator.generate_comparison_report()
    print("\n🏆 COMPREHENSIVE MODEL COMPARISON RESULTS:")
    print(comparison_df.to_string(index=False, float_format='%.4f'))

    # Best models
    best_models = evaluator.get_best_models(metric='f1_macro', top_k=3)
    print("\n🥇 TOP 3 MODELS BY F1-MACRO SCORE:")
    for i, (model_name, result) in enumerate(best_models, 1):
        print(f"   {i}. {model_name.upper()}: {result['f1_macro']:.4f}")

    # Generate all visualizations
    print("\n" + "="*80)
    print("🎨 VISUALIZATION PHASE")
    print("="*80)

    # 1. Performance comparison
    print("📊 Creating performance comparison chart...")
    perf_fig = visualizer.plot_performance_comparison(evaluation_results)
    if perf_fig:
        perf_fig.show()
        visualizer.save_figure(perf_fig, 'full_performance_comparison.html')

    # 2. Confusion matrices
    print("🔥 Creating confusion matrices...")
    cm_fig = visualizer.plot_confusion_matrices(evaluation_results)
    if cm_fig:
        cm_fig.show()
        visualizer.save_figure(cm_fig, 'confusion_matrices.html')

    # 3. Training curves
    print("📈 Creating training curves...")
    training_history = {name: result['history'] for name, result in training_results.items()
                      if result['status'] == 'success'}
    if training_history:
        curves_fig = visualizer.plot_training_curves(training_history)
        if curves_fig:
            curves_fig.show()
            visualizer.save_figure(curves_fig, 'training_curves.html')

    # 4. Radar chart
    print("🎯 Creating radar chart...")
    radar_fig = visualizer.plot_radar_chart(evaluation_results)
    if radar_fig:
        radar_fig.show()
        visualizer.save_figure(radar_fig, 'radar_chart.html')

    # 5. Efficiency analysis
    print("⚡ Creating efficiency analysis...")
    eff_fig = visualizer.plot_efficiency_analysis(evaluation_results)
    if eff_fig:
        eff_fig.show()
        visualizer.save_figure(eff_fig, 'efficiency_analysis.html')

    # 6. Model ranking
    print("🏅 Creating model ranking...")
    ranking_fig = visualizer.create_model_ranking(evaluation_results)
    if ranking_fig:
        ranking_fig.show()
        visualizer.save_figure(ranking_fig, 'model_ranking.html')

    # Save all results
    print("\n💾 Saving results...")
    evaluator.save_results('full_comparison_results.json')
    comparison_df.to_csv('results/model_comparison.csv', index=False)

    total_time = time.time() - total_start_time

    print("\n" + "="*80)
    print("🎉 FULL COMPARISON COMPLETED!")
    print("="*80)
    print(f"   Total time: {format_time(total_time)}")
    print(f"   Models evaluated: {len(evaluation_results)}")
    print(f"   Best model: {best_models[0][0].upper()} (F1: {best_models[0][1]['f1_macro']:.4f})")
    print(f"   Results saved to: results/")
    print(f"   Visualizations saved to: visualizations/")

    return evaluation_results

# Uncomment the line below to run full comparison
full_results = run_full_comparison()

print("\n⚡ Full comparison pipeline ready!")
print("   To run full comparison, set FULL_CONFIG['run_full_comparison'] = True and execute the cell above.")

🚀 Starting full model comparison with 6 models...
   This may take a while depending on your hardware.

🏋️ TRAINING PHASE
🚀 Starting batch training for 6 models...

Training Model 1/6: BILSTM-ATTENTION

🎯 Starting training for bilstm-attention...
✅ Data loaders created for traditional models
🏋️ Training BiLSTM-Attention for 3 epochs...


Epoch 1/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 1: Train Loss: 1.5707, Val Loss: 0.8625, Val Acc: 0.6250, Val F1: 0.6360
💾 Model saved to models/bilstm_attention_best.pt


Epoch 2/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 2: Train Loss: 0.8874, Val Loss: 0.4114, Val Acc: 0.8470, Val F1: 0.8295
💾 Model saved to models/bilstm_attention_best.pt


Epoch 3/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 3: Train Loss: 0.5700, Val Loss: 0.2430, Val Acc: 0.8960, Val F1: 0.8783
💾 Model saved to models/bilstm_attention_best.pt
✅ Training completed in 1.5m
✅ bilstm-attention training completed successfully!

Training Model 2/6: CNN

🎯 Starting training for cnn...
✅ Data loaders created for traditional models
🏋️ Training CNN-Emotion for 3 epochs...


Epoch 1/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 1: Train Loss: 1.6899, Val Loss: 1.1865, Val Acc: 0.5745, Val F1: 0.5762
💾 Model saved to models/cnn_emotion_best.pt


Epoch 2/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 2: Train Loss: 0.9026, Val Loss: 0.4578, Val Acc: 0.8175, Val F1: 0.8070
💾 Model saved to models/cnn_emotion_best.pt


Epoch 3/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 3: Train Loss: 0.4482, Val Loss: 0.2943, Val Acc: 0.8900, Val F1: 0.8742
💾 Model saved to models/cnn_emotion_best.pt
✅ Training completed in 36.8s
✅ cnn training completed successfully!

Training Model 3/6: ROBERTA-BASE

🎯 Starting training for roberta-base...


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Loaded transformer model: roberta-base
✅ Data loaders created for roberta-base models
🤖 Training ROBERTA-BASE for 3 epochs...


Epoch 1/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 1: Train Loss: 0.6343, Val Loss: 0.2021, Val Acc: 0.9255, Val F1: 0.8996
💾 Model saved to models/roberta_base_best.pt


Epoch 2/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 2: Train Loss: 0.1865, Val Loss: 0.1702, Val Acc: 0.9390, Val F1: 0.9169
💾 Model saved to models/roberta_base_best.pt


Epoch 3/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 3: Train Loss: 0.1165, Val Loss: 0.1650, Val Acc: 0.9365, Val F1: 0.9161
✅ Training completed in 8.3h
✅ roberta-base training completed successfully!

Training Model 4/6: DISTILBERT-BASE

🎯 Starting training for distilbert-base...


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Loaded transformer model: distilbert-base-uncased
✅ Data loaders created for distilbert-base models
🤖 Training DISTILBERT-BASE for 3 epochs...


Epoch 1/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 1: Train Loss: 0.6543, Val Loss: 0.2231, Val Acc: 0.9225, Val F1: 0.8942
💾 Model saved to models/distilbert_base_best.pt


Epoch 2/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 2: Train Loss: 0.1585, Val Loss: 0.1816, Val Acc: 0.9320, Val F1: 0.9049
💾 Model saved to models/distilbert_base_best.pt


Epoch 3/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 3: Train Loss: 0.1052, Val Loss: 0.1528, Val Acc: 0.9390, Val F1: 0.9159
💾 Model saved to models/distilbert_base_best.pt
✅ Training completed in 7.6m
✅ distilbert-base training completed successfully!

Training Model 5/6: ELECTRA-BASE

🎯 Starting training for electra-base...


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-base-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Loaded transformer model: google/electra-base-discriminator
✅ Data loaders created for electra-base models
🤖 Training ELECTRA-BASE for 3 epochs...


Epoch 1/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 1: Train Loss: 0.8643, Val Loss: 0.3157, Val Acc: 0.9165, Val F1: 0.8942
💾 Model saved to models/electra_base_best.pt


Epoch 2/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 2: Train Loss: 0.2075, Val Loss: 0.1675, Val Acc: 0.9380, Val F1: 0.9139
💾 Model saved to models/electra_base_best.pt


Epoch 3/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 3: Train Loss: 0.1245, Val Loss: 0.1692, Val Acc: 0.9345, Val F1: 0.9067
✅ Training completed in 14.7m
✅ electra-base training completed successfully!

Training Model 6/6: ALBERT-BASE

🎯 Starting training for albert-base...


Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Loaded transformer model: albert-base-v2
✅ Data loaders created for albert-base models
🤖 Training ALBERT-BASE for 3 epochs...


Epoch 1/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 1: Train Loss: 0.7438, Val Loss: 0.2894, Val Acc: 0.9005, Val F1: 0.8688
💾 Model saved to models/albert_base_best.pt


Epoch 2/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 2: Train Loss: 0.2042, Val Loss: 0.1965, Val Acc: 0.9260, Val F1: 0.9024
💾 Model saved to models/albert_base_best.pt


Epoch 3/3 [Train]:   0%|          | 0/1000 [00:00<?, ?it/s]

Epoch 3: Train Loss: 0.1189, Val Loss: 0.1564, Val Acc: 0.9320, Val F1: 0.9036
💾 Model saved to models/albert_base_best.pt
✅ Training completed in 15.1m
✅ albert-base training completed successfully!

🎉 Batch training completed!
   Successful models: 6/6
   Total time: 9.0h

📊 EVALUATION PHASE
🎯 Starting comprehensive evaluation of 6 models...
✅ Data loaders created for traditional models

📊 Evaluating bilstm-attention...
🔍 Evaluating bilstm-attention on test set...


Evaluating:   0%|          | 0/125 [00:00<?, ?it/s]

✅ bilstm-attention Evaluation Results:
   Accuracy: 0.8910
   F1-Macro: 0.8572
   F1-Weighted: 0.8936
   Avg Inference Time: 9.03ms per batch
✅ Data loaders created for traditional models

📊 Evaluating cnn...
🔍 Evaluating cnn on test set...


Evaluating:   0%|          | 0/125 [00:00<?, ?it/s]

✅ cnn Evaluation Results:
   Accuracy: 0.8845
   F1-Macro: 0.8483
   F1-Weighted: 0.8871
   Avg Inference Time: 2.40ms per batch
✅ Data loaders created for roberta-base models

📊 Evaluating roberta-base...
🤖 Evaluating roberta-base on test set...


Evaluating:   0%|          | 0/125 [00:00<?, ?it/s]

✅ roberta-base Evaluation Results:
   Accuracy: 0.9285
   F1-Macro: 0.8872
   F1-Weighted: 0.9298
   Avg Inference Time: 16.04ms per batch
✅ Data loaders created for distilbert-base models

📊 Evaluating distilbert-base...
🤖 Evaluating distilbert-base on test set...


Evaluating:   0%|          | 0/125 [00:00<?, ?it/s]

✅ distilbert-base Evaluation Results:
   Accuracy: 0.9270
   F1-Macro: 0.8854
   F1-Weighted: 0.9267
   Avg Inference Time: 9.24ms per batch
✅ Data loaders created for electra-base models

📊 Evaluating electra-base...
🤖 Evaluating electra-base on test set...


Evaluating:   0%|          | 0/125 [00:00<?, ?it/s]

✅ electra-base Evaluation Results:
   Accuracy: 0.9305
   F1-Macro: 0.8847
   F1-Weighted: 0.9301
   Avg Inference Time: 19.72ms per batch
✅ Data loaders created for albert-base models

📊 Evaluating albert-base...
🤖 Evaluating albert-base on test set...


Evaluating:   0%|          | 0/125 [00:00<?, ?it/s]

✅ albert-base Evaluation Results:
   Accuracy: 0.9345
   F1-Macro: 0.8896
   F1-Weighted: 0.9339
   Avg Inference Time: 19.28ms per batch

🎉 Evaluation completed for 6 models!

📋 ANALYSIS PHASE

🏆 COMPREHENSIVE MODEL COMPARISON RESULTS:
           Model  Accuracy  F1-Macro  F1-Weighted  Precision  Recall  Training Time (s)  Inference Time (ms)  Model Size (M)  Samples
     albert-base    0.9345    0.8896       0.9339     0.8994  0.8811           905.4997              19.2846         11.6882     2000
    roberta-base    0.9285    0.8872       0.9298     0.8730  0.9054         30043.7731              16.0398        124.6502     2000
 distilbert-base    0.9270    0.8854       0.9267     0.8974  0.8792           456.9058               9.2380         66.9581     2000
    electra-base    0.9305    0.8847       0.9301     0.8939  0.8797           882.9772              19.7180        109.4869     2000
bilstm-attention    0.8910    0.8572       0.8936     0.8291  0.8990            89.3898      

💾 Figure saved to visualizations/full_performance_comparison.html
🔥 Creating confusion matrices...


💾 Figure saved to visualizations/confusion_matrices.html
📈 Creating training curves...


💾 Figure saved to visualizations/training_curves.html
🎯 Creating radar chart...


💾 Figure saved to visualizations/radar_chart.html
⚡ Creating efficiency analysis...


💾 Figure saved to visualizations/efficiency_analysis.html
🏅 Creating model ranking...


💾 Figure saved to visualizations/model_ranking.html

💾 Saving results...
💾 Results saved to results/full_comparison_results.json

🎉 FULL COMPARISON COMPLETED!
   Total time: 9.0h
   Models evaluated: 6
   Best model: ALBERT-BASE (F1: 0.8896)
   Results saved to: results/
   Visualizations saved to: visualizations/

⚡ Full comparison pipeline ready!
   To run full comparison, set FULL_CONFIG['run_full_comparison'] = True and execute the cell above.


### 9.3 Custom Model Selection

Run comparison with custom model selection.

In [26]:
# def run_custom_comparison(selected_models, epochs=2, batch_size=16):
#     """Run comparison with custom model selection"""
#     print(f"🎯 Running custom comparison with {len(selected_models)} models...")
#     print(f"   Models: {selected_models}")

#     # Validate model availability
#     available_models = []
#     for model in selected_models:
#         if model in ['bilstm-attention', 'cnn']:
#             if data_manager.vocab is not None:
#                 available_models.append(model)
#             else:
#                 print(f"⚠️ Skipping {model}: vocabulary not built")
#         elif model in data_manager.tokenizers:
#             available_models.append(model)
#         else:
#             print(f"⚠️ Skipping {model}: tokenizer not available")

#     if not available_models:
#         print("❌ No available models to compare")
#         return None

#     print(f"✅ Available models: {available_models}")

#     # Train models
#     training_results = trainer.train_multiple_models(
#         available_models, data_manager, epochs=epochs, batch_size=batch_size
#     )

#     # Evaluate models
#     evaluation_results = evaluator.evaluate_all_models(
#         training_results, data_manager, batch_size=batch_size
#     )

#     if evaluation_results:
#         # Generate report
#         comparison_df = evaluator.generate_comparison_report()
#         print("\n📊 Custom Comparison Results:")
#         print(comparison_df.to_string(index=False, float_format='%.4f'))

#         # Create key visualizations
#         perf_fig = visualizer.plot_performance_comparison(evaluation_results)
#         if perf_fig:
#             perf_fig.show()

#         return evaluation_results

#     return None

# # Example: Compare a few specific models
# custom_results = run_custom_comparison(['bilstm-attention', 'roberta-base', 'distilbert-base', 'cnn'])

# print("🛠️ Custom comparison function ready!")
# print("   Example usage: run_custom_comparison(['bilstm-attention', 'roberta-base', 'distilbert-base'])")

## 10. Results and Conclusions

### 10.1 Load and Display Previous Results

If you have run the comparison before, you can load and display the results.

In [27]:
# def display_saved_results(filename='full_comparison_results.json'):
def display_saved_results(filename='quick_test_results.json'):
    """Display previously saved results with fallback options"""
    results = load_results(filename)

    if not results:
        print(f"❌ No saved results found in {filename}")

        # Try alternative result files
        alternative_files = [
            'quick_test_results.json',
            'evaluation_results.json',
            'custom_results.json'
        ]

        print("🔍 Checking for alternative result files...")
        for alt_file in alternative_files:
            alt_results = load_results(alt_file)
            if alt_results:
                print(f"✅ Found results in {alt_file}")
                return display_results_data(alt_results, alt_file)

        print("   Run the comparison pipeline first to generate results.")
        print("   Available options:")
        print("   • quick_results = run_quick_test()  # For quick test")
        print("   • custom_results = run_custom_comparison(['model1', 'model2'])  # For custom selection")
        print("   • Set FULL_CONFIG['run_full_comparison'] = True and run full pipeline")
        return None

    return display_results_data(results, filename)

def display_results_data(results, filename):
    """Display results data with formatting"""
    print(f"📊 Loaded results for {len(results)} models from {filename}")

    # Create summary table
    summary_data = []
    for model_name, result in results.items():
        summary_data.append({
            'Model': model_name.upper(),
            'Accuracy': f"{result['accuracy']:.4f}",
            'F1-Macro': f"{result['f1_macro']:.4f}",
            'F1-Weighted': f"{result['f1_weighted']:.4f}",
            'Training Time': format_time(result.get('training_time', 0)),
            'Inference Time': f"{result['avg_inference_time']*1000:.2f}ms",
            'Model Size': f"{result.get('model_size', 0)/1e6:.1f}M"
        })

    summary_df = pd.DataFrame(summary_data)
    summary_df = summary_df.sort_values('F1-Macro', ascending=False).reset_index(drop=True)

    print("\n🏆 MODEL PERFORMANCE SUMMARY:")
    print(summary_df.to_string(index=False))

    # Best model analysis
    if len(summary_df) > 0:
        best_model = summary_df.iloc[0]
        print(f"\n🥇 BEST PERFORMING MODEL: {best_model['Model']}")
        print(f"   F1-Macro Score: {best_model['F1-Macro']}")
        print(f"   Accuracy: {best_model['Accuracy']}")
        print(f"   Training Time: {best_model['Training Time']}")
        print(f"   Model Size: {best_model['Model Size']} parameters")

    return results

def display_current_results(results_variable):
    """Display results from a variable (like quick_results)"""
    if results_variable is None:
        print("❌ No results to display. The variable is None.")
        return None

    if not isinstance(results_variable, dict):
        print("❌ Invalid results format. Expected dictionary.")
        return None

    return display_results_data(results_variable, "current session")

# Try to load and display results
saved_results = display_saved_results()

# If you have quick_results from the previous test, you can display them:
print("\n" + "="*50)
print("💡 ALTERNATIVE: Display quick test results")
print("="*50)
print("If you have run the quick test, use:")
print("display_current_results(quick_results)")

# Try to display quick results if available
try:
    if 'quick_results' in globals() and quick_results:
        print("\n🎯 Found quick test results! Displaying...")
        display_current_results(quick_results)
except NameError:
    print("\n⚠️ No quick_results variable found. Run the quick test first.")

📊 Loaded results for 6 models from quick_test_results.json

🏆 MODEL PERFORMANCE SUMMARY:
           Model Accuracy F1-Macro F1-Weighted Training Time Inference Time Model Size
    ROBERTA-BASE   0.9045   0.8585      0.9053          4.5m        15.51ms     124.7M
 DISTILBERT-BASE   0.9070   0.8550      0.9053          2.3m         8.36ms      67.0M
     ALBERT-BASE   0.8935   0.8420      0.8926          5.0m        18.22ms      11.7M
    ELECTRA-BASE   0.8370   0.6970      0.8176          4.7m        18.92ms     109.5M
BILSTM-ATTENTION   0.5110   0.5121      0.5161         18.7s         8.74ms       3.2M
             CNN   0.4005   0.4081      0.3899          8.0s         1.97ms       0.9M

🥇 BEST PERFORMING MODEL: ROBERTA-BASE
   F1-Macro Score: 0.8585
   Accuracy: 0.9045
   Training Time: 4.5m
   Model Size: 124.7M parameters

💡 ALTERNATIVE: Display quick test results
If you have run the quick test, use:
display_current_results(quick_results)

🎯 Found quick test results! Displaying...

### 10.2 Model Analysis and Insights

In [28]:
def generate_insights(results):
    """Generate insights from model comparison results"""
    if not results:
        print("❌ No results available for analysis")
        return

    print("🔍 DETAILED ANALYSIS AND INSIGHTS:")
    print("="*60)

    # Performance analysis
    f1_scores = [(name, result['f1_macro']) for name, result in results.items()]
    f1_scores.sort(key=lambda x: x[1], reverse=True)

    best_model = f1_scores[0]
    worst_model = f1_scores[-1]

    print(f"\n📈 PERFORMANCE INSIGHTS:")
    print(f"   Best Model: {best_model[0].upper()} (F1: {best_model[1]:.4f})")
    print(f"   Worst Model: {worst_model[0].upper()} (F1: {worst_model[1]:.4f})")
    print(f"   Performance Gap: {(best_model[1] - worst_model[1]):.4f}")

    # Model type analysis
    traditional_models = [name for name in results.keys() if name in ['bilstm-attention', 'cnn']]
    transformer_models = [name for name in results.keys() if name not in traditional_models]

    if traditional_models and transformer_models:
        trad_avg = np.mean([results[name]['f1_macro'] for name in traditional_models])
        trans_avg = np.mean([results[name]['f1_macro'] for name in transformer_models])

        print(f"\n🏗️ ARCHITECTURE COMPARISON:")
        print(f"   Traditional Models Avg F1: {trad_avg:.4f}")
        print(f"   Transformer Models Avg F1: {trans_avg:.4f}")
        print(f"   Transformer Advantage: {(trans_avg - trad_avg):.4f}")

    # Efficiency analysis
    training_times = [(name, result.get('training_time', 0)) for name, result in results.items()]
    training_times.sort(key=lambda x: x[1])

    fastest_training = training_times[0]
    slowest_training = training_times[-1]

    print(f"\n⚡ EFFICIENCY INSIGHTS:")
    print(f"   Fastest Training: {fastest_training[0].upper()} ({format_time(fastest_training[1])})")
    print(f"   Slowest Training: {slowest_training[0].upper()} ({format_time(slowest_training[1])})")

    # Per-emotion analysis
    emotion_performance = defaultdict(list)
    for model_name, result in results.items():
        f1_per_class = result['f1_per_class']
        for i, emotion in enumerate(config.emotion_labels):
            emotion_performance[emotion].append((model_name, f1_per_class[i]))

    print(f"\n😊 PER-EMOTION ANALYSIS:")
    for emotion in config.emotion_labels:
        scores = emotion_performance[emotion]
        scores.sort(key=lambda x: x[1], reverse=True)
        best_for_emotion = scores[0]
        avg_score = np.mean([score[1] for score in scores])
        print(f"   {emotion.capitalize()}: Best = {best_for_emotion[0].upper()} ({best_for_emotion[1]:.3f}), Avg = {avg_score:.3f}")

    # Recommendations
    print(f"\n💡 RECOMMENDATIONS:")

    # Best overall
    print(f"   🏆 For best performance: {best_model[0].upper()}")

    # Best efficiency
    efficiency_scores = [(name, result['f1_macro'] / max(result.get('training_time', 1), 1))
                        for name, result in results.items()]
    efficiency_scores.sort(key=lambda x: x[1], reverse=True)
    most_efficient = efficiency_scores[0]
    print(f"   ⚡ For best efficiency: {most_efficient[0].upper()}")

    # Best for production
    inference_times = [(name, result['avg_inference_time']) for name, result in results.items()]
    inference_times.sort(key=lambda x: x[1])
    fastest_inference = inference_times[0]
    print(f"   🚀 For production (fast inference): {fastest_inference[0].upper()}")

# Generate insights if results are available
if saved_results:
    generate_insights(saved_results)
else:
    print("📝 Insights will be generated after running model comparison.")

🔍 DETAILED ANALYSIS AND INSIGHTS:

📈 PERFORMANCE INSIGHTS:
   Best Model: ROBERTA-BASE (F1: 0.8585)
   Worst Model: CNN (F1: 0.4081)
   Performance Gap: 0.4504

🏗️ ARCHITECTURE COMPARISON:
   Traditional Models Avg F1: 0.4601
   Transformer Models Avg F1: 0.8131
   Transformer Advantage: 0.3530

⚡ EFFICIENCY INSIGHTS:
   Fastest Training: CNN (8.0s)
   Slowest Training: ALBERT-BASE (5.0m)

😊 PER-EMOTION ANALYSIS:
   Sadness: Best = DISTILBERT-BASE (0.946), Avg = 0.791
   Joy: Best = ROBERTA-BASE (0.928), Avg = 0.756
   Love: Best = ROBERTA-BASE (0.800), Avg = 0.599
   Anger: Best = ROBERTA-BASE (0.910), Avg = 0.707
   Fear: Best = DISTILBERT-BASE (0.894), Avg = 0.721
   Surprise: Best = ROBERTA-BASE (0.701), Avg = 0.598

💡 RECOMMENDATIONS:
   🏆 For best performance: ROBERTA-BASE
   ⚡ For best efficiency: CNN
   🚀 For production (fast inference): CNN


### 10.3 Final Summary and Conclusions

In [29]:
print("🎯 COMPREHENSIVE EMOTION CLASSIFICATION MODEL COMPARISON")
print("="*80)

print("\n📋 PROJECT SUMMARY:")
print("   This notebook provides a comprehensive comparison of 9+ state-of-the-art")
print("   deep learning models for emotion classification, including:")
print("   • Traditional architectures: BiLSTM with Attention, CNN")
print("   • Modern transformers: RoBERTa, DeBERTa-v3, DistilBERT, ELECTRA, XLNet, ALBERT")

print("\n🔧 TECHNICAL FEATURES:")
print("   ✅ Unified data management system")
print("   ✅ Modular model architecture framework")
print("   ✅ Comprehensive training pipeline")
print("   ✅ Detailed evaluation metrics")
print("   ✅ Professional-grade interactive visualizations")
print("   ✅ Class imbalance handling")
print("   ✅ Performance and efficiency analysis")

print("\n📊 EVALUATION METRICS:")
print("   • Accuracy, Precision, Recall, F1-Score (macro & weighted)")
print("   • Per-class performance analysis")
print("   • Confusion matrices")
print("   • Training and inference time analysis")
print("   • Model size comparison")

print("\n🎨 VISUALIZATIONS:")
print("   • Interactive performance comparison charts")
print("   • Confusion matrix heatmaps")
print("   • Training progress curves")
print("   • Per-emotion performance radar charts")
print("   • Efficiency analysis plots")
print("   • Model ranking leaderboards")

print("\n🚀 USAGE INSTRUCTIONS:")
print("   1. Run all cells in order to set up the environment")
print("   2. Use run_quick_test() for a fast test with traditional models")
print("   3. Set FULL_CONFIG['run_full_comparison'] = True for complete analysis")
print("   4. Use run_custom_comparison() for specific model selection")
print("   5. Results and visualizations are saved automatically")

print("\n📁 OUTPUT FILES:")
print("   • results/: JSON files with detailed metrics")
print("   • visualizations/: Interactive HTML charts")
print("   • models/: Trained model checkpoints")

print("\n🎉 READY TO USE!")
print("   The notebook is fully functional and ready for emotion classification")
print("   model comparison. All components have been tested and integrated.")

print("\n" + "="*80)
print("✅ NOTEBOOK SETUP COMPLETE - READY FOR MODEL COMPARISON!")
print("="*80)

🎯 COMPREHENSIVE EMOTION CLASSIFICATION MODEL COMPARISON

📋 PROJECT SUMMARY:
   This notebook provides a comprehensive comparison of 9+ state-of-the-art
   deep learning models for emotion classification, including:
   • Traditional architectures: BiLSTM with Attention, CNN
   • Modern transformers: RoBERTa, DeBERTa-v3, DistilBERT, ELECTRA, XLNet, ALBERT

🔧 TECHNICAL FEATURES:
   ✅ Unified data management system
   ✅ Modular model architecture framework
   ✅ Comprehensive training pipeline
   ✅ Detailed evaluation metrics
   ✅ Professional-grade interactive visualizations
   ✅ Class imbalance handling
   ✅ Performance and efficiency analysis

📊 EVALUATION METRICS:
   • Accuracy, Precision, Recall, F1-Score (macro & weighted)
   • Per-class performance analysis
   • Confusion matrices
   • Training and inference time analysis
   • Model size comparison

🎨 VISUALIZATIONS:
   • Interactive performance comparison charts
   • Confusion matrix heatmaps
   • Training progress curves
   • Per-e