# Phase 1: Viral Shorts Prediction - Updated for Actual Dataset

**Project:** Miles - Multimodal Analysis of Short-Form Video Algorithms  
**Author:** Cheney Yoon  
**Course:** APS360 - Applied Fundamentals of Deep Learning

## üìã Overview

This notebook implements the complete Phase 1 training pipeline:

1. **Environment Setup** - Install dependencies, mount Google Drive
2. **Data Pipeline** - Download and preprocess YouTube Shorts dataset
3. **Baseline Model** - Train logistic regression baseline (target: AUROC ‚â• 0.65)
4. **Multimodal Model** - Train BERT + ResNet-50 fusion model (target: AUROC ‚â• 0.75)
5. **Evaluation** - Comprehensive metrics and visualization
6. **Model Export** - Save for production deployment


## 1Ô∏è‚É£ Environment Setup

In [None]:
# CELL 0: Clean slate (run this first!)
import shutil
import os

print("Cleaning up old files...")

# Delete old checkpoints
if os.path.exists('experiments/checkpoints'):
    shutil.rmtree('experiments/checkpoints')
    print("‚úÖ Deleted old checkpoints")

# Delete old MLflow runs (optional)
if os.path.exists('mlruns'):
    shutil.rmtree('mlruns')
    print("‚úÖ Deleted old MLflow runs")

# Recreate directories
os.makedirs('experiments/checkpoints', exist_ok=True)
os.makedirs('mlruns', exist_ok=True)

print("\nüéØ Clean slate ready! Now restart runtime and run cells in order.")

Cleaning up old files...
‚úÖ Deleted old checkpoints

üéØ Clean slate ready! Now restart runtime and run cells in order.


In [None]:
# Check GPU
!nvidia-smi

Fri Nov  7 21:30:13 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   31C    P0             48W /  400W |    3659MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
# Mount Drive
from google.colab import drive
drive.mount('/content/drive')

import os
project_dir = '/content/drive/MyDrive/Colab Notebooks/Miles'
os.makedirs(project_dir, exist_ok=True)
%cd {project_dir}

print(f"Working directory: {os.getcwd()}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Colab Notebooks/Miles
Working directory: /content/drive/MyDrive/Colab Notebooks/Miles


In [None]:
# Install dependencies
!pip install -q torch torchvision transformers accelerate
!pip install -q datasets pandas numpy pillow pyarrow
!pip install -q scikit-learn scipy mlflow pyyaml tqdm
!pip install -q matplotlib seaborn

print("‚úÖ Dependencies installed!")

‚úÖ Dependencies installed!


## 2Ô∏è‚É£ Import Modules

In [None]:
import sys
import warnings
warnings.filterwarnings('ignore')

sys.path.append(os.path.join(os.getcwd(), 'src'))

import os
from getpass import getpass

import torch
import torch.nn as nn
import pandas as pd
import numpy as np

# Import dataset adapter (NEW!)
from data.dataset_adapter import (
    prepare_dataset_for_training,
    get_available_scalar_features,
    get_dataset_summary
)
from data.download import download_dataset
from data.preprocessing import preprocess_dataset
from data.dataset import create_train_val_test_split, create_data_loaders

from models.baseline import BaselineModel
from models.fusion_model import MultimodalViralityPredictor

from training.utils import load_config, set_seed, get_device, save_checkpoint
from training.evaluate import evaluate_model, print_evaluation_report
from training.train import train_model

print("‚úÖ Modules imported!")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

‚úÖ Modules imported!
PyTorch: 2.8.0+cu126
CUDA: True
GPU: NVIDIA A100-SXM4-40GB


## 3Ô∏è‚É£ Download & Prepare Dataset

**NEW:** Using dataset adapter to handle actual column structure

In [None]:
# Path to your manually uploaded CSV file
csv_path = 'data/youtube_shorts_tiktok_trends_2025.csv'

# Load the CSV directly with pandas
print(f"Loading dataset from: {csv_path}")
df_raw = pd.read_csv(csv_path)

print(f"\n‚úÖ Downloaded {len(df_raw):,} videos")
print(f"\nActual columns in dataset:")
print(df_raw.columns.tolist()[:20], "...")  # Show first 20 columns


Loading dataset from: data/youtube_shorts_tiktok_trends_2025.csv

‚úÖ Downloaded 48,079 videos

Actual columns in dataset:
['platform', 'country', 'region', 'language', 'category', 'hashtag', 'title_keywords', 'author_handle', 'sound_type', 'music_track', 'week_of_year', 'duration_sec', 'views', 'likes', 'comments', 'shares', 'saves', 'engagement_rate', 'trend_label', 'source_hint'] ...


In [None]:
# Basic filtering (keep English, recent videos)
print("Filtering dataset...")

# Filter for English language
if 'language' in df_raw.columns:
    df_filtered = df_raw[df_raw['language'] == 'en'].copy()
    print(f"  After English filter: {len(df_filtered):,} videos")
else:
    df_filtered = df_raw.copy()

# Drop rows with missing critical fields
critical_cols = ['row_id', 'title', 'views', 'likes']
df_filtered = df_filtered.dropna(subset=critical_cols)
print(f"  After removing nulls: {len(df_filtered):,} videos")

# Optional: Sample for faster testing (comment out for full training)
# df_filtered = df_filtered.sample(10000, random_state=42)
# print(f"  After sampling: {len(df_filtered):,} videos")

Filtering dataset...
  After English filter: 9,542 videos
  After removing nulls: 9,542 videos


In [None]:
# Prepare dataset using adapter (handles column mapping automatically)
print("\n" + "="*70)
print("Preparing dataset with adapter...")
print("="*70)

df_prepared = prepare_dataset_for_training(
    df_filtered,
    text_column='title',
    create_viral_labels=True,
    viral_threshold_percentile=80.0  # Top 20% = viral
)

print(f"\n‚úÖ Dataset prepared!")
print(f"Final shape: {df_prepared.shape}")


Preparing dataset with adapter...

‚úÖ Dataset prepared!
Final shape: (9542, 59)


In [None]:
# Get dataset summary
summary = get_dataset_summary(df_prepared)

print("\nüìä Dataset Summary:")
print(f"Total videos: {summary['total_videos']:,}")
if 'viral_count' in summary:
    print(f"Viral videos: {summary['viral_count']:,} ({summary['viral_percentage']:.1f}%)")
if 'platforms' in summary:
    print(f"Platforms: {summary['platforms']}")

# Show sample
print("\nSample data:")
display_cols = ['video_id', 'title', 'views', 'likes', 'engagement_velocity', 'is_viral']
display(df_prepared[display_cols].head())


üìä Dataset Summary:
Total videos: 9,542
Viral videos: 1,909 (20.0%)
Platforms: {'TikTok': 5756, 'YouTube': 3786}

Sample data:


Unnamed: 0,video_id,title,views,likes,engagement_velocity,is_viral
2,0d88a011235a82244995ef52961f9502,Football skills in 60s,7385,363,671.36,0
4,d696b4f0a50ea70e7cb5021be7e198ec,POV: Budget üòÇ,16174,832,2695.67,0
6,1e3f2a7357af75024849730b4404354a,24 Hours in Istanbul üß†,27099,1868,1426.26,0
7,c29ed021d91080e21ddb11e08ad9e9db,Perfect ‚Äî Cover üì±,172755,8917,24679.29,1
11,52cee7fe1f79c8c549bd00f702e22f9b,Makeup Basics You Need,41879,5320,8375.8,0


## 4Ô∏è‚É£ Create Train/Val/Test Splits

In [None]:
# Create splits
print("Creating train/val/test splits (70/15/15)...")

train_df, val_df, test_df = create_train_val_test_split(
    df_prepared,
    train_ratio=0.7,
    val_ratio=0.15,
    test_ratio=0.15,
    stratify_column='is_viral',
    random_seed=42
)

# Save splits
os.makedirs('data/processed', exist_ok=True)
train_df.to_parquet('data/processed/train.parquet', index=False)
val_df.to_parquet('data/processed/val.parquet', index=False)
test_df.to_parquet('data/processed/test.parquet', index=False)

print(f"\n‚úÖ Splits created:")
print(f"  Train: {len(train_df):,} samples ({100*train_df['is_viral'].mean():.1f}% viral)")
print(f"  Val:   {len(val_df):,} samples ({100*val_df['is_viral'].mean():.1f}% viral)")
print(f"  Test:  {len(test_df):,} samples ({100*test_df['is_viral'].mean():.1f}% viral)")

Creating train/val/test splits (70/15/15)...

‚úÖ Splits created:
  Train: 6,679 samples (20.0% viral)
  Val:   1,431 samples (20.0% viral)
  Test:  1,432 samples (20.0% viral)


## 5Ô∏è‚É£ Baseline Model Training

In [None]:
print("="*70)
print("Training Baseline (Logistic Regression + TF-IDF)")
print("="*70)

baseline = BaselineModel(max_features=5000, ngram_range=(1, 1))
baseline.fit(train_df, label_column='is_viral', text_columns=['title'])

# Evaluate
test_metrics_baseline = baseline.evaluate(test_df, label_column='is_viral', text_columns=['title'])

print(f"\n{'='*70}")
print(f"Baseline Test AUROC: {test_metrics_baseline['auroc']:.4f}")
print(f"Target: ‚â• 0.65")
print(f"{'='*70}")

# Save
os.makedirs('experiments/checkpoints', exist_ok=True)
baseline.save('experiments/checkpoints/baseline_model.pkl')
print("\n‚úÖ Baseline saved!")

Training Baseline (Logistic Regression + TF-IDF)
              precision    recall  f1-score   support

   Not Viral       0.80      0.50      0.62      1145
       Viral       0.21      0.51      0.29       287

    accuracy                           0.51      1432
   macro avg       0.51      0.51      0.46      1432
weighted avg       0.68      0.51      0.55      1432

[[577 568]
 [140 147]]

Baseline Test AUROC: 0.4879
Target: ‚â• 0.65

‚úÖ Baseline saved!


## 6Ô∏è‚É£ Multimodal Model Setup

In [None]:
# Get available scalar features from dataset
scalar_features = get_available_scalar_features(df_prepared)

print(f"Using {len(scalar_features)} scalar features:")
for feat in scalar_features:
    print(f"  - {feat}")

# Update config with actual feature count
config = load_config('src/configs/training_config.yaml')
config['model']['num_scalar_features'] = len(scalar_features)

# ============================================================
# ADD THIS: Update loss weights to fix regression dominance
# ============================================================
config['training']['regression_weight'] = 0.05  # Reduced from 0.3
config['training']['classification_weight'] = 0.95  # Increased from 0.7

print(f"\n‚úÖ Using {len(scalar_features)} features (updated from config default)")
print(f"‚úÖ Updated loss weights: cls={config['training']['classification_weight']}, reg={config['training']['regression_weight']}")

Using 18 scalar features:
  - views
  - likes
  - comments
  - shares
  - saves
  - engagement_rate
  - completion_rate
  - like_rate
  - comment_ratio
  - share_rate
  - save_rate
  - upload_hour
  - publish_dayofweek
  - is_weekend
  - duration_sec
  - title_length
  - has_emoji
  - creator_avg_views

‚úÖ Using 18 features (updated from config default)
‚úÖ Updated loss weights: cls=0.95, reg=0.05


In [None]:
# Setup
set_seed(config['seed'])
device = get_device(config['hardware']['device'])

print(f"Device: {device}")
print(f"Seed: {config['seed']}")

Device: cuda
Seed: 42


In [None]:
# Create DataLoaders with actual feature set
print("Creating DataLoaders...")

from data.dataset import ViralShortsDataset, collate_multimodal_batch
from torch.utils.data import DataLoader

# Create datasets with actual scalar features (UPDATED after normalization)
train_dataset = ViralShortsDataset(
    train_df,
    text_column='title',
    scalar_columns=scalar_features,  # ‚Üê Now uses filtered numeric features
    label_column='is_viral',
    velocity_column='engagement_velocity',
    text_max_length=128,
    use_images=False,
    augment_images=False
)

val_dataset = ViralShortsDataset(
    val_df,
    text_column='title',
    scalar_columns=scalar_features,
    label_column='is_viral',
    velocity_column='engagement_velocity',
    text_max_length=128,
    use_images=False,
    augment_images=False
)

test_dataset = ViralShortsDataset(
    test_df,
    text_column='title',
    scalar_columns=scalar_features,
    label_column='is_viral',
    velocity_column='engagement_velocity',
    text_max_length=128,
    use_images=False,
    augment_images=False
)

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=2, collate_fn=collate_multimodal_batch)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=2, collate_fn=collate_multimodal_batch)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, num_workers=2, collate_fn=collate_multimodal_batch)

print(f"‚úÖ DataLoaders created:")
print(f"  Train batches: {len(train_loader)}")
print(f"  Val batches: {len(val_loader)}")
print(f"  Test batches: {len(test_loader)}")

Creating DataLoaders...
‚úÖ DataLoaders created:
  Train batches: 209
  Val batches: 45
  Test batches: 45


In [None]:
# Initialize model (text-only mode) - UNFROZE ENCODERS
print("Initializing multimodal model (text-only mode)...")

model = MultimodalViralityPredictor(
    num_scalar_features=len(scalar_features),  # ‚Üê Uses filtered count
    freeze_encoders=False,
    fusion_hidden_dims=[1024, 256],
    dropout_rates=[0.3, 0.2],
    use_text=True,
    use_vision=False
).to(device)

params = model.count_parameters()
print(f"\n‚úÖ Model initialized:")
print(f"  Total params: {params['total']:,}")
print(f"  Trainable: {params['trainable']:,}")
print(f"  Frozen: {params['frozen']:,}")

Initializing multimodal model (text-only mode)...

‚úÖ Model initialized:
  Total params: 110,551,299
  Trainable: 110,551,299
  Frozen: 0


In [None]:
# RELOAD MODULES - Run this cell to apply code fixes
import importlib
import sys

# Reload training modules to pick up fixes
if 'training.utils' in sys.modules:
    importlib.reload(sys.modules['training.utils'])
if 'training.evaluate' in sys.modules:
    importlib.reload(sys.modules['training.evaluate'])
if 'training.train' in sys.modules:
    importlib.reload(sys.modules['training.train'])

# Re-import after reload
from training.utils import load_config, set_seed, get_device, save_checkpoint
from training.evaluate import evaluate_model, print_evaluation_report
from training.train import train_model

print("‚úÖ Modules reloaded with latest fixes!")

‚úÖ Modules reloaded with latest fixes!


In [None]:
# ========================================
# HOTFIX: Patch torch.load in training.utils
# ========================================
import training.utils
import torch

# Store original load_checkpoint function
original_load_checkpoint = training.utils.load_checkpoint

def patched_load_checkpoint(checkpoint_path, model, optimizer=None, scheduler=None, device=None):
    """Patched version with weights_only=False"""
    import logging
    logger = logging.getLogger(__name__)
    logger.info(f"Loading checkpoint from {checkpoint_path}")

    # FIX: Add weights_only=False
    checkpoint = torch.load(checkpoint_path, map_location=device, weights_only=False)

    model.load_state_dict(checkpoint['model_state_dict'])

    if optimizer and 'optimizer_state_dict' in checkpoint:
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

    if scheduler and 'scheduler_state_dict' in checkpoint and checkpoint['scheduler_state_dict']:
        scheduler.load_state_dict(checkpoint['scheduler_state_dict'])

    epoch = checkpoint.get('epoch', 0)
    metrics = checkpoint.get('metrics', {})

    logger.info(f"Checkpoint loaded successfully from epoch {epoch}")
    return model, optimizer, scheduler, epoch, metrics

# Monkey-patch the function
training.utils.load_checkpoint = patched_load_checkpoint

print("‚úÖ Hotfix applied! torch.load will now use weights_only=False")

‚úÖ Hotfix applied! torch.load will now use weights_only=False


## 7Ô∏è‚É£ Training

In [None]:
# ============================================================
# PRE-PROCESSING: Handle categorical features
# ============================================================
print("="*70)
print("Pre-processing categorical features...")
print("="*70)

# Identify categorical vs numeric features
numeric_features = []
categorical_features = []

for col in scalar_features:
    if pd.api.types.is_numeric_dtype(train_df[col]):
        numeric_features.append(col)
    else:
        categorical_features.append(col)

print(f"Numeric features ({len(numeric_features)}): {numeric_features}")
print(f"Categorical features ({len(categorical_features)}): {categorical_features}")

# Encode categorical features
from sklearn.preprocessing import LabelEncoder

for col in categorical_features:
    le = LabelEncoder()
    # Fit on all unique values across all splits
    all_values = pd.concat([train_df[col], val_df[col], test_df[col]]).unique()
    le.fit(all_values)

    # Transform all splits
    train_df[col] = le.transform(train_df[col])
    val_df[col] = le.transform(val_df[col])
    test_df[col] = le.transform(test_df[col])

    print(f"  Encoded {col}: {len(le.classes_)} categories")

# Now all features are numeric
scalar_features = numeric_features + categorical_features
print(f"\n‚úÖ All {len(scalar_features)} features are now numeric")
print("="*70)

Pre-processing categorical features...
Numeric features (18): ['views', 'likes', 'comments', 'shares', 'saves', 'engagement_rate', 'completion_rate', 'like_rate', 'comment_ratio', 'share_rate', 'save_rate', 'upload_hour', 'is_weekend', 'duration_sec', 'title_length', 'has_emoji', 'creator_avg_views', 'publish_dayofweek']
Categorical features (0): []

‚úÖ All 18 features are now numeric


In [None]:

# ============================================================
# FIX: Normalize dataset features
# ============================================================
from sklearn.preprocessing import StandardScaler
import numpy as np

print("="*70)
print("Normalizing features...")
print("="*70)

# 1. Fit scaler on training data (all features are now numeric)
scaler = StandardScaler()
scaler.fit(train_df[scalar_features])

# 2. Transform all splits
train_df[scalar_features] = scaler.transform(train_df[scalar_features])
val_df[scalar_features] = scaler.transform(val_df[scalar_features])
test_df[scalar_features] = scaler.transform(test_df[scalar_features])

print(f"‚úÖ Normalized {len(scalar_features)} scalar features")

# 3. Normalize engagement_velocity to [0, 1]
velocity_min = train_df['engagement_velocity'].min()
velocity_max = train_df['engagement_velocity'].max()

train_df['engagement_velocity'] = (train_df['engagement_velocity'] - velocity_min) / (velocity_max - velocity_min)
val_df['engagement_velocity'] = (val_df['engagement_velocity'] - velocity_min) / (velocity_max - velocity_min)
test_df['engagement_velocity'] = (test_df['engagement_velocity'] - velocity_min) / (velocity_max - velocity_min)

print(f"‚úÖ Normalized engagement_velocity to [0, 1]")
print(f"   Train range: [{train_df['engagement_velocity'].min():.4f}, {train_df['engagement_velocity'].max():.4f}]")

# 4. Add class weights for imbalanced data
viral_count = train_df['is_viral'].sum()
non_viral_count = len(train_df) - viral_count
class_weights = torch.tensor([
    len(train_df) / (2 * non_viral_count),  # weight for non-viral
    len(train_df) / (2 * viral_count)        # weight for viral
], dtype=torch.float32).to(device)

print(f"‚úÖ Calculated class weights: non-viral={class_weights[0]:.3f}, viral={class_weights[1]:.3f}")
print("="*70)


Normalizing features...
‚úÖ Normalized 18 scalar features
‚úÖ Normalized engagement_velocity to [0, 1]
   Train range: [0.0000, 1.0000]
‚úÖ Calculated class weights: non-viral=0.625, viral=2.500


In [None]:
# DIAGNOSTIC: Check data types
print("Checking data after encoding and normalization...")
print("\nTrain DataFrame dtypes:")
for col in scalar_features:
    print(f"  {col}: {train_df[col].dtype} - sample: {train_df[col].iloc[0]}")

print("\nChecking for string values...")
for col in scalar_features:
    if train_df[col].dtype == 'object':
        print(f"‚ùå ERROR: {col} is still type 'object' (strings!)")
        print(f"   Sample values: {train_df[col].unique()[:5]}")

Checking data after encoding and normalization...

Train DataFrame dtypes:
  views: float64 - sample: -0.10646334373044782
  likes: float64 - sample: 0.05859236202427371
  comments: float64 - sample: 0.2300179659545931
  shares: float64 - sample: -0.09174063255630419
  saves: float64 - sample: 0.4172526950912015
  engagement_rate: float64 - sample: 0.8070714281172682
  completion_rate: float64 - sample: 1.365713751840436
  like_rate: float64 - sample: 0.6532623358670496
  comment_ratio: float64 - sample: 1.1594903704244173
  share_rate: float64 - sample: -0.03816888040516234
  save_rate: float64 - sample: 1.5968765298666459
  upload_hour: float64 - sample: -0.36966530046072654
  is_weekend: float64 - sample: 0.7848677727870719
  duration_sec: float64 - sample: 1.079104503949965
  title_length: float64 - sample: 1.607371959935594
  has_emoji: float64 - sample: 1.0715778970518122
  creator_avg_views: float64 - sample: -1.2795278498982428
  publish_dayofweek: float64 - sample: 0.348731430

In [None]:
print("="*70)
print("Starting Training (Text-Only Multimodal Model)")
print("="*70)
print("\nEstimated time: 2-3 hours on A100 GPU\n")

# ============================================================
# GLOBAL FIX: Patch torch.load to always use weights_only=False
# ============================================================
import torch
_original_torch_load = torch.load

def patched_torch_load(f, map_location=None, pickle_module=None, *, weights_only=None, mmap=None, **kwargs):
    """Global patch for torch.load to use weights_only=False by default"""
    if weights_only is None:
        weights_only = False
    return _original_torch_load(f, map_location=map_location, pickle_module=pickle_module,
                                  weights_only=weights_only, mmap=mmap, **kwargs)

torch.load = patched_torch_load
print("‚úÖ Global torch.load patch applied\n")
# ============================================================

trained_model, train_results = train_model(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    config=config,
    device=device
)

print(f"\n{'='*70}")
print(f"Training Complete!")
print(f"Best Validation AUROC: {train_results['best_auroc']:.4f}")
print(f"{'='*70}")

Starting Training (Text-Only Multimodal Model)

Estimated time: 2-3 hours on A100 GPU

‚úÖ Global torch.load patch applied






Training Complete!
Best Validation AUROC: 0.8451


## 8Ô∏è‚É£ Final Evaluation

In [None]:
# Evaluate on test set
print("="*70)
print("Final Evaluation on Test Set")
print("="*70)

classification_criterion = nn.CrossEntropyLoss()
regression_criterion = nn.MSELoss()

test_metrics = evaluate_model(
    trained_model,
    test_loader,
    device,
    classification_criterion,
    regression_criterion,
    classification_weight=0.7,
    regression_weight=0.3,
    return_predictions=True
)

print_evaluation_report(test_metrics, "Test Set Evaluation")

# Check success criteria
success_auroc = test_metrics['auroc'] >= 0.75
success_mae = test_metrics['velocity_mae'] <= 0.30

print("\n" + "="*70)
print("Success Criteria:")
print(f"  AUROC ‚â• 0.75: {'‚úÖ PASS' if success_auroc else '‚ùå FAIL'} ({test_metrics['auroc']:.4f})")
print(f"  MAE ‚â§ 0.30:   {'‚úÖ PASS' if success_mae else '‚ùå FAIL'} ({test_metrics['velocity_mae']:.4f})")
print("="*70)

Final Evaluation on Test Set

                         Test Set Evaluation                          

Classification Metrics:
  AUROC:                    0.8551
  Average Precision (PR-AUC): 0.5780
  Accuracy:                 0.8017
  F1 Score:                 0.0207
  Precision:                1.0000
  Recall:                   0.0105
  Precision_at_10% Recall:     0.2004

Confusion Matrix:
  TP:     3  |  FP:     0
  FN:   284  |  TN:  1145

Regression Metrics (Engagement Velocity):
  MAE:  0.0310
  RMSE: 0.0397
  R¬≤:   -0.3757

Loss Metrics:
  Total Loss:          0.3206
  Classification Loss: 0.4574
  Regression Loss:     0.0016


Success Criteria:
  AUROC ‚â• 0.75: ‚úÖ PASS (0.8551)
  MAE ‚â§ 0.30:   ‚úÖ PASS (0.0310)


## 9Ô∏è‚É£ Model Comparison & Export

In [None]:
# Compare models
comparison = pd.DataFrame({
    'Model': ['Baseline (TF-IDF)', 'Multimodal (BERT Text-Only)'],
    'AUROC': [test_metrics_baseline['auroc'], test_metrics['auroc']],
    'Accuracy': [test_metrics_baseline['accuracy'], test_metrics['accuracy']],
    'F1 Score': [0, test_metrics['f1']],
    'Velocity MAE': [0, test_metrics['velocity_mae']]
})

print("\nModel Comparison:")
display(comparison)

improvement = test_metrics['auroc'] - test_metrics_baseline['auroc']
print(f"\nAUROC Improvement: +{improvement:.4f} ({100*improvement/test_metrics_baseline['auroc']:.1f}% relative)")


Model Comparison:


Unnamed: 0,Model,AUROC,Accuracy,F1 Score,Velocity MAE
0,Baseline (TF-IDF),0.487867,0.505587,0.0,0.0
1,Multimodal (BERT Text-Only),0.855104,0.801676,0.02069,0.030975



AUROC Improvement: +0.3672 (75.3% relative)


In [None]:
# Export model
print("Exporting model...")

os.makedirs('experiments/exported_models', exist_ok=True)
torch.save(trained_model.state_dict(), 'experiments/exported_models/model_state_dict.pt')
torch.save({
    'model_state_dict': trained_model.state_dict(),
    'config': config,
    'scalar_features': scalar_features,
    'test_metrics': {k: float(v) if isinstance(v, (int, float, np.number)) else str(v)
                     for k, v in test_metrics.items() if k != 'predictions'}
}, 'experiments/exported_models/model_full.pt')

print("‚úÖ Model exported!")

Exporting model...
‚úÖ Model exported!


## üìä Final Summary

In [None]:
import json

print("="*70)
print("PHASE 1 COMPLETE!")
print("="*70)

print(f"\nüìä Dataset:")
print(f"  Total videos: {len(df_prepared):,}")
print(f"  Viral rate: {100 * df_prepared['is_viral'].mean():.1f}%")
print(f"  Scalar features: {len(scalar_features)}")

print(f"\nüéØ Results:")
print(f"  Baseline AUROC:    {test_metrics_baseline['auroc']:.4f}")
print(f"  Multimodal AUROC:  {test_metrics['auroc']:.4f} {'‚úÖ' if success_auroc else '‚ùå'}")
print(f"  Velocity MAE:      {test_metrics['velocity_mae']:.4f} {'‚úÖ' if success_mae else '‚ùå'}")

# Save results
results = {
    'dataset_size': len(df_prepared),
    'viral_rate': float(df_prepared['is_viral'].mean()),
    'num_features': len(scalar_features),
    'baseline_auroc': float(test_metrics_baseline['auroc']),
    'multimodal_auroc': float(test_metrics['auroc']),
    'velocity_mae': float(test_metrics['velocity_mae']),
    'success': success_auroc and success_mae
}

with open('experiments/phase1_results.json', 'w') as f:
    json.dump(results, f, indent=2)

print(f"\nüíæ Saved to: experiments/phase1_results.json")
print("\nüéâ Phase 1 complete!")

PHASE 1 COMPLETE!

üìä Dataset:
  Total videos: 9,542
  Viral rate: 20.0%
  Scalar features: 18

üéØ Results:
  Baseline AUROC:    0.4879
  Multimodal AUROC:  0.8551 ‚úÖ
  Velocity MAE:      0.0310 ‚úÖ

üíæ Saved to: experiments/phase1_results.json

üéâ Phase 1 complete!
