# Per-Label Threshold Tuning

This notebook implements per-label threshold tuning for the movie genre classification model. Instead of using a single global threshold for all genres, we optimize a threshold for each genre individually to maximize F1 score.

## Objectives:
1. Load trained model and validation data
2. Get prediction probabilities for validation set
3. For each genre/label, find optimal threshold that maximizes F1 score
4. Save per-label thresholds to JSON file
5. Compare performance: global threshold vs per-label thresholds

In [9]:
# Imports and Setup
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Add project root to path
project_root = Path().resolve().parent
sys.path.insert(0, str(project_root))

import numpy as np
import pandas as pd
import json
from tqdm import tqdm

from sklearn.metrics import f1_score, precision_score, recall_score, hamming_loss, jaccard_score
from sklearn.model_selection import train_test_split
from scipy.special import expit  # Sigmoid function for converting scores to probabilities

# Project imports
from descriptions.config import INTERIM_DATA_DIR, MODELS_DIR, REPORTS_DIR
from descriptions.dataset import load_interim
from descriptions.modeling.model import load_model
from descriptions.modeling.preprocess import load_preprocessors
from descriptions.modeling.train import prepare_features_and_labels

print("✓ Imports complete")

✓ Imports complete


## 1. Load Model and Preprocessors

In [10]:
# Load the trained model
print("Loading trained model...")
model_path = MODELS_DIR / "linearsvc.joblib"
if not model_path.exists():
    # Try to find any model file
    model_files = list(MODELS_DIR.glob("*.joblib"))
    model_files = [f for f in model_files if f.name not in {
        "tfidf_vectorizer.joblib", "genre_binarizer.joblib", 
        "normalizer.joblib", "feature_selector.joblib"
    }]
    if model_files:
        model_path = model_files[0]
        print(f"Using model: {model_path.name}")
    else:
        raise FileNotFoundError(f"No model found in {MODELS_DIR}")

model = load_model(model_path)
print(f"✓ Model loaded: {model_path.name}")

# Load preprocessors
print("\nLoading preprocessors...")
vectorizer, mlb, normalizer, feature_selector = load_preprocessors()
print(f"✓ Preprocessors loaded: {len(mlb.classes_)} genre classes")
print(f"  Genres: {list(mlb.classes_)}")

Loading trained model...
[32m2026-01-12 14:15:08.961[0m | [1mINFO    [0m | [36mdescriptions.modeling.model[0m:[36mload_model[0m:[36m103[0m - [1mLoading model from /Users/christianfullerton/Developer/Python Workspace/movie_genre_model/models/linearsvc.joblib...[0m
[32m2026-01-12 14:15:08.970[0m | [32m[1mSUCCESS [0m | [36mdescriptions.modeling.model[0m:[36mload_model[0m:[36m105[0m - [32m[1mModel loaded successfully from /Users/christianfullerton/Developer/Python Workspace/movie_genre_model/models/linearsvc.joblib[0m
✓ Model loaded: linearsvc.joblib

Loading preprocessors...
[32m2026-01-12 14:15:08.972[0m | [1mINFO    [0m | [36mdescriptions.modeling.preprocess[0m:[36mload_preprocessors[0m:[36m274[0m - [1mLoading TfidfVectorizer from /Users/christianfullerton/Developer/Python Workspace/movie_genre_model/models/tfidf_vectorizer.joblib...[0m
[32m2026-01-12 14:15:08.973[0m | [1mINFO    [0m | [36mdescriptions.modeling.model[0m:[36mload_model[0m:[3

## 2. Load and Prepare Validation Data

In [11]:
# Load interim data
print("Loading data...")
data = load_interim(INTERIM_DATA_DIR / "cleaned_movies.csv")
print(f"✓ Loaded {len(data)} samples")

# Prepare features and labels using saved preprocessors
print("\nPreparing features and labels...")
X_all, y_all, _, _, _, _ = prepare_features_and_labels(
    data,
    vectorizer=vectorizer,
    mlb=mlb,
    normalizer=normalizer,
    feature_selector=feature_selector,
)
print(f"✓ Features shape: {X_all.shape}")
print(f"✓ Labels shape: {y_all.shape}")

# Split into train/validation (use same split as training: 80/20, random_state=42)
print("\nSplitting data into train/validation sets...")
X_train, X_val, y_train, y_val = train_test_split(
    X_all.values if isinstance(X_all, pd.DataFrame) else X_all,
    y_all,
    test_size=0.2,
    random_state=42,
    shuffle=True
)

print(f"✓ Training samples: {len(X_train)}")
print(f"✓ Validation samples: {len(X_val)}")
print(f"✓ Number of genres: {y_val.shape[1]}")

Loading data...
[32m2026-01-12 14:15:09.046[0m | [1mINFO    [0m | [36mdescriptions.dataset[0m:[36mload_interim[0m:[36m99[0m - [1mLoading interim data from /Users/christianfullerton/Developer/Python Workspace/movie_genre_model/data/interim/cleaned_movies.csv...[0m
[32m2026-01-12 14:15:09.113[0m | [34m[1mDEBUG   [0m | [36mdescriptions.dataset[0m:[36mload_interim[0m:[36m103[0m - [34m[1mLoaded with index column[0m
[32m2026-01-12 14:15:09.113[0m | [32m[1mSUCCESS [0m | [36mdescriptions.dataset[0m:[36mload_interim[0m:[36m108[0m - [32m[1m✓ Data loaded successfully: 9087 rows, 2 columns[0m
✓ Loaded 9087 samples

Preparing features and labels...
[32m2026-01-12 14:15:09.116[0m | [1mINFO    [0m | [36mdescriptions.modeling.train[0m:[36mprepare_features_and_labels[0m:[36m142[0m - [1mGenerating TF-IDF features from descriptions...[0m
[32m2026-01-12 14:15:09.116[0m | [1mINFO    [0m | [36mdescriptions.modeling.preprocess[0m:[36m_generate_descri

## 3. Get Prediction Probabilities for Validation Set

In [12]:
# Get prediction probabilities
print("Generating prediction probabilities for validation set...")
y_scores = model.decision_function(X_val)
y_proba = expit(y_scores)  # Convert scores to probabilities using sigmoid

print(f"✓ Probabilities shape: {y_proba.shape}")
print(f"  - Samples: {y_proba.shape[0]}")
print(f"  - Labels: {y_proba.shape[1]}")
print(f"  - Probability range: [{y_proba.min():.4f}, {y_proba.max():.4f}]")

Generating prediction probabilities for validation set...
✓ Probabilities shape: (1812, 14)
  - Samples: 1812
  - Labels: 14
  - Probability range: [0.1966, 0.8981]


## 4. Per-Label Threshold Tuning

For each genre/label, we'll try different threshold values and find the one that maximizes F1 score for that specific label.

In [13]:
def tune_per_label_thresholds(y_true, y_proba, thresholds_to_try=None, metric='f1'):
    """
    Tune thresholds for each label individually.
    
    Args:
        y_true: True labels (binary array, shape: [n_samples, n_labels])
        y_proba: Prediction probabilities (array, shape: [n_samples, n_labels])
        thresholds_to_try: List of threshold values to try. If None, uses range 0.1 to 0.9 with step 0.05
        metric: Metric to optimize ('f1', 'precision', 'recall', or 'jaccard')
    
    Returns:
        Dictionary mapping genre names to optimal thresholds
        Dictionary mapping genre names to optimal metric scores
    """
    n_labels = y_proba.shape[1]
    n_samples = y_proba.shape[0]
    
    if thresholds_to_try is None:
        thresholds_to_try = np.arange(0.1, 0.95, 0.05)
    
    optimal_thresholds = {}
    optimal_scores = {}
    
    print(f"Tuning thresholds for {n_labels} labels...")
    print(f"Trying {len(thresholds_to_try)} threshold values: {thresholds_to_try[0]:.2f} to {thresholds_to_try[-1]:.2f}")
    print(f"Optimizing for: {metric}")
    print()
    
    for label_idx in tqdm(range(n_labels), desc="Tuning thresholds"):
        label_name = mlb.classes_[label_idx]
        y_true_label = y_true[:, label_idx]
        y_proba_label = y_proba[:, label_idx]
        
        # Skip if label has no positive samples in validation set
        if y_true_label.sum() == 0:
            optimal_thresholds[label_name] = 0.5  # Default threshold
            optimal_scores[label_name] = 0.0
            continue
        
        best_threshold = 0.5
        best_score = 0.0
        
        # Try each threshold
        for threshold in thresholds_to_try:
            y_pred_label = (y_proba_label >= threshold).astype(int)
            
            # Calculate metric for this label
            if metric == 'f1':
                score = f1_score(y_true_label, y_pred_label, zero_division=0)
            elif metric == 'precision':
                score = precision_score(y_true_label, y_pred_label, zero_division=0)
            elif metric == 'recall':
                score = recall_score(y_true_label, y_pred_label, zero_division=0)
            elif metric == 'jaccard':
                score = jaccard_score(y_true_label, y_pred_label, zero_division=0)
            else:
                raise ValueError(f"Unknown metric: {metric}")
            
            if score > best_score:
                best_score = score
                best_threshold = threshold
        
        optimal_thresholds[label_name] = float(best_threshold)
        optimal_scores[label_name] = float(best_score)
    
    return optimal_thresholds, optimal_scores

# Tune thresholds
print("=" * 70)
print("PER-LABEL THRESHOLD TUNING")
print("=" * 70)
optimal_thresholds, optimal_scores = tune_per_label_thresholds(y_val, y_proba, metric='f1')
print()
print("✓ Threshold tuning complete!")

PER-LABEL THRESHOLD TUNING
Tuning thresholds for 14 labels...
Trying 17 threshold values: 0.10 to 0.90
Optimizing for: f1



Tuning thresholds: 100%|██████████| 14/14 [00:00<00:00, 26.98it/s]


✓ Threshold tuning complete!





## 5. Display Threshold Tuning Results

In [14]:
# Display results
print("=" * 70)
print("OPTIMAL THRESHOLDS PER GENRE")
print("=" * 70)
print(f"{'Genre':<25} {'Optimal Threshold':<20} {'F1 Score':<15}")
print("-" * 70)

# Sort by threshold for better visualization
sorted_thresholds = sorted(optimal_thresholds.items(), key=lambda x: x[1], reverse=True)

for genre, threshold in sorted_thresholds:
    f1 = optimal_scores[genre]
    print(f"{genre:<25} {threshold:<20.4f} {f1:<15.4f}")

print("-" * 70)
print(f"\nSummary Statistics:")
print(f"  Mean threshold: {np.mean(list(optimal_thresholds.values())):.4f}")
print(f"  Median threshold: {np.median(list(optimal_thresholds.values())):.4f}")
print(f"  Min threshold: {min(optimal_thresholds.values()):.4f}")
print(f"  Max threshold: {max(optimal_thresholds.values()):.4f}")
print(f"  Std threshold: {np.std(list(optimal_thresholds.values())):.4f}")

OPTIMAL THRESHOLDS PER GENRE
Genre                     Optimal Threshold    F1 Score       
----------------------------------------------------------------------
Adventure                 0.5500               0.7099         
Animation                 0.5500               0.7635         
Family                    0.5500               0.7561         
Fantasy                   0.5500               0.7157         
History                   0.5500               0.6900         
Horror                    0.5500               0.7411         
Mystery                   0.5500               0.6177         
Romance                   0.5500               0.6968         
Science Fiction           0.5500               0.7846         
Action                    0.5000               0.7721         
Comedy                    0.5000               0.7589         
Crime                     0.5000               0.7343         
Drama                     0.5000               0.7747         
Thriller          

## 6. Save Per-Label Thresholds to JSON

In [15]:
# Prepare data to save
thresholds_data = {
    "per_label_thresholds": optimal_thresholds,
    "per_label_f1_scores": optimal_scores,
    "summary": {
        "mean_threshold": float(np.mean(list(optimal_thresholds.values()))),
        "median_threshold": float(np.median(list(optimal_thresholds.values()))),
        "min_threshold": float(min(optimal_thresholds.values())),
        "max_threshold": float(max(optimal_thresholds.values())),
        "std_threshold": float(np.std(list(optimal_thresholds.values()))),
        "n_labels": len(optimal_thresholds),
        "validation_samples": len(y_val)
    },
    "metadata": {
        "model_path": str(model_path),
        "metric_optimized": "f1",
        "threshold_range": "0.1 to 0.9 (step 0.05)"
    }
}

# Save to JSON file
output_path = MODELS_DIR / "per_label_thresholds.json"
print(f"\nSaving per-label thresholds to {output_path}...")
with open(output_path, 'w') as f:
    json.dump(thresholds_data, f, indent=2)

print(f"✓ Saved per-label thresholds to {output_path}")
print(f"  - {len(optimal_thresholds)} labels")
print(f"  - Threshold range: {min(optimal_thresholds.values()):.3f} to {max(optimal_thresholds.values()):.3f}")


Saving per-label thresholds to /Users/christianfullerton/Developer/Python Workspace/movie_genre_model/models/per_label_thresholds.json...
✓ Saved per-label thresholds to /Users/christianfullerton/Developer/Python Workspace/movie_genre_model/models/per_label_thresholds.json
  - 14 labels
  - Threshold range: 0.500 to 0.550


## 7. Compare Performance: Global Threshold vs Per-Label Thresholds

In [16]:
def evaluate_with_per_label_thresholds(y_true, y_proba, per_label_thresholds, mlb):
    """
    Evaluate model using per-label thresholds.
    
    Args:
        y_true: True labels (binary array)
        y_proba: Prediction probabilities (array)
        per_label_thresholds: Dictionary mapping genre names to thresholds
        mlb: MultiLabelBinarizer with genre classes
    
    Returns:
        Dictionary of metrics
    """
    n_samples, n_labels = y_proba.shape
    y_pred = np.zeros_like(y_proba, dtype=int)
    
    # Apply per-label thresholds
    for label_idx, label_name in enumerate(mlb.classes_):
        threshold = per_label_thresholds[label_name]
        y_pred[:, label_idx] = (y_proba[:, label_idx] >= threshold).astype(int)
    
    # Calculate metrics
    metrics = {
        "f1_micro": f1_score(y_true, y_pred, average='micro', zero_division=0),
        "precision_micro": precision_score(y_true, y_pred, average='micro', zero_division=0),
        "recall_micro": recall_score(y_true, y_pred, average='micro', zero_division=0),
        "hamming_loss": hamming_loss(y_true, y_pred),
        "jaccard_micro": jaccard_score(y_true, y_pred, average='micro', zero_division=0),
    }
    
    return metrics

# Evaluate with global threshold (default: 0.55)
global_threshold = 0.55
y_pred_global = (y_proba >= global_threshold).astype(int)

metrics_global = {
    "f1_micro": f1_score(y_val, y_pred_global, average='micro', zero_division=0),
    "precision_micro": precision_score(y_val, y_pred_global, average='micro', zero_division=0),
    "recall_micro": recall_score(y_val, y_pred_global, average='micro', zero_division=0),
    "hamming_loss": hamming_loss(y_val, y_pred_global),
    "jaccard_micro": jaccard_score(y_val, y_pred_global, average='micro', zero_division=0),
}

# Evaluate with per-label thresholds
metrics_per_label = evaluate_with_per_label_thresholds(y_val, y_proba, optimal_thresholds, mlb)

# Display comparison
print("=" * 70)
print("PERFORMANCE COMPARISON")
print("=" * 70)
print(f"\n{'Metric':<25} {'Global (0.55)':<20} {'Per-Label':<20} {'Improvement':<15}")
print("-" * 70)

for metric in ['f1_micro', 'precision_micro', 'recall_micro', 'hamming_loss', 'jaccard_micro']:
    global_val = metrics_global[metric]
    per_label_val = metrics_per_label[metric]
    
    # For hamming_loss, lower is better
    if metric == 'hamming_loss':
        improvement = global_val - per_label_val
        improvement_pct = (improvement / global_val * 100) if global_val > 0 else 0
        improvement_str = f"{improvement:.4f} ({improvement_pct:+.2f}%)"
    else:
        improvement = per_label_val - global_val
        improvement_pct = (improvement / global_val * 100) if global_val > 0 else 0
        improvement_str = f"{improvement:+.4f} ({improvement_pct:+.2f}%)"
    
    print(f"{metric:<25} {global_val:<20.4f} {per_label_val:<20.4f} {improvement_str:<15}")

print("-" * 70)

PERFORMANCE COMPARISON

Metric                    Global (0.55)        Per-Label            Improvement    
----------------------------------------------------------------------
f1_micro                  0.6965               0.7423               +0.0458 (+6.58%)
precision_micro           0.7893               0.7196               -0.0697 (-8.83%)
recall_micro              0.6232               0.7665               +0.1433 (+23.00%)
hamming_loss              0.0995               0.0975               0.0020 (+2.02%)
jaccard_micro             0.5343               0.5902               +0.0559 (+10.46%)
----------------------------------------------------------------------


## 8. Test Per-Label Thresholds on Test Set

Now let's evaluate the per-label thresholds on the test set (the 20% held out during training) to see if they improve all metrics compared to the global threshold.

In [17]:
# For proper testing, we'll use the validation set as our test set
# (since we tuned thresholds on validation, but for this test we'll evaluate comprehensively)
# In production, you'd want a separate held-out test set

# Load the per-label thresholds from JSON (in case we're re-running)
print("Loading per-label thresholds from JSON...")
with open(MODELS_DIR / "per_label_thresholds.json", 'r') as f:
    thresholds_data = json.load(f)
    per_label_thresholds = thresholds_data['per_label_thresholds']

print(f"✓ Loaded thresholds for {len(per_label_thresholds)} labels")
print(f"  Threshold range: {min(per_label_thresholds.values()):.3f} to {max(per_label_thresholds.values()):.3f}")

# We'll test on the validation set (which represents our test set)
# Get prediction probabilities for test set
print("\nGenerating prediction probabilities for test set...")
X_test = X_val  # Using validation set as test set
y_test = y_val
y_scores_test = model.decision_function(X_test)
y_proba_test = expit(y_scores_test)

print(f"✓ Test set: {len(X_test)} samples, {y_test.shape[1]} labels")

Loading per-label thresholds from JSON...
✓ Loaded thresholds for 14 labels
  Threshold range: 0.500 to 0.550

Generating prediction probabilities for test set...
✓ Test set: 1812 samples, 14 labels


In [18]:
# Evaluate with global threshold (default: 0.55)
global_threshold = 0.55
y_pred_global_test = (y_proba_test >= global_threshold).astype(int)

metrics_global_test = {
    "f1_micro": f1_score(y_test, y_pred_global_test, average='micro', zero_division=0),
    "precision_micro": precision_score(y_test, y_pred_global_test, average='micro', zero_division=0),
    "recall_micro": recall_score(y_test, y_pred_global_test, average='micro', zero_division=0),
    "hamming_loss": hamming_loss(y_test, y_pred_global_test),
    "jaccard_micro": jaccard_score(y_test, y_pred_global_test, average='micro', zero_division=0),
}

# Evaluate with per-label thresholds
metrics_per_label_test = evaluate_with_per_label_thresholds(y_test, y_proba_test, per_label_thresholds, mlb)

print("=" * 70)
print("TEST SET EVALUATION: Global Threshold vs Per-Label Thresholds")
print("=" * 70)
print(f"\n{'Metric':<25} {'Global (0.55)':<20} {'Per-Label':<20} {'Improvement':<20} {'Improved?':<10}")
print("-" * 95)

improvements = {}
all_improved = True

for metric in ['f1_micro', 'precision_micro', 'recall_micro', 'jaccard_micro', 'hamming_loss']:
    global_val = metrics_global_test[metric]
    per_label_val = metrics_per_label_test[metric]
    
    # For hamming_loss, lower is better
    if metric == 'hamming_loss':
        improvement = global_val - per_label_val
        improvement_pct = (improvement / global_val * 100) if global_val > 0 else 0
        improvement_str = f"{improvement:.4f} ({improvement_pct:+.2f}%)"
        improved = improvement > 0  # Lower is better
    else:
        improvement = per_label_val - global_val
        improvement_pct = (improvement / global_val * 100) if global_val > 0 else 0
        improvement_str = f"{improvement:+.4f} ({improvement_pct:+.2f}%)"
        improved = improvement > 0  # Higher is better
    
    improvements[metric] = improved
    if not improved:
        all_improved = False
    
    improved_str = "✓ Yes" if improved else "✗ No"
    print(f"{metric:<25} {global_val:<20.4f} {per_label_val:<20.4f} {improvement_str:<20} {improved_str:<10}")

print("-" * 95)

# Summary
print("\n" + "=" * 70)
print("SUMMARY")
print("=" * 70)

improved_count = sum(improvements.values())
total_metrics = len(improvements)
print(f"\nMetrics improved: {improved_count} / {total_metrics}")

if all_improved:
    print("\n✅ SUCCESS: Per-label thresholds improved ALL metrics!")
else:
    print("\n⚠️  Per-label thresholds improved SOME metrics, but not all.")
    print("   Consider the trade-offs between precision, recall, and F1 score.")

# Show key improvements
print("\nKey Improvements:")
for metric, improved in improvements.items():
    if improved:
        if metric == 'hamming_loss':
            diff = metrics_global_test[metric] - metrics_per_label_test[metric]
            pct = (diff / metrics_global_test[metric] * 100) if metrics_global_test[metric] > 0 else 0
        else:
            diff = metrics_per_label_test[metric] - metrics_global_test[metric]
            pct = (diff / metrics_global_test[metric] * 100) if metrics_global_test[metric] > 0 else 0
        print(f"  ✓ {metric}: {pct:+.2f}%")

TEST SET EVALUATION: Global Threshold vs Per-Label Thresholds

Metric                    Global (0.55)        Per-Label            Improvement          Improved? 
-----------------------------------------------------------------------------------------------
f1_micro                  0.6965               0.7423               +0.0458 (+6.58%)     ✓ Yes     
precision_micro           0.7893               0.7196               -0.0697 (-8.83%)     ✗ No      
recall_micro              0.6232               0.7665               +0.1433 (+23.00%)    ✓ Yes     
jaccard_micro             0.5343               0.5902               +0.0559 (+10.46%)    ✓ Yes     
hamming_loss              0.0995               0.0975               0.0020 (+2.02%)      ✓ Yes     
-----------------------------------------------------------------------------------------------

SUMMARY

Metrics improved: 4 / 5

⚠️  Per-label thresholds improved SOME metrics, but not all.
   Consider the trade-offs between precision, rec

In [19]:
# Calculate additional insights
print("\n" + "=" * 70)
print("DETAILED METRIC ANALYSIS")
print("=" * 70)

# F1 Score improvement (most important)
f1_improvement = metrics_per_label_test['f1_micro'] - metrics_global_test['f1_micro']
f1_improvement_pct = (f1_improvement / metrics_global_test['f1_micro'] * 100) if metrics_global_test['f1_micro'] > 0 else 0
print(f"\nF1 Score:")
print(f"  Global threshold:     {metrics_global_test['f1_micro']:.4f} ({metrics_global_test['f1_micro']*100:.2f}%)")
print(f"  Per-label thresholds: {metrics_per_label_test['f1_micro']:.4f} ({metrics_per_label_test['f1_micro']*100:.2f}%)")
print(f"  Improvement:          {f1_improvement:+.4f} ({f1_improvement_pct:+.2f}%)")

# Precision/Recall trade-off
precision_diff = metrics_per_label_test['precision_micro'] - metrics_global_test['precision_micro']
recall_diff = metrics_per_label_test['recall_micro'] - metrics_global_test['recall_micro']
print(f"\nPrecision/Recall Trade-off:")
print(f"  Precision change: {precision_diff:+.4f} ({precision_diff/metrics_global_test['precision_micro']*100:+.2f}%)")
print(f"  Recall change:    {recall_diff:+.4f} ({recall_diff/metrics_global_test['recall_micro']*100:+.2f}%)")

# Hamming Loss (error rate)
hamming_improvement = metrics_global_test['hamming_loss'] - metrics_per_label_test['hamming_loss']
hamming_improvement_pct = (hamming_improvement / metrics_global_test['hamming_loss'] * 100) if metrics_global_test['hamming_loss'] > 0 else 0
print(f"\nHamming Loss (lower is better):")
print(f"  Global threshold:     {metrics_global_test['hamming_loss']:.4f} ({metrics_global_test['hamming_loss']*100:.2f}%)")
print(f"  Per-label thresholds: {metrics_per_label_test['hamming_loss']:.4f} ({metrics_per_label_test['hamming_loss']*100:.2f}%)")
print(f"  Improvement:          {hamming_improvement:+.4f} ({hamming_improvement_pct:+.2f}% reduction)")

# Jaccard Score
jaccard_improvement = metrics_per_label_test['jaccard_micro'] - metrics_global_test['jaccard_micro']
jaccard_improvement_pct = (jaccard_improvement / metrics_global_test['jaccard_micro'] * 100) if metrics_global_test['jaccard_micro'] > 0 else 0
print(f"\nJaccard Score (overlap):")
print(f"  Global threshold:     {metrics_global_test['jaccard_micro']:.4f} ({metrics_global_test['jaccard_micro']*100:.2f}%)")
print(f"  Per-label thresholds: {metrics_per_label_test['jaccard_micro']:.4f} ({metrics_per_label_test['jaccard_micro']*100:.2f}%)")
print(f"  Improvement:          {jaccard_improvement:+.4f} ({jaccard_improvement_pct:+.2f}%)")

print("\n" + "=" * 70)


DETAILED METRIC ANALYSIS

F1 Score:
  Global threshold:     0.6965 (69.65%)
  Per-label thresholds: 0.7423 (74.23%)
  Improvement:          +0.0458 (+6.58%)

Precision/Recall Trade-off:
  Precision change: -0.0697 (-8.83%)
  Recall change:    +0.1433 (+23.00%)

Hamming Loss (lower is better):
  Global threshold:     0.0995 (9.95%)
  Per-label thresholds: 0.0975 (9.75%)
  Improvement:          +0.0020 (+2.02% reduction)

Jaccard Score (overlap):
  Global threshold:     0.5343 (53.43%)
  Per-label thresholds: 0.5902 (59.02%)
  Improvement:          +0.0559 (+10.46%)



## Final Summary

The per-label thresholds have been tested on the test set. Review the metrics above to see if all metrics improved compared to the global threshold.

**Note:** Per-label threshold tuning optimizes for F1 score per label, which may result in trade-offs between precision and recall. The goal is to improve overall F1 score and other key metrics like Jaccard score and Hamming loss.

If metrics improved, consider:
1. Updating the prediction code to use per-label thresholds
2. Re-evaluating on a completely held-out test set
3. Deploying the updated model with per-label thresholds

## Summary

The per-label thresholds have been tuned and saved to `models/per_label_thresholds.json`. 

**Key Results:**
- Optimized thresholds for each genre individually using F1 score
- Thresholds saved to JSON file for use in prediction code
- Performance comparison shows improvements from per-label tuning

**Next Steps:**
1. Update prediction code to use per-label thresholds instead of global threshold
2. Re-evaluate model on test set with per-label thresholds
3. Deploy updated prediction code with per-label thresholds