# Difficulty Binning Analysis with Real Data

This notebook explores the current difficulty binning algorithm used in the songbook generator and visualizes how it behaves with real song data from Google Drive.

## Current Algorithm Overview

The current binning logic (from `generator/worker/difficulty.py`) uses a **relative approach**:

1. Find the minimum difficulty value among all songs
2. Use a hardcoded maximum of 5.0
3. Normalize all difficulties to a 0-1 range: `(difficulty - min) / (5.0 - min)`
4. Digitize into bins using `np.linspace(0, 1, num_bins + 1)`

This means that bin assignments are **relative to the current song selection** - the same song can be assigned different bins depending on what other songs are included.

## Goals

- Visualize how the current algorithm distributes real songs from Google Drive into bins
- Explore how bin assignments change with different real song selections
- Compare the relative approach with possible absolute schemes using real data
- Identify inconsistencies and edge cases in the live songbook collection

## Prerequisites

This notebook requires Google Cloud authentication with service account impersonation permissions. Ensure you have:
1. Authenticated with `gcloud auth application-default login`
2. Permission to impersonate the songbook-generator service account
3. Access to the Google Drive folder containing song sheets


In [None]:
# Standard library imports
import sys
import warnings
from pathlib import Path
from typing import List

# Data manipulation and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
import ipywidgets as widgets
from ipywidgets import interactive

# Add the project root to the Python path so we can import from generator
project_root = Path().absolute().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

# Set up matplotlib for better plots
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["font.size"] = 10
sns.set_style("whitegrid")
sns.set_palette("husl")

print("✅ Imports completed successfully")

In [None]:
# Import project modules
from generator.worker.difficulty import assign_difficulty_bins
from generator.worker.models import File
from generator.common.gdrive import GoogleDriveClient
from generator.common.config import get_settings
from generator.worker.pdf import init_services

print("✅ Project modules imported successfully")

## Initialize Google Drive Connection

In [None]:
# Initialize Google Drive services using existing authentication patterns
def init_gdrive_services():
    """Initialize Google Drive services using the same patterns as CLI tools."""
    settings = get_settings()

    # Use the same initialization as songbook-tools CLI
    credential_config = settings.gcp_credentials
    drive, cache = init_services(
        scopes=credential_config.scopes,
        target_principal=credential_config.principal,
    )

    gdrive = GoogleDriveClient(cache=cache, drive=drive)
    print("✅ Google Drive services initialized")
    return gdrive, settings


# Initialize the services
gdrive, settings = init_gdrive_services()

## Fetch Real Song Data from Google Drive

In [None]:
def fetch_real_songs(gdrive_client: GoogleDriveClient, settings) -> List[File]:
    """Fetch real song files from Google Drive."""
    print("📥 Fetching real songs from Google Drive...")

    # Use the same folder as the songbook generator
    songs_folder_id = settings.gdrive_songs_folder_id

    # Fetch all PDF files from the songs folder
    files = list(
        gdrive_client.list_files(
            parent_folder_id=songs_folder_id,
            mime_types=["application/pdf"],
            include_trashed=False,
        )
    )

    print(f"📊 Found {len(files)} song files")

    # Filter to only songs with difficulty ratings
    songs_with_difficulty = []
    songs_without_difficulty = []

    for file in files:
        difficulty_value = file.properties.get("difficulty", "")
        if difficulty_value and difficulty_value.strip():
            try:
                float(difficulty_value)
                songs_with_difficulty.append(file)
            except (ValueError, TypeError):
                songs_without_difficulty.append(file)
        else:
            songs_without_difficulty.append(file)

    print(f"   Songs with difficulty: {len(songs_with_difficulty)}")
    print(f"   Songs without difficulty: {len(songs_without_difficulty)}")

    # Show some examples
    if len(songs_with_difficulty) > 0:
        print("\n🎵 Sample songs with difficulties:")
        for i, song in enumerate(songs_with_difficulty[:5]):
            diff = song.properties.get("difficulty", "N/A")
            print(f"   {i + 1}. {song.name} (difficulty: {diff})")

    return files, songs_with_difficulty


# Fetch the real song data
all_songs, songs_with_difficulty = fetch_real_songs(gdrive, settings)

## Helper Functions for Real Data Analysis

In [None]:
def analyze_binning(files: List[File], num_bins: int = 5) -> pd.DataFrame:
    """Apply difficulty binning and return a DataFrame with the results."""
    # Make a copy to avoid modifying the originals
    files_copy = [
        File(
            id=f.id,
            name=f.name,
            properties=f.properties.copy(),
            mimeType=f.mimeType,
            parents=f.parents.copy(),
        )
        for f in files
    ]

    # Apply the binning algorithm
    assign_difficulty_bins(files_copy, num_bins=num_bins)

    # Extract data for analysis
    data = []
    for f in files_copy:
        try:
            difficulty = float(f.properties.get("difficulty", -1))
        except (ValueError, TypeError):
            difficulty = -1

        bin_assigned = int(f.properties.get("difficulty_bin", 0))

        data.append(
            {
                "name": f.name,
                "difficulty": difficulty,
                "bin": bin_assigned,
                "has_difficulty": difficulty != -1,
            }
        )

    return pd.DataFrame(data)


def visualize_binning(df: pd.DataFrame, title: str):
    """Create visualizations for difficulty binning results."""
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))

    # Only analyze songs with difficulty values
    valid_df = df[df["has_difficulty"]].copy()

    if len(valid_df) == 0:
        print(f"⚠️ No songs with difficulty values found in {title}")
        return

    # Plot 1: Difficulty distribution by bin
    for bin_num in sorted(valid_df["bin"].unique()):
        bin_data = valid_df[valid_df["bin"] == bin_num]["difficulty"]
        axes[0].hist(bin_data, alpha=0.7, label=f"Bin {bin_num}", bins=20)

    axes[0].set_xlabel("Difficulty")
    axes[0].set_ylabel("Number of Songs")
    axes[0].set_title("Difficulty Distribution by Bin")
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)

    # Plot 2: Box plot of difficulties by bin
    sns.boxplot(data=valid_df, x="bin", y="difficulty", ax=axes[1])
    axes[1].set_title("Difficulty Ranges by Bin")
    axes[1].set_xlabel("Bin")
    axes[1].set_ylabel("Difficulty")
    axes[1].grid(True, alpha=0.3)

    # Plot 3: Scatter plot with bin colors
    scatter = axes[2].scatter(
        range(len(valid_df)),
        valid_df["difficulty"],
        c=valid_df["bin"],
        cmap="viridis",
        alpha=0.7,
    )
    axes[2].set_xlabel("Song Index")
    axes[2].set_ylabel("Difficulty")
    axes[2].set_title("Songs by Difficulty and Bin Assignment")
    axes[2].grid(True, alpha=0.3)
    plt.colorbar(scatter, ax=axes[2], label="Bin")

    plt.suptitle(title, fontsize=14, y=1.02)
    plt.tight_layout()
    plt.show()

    # Print summary statistics
    print(f"\n📊 Summary for {title}:")
    print(f"   Total songs: {len(df)}")
    print(f"   Songs with difficulty: {len(valid_df)}")
    print(
        f"   Difficulty range: {valid_df['difficulty'].min():.2f} - {valid_df['difficulty'].max():.2f}"
    )
    print(f"   Bins used: {sorted(valid_df['bin'].unique())}")

    # Show bin distribution
    bin_counts = valid_df["bin"].value_counts().sort_index()
    print("\n   Songs per bin:")
    for bin_num in sorted(bin_counts.index):
        count = bin_counts[bin_num]
        avg_diff = valid_df[valid_df["bin"] == bin_num]["difficulty"].mean()
        print(f"     Bin {bin_num}: {count:2d} songs (avg difficulty: {avg_diff:.2f})")


def compare_binning_approaches(real_files: List[File], num_bins: int = 5):
    """Compare current relative binning with absolute binning approaches."""
    # Current approach
    current_df = analyze_binning(real_files, num_bins)

    # Create visualizations showing the comparison
    valid_songs = current_df[current_df["has_difficulty"]].copy()

    if len(valid_songs) == 0:
        print("⚠️ No songs with difficulty values found for comparison")
        return

    # Simulate absolute binning with fixed ranges
    def absolute_binning(difficulty):
        if difficulty < 1.5:
            return 1
        elif difficulty < 2.5:
            return 2
        elif difficulty < 3.5:
            return 3
        elif difficulty < 4.5:
            return 4
        else:
            return 5

    valid_songs["absolute_bin"] = valid_songs["difficulty"].apply(absolute_binning)

    # Create comparison visualization
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))

    # Current relative binning
    for bin_num in sorted(valid_songs["bin"].unique()):
        bin_data = valid_songs[valid_songs["bin"] == bin_num]["difficulty"]
        axes[0, 0].hist(bin_data, alpha=0.7, label=f"Bin {bin_num}", bins=20)
    axes[0, 0].set_title("Current Relative Binning")
    axes[0, 0].set_xlabel("Difficulty")
    axes[0, 0].set_ylabel("Number of Songs")
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)

    # Absolute binning simulation
    for bin_num in sorted(valid_songs["absolute_bin"].unique()):
        bin_data = valid_songs[valid_songs["absolute_bin"] == bin_num]["difficulty"]
        axes[0, 1].hist(bin_data, alpha=0.7, label=f"Bin {bin_num}", bins=20)
    axes[0, 1].set_title("Simulated Absolute Binning")
    axes[0, 1].set_xlabel("Difficulty")
    axes[0, 1].set_ylabel("Number of Songs")
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)

    # Comparison scatter plots
    axes[1, 0].scatter(
        valid_songs["difficulty"], valid_songs["bin"], alpha=0.6, c="blue"
    )
    axes[1, 0].set_title("Current: Difficulty vs Bin Assignment")
    axes[1, 0].set_xlabel("Difficulty")
    axes[1, 0].set_ylabel("Assigned Bin")
    axes[1, 0].grid(True, alpha=0.3)

    axes[1, 1].scatter(
        valid_songs["difficulty"], valid_songs["absolute_bin"], alpha=0.6, c="red"
    )
    axes[1, 1].set_title("Absolute: Difficulty vs Bin Assignment")
    axes[1, 1].set_xlabel("Difficulty")
    axes[1, 1].set_ylabel("Assigned Bin")
    axes[1, 1].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    # Show songs that would change bins
    changes = valid_songs[valid_songs["bin"] != valid_songs["absolute_bin"]].copy()
    print(
        f"\n🔄 Songs that would change bins: {len(changes)}/{len(valid_songs)} ({len(changes) / len(valid_songs) * 100:.1f}%)"
    )

    if len(changes) > 0:
        print("\nExamples of songs that would change bins:")
        sample_changes = changes.head(10)[["name", "difficulty", "bin", "absolute_bin"]]
        sample_changes.columns = [
            "Song Name",
            "Difficulty",
            "Current Bin",
            "Absolute Bin",
        ]
        print(sample_changes.to_string(index=False))


print("✅ Helper functions defined successfully")

## Analyze Current Binning with Real Data

In [None]:
# Analyze the full collection
print("🔍 Analyzing current binning algorithm with all real songs")
print("=" * 70)

if len(songs_with_difficulty) > 0:
    full_collection_df = analyze_binning(songs_with_difficulty)
    visualize_binning(full_collection_df, "Full Real Song Collection")
else:
    print("⚠️ No songs with difficulty values found. Cannot perform analysis.")

## Interactive Analysis with Real Data Subsets

In [None]:
# Interactive widget to explore different subsets of real songs
def interactive_real_data_analysis(
    min_difficulty=1.0, max_difficulty=5.0, max_songs=50
):
    """Interactive analysis using real song data with filtering."""
    if len(songs_with_difficulty) == 0:
        print("⚠️ No songs with difficulty values available for analysis")
        return

    # Filter songs by difficulty range
    filtered_songs = []
    for song in songs_with_difficulty:
        try:
            diff = float(song.properties.get("difficulty", 0))
            if min_difficulty <= diff <= max_difficulty:
                filtered_songs.append(song)
        except (ValueError, TypeError):
            continue

    # Limit number of songs if requested
    if max_songs and len(filtered_songs) > max_songs:
        # Sort by difficulty for consistent selection
        filtered_songs = sorted(
            filtered_songs, key=lambda s: float(s.properties.get("difficulty", 0))
        )
        filtered_songs = filtered_songs[:max_songs]

    print(
        f"📊 Analyzing {len(filtered_songs)} songs with difficulty {min_difficulty}-{max_difficulty}"
    )

    if len(filtered_songs) > 0:
        subset_df = analyze_binning(filtered_songs)
        title = f"Real Songs: Difficulty {min_difficulty}-{max_difficulty} (n={len(filtered_songs)})"
        visualize_binning(subset_df, title)
    else:
        print("⚠️ No songs found in the specified difficulty range")


# Create interactive widget
if len(songs_with_difficulty) > 0:
    # Get actual difficulty range from the data
    all_difficulties = []
    for song in songs_with_difficulty:
        try:
            diff = float(song.properties.get("difficulty", 0))
            all_difficulties.append(diff)
        except (ValueError, TypeError):
            continue

    if all_difficulties:
        min_real_diff = min(all_difficulties)
        max_real_diff = max(all_difficulties)
        print(
            f"📈 Real difficulty range in data: {min_real_diff:.2f} - {max_real_diff:.2f}"
        )

        interactive_plot = interactive(
            interactive_real_data_analysis,
            min_difficulty=widgets.FloatSlider(
                value=min_real_diff,
                min=min_real_diff,
                max=max_real_diff,
                step=0.1,
                description="Min Difficulty:",
            ),
            max_difficulty=widgets.FloatSlider(
                value=max_real_diff,
                min=min_real_diff,
                max=max_real_diff,
                step=0.1,
                description="Max Difficulty:",
            ),
            max_songs=widgets.IntSlider(
                value=50,
                min=10,
                max=len(songs_with_difficulty),
                step=10,
                description="Max Songs:",
            ),
        )
        display(interactive_plot)
    else:
        print("⚠️ No valid difficulty values found in real data")
else:
    print("⚠️ No songs with difficulty values available for interactive analysis")

## Demonstrate Inconsistency with Real Data

In [None]:
# Demonstrate how the same real songs get different bins in different contexts
print("🎯 Demonstration: Same Real Songs, Different Contexts")
print("=" * 60)

if len(songs_with_difficulty) >= 20:
    # Sort songs by difficulty
    sorted_songs = sorted(
        songs_with_difficulty, key=lambda s: float(s.properties.get("difficulty", 0))
    )

    # Create two different contexts with the same core songs
    # Context 1: Easy to medium songs (lower half of difficulty range)
    mid_point = len(sorted_songs) // 2
    easy_context = sorted_songs[: mid_point + 5]  # Include some overlap

    # Context 2: Medium to hard songs (upper half of difficulty range)
    hard_context = sorted_songs[mid_point - 5 :]  # Include some overlap

    print(f"Context 1 (Easy-Medium): {len(easy_context)} songs")
    easy_df = analyze_binning(easy_context)
    visualize_binning(easy_df, "Context 1: Easy-Medium Songs")

    print(f"\nContext 2 (Medium-Hard): {len(hard_context)} songs")
    hard_df = analyze_binning(hard_context)
    visualize_binning(hard_df, "Context 2: Medium-Hard Songs")

    # Find overlapping songs and show how their bins changed
    easy_song_ids = {f.id for f in easy_context}
    hard_song_ids = {f.id for f in hard_context}
    overlap_ids = easy_song_ids & hard_song_ids

    if overlap_ids:
        print(
            f"\n🔄 {len(overlap_ids)} overlapping songs with different bin assignments:"
        )

        easy_lookup = {
            f.id: (f.name, easy_df[easy_df.name == f.name])
            for f in easy_context
            if f.id in overlap_ids
        }
        hard_lookup = {
            f.id: (f.name, hard_df[hard_df.name == f.name])
            for f in hard_context
            if f.id in overlap_ids
        }

        for song_id in list(overlap_ids)[:10]:  # Show first 10 examples
            if song_id in easy_lookup and song_id in hard_lookup:
                easy_name, easy_row = easy_lookup[song_id]
                hard_name, hard_row = hard_lookup[song_id]

                if not easy_row.empty and not hard_row.empty:
                    easy_bin = easy_row.iloc[0]["bin"]
                    hard_bin = hard_row.iloc[0]["bin"]
                    difficulty = easy_row.iloc[0]["difficulty"]

                    if easy_bin != hard_bin:
                        print(
                            f"   '{easy_name[:50]}' (difficulty {difficulty:.2f}): Bin {easy_bin} → Bin {hard_bin}"
                        )
else:
    print(
        "⚠️ Need at least 20 songs with difficulty values to demonstrate context changes"
    )

## Compare Relative vs Absolute Binning

In [None]:
# Compare current relative binning with absolute binning using real data
print("⚖️ Comparison: Relative vs Absolute Binning")
print("=" * 50)

if len(songs_with_difficulty) > 0:
    compare_binning_approaches(songs_with_difficulty)
else:
    print("⚠️ No songs with difficulty values available for comparison")

## Edge Cases and Real Data Insights

In [None]:
# Analyze edge cases in the real data
print("⚠️ Edge Cases and Insights from Real Data")
print("=" * 50)

if len(songs_with_difficulty) > 0:
    # Get all difficulties
    real_difficulties = []
    for song in songs_with_difficulty:
        try:
            diff = float(song.properties.get("difficulty", 0))
            real_difficulties.append(diff)
        except (ValueError, TypeError):
            continue

    print("📊 Real data statistics:")
    print(f"   Number of songs: {len(real_difficulties)}")
    print(
        f"   Difficulty range: {min(real_difficulties):.2f} - {max(real_difficulties):.2f}"
    )
    print(f"   Mean difficulty: {np.mean(real_difficulties):.2f}")
    print(f"   Standard deviation: {np.std(real_difficulties):.2f}")

    # Test edge case: What if we only had very easy songs?
    very_easy_songs = [
        s
        for s in songs_with_difficulty
        if float(s.properties.get("difficulty", 0)) <= 2.0
    ]

    if len(very_easy_songs) >= 5:
        print("\n🔸 Edge Case: Only Very Easy Songs (≤2.0 difficulty)")
        easy_df = analyze_binning(very_easy_songs)
        visualize_binning(easy_df, "Edge Case: Only Very Easy Songs")

    # Test edge case: What if we only had very hard songs?
    very_hard_songs = [
        s
        for s in songs_with_difficulty
        if float(s.properties.get("difficulty", 0)) >= 4.0
    ]

    if len(very_hard_songs) >= 5:
        print("\n🔸 Edge Case: Only Very Hard Songs (≥4.0 difficulty)")
        hard_df = analyze_binning(very_hard_songs)
        visualize_binning(hard_df, "Edge Case: Only Very Hard Songs")

    # Test what happens with outliers
    outlier_threshold = np.mean(real_difficulties) + 2 * np.std(real_difficulties)
    outlier_songs = [
        s
        for s in songs_with_difficulty
        if float(s.properties.get("difficulty", 0)) >= outlier_threshold
    ]

    if len(outlier_songs) > 0:
        print(
            f"\n🔸 Outlier Analysis: {len(outlier_songs)} songs with difficulty ≥{outlier_threshold:.2f}"
        )
        print("   These outlier songs can skew the entire binning distribution")

        # Compare with and without outliers
        songs_without_outliers = [
            s
            for s in songs_with_difficulty
            if float(s.properties.get("difficulty", 0)) < outlier_threshold
        ]

        if len(songs_without_outliers) >= 10:
            print("\n   Without outliers:")
            no_outlier_df = analyze_binning(songs_without_outliers)
            visualize_binning(no_outlier_df, "Real Data Without Outliers")

            print("\n   With outliers:")
            with_outlier_df = analyze_binning(songs_with_difficulty)
            visualize_binning(with_outlier_df, "Real Data With Outliers")

else:
    print("⚠️ No songs with difficulty values available for edge case analysis")

## Key Findings and Recommendations

Based on the analysis of real song data from Google Drive, this notebook demonstrates several critical issues with the current difficulty binning algorithm:

### 🚨 Key Issues Identified

1. **Inconsistent binning**: The same song gets different bin assignments depending on what other songs are in the collection
2. **Sensitivity to outliers**: Adding very hard songs shifts all existing song assignments
3. **Over-amplification**: Small difficulty ranges get exaggerated across the full bin spread
4. **Context dependency**: Bin assignments are not predictable or stable

### 📋 Recommendations

1. **Implement absolute binning** with fixed difficulty ranges (e.g., 1.0-1.5 = Bin 1, 1.5-2.5 = Bin 2, etc.)
2. **Consider percentile-based binning** using historical data to ensure consistent distribution
3. **Add configuration options** for binning method selection
4. **Expand test coverage** for the identified edge cases
5. **Monitor binning consistency** in production to detect when algorithms produce unexpected results

The analysis with real data provides concrete evidence that the current relative approach creates unpredictable user experiences and should be replaced with a more stable algorithm.
