# Difficulty Binning Analysis

This notebook explores the current difficulty binning algorithm used in the songbook generator and visualizes how it behaves with real song data from Google Drive.

## Current Algorithm Overview

The current binning logic (from `generator/worker/difficulty.py`) uses a **relative approach**:

1. Find the minimum difficulty value among all songs
2. Use a hardcoded maximum of 5.0
3. Normalize all difficulties to a 0-1 range: `(difficulty - min) / (5.0 - min)`
4. Digitize into bins using `np.linspace(0, 1, num_bins + 1)`

This means that bin assignments are **relative to the current song selection** - the same song can be assigned different bins depending on what other songs are included.

## Goals

- Visualize how the current algorithm distributes real songs into bins
- Explore how bin assignments change with different song selections
- Compare the relative approach with possible absolute schemes
- Identify inconsistencies and edge cases


In [None]:
# Standard library imports
import sys
import warnings
from pathlib import Path
from typing import List, Optional

# Data manipulation and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
import ipywidgets as widgets
from ipywidgets import interactive

# Add the project root to the Python path so we can import from generator
project_root = Path().absolute().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

# Set up matplotlib for better plots
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["font.size"] = 10
sns.set_style("whitegrid")
sns.set_palette("husl")

print("✅ Imports completed successfully")

In [None]:
# Import project modules
from generator.worker.difficulty import assign_difficulty_bins
from generator.worker.models import File
# These imports are for the real data section (commented out by default)
# from generator.common.gdrive import GoogleDriveClient
# from generator.common.caching import init_cache
# from generator.common.config import get_settings
# from generator.worker.pdf import init_services

print("✅ Project modules imported successfully")

## Helper Functions for Analysis

In [None]:
def create_sample_files(
    difficulties: List[float], names: Optional[List[str]] = None
) -> List[File]:
    """Create a list of File objects with specified difficulty values for testing."""
    if names is None:
        names = [f"Song {i + 1}" for i in range(len(difficulties))]

    files = []
    for i, (diff, name) in enumerate(zip(difficulties, names)):
        properties = {"difficulty": str(diff)} if diff != -1 else {}
        files.append(File(id=f"file_{i}", name=name, properties=properties))
    return files


def analyze_binning(files: List[File], num_bins: int = 5) -> pd.DataFrame:
    """Apply difficulty binning and return a DataFrame with the results."""
    # Make a copy to avoid modifying the originals
    files_copy = [
        File(
            id=f.id,
            name=f.name,
            properties=f.properties.copy(),
            mimeType=f.mimeType,
            parents=f.parents.copy(),
        )
        for f in files
    ]

    # Apply the binning algorithm
    assign_difficulty_bins(files_copy, num_bins=num_bins)

    # Extract data for analysis
    data = []
    for f in files_copy:
        try:
            difficulty = float(f.properties.get("difficulty", -1))
        except (ValueError, TypeError):
            difficulty = -1

        bin_assigned = int(f.properties.get("difficulty_bin", 0))

        data.append(
            {
                "name": f.name,
                "difficulty": difficulty,
                "bin": bin_assigned,
                "has_difficulty": difficulty != -1,
            }
        )

    return pd.DataFrame(data)


def plot_binning_analysis(df: pd.DataFrame, title: str = "Difficulty Binning Analysis"):
    """Create visualizations of the binning results."""
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle(title, fontsize=16)

    # Filter out songs without difficulty values for most plots
    df_with_diff = df[df["has_difficulty"]]

    # 1. Scatter plot: Difficulty vs Bin Assignment
    if len(df_with_diff) > 0:
        axes[0, 0].scatter(
            df_with_diff["difficulty"],
            df_with_diff["bin"],
            alpha=0.7,
            s=50,
            color="steelblue",
        )
        axes[0, 0].set_xlabel("Original Difficulty Value")
        axes[0, 0].set_ylabel("Assigned Bin")
        axes[0, 0].set_title("Difficulty vs Assigned Bin")
        axes[0, 0].grid(True, alpha=0.3)

        # Add min/max lines
        min_diff = df_with_diff["difficulty"].min()
        axes[0, 0].axvline(
            x=min_diff, color="red", linestyle="--", alpha=0.7, label=f"Min: {min_diff}"
        )
        axes[0, 0].axvline(
            x=5.0, color="orange", linestyle="--", alpha=0.7, label="Hardcoded Max: 5.0"
        )
        axes[0, 0].legend()
    else:
        axes[0, 0].text(
            0.5,
            0.5,
            "No songs with difficulty values",
            ha="center",
            va="center",
            transform=axes[0, 0].transAxes,
        )
        axes[0, 0].set_title("Difficulty vs Assigned Bin")

    # 2. Histogram: Distribution of Difficulty Values
    if len(df_with_diff) > 0:
        axes[0, 1].hist(
            df_with_diff["difficulty"],
            bins=20,
            alpha=0.7,
            color="lightcoral",
            edgecolor="black",
        )
        axes[0, 1].set_xlabel("Difficulty Value")
        axes[0, 1].set_ylabel("Count")
        axes[0, 1].set_title(
            f"Distribution of Difficulty Values (n={len(df_with_diff)})"
        )
        axes[0, 1].axvline(
            x=df_with_diff["difficulty"].mean(),
            color="red",
            linestyle="--",
            alpha=0.7,
            label=f"Mean: {df_with_diff['difficulty'].mean():.2f}",
        )
        axes[0, 1].legend()
    else:
        axes[0, 1].text(
            0.5,
            0.5,
            "No difficulty values to plot",
            ha="center",
            va="center",
            transform=axes[0, 1].transAxes,
        )
        axes[0, 1].set_title("Distribution of Difficulty Values")

    # 3. Bar plot: Songs per Bin
    bin_counts = df["bin"].value_counts().sort_index()
    axes[1, 0].bar(
        bin_counts.index,
        bin_counts.values,
        alpha=0.7,
        color="lightgreen",
        edgecolor="black",
    )
    axes[1, 0].set_xlabel("Assigned Bin")
    axes[1, 0].set_ylabel("Number of Songs")
    axes[1, 0].set_title("Distribution of Songs Across Bins")
    axes[1, 0].set_xticks(range(max(bin_counts.index) + 1))

    # Add count labels on bars
    for i, count in enumerate(bin_counts.values):
        axes[1, 0].text(
            bin_counts.index[i], count + 0.1, str(count), ha="center", va="bottom"
        )

    # 4. Box plot: Difficulty Distribution by Bin
    if len(df_with_diff) > 0 and len(df_with_diff["bin"].unique()) > 1:
        df_with_diff.boxplot(column="difficulty", by="bin", ax=axes[1, 1])
        axes[1, 1].set_xlabel("Assigned Bin")
        axes[1, 1].set_ylabel("Difficulty Value")
        axes[1, 1].set_title("Difficulty Range by Bin")
        plt.suptitle("")  # Remove the automatic title from boxplot
    else:
        axes[1, 1].text(
            0.5,
            0.5,
            "Insufficient data for box plot",
            ha="center",
            va="center",
            transform=axes[1, 1].transAxes,
        )
        axes[1, 1].set_title("Difficulty Range by Bin")

    plt.tight_layout()
    plt.show()


def compare_binning_schemes(
    difficulties: List[float], names: Optional[List[str]] = None, num_bins: int = 5
):
    """Compare the current relative binning with an absolute binning scheme."""
    if names is None:
        names = [f"Song {i + 1}" for i in range(len(difficulties))]

    # Create files and apply current algorithm
    files = create_sample_files(difficulties, names)
    df_relative = analyze_binning(files, num_bins)

    # Create absolute binning (fixed ranges)
    df_absolute = df_relative.copy()

    # Define absolute bin ranges (example: 1-2 = bin 1, 2-3 = bin 2, etc.)
    def absolute_bin(difficulty):
        if difficulty == -1:
            return 0
        elif difficulty < 1.0:
            return 1
        elif difficulty < 2.0:
            return 1
        elif difficulty < 3.0:
            return 2
        elif difficulty < 4.0:
            return 3
        elif difficulty < 5.0:
            return 4
        else:
            return 5

    df_absolute["bin_absolute"] = df_absolute["difficulty"].apply(absolute_bin)

    # Create comparison plot
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))

    # Plot 1: Current (Relative) Binning
    valid_data = df_relative[df_relative["has_difficulty"]]
    if len(valid_data) > 0:
        axes[0].scatter(
            valid_data["difficulty"],
            valid_data["bin"],
            alpha=0.7,
            s=50,
            color="steelblue",
        )
        axes[0].set_xlabel("Difficulty Value")
        axes[0].set_ylabel("Assigned Bin")
        axes[0].set_title("Current (Relative) Binning")
        axes[0].grid(True, alpha=0.3)

        min_diff = valid_data["difficulty"].min()
        axes[0].axvline(
            x=min_diff, color="red", linestyle="--", alpha=0.7, label=f"Min: {min_diff}"
        )
        axes[0].axvline(
            x=5.0, color="orange", linestyle="--", alpha=0.7, label="Max: 5.0"
        )
        axes[0].legend()

    # Plot 2: Absolute Binning
    valid_data_abs = df_absolute[df_absolute["has_difficulty"]]
    if len(valid_data_abs) > 0:
        axes[1].scatter(
            valid_data_abs["difficulty"],
            valid_data_abs["bin_absolute"],
            alpha=0.7,
            s=50,
            color="green",
        )
        axes[1].set_xlabel("Difficulty Value")
        axes[1].set_ylabel("Assigned Bin")
        axes[1].set_title("Absolute Binning (Fixed Ranges)")
        axes[1].grid(True, alpha=0.3)

        # Add bin boundary lines
        for boundary in [1, 2, 3, 4, 5]:
            axes[1].axvline(x=boundary, color="gray", linestyle=":", alpha=0.5)

    # Plot 3: Comparison of bin distributions
    bin_counts_rel = df_relative["bin"].value_counts().sort_index()
    bin_counts_abs = df_absolute["bin_absolute"].value_counts().sort_index()

    x = np.arange(len(bin_counts_rel.index))
    width = 0.35

    axes[2].bar(
        x - width / 2,
        bin_counts_rel.values,
        width,
        label="Relative",
        alpha=0.7,
        color="steelblue",
    )
    axes[2].bar(
        x + width / 2,
        [bin_counts_abs.get(i, 0) for i in bin_counts_rel.index],
        width,
        label="Absolute",
        alpha=0.7,
        color="green",
    )

    axes[2].set_xlabel("Bin")
    axes[2].set_ylabel("Number of Songs")
    axes[2].set_title("Bin Distribution Comparison")
    axes[2].set_xticks(x)
    axes[2].set_xticklabels(bin_counts_rel.index)
    axes[2].legend()

    plt.tight_layout()
    plt.show()

    return df_relative, df_absolute


print("✅ Helper functions defined")

## Test with Sample Data

Let's start by understanding the algorithm with some controlled sample data.

In [None]:
# Test 1: Simple linear distribution (from the unit tests)
print("📊 Test 1: Simple Linear Distribution")
print("=" * 50)

difficulties_linear = [1.0, 2.0, 3.0, 4.0, 5.0]
files_linear = create_sample_files(difficulties_linear)
df_linear = analyze_binning(files_linear)

print("Input difficulties:", difficulties_linear)
print("\nBinning results:")
print(df_linear[["name", "difficulty", "bin"]].to_string(index=False))

plot_binning_analysis(df_linear, "Test 1: Linear Distribution (1.0 to 5.0)")

In [None]:
# Test 2: Clamping behavior (values above 5.0)
print("📊 Test 2: Clamping Behavior")
print("=" * 50)

difficulties_clamping = [2.0, 3.5, 5.0, 6.0, 7.5]
files_clamping = create_sample_files(
    difficulties_clamping,
    ["Easy Song", "Medium Song", "Hard Song", "Very Hard Song", "Extreme Song"],
)
df_clamping = analyze_binning(files_clamping)

print("Input difficulties:", difficulties_clamping)
print("\nBinning results:")
print(df_clamping[["name", "difficulty", "bin"]].to_string(index=False))

plot_binning_analysis(df_clamping, "Test 2: Clamping Behavior (includes values > 5.0)")

In [None]:
# Test 3: Impact of missing values
print("📊 Test 3: Impact of Missing Values")
print("=" * 50)

difficulties_missing = [1.5, -1, 3.0, -1, 4.5]  # -1 represents missing values
files_missing = create_sample_files(
    difficulties_missing,
    ["Song A", "Song B (no diff)", "Song C", "Song D (no diff)", "Song E"],
)
df_missing = analyze_binning(files_missing)

print("Input difficulties:", difficulties_missing, "(-1 = missing)")
print("\nBinning results:")
print(df_missing[["name", "difficulty", "bin"]].to_string(index=False))

plot_binning_analysis(df_missing, "Test 3: Impact of Missing Difficulty Values")

## Interactive Exploration

Now let's create interactive widgets to explore how different song selections affect binning.

In [None]:
# Interactive function to demonstrate how song selection affects binning
def interactive_binning_demo(
    min_diff=1.0,
    max_diff=5.0,
    num_songs=10,
    num_bins=5,
    include_outliers=False,
    outlier_value=6.0,
):
    """Interactive demo showing how song selection affects relative binning."""

    # Generate sample difficulties
    base_difficulties = np.linspace(min_diff, max_diff, num_songs).tolist()

    if include_outliers:
        base_difficulties.extend([outlier_value, outlier_value + 0.5])

    # Create files and analyze
    files = create_sample_files(base_difficulties)
    df = analyze_binning(files, num_bins)

    print(
        f"📈 Generated {len(base_difficulties)} songs with difficulties from {min(base_difficulties):.1f} to {max(base_difficulties):.1f}"
    )
    print(f"🎯 Using {num_bins} bins")

    if include_outliers:
        print(f"⚠️  Included outliers at {outlier_value}")

    # Show binning statistics
    valid_songs = df[df["has_difficulty"]]
    if len(valid_songs) > 0:
        actual_min = valid_songs["difficulty"].min()
        actual_max = valid_songs["difficulty"].max()
        scaler = 5.0 - actual_min

        print("\n📊 Binning Algorithm Parameters:")
        print(f"   • Actual min difficulty: {actual_min:.2f}")
        print("   • Hardcoded max: 5.0")
        print(f"   • Scaler (5.0 - min): {scaler:.2f}")
        print(
            f"   • Normalization range: 0 to {(actual_max - actual_min) / scaler:.2f}"
        )

    plot_binning_analysis(df, f"Interactive Demo: {num_songs} songs, {num_bins} bins")

    return df


# Create the interactive widget
interactive_widget = interactive(
    interactive_binning_demo,
    min_diff=widgets.FloatSlider(
        value=1.0, min=0.5, max=3.0, step=0.1, description="Min Difficulty:"
    ),
    max_diff=widgets.FloatSlider(
        value=5.0, min=3.0, max=6.0, step=0.1, description="Max Difficulty:"
    ),
    num_songs=widgets.IntSlider(
        value=10, min=3, max=30, step=1, description="# of Songs:"
    ),
    num_bins=widgets.IntSlider(
        value=5, min=3, max=10, step=1, description="# of Bins:"
    ),
    include_outliers=widgets.Checkbox(value=False, description="Include Outliers"),
    outlier_value=widgets.FloatSlider(
        value=6.0, min=5.5, max=8.0, step=0.1, description="Outlier Value:"
    ),
)

display(interactive_widget)

## Demonstration: How Song Selection Affects Binning

This section demonstrates the key issue with the relative binning approach: **the same song gets different bin assignments depending on what other songs are included**.

In [None]:
# Demonstrate how the same song gets different bins in different contexts
print("🎯 Demonstration: Same Songs, Different Contexts")
print("=" * 60)

# Define some reference songs
reference_songs = {
    "Wonderwall": 2.5,
    "Hotel California": 3.5,
    "Stairway to Heaven": 4.0,
}

# Scenario 1: Only easy songs in the collection
easy_collection = {
    "Twinkle Twinkle": 1.0,
    "Mary Had a Little Lamb": 1.2,
    **reference_songs,  # Our reference songs
    "Happy Birthday": 1.5,
}

# Scenario 2: Mixed collection with some very hard songs
mixed_collection = {
    **reference_songs,  # Same reference songs
    "Flight of the Bumblebee": 5.5,
    "Classical Gas": 5.8,
    "Eruption": 6.0,
    "YYZ": 6.2,
}

# Analyze both scenarios
print("\n📊 Scenario 1: Easy Collection")
difficulties_easy = list(easy_collection.values())
names_easy = list(easy_collection.keys())
files_easy = create_sample_files(difficulties_easy, names_easy)
df_easy = analyze_binning(files_easy)

print("\n📊 Scenario 2: Mixed Collection (with very hard songs)")
difficulties_mixed = list(mixed_collection.values())
names_mixed = list(mixed_collection.keys())
files_mixed = create_sample_files(difficulties_mixed, names_mixed)
df_mixed = analyze_binning(files_mixed)

# Compare the bin assignments for our reference songs
print("\n🔍 Bin Assignment Comparison for Reference Songs:")
print("-" * 60)
comparison_data = []
for song_name in reference_songs.keys():
    easy_bin = df_easy[df_easy["name"] == song_name]["bin"].iloc[0]
    mixed_bin = df_mixed[df_mixed["name"] == song_name]["bin"].iloc[0]
    difficulty = reference_songs[song_name]

    comparison_data.append(
        {
            "Song": song_name,
            "Difficulty": difficulty,
            "Easy Collection Bin": easy_bin,
            "Mixed Collection Bin": mixed_bin,
            "Difference": mixed_bin - easy_bin,
        }
    )

comparison_df = pd.DataFrame(comparison_data)
print(comparison_df.to_string(index=False))

print("\n⚠️  Notice how the same songs get different bin assignments!")
print(
    "   This happens because the algorithm normalizes relative to the minimum difficulty in each collection."
)

# Visualize both scenarios
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
fig.suptitle("Same Songs, Different Bin Assignments", fontsize=16)

# Plot scenario 1
valid_easy = df_easy[df_easy["has_difficulty"]]
axes[0].scatter(
    valid_easy["difficulty"], valid_easy["bin"], alpha=0.7, s=80, color="lightblue"
)
axes[0].set_xlabel("Difficulty Value")
axes[0].set_ylabel("Assigned Bin")
axes[0].set_title("Scenario 1: Easy Collection")
axes[0].grid(True, alpha=0.3)

# Highlight reference songs
for song in reference_songs.keys():
    song_data = valid_easy[valid_easy["name"] == song]
    if len(song_data) > 0:
        axes[0].scatter(
            song_data["difficulty"],
            song_data["bin"],
            s=150,
            color="red",
            alpha=0.8,
            marker="*",
            label=f"{song} (bin {song_data['bin'].iloc[0]})",
        )

axes[0].legend(bbox_to_anchor=(1.05, 1), loc="upper left")

# Plot scenario 2
valid_mixed = df_mixed[df_mixed["has_difficulty"]]
axes[1].scatter(
    valid_mixed["difficulty"], valid_mixed["bin"], alpha=0.7, s=80, color="lightgreen"
)
axes[1].set_xlabel("Difficulty Value")
axes[1].set_ylabel("Assigned Bin")
axes[1].set_title("Scenario 2: Mixed Collection")
axes[1].grid(True, alpha=0.3)

# Highlight reference songs
for song in reference_songs.keys():
    song_data = valid_mixed[valid_mixed["name"] == song]
    if len(song_data) > 0:
        axes[1].scatter(
            song_data["difficulty"],
            song_data["bin"],
            s=150,
            color="red",
            alpha=0.8,
            marker="*",
            label=f"{song} (bin {song_data['bin'].iloc[0]})",
        )

axes[1].legend(bbox_to_anchor=(1.05, 1), loc="upper left")

plt.tight_layout()
plt.show()

## Comparison: Relative vs Absolute Binning

Let's compare the current relative approach with a potential absolute binning scheme.

In [None]:
# Compare binning schemes with a realistic dataset
print("🎯 Comparison: Relative vs Absolute Binning")
print("=" * 50)

# Create a realistic distribution of song difficulties
np.random.seed(42)  # For reproducible results
realistic_difficulties = (
    [1.0, 1.2, 1.5]  # Very easy songs
    + [2.0, 2.1, 2.3, 2.5, 2.7, 2.8]  # Easy songs
    + [3.0, 3.2, 3.4, 3.5, 3.6, 3.8, 3.9]  # Medium songs
    + [4.0, 4.1, 4.3, 4.5, 4.7]  # Hard songs
    + [5.0, 5.2, 5.5, 5.8, 6.0, 6.5]  # Very hard songs (including some > 5.0)
)

realistic_names = [f"Song {i + 1:02d}" for i in range(len(realistic_difficulties))]

print(f"📊 Analyzing {len(realistic_difficulties)} songs")
print(
    f"   Difficulty range: {min(realistic_difficulties):.1f} to {max(realistic_difficulties):.1f}"
)

df_relative, df_absolute = compare_binning_schemes(
    realistic_difficulties, realistic_names, num_bins=5
)

# Show detailed comparison
print("\n📈 Binning Statistics Comparison:")
print("-" * 50)

valid_data = df_relative[df_relative["has_difficulty"]]

print("RELATIVE BINNING:")
rel_counts = df_relative["bin"].value_counts().sort_index()
for bin_num, count in rel_counts.items():
    if bin_num > 0:  # Skip bin 0 (no difficulty)
        bin_difficulties = valid_data[valid_data["bin"] == bin_num]["difficulty"]
        if len(bin_difficulties) > 0:
            print(
                f"  Bin {bin_num}: {count:2d} songs, difficulty range {bin_difficulties.min():.1f}-{bin_difficulties.max():.1f}"
            )

print("\nABSOLUTE BINNING:")
abs_counts = df_absolute["bin_absolute"].value_counts().sort_index()
for bin_num, count in abs_counts.items():
    if bin_num > 0:  # Skip bin 0 (no difficulty)
        bin_difficulties = valid_data[df_absolute["bin_absolute"] == bin_num][
            "difficulty"
        ]
        if len(bin_difficulties) > 0:
            print(
                f"  Bin {bin_num}: {count:2d} songs, difficulty range {bin_difficulties.min():.1f}-{bin_difficulties.max():.1f}"
            )

# Show songs that change bins
print("\n🔄 Songs that change bins between schemes:")
print("-" * 50)
different_bins = df_relative[df_relative["bin"] != df_absolute["bin_absolute"]]
different_bins = different_bins[
    different_bins["has_difficulty"]
]  # Only show songs with difficulty values

if len(different_bins) > 0:
    for _, row in different_bins.iterrows():
        print(
            f"  {row['name']:12s}: {row['difficulty']:4.1f} -> Relative: bin {row['bin']}, Absolute: bin {df_absolute.loc[_, 'bin_absolute']}"
        )
else:
    print("  No differences found (surprising!)")

print(
    f"\n📊 {len(different_bins)} out of {len(valid_data)} songs ({len(different_bins) / len(valid_data) * 100:.1f}%) change bins"
)

## Edge Cases and Inconsistencies

Let's explore some edge cases that could cause problems with the current algorithm.

In [None]:
print("⚠️  Edge Cases and Inconsistencies")
print("=" * 50)

# Edge Case 1: All difficulties are >= 5.0
print("\n🔍 Edge Case 1: All difficulties >= 5.0 (scaler becomes <= 0)")
edge_case_1 = [5.0, 5.5, 6.0, 6.5, 7.0]
files_edge_1 = create_sample_files(edge_case_1)
df_edge_1 = analyze_binning(files_edge_1)

print(f"Input: {edge_case_1}")
print("Result: All songs should get scaler = 1.0 (fallback)")
print(df_edge_1[["name", "difficulty", "bin"]].to_string(index=False))

# Edge Case 2: Very small range of difficulties
print("\n🔍 Edge Case 2: Very small range of difficulties")
edge_case_2 = [2.0, 2.1, 2.15, 2.2, 2.25]
files_edge_2 = create_sample_files(edge_case_2)
df_edge_2 = analyze_binning(files_edge_2)

print(f"Input: {edge_case_2} (range: {max(edge_case_2) - min(edge_case_2):.2f})")
print("Result: Small differences get amplified by normalization")
print(df_edge_2[["name", "difficulty", "bin"]].to_string(index=False))

# Edge Case 3: Only one song with difficulty
print("\n🔍 Edge Case 3: Only one song with difficulty")
edge_case_3 = [-1, -1, 3.5, -1, -1]  # Only one song has difficulty
files_edge_3 = create_sample_files(
    edge_case_3, ["No diff 1", "No diff 2", "Only song", "No diff 3", "No diff 4"]
)
df_edge_3 = analyze_binning(files_edge_3)

print(f"Input: {edge_case_3} (-1 = no difficulty)")
print("Result: The one song with difficulty becomes both min and max")
print(df_edge_3[["name", "difficulty", "bin"]].to_string(index=False))

# Visualize edge cases
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
fig.suptitle("Edge Cases in Difficulty Binning", fontsize=16)

# Plot edge case 1
valid_1 = df_edge_1[df_edge_1["has_difficulty"]]
axes[0].scatter(valid_1["difficulty"], valid_1["bin"], alpha=0.7, s=80, color="red")
axes[0].set_xlabel("Difficulty")
axes[0].set_ylabel("Bin")
axes[0].set_title("All Difficulties ≥ 5.0")
axes[0].grid(True, alpha=0.3)
axes[0].axvline(x=5.0, color="orange", linestyle="--", alpha=0.7, label="Hardcoded max")
axes[0].legend()

# Plot edge case 2
valid_2 = df_edge_2[df_edge_2["has_difficulty"]]
axes[1].scatter(valid_2["difficulty"], valid_2["bin"], alpha=0.7, s=80, color="orange")
axes[1].set_xlabel("Difficulty")
axes[1].set_ylabel("Bin")
axes[1].set_title("Very Small Range")
axes[1].grid(True, alpha=0.3)

# Plot edge case 3
valid_3 = df_edge_3[df_edge_3["has_difficulty"]]
if len(valid_3) > 0:
    axes[2].scatter(
        valid_3["difficulty"], valid_3["bin"], alpha=0.7, s=80, color="green"
    )
axes[2].set_xlabel("Difficulty")
axes[2].set_ylabel("Bin")
axes[2].set_title("Only One Song")
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary of issues found
print("\n❌ Issues Identified:")
print("-" * 30)
print("1. Same songs get different bins depending on the collection")
print("2. Adding very hard songs changes bins for all existing songs")
print("3. Small difficulty ranges get over-amplified")
print("4. Edge cases with scaler <= 0 need special handling")
print("5. Single song collections become degenerate")
print("\n✅ Potential Solutions:")
print("-" * 30)
print("1. Use absolute binning with fixed ranges")
print("2. Use percentile-based binning instead of linear")
print("3. Add minimum range requirements")
print("4. Use historical data to set stable ranges")

## Fetch Real Data from Google Drive

⚠️ **Note**: This section requires valid Google Cloud credentials configured for the project. If you're running this notebook without proper credentials, you can skip this section and work with the sample data above.

If you have credentials set up, uncomment and run the cells below to analyze real song data from Google Drive.

In [None]:
# Uncomment this cell to fetch real data (requires Google Cloud credentials)

# def fetch_real_song_data():
#     """Fetch real song data from Google Drive."""
#     try:
#         settings = get_settings()
#         credential_config = settings.google_cloud.credentials.get("songbook-generator")
#
#         if not credential_config:
#             print("❌ No credential config found. Make sure you have proper authentication set up.")
#             return None
#
#         # Initialize services
#         drive, cache = init_services(
#             scopes=credential_config.scopes,
#             target_principal=credential_config.principal,
#         )
#
#         gdrive_client = GoogleDriveClient(cache=cache, drive=drive)
#
#         # Get song sheet folder IDs from settings
#         folder_ids = settings.song_sheets.folder_ids
#         print(f"📂 Fetching files from {len(folder_ids)} folder(s)...")
#
#         all_files = []
#         for folder_id in folder_ids:
#             files = gdrive_client.query_drive_files(folder_id)
#             all_files.extend(files)
#             print(f"   Found {len(files)} files in folder {folder_id}")
#
#         print(f"📊 Total files found: {len(all_files)}")
#
#         # Filter files that have difficulty values
#         files_with_difficulty = []
#         for f in all_files:
#             if 'difficulty' in f.properties:
#                 try:
#                     difficulty = float(f.properties['difficulty'])
#                     if difficulty > 0:  # Valid difficulty
#                         files_with_difficulty.append(f)
#                 except (ValueError, TypeError):
#                     pass  # Skip invalid difficulty values
#
#         print(f"🎵 Files with valid difficulty values: {len(files_with_difficulty)}")
#
#         return files_with_difficulty
#
#     except Exception as e:
#         print(f"❌ Error fetching data: {e}")
#         print("Make sure you have proper Google Cloud authentication configured.")
#         return None

# # Fetch real data
# print("🔍 Attempting to fetch real song data from Google Drive...")
# real_files = fetch_real_song_data()

# if real_files:
#     print("\n✅ Successfully fetched real data!")
#
#     # Analyze real data
#     real_df = analyze_binning(real_files)
#
#     print(f"\n📊 Real Data Analysis:")
#     print(f"   Total songs: {len(real_df)}")
#     print(f"   Songs with difficulty: {len(real_df[real_df['has_difficulty']])}")
#
#     valid_real = real_df[real_df['has_difficulty']]
#     if len(valid_real) > 0:
#         print(f"   Difficulty range: {valid_real['difficulty'].min():.2f} to {valid_real['difficulty'].max():.2f}")
#         print(f"   Mean difficulty: {valid_real['difficulty'].mean():.2f}")
#
#         # Plot real data analysis
#         plot_binning_analysis(real_df, f"Real Song Data Analysis ({len(valid_real)} songs)")
#
#         # Show sample of real songs
#         print("\n🎵 Sample of real songs with difficulties:")
#         sample_songs = valid_real.sample(min(10, len(valid_real)))
#         print(sample_songs[['name', 'difficulty', 'bin']].to_string(index=False))
#
# else:
#     print("\n⏭️  Skipping real data analysis - using sample data only")
#     print("   To enable real data fetching, ensure you have:")
#     print("   1. Google Cloud credentials configured")
#     print("   2. Access to the song sheet folders")
#     print("   3. Proper configuration in your settings")

print("\n⚠️  Real data fetching is commented out.")
print("   Uncomment the code above if you have Google Cloud credentials configured.")

## Summary and Recommendations

Based on our analysis of the difficulty binning algorithm, here are the key findings and recommendations:

In [None]:
print("📋 SUMMARY OF FINDINGS")
print("=" * 50)

print("\n🔍 Current Algorithm Behavior:")
print(
    "   • Uses RELATIVE binning based on the minimum difficulty in the current song selection"
)
print(
    "   • Normalizes difficulties to 0-1 range using (difficulty - min) / (5.0 - min)"
)
print(
    "   • Distributes songs into bins using equal-width intervals in normalized space"
)
print("   • Handles edge cases with fallback scaler = 1.0 when all difficulties ≥ 5.0")

print("\n❌ Key Problems Identified:")
print(
    "   1. 🎯 INCONSISTENT BINNING: Same song gets different bins depending on collection"
)
print(
    "   2. 📈 SENSITIVITY TO OUTLIERS: Adding very hard songs shifts all other assignments"
)
print(
    "   3. 🔍 OVER-AMPLIFICATION: Small difficulty ranges get exaggerated into full bin spread"
)
print("   4. ⚠️  EDGE CASE ISSUES: Single songs, all-high difficulties, etc.")
print(
    "   5. 🎵 UNPREDICTABLE FOR USERS: Musicians can't predict which bin their song will be in"
)

print("\n✅ Recommended Solutions:")
print(
    "   1. 📊 ABSOLUTE BINNING: Use fixed difficulty ranges (e.g., 1-2=easy, 2-3=medium, etc.)"
)
print("   2. 📈 PERCENTILE-BASED: Use historical data to define percentile boundaries")
print("   3. 🎯 HYBRID APPROACH: Combine absolute ranges with occasional recalibration")
print("   4. 📋 CONFIGURATION: Make binning parameters configurable and documented")
print("   5. ✅ VALIDATION: Add tests for edge cases and consistency checks")

print("\n🎯 Immediate Actions for Issue #121:")
print("   • Document current behavior and its limitations")
print("   • Implement absolute binning as an alternative")
print("   • Add configuration option to choose binning method")
print("   • Expand test coverage for edge cases")
print("   • Consider user impact when changing binning method")

print("\n" + "=" * 50)
print(
    "✨ This analysis provides concrete data to support improvements to the difficulty grading system!"
)

## Next Steps

1. **Share this analysis** with the development team and stakeholders
2. **Gather feedback** on preferred binning approach (absolute vs. relative vs. hybrid)
3. **Implement chosen solution** with proper configuration options
4. **Add comprehensive tests** for edge cases identified here
5. **Document the binning algorithm** so users understand how their songs will be classified
6. **Plan migration strategy** if changing existing binning behavior

---

**This notebook can be re-run with real data** from your Google Drive once proper authentication is configured. The interactive widgets allow for further exploration of how different parameters affect the binning behavior.