# Scheduling Trends Analysis

This notebook replicates the calculations and charts from the **"Scheduling Trends"** section of the Streamlit app.

## Objective

Analyze the **probability that an observation will be scheduled** based on:
- **Priority** (numeric)
- **Visibility** (total visibility hours)
- **Required time** (requested hours)

We include **interaction terms** to capture combined effects (e.g., if visibility = 0, the probability should be ~0 regardless of priority).

In [None]:
# Imports
import sys
from pathlib import Path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix

# Add src directory to path
project_root = Path().cwd().parent if 'notebooks' in str(Path().cwd()) else Path().cwd()
sys.path.insert(0, str(project_root / 'src'))

from tsi.modeling.tendencias import (
    compute_empirical_rates,
    smooth_trend,
    fit_logistic_with_interactions,
    predict_probs,
    create_prediction_grid,
)
from tsi.plots.tendencias import (
    bar_rate_by_priority,
    loess_trend,
    heatmap_visibility_priority,
    pred_curve_vs_visibility,
)

print("✅ Imports completed")

## 1. Data Loading

We load the preprocessed sample dataset. This CSV is generated from JSON files using the `preprocess_schedules.py` script.

In [None]:
# Load sample data from the preprocessed CSV
data_path = project_root / 'data' / 'schedule.csv'

if not data_path.exists():
    raise FileNotFoundError(
        f"⚠️ Sample data file not found: {data_path}\n"
        f"To create it, run: python preprocess_schedules.py --schedule data/schedule.json --visibility data/possible_periods.json --output {data_path}"
    )

df_raw = pd.read_csv(data_path)
print(f"✅ Data loaded: {len(df_raw)} rows, {len(df_raw.columns)} columns")
print(f"Source: {data_path.name}")

In [None]:
# Prepare data (similar to what the app does)
df = df_raw.copy()

# Normalize columns (adjust according to actual schema)
# The app uses 'scheduled_flag', 'requested_hours', etc.

# If CSV doesn't have 'requested_hours', create it
if 'requestedDurationSec' in df.columns and 'requested_hours' not in df.columns:
    df['requested_hours'] = df['requestedDurationSec'] / 3600.0

# Verify required columns
required_cols = ['priority', 'total_visibility_hours', 'requested_hours', 'scheduled_flag']
missing = [col for col in required_cols if col not in df.columns]

if missing:
    print(f"❌ Missing columns: {missing}")
    print("\nAvailable columns:")
    print(df.columns.tolist())
else:
    print("✅ All required columns are present")
    print(f"\nData summary:")
    print(df[required_cols].describe())

## 2. Exploratory Analysis

We verify distributions and key values.

In [None]:
# Basic statistics
print("=" * 60)
print("BASIC STATISTICS")
print("=" * 60)

total_obs = len(df)
n_scheduled = df['scheduled_flag'].sum()
rate_scheduled = n_scheduled / total_obs * 100
n_zero_vis = (df['total_visibility_hours'] == 0).sum()

print(f"Total observations: {total_obs:,}")
print(f"Scheduled: {n_scheduled:,} ({rate_scheduled:.1f}%)")
print(f"Unscheduled: {total_obs - n_scheduled:,}")
print(f"Visibility = 0: {n_zero_vis:,} ({n_zero_vis/total_obs*100:.1f}%)")

print(f"\nPriority: min={df['priority'].min():.1f}, max={df['priority'].max():.1f}, mean={df['priority'].mean():.2f}")
print(f"Visibility: min={df['total_visibility_hours'].min():.1f}, max={df['total_visibility_hours'].max():.1f}, mean={df['total_visibility_hours'].mean():.2f}")
print(f"Required time: min={df['requested_hours'].min():.2f}, max={df['requested_hours'].max():.2f}, mean={df['requested_hours'].mean():.2f}")

## 3. Empirical Rates by Priority

We calculate the scheduling rate for each priority level.

In [None]:
# Calculate empirical rates
empirical = compute_empirical_rates(df, n_bins=10)

print("Empirical rates by priority:")
print(empirical.by_priority)

# Chart
fig = bar_rate_by_priority(
    empirical.by_priority,
    library='plotly',
    title='Scheduling rate by priority',
)
fig.show()

**Observations:**
- Observations with higher priority have higher scheduling rates.
- Sample size (n) varies between priority levels.

## 4. Smoothed Trends

We apply smoothing (weighted moving average) to see continuous trends.

In [None]:
# Trend: visibility → scheduling rate
smooth_vis = smooth_trend(
    df,
    x_col='total_visibility_hours',
    y_col='scheduled_flag',
    bandwidth=0.3,
)

fig_vis = loess_trend(
    smooth_vis,
    library='plotly',
    title='Smoothed trend: Visibility',
    x_label='Visibility (hours)',
)
fig_vis.show()

In [None]:
# Trend: required time → scheduling rate
smooth_time = smooth_trend(
    df,
    x_col='requested_hours',
    y_col='scheduled_flag',
    bandwidth=0.3,
)

fig_time = loess_trend(
    smooth_time,
    library='plotly',
    title='Smoothed trend: Required time',
    x_label='Required time (hours)',
)
fig_time.show()

**Observations:**
- Visibility shows a positive relationship with scheduling rate.
- Required time may show a non-linear relationship (very long observations are less likely to be scheduled).

## 5. 2D Heatmap: Visibility × Priority

We visualize the interaction between visibility and priority.

In [None]:
fig_heatmap = heatmap_visibility_priority(
    df,
    library='plotly',
    n_bins_vis=10,
    n_bins_priority=10,
)
fig_heatmap.show()

**Observations:**
- The upper right corner (high priority + high visibility) shows the highest rates.
- When visibility is low, even with high priority, the rate is low.

## 6. Logistic Model with Interactions

### 6.1 Decision: Exclude visibility = 0?

**Justification:**
- Observations with visibility = 0 **can never be scheduled** (not visible from the telescope).
- Including them in training can bias the model.
- **Decision:** Exclude them from training, but show their behavior in empirical charts.

In [None]:
# Train model WITHOUT visibility = 0
print("=" * 60)
print("LOGISTIC MODEL WITH INTERACTIONS")
print("=" * 60)

model_result = fit_logistic_with_interactions(
    df,
    exclude_zero_visibility=True,
    class_weight='balanced',
)

print(f"\n✅ Model trained successfully")
print(f"Samples: {model_result.n_samples:,}")
print(f"Scheduled: {model_result.n_scheduled:,}")
print(f"Accuracy: {model_result.accuracy:.3f}")
print(f"AUC: {model_result.auc_score:.3f}" if model_result.auc_score else "AUC: N/A")
print(f"\nFeatures: {model_result.feature_names}")

### 6.2 Interaction Terms

The model includes:
- `priority_num × visibility`
- `visibility × required_time`
- `priority_num × required_time`

These terms capture combined effects (e.g., high priority only helps if there's visibility).

## 7. Model Predictions

We generate predictions for different combinations of priority and visibility.

In [None]:
# Create prediction grid
vis_min = df['total_visibility_hours'].min()
vis_max = df['total_visibility_hours'].max()
priority_levels = sorted(df['priority'].dropna().unique())[:5]  # Top 5

grid = create_prediction_grid(
    visibility_range=(vis_min, vis_max),
    priority_levels=priority_levels,
    tiempo_requerido=df['requested_hours'].median(),
    n_points=100,
)

# Predict
grid_with_probs = predict_probs(grid, model_result)

print(f"Predictions generated: {len(grid_with_probs)} rows")
print(grid_with_probs.head())

In [None]:
# Probability vs visibility chart
fig_pred = pred_curve_vs_visibility(
    grid_with_probs,
    library='plotly',
    tiempo_fijo=df['requested_hours'].median(),
)
fig_pred.show()

**Interpretation:**
- The curves show how scheduling probability increases with visibility.
- Higher priority levels have elevated curves.
- When visibility ≈ 0, all curves converge to low probability (~0), confirming the interaction capture.

## 8. Model Validation

We evaluate the model with additional metrics.

In [None]:
# Predictions on training set
df_train = df[df['total_visibility_hours'] > 0].copy()  # Same data used in training
df_train_pred = predict_probs(df_train, model_result)

y_true = df_train_pred['scheduled_flag'].astype(int).values
y_pred_proba = df_train_pred['prob_planificada'].values
y_pred = (y_pred_proba > 0.5).astype(int)

# Metrics
acc = accuracy_score(y_true, y_pred)
auc = roc_auc_score(y_true, y_pred_proba)
cm = confusion_matrix(y_true, y_pred)

print("=" * 60)
print("VALIDATION METRICS")
print("=" * 60)
print(f"Accuracy: {acc:.3f}")
print(f"AUC: {auc:.3f}")
print(f"\nConfusion Matrix:")
print(cm)
print(f"\nTrue Negatives: {cm[0, 0]}")
print(f"False Positives: {cm[0, 1]}")
print(f"False Negatives: {cm[1, 0]}")
print(f"True Positives: {cm[1, 1]}")

## 9. Conclusions

### Key Findings:
1. **Priority:** Positive correlation with scheduling rate.
2. **Visibility:** Critical factor - visibility = 0 implies probability ≈ 0.
3. **Interactions:** Interaction terms correctly capture that high priority is only effective if there's visibility.
4. **Required time:** Moderate effect - very long observations are less likely to be scheduled.

### Technical Decisions:
- **Exclusion of visibility = 0:** Justified for the model (impossible cases).
- **Interaction terms:** Improve predictive capability by capturing combined effects.
- **Balanced weighting:** Useful if classes are imbalanced.

### Next Steps:
- Temporal cross-validation (if data from multiple iterations exists).
- Feature importance to quantify the impact of each variable.
- Explore non-linear models (Random Forest, XGBoost) for comparison.