 ## Project 1
 **Question:**  
 You are tasked by the scouting department with creating a machine learning model that predicts which NCAA players will become good 3PT shooters in the NBA. How would you approach this problem? Be sure to explain what datasets you would want to use in your answer.

 **Answer:**

 ### Problem Definition
 The goal is to identify which NCAA players are most likely to become reliable NBA shooters—defined here as hitting **≥36% from three on meaningful volume (250+ attempts) in their first three NBA seasons**. This gives scouts a forward-looking tool to prioritize prospects.

 ### Data Sources
 - **Primary:** NCAA box scores and season splits from Sports-Reference; NBA outcomes from the NBA Stats API and Basketball-Reference.
 - **Premium (if available):** Synergy Sports for shot-type context (catch-and-shoot vs. off-dribble, contested vs. uncontested).
 - **Linking:** Draft records to connect each NCAA player’s college profile to their NBA career.

 ### Key Features
 - **3PA rate:** Shows willingness to take threes.
 - **FT%:** A reliable indicator of shooting touch.
 - **Age on Draft Night:** Aligns with NBA development curves; younger players with the same stats usually have more upside.
 - **EB-smoothed NCAA 3P%:** Adjusts raw makes/attempts by blending them with league averages, so small samples aren’t over- or under-weighted.
 - **Shot selection & role metrics:** Synergy adds insight into whether a player’s threes come in translatable contexts (spot-ups vs. forced off-dribble).

 ### Controls & Adjustments
 - Add a regime flag for NCAA 3P% (pre- vs. post-2019 line move) to account for different difficulty environments.
 - Normalize by pace and possessions so players on faster teams aren’t inflated.
 - Recognize positional context: guards are expected to take higher-volume threes, while bigs may show value even at lower volume.

 ### Modeling Approach
 - **Y variable:** An Empirical Bayes estimate of NBA 3P% in Years 1–3 (shrunk toward league average for stability).
 - **Features:** NCAA data flows into a model (regularized regression or gradient boosted trees). The model outputs:
     1. A probability the player will be at or above NBA league-average from three on real volume.
     2. A stabilized NBA 3P% projection with uncertainty bands.

 ### Why This Works
 Empirical Bayes on both sides (college and NBA) prevents us from overrating noisy samples. The regime flag ensures fair comparisons across eras. The features (3PA rate, FT%, role, and age) are exactly the indicators that scouts already care about—the model simply quantifies them against 10+ years of draft history. This way, we aren’t replacing the eye test, just giving scouts an evidence-based lens to spot hidden value or validate what they already see on film.

 ---

 ## Project 2
 **Question:**  
 You’ve built the machine learning model from question 1, and now a scout wants help understanding it. Explain how machine learning models would work in this specific context in one or two paragraphs, using simple language.

 **Answer:**

 Think of this like having a very experienced scout who has tracked every draft class since 2010. The model looks at hundreds of players who went from college to the NBA and learns the patterns: younger players with solid free throw touch and high three-point volume usually develop into strong NBA shooters, while older players with similar stats tend to plateau.

 For any new prospect, we feed in their college profile—3PA rate, FT%, age, role, and their EB-adjusted NCAA 3P% (which smooths out hot streaks or cold stretches). The model compares them to historical players and gives two outputs: (1) the probability that they’ll become an above-average NBA shooter in their first three years, and (2) a stabilized projection of their NBA 3P%. We also highlight the “why” in plain terms—for example: “19 years old, high volume, strong FT%—looks similar to Player X at the same stage.”

 This doesn’t replace your eye test. Instead, it adds another layer: it can uncover overlooked shooters or flag concerns about hyped players whose stats don’t usually translate. And because the outcome variable is an EB-smoothed NBA 3P% over Y1–Y3, it reflects true shooting skill instead of just early streaks or slumps.


---

## Question 3

**Prompt:**  
Please download the NBA Dataset available for free here. Using only the data provided in that dataset, rate the 10 best, 10 most average, and 10 worst individual player seasons. Only include the regular season, use players with at least 500 minutes played in that specific season, and only go back as far as the 2010-11 season. You may define and calculate these metrics in any way you wish.

List the results of your code for all 3 rankings. Only ranks, player names, and seasons are needed for each set. *This question is required.*

---

### Answer

This analysis evaluates individual NBA player seasons from 2010-11 onwards using a dual-metric approach to capture both all-around impact and per-minute efficiency. I developed rankings using PIE (Player Impact Estimate) and Game Score per 36 minutes, then built predictive models to forecast 2025-26 season performance.

**Key Findings:**
- **Top performers:** LeBron James dominates PIE rankings (2010-2014), while modern big men lead Game Score/36 (Embiid 2023-24, Jokic 2024-25, Giannis 2019-20)
- **Predictive models:** Achieved 85.2% R² for PIE prediction and 70.7% R² for Game Score/36
- **2025-26 Projections:** Embiid, Jokic, and Giannis expected to lead both metrics

---

## Results

### PIE Rankings – All-Around Impact Leaders

#### Top 10 Seasons

| Rank | Player            | Season   |
|------|-------------------|----------|
| 1    | LeBron James      | 2012-13  |
| 2    | LeBron James      | 2011-12  |
| 3    | Russell Westbrook | 2016-17  |
| 4    | Kevin Durant      | 2013-14  |
| 5    | Nikola Jokic      | 2021-22  |
| 6    | Nikola Jokic      | 2024-25  |
| 7    | Kevin Durant      | 2012-13  |
| 8    | Joel Embiid       | 2023-24  |
| 9    | LeBron James      | 2010-11  |
| 10   | LeBron James      | 2013-14  |

#### Middle 10 Seasons (Around Median)

| Rank | Player             | Season   |
|------|--------------------|----------|
| 1    | Patrick Beverley   | 2018-19  |
| 2    | Bub Carrington     | 2024-25  |
| 3    | Omer Asik          | 2013-14  |
| 4    | J.J. Barea         | 2010-11  |
| 5    | Carmelo Anthony    | 2020-21  |
| 6    | Earl Clark         | 2012-13  |
| 7    | Cory Joseph        | 2015-16  |
| 8    | Jaxson Hayes       | 2021-22  |
| 9    | Quentin Grimes     | 2022-23  |
| 10   | Steve Novak        | 2011-12  |

#### Bottom 10 Seasons

| Rank | Player             | Season   |
|------|--------------------|----------|
| 1    | Ronnie Price       | 2010-11  |
| 2    | Gary Harris        | 2014-15  |
| 3    | Terrance Ferguson  | 2019-20  |
| 4    | Kevin Seraphin     | 2010-11  |
| 5    | Ryan Hollins       | 2011-12  |
| 6    | Jason Collins      | 2010-11  |
| 7    | Doron Lamb         | 2012-13  |
| 8    | Rashad Vaughn      | 2015-16  |
| 9    | Terrance Ferguson  | 2017-18  |
| 10   | John Lucas III     | 2013-14  |

---

### Game Score per 36 Rankings – Per-Minute Efficiency Leaders

#### Top 10 Seasons

| Rank | Player                  | Season   |
|------|-------------------------|----------|
| 1    | Joel Embiid             | 2023-24  |
| 2    | Nikola Jokic            | 2024-25  |
| 3    | Giannis Antetokounmpo   | 2019-20  |
| 4    | Nikola Jokic            | 2021-22  |
| 5    | Shai Gilgeous-Alexander | 2024-25  |
| 6    | Giannis Antetokounmpo   | 2021-22  |
| 7    | Joel Embiid             | 2022-23  |
| 8    | Nikola Jokic            | 2022-23  |
| 9    | Giannis Antetokounmpo   | 2023-24  |
| 10   | Nikola Jokic            | 2023-24  |

#### Middle 10 Seasons (Around Median)

| Rank | Player                | Season   |
|------|-----------------------|----------|
| 1    | Jabari Walker         | 2023-24  |
| 2    | Roy Hibbert           | 2014-15  |
| 3    | Max Strus             | 2024-25  |
| 4    | Sam Hauser            | 2023-24  |
| 5    | Julian Champagnie     | 2024-25  |
| 6    | Justise Winslow       | 2021-22  |
| 7    | Naji Marshall         | 2023-24  |
| 8    | Shabazz Napier        | 2017-18  |
| 9    | Terance Mann          | 2021-22  |
| 10   | Michael Carter-Williams | 2014-15 |

#### Bottom 10 Seasons

| Rank | Player             | Season   |
|------|--------------------|----------|
| 1    | DeShawn Stevenson  | 2011-12  |
| 2    | Terrance Ferguson  | 2019-20  |
| 3    | Stephen Graham     | 2010-11  |
| 4    | Mike Miller        | 2014-15  |
| 5    | Shawne Williams    | 2011-12  |
| 6    | Rashad Vaughn      | 2015-16  |
| 7    | Semi Ojeleye       | 2017-18  |
| 8    | Jason Collins      | 2010-11  |
| 9    | Doron Lamb         | 2012-13  |
| 10   | Anthony Brown      | 2015-16  |

---

## Question 4

**Prompt:**  
Describe how you calculated these rankings and why you chose that approach. *This question is required.*

---

### Answer 4

#### Methodology & Rationale

**Why These Two Metrics?**  
I chose a complementary pair that captures different aspects of basketball excellence:

1. **PIE (Player Impact Estimate)**
   - Measures share of team events while on court
   - Captures all-around, high-usage seasons
   - Formula: (Player's positive contributions) / (Total game events from both teams)
   - For each game, PIE is calculated as a player’s contributions divided by the combined contributions of both teams, so all players’ PIE in a game sum to 1. A season PIE is then the weighted average of these game values, where weights are the game totals.
   - Best for identifying star-level, ball-dominant impact

2. **Game Score per 36 Minutes**
   - Hollinger's efficiency metric scaled to per-36 minutes
   - Rewards scoring, rebounds, assists; penalizes missed shots and turnovers
   - Formula: Points + 0.4×FGM - 0.7×FGA - 0.4×(FTA-FTM) + 0.7×OREB + 0.3×DREB + STL + 0.7×AST + 0.7×BLK - 0.4×PF - TOV
   - Best for per-minute production regardless of playing time

**Data Filtering & Preprocessing**

- **Filters Applied:**
  - Seasons: 2010-11 through 2024-25
  - Minutes: Minimum 500 minutes played
  - Regular season only
  - Final dataset: 4,423 rows

- **Tiebreakers:**
  1. Primary metric (PIE or Game Score/36)
  2. season_pie DESC
  3. total_minutes DESC
  4. ts_pct DESC

- **Side Notes:**
  - For ‘most average,’ I defined it as the 10 player-seasons closest to the median value of the distribution, since NBA performance data is skewed by extreme outliers. This captures the truly ‘middle’ players rather than being pulled upward by superstars.

---

## Appendix: Technical Implementation

### Predictive Analytics – 2025–26 Season Forecasts

#### Machine Learning Pipeline

I completed the main rankings quickly enough to use the remaining time to build forecasts for the 2025–26 season. I developed a simple prediction system using lagged features to project both PIE and Game Score per 36.

- **Key Features:**
  - 39 engineered features including advanced metrics, usage rates, and performance consistency
  - Lag-based approach using prior season performance as primary predictors
  - Feature importance filtering to prevent overfitting
  - Random Forest models optimized for each target metric

- **Model Performance:**
  - PIE Prediction Model: R² = 0.852, RMSE = 0.011
    - Top predictors: Previous PIE (1.063), Production/36 lag (0.003), Free throw attempts lag (0.002)
  - Game Score/36 Prediction Model: R² = 0.707, RMSE = 2.161
    - Top predictors: Production/36 lag (0.809), Offensive impact lag (0.058), Two-way impact lag (0.007)

#### 2025–26 Projections

**PIE Leaderboard (Projected)**

| Rank | Player                  | Predicted PIE |
|------|-------------------------|--------------|
| 1    | Joel Embiid             | 0.1482       |
| 2    | Nikola Jokic            | 0.1474       |
| 3    | Giannis Antetokounmpo   | 0.1432       |
| 4    | Shai Gilgeous-Alexander | 0.1399       |
| 5    | Luka Doncic             | 0.1396       |

**Game Score/36 Leaderboard (Projected)**

| Rank | Player                | Predicted GS/36 |
|------|-----------------------|-----------------|
| 1    | Giannis Antetokounmpo | 27.68           |
| 2    | Nikola Jokic          | 27.36           |
| 3    | Joel Embiid           | 25.37           |
| 4    | Anthony Davis         | 24.91           |
| 5    | Luka Doncic           | 24.02           |

#### Key Insights

- **Historical Patterns:**
  - PIE leaders = high-usage stars in prime (LeBron 2009–2013, Westbrook’s triple-double season).
  - Game Score/36 leaders = modern bigs (Jokic, Embiid, Giannis).
  - Divergence shows two routes to excellence: all-around dominance vs. efficient per-minute scoring.
- **Predictive Insights:**
  - Previous season performance is the strongest predictor.
  - PIE persistence is higher (R² = 0.85) → stable measure.
  - Game Score/36 is more volatile due to injuries and role changes.

#### Future Production Enhancements

- **Infrastructure & Orchestration**
  - Apache Airflow for automated retraining
  - Apache Kafka for real-time streaming
  - MLflow for experiment tracking
  - FastAPI server for real-time predictions

- **Advanced Analytics & Features**
  - Integrate PER, VORP, EWA
  - SHAP/LIME for explainability
  - Recursive Feature Elimination (RFE)
  - Bayesian hierarchical models for pooling

- **Model Enhancement**
  - Position/team/league-level predictions
  - Uncertainty quantification with intervals
  - Real-time updating
  - Ensembles combining multiple models

- **User Experience**
  - React/Vite dashboards
  - Predicted vs. historical side-by-side
  - Custom metric builders for scouts
  - Mobile-friendly delivery

- **Production Monitoring**
  - Model drift detection
  - A/B testing framework
  - Performance dashboards
  - Automated alerts on performance changes

- **Code Structure**
  1. `data_loader.py` – preprocessing and filtering
  2. `feature_engineering.py` – advanced + lagged metrics
  3. `ml_pipeline.py` – Random Forest training, feature filtering
  4. `prediction_engine.py` – 2025–26 forecasting
  5. `ranking_system.py` – Top/Middle/Bottom 10 with tiebreakers

- **Feature Engineering:**
  - Advanced metrics: TS%, eFG%, usage, ratings
  - Consistency: variance of game scores
  - Context: minutes tier, team win%, season experience
  - Lag features: prior-season performance

- **Model Architecture:**
  - Random Forest Regressor, 100 estimators
  - Permutation importance, 10-fold CV
  - Minimum threshold = 0.001
  - Separate models per metric

- **Data Quality Assurance:**
  - Lag validation across seasons
  - Missing data handled with complete cases
  - Outlier validation
  - Reproducible with fixed seeds/versioning

---

## Detailed Rankings (With Context)

### PIE

#### Top 10 Seasons by PIE (with context)

| Rank | Player Name      | Season   | PIE      | PIE%  | PIE Num | PIE Den   | Points | FGM | FTM | FGA | FTA | DREB | OREB | AST | STL | BLK | PF  | TOV | Minutes  |
|------|------------------|----------|----------|-------|---------|-----------|--------|-----|-----|-----|-----|------|------|-----|-----|-----|-----|-----|----------|
| 1    | LeBron James     | 2012-13  | 0.174857 | 17.49 | 2254.0  | 12890.5   | 2036.0 | 765 | 403 |1354 | 535 | 513  | 97   | 551 |129  | 67  |110  |226  | 2835.00  |
| 2    | LeBron James     | 2011-12  | 0.174495 | 17.45 | 1683.0  | 9645.0    | 1683.0 | 621 | 387 |1169 | 502 | 398  | 94   | 387 |115  | 50  | 96  |213  | 2297.00  |
| 3    | Russell Westbrook| 2016-17  | 0.171554 | 17.16 | 2466.0  | 14374.5   | 2558.0 | 824 | 710 |1941 | 840 | 727  |137   | 840 |132  | 31  |190  |438  | 2761.00  |
| 4    | Kevin Durant     | 2013-14  | 0.169186 | 16.92 | 2339.5  | 13828.0   | 2593.0 | 849 | 703 |1688 | 805 | 540  | 58   | 445 |103  | 59  |174  |285  | 3087.00  |
| 5    | Nikola Jokic     | 2021-22  | 0.166170 | 16.62 | 2536.5  | 15264.5   | 2004.0 | 764 | 379 |1311 | 468 | 813  |206   | 584 |109  | 63  |191  |281  | 2440.00  |
| 6    | Nikola Jokic     | 2024-25  | 0.163343 | 16.33 | 2670.5  | 16349.0   | 2071.0 | 786 | 361 |1364 | 451 | 692  |200   | 716 |127  | 45  |160  |230  | 2557.28  |
| 7    | Kevin Durant     | 2012-13  | 0.162572 | 16.26 | 2243.5  | 13800.0   | 2280.0 | 731 | 679 |1433 | 750 | 594  | 46   | 374 |116  |105  |143  |280  | 3076.00  |
| 8    | Joel Embiid      | 2023-24  | 0.162201 | 16.22 | 1195.5  | 7370.5    | 1217.0 | 412 | 342 | 768 | 389 | 302  | 80   | 197 | 39  | 59  | 96  |130  | 1140.00  |
| 9    | LeBron James     | 2010-11  | 0.161252 | 16.13 | 2030.0  | 12589.0   | 2111.0 | 758 | 503 |1485 | 663 | 510  | 80   | 554 |124  | 50  |163  |284  | 3027.00  |
| 10   | LeBron James     | 2013-14  | 0.158951 | 15.90 | 2075.5  | 13057.5   | 2089.0 | 767 | 439 |1353 | 585 | 452  | 81   | 488 |121  | 26  |126  |270  | 2863.00  |

#### Middle 10 Seasons by PIE (with context)

| Rank | Player Name        | Season   | PIE      | PIE%  | PIE Num | PIE Den   | Points | FGM | FTM | FGA | FTA | DREB | OREB | AST | STL | BLK | PF  | TOV | Minutes  |
|------|--------------------|----------|----------|-------|---------|-----------|--------|-----|-----|-----|-----|------|------|-----|-----|-----|-----|-----|----------|
| 1    | Patrick Beverley   | 2018-19  | 0.044128 | 4.41  | 674.5   | 15285.0   | 596.0  | 194 | 96  | 477 | 123 | 312  | 76   | 300 | 67  | 43  |265  | 85  | 2097.00  |
| 2    | Bub Carrington     | 2024-25  | 0.044140 | 4.41  | 751.5   | 17025.5   | 801.0  | 297 | 69  | 743 | 85  | 303  | 33   | 360 | 53  | 20  |190  |140  | 2419.13  |
| 3    | Omer Asik          | 2013-14  | 0.044143 | 4.41  | 377.0   | 8540.5    | 280.0  | 101 | 78  | 190 |126  | 277  |101   | 25  | 14  | 37  | 92  | 59  | 947.00   |
| 4    | J.J. Barea         | 2010-11  | 0.044121 | 4.41  | 601.0   | 13621.5   | 769.0  | 285 |133  | 649 |157  | 130  | 29   | 317 | 30  |  1  |136  |136  | 1631.00  |
| 5    | Carmelo Anthony    | 2020-21  | 0.044149 | 4.41  | 699.5   | 15844.0   |1035.0  | 367 |155  | 870 |176  | 221  | 41   | 110 | 53  | 42  |168  | 69  | 1894.00  |
| 6    | Earl Clark         | 2012-13  | 0.044116 | 4.41  | 430.0   | 9747.0    | 426.0  | 170 | 51  | 386 | 74  | 242  | 82   | 65  | 36  | 44  |102  | 61  | 1332.00  |
| 7    | Cory Joseph        | 2015-16  | 0.044110 | 4.41  | 588.5   | 13341.5   | 677.0  | 257 |133  | 585 |174  | 171  | 39   | 250 | 63  | 20  |131  |102  | 2003.00  |
| 8    | Jaxson Hayes       | 2021-22  | 0.044106 | 4.41  | 609.0   | 13807.5   | 654.0  | 245 |144  | 398 |188  | 200  |115   | 43  | 33  | 55  |155  | 54  | 1363.00  |
| 9    | Quentin Grimes     | 2022-23  | 0.044102 | 4.41  | 627.5   | 14228.5   | 799.0  | 282 | 78  | 602 | 98  | 180  | 49   | 150 | 47  | 26  |177  | 69  | 2086.00  |
| 10   | Steve Novak        | 2011-12  | 0.044089 | 4.41  | 350.0   | 7938.5    | 477.0  | 161 | 22  | 336 | 26  | 95   | 9    | 12  | 16  |  9  | 59  | 21  | 998.00   |

#### Bottom 10 Seasons by PIE (with context)

| Rank | Player Name        | Season   | PIE      | PIE%  | PIE Num | PIE Den   | Points | FGM | FTM | FGA | FTA | DREB | OREB | AST | STL | BLK | PF  | TOV | Minutes  |
|------|--------------------|----------|----------|-------|---------|-----------|--------|-----|-----|-----|-----|------|------|-----|-----|-----|-----|-----|----------|
| 1    | Ronnie Price       | 2010-11  | 0.004207 | 0.42  | 40.5    | 9627.5    | 197.0  | 74  | 29  | 210 | 39  | 39   | 22   | 56  | 42  |  5  |106  | 55  | 689.00   |
| 2    | Gary Harris        | 2014-15  | 0.004483 | 0.45  | 41.0    | 9146.5    | 188.0  | 66  | 35  | 217 | 47  | 43   | 21   | 29  | 39  |  7  | 71  | 38  | 692.00   |
| 3    | Terrance Ferguson  | 2019-20  | 0.004869 | 0.49  | 46.5    | 9549.5    | 209.0  | 74  | 15  | 199 | 19  | 50   | 23   | 45  | 26  | 16  |144  | 30  |1146.00   |
| 4    | Kevin Seraphin     | 2010-11  | 0.005166 | 0.52  | 49.0    | 9486.0    | 154.0  | 66  | 22  | 147 | 31  | 72   | 80   | 10  | 17  | 28  |126  | 42  | 604.00   |
| 5    | Ryan Hollins       | 2011-12  | 0.005233 | 0.52  | 31.5    | 6020.0    | 131.0  | 46  | 39  | 84  | 75  | 48   | 34   |  9  |  5  | 17  | 78  | 35  | 505.00   |
| 6    | Jason Collins      | 2010-11  | 0.005796 | 0.58  | 44.5    | 7677.5    | 96.0   | 34  | 27  | 71  | 41  | 72   | 30   | 22  |  9  |  9  | 97  | 26  | 570.00   |
| 7    | Doron Lamb         | 2012-13  | 0.005845 | 0.58  | 46.0    | 7870.0    | 154.0  | 60  | 20  |163  | 34  | 38   |  8   | 32  | 13  |  0  | 52  | 26  | 560.00   |
| 8    | Rashad Vaughn      | 2015-16  | 0.007004 | 0.70  | 86.5    |12350.5    | 217.0  | 81  | 12  |266  | 15  | 77   | 11   | 39  | 29  | 16  | 73  | 28  | 965.00   |
| 9    | Terrance Ferguson  | 2017-18  | 0.007394 | 0.74  | 80.5    |10887.0    | 189.0  | 70  |  9  |169  | 10  | 28   | 19   | 19  | 24  | 10  | 83  | 11  | 730.00   |
| 10   | John Lucas III     | 2013-14  | 0.007493 | 0.75  | 51.0    | 6806.0    | 159.0  | 62  | 10  |190  | 16  | 27   | 12   | 42  | 14  |  0  | 41  | 22  | 572.00   |

---

### Game Score

#### Top 10 Seasons by Game Score per 36 (with context)

| Rank | Player Name              | Season   | GS/36   | PTS/36 | FGM/36 | FGA/36 | FTM/36 | FTA/36 | OREB/36 | DREB/36 | STL/36 | AST/36 | BLK/36 | PF/36 | TOV/36 |
|------|--------------------------|----------|---------|--------|--------|--------|--------|--------|---------|---------|--------|--------|--------|--------|--------|
| 1    | Joel Embiid              | 2023-24  | 32.2674 | 38.432 | 13.011 | 24.253 | 10.800 | 12.284 | 2.526   | 9.537   | 1.232  | 6.221  | 1.863  | 3.032  | 4.105  |
| 2    | Nikola Jokic             | 2024-25  | 29.6739 | 29.154 | 11.065 | 19.202 | 5.082  | 6.349  | 2.815   | 9.742   | 1.788  |10.079  | 0.633  | 2.252  | 3.238  |
| 3    | Giannis Antetokounmpo    | 2019-20  | 29.1506 | 35.185 | 12.985 | 23.626 | 7.502  |11.864  | 2.670   |13.597   | 1.223  | 6.849  | 1.203  | 3.629  | 4.301  |
| 4    | Nikola Jokic             | 2021-22  | 28.7543 | 29.567 | 11.272 | 19.343 | 5.592  | 6.905  | 3.039   |11.995   | 1.608  | 8.616  | 0.930  | 2.818  | 4.146  |
| 5    | Shai Gilgeous-Alexander  | 2024-25  | 28.3994 | 34.434 | 11.932 | 23.093 | 8.303  | 9.251  | 0.921   | 4.344   | 1.828  | 6.708  | 1.058  | 2.296  | 2.557  |
| 6    | Giannis Antetokounmpo    | 2021-22  | 28.3919 | 33.213 | 11.430 | 20.654 | 9.174  |12.708  | 2.223   |10.684   | 1.194  | 6.437  | 1.510  | 3.517  | 3.633  |
| 7    | Joel Embiid              | 2022-23  | 28.3874 | 34.912 | 11.643 | 21.239 |10.571  |12.331  | 1.807   | 8.908   | 1.056  | 4.382  | 1.791  | 3.279  | 3.614  |
| 8    | Nikola Jokic             | 2022-23  | 28.2147 | 26.591 | 10.164 | 16.080 | 5.365  | 6.530  | 2.628   |10.227   | 1.369  |10.668  | 0.740  | 2.738  | 3.886  |
| 9    | Giannis Antetokounmpo    | 2023-24  | 28.2119 | 31.862 | 12.057 | 19.805 | 7.296  |11.124  | 2.894   | 9.366   | 1.198  | 6.861  | 1.151  | 2.971  | 3.532  |
| 10   | Nikola Jokic             | 2023-24  | 27.9987 | 27.592 | 10.947 | 18.635 | 4.641  | 5.685  | 2.892   | 9.959   | 1.439  | 9.367  | 0.931  | 2.596  | 3.132  |

#### Middle 10 Seasons by Game Score per 36 (with context)

| Rank | Player Name            | Season   | GS/36   | PTS/36 | FGM/36 | FGA/36 | FTM/36 | FTA/36 | OREB/36 | DREB/36 | STL/36 | AST/36 | BLK/36 | PF/36 | TOV/36 |
|------|------------------------|----------|---------|--------|--------|--------|--------|--------|---------|---------|--------|--------|--------|--------|--------|
| 1    | Jabari Walker          | 2023-24  | 11.6792 | 13.739 | 5.037  |10.973  | 2.788  | 3.688  | 3.283   | 7.690   | 0.899  | 1.574  | 0.450  | 3.733  | 1.462  |
| 2    | Roy Hibbert            | 2014-15  | 11.6767 | 15.284 | 6.041  |13.531  | 3.202  | 3.888  | 2.973   | 7.318   | 0.343  | 1.601  | 2.382  | 4.116  | 2.039  |
| 3    | Max Strus              | 2024-25  | 11.6758 | 13.439 | 4.708  |10.643  | 0.799  | 0.970  | 1.512   | 4.679   | 0.742  | 4.508  | 0.342  | 2.967  | 1.541  |
| 4    | Sam Hauser             | 2023-24  | 11.6745 | 14.772 | 5.166  |11.645  | 0.362  | 0.385  | 0.816   | 4.916   | 0.884  | 1.699  | 0.566  | 2.152  | 0.657  |
| 5    | Julian Champagnie      | 2024-25  | 11.6744 | 15.330 | 5.181  |12.497  | 1.649  | 1.824  | 1.261   | 4.754   | 1.164  | 2.018  | 0.660  | 2.154  | 1.397  |
| 6    | Justise Winslow        | 2021-22  | 11.6832 | 13.152 | 5.280  |12.336  | 1.872  | 3.168  | 2.400   | 7.296   | 1.680  | 4.032  | 1.200  | 3.216  | 2.352  |
| 7    | Naji Marshall          | 2023-24  | 11.6723 | 13.401 | 5.025  |10.746  | 1.707  | 2.244  | 1.422   | 5.373   | 1.454  | 3.856  | 0.316  | 2.718  | 1.896  |
| 8    | Shabazz Napier         | 2017-18  | 11.6712 | 15.456 | 5.352  |12.744  | 2.784  | 3.312  | 0.624   | 3.456   | 1.944  | 3.600  | 0.336  | 2.016  | 2.160  |
| 9    | Terance Mann           | 2021-22  | 11.6696 | 13.787 | 5.281  |10.909  | 2.024  | 2.593  | 1.644   | 5.075   | 0.870  | 3.304  | 0.332  | 2.814  | 1.328  |
| 10   | Michael Carter-Williams| 2014-15  | 11.6677 | 16.435 | 6.193  |15.635  | 3.437  | 4.951  | 1.089   | 4.917   | 1.888  | 7.520  | 0.510  | 2.841  | 4.304  |

#### Bottom 10 Seasons by Game Score per 36 (with context)

| Rank | Player Name        | Season   | GS/36   | PTS/36 | FGM/36 | FGA/36 | FTM/36 | FTA/36 | OREB/36 | DREB/36 | STL/36 | AST/36 | BLK/36 | PF/36 | TOV/36 |
|------|--------------------|----------|---------|--------|--------|--------|--------|--------|---------|---------|--------|--------|--------|--------|--------|
| 1    | DeShawn Stevenson  | 2011-12  | 3.2811  | 5.686  | 1.883  | 6.608  | 0.346  | 0.615  | 0.269   | 3.612   | 0.730  | 1.575  | 0.154  | 2.267  | 0.730  |
| 2    | Terrance Ferguson  | 2019-20  | 3.4524  | 6.565  | 2.325  | 6.251  | 0.471  | 0.597  | 0.723   | 1.571   | 0.817  | 1.414  | 0.503  | 4.524  | 0.942  |
| 3    | Stephen Graham     | 2010-11  | 3.5149  | 7.540  | 3.093  | 7.695  | 1.199  | 1.469  | 0.657   | 4.099   | 0.541  | 1.547  | 0.039  | 4.099  | 1.469  |
| 4    | Mike Miller        | 2014-15  | 3.6801  | 5.822  | 1.976  | 6.089  | 0.160  | 0.214  | 0.214   | 4.647   | 0.748  | 2.457  | 0.214  | 3.953  | 1.228  |
| 5    | Shawne Williams    | 2011-12  | 3.7728  | 8.136  | 3.024  |10.584  | 0.576  | 0.792  | 1.440   | 3.456   | 0.720  | 1.152  | 0.792  | 3.168  | 0.936  |
| 6    | Rashad Vaughn      | 2015-16  | 3.8462  | 8.095  | 3.022  | 9.923  | 0.448  | 0.560  | 0.410   | 2.873   | 1.082  | 1.455  | 0.597  | 2.723  | 1.045  |
| 7    | Semi Ojeleye       | 2017-18  | 3.9658  | 6.313  | 2.104  | 6.086  | 0.809  | 1.327  | 1.198   | 4.014   | 0.680  | 0.647  | 0.129  | 2.946  | 0.809  |
| 8    | Jason Collins      | 2010-11  | 3.9663  | 6.063  | 2.147  | 4.484  | 1.705  | 2.589  | 1.895   | 4.547   | 0.568  | 1.389  | 0.568  | 6.126  | 1.642  |
| 9    | Doron Lamb         | 2012-13  | 4.1079  | 9.900  | 3.857  |10.479  | 1.286  | 2.186  | 0.514   | 2.443   | 0.836  | 2.057  | 0.000  | 3.343  | 1.671  |
| 10   | Anthony Brown      | 2015-16  | 4.1775  | 7.065  | 2.396  | 7.741  | 1.044  | 1.229  | 0.553   | 3.747   | 0.860  | 1.167  | 0.307  | 2.089  | 0.922  |

---


In [1]:
%%writefile src/heat_data_scientist_2025/utils/config.py
# src/heat_data_scientist_2025/utils/config.py
from __future__ import annotations

import os
from dataclasses import dataclass
from pathlib import Path
from typing import Iterable, Tuple, List, Dict, Optional
import difflib
from pathlib import Path
from pydantic_settings import BaseSettings, SettingsConfigDict
from pydantic import Field

class Settings(BaseSettings):
    # allow overriding via .env or real env; keep your robust repo_root default
    repo_root: Path = Path(__file__).resolve().parents[3]
    heat_data_root: Path = Field(default=Path("data"), alias="HEAT_DATA_ROOT")

    model_config = SettingsConfigDict(
        env_file=".env",          # loads .env if present
        env_prefix="",            # we already use explicit aliases
        case_sensitive=False,     # friendlier on Windows
        extra="ignore",           # ignore unknown envs
    )

    @property
    def data_root(self) -> Path:
        return (self.repo_root / self.heat_data_root
                if not self.heat_data_root.is_absolute()
                else self.heat_data_root)


S = Settings()

_REPO_ROOT = S.repo_root
DATA_ROOT  = S.data_root

RAW_DIR =       DATA_ROOT / "raw" / "heat_data_scientist_2025"
PROCESSED_DIR = DATA_ROOT / "processed" / "heat_data_scientist_2025"
QUALITY_DIR =   DATA_ROOT / "quality" / "quality_reports"
ML_DATASET_PATH = PROCESSED_DIR / "nba_ml_dataset.parquet"
NBA_CATALOG_PATH = PROCESSED_DIR / "nba_data_catalog.md"
SQLITE_PATH = PROCESSED_DIR / "nba.sqlite"
RANKINGS_PATH = PROCESSED_DIR / "nba_rankings_results.txt"

# EDA needs
EDA_OUT_DIR = PROCESSED_DIR / "eda"

# ML Pipeline paths
ML_MODELS_DIR = PROCESSED_DIR / "models"
ML_PREDICTIONS_DIR = PROCESSED_DIR / "predictions"
ML_EVALUATION_DIR = PROCESSED_DIR / "evaluation"

# production yaml
PROJECT_ROOT: Path = Path("src")
COLUMN_SCHEMA_PATH: Path = PROJECT_ROOT / "heat_data_scientist_2025" / "data" / "column_schema.yaml"

# Project Settings:
start_season = 2009
end_season = 2024
final_top_data_amt = 10
season_type = 'Regular Season'
minutes_total_minimum_per_season = 500

# --- ML Pipeline Configuration ---
class MLPipelineConfig:
    """Configuration for ML Pipeline automation"""
    
    # Target and prediction settings
    TARGET_COLUMN = "season_pie"
    PREDICTION_YEAR = 2025
    SOURCE_YEAR = 2024  # Use 2024 data to predict 2025
    
    # Training settings
    TRAIN_START_YEAR = 2011  # First year with reliable lag features
    TEST_YEARS = [2023, 2024]  # Hold out for validation
    
    # Model settings
    DEFAULT_STRATEGY = "filter_complete"  # "filter_complete", "two_stage", "auto"
    RANDOM_STATE = 42
    
    # Required lag features for complete cases
    REQUIRED_LAG_FEATURES = [
        "season_pie",
        "pts_per36", "ast_per36", "reb_per36",
        "ts_pct", "efg_pct",
        "usage_events_per_min",
        "games_played", "total_minutes",
        "defensive_per36", "production_per36",
        "win_pct", "team_win_pct_final"
    ]
    
    # Feature engineering settings
    NULL_STRATEGY = "diagnose_only"
    CREATE_LAG_YEARS = [1]  # Create lag1 features
    
    # Model parameters
    MODEL_PARAMS = {
        'random_forest': {
            'n_estimators': 200,
            'max_depth': 12,
            'min_samples_split': 5,
            'min_samples_leaf': 2,
            'random_state': RANDOM_STATE,
            'n_jobs': -1
        },
        'xgboost': {
            'n_estimators': 200,
            'max_depth': 8,
            'learning_rate': 0.05,
            'subsample': 0.8,
            'random_state': RANDOM_STATE,
            'n_jobs': -1
        }
    }
    
    # Evaluation settings
    EVALUATION_METRICS = ['r2', 'rmse', 'mae', 'mape']
    TOP_N_PREDICTIONS = 50  # Top N players for leaderboard
    
    # Feature importance settings
    MIN_FEATURE_IMPORTANCE = 0.001
    TOP_FEATURES_COUNT = 20
    
    # Output settings
    SAVE_ENGINEERED_DATA = True
    SAVE_MODELS = True
    SAVE_PREDICTIONS = True
    SAVE_EVALUATION = True
    
    # Automation settings
    AUTO_FEATURE_SELECTION = True
    AUTO_HYPERPARAMETER_TUNING = False  # Set to True for automated tuning
    CROSS_VALIDATION_FOLDS = 5

# Initialize ML config
ML_CONFIG = MLPipelineConfig()

# --- Kaggle dataset handle ---
KAGGLE_DATASET = "eoinamoore/historical-nba-data-and-player-box-scores"

# --- Tables we actually care about right now ---
IMPORTANT_TABLES = ["PlayerStatistics", "TeamStatistics"]

# --- Table -> CSV filename mapping (simple & explicit) ---
KAGGLE_TABLE_TO_CSV = {
    "Players": "Players.csv",
    "PlayerStatistics": "PlayerStatistics.csv",
    "TeamStatistics": "TeamStatistics.csv",
}
PARQUET_DIR = Path(DATA_ROOT) / "parquet_cache"

# Canonical ML export list (final parquet column order)
ML_EXPORT_COLUMNS = [
    # IDs & core season
    "personId", "player_name", "season",

    # playing time & outcomes
    "games_played", "total_minutes",
    "win_pct", "home_games_pct", "avg_plus_minus", "total_plus_minus",

    # efficiency & per-36
    "season_pie", "ts_pct", "fg_pct", "fg3_pct", "ft_pct",
    "pts_per36", "ast_per36", "reb_per36",
    "usage_per_min", "efficiency_per_game",

    # raw season totals
    "total_points", "total_assists", "total_rebounds",
    "total_steals", "total_blocks", "total_turnovers",
    "total_fgm", "total_fga", "total_ftm", "total_fta", "total_3pm", "total_3pa",

    # player bio & role (expanded to include everything you listed)
    "height", "bodyWeight", "draftYear", "draftRound", "draftNumber",
    "birthdate", "country", "position",

    # share-of-team season metrics
    "share_pts", "share_ast", "share_reb",
    "share_stl", "share_blk",
    "share_fga", "share_fgm",
    "share_3pa", "share_3pm",
    "share_fta", "share_ftm",
    "share_tov", "share_reb_off", "share_reb_def", "share_pf",
    "season_game_score_total", "game_score_per36",
]

def _first_existing(candidates: Iterable[Path]) -> Path:
    for p in candidates:
        if p.exists():
            return p
    raise FileNotFoundError(
        "None of the candidate files exist:\n" + "\n".join(str(c) for c in candidates)
    )


# training features
season_pie_numerical_features = [
    "season_start_year",
    # lagged features
    "season_pie_lag1", "ts_pct_lag1", "efg_pct_lag1", "fg_pct_lag1", "fg3_pct_lag1", "ft_pct_lag1",
    "pts_per36_lag1", "ast_per36_lag1", "reb_per36_lag1", "defensive_per36_lag1",
    "production_per36_lag1", "stocks_per36_lag1", "three_point_rate_lag1", "ft_rate_lag1",
    "pts_per_shot_lag1", "ast_to_tov_lag1", "usage_events_per_min_lag1", "usage_per_min_lag1",
    "games_played_lag1", "total_minutes_lag1", "total_points_lag1", "total_assists_lag1",
    "total_rebounds_lag1", "total_steals_lag1", "total_blocks_lag1", "total_fga_lag1",
    "total_fta_lag1", "total_3pa_lag1", "total_3pm_lag1", "total_tov_lag1", "win_pct_lag1",
    "avg_plus_minus_lag1", "team_win_pct_final_lag1",
    "offensive_impact_lag1", "two_way_impact_lag1", "efficiency_volume_score_lag1",
    "versatility_score_lag1", "shooting_score_lag1"
]
game_score_per36_numerical_features = [
    "season_start_year",
    # lagged features
    "season_pie_lag1", "ts_pct_lag1", "efg_pct_lag1", "fg_pct_lag1", "fg3_pct_lag1", "ft_pct_lag1",
    "pts_per36_lag1", "ast_per36_lag1", "reb_per36_lag1", "defensive_per36_lag1",
    "production_per36_lag1", "stocks_per36_lag1", "three_point_rate_lag1", "ft_rate_lag1",
    "pts_per_shot_lag1", "ast_to_tov_lag1", "usage_events_per_min_lag1", "usage_per_min_lag1",
    "games_played_lag1", "total_minutes_lag1", "total_points_lag1", "total_assists_lag1",
    "total_rebounds_lag1", "total_steals_lag1", "total_blocks_lag1", "total_fga_lag1",
    "total_fta_lag1", "total_3pa_lag1", "total_3pm_lag1", "total_tov_lag1", "win_pct_lag1",
    "avg_plus_minus_lag1", "team_win_pct_final_lag1",
    "offensive_impact_lag1", "two_way_impact_lag1", "efficiency_volume_score_lag1",
    "versatility_score_lag1", "shooting_score_lag1"
]

nominal_categoricals = []
ordinal_categoricals = ["minutes_tier"] 
y_variables = ["season_pie", "game_score_per36"]



@dataclass(frozen=True)
class Paths:
    repo_root: Path = _REPO_ROOT
    data_root: Path = DATA_ROOT
    raw_dir: Path = RAW_DIR
    processed_dir: Path = PROCESSED_DIR
    quality_dir: Path = QUALITY_DIR
    ml_dataset_path: Path = ML_DATASET_PATH
    nba_catalog_path: Path = NBA_CATALOG_PATH
    sqlite_path: Path = SQLITE_PATH
    rankings_path: Path = RANKINGS_PATH
    eda_out_dir: Path = EDA_OUT_DIR
    column_schema_path: Path = COLUMN_SCHEMA_PATH
    
    # ML Pipeline paths
    ml_models_dir: Path = ML_MODELS_DIR
    ml_predictions_dir: Path = ML_PREDICTIONS_DIR
    ml_evaluation_dir: Path = ML_EVALUATION_DIR

    def ensure_ml_dirs(self) -> None:
        """Create ML pipeline directories if they don't exist."""
        self.ml_models_dir.mkdir(parents=True, exist_ok=True)
        self.ml_predictions_dir.mkdir(parents=True, exist_ok=True)
        self.ml_evaluation_dir.mkdir(parents=True, exist_ok=True)

    def csv(self, name: str) -> Path:
        """Return a path under RAW_DIR for an exact filename, with rich diagnostics on failure."""
        p = (self.raw_dir / name).resolve()
        if p.exists():
            return p

        raw_exists = self.raw_dir.exists()
        all_csvs = []
        if raw_exists:
            try:
                all_csvs = sorted([q.name for q in self.raw_dir.glob("*.csv")])
            except Exception:
                all_csvs = []

        suggestions = difflib.get_close_matches(name, all_csvs, n=5, cutoff=0.5) if all_csvs else []

        lines = [
            f"Expected CSV not found: {p}",
            f"RAW_DIR: {self.raw_dir}  (exists={raw_exists})",
            f"DATA_ROOT (override with HEAT_DATA_ROOT): {self.data_root}",
            f"CSV files found in RAW_DIR ({len(all_csvs)}): {all_csvs[:25]}{' ...' if len(all_csvs) > 25 else ''}",
        ]
        if suggestions:
            lines.append(f"Closest names: {suggestions}")
        lines.append("If your data lives elsewhere, set the environment variable HEAT_DATA_ROOT to that base folder.")
        raise FileNotFoundError("\n".join(lines))

    def csv_any(self, *names: str) -> Path:
        return _first_existing([(self.raw_dir / n).resolve() for n in names])

    # canonical asset getters (7 kaggle tables + optional league schedule)
    def players_csv(self) -> Path:           return self.csv("Players.csv")
    def playerstats_csv(self) -> Path:       return self.csv("PlayerStatistics.csv")
    def teamstats_csv(self) -> Path:         return self.csv("TeamStatistics.csv")
    def games_csv(self) -> Path:             return self.csv("Games.csv")
    def teams_csv(self) -> Path:             return self.csv("Teams.csv")
    def team_histories_csv(self) -> Path:    return self.csv_any("CoachHistory.csv", "Coaches.csv")
    def league_schedule_csv(self) -> Path:   return (self.raw_dir / "LeagueSchedule24_25.csv")

    # ML Pipeline file getters
    def model_path(self, model_name: str, year: int) -> Path:
        """Get path for saving/loading trained models."""
        return self.ml_models_dir / f"{model_name}_{year}_model.pkl"
    
    def predictions_path(self, target: str, year: int | None = None) -> Path:
        """
        Flexible getter for per-target predictions path.

        Accepts either:
        - (target, year) e.g., ("season_pie", 2025)
        - single string with trailing year, e.g., "season_pie_2025"
        - single string that already includes a *_predictions_YYYY tag, we'll normalize it

        Produces: .../predictions/{target}_predictions_{year}.parquet
        """
        self.ensure_ml_dirs()

        safe_target = str(target).strip().lower()

        # If year isn't separately provided, try to parse it from the 'target' string
        if year is None:
            # tolerate patterns like "season_pie_2025" or "season_pie_predictions_2025"
            import re
            m = re.search(r'(\d{4})$', safe_target)
            if m:
                year = int(m.group(1))
                # strip known suffixes to get the target core
                safe_target = re.sub(r'(_predictions)?_\d{4}$', '', safe_target)
            else:
                raise TypeError("predictions_path() requires 'year' or a target string ending with a 4-digit year.")

        return self.ml_predictions_dir / f"{safe_target}_predictions_{int(year)}.parquet"


    def leaderboard_path(self, target: str, year: int | None = None) -> Path:
        """
        Flexible getter for per-target leaderboard path.

        Accepts either:
        - (target, year) e.g., ("game_score_per36", 2025)
        - single string with trailing year, e.g., "game_score_per36_2025"
        - single string that already includes a *_leaderboard_YYYY tag, we'll normalize it

        Produces: .../predictions/{target}_leaderboard_{year}.csv
        """
        self.ensure_ml_dirs()

        safe_target = str(target).strip().lower()

        if year is None:
            import re
            m = re.search(r'(\d{4})$', safe_target)
            if m:
                year = int(m.group(1))
                safe_target = re.sub(r'(_leaderboard)?_\d{4}$', '', safe_target)
            else:
                raise TypeError("leaderboard_path() requires 'year' or a target string ending with a 4-digit year.")

        return self.ml_predictions_dir / f"{safe_target}_leaderboard_{int(year)}.csv"

    
    def evaluation_path(self, model_name: str, year: int) -> Path:
        """Get path for saving/loading evaluation results."""
        return self.ml_evaluation_dir / f"{model_name}_{year}_evaluation.json"
    
    def feature_importance_path(self, model_name: str, year: int) -> Path:
        """Get path for saving/loading feature importance."""
        return self.ml_evaluation_dir / f"{model_name}_{year}_feature_importance.csv"

CFG = Paths()

if __name__ == "__main__":
    print("Config paths:")
    print(f"Raw dir: {CFG.raw_dir}")
    print(f"Processed dir: {CFG.processed_dir}")
    print(f"ML models dir: {CFG.ml_models_dir}")
    print(f"ML predictions dir: {CFG.ml_predictions_dir}")
    print(f"ML evaluation dir: {CFG.ml_evaluation_dir}")
    
    print("\nML Config:")
    print(f"Target column: {ML_CONFIG.TARGET_COLUMN}")
    print(f"Prediction year: {ML_CONFIG.PREDICTION_YEAR}")
    print(f"Required lag features: {len(ML_CONFIG.REQUIRED_LAG_FEATURES)}")


Overwriting src/heat_data_scientist_2025/utils/config.py


In [6]:
%%writefile src/heat_data_scientist_2025/data/kaggle_pull.py
"""
quick pass:
- pull kaggle tables, compute per-game PIE -> roll to season -> rank top/mid/bottom
- rules: 2010+ seasons, >=500 minutes gate (applied at the *end*), tie-breakers as noted
- also spit out Game Score per 36 (gs36) companions
- keeping this all python-first; no duckdb materialization here
"""

from __future__ import annotations

from typing import Dict, Iterable
from pathlib import Path

import pandas as pd
import numpy as np
import kagglehub as kh
from kagglehub import KaggleDatasetAdapter as KDA

from src.heat_data_scientist_2025.utils.config import (
    CFG,
    KAGGLE_DATASET,
    IMPORTANT_TABLES,
    KAGGLE_TABLE_TO_CSV,
    start_season as CFG_START_SEASON,
    season_type as CFG_SEASON_TYPE,
    minutes_total_minimum_per_season as CFG_MIN_SEASON_MINUTES,
)

# ---------------------------
# Utilities
# ---------------------------

# note: season strings like 2010-11 based on aug–jul boundary
# kinda obvious: aug+ is next season start; else previous year
def _season_from_timestamp(ts: pd.Series) -> pd.Series:
    """Create season strings like '2010-11' (Aug–Jul season boundary)."""
    dt = pd.to_datetime(ts, errors="coerce", utc=False)
    start_year = np.where(dt.dt.month >= 8, dt.dt.year, dt.dt.year - 1)
    end_year = start_year + 1
    start_s = pd.Series(start_year, index=ts.index).astype(str)
    end_s = (pd.Series(end_year, index=ts.index) % 100).astype(str).str.zfill(2)
    return start_s + "-" + end_s

# tiny helper: divide but don't explode on zeros/NaNs
# returns 0 when denom <= 0; good enough for rate features here
def _safe_div(num: pd.Series, den: pd.Series) -> pd.Series:
    """Safe division: returns 0 where denom <= 0 or NaN."""
    den = den.fillna(0)
    out = pd.Series(np.zeros(len(num)), index=num.index, dtype="float64")
    mask = den > 0
    out.loc[mask] = (num[mask] / den[mask]).astype("float64")
    return out

# bulk numeric coercion; keep errors as NaN and move on
def _to_numeric(df: pd.DataFrame, cols: list[str]) -> pd.DataFrame:
    for c in cols:
        if c in df.columns:
            df[c] = pd.to_numeric(df[c], errors="coerce")
    return df

# quick join QA: sizes + unmatched keys sample; prints only, no exceptions
# i use this when debugging merges; safe to leave in
def _join_diagnostics(
    left: pd.DataFrame,
    right: pd.DataFrame,
    merged: pd.DataFrame,
    on: list[str] | str,
    how: str,
    left_name: str = "left",
    right_name: str = "right",
    key_sample_n: int = 5,
) -> None:
    """Compact report about a merge: size deltas and unmatched keys (QA only)."""
    on_cols = [on] if isinstance(on, str) else list(on)
    print(f"[JOIN] '{left_name}' x '{right_name}' on {on_cols} (how='{how}')")
    print(f"       {left_name} rows:  {len(left):,}")
    print(f"       {right_name} rows: {len(right):,}")
    print(f"       merged rows:       {len(merged):,}")

    if how in ("left", "inner"):
        left_keys = left[on_cols].drop_duplicates()
        right_keys = right[on_cols].drop_duplicates()
        merged_keys = merged[on_cols].drop_duplicates()
        left_only = (
            left_keys.merge(merged_keys, on=on_cols, how="left", indicator=True)
            .query("_merge == 'left_only'")
            .drop(columns=["_merge"])
        )
        right_only = (
            right_keys.merge(merged_keys, on=on_cols, how="left", indicator=True)
            .query("_merge == 'left_only'")
            .drop(columns=["_merge"])
        )
        if len(left_only):
            print(f"       ⚠ Unmatched in {right_name}: {len(left_only):,} key(s) from {left_name}")
            if key_sample_n > 0:
                print(left_only.head(key_sample_n))
        if len(right_only):
            print(f"       ℹ Extra keys in {right_name} not used: {len(right_only):,}")
            if key_sample_n > 0:
                print(right_only.head(key_sample_n))

# ---------------------------
# Load + filters
# ---------------------------

# rough filter pass:
# - tag seasons on both tables
# - keep only {season_type} + seasons >= start
# - compute season minutes, but *don’t* gate yet (see PIE bias note)
def enforce_criteria_python(
    players_df: pd.DataFrame | None,
    player_stats_df: pd.DataFrame,
    team_stats_df: pd.DataFrame,
    start_season: int = CFG_START_SEASON,
    season_type: str = CFG_SEASON_TYPE,
    minutes_total_minimum_per_season: int = CFG_MIN_SEASON_MINUTES,
    defer_minutes_gate: bool = True,
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Adds 'season' to player/team stats; filters by gameType and start season.
    Computes per (personId, season) minutes_total but defers gating by default.
    """
    ps = player_stats_df.copy()
    ps["season"] = _season_from_timestamp(ps["gameDate"])
    ts = team_stats_df.copy()
    ts["season"] = _season_from_timestamp(ts["gameDate"])

    ps = _to_numeric(ps, [
        "numMinutes","points","assists","blocks","steals",
        "fieldGoalsAttempted","fieldGoalsMade","freeThrowsAttempted","freeThrowsMade",
        "threePointersAttempted","threePointersMade",
        "reboundsDefensive","reboundsOffensive","reboundsTotal",
        "foulsPersonal","turnovers","plusMinusPoints","home","win"
    ])
    for bcol in ["home","win"]:
        if bcol in ps.columns:
            ps[bcol] = ps[bcol].fillna(0).astype(int)

    ps = ps.loc[
        (ps["gameType"] == season_type) &
        (ps["season"].str.slice(0, 4).astype(int) >= start_season)
    ].copy()

    fn = ps.get("firstName", "").fillna("")
    ln = ps.get("lastName", "").fillna("")
    ps["player_name"] = (fn + " " + ln).str.strip()

    season_minutes = (
        ps.groupby(["personId","season"], as_index=False)["numMinutes"]
          .sum().rename(columns={"numMinutes": "minutes_total"})
    )
    ps = ps.merge(season_minutes, on=["personId","season"], how="left")

    if not defer_minutes_gate:
        ps = ps.loc[ps["minutes_total"] >= minutes_total_minimum_per_season].copy()

    players_out = players_df.copy() if players_df is not None else None
    return players_out, ps, ts

# ---------------------------
# PIE (per game) + season aggregation
# ---------------------------

# core math: build per-game PIE exactly once, *then* roll up
# heads-up: drop 0-minute rows; also sanity check that sum(pie) ~ 1 per game
def compute_player_game_pie(
    filtered_player_stats: pd.DataFrame,
    drop_zero_minute_games: bool = True,
    validate_game_sums: bool = True,
    atol: float = 1e-9,
    rtol: float = 1e-6,
) -> pd.DataFrame:
    """Compute per-game PIE; denominator is sum of numerators across ALL players in that game."""
    required = [
        "points","fieldGoalsMade","freeThrowsMade","fieldGoalsAttempted","freeThrowsAttempted",
        "reboundsDefensive","reboundsOffensive","assists","steals","blocks","foulsPersonal","turnovers",
        "gameId","numMinutes"
    ]
    missing = [c for c in required if c not in filtered_player_stats.columns]
    if missing:
        raise KeyError(f"Missing columns for PIE calculation: {missing}")

    df = filtered_player_stats.copy()
    df = _to_numeric(df, required)

    if drop_zero_minute_games:
        before = len(df)
        df = df.loc[df["numMinutes"] > 0].copy()
        if before != len(df):
            print(f"[PIE] Dropped zero-minute rows: {before - len(df):,} (kept {len(df):,})")

    df["pie_numerator"] = (
        df["points"]
        + df["fieldGoalsMade"]
        + df["freeThrowsMade"]
        - df["fieldGoalsAttempted"]
        - df["freeThrowsAttempted"]
        + df["reboundsDefensive"]
        + 0.5 * df["reboundsOffensive"]
        + df["assists"]
        + df["steals"]
        + 0.5 * df["blocks"]
        - df["foulsPersonal"]
        - df["turnovers"]
    )

    den = (
        df.groupby("gameId", as_index=False)["pie_numerator"]
          .sum().rename(columns={"pie_numerator": "pie_denominator"})
    )

    before_n = len(df)
    merged = df.merge(den, on="gameId", how="left", validate="many_to_one")
    _join_diagnostics(df, den, merged, on="gameId", how="left",
                      left_name="player_game", right_name="game_denominators")

    merged["pie_denominator"] = pd.to_numeric(merged["pie_denominator"], errors="coerce")
    merged = merged.loc[merged["pie_denominator"] > 0].copy()
    merged["game_pie"] = _safe_div(merged["pie_numerator"], merged["pie_denominator"])

    if validate_game_sums:
        sums = merged.groupby("gameId", as_index=False)["game_pie"].sum()
        bad = sums.loc[~np.isclose(sums["game_pie"], 1.0, rtol=rtol, atol=atol)]
        if not bad.empty:
            print(f"[PIE][QA] Σ(game_pie) != 1 within tol for {len(bad):,} game(s). "
                  f"Max deviation: {float((bad['game_pie'] - 1.0).abs().max()):.3e}")
            print(bad.head(5))

    after_n = len(merged)
    if after_n > before_n:
        raise ValueError(f"[PIE] Unexpected row gain after merge/filter: before={before_n}, after={after_n}")

    if merged[["pie_denominator","game_pie"]].isna().any().any():
        raise ValueError("[PIE] Nulls detected after join.")

    return merged

# season rollup:
# - totals, %s, PIE season-level, per-36s, usage/eff
# - pick primary team by minutes and attach final team win%
# - minutes gate at the end to avoid denominator bias
def build_player_season_table_python(
    player_game_with_pie: pd.DataFrame,
    team_stats_df: pd.DataFrame,
    minutes_total_minimum_per_season: int = CFG_MIN_SEASON_MINUTES,
) -> pd.DataFrame:
    """Aggregate player-game rows to player-season (PIE, TS%, per-36, GS36, team ctx)."""
    pg = player_game_with_pie.copy()

    pg = _to_numeric(pg, [
        "numMinutes","points","assists","blocks","steals",
        "reboundsDefensive","reboundsOffensive","reboundsTotal",
        "foulsPersonal","turnovers","fieldGoalsAttempted","fieldGoalsMade",
        "freeThrowsAttempted","freeThrowsMade","threePointersAttempted","threePointersMade",
        "plusMinusPoints","game_pie","home","win"
    ])

    g = pg.groupby(["personId","player_name","season"], as_index=False).agg(
        games_played=("gameId","nunique"),
        total_minutes=("numMinutes","sum"),
        total_points=("points","sum"),
        total_assists=("assists","sum"),
        total_rebounds=("reboundsTotal","sum"),
        total_oreb=("reboundsOffensive","sum"),
        total_dreb=("reboundsDefensive","sum"),
        total_blocks=("blocks","sum"),
        total_stls=("steals","sum"),
        total_pf=("foulsPersonal","sum"),
        total_tov=("turnovers","sum"),
        total_fga=("fieldGoalsAttempted","sum"),
        total_fgm=("fieldGoalsMade","sum"),
        total_fta=("freeThrowsAttempted","sum"),
        total_ftm=("freeThrowsMade","sum"),
        total_3pa=("threePointersAttempted","sum"),
        total_3pm=("threePointersMade","sum"),
        total_plus_minus=("plusMinusPoints","sum"),
        avg_plus_minus=("plusMinusPoints","mean"),
        wins=("win","sum"),
        home_games=("home","sum"),
        season_pie_num=("pie_numerator","sum"),
        season_pie_den=("pie_denominator","sum"),
    )

    # Shooting
    g["fg_pct"]  = _safe_div(g["total_fgm"], g["total_fga"])
    g["fg3_pct"] = _safe_div(g["total_3pm"], g["total_3pa"])
    g["ft_pct"]  = _safe_div(g["total_ftm"], g["total_fta"])
    g["ts_pct"]  = _safe_div(g["total_points"], 2.0 * (g["total_fga"] + 0.44 * g["total_fta"]))

    # PIE season (weighted) + percent-style display
    g["season_pie"] = _safe_div(g["season_pie_num"], g["season_pie_den"])
    g["season_pie_pct"] = 100.0 * g["season_pie"]

    # Optional: simple per-game average of game_pie (diagnostic only)
    game_mean = (pg.groupby(["personId","season"], as_index=False)["game_pie"]
                   .mean()
                   .rename(columns={"game_pie":"season_pie_avg_game"}))
    g = g.merge(game_mean, on=["personId","season"], how="left")

    # Per-36 (incl. GS inputs)
    g["pts_per36"] = _safe_div(g["total_points"] * 36.0, g["total_minutes"])
    g["ast_per36"] = _safe_div(g["total_assists"] * 36.0, g["total_minutes"])
    g["reb_per36"] = _safe_div(g["total_rebounds"] * 36.0, g["total_minutes"])
    g["fgm_per36"]  = _safe_div(g["total_fgm"]  * 36.0, g["total_minutes"])
    g["fga_per36"]  = _safe_div(g["total_fga"]  * 36.0, g["total_minutes"])
    g["ftm_per36"]  = _safe_div(g["total_ftm"]  * 36.0, g["total_minutes"])
    g["fta_per36"]  = _safe_div(g["total_fta"]  * 36.0, g["total_minutes"])
    g["oreb_per36"] = _safe_div(g["total_oreb"] * 36.0, g["total_minutes"])
    g["dreb_per36"] = _safe_div(g["total_dreb"] * 36.0, g["total_minutes"])
    g["stl_per36"]  = _safe_div(g["total_stls"] * 36.0, g["total_minutes"])
    g["blk_per36"]  = _safe_div(g["total_blocks"]* 36.0, g["total_minutes"])
    g["pf_per36"]   = _safe_div(g["total_pf"]   * 36.0, g["total_minutes"])
    g["tov_per36"]  = _safe_div(g["total_tov"]  * 36.0, g["total_minutes"])

    # Usage/Efficiency
    g["usage_per_min"] = _safe_div(
        (g["total_fga"] + 0.44 * g["total_fta"] + g["total_tov"]), g["total_minutes"]
    )
    g["efficiency_per_game"] = _safe_div(
        (g["total_points"] + g["total_rebounds"] + g["total_assists"]
         + g["total_stls"] + g["total_blocks"]
         - (g["total_fga"] - g["total_fgm"])
         - (g["total_fta"] - g["total_ftm"])
         - g["total_tov"]),
        g["games_played"]
    )

    g["win_pct"] = _safe_div(g["wins"], g["games_played"])
    g["home_games_pct"] = _safe_div(g["home_games"], g["games_played"])

    # primary team by minutes, then attach team final W/L%
    ts = team_stats_df.copy()
    if "season" not in ts.columns:
        ts["season"] = _season_from_timestamp(ts["gameDate"])
    ts = _to_numeric(ts, ["seasonWins","seasonLosses"])
    ts["seasonWins"]   = ts["seasonWins"].fillna(0)
    ts["seasonLosses"] = ts["seasonLosses"].fillna(0)

    team_final = (
        ts.groupby(["teamCity","teamName","season"], as_index=False)
          .agg(final_wins=("seasonWins","max"), final_losses=("seasonLosses","max"))
    )
    team_final["team_win_pct_final"] = _safe_div(
        team_final["final_wins"], (team_final["final_wins"] + team_final["final_losses"])
    )

    def _pick(col_lower: str, col_camel: str) -> str:
        if col_lower in pg.columns: return col_lower
        if col_camel in pg.columns: return col_camel
        raise KeyError(f"Expected one of '{col_lower}' or '{col_camel}' in player-game columns.")

    team_city_col = _pick("playerteamCity", "playerTeamCity")
    team_name_col = _pick("playerteamName", "playerTeamName")

    per_team_minutes = (
        pg.groupby(["personId","season",team_city_col,team_name_col], as_index=False)
          .agg(minutes_on_team=("numMinutes","sum"))
    )
    idx = per_team_minutes.groupby(["personId","season"])["minutes_on_team"].idxmax()
    main_team = per_team_minutes.loc[idx, ["personId","season",team_city_col,team_name_col]].copy()
    main_team = main_team.rename(columns={team_city_col: "teamCity", team_name_col: "teamName"})

    g_team = g.merge(main_team, on=["personId","season"], how="left", validate="one_to_one")

    merged = g_team.merge(
        team_final[["teamCity","teamName","season","team_win_pct_final"]],
        on=["teamCity","teamName","season"], how="left", validate="many_to_one"
    )
    _join_diagnostics(g_team, team_final, merged,
                      on=["teamCity","teamName","season"], how="left",
                      left_name="player_season", right_name="team_final")

    merged = merged.rename(columns={
        "total_stls": "total_steals",
        "total_oreb": "total_reb_off",
        "total_dreb": "total_reb_def",
    })

    # Game Score (season total) + per-36
    merged["season_game_score_total"] = (
        merged["total_points"]
        + 0.4 * merged["total_fgm"]
        - 0.7 * merged["total_fga"]
        - 0.4 * (merged["total_fta"] - merged["total_ftm"])
        + 0.7 * merged["total_reb_off"]
        + 0.3 * merged["total_reb_def"]
        + merged["total_steals"]
        + 0.7 * merged["total_assists"]
        + 0.7 * merged["total_blocks"]
        - 0.4 * merged["total_pf"]
        - merged["total_tov"]
    )
    merged["game_score_per36"] = _safe_div(merged["season_game_score_total"] * 36.0, merged["total_minutes"])

    # minutes gate AFTER derived metrics
    before = len(merged)
    merged = merged.loc[merged["total_minutes"] >= minutes_total_minimum_per_season].copy()
    after = len(merged)
    if before != after:
        print(f"[GATE] Applied {minutes_total_minimum_per_season} min gate at player-season: "
              f"dropped {before - after:,} row(s).")

    return merged


# ---------------------------
# Ranking cores
# ---------------------------

# optional explainability bits: break PIE into pos/neg buckets + normalized shares
# helps reason about *why* a season sorted where it did
def _compute_pie_component_context(df: pd.DataFrame) -> pd.DataFrame:
    """Optional PIE component shares for extended context."""
    out = df.copy()
    required = [
        "total_points","total_fgm","total_ftm","total_fga","total_fta",
        "total_reb_def","total_reb_off","total_assists","total_steals","total_blocks",
        "total_pf","total_tov",
        "season_pie_num","season_pie_den",
    ]
    missing = [c for c in required if c not in out.columns]
    if missing:
        raise KeyError(f"[context] missing columns: {missing}")

    pie_pos = (
        out["total_points"] + out["total_fgm"] + out["total_ftm"]
        + out["total_reb_def"] + 0.5*out["total_reb_off"]
        + out["total_assists"] + out["total_steals"] + 0.5*out["total_blocks"]
    )
    pie_neg = (out["total_fga"] + out["total_fta"] + out["total_pf"] + out["total_tov"])
    pie_base = pie_pos + pie_neg

    def _share(x: pd.Series) -> pd.Series:
        den = pie_base.replace(0, np.nan)
        return (x / den).fillna(0.0)

    out["pie_pos"] = pie_pos
    out["pie_neg"] = pie_neg
    out["pie_base"] = pie_base

    out["share_points"] = _share(out["total_points"])
    out["share_makes"]  = _share(out["total_fgm"] + out["total_ftm"])
    out["share_reb"]    = _share(out["total_reb_def"] + 0.5*out["total_reb_off"])
    out["share_ast"]    = _share(out["total_assists"])
    out["share_stl"]    = _share(out["total_steals"])
    out["share_blk"]    = _share(0.5*out["total_blocks"])
    out["share_fga"]    = _share(out["total_fga"])
    out["share_fta"]    = _share(out["total_fta"])
    out["share_pf"]     = _share(out["total_pf"])
    out["share_tov"]    = _share(out["total_tov"])
    return out

# gs36 sort: main metric desc, then minutes, then games/ts% depending on mode
# include_context=True adds the exact per-36 inputs used by Game Score
def rank_seasons_by_gs36(
    player_season_df: pd.DataFrame,
    top_n: int = 10,
    middle_n: int = 10,
    bottom_n: int = 10,
    tie_breaker: str = "python",  # "python" or "duckdb" (ts_pct as 3rd tie-breaker)
    include_context: bool = False,
) -> dict[str, pd.DataFrame]:
    df = player_season_df.copy()
    needed = ["personId","player_name","season","game_score_per36","total_minutes","games_played"]
    missing = [c for c in needed if c not in df.columns]
    if missing:
        raise KeyError(f"Missing columns for GS36 ranking: {missing}")

    if tie_breaker == "duckdb":
        if "ts_pct" not in df.columns:
            df["ts_pct"] = np.nan
        sort_cols_top = [("game_score_per36", False), ("total_minutes", False), ("ts_pct", False), ("player_name", True)]
    else:
        sort_cols_top = [("game_score_per36", False), ("total_minutes", False), ("games_played", False), ("player_name", True)]

    top = df.sort_values([c for c,_ in sort_cols_top],
                         ascending=[a for _,a in sort_cols_top]).head(top_n).copy()
    bottom = df.sort_values(
        [sort_cols_top[0][0], sort_cols_top[1][0], sort_cols_top[2][0], sort_cols_top[3][0]],
        ascending=[True, False, False, True]
    ).head(bottom_n).copy()

    med = df["game_score_per36"].median(skipna=True)
    df["dist_to_median"] = (df["game_score_per36"] - med).abs()
    middle = df.sort_values(
        ["dist_to_median"] + [c for c,_ in sort_cols_top],
        ascending=[True] + [a for _,a in sort_cols_top]
    ).head(middle_n).drop(columns=["dist_to_median"]).copy()

    def _minimal(block: pd.DataFrame) -> pd.DataFrame:
        out = block.loc[:, ["player_name","season","game_score_per36","games_played","total_minutes"]].copy()
        out["game_score_per36"] = pd.to_numeric(out["game_score_per36"], errors="coerce").round(6)
        out["total_minutes"] = pd.to_numeric(out["total_minutes"], errors="coerce").round(1)
        return out.reset_index(drop=True)

    if not include_context:
        return {"top": _minimal(top), "middle": _minimal(middle), "bottom": _minimal(bottom)}

    GS_INPUT_PER36 = [
        "pts_per36", "fgm_per36", "fga_per36", "ftm_per36", "fta_per36",
        "oreb_per36", "dreb_per36", "stl_per36", "ast_per36", "blk_per36",
        "pf_per36", "tov_per36",
    ]
    def _with_gs_context(block: pd.DataFrame) -> pd.DataFrame:
        out = block.reset_index(drop=True).copy()
        out.insert(0, "Rank", range(1, len(out) + 1))
        selected = ["Rank","player_name","season","game_score_per36"] + [c for c in GS_INPUT_PER36 if c in out.columns]
        out["game_score_per36"] = pd.to_numeric(out["game_score_per36"], errors="coerce").round(6)
        for c in GS_INPUT_PER36:
            if c in out.columns:
                out[c] = pd.to_numeric(out[c], errors="coerce").round(3)
        return out.loc[:, selected]

    return {"top": _with_gs_context(top), "middle": _with_gs_context(middle), "bottom": _with_gs_context(bottom)}

# pie sort: season_pie desc, then minutes, then games or ts% (mode)
# include_context=True w/ pie_only_context shows only the inputs + the PIE parts
def rank_seasons_by_pie(
    player_season_df: pd.DataFrame,
    top_n: int = 10,
    middle_n: int = 10,
    bottom_n: int = 10,
    tie_breaker: str = "python",
    include_context: bool = False,
    context_cols: list[str] | None = None,
    round_map: dict[str, int] | None = None,
    pie_only_context: bool = True,
) -> dict[str, pd.DataFrame]:
    df = player_season_df.copy()
    needed = ["personId","player_name","season","season_pie","total_minutes","games_played"]
    missing = [c for c in needed if c not in df.columns]
    if missing:
        raise KeyError(f"Missing columns for PIE ranking: {missing}")

    if "season_pie_pct" not in df.columns:
        df["season_pie_pct"] = 100.0 * df["season_pie"]

    if tie_breaker == "duckdb":
        if "ts_pct" not in df.columns:
            df["ts_pct"] = np.nan
        sort_cols_top = [("season_pie", False), ("total_minutes", False), ("ts_pct", False), ("player_name", True)]
    else:
        sort_cols_top = [("season_pie", False), ("total_minutes", False), ("games_played", False), ("player_name", True)]

    top = df.sort_values([c for c,_ in sort_cols_top],
                         ascending=[a for _,a in sort_cols_top]).head(top_n).copy()
    bottom = df.sort_values(
        [sort_cols_top[0][0], sort_cols_top[1][0], sort_cols_top[2][0], sort_cols_top[3][0]],
        ascending=[True, False, False, True]
    ).head(bottom_n).copy()

    med = df["season_pie"].median(skipna=True)
    df["dist_to_median"] = (df["season_pie"] - med).abs()
    middle = df.sort_values(
        ["dist_to_median"] + [c for c,_ in sort_cols_top],
        ascending=[True] + [a for _,a in sort_cols_top]
    ).head(middle_n).drop(columns=["dist_to_median"]).copy()

    def _minimal(block: pd.DataFrame) -> pd.DataFrame:
        out = block.loc[:, ["player_name","season","season_pie","season_pie_pct","games_played","total_minutes"]].copy()
        out["season_pie"] = pd.to_numeric(out["season_pie"], errors="coerce").round(6)
        out["season_pie_pct"] = pd.to_numeric(out["season_pie_pct"], errors="coerce").round(2)
        out["total_minutes"] = pd.to_numeric(out["total_minutes"], errors="coerce").round(1)
        return out.reset_index(drop=True)

    if not include_context:
        return {"top": _minimal(top), "middle": _minimal(middle), "bottom": _minimal(bottom)}

    if pie_only_context:
        PIE_INPUT_TOTALS = [
            "total_points",
            "total_fgm", "total_ftm", "total_fga", "total_fta",
            "total_reb_def", "total_reb_off",
            "total_assists", "total_steals", "total_blocks",
            "total_pf", "total_tov",
            "total_minutes"
        ]
        PIE_METRICS = ["season_pie", "season_pie_pct", "season_pie_num", "season_pie_den"]

        def _pie_only(block: pd.DataFrame) -> pd.DataFrame:
            out = block.reset_index(drop=True).copy()
            out.insert(0, "Rank", range(1, len(out) + 1))
            selected = ["Rank", "player_name", "season"] \
                       + [c for c in PIE_METRICS if c in out.columns] \
                       + [c for c in PIE_INPUT_TOTALS if c in out.columns]
            out["season_pie"] = pd.to_numeric(out["season_pie"], errors="coerce").round(6)
            out["season_pie_pct"] = pd.to_numeric(out["season_pie_pct"], errors="coerce").round(2)
            return out.loc[:, selected]

        return {"top": _pie_only(top), "middle": _pie_only(middle), "bottom": _pie_only(bottom)}

    # extended context (unchanged except we expose season_pie_pct if present)
    if context_cols is None:
        context_cols = [
            "season_pie_num","season_pie_den","pie_pos","pie_neg","pie_base",
            "share_points","share_makes","share_reb","share_ast","share_stl","share_blk",
            "share_fga","share_fta","share_pf","share_tov","season_pie_pct"
        ]
    df_ctx = _compute_pie_component_context(df)

    def _join_keys(block_cols, ctx_cols):
        if all(k in block_cols and k in ctx_cols for k in ("personId","season")):
            return ["personId","season"]
        if all(k in block_cols and k in ctx_cols for k in ("player_name","season")):
            return ["player_name","season"]
        raise KeyError("No suitable join keys found; expected (personId, season) or (player_name, season).")

    def _with_context(block: pd.DataFrame) -> pd.DataFrame:
        join_keys = _join_keys(block.columns, df_ctx.columns)
        base_cols = set(block.columns)
        cand_cols = [c for c in context_cols if c in df_ctx.columns or c in block.columns]
        # prefer from block (already has season_pie_pct)
        have = [c for c in cand_cols if c in block.columns]
        need = [c for c in cand_cols if c not in block.columns and c in df_ctx.columns]
        merged = block.merge(df_ctx[join_keys + need].drop_duplicates(), on=join_keys, how="left")
        merged = merged.reset_index(drop=True)
        merged.insert(0, "Rank", range(1, len(merged) + 1))

        round_map_local = round_map or {
            "season_pie": 6, "season_pie_pct": 2,
            "share_points": 3, "share_makes": 3, "share_reb": 3, "share_ast": 3, "share_stl": 3, "share_blk": 3,
            "share_fga": 3, "share_fta": 3, "share_pf": 3, "share_tov": 3,
        }
        for col, nd in round_map_local.items():
            if col in merged.columns:
                merged[col] = pd.to_numeric(merged[col], errors="coerce").round(nd)

        keep = ["Rank"] + list(block.columns) + [c for c in need if c in merged.columns]
        return merged.loc[:, keep]

    return {"top": _with_context(top), "middle": _with_context(middle), "bottom": _with_context(bottom)}


# ---------------------------
# Generic printers/formatters/exporters
# ---------------------------

# helper wrapper: choose metric + include context frames
def build_rankings_with_context_generic(
    player_season_df: pd.DataFrame,
    metric: str = "pie",  # "pie" or "gs36"
    top_n: int = 10, middle_n: int = 10, bottom_n: int = 10,
    tie_breaker: str = "python",
) -> dict[str, pd.DataFrame]:
    if metric == "pie":
        return rank_seasons_by_pie(
            player_season_df, top_n=top_n, middle_n=middle_n, bottom_n=bottom_n,
            tie_breaker=tie_breaker, include_context=True, pie_only_context=True
        )
    elif metric == "gs36":
        return rank_seasons_by_gs36(
            player_season_df, top_n=top_n, middle_n=middle_n, bottom_n=bottom_n,
            tie_breaker=tie_breaker, include_context=True
        )
    else:
        raise ValueError("metric must be 'pie' or 'gs36'")

# print the context frames in a readable way (meh but handy)
def print_rankings_with_context_generic(
    player_season_df: pd.DataFrame,
    metric: str = "pie",
    **kwargs,
) -> dict[str, pd.DataFrame]:
    ctx = build_rankings_with_context_generic(player_season_df, metric=metric, **kwargs)
    label = "PIE" if metric == "pie" else "Game Score per 36"
    print(f"\n=== Top 10 seasons by {label} (with context) ===")
    print(ctx["top"].to_string(index=False))
    print(f"\n=== Middle 10 seasons by {label} (with context) ===")
    print(ctx["middle"].to_string(index=False))
    print(f"\n=== Bottom 10 seasons by {label} (with context) ===")
    print(ctx["bottom"].to_string(index=False))
    return ctx

# squash to submission columns (Rank, Player, Season) for each bucket
def format_rankings_for_submission_generic(
    player_season_df: pd.DataFrame,
    metric: str = "pie",
    top_n: int = 10,
    middle_n: int = 10,
    bottom_n: int = 10,
    tie_breaker: str = "python",
) -> dict[str, pd.DataFrame]:
    if metric == "pie":
        ranks = rank_seasons_by_pie(player_season_df, top_n=top_n, middle_n=middle_n, bottom_n=bottom_n, tie_breaker=tie_breaker)
        title = "PIE"
    elif metric == "gs36":
        ranks = rank_seasons_by_gs36(player_season_df, top_n=top_n, middle_n=middle_n, bottom_n=bottom_n, tie_breaker=tie_breaker)
        title = "Game Score per 36"
    else:
        raise ValueError("metric must be 'pie' or 'gs36'")

    def _fmt(df: pd.DataFrame, rank_start: int = 1) -> pd.DataFrame:
        out = df[["player_name","season"]].copy()
        out.insert(0, "Rank", range(rank_start, rank_start + len(out)))
        out = out.rename(columns={"player_name":"Player","season":"Season"})
        return out

    out = {
        "top": _fmt(ranks["top"], rank_start=1),
        "middle": _fmt(ranks["middle"], rank_start=1),
        "bottom": _fmt(ranks["bottom"], rank_start=1),
    }
    out["__title__"] = title  # carry label for printing
    return out

# print-only version for the submission tables
def print_rankings_for_submission_generic(
    player_season_df: pd.DataFrame,
    metric: str = "pie",
    top_n: int = 10,
    middle_n: int = 10,
    bottom_n: int = 10,
    tie_breaker: str = "python",
) -> dict[str, pd.DataFrame]:
    sub = format_rankings_for_submission_generic(
        player_season_df, metric=metric, top_n=top_n, middle_n=middle_n, bottom_n=bottom_n, tie_breaker=tie_breaker
    )
    label = sub.pop("__title__")
    print(f"\n=== Top 10 seasons by {label} ===")
    print(sub["top"].to_string(index=False))
    print(f"\n=== Middle 10 seasons by {label} (around median) ===")
    print(sub["middle"].to_string(index=False))
    print(f"\n=== Bottom 10 seasons by {label} ===")
    print(sub["bottom"].to_string(index=False))
    return sub

# dump the context frames to csvs side-by-side (top/mid/bottom)
def export_context_rankings(
    ctx_ranks: dict[str, pd.DataFrame],
    output_dir: Path | str,
    filename_prefix: str,
) -> None:
    outdir = Path(output_dir)
    outdir.mkdir(parents=True, exist_ok=True)
    for k, df in ctx_ranks.items():
        path = outdir / f"{filename_prefix}_{k}.csv"
        df.to_csv(path, index=False)
        print(f"[EXPORT] wrote {k} → {path}")

# plain .txt logs for the submission tables (easier to eyeball in diffs)
def export_submission_txt(
    submission_ranks: dict[str, pd.DataFrame],
    output_dir: Path | str | None,
    filename_prefix: str,
    heading_label: str,
) -> None:
    """
    Writes three files: {prefix}_top.txt, {prefix}_middle.txt, {prefix}_bottom.txt
    with a heading line that includes `heading_label`.
    """
    outdir = Path(".") if output_dir is None else Path(output_dir)
    outdir.mkdir(parents=True, exist_ok=True)
    for rank_type, df in submission_ranks.items():
        filename = f"{filename_prefix}_{rank_type}.txt"
        filepath = outdir / filename
        with open(filepath, "w") as f:
            f.write(f"=== {rank_type.title()} 10 seasons by {heading_label} ===\n")
            f.write(df.to_string(index=False))
            f.write("\n")
        print(f"[EXPORT] Saved {rank_type} rankings to: {filepath}")

# ---------------------------
# Kaggle loader
# ---------------------------

# thin wrapper around kagglehub to get named tables as pandas
# raises with a readable list if something didn't load
def load_nba_csv_tables(
    table_names: Iterable[str] = IMPORTANT_TABLES,
    dataset: str = KAGGLE_DATASET,
) -> Dict[str, pd.DataFrame]:
    result: Dict[str, pd.DataFrame] = {}
    missing_map = {}

    for t in table_names:
        csv_name = KAGGLE_TABLE_TO_CSV.get(t)
        if not csv_name:
            missing_map[t] = "No CSV mapping in KAGGLE_TABLE_TO_CSV"
            continue
        try:
            df = kh.dataset_load(
                KDA.PANDAS, dataset, csv_name, pandas_kwargs={"low_memory": False}
            )
            result[t] = df
            print(f"[INFO] Loaded via KaggleHub/PANDAS: {csv_name} -> '{t}' (rows={len(df):,})")
        except Exception as e:
            missing_map[t] = str(e)

    if missing_map:
        missing_str = "\n".join([f"- {t}: {why}" for t, why in missing_map.items()])
        raise RuntimeError(
            f"Some tables could not be loaded.\nDataset: {dataset}\nIssues:\n{missing_str}"
        )
    return result

# convenience one-shot: load, filter, compute game-level PIE
# note: minutes gate deferred to season aggregation step
def build_player_game_table_python(
    start_season: int = CFG_START_SEASON,
    season_type: str = CFG_SEASON_TYPE,
    minutes_total_minimum_per_season: int = CFG_MIN_SEASON_MINUTES,
) -> pd.DataFrame:
    """
    Loads via KaggleHub → applies basic criteria in Python → computes PIE at game level.
    IMPORTANT: Defer minutes gate to player-season (avoid biasing PIE denominators).
    """
    dfs = load_nba_csv_tables(["PlayerStatistics","TeamStatistics"])
    _, ps_basic, _ = enforce_criteria_python(
        None,
        dfs["PlayerStatistics"],
        dfs["TeamStatistics"],
        start_season=start_season,
        season_type=season_type,
        minutes_total_minimum_per_season=minutes_total_minimum_per_season,
        defer_minutes_gate=True,
    )
    ps_pie = compute_player_game_pie(ps_basic)
    return ps_pie

# ---------------------------
# Main
# ---------------------------

# basic CLI-ish entrypoint; prints shapes + writes out submission/context artifacts
if __name__ == "__main__":
    print("=== NBA Rankings (PIE & GS36) — Submission + Context ===")

    # 1) Player-game with PIE
    print("\n1. Building player-game table with PIE...")
    try:
        pg = build_player_game_table_python(
            start_season=CFG_START_SEASON,
            season_type=CFG_SEASON_TYPE,
            minutes_total_minimum_per_season=CFG_MIN_SEASON_MINUTES,
        )
        print(f"[SUCCESS] player_game_with_pie shape: {pg.shape}")
    except Exception as e:
        print(f"[ERROR] Failed to build player-game table: {e}")
        pg = None

    # 2) Player-season + rankings
    if pg is not None:
        print("\n2. Building player-season table and generating rankings...")
        try:
            dfs = load_nba_csv_tables(["TeamStatistics"])
            season_df = build_player_season_table_python(pg, dfs["TeamStatistics"])
            print(f"[SUCCESS] player_season shape: {season_df.shape}")

            # Persist ML dataset
            season_df.to_parquet(CFG.ml_dataset_path, index=False)

            # ---- PIE ----
            # tie_breaker='duckdb' uses ts_pct as 3rd tie-breaker to match spec
            pie_submission = print_rankings_for_submission_generic(
                season_df, metric="pie", tie_breaker="duckdb"
            )
            export_submission_txt(
                pie_submission, output_dir=CFG.processed_dir,
                filename_prefix="nba_pie_rankings", heading_label="PIE"
            )

            pie_ctx = print_rankings_with_context_generic(
                season_df, metric="pie", top_n=10, middle_n=10, bottom_n=10, tie_breaker="duckdb"
            )
            export_context_rankings(
                pie_ctx, output_dir=CFG.processed_dir, filename_prefix="nba_pie_rankings_context"
            )

            # ---- GS36 ----
            gs36_submission = print_rankings_for_submission_generic(
                season_df, metric="gs36"
            )
            export_submission_txt(
                gs36_submission, output_dir=CFG.processed_dir,
                filename_prefix="nba_gamescore36_rankings", heading_label="Game Score per 36"
            )

            gs36_ctx = print_rankings_with_context_generic(
                season_df, metric="gs36", top_n=10, middle_n=10, bottom_n=10
            )
            export_context_rankings(
                gs36_ctx, output_dir=CFG.processed_dir, filename_prefix="nba_gamescore36_rankings_context"
            )

        except Exception as e:
            print(f"[ERROR] Failed to generate rankings: {e}")

    print("\n=== Complete ===")

Overwriting src/heat_data_scientist_2025/data/kaggle_pull.py


In [3]:
%%writefile src/heat_data_scientist_2025/data/load_data_utils.py
import pandas as pd
import time

def load_data_optimized(
    DATA_PATH: str,
    debug: bool = False,
    drop_null_rows: bool = False,
    drop_null_how: str = 'any',  # 'any' or 'all'
    drop_null_subset: list | None = None,  # list of column names or None for all columns
    use_sample: bool = False,
    sample_size: int = 1000,
):
    """Load data with performance optimizations and enhanced debug diagnostics.

    Parameters:
    - DATA_PATH: Path to the parquet file.
    - debug: If True, prints detailed dataset diagnostics.
    - drop_null_rows: If True, drops rows based on null criteria.
    - drop_null_how: 'any' to drop rows with any nulls, 'all' to drop rows with all nulls.
    - drop_null_subset: List of columns to consider when dropping nulls; defaults to all.

    Returns:
    - df: Loaded (and optionally filtered) DataFrame.
    """
    print("Loading data for enhanced comprehensive EDA...")
    start_time = time.time()

    # 1. Load data
    if use_sample:
        print(f"⚡ Using sample data (n={sample_size}) instead of real parquet.")
        len_df = sample_size
        df = pd.read_parquet(DATA_PATH)
        #take only the len of the data
        df = df.head(len_df)
    else:
        if DATA_PATH is None:
            raise ValueError("DATA_PATH must be provided when not using sample data.")
        df = pd.read_parquet(DATA_PATH)


    # 2. Drop null rows if requested
    if drop_null_rows:
        before = len(df)
        # Determine which subset to use for dropna
        subset_desc = "all columns" if drop_null_subset is None else f"subset={drop_null_subset}"
        print(f"→ Applying null dropping: how='{drop_null_how}', {subset_desc}")
        if drop_null_subset is None:
            df = df.dropna(how=drop_null_how)
        else:
            # Defensive: ensure provided columns exist (warn if some missing)
            missing_cols = [c for c in drop_null_subset if c.upper() not in df.columns]
            if missing_cols:
                print(f"⚠️ Warning: drop_null_subset columns not found in dataframe and will be ignored: {missing_cols}")
            valid_subset = [c.upper() for c in drop_null_subset if c.upper() in df.columns]
            df = df.dropna(how=drop_null_how, subset=valid_subset if valid_subset else None)
        dropped = before - len(df)
        print(f"✓ Dropped {dropped:,} rows by null criteria (how='{drop_null_how}', subset={drop_null_subset}); remaining {len(df):,} rows")

    # 3. Debug diagnostics
    if debug:
        print("========== Dataset Debug Details ============")
        print(f"Total rows       : {df.shape[0]:,}")
        print(f"Total columns    : {df.shape[1]:,}")
        print(f"Columns          : {df.columns.tolist()}")

        total = len(df)
        null_counts = df.isnull().sum()
        non_null_counts = total - null_counts
        null_percent = (null_counts / total) * 100
        dtype_info = df.dtypes

        null_summary = pd.DataFrame({
            'dtype'          : dtype_info,
            'null_count'     : null_counts,
            'non_null_count' : non_null_counts,
            'null_percent'   : null_percent
        }).sort_values(by='null_percent', ascending=False)

        pd.set_option('display.max_rows', None)
        print("---- Nulls Summary (per column) ----")
        print(null_summary)

    load_time = time.time() - start_time
    print(f"✓ Dataset loaded: {df.shape[0]:,} rows × {df.shape[1]:,} columns in {load_time:.2f}s")
    return df



if __name__ == "__main__":
    from src.heat_data_scientist_2025.utils.config import CFG
    df = load_data_optimized(
        CFG.ml_dataset_path,
        debug=True,
        # use_sample=True,
        # drop_null_rows=True,
        # drop_null_subset=['AAV']
    )
    print(df.columns.tolist())
    print(df.head())
    print(df.shape)
    # unique values for season
    print(df['season'].unique())
    

    # df = load_data_optimized(
    #     FINAL_DATA_PATH,
    #     debug=True,
    #     use_sample=True,
    # )
    # print(df.columns.tolist())

Overwriting src/heat_data_scientist_2025/data/load_data_utils.py


In [4]:
%%writefile src/heat_data_scientist_2025/data/feature_engineering.py
"""
Current columns:
Index(['personId', 'player_name', 'season', 'games_played', 'total_minutes',
       'total_points', 'total_assists', 'total_rebounds', 'total_reb_off',
       'total_reb_def', 'total_blocks', 'total_steals', 'total_pf',
       'total_tov', 'total_fga', 'total_fgm', 'total_fta', 'total_ftm',
       'total_3pa', 'total_3pm', 'total_plus_minus', 'avg_plus_minus', 'wins',
       'home_games', 'season_pie_num', 'season_pie_den', 'fg_pct', 'fg3_pct',
       'ft_pct', 'ts_pct', 'season_pie', 'pts_per36', 'ast_per36', 'reb_per36',
       'usage_per_min', 'efficiency_per_game', 'win_pct', 'home_games_pct',
       'teamCity', 'teamName', 'team_win_pct_final'],
      dtype='object')


Feature engineering for NBA player-season data.

"""


from __future__ import annotations
from typing import List, Tuple, Optional, Dict
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression


# quick check: makes sure we have the columns we need
def require_columns(df: pd.DataFrame, cols: List[str], context: str) -> None:
    """Check if required columns exist, throw error if missing."""
    missing = [c for c in cols if c not in df.columns]
    if missing:
        raise ValueError(f"Missing columns for {context}: {missing}")


# helper: finds first column that actually exists in the data  
def _first_present(df: pd.DataFrame, candidates: List[str]) -> Optional[str]:
    """Return first column name that exists in df, else None."""
    for c in candidates:
        if c in df.columns:
            return c
    return None


# helper: marks each players first season row (used to verify lag nulls)
def compute_first_season_mask(df: pd.DataFrame, season_col: str = "season_start_year") -> pd.Series:
    """Return boolean mask for each player's earliest season."""
    require_columns(df, ["personId", season_col], "compute_first_season_mask")
    idx = (df.sort_values(["personId", season_col])
           .groupby("personId", group_keys=False)
           .head(1)).index
    return df.index.isin(idx)


# parse season string like '2023-24' into year 2023
def add_season_start_year(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """Extract numeric start year from season string like 'YYYY-YY'."""
    out = df.copy()
    if "season" not in out.columns:
        raise ValueError("Need 'season' column")
    
    out["season_start_year"] = (
        out["season"].astype(str).str.extract(r"(\d{4})")[0].astype(int)
    )
    return out, ["season_start_year"]


# experience stuff - years in league and rough groupings
def add_experience_features(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """Add experience features from draft year if available."""
    out = df.copy()
    created: List[str] = []

    # cumulative games played (simple running total)
    if "personId" in out.columns and "games_played" in out.columns:
        out = out.sort_values(["personId", "season_start_year"])
        out["games_played_total"] = out.groupby("personId")["games_played"].cumsum()
        created.append("games_played_total")

    # years since draft (if we have draft year)
    if "draftYear" in out.columns and "season_start_year" in out.columns:
        out["years_experience"] = (out["season_start_year"] - out["draftYear"]).clip(lower=0)
        
        # rough experience buckets
        def exp_bucket(exp):
            if pd.isna(exp): return "Unknown"
            if exp <= 2: return "Rookie/Sophomore"  
            if exp <= 5: return "Young Player"
            if exp <= 9: return "Prime Years"
            if exp <= 15: return "Veteran"
            return "Elder Statesman"
        
        out["experience_bucket"] = out["years_experience"].apply(exp_bucket)
        created.extend(["years_experience", "experience_bucket"])
        
    return out, created


# advanced box score metrics - efficiency stuff mostly
def add_advanced_metrics(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """Create advanced metrics from basic box score stats."""
    out = df.copy()
    created: List[str] = []

    # find the columns we need (handles different naming)
    fga = _first_present(out, ["total_fga"])
    fta = _first_present(out, ["total_fta"])
    fgm = _first_present(out, ["total_fgm"])
    tpa = _first_present(out, ["total_3pa"])
    tpm = _first_present(out, ["total_3pm"])
    ast = _first_present(out, ["total_assists"])
    blk = _first_present(out, ["total_blocks"])
    stl = _first_present(out, ["total_steals"])
    pts = _first_present(out, ["total_points"])
    mins = _first_present(out, ["total_minutes"])
    reb = _first_present(out, ["total_rebounds"])
    dreb = _first_present(out, ["total_reb_def"])
    oreb = _first_present(out, ["total_reb_off"])
    tov = _first_present(out, ["total_tov", "total_turnovers"])

    # true shooting attempts estimate
    if fga and fta:
        out["ts_attempts"] = out[fga] + 0.44 * out[fta]
        created.append("ts_attempts")
        
    # shooting rates and efficiency
    if fga and tpa:
        out["three_point_rate"] = np.where(out[fga] > 0, out[tpa] / out[fga], 0.0)
        created.append("three_point_rate")
    if fga and fta:
        out["ft_rate"] = np.where(out[fga] > 0, out[fta] / out[fga], 0.0)
        created.append("ft_rate")
    if fga and fgm and tpm:
        out["efg_pct"] = np.where(out[fga] > 0, (out[fgm] + 0.5 * out[tpm]) / out[fga], 0.0)
        created.append("efg_pct")
    if fga and pts:
        out["pts_per_shot"] = np.where(out[fga] > 0, out[pts] / out[fga], 0.0)
        created.append("pts_per_shot")

    # defensive and overall production per 36
    if mins and blk and stl:
        out["defensive_per36"] = np.where(out[mins] > 0, (out[blk] + out[stl]) * 36 / out[mins], 0.0)
        out["stocks_per36"] = out["defensive_per36"].copy()  # same thing
        created.extend(["defensive_per36", "stocks_per36"])
    if mins and pts and ast and reb:
        out["production_per36"] = np.where(out[mins] > 0, (out[pts] + out[ast] + out[reb]) * 36 / out[mins], 0.0)
        created.append("production_per36")
    if mins and tov:
        out["tov_per36"] = np.where(out[mins] > 0, out[tov] * 36 / out[mins], 0.0)
        created.append("tov_per36")

    # rebounding shares
    if reb and dreb and oreb:
        total_reb_safe = out[reb].replace(0, np.nan)
        out["dreb_share"] = (out[dreb] / total_reb_safe).fillna(0.0)
        out["oreb_share"] = (out[oreb] / total_reb_safe).fillna(0.0)
        created.extend(["dreb_share", "oreb_share"])

    # usage events (shots, fts, turnovers)
    if fga and fta and tov and mins:
        out["usage_events_total"] = out[fga] + 0.44 * out[fta] + out[tov]
        out["usage_events_per_min"] = np.where(out[mins] > 0, out["usage_events_total"] / out[mins], 0.0)
        created.extend(["usage_events_total", "usage_events_per_min"])

    # assist to turnover ratio
    if ast and tov:
        out["ast_to_tov"] = np.where(out[tov] > 0, out[ast] / out[tov], out[ast])
        created.append("ast_to_tov")

    return out, created


# usage and shot creation features
def add_usage_features(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """Usage and shot creation metrics."""
    out = df.copy()
    created: List[str] = []
    
    # total usage from per-minute usage  
    if "usage_per_min" in out.columns:
        min_col = _first_present(out, ["total_minutes"])
        if min_col:
            out["total_usage"] = out["usage_per_min"] * out[min_col]
            created.append("total_usage")

    # shot creation (shots + assists)
    fga_col = _first_present(out, ["total_fga"])
    fta_col = _first_present(out, ["total_fta"])
    ast_col = _first_present(out, ["total_assists"])
    min_col = _first_present(out, ["total_minutes"])
    
    if fga_col and fta_col and ast_col:
        out["shot_creation"] = out[fga_col] + out[fta_col] + out[ast_col]
        if min_col:
            out["shot_creation_per36"] = np.where(out[min_col] > 0, out["shot_creation"] * 36 / out[min_col], 0.0)
        else:
            out["shot_creation_per36"] = 0.0
        created.extend(["shot_creation", "shot_creation_per36"])
        
    return out, created


# rolling averages and trends (optional - currently disabled)
def add_rolling_features(df: pd.DataFrame, window: int = 3, stats: List[str] = None) -> Tuple[pd.DataFrame, List[str]]:
    """Rolling means and linear trends for key stats."""
    if stats is None:
        stats = ["season_pie", "ts_pct", "pts_per36", "ast_per36", "efficiency_per_game"]
        
    require_columns(df, ["season_start_year", "personId"], "add_rolling_features")
    
    out = df.copy().sort_values(["personId", "season_start_year"])
    created: List[str] = []
    
    valid_stats = [s for s in stats if s in out.columns]
    if not valid_stats:
        return out, created
        
    gp = out.groupby("personId")
    
    for stat in valid_stats:
        # rolling mean
        col_roll = f"{stat}_rollmean_{window}"
        out[col_roll] = gp[stat].rolling(window, min_periods=1).mean().reset_index(level=0, drop=True)
        created.append(col_roll)
        
        # linear trend slope
        def slope(x: pd.Series) -> float:
            arr = x.values
            if arr.size < 2:
                return float("nan")
            X = np.arange(len(arr)).reshape(-1, 1)
            y = arr.ravel()
            try:
                model = LinearRegression().fit(X, y)
                return float(model.coef_[0])
            except Exception:
                return float("nan")
                
        col_slope = f"{stat}_trend_{window}"  
        out[col_slope] = gp[stat].rolling(window, min_periods=2).apply(slope, raw=False).reset_index(level=0, drop=True)
        created.append(col_slope)
        
    return out, created


# figures out which numeric columns to lag automatically
def _build_lag_stat_list_auto(df: pd.DataFrame, season_col: str = "season_start_year") -> List[str]:
    """Auto-select numeric columns to lag, excluding obvious problem ones."""
    numeric = df.select_dtypes(include=[np.number]).columns.tolist()
    
    # stuff we definitely don't want to lag
    exclude_exact = {
        "personId", season_col, "games_played_total", "forecast_season", 
        "source_season", "season_pie_num", "season_pie_den"
    }
    
    base = [c for c in numeric if c not in exclude_exact 
            and not c.endswith("_lag1") and "trend_" not in c and "rollmean_" not in c]
    
    # skip columns with id/index type names
    bad_substrings = ("_id", "_idx", "_code")
    base = [c for c in base if not any(s in c.lower() for s in bad_substrings)]
    
    return base


# creates lagged features by player
def add_lag_features(df: pd.DataFrame, stats: Optional[List[str]] = None, 
                    lags: List[int] = [1], season_col: str = "season_start_year") -> Tuple[pd.DataFrame, List[str]]:
    """Add lag features by player-season. Nulls will be in first seasons only."""
    out = df.copy()
    created: List[str] = []
    require_columns(out, ["personId", season_col], "add_lag_features")
    
    # auto-detect stats to lag if not provided
    if stats is None:
        stats = _build_lag_stat_list_auto(out, season_col=season_col)
    else:
        # filter to numeric columns only
        num_cols = set(out.select_dtypes(include=[np.number]).columns)
        stats = [s for s in stats if s in num_cols]
    
    out = out.sort_values(["personId", season_col])
    gp = out.groupby("personId", group_keys=False)
    
    # create lag columns
    for col in stats:
        for k in lags:
            name = f"{col}_lag{k}"
            out[name] = gp[col].shift(k)
            created.append(name)
    
    # add helper columns
    out["has_prior_season"] = gp.cumcount() > 0
    created.append("has_prior_season")
    
    return out, created


# minutes and availability features  
def add_minutes_features(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """Playing time and availability metrics."""
    out = df.copy()
    created: List[str] = []
    
    if "games_played" in out.columns and "total_minutes" in out.columns:
        out["minutes_per_game"] = np.where(out["games_played"] > 0, out["total_minutes"] / out["games_played"], 0.0)
        out["games_pct"] = out["games_played"] / 82.0
        created.extend(["minutes_per_game", "games_pct"])
        
        # playing time tiers
        out["minutes_tier"] = pd.cut(out["minutes_per_game"], bins=[0, 15, 25, 35, 48], 
                                   labels=["Bench", "Role Player", "Starter", "Star"], include_lowest=True)
        created.append("minutes_tier")
        
        # total minutes tiers (handles duplicate edges)
        try:
            out["total_minutes_tier"] = pd.qcut(out["total_minutes"], q=5, 
                                              labels=["Very Low", "Low", "Medium", "High", "Very High"])
        except ValueError:
            ranks = out["total_minutes"].rank(method="average")
            out["total_minutes_tier"] = pd.qcut(ranks, q=5,
                                              labels=["Very Low", "Low", "Medium", "High", "Very High"])
        created.append("total_minutes_tier")
    
    return out, created


# shooting performance relative to league average by season  
def add_performance_consistency(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """Shooting performance vs league medians and composite score."""
    out = df.copy()
    created: List[str] = []
    
    need = ["season_start_year", "fg_pct", "fg3_pct", "ft_pct"]
    missing = [c for c in need if c not in out.columns]
    if missing:
        return out, created
        
    # season-level medians for comparison
    grp = out.groupby("season_start_year", group_keys=False)
    league_medians = ["fg_league_med", "fg3_league_med", "ft_league_med"]
    out["fg_league_med"] = grp["fg_pct"].transform("median")
    out["fg3_league_med"] = grp["fg3_pct"].transform("median") 
    out["ft_league_med"] = grp["ft_pct"].transform("median")
    
    # differences from league median
    out["fg_vs_league"] = out["fg_pct"] - out["fg_league_med"]
    out["fg3_vs_league"] = out["fg3_pct"] - out["fg3_league_med"]
    out["ft_vs_league"] = out["ft_pct"] - out["ft_league_med"]
    created.extend(["fg_vs_league", "fg3_vs_league", "ft_vs_league"])
    
    # composite shooting score
    out["shooting_score"] = out["fg_pct"] * 0.4 + out["fg3_pct"] * 0.3 + out["ft_pct"] * 0.3
    created.append("shooting_score")
    
    # clean up temp columns
    out.drop(columns=league_medians, inplace=True)
    return out, created


# composite features combining multiple stats
def create_composite_features(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """Create composite impact metrics."""
    out = df.copy()
    created: List[str] = []
    
    # make sure we have the base columns (fill with nan if missing)
    for need in ["pts_per36", "ast_per36", "reb_per36", "defensive_per36"]:
        if need not in out.columns:
            out[need] = np.nan
    
    # offensive impact score
    out["offensive_impact"] = out["pts_per36"] * 0.4 + out["ast_per36"] * 0.3 + out["ts_pct"] * 100 * 0.3
    created.append("offensive_impact")
    
    # two-way impact (offense + defense)
    out["two_way_impact"] = out["offensive_impact"] + out["defensive_per36"] * 10
    created.append("two_way_impact")
    
    # efficiency x volume
    if "efficiency_per_game" in out.columns and "total_usage" in out.columns:
        out["efficiency_volume_score"] = out["efficiency_per_game"] * out["total_usage"]  
        created.append("efficiency_volume_score")
    
    # versatility score (above median in multiple areas)
    scoring_contrib = (out["pts_per36"] > out["pts_per36"].median()).astype(int)
    assist_contrib = (out["ast_per36"] > out["ast_per36"].median()).astype(int)
    rebound_contrib = (out["reb_per36"] > out["reb_per36"].median()).astype(int) 
    defense_contrib = (out["defensive_per36"] > out["defensive_per36"].median()).astype(int)
    out["versatility_score"] = scoring_contrib + assist_contrib + rebound_contrib + defense_contrib
    created.append("versatility_score")
    
    return out, created


# helper for season-normalized z-scores
def _zscore_by_season(df: pd.DataFrame, col: str, season_col: str) -> pd.Series:
    """Z-score within each season, clipped to avoid extreme outliers."""
    if col not in df.columns:
        return pd.Series(np.nan, index=df.index)
    g = df.groupby(season_col)[col]
    z = (df[col] - g.transform("mean")) / (g.transform("std").replace(0, np.nan))
    return z.clip(-3, 3).fillna(0.0)


# portability index - how well skills transfer between situations  
def build_portability_index(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """Portability index based on transferable skills."""
    out = df.copy()
    created: List[str] = []
    season_col = "season_start_year"
    require_columns(out, [season_col], "build_portability_index")
    
    # season-normalized z-scores for key components
    z = {}
    for col in ["ts_pct", "efg_pct", "pts_per_shot", "fg3_pct", "three_point_rate", "ft_pct",
                "stocks_per36", "dreb_share", "oreb_share", "ast_per36", "ast_to_tov", "usage_per_min"]:
        if col in out.columns:
            z[col] = _zscore_by_season(out, col, season_col)
        else:
            z[col] = pd.Series(0.0, index=out.index)
    
    # component scores
    score_eff = (z["ts_pct"] + z["efg_pct"] + z["pts_per_shot"]) / 3.0
    shoot_abil = (z["fg3_pct"] + z["three_point_rate"] + z["ft_pct"]) / 3.0  
    def_abil = z["stocks_per36"]
    
    # rebounding versatility (good at both, penalty for imbalance)
    reb_mean = (z["dreb_share"] + z["oreb_share"]) / 2.0
    reb_gap = (z["dreb_share"] - z["oreb_share"]).abs()
    def_vers = reb_mean - 0.25 * reb_gap
    
    pass_abil = (z["ast_per36"] + z["ast_to_tov"]) / 2.0
    
    # usage with diminishing returns (too much usage can hurt portability)
    usage_term = z["usage_per_min"] - 0.15 * (z["usage_per_min"] ** 2)
    
    # weighted combination (weights sum to 1.0)
    out["portability_index"] = (0.16 * score_eff + 0.40 * shoot_abil + 0.08 * def_abil + 
                               0.05 * def_vers + 0.25 * pass_abil + 0.06 * usage_term)
    created.append("portability_index")
    
    # save component scores too
    out["pi_scoring_eff"] = score_eff
    out["pi_shooting"] = shoot_abil  
    out["pi_defense"] = def_abil
    out["pi_versatility"] = def_vers
    out["pi_passing"] = pass_abil
    out["pi_usage_term"] = usage_term
    created.extend(["pi_scoring_eff", "pi_shooting", "pi_defense", "pi_versatility", "pi_passing", "pi_usage_term"])
    
    return out, created


# main feature engineering function
def engineer_features(df: pd.DataFrame, drop_null_lag_rows: bool = True, verbose: bool = False) -> pd.DataFrame:
    """
    Build all features and optionally drop first-season rows with null lags.
    
    Args:
        df: Input dataframe with player-season data
        drop_null_lag_rows: If True, drop rows where lag features are null (default True)
        verbose: Print progress info (default False)
    
    Returns:
        Processed dataframe with all engineered features
    """
    if verbose:
        print("Starting feature engineering...")
        
    original_shape = df.shape
    out = df.copy()
    
    # check we have required base columns
    required_base = ["personId", "season", "games_played", "total_minutes", "season_pie"]
    missing_base = [c for c in required_base if c not in out.columns]
    if missing_base:
        raise ValueError(f"Missing required columns: {missing_base}")
    
    # 1. parse season to get numeric year
    if verbose:
        print("Parsing seasons...")
    out, _ = add_season_start_year(out)
    
    # 2. sort by player and season for all subsequent operations
    out = out.sort_values(["personId", "season_start_year"])
    
    # 3. build features step by step
    if verbose:
        print("Adding experience features...")
    out, _ = add_experience_features(out)
    
    if verbose:
        print("Adding advanced metrics...")
    out, _ = add_advanced_metrics(out)
    
    if verbose:
        print("Adding usage features...")
    out, _ = add_usage_features(out)
    
    if verbose:
        print("Adding minutes features...")  
    out, _ = add_minutes_features(out)
    
    if verbose:
        print("Adding performance consistency...")
    out, _ = add_performance_consistency(out)
    
    if verbose:
        print("Creating composite features...")
    out, _ = create_composite_features(out)
    
    if verbose:
        print("Building portability index...")
    out, _ = build_portability_index(out)
    
    # 4. add lag features (creates nulls in first seasons)
    if verbose:
        print("Creating lag features...")
    out, _ = add_lag_features(out, lags=[1])
    
    # 5. clean up any infinite values
    numeric_cols = out.select_dtypes(include=[np.number]).columns
    inf_cols = []
    for col in numeric_cols:
        if np.isinf(out[col]).any():
            inf_cols.append(col)
            out[col] = out[col].replace([np.inf, -np.inf], np.nan)
    
    if verbose and inf_cols:
        print(f"Cleaned infinite values in {len(inf_cols)} columns")
    
    # 6. handle lag nulls
    lag_cols = [c for c in out.columns if c.endswith("_lag1")]
    if lag_cols:
        # check that lag nulls are only in first seasons (as expected)
        lag_nulls_mask = out[lag_cols].isnull().any(axis=1)
        first_season_mask = compute_first_season_mask(out)
        
        # simple validation
        nulls_are_first_seasons = (lag_nulls_mask == first_season_mask).all()
        
        if nulls_are_first_seasons:
            if verbose:
                print(f"✓ Lag nulls confirmed as first seasons only ({lag_nulls_mask.sum()} rows)")
            
            if drop_null_lag_rows:
                rows_before = len(out)
                out = out[~lag_nulls_mask].copy()
                rows_dropped = rows_before - len(out)
                if verbose:
                    print(f"Dropped {rows_dropped} first-season rows with null lags")
        else:
            print("⚠️ Warning: Some lag nulls are not from first seasons - check data quality")
    
    if verbose:
        print(f"Feature engineering complete: {original_shape[0]} → {len(out)} rows, {original_shape[1]} → {len(out.columns)} columns")
    
    return out


if __name__ == "__main__":
    from src.heat_data_scientist_2025.data.load_data_utils import load_data_optimized
    from src.heat_data_scientist_2025.utils.config import CFG

    df = load_data_optimized(
        CFG.ml_dataset_path,
        debug=True,
        drop_null_rows=True,
    )

    
    # run feature engineering
    try:
        df_eng = engineer_features(df, drop_null_lag_rows=True, verbose=True)
        print(f"✓ Success! Result shape: {df_eng.shape}")
        print("Sample lag columns created:", [c for c in df_eng.columns if c.endswith('_lag1')][:5])
        print(f"First season rows dropped: {len(df) - len(df_eng)}")
    except Exception as e:
        print(f"✗ Error: {e}")
        
    #print columns
    print(df_eng.columns)

    # check that there are these lists in the dataset
    numerical_features = [
        # lagged features
        "season_pie_lag1", "ts_pct_lag1", "efg_pct_lag1", "fg_pct_lag1", "fg3_pct_lag1", "ft_pct_lag1",
        "pts_per36_lag1", "ast_per36_lag1", "reb_per36_lag1", "defensive_per36_lag1",
        "production_per36_lag1", "stocks_per36_lag1", "three_point_rate_lag1", "ft_rate_lag1",
        "pts_per_shot_lag1", "ast_to_tov_lag1", "usage_events_per_min_lag1", "usage_per_min_lag1",
        "games_played_lag1", "total_minutes_lag1", "total_points_lag1", "total_assists_lag1",
        "total_rebounds_lag1", "total_steals_lag1", "total_blocks_lag1", "total_fga_lag1",
        "total_fta_lag1", "total_3pa_lag1", "total_3pm_lag1", "total_tov_lag1", "win_pct_lag1",
        "avg_plus_minus_lag1", "team_win_pct_final_lag1",
        "offensive_impact_lag1", "two_way_impact_lag1", "efficiency_volume_score_lag1",
        "versatility_score_lag1", "shooting_score_lag1", "season"
    ]
    
    nominal_categoricals = []
    ordinal_categoricals = ["minutes_tier"] 
    y_variables = ["season_pie", "game_score_per36"]
    
    #check that the features are in the df_eng
    for feature in numerical_features:
        assert feature in df_eng.columns, f"{feature} is not in the dataset"
    for feature in nominal_categoricals:
        assert feature in df_eng.columns, f"{feature} is not in the dataset"
    for feature in ordinal_categoricals:
        assert feature in df_eng.columns, f"{feature} is not in the dataset"
    for feature in y_variables:
        assert feature in df_eng.columns, f"{feature} is not in the dataset"


Overwriting src/heat_data_scientist_2025/data/feature_engineering.py


In [5]:
%%writefile src/heat_data_scientist_2025/ml/enhanced_ml_pipeline.py
"""
Enhanced Automated ML Pipeline with Feature Evaluation and Multi-Target Support
===============================================================================

New functions for:
1. Feature validation and evaluation
2. Permutation importance calculation
3. Feature filtering based on importance
4. Multi-target prediction support
5. Feature restriction to specified lists only
"""

from __future__ import annotations
import pandas as pd
import numpy as np
import pickle
import json
from pathlib import Path
from typing import Dict, List, Tuple, Optional, Any
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

# Import existing modules
from src.heat_data_scientist_2025.utils.config import CFG, ML_CONFIG
from src.heat_data_scientist_2025.data.feature_engineering import (
    engineer_features
)
# --- Add near the top of enhanced_ml_pipeline.py imports ---
from sklearn.preprocessing import LabelEncoder
from dataclasses import dataclass

# --- Add these helpers somewhere above your main pipeline class/functions ---

# explicit ordinal orders
ORDINAL_ORDERS = {
    "minutes_tier": ["Bench", "Role Player", "Starter", "Star"],
}

def _freeze_label_encoder(le: LabelEncoder) -> dict:
    """
    Convert fitted LabelEncoder to a simple mapping dict (value -> code),
    so we can transform robustly (and detect unknowns) without calling .transform().
    """
    classes = list(le.classes_)
    mapping = {cls: idx for idx, cls in enumerate(classes)}
    reverse = {idx: cls for idx, cls in enumerate(classes)}
    return {"forward": mapping, "reverse": reverse}

def ensure_numeric_matrix(df_like: pd.DataFrame, context: str = "X") -> None:
    """
    Hard-stop if any object columns exist. Print sample offending values.
    """
    obj_cols = [c for c in df_like.columns if df_like[c].dtype == "object"]
    if obj_cols:
        examples = {c: df_like[c].dropna().astype(str).unique()[:5].tolist() for c in obj_cols}
        msg = [
            f"[ensure_numeric_matrix] {context} contains object/string columns:",
            f"  Columns: {obj_cols}",
            f"  Sample values: {examples}",
            "  -> Encode these before calling model.predict / model.fit."
        ]
        raise ValueError("\n".join(msg))

def validate_target_feature_separation(target: str, feature_names: list[str], verbose: bool = True) -> bool:
    """
    Prevent blatant leakage: a feature that is literally the target or same-season alias.
    Allowed: *_lag1 variants of the target.
    """
    bad = []
    for f in feature_names:
        # strict disallow the bare target
        if f == target:
            bad.append(f)
        # disallow target-like aliases without lag
        if f.startswith(target + "_") and not f.endswith("_lag1"):
            bad.append(f)
    ok = len(bad) == 0
    if verbose:
        if ok:
            print(f"[leakage-check] OK for target '{target}'.")
        else:
            print(f"[leakage-check] Potential leakage for target '{target}': {bad}")
    return ok

def create_target_specific_features(
    target: str,
    numerical_features: list[str],
    nominal_categoricals: list[str],
    ordinal_categoricals: list[str],
) -> list[str]:
    """
    Choose features for a target. Uses your config lists when available.
    Falls back to filtering provided feature lists.
    """
    from src.heat_data_scientist_2025.utils.config import (
        season_pie_numerical_features as PIE_NUMS,
        game_score_per36_numerical_features as GS36_NUMS,
    )
    if target == "season_pie":
        numeric = [f for f in PIE_NUMS if f in numerical_features or f in PIE_NUMS]
    elif target == "game_score_per36":
        numeric = [f for f in GS36_NUMS if f in numerical_features or f in GS36_NUMS]
    else:
        numeric = [f for f in numerical_features]  # generic fallback

    return list(dict.fromkeys(numeric + nominal_categoricals + ordinal_categoricals))  # keep order, remove dups

@dataclass
class EncodersBundle:
    """
    Stores all encoders / mappings for re-use at prediction time and for decoding.
    """
    nominal_maps: dict         # {col: {"forward": {val->code}, "reverse": {code->val}}}
    ordinal_maps: dict         # {col: {"forward": {val->code}, "reverse": {code->val}}}
    raw_label_encoders: dict   # {col: fitted LabelEncoder}  (kept for compatibility / inspection)

def encode_categoricals(
    df: pd.DataFrame,
    nominal_categoricals: list[str],
    ordinal_categoricals: list[str],
    strict: bool = True,
    verbose: bool = True
) -> tuple[pd.DataFrame, EncodersBundle]:
    """
    Encode categoricals with strong debugs and easy reversibility.

    - Ordinals use explicit domain order (e.g., minutes_tier).
    - Nominals use LabelEncoder, but we freeze to dicts for robust transform.
    - We don't silently coerce unknowns; we raise unless strict=False.
    """
    out = df.copy()
    raw_label_encoders: dict[str, LabelEncoder] = {}
    nominal_maps: dict[str, dict] = {}
    ordinal_maps: dict[str, dict] = {}

    # Nominal
    for col in nominal_categoricals:
        if col not in out.columns:
            continue
        ser = out[col].astype("string").fillna("Unknown")
        le = LabelEncoder()
        le.fit(ser.to_numpy())
        raw_label_encoders[col] = le
        maps = _freeze_label_encoder(le)
        nominal_maps[col] = maps
        out[col] = ser.map(maps["forward"])
        # enforce no NaNs after mapping unless strict=False
        unknown = out[col].isna()
        if unknown.any():
            unseen = sorted(ser[unknown].dropna().unique().tolist())
            msg = (f"[encode_categoricals] Unknown nominal categories in '{col}': {unseen}. "
                   f"Known={list(maps['forward'].keys())}")
            if strict:
                raise ValueError(msg)
            else:
                print("WARN:", msg)
                out.loc[unknown, col] = -1  # explicit bucket for unknown

        out[col] = out[col].astype("int32")

    # Ordinal
    for col in ordinal_categoricals:
        if col not in out.columns:
            continue
        ser = out[col].astype("string")
        # choose explicit order if provided
        order = ORDINAL_ORDERS.get(col)
        if order is None:
            # if no explicit order is known, try to keep category order if any
            if pd.api.types.is_categorical_dtype(out[col]) and out[col].cat.ordered:
                order = list(out[col].cat.categories.astype("string"))
            else:
                raise ValueError(f"[encode_categoricals] No explicit order for ordinal '{col}'. "
                                 f"Provide ORDINAL_ORDERS['{col}']=[...]")
        # allow 'Unknown' as explicit category so we can map NA deterministically
        if "Unknown" not in order:
            order = order + ["Unknown"]
        forward = {lvl: i for i, lvl in enumerate(order)}
        reverse = {i: lvl for lvl, i in forward.items()}
        ordinal_maps[col] = {"forward": forward, "reverse": reverse}

        ser = ser.fillna("Unknown")
        out[col] = ser.map(forward)
        unknown = out[col].isna()
        if unknown.any():
            unseen = sorted(ser[unknown].dropna().unique().tolist())
            msg = (f"[encode_categoricals] Unknown ordinal categories in '{col}': {unseen}. "
                   f"Allowed={order}")
            if strict:
                raise ValueError(msg)
            else:
                print("WARN:", msg)
                out.loc[unknown, col] = forward["Unknown"]

        out[col] = out[col].astype("int16")

    if verbose:
        print(f"[encode_categoricals] Encoded: {len(nominal_maps)} nominal, {len(ordinal_maps)} ordinal")

    return out, EncodersBundle(nominal_maps, ordinal_maps, raw_label_encoders)

def apply_encoders_to_frame(
    df: pd.DataFrame,
    encoders: EncodersBundle,
    strict: bool = True,
    verbose: bool = True
) -> pd.DataFrame:
    """
    Apply frozen encoders to a new frame (e.g., prediction set) without calling LabelEncoder.transform.
    Raises on unknowns unless strict=False.
    """
    out = df.copy()

    # Nominal
    for col, maps in encoders.nominal_maps.items():
        if col not in out.columns:
            continue
        ser = out[col].astype("string").fillna("Unknown")
        out[col] = ser.map(maps["forward"])
        unk = out[col].isna()
        if unk.any():
            unseen = sorted(ser[unk].dropna().unique().tolist())
            msg = (f"[apply_encoders] Unknown nominal categories in '{col}': {unseen}. "
                   f"Known={list(maps['forward'].keys())}")
            if strict:
                raise ValueError(msg)
            else:
                print("WARN:", msg)
                out.loc[unk, col] = -1
        out[col] = out[col].astype("int32")

    # Ordinal
    for col, maps in encoders.ordinal_maps.items():
        if col not in out.columns:
            continue
        ser = out[col].astype("string").fillna("Unknown")
        out[col] = ser.map(maps["forward"])
        unk = out[col].isna()
        if unk.any():
            unseen = sorted(ser[unk].dropna().unique().tolist())
            msg = (f"[apply_encoders] Unknown ordinal categories in '{col}': {unseen}. "
                   f"Allowed={list(maps['forward'].keys())}")
            if strict:
                raise ValueError(msg)
            else:
                print("WARN:", msg)
                out.loc[unk, col] = maps["forward"]["Unknown"]
        out[col] = out[col].astype("int16")

    if verbose:
        obj_cols = [c for c in out.columns if out[c].dtype == "object"]
        if obj_cols:
            print(f"[apply_encoders] WARN: still object cols after encoding: {obj_cols}")

    return out

def audit_lag_feature_integrity(
    df: pd.DataFrame,
    person_col: str = "personId",
    season_col: str = "season_start_year",
    lag_pairs: list[tuple[str, str]] = [("season_pie_lag1", "season_pie")],
    verbose: bool = True
) -> None:
    """
    Audit that lag columns are true 1-season lags and not same-season leakage.
    Prints diagnostics; raises only on blatant shape problems.
    """
    if verbose:
        print("\n[audit] Verifying lag feature integrity...")
    if not {person_col, season_col}.issubset(df.columns):
        print("[audit] Skipping: missing required id/season cols.")
        return

    g = df.sort_values([person_col, season_col]).groupby(person_col, group_keys=False)
    prev_year = g[season_col].shift(1)
    year_gap = df[season_col] - prev_year

    # % rows whose previous row is exactly prior season (for non-first rows)
    valid_prev = (g.cumcount() > 0) & (year_gap == 1)
    if verbose:
        total = int((g.cumcount() > 0).sum())
        good = int(valid_prev.sum())
        print(f"[audit] Consecutive season pairs: {good}/{total} ({(good/total*100 if total else 100):.1f}%)")

    for lag_col, base_col in lag_pairs:
        if lag_col not in df.columns or base_col not in df.columns:
            print(f"[audit] Skipping pair (missing cols): {lag_col}, {base_col}")
            continue
        expected = g[base_col].shift(1)
        mism = (df[lag_col] != expected) & valid_prev
        mism_ct = int(mism.sum())
        if verbose:
            print(f"[audit] {lag_col} vs shift({base_col}): mismatches among consecutive seasons = {mism_ct}")
            if mism_ct:
                print(df.loc[mism, [person_col, season_col, lag_col, base_col]].head(5))


def validate_and_evaluate_features(df: pd.DataFrame, 
                                 numerical_features: List[str],
                                 nominal_categoricals: List[str],
                                 ordinal_categoricals: List[str],
                                 y_variables: List[str],
                                 verbose: bool = True) -> Dict[str, Any]:
    """
    Validate that specified features exist and evaluate their completeness.
    Robust to empty feature groups (no division by zero in prints).
    """
    if verbose:
        print("🔍 VALIDATING AND EVALUATING SPECIFIED FEATURES")
        print("=" * 55)
    
    results = {
        'numerical_features': {},
        'nominal_categoricals': {},
        'ordinal_categoricals': {},
        'y_variables': {},
        'missing_features': [],
        'available_features': [],
        'feature_completeness': {}
    }
    
    all_specified_features = numerical_features + nominal_categoricals + ordinal_categoricals + y_variables

    # Per-group validation with safe printing
    groups = [
        ('numerical_features', numerical_features),
        ('nominal_categoricals', nominal_categoricals),
        ('ordinal_categoricals', ordinal_categoricals),
        ('y_variables', y_variables),
    ]
    for feature_type, features in groups:
        available = [f for f in features if f in df.columns]
        missing = [f for f in features if f not in df.columns]

        results[feature_type]['available'] = available
        results[feature_type]['missing'] = missing
        results[feature_type]['availability_pct'] = (len(available) / len(features) * 100) if features else 100.0

        # Calculate completeness only for available features
        for feature in available:
            non_null_count = df[feature].notna().sum()
            total_count = len(df)
            completeness_pct = (non_null_count / total_count * 100.0) if total_count else 0.0
            results['feature_completeness'][feature] = {
                'non_null_count': int(non_null_count),
                'total_count': int(total_count),
                'completeness_pct': float(completeness_pct),
            }

        if verbose:
            print(f"\n📊 {feature_type.replace('_', ' ').title()}:")
            if len(features) == 0:
                print("   Available: 0/0 (n/a — no features specified)")
                print("   Completeness: n/a")
            else:
                pct = (len(available) / len(features) * 100.0)
                print(f"   Available: {len(available)}/{len(features)} ({pct:.1f}%)")
                if missing:
                    print(f"   Missing: {missing}")
                if available:
                    print(f"   Completeness:")
                    # show at most 5 to avoid flooding logs
                    for feature in available[:5]:
                        comp_pct = results['feature_completeness'][feature]['completeness_pct']
                        print(f"     {feature:<25} {comp_pct:5.1f}%")
                    if len(available) > 5:
                        print(f"     ... and {len(available)-5} more")

    # Overall summary (robust even if all lists were empty, though in practice y_variables is non-empty)
    total_specified = len(all_specified_features)
    total_available = len([f for f in all_specified_features if f in df.columns])
    overall_availability = (total_available / total_specified * 100.0) if total_specified else 100.0

    results['overall'] = {
        'total_specified': int(total_specified),
        'total_available': int(total_available),
        'availability_pct': float(overall_availability)
    }
    results['missing_features'] = [f for f in all_specified_features if f not in df.columns]
    results['available_features'] = [f for f in all_specified_features if f in df.columns]

    if verbose:
        print(f"\n📈 OVERALL FEATURE AVAILABILITY:")
        if total_specified == 0:
            print("   Specified features: 0 (n/a)")
            print("   Available features: 0 (n/a)")
            print("   Availability: n/a")
        else:
            print(f"   Specified features: {total_specified}")
            print(f"   Available features: {total_available}")
            print(f"   Availability: {overall_availability:.1f}%")

        if results['missing_features']:
            print(f"\n⚠️  Missing Features ({len(results['missing_features'])}):")
            for feature in results['missing_features'][:10]:
                print(f"     {feature}")
            if len(results['missing_features']) > 10:
                print(f"     ... and {len(results['missing_features'])-10} more")

    return results


def calculate_permutation_importance(X_train: pd.DataFrame, 
                                   y_train: pd.Series,
                                   X_test: pd.DataFrame, 
                                   y_test: pd.Series,
                                   target_name: str,
                                   numerical_features: List[str],
                                   model: Optional[Any] = None,
                                   n_repeats: int = 10,
                                   random_state: int = 42,
                                   verbose: bool = True) -> pd.DataFrame:
    """
    Calculate permutation importance for numerical features for a specific target.
    Hard fail on unexpected nulls to avoid masking upstream issues.
    """
    if verbose:
        print(f"\n🔄 CALCULATING PERMUTATION IMPORTANCE FOR {target_name.upper()}")
        print("=" * 60)

    # Select the numerical features that exist
    available_numerical = [f for f in numerical_features if f in X_train.columns]
    if not available_numerical:
        if verbose:
            print("⚠️  No numerical features available for permutation importance")
        return pd.DataFrame(columns=['feature','importance_mean','importance_std','target','importance_lower','importance_upper'])

    X_train_num = X_train[available_numerical].copy()
    X_test_num  = X_test[available_numerical].copy()

    # Strict null checks (no imputation here)
    n_tr = int(X_train_num.isna().sum().sum())
    n_te = int(X_test_num.isna().sum().sum())
    n_ytr = int(y_train.isna().sum())
    n_yte = int(y_test.isna().sum())
    if n_tr or n_te or n_ytr or n_yte:
        raise ValueError(
            f"[permutation_importance] Unexpected nulls found "
            f"(X_train NA={n_tr}, X_test NA={n_te}, y_train NA={n_ytr}, y_test NA={n_yte}). "
            f"Upstream should use 'filter_complete' to guarantee completeness."
        )

    # Train model if not provided
    if model is None:
        model = RandomForestRegressor(
            n_estimators=100,
            random_state=random_state,
            n_jobs=-1
        )
        model.fit(X_train_num, y_train)

    if verbose:
        print(f"   📊 Features for importance: {len(available_numerical)}")
        print(f"   🎯 Target: {target_name}")
        print(f"   🔄 Repeats: {n_repeats}")

    perm = permutation_importance(
        model, X_test_num, y_test,
        n_repeats=n_repeats,
        random_state=random_state,
        scoring='r2'
    )

    importance_df = (
        pd.DataFrame({
            'feature': available_numerical,
            'importance_mean': perm.importances_mean,
            'importance_std': perm.importances_std,
            'target': target_name
        })
        .sort_values('importance_mean', ascending=False)
        .reset_index(drop=True)
    )
    importance_df['importance_lower'] = importance_df['importance_mean'] - 1.96 * importance_df['importance_std']
    importance_df['importance_upper'] = importance_df['importance_mean'] + 1.96 * importance_df['importance_std']

    if verbose and not importance_df.empty:
        print(f"\n📈 TOP 10 MOST IMPORTANT FEATURES FOR {target_name.upper()}:")
        print("-" * 70)
        for i, (_, row) in enumerate(importance_df.head(10).iterrows(), 1):
            print(f"{i:2d}. {row['feature']:<30} {row['importance_mean']:8.4f} ± {row['importance_std']:6.4f}")

    return importance_df



def filter_features_by_importance(importance_df: pd.DataFrame,
                                min_importance: float = 0.001,
                                max_features: Optional[int] = None,
                                target_name: str = "unknown",
                                verbose: bool = True) -> List[str]:
    """
    Filter features based on permutation importance scores.
    
    Returns:
        List of important feature names
    """
    if importance_df.empty:
        return []
    
    if verbose:
        print(f"\n✂️  FILTERING FEATURES FOR {target_name.upper()}")
        print("=" * 50)
    
    # Filter by minimum importance
    important_features = importance_df[
        importance_df['importance_mean'] > min_importance
    ].copy()
    
    # Apply max features limit if specified
    if max_features and len(important_features) > max_features:
        important_features = important_features.head(max_features)
    
    feature_names = important_features['feature'].tolist()
    
    if verbose:
        total_features = len(importance_df)
        kept_features = len(feature_names)
        removed_features = total_features - kept_features
        
        print(f"   🎯 Min importance threshold: {min_importance}")
        if max_features:
            print(f"   📊 Max features limit: {max_features}")
        print(f"   ✅ Features kept: {kept_features}/{total_features} ({kept_features/total_features*100:.1f}%)")
        print(f"   ❌ Features removed: {removed_features}")
        
        if kept_features > 0:
            print(f"   📈 Importance range: {important_features['importance_mean'].min():.4f} to {important_features['importance_mean'].max():.4f}")
    
    return feature_names


def create_game_score_per36_feature(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """
    Create the game_score_per36 feature since it's not in your original feature engineering.
    Game Score = PTS + 0.4*FG - 0.7*FGA - 0.4*(FTA-FT) + 0.7*ORB + 0.3*DRB + STL + 0.7*AST + 0.7*BLK - 0.4*PF - TOV
    """
    out = df.copy()
    created = []
    
    # Required columns for game score calculation
    required_cols = ['total_points', 'total_fgm', 'total_fga', 'total_fta', 'total_ftm', 
                    'total_reb_off', 'total_reb_def', 'total_steals', 'total_assists', 
                    'total_blocks', 'total_pf', 'total_tov', 'total_minutes']
    
    missing_cols = [c for c in required_cols if c not in out.columns]
    if missing_cols:
        print(f"[create_game_score_per36] Warning: Missing columns {missing_cols}")
        # Create dummy column filled with median season_pie for now
        out['game_score_per36'] = out.get('season_pie', 0.1) * 36  # Rough approximation
        return out, ['game_score_per36']
    
    # Calculate game score per game
    out['game_score_total'] = (
        out['total_points'] + 
        0.4 * out['total_fgm'] - 
        0.7 * out['total_fga'] - 
        0.4 * (out['total_fta'] - out['total_ftm']) +
        0.7 * out['total_reb_off'] + 
        0.3 * out['total_reb_def'] + 
        out['total_steals'] + 
        0.7 * out['total_assists'] + 
        0.7 * out['total_blocks'] - 
        0.4 * out['total_pf'] - 
        out['total_tov']
    )
    
    # Convert to per-36 minute rate
    out['game_score_per36'] = np.where(
        out['total_minutes'] > 0,
        out['game_score_total'] * 36 / out['total_minutes'],
        0.0
    )
    
    created.extend(['game_score_total', 'game_score_per36'])
    return out, created


def create_multi_target_datasets(df_engineered: pd.DataFrame,
                                numerical_features: List[str],
                                nominal_categoricals: List[str], 
                                ordinal_categoricals: List[str],
                                y_variables: List[str],
                                strategy: str = "filter_complete",
                                test_seasons: Optional[List] = None,
                                season_col: str = "season_start_year",
                                verbose: bool = True) -> Dict[str, Any]:
    """
    Create ML datasets for multiple target variables with target-specific features.
    - Encodes categoricals with explicit ordinal order (e.g., minutes_tier).
    - Returns encoders for re-use at prediction time.
    - Runs a lag audit for sanity (prints only).
    """
    if verbose:
        print(f"\n📋 CREATING MULTI-TARGET ML DATASETS")
        print("=" * 45)
        print(f"🎯 Targets: {y_variables}")
        print(f"📊 Strategy: {strategy}")

    # Defensive: audit lag features (print-only)
    try:
        audit_lag_feature_integrity(df_engineered, verbose=verbose)
    except Exception as e:
        print(f"[audit] ERROR during lag integrity check: {e}")

    # 1) Encode categoricals ONCE and keep encoders for reuse
    if verbose:
        print("[step] Encoding categoricals...")
    df_processed, enc_bundle = encode_categoricals(
        df_engineered,
        nominal_categoricals=nominal_categoricals,
        ordinal_categoricals=ordinal_categoricals,
        strict=True,
        verbose=verbose
    )

    # 2) Temporal split config
    if test_seasons is None:
        test_seasons = ML_CONFIG.TEST_YEARS

    train_mask = ~df_processed[season_col].isin(test_seasons)
    test_mask = df_processed[season_col].isin(test_seasons)

    results = {
        'datasets': {},
        'encoders': enc_bundle,                # <--- new
        'label_encoders': enc_bundle.raw_label_encoders,  # keep old key for compatibility
        'train_seasons': sorted(df_processed[train_mask][season_col].unique()),
        'test_seasons': sorted(df_processed[test_mask][season_col].unique())
    }

    # 3) Build target-specific datasets
    for target in y_variables:
        if target not in df_processed.columns:
            if verbose:
                print(f"⚠️  Target '{target}' not found in data, skipping")
            continue

        if verbose:
            print(f"\n🎯 Processing target: {target}")

        target_features = create_target_specific_features(
            target, numerical_features, nominal_categoricals, ordinal_categoricals
        )
        available_features = [f for f in target_features if f in df_processed.columns]

        # Leakage sanity
        if not validate_target_feature_separation(target, available_features, verbose):
            if verbose:
                print(f"   ⚠️  Skipping {target} due to target leakage candidates")
            continue

        if verbose:
            print(f"   📊 Target-specific features: {len(available_features)}")

        # Filter for complete cases if requested
        target_mask = df_processed[target].notna()
        if strategy == "filter_complete":
            feat_mask = df_processed[available_features].notna().all(axis=1)
            full_mask = target_mask & feat_mask
        else:
            full_mask = target_mask

        train_data = df_processed[train_mask & full_mask]
        test_data  = df_processed[test_mask & full_mask]

        if len(train_data) == 0 or len(test_data) == 0:
            if verbose:
                print(f"   ⚠️  Insufficient data for {target} (train: {len(train_data)}, test: {len(test_data)})")
            continue

        # hard-stop on leftover NA — do not mask issues
        n_train_na = train_data[available_features + [target]].isna().sum().sum()
        n_test_na  = test_data[available_features + [target]].isna().sum().sum()
        if n_train_na or n_test_na:
            raise ValueError(
                f"[create_multi_target_datasets] Found NA after 'filter_complete' "
                f"for {target} (train NA={n_train_na}, test NA={n_test_na}). Diagnose upstream lag/joins."
            )

        X_train = train_data[available_features].copy()
        y_train = train_data[target].copy()
        X_test  = test_data[available_features].copy()
        y_test  = test_data[target].copy()

        results['datasets'][target] = {
            'X_train': X_train,
            'y_train': y_train,
            'X_test':  X_test,
            'y_test':  y_test,
            'feature_names': available_features,
            'train_size': len(X_train),
            'test_size': len(X_test),
            'target_name': target
        }

        if verbose:
            print(f"   ✅ Train: {len(X_train)}, Test: {len(X_test)}, Features: {len(available_features)}")

    return results



def train_multi_target_models(datasets: Dict[str, Any],
                             numerical_features: List[str],
                             importance_threshold: float = 0.001,
                             max_features_per_target: Optional[int] = None,
                             n_importance_repeats: int = 10,
                             verbose: bool = True) -> Dict[str, Any]:
    """
    Train models for multiple targets with permutation importance filtering.
    Uses target-specific features to prevent leakage.
    
    Returns:
        Dict containing trained models, importance scores, and filtered features
    """
    if verbose:
        print(f"\n🤖 TRAINING MULTI-TARGET MODELS WITH IMPORTANCE FILTERING")
        print("=" * 65)
    
    results = {
        'models': {},
        'importance_scores': {},
        'filtered_features': {},
        'evaluation_metrics': {}
    }
    
    for target_name, data in datasets['datasets'].items():
        if verbose:
            print(f"\n🎯 Training model for: {target_name}")
            print("-" * 40)
        
        X_train = data['X_train']
        y_train = data['y_train']
        X_test = data['X_test']
        y_test = data['y_test']
        feature_names = data['feature_names']
        
        # Get target-specific numerical features (subset of all numerical features)
        target_numerical = [f for f in numerical_features if f in feature_names]
        
        if verbose:
            print(f"   📊 Using {len(target_numerical)} target-specific numerical features")
        
        # Train initial model on all features
        model = RandomForestRegressor(
            n_estimators=100,
            random_state=42,
            n_jobs=-1
        )
        model.fit(X_train, y_train)
        
        # Calculate permutation importance (only for target-specific numerical features)
        importance_df = calculate_permutation_importance(
            X_train, y_train, X_test, y_test,
            target_name, target_numerical, None,  # Pass None to train new model
            n_repeats=n_importance_repeats, verbose=verbose
        )
        
        results['importance_scores'][target_name] = importance_df
        
        # Filter features based on importance
        if not importance_df.empty:
            important_features = filter_features_by_importance(
                importance_df, importance_threshold, max_features_per_target,
                target_name, verbose=verbose
            )
            
            # Add back categorical features (they weren't in permutation importance)
            # Exclude 'prediction_season' to avoid unseen category issues
            categorical_features = [f for f in feature_names 
                                  if f not in numerical_features and f in X_train.columns and f != "prediction_season"]
            final_features = important_features + categorical_features
            
            # Retrain with filtered features
            if final_features:
                X_train_filtered = X_train[final_features]
                X_test_filtered = X_test[final_features]
                
                final_model = RandomForestRegressor(
                    n_estimators=100,
                    random_state=42,
                    n_jobs=-1
                )
                final_model.fit(X_train_filtered, y_train)
                
                # Evaluate filtered model
                y_pred = final_model.predict(X_test_filtered)
                metrics = {
                    'r2': r2_score(y_test, y_pred),
                    'rmse': np.sqrt(mean_squared_error(y_test, y_pred)),
                    'mae': mean_absolute_error(y_test, y_pred)
                }
                
                results['models'][target_name] = final_model
                results['filtered_features'][target_name] = final_features
                results['evaluation_metrics'][target_name] = metrics
                
                if verbose:
                    print(f"   ✅ Final model R²: {metrics['r2']:.3f}, RMSE: {metrics['rmse']:.4f}")
                    print(f"   📊 Final features: {len(final_features)} ({len(important_features)} numerical + {len(categorical_features)} categorical)")
            else:
                if verbose:
                    print(f"   ⚠️  No features passed importance threshold for {target_name}")
        else:
            if verbose:
                print(f"   ⚠️  Could not calculate importance for {target_name}")
    
    return results


def save_feature_importance_results(results: Dict[str, Any],
                                  output_dir: Path,
                                  verbose: bool = True) -> None:
    """
    Save feature importance results and final feature lists.
    """
    if verbose:
        print(f"\n💾 SAVING FEATURE IMPORTANCE RESULTS")
        print("=" * 40)
    
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    
    # Save importance scores for each target
    for target_name, importance_df in results['importance_scores'].items():
        if not importance_df.empty:
            importance_path = output_dir / f"{target_name}_permutation_importance.csv"
            importance_df.to_csv(importance_path, index=False)
            if verbose:
                print(f"   📈 {target_name} importance: {importance_path}")
    
    # Save filtered features for each target
    filtered_features_summary = {}
    for target_name, features in results['filtered_features'].items():
        filtered_features_summary[target_name] = {
            'features': features,
            'count': len(features)
        }
    
    features_path = output_dir / "filtered_features_summary.json"
    with open(features_path, 'w') as f:
        json.dump(filtered_features_summary, f, indent=2)
    
    if verbose:
        print(f"   📋 Filtered features: {features_path}")
    
    # Save evaluation metrics
    metrics_path = output_dir / "model_evaluation_metrics.json"
    with open(metrics_path, 'w') as f:
        json.dump(results['evaluation_metrics'], f, indent=2, default=str)
    
    if verbose:
        print(f"   📊 Evaluation metrics: {metrics_path}")


def print_final_results(results: Dict[str, Any], verbose: bool = True) -> None:
    """
    Print comprehensive final results summary.
    """
    if not verbose:
        return
        
    print(f"\n🎉 FINAL RESULTS SUMMARY")
    print("=" * 30)
    
    for target_name in results['models'].keys():
        print(f"\n🎯 TARGET: {target_name.upper()}")
        print("-" * 40)
        
        # Model performance
        if target_name in results['evaluation_metrics']:
            metrics = results['evaluation_metrics'][target_name]
            print(f"📊 Model Performance:")
            print(f"   R²: {metrics['r2']:.4f}")
            print(f"   RMSE: {metrics['rmse']:.4f}") 
            print(f"   MAE: {metrics['mae']:.4f}")
        
        # Feature count
        if target_name in results['filtered_features']:
            features = results['filtered_features'][target_name]
            print(f"📈 Features Used: {len(features)}")
        
        # Top importance scores
        if target_name in results['importance_scores']:
            importance_df = results['importance_scores'][target_name]
            if not importance_df.empty:
                print(f"🏆 Top 5 Most Important Features:")
                top_5 = importance_df.head(5)
                for i, (_, row) in enumerate(top_5.iterrows(), 1):
                    print(f"   {i}. {row['feature']:<25} {row['importance_mean']:.4f}")


def create_target_specific_features(target_name: str, 
                                   base_numerical_features: List[str],
                                   nominal_categoricals: List[str],
                                   ordinal_categoricals: List[str]) -> List[str]:
    """
    Create target-specific feature lists to prevent leakage.

    We only forbid the *current* target (and trivially identical transforms),
    but DO allow *lagged* versions (e.g., season_pie_lag1) because those are valid
    predictors for next season.

    Returns:
        List of features appropriate for the specific target
    """
    target_name = str(target_name).strip()

    # Disallow contemporaneous target columns (and obvious direct twins)
    hard_exclusions = {
        'season_pie': {'season_pie'},               # forbid current target only
        'game_score_per36': {'game_score_per36'},   # forbid current target only
    }
    exclusions = hard_exclusions.get(target_name, set())

    # keep only allowed features; lags remain allowed (e.g., *_lag1)
    safe_numerical = [f for f in base_numerical_features if f not in exclusions]

    # Combine all feature types
    all_features = safe_numerical + nominal_categoricals + ordinal_categoricals
    return all_features


def validate_target_feature_separation(target_name: str, 
                                       features: List[str], 
                                       verbose: bool = True) -> bool:
    """
    Validate that features don't contain contemporaneous target leakage.

    We explicitly allow lagged target features (e.g., season_pie_lag1).
    """
    target_name = str(target_name).strip()
    contemporaneous = {target_name}

    leaked = [f for f in features if f in contemporaneous]
    if leaked:
        if verbose:
            print(f"⚠️  TARGET LEAKAGE DETECTED for {target_name}: {leaked} (current target in features)")
        return False

    if verbose:
        print(f"✅ No contemporaneous target leakage for {target_name} (lags allowed)")
    return True


def generate_and_save_predictions(
    df_engineered: pd.DataFrame,
    datasets: Dict[str, Any],
    model_results: Dict[str, Any],
    season_col: str = "season_start_year",
    id_cols: list[str] = ["personId", "player_name"],
    verbose: bool = True
) -> dict[str, Path]:
    """
    Generate predictions for ML_CONFIG.PREDICTION_YEAR using models and features
    learned on historical data. Applies the SAME encoders used during dataset creation.

    Saves one parquet per target:
      .../predictions/{target}_predictions_{PREDICTION_YEAR}.parquet
    """
    enc_bundle: EncodersBundle = datasets.get("encoders")
    if enc_bundle is None:
        raise RuntimeError("[generate_and_save_predictions] Missing encoders in 'datasets'. "
                           "Ensure create_multi_target_datasets() returned 'encoders'.")

    pred_year   = ML_CONFIG.PREDICTION_YEAR
    source_year = ML_CONFIG.SOURCE_YEAR

    # Base rows for predictions: use last observed season (source_year)
    base = df_engineered.loc[df_engineered[season_col] == source_year].copy()
    if verbose:
        print(f"[predict] Base rows from season={source_year}: {len(base)}")

    saved_paths: dict[str, Path] = {}

    for target, model in model_results.get("models", {}).items():
        final_feats = model_results["filtered_features"].get(target)
        if not final_feats:
            print(f"[predict] ⚠️ No final features recorded for target '{target}', skipping.")
            continue

        if verbose:
            print(f"\n[predict] Target={target}")
            print(f"[predict] Using {len(final_feats)} features")

        # Assemble X_pred from base
        missing = [c for c in final_feats if c not in base.columns]
        if missing:
            raise KeyError(f"[predict] Base frame missing features for '{target}': {missing}")

        X_pred_raw = base[final_feats].copy()

        # Apply encoders to X_pred to match training
        X_pred = apply_encoders_to_frame(X_pred_raw, enc_bundle, strict=True, verbose=verbose)

        # Final guard: no strings allowed
        ensure_numeric_matrix(X_pred, context=f"X_pred ({target})")

        # Predict
        y_hat = model.predict(X_pred)

        # Assemble output
        pred_df = base[id_cols + [season_col]].copy()
        pred_df["prediction_season"] = int(pred_year)
        pred_df[f"{target}_pred"] = y_hat

        # Save
        path = CFG.predictions_path(target, year=pred_year)
        path.parent.mkdir(parents=True, exist_ok=True)
        pred_df.to_parquet(path, index=False)
        saved_paths[target] = path

        if verbose:
            print(f"[predict] Saved {target} predictions → {path}")

    return saved_paths




# Enhanced AutomatedMLPipeline class with multi-target support
class EnhancedMultiTargetMLPipeline:
    """
    Enhanced ML Pipeline with multi-target support and feature importance filtering.
    """
    
    def __init__(self, 
                 numerical_features: List[str],
                 nominal_categoricals: List[str],
                 ordinal_categoricals: List[str], 
                 y_variables: List[str],
                 importance_threshold: float = 0.001,
                 max_features_per_target: Optional[int] = None,
                 verbose: bool = True):
        
        self.numerical_features = numerical_features
        self.nominal_categoricals = nominal_categoricals
        self.ordinal_categoricals = ordinal_categoricals
        self.y_variables = y_variables
        self.importance_threshold = importance_threshold
        self.max_features_per_target = max_features_per_target
        self.verbose = verbose
        
        self.results = {}
        
        # Ensure directories exist
        CFG.ensure_ml_dirs()
    
    def run_complete_pipeline(self, df_engineered: pd.DataFrame) -> Dict[str, Any]:
        """
        Run the complete enhanced pipeline with multi-target support.
        """
        if self.verbose:
            print("🚀 ENHANCED MULTI-TARGET ML PIPELINE")
            print("=" * 50)
            print(f"🎯 Targets: {self.y_variables}")
            print(f"📊 Numerical features: {len(self.numerical_features)}")
            print(f"🏷️  Categorical features: {len(self.nominal_categoricals + self.ordinal_categoricals)}")
        
        # Step 1: Validate features
        validation_results = validate_and_evaluate_features(
            df_engineered, self.numerical_features, self.nominal_categoricals,
            self.ordinal_categoricals, self.y_variables, self.verbose
        )
        
        # Step 2: Create multi-target datasets
        datasets = create_multi_target_datasets(
            df_engineered, self.numerical_features, self.nominal_categoricals,
            self.ordinal_categoricals, self.y_variables, verbose=self.verbose
        )
        
        # Step 3: Train models with importance filtering
        model_results = train_multi_target_models(
            datasets, self.numerical_features, self.importance_threshold,
            self.max_features_per_target, verbose=self.verbose
        )
        
        # Step 4: Save importance results
        save_feature_importance_results(
            model_results, CFG.ml_evaluation_dir, self.verbose
        )
        
        # Step 5: Generate and save predictions
        saved_pred_paths = generate_and_save_predictions(
            df_engineered, datasets, model_results, verbose=self.verbose
        )
        
        # Step 6: Print final summary
        print_final_results(model_results, self.verbose)
        
        # Combine all results
        self.results = {
            'feature_validation': validation_results,
            'datasets': datasets,
            'model_results': model_results,
            'saved_predictions': {k: str(v) for k, v in saved_pred_paths.items()},
            'config': {
                'numerical_features': self.numerical_features,
                'nominal_categoricals': self.nominal_categoricals,
                'ordinal_categoricals': self.ordinal_categoricals,
                'y_variables': self.y_variables,
                'importance_threshold': self.importance_threshold,
                'max_features_per_target': self.max_features_per_target
            }
        }
        
        return self.results


# Example usage function
def run_enhanced_pipeline_example():
    """
    Example of how to use the enhanced pipeline with your specified features.
    """
    
    # Your specified features
    numerical_features = [
        # lagged features
        "season_pie_lag1", "ts_pct_lag1", "efg_pct_lag1", "fg_pct_lag1", "fg3_pct_lag1", "ft_pct_lag1",
        "pts_per36_lag1", "ast_per36_lag1", "reb_per36_lag1", "defensive_per36_lag1",
        "production_per36_lag1", "stocks_per36_lag1", "three_point_rate_lag1", "ft_rate_lag1",
        "pts_per_shot_lag1", "ast_to_tov_lag1", "usage_events_per_min_lag1", "usage_per_min_lag1",
        "games_played_lag1", "total_minutes_lag1", "total_points_lag1", "total_assists_lag1",
        "total_rebounds_lag1", "total_steals_lag1", "total_blocks_lag1", "total_fga_lag1",
        "total_fta_lag1", "total_3pa_lag1", "total_3pm_lag1", "total_tov_lag1", "win_pct_lag1",
        "avg_plus_minus_lag1", "team_win_pct_final_lag1",
        "offensive_impact_lag1", "two_way_impact_lag1", "efficiency_volume_score_lag1",
        "versatility_score_lag1", "shooting_score_lag1"
    ]
    
    nominal_categoricals = ["prediction_season"]
    ordinal_categoricals = ["minutes_tier"] 
    y_variables = ["season_pie", "game_score_per36"]
    
    # Load your engineered data
    from src.heat_data_scientist_2025.data.load_data_utils import load_data_optimized
    
    df = load_data_optimized(CFG.ml_dataset_path, drop_null_rows=True)
    df_engineered, _, _ = engineer_features(df, verbose=True)
    
    # Add game_score_per36 if needed
    if 'game_score_per36' in y_variables and 'game_score_per36' not in df_engineered.columns:
        df_engineered, _ = create_game_score_per36_feature(df_engineered)
    
    # Create and run enhanced pipeline
    pipeline = EnhancedMultiTargetMLPipeline(
        numerical_features=numerical_features,
        nominal_categoricals=nominal_categoricals,
        ordinal_categoricals=ordinal_categoricals,
        y_variables=y_variables,
        importance_threshold=0.001,  # Adjust threshold as needed
        max_features_per_target=30,  # Limit features per target
        verbose=True
    )
    
    results = pipeline.run_complete_pipeline(df_engineered)
    
    return results


if __name__ == "__main__":
    # Run the example
    results = run_enhanced_pipeline_example()
    
    print("\n✅ Enhanced pipeline completed!")
    print("Check the output directories for saved feature importance and model results.")


Overwriting src/heat_data_scientist_2025/ml/enhanced_ml_pipeline.py


In [6]:
%%writefile src/heat_data_scientist_2025/ml/leaderboard_compare.py
"""
Enhanced leaderboard_compare.py with fixes for column naming and code cleanup
"""

from __future__ import annotations

from dataclasses import dataclass
from pathlib import Path
from typing import Dict, List, Tuple, Optional
import re
import numpy as np
import pandas as pd

from src.heat_data_scientist_2025.utils.config import CFG, ML_CONFIG

# ---------- Improved Helper Functions ----------

def _season_str(start_year: int) -> str:
    """Convert year to season string format (e.g., 2024 -> '2024-25')."""
    return f"{start_year}-{str((start_year + 1) % 100).zfill(2)}"

def _as_float(s) -> pd.Series:
    """Safely convert series to float, handling errors gracefully."""
    if isinstance(s, pd.Series):
        return pd.to_numeric(s, errors="coerce").astype(float)
    else:
        # Handle scalar values
        return pd.Series([float(s) if pd.notna(s) else np.nan])

def _normalize_season(s: pd.Series) -> pd.Series:
    """Normalize season strings to YYYY-YY format."""
    s = s.astype(str).str.strip()
    # Already in correct format
    mask = s.str.match(r"^\d{4}-\d{2}$")
    if mask.all():
        return s
    # Convert single year to season format
    yo = s.str.extract(r"(\d{4})")[0]
    ok = yo.notna()
    out = s.copy()
    out.loc[ok] = yo[ok] + "-" + (yo[ok].astype(int).add(1) % 100).astype(str).str.zfill(2)
    return out

def _find_prediction_column(df: pd.DataFrame, metric: str, verbose: bool = True) -> str:
    """
    IMPROVED: Intelligently find the prediction column with multiple fallback strategies.
    
    Tries in order:
    1. Direct metric name (e.g., 'game_score_per36')
    2. {metric}_pred pattern (e.g., 'game_score_per36_pred')  
    3. predicted_{metric} pattern (e.g., 'predicted_game_score_per36')
    4. Any column containing 'pred' for single fallback
    
    Args:
        df: DataFrame containing predictions
        metric: Target metric name
        verbose: Whether to print debug info
        
    Returns:
        Column name containing predictions
        
    Raises:
        KeyError: If no suitable prediction column found
    """
    candidates = [
        metric,                           # Direct: 'game_score_per36'
        f"{metric}_pred",                 # Suffix: 'game_score_per36_pred'  
        f"predicted_{metric}",            # Prefix: 'predicted_game_score_per36'
        f"pred_{metric}",                 # Alt prefix: 'pred_game_score_per36'
    ]
    
    # Try exact matches first
    for candidate in candidates:
        if candidate in df.columns:
            if verbose:
                print(f"[pred-col] Found exact match: '{candidate}' for metric '{metric}'")
            return candidate
    
    # Fallback: any numeric column containing 'pred'
    pred_cols = [c for c in df.columns if 'pred' in c.lower() and pd.api.types.is_numeric_dtype(df[c])]
    if len(pred_cols) == 1:
        if verbose:
            print(f"[pred-col] Using fallback: '{pred_cols[0]}' for metric '{metric}'")
        return pred_cols[0]
    elif len(pred_cols) > 1:
        raise KeyError(
            f"Ambiguous prediction columns for metric '{metric}': {pred_cols}. "
            f"Expected one of: {candidates}"
        )
    
    # No suitable column found
    raise KeyError(
        f"No prediction column found for metric '{metric}'. "
        f"Tried: {candidates}. Available columns: {list(df.columns)}"
    )

# ---------- Core Data Loading Functions ----------

def _load_hist_minimal(metric: str, minutes_gate: int = 500, verbose: bool = True) -> pd.DataFrame:
    """Load minimal historical frame from the parquet dataset (≤2024)."""
    need_cols = ["player_name", "season", "games_played", "total_minutes", metric]
    df = pd.read_parquet(CFG.ml_dataset_path)
    
    # Keep ≤ 2024 seasons only
    start_year = df["season"].astype(str).str.extract(r"^(\d{4})")[0].astype(int)
    df = df.loc[start_year <= 2024].copy()

    # Handle missing game_score_per36 (compute if possible)
    if metric == "game_score_per36" and "game_score_per36" not in df.columns:
        required_for_gs = {
            "total_points", "total_fgm", "total_fga", "total_fta", "total_ftm",
            "total_reb_off", "total_reb_def", "total_steals", "total_assists",
            "total_blocks", "total_pf", "total_tov", "total_minutes"
        }
        if required_for_gs.issubset(df.columns):
            if verbose:
                print(f"[COMPUTE] Calculating missing {metric} from component stats")
            game_score_total = (
                df["total_points"] + 0.4 * df["total_fgm"] - 0.7 * df["total_fga"]
                - 0.4 * (df["total_fta"] - df["total_ftm"]) + 0.7 * df["total_reb_off"]
                + 0.3 * df["total_reb_def"] + df["total_steals"] + 0.7 * df["total_assists"]
                + 0.7 * df["total_blocks"] - 0.4 * df["total_pf"] - df["total_tov"]
            )
            df["game_score_per36"] = np.where(
                df["total_minutes"] > 0, 
                game_score_total * 36.0 / df["total_minutes"], 
                np.nan
            )
        else:
            if verbose:
                print(f"[WARN] Cannot compute {metric} - missing required columns")
            df[metric] = np.nan

    # Ensure required columns exist and are numeric
    for c in ["games_played", "total_minutes", metric]:
        if c not in df.columns:
            df[c] = np.nan
        df[c] = _as_float(df[c])

    # Apply minutes gate and clean up
    df = df.loc[df["total_minutes"] >= minutes_gate].copy()
    return df[need_cols].dropna(subset=[metric])

def _load_predictions(metric: str, prediction_year: int, verbose: bool = True) -> pd.DataFrame:
    """
    IMPROVED: Load predictions with flexible column detection and better error handling.
    """
    pth = CFG.predictions_path(metric, prediction_year)
    if verbose:
        print(f"[LOAD] Loading predictions: {pth}")
    
    if not pth.exists():
        raise FileNotFoundError(f"Predictions file not found: {pth}")
        
    df = pd.read_parquet(pth)
    if verbose:
        print(f"[LOAD] Loaded {len(df):,} prediction rows with columns: {list(df.columns)}")

    # Use improved column detection
    pred_col = _find_prediction_column(df, metric, verbose=verbose)
    
    # Ensure required ID columns
    required_id_cols = ["player_name"]
    missing_id = [c for c in required_id_cols if c not in df.columns]
    if missing_id:
        raise KeyError(f"Predictions missing required columns: {missing_id}")

    # Build output dataframe
    out = pd.DataFrame({
        "player_name": df["player_name"].astype(str),
        "season": _season_str(prediction_year),
        metric: _as_float(df[pred_col]),
        "games_played": _as_float(df.get("games_played", np.nan)),
        "total_minutes": _as_float(df.get("total_minutes", np.nan)),
        "source": "pred"
    })
    
    # Remove null predictions
    before_len = len(out)
    out = out.dropna(subset=[metric])
    if verbose and len(out) != before_len:
        print(f"[CLEAN] Removed {before_len - len(out)} null predictions")
    
    return out

# ---------- Ranking and Comparison Functions ----------

def _rank_three_buckets(df: pd.DataFrame, metric: str, top_n: int = 10, 
                        middle_n: int = 10, bottom_n: int = 10) -> Dict[str, pd.DataFrame]:
    """
    Create top/middle/bottom rankings with consistent tie-breaking.
    
    Tie-breakers (in order):
    - Top: metric desc, total_minutes desc, games_played desc, player_name asc
    - Bottom: metric asc, total_minutes desc, games_played desc, player_name asc  
    - Middle: distance to median asc, then same as top
    """
    use = df.loc[df[metric].notna()].copy()
    
    # Ensure tie-breaker columns exist (fill NaN with 0 for sorting only)
    for col in ["total_minutes", "games_played"]:
        if col not in use.columns:
            use[col] = 0.0
        else:
            use[col] = _as_float(use[col]).fillna(0.0)

    # Top rankings (highest metric values)
    top = (
        use.sort_values([metric, "total_minutes", "games_played", "player_name"],
                        ascending=[False, False, False, True], kind="stable")
           .head(top_n).copy()
    )

    # Bottom rankings (lowest metric values)  
    bottom = (
        use.sort_values([metric, "total_minutes", "games_played", "player_name"],
                        ascending=[True, False, False, True], kind="stable")
           .head(bottom_n).copy()
    )

    # Middle rankings (closest to median)
    median_val = use[metric].median(skipna=True)
    use_middle = use.copy()
    use_middle["__dist_to_median"] = (use_middle[metric] - median_val).abs()
    middle = (
        use_middle.sort_values(["__dist_to_median", metric, "total_minutes", "games_played", "player_name"],
                               ascending=[True, False, False, False, True], kind="stable")
                  .head(middle_n)
                  .drop(columns=["__dist_to_median"]).copy()
    )

    return {"top": top, "middle": middle, "bottom": bottom}

def _build_new_boards(hist_df: pd.DataFrame, preds_df: pd.DataFrame,
                      metric: str, prediction_year: int) -> Tuple[Dict[str, pd.DataFrame], Dict[str, pd.DataFrame]]:
    """
    IMPROVED: Combine historical and prediction data to create new leaderboards with notes.
    
    Returns:
        (boards_dict, closest_dict) where boards_dict contains the final top/middle/bottom 10
        and closest_dict contains the near-miss predictions for each category.
    """
    # Ensure consistent schema
    hist_df = hist_df.copy()
    hist_df["source"] = "historical"
    
    preds_df = preds_df.copy()
    # Ensure common columns exist with defaults
    for col in ["games_played", "total_minutes"]:
        if col not in hist_df.columns:
            hist_df[col] = np.nan
        if col not in preds_df.columns: 
            preds_df[col] = np.nan

    # Combine datasets
    combined = pd.concat([hist_df, preds_df], ignore_index=True, sort=False)
    
    # Generate rankings from combined data
    boards = _rank_three_buckets(combined, metric)

    # Add rank numbers and prediction notes
    season_tag = _season_str(prediction_year)
    for bucket_name, board_df in boards.items():
        board_df = board_df.copy()
        board_df.insert(0, "Rank", range(1, len(board_df) + 1))
        
        # Add notes for predictions
        is_prediction = (board_df.get("source", "historical") == "pred") | (board_df["season"] == season_tag)
        board_df["Notes"] = np.where(is_prediction, f"NEW {season_tag} prediction", "")
        
        boards[bucket_name] = board_df

    # Generate "closest miss" lists for predictions not in top 10 of each bucket
    closest = _generate_closest_predictions(combined, boards, metric, prediction_year)

    return boards, closest

def _generate_closest_predictions(combined_df: pd.DataFrame, boards: Dict[str, pd.DataFrame], 
                                  metric: str, prediction_year: int, k: int = 10) -> Dict[str, pd.DataFrame]:
    """Generate lists of predictions that were closest to making each leaderboard."""
    season_tag = _season_str(prediction_year)
    predictions_only = combined_df.loc[combined_df["source"] == "pred"].copy()
    
    closest = {}
    
    for bucket_name, board_df in boards.items():
        # Get predictions that didn't make this board
        board_players = set(zip(board_df["player_name"], board_df["season"]))
        missed_preds = predictions_only[
            ~predictions_only.apply(lambda r: (r["player_name"], r["season"]) in board_players, axis=1)
        ].copy()
        
        if bucket_name == "top":
            # For top board: predictions with highest metric values that didn't make it
            cutoff_value = board_df[metric].min() if not board_df.empty else float('-inf')
            missed_preds["gap_to_cutoff"] = cutoff_value - missed_preds[metric] 
            closest_missed = (
                missed_preds.sort_values(["gap_to_cutoff", metric], ascending=[True, False])
                           .head(k).copy()
            )
            
        elif bucket_name == "bottom":
            # For bottom board: predictions with lowest metric values that didn't make it  
            cutoff_value = board_df[metric].max() if not board_df.empty else float('inf')
            missed_preds["gap_to_cutoff"] = missed_preds[metric] - cutoff_value
            closest_missed = (
                missed_preds.sort_values(["gap_to_cutoff", metric], ascending=[True, True])
                           .head(k).copy()
            )
            
        else:  # middle
            # For middle board: predictions closest to median that didn't make it
            median_val = combined_df[metric].median(skipna=True)
            missed_preds["gap_to_median"] = (missed_preds[metric] - median_val).abs()
            closest_missed = (
                missed_preds.sort_values(["gap_to_median", metric], ascending=[True, False])
                           .head(k).copy()
            )

        # Add helpful metadata
        closest_missed["Notes"] = f"Closest {season_tag} prediction to {bucket_name}"
        closest[f"closest_{bucket_name}"] = closest_missed.reset_index(drop=True)
    
    return closest

# ---------- Main Public API ----------

def build_leaderboards_with_predictions(metrics: Tuple[str, ...] = ("game_score_per36", "season_pie"),
                                        prediction_year: int = 2025,
                                        save: bool = True, 
                                        verbose: bool = True) -> Dict[str, Dict[str, pd.DataFrame]]:
    """
    IMPROVED: Build comprehensive leaderboards combining historical data with predictions.
    
    Creates top/middle/bottom-10 leaderboards for each metric, includes 2025 predictions,
    adds notes for newcomers, and generates 'closest miss' lists.
    
    Args:
        metrics: Tuple of metric names to process
        prediction_year: Year of predictions to include
        save: Whether to save results to CSV files
        verbose: Whether to print progress information
        
    Returns:
        Dictionary of results: {metric: {"boards": {...}, "closest": {...}}}
    """
    if verbose:
        print(f"\n🏆 BUILDING COMPREHENSIVE LEADERBOARDS")
        print("=" * 45)
        print(f"📊 Metrics: {metrics}")
        print(f"📅 Prediction year: {prediction_year}")
    
    results = {}
    output_dir = CFG.ml_predictions_dir
    output_dir.mkdir(parents=True, exist_ok=True)

    for metric in metrics:
        if verbose:
            print(f"\n📈 Processing metric: {metric}")
            print("-" * 30)
        
        try:
            # Load historical data and predictions
            hist_df = _load_hist_minimal(metric, verbose=verbose)
            pred_df = _load_predictions(metric, prediction_year, verbose=verbose)
            
            if verbose:
                print(f"✅ Historical data: {len(hist_df):,} player-seasons")
                print(f"✅ Predictions: {len(pred_df):,} players for {prediction_year}")

            # Build combined leaderboards
            boards, closest = _build_new_boards(hist_df, pred_df, metric, prediction_year)
            results[metric] = {"boards": boards, "closest": closest}

            if save:
                # Save main leaderboards
                for bucket_name, board_df in boards.items():
                    board_path = output_dir / f"{metric}_{bucket_name}_leaderboard_{prediction_year}_with_predictions.csv"
                    board_df.to_csv(board_path, index=False)
                    if verbose:
                        print(f"💾 Saved {bucket_name} leaderboard: {board_path.name}")

                # Save closest miss lists  
                closest_path = output_dir / f"{metric}_closest_misses_{prediction_year}.csv"
                closest_combined = pd.concat(
                    closest.values(), 
                    keys=list(closest.keys())
                ).reset_index(level=0).rename(columns={"level_0": "category"})
                closest_combined.to_csv(closest_path, index=False)
                if verbose:
                    print(f"💾 Saved closest misses: {closest_path.name}")

            # Print summary for this metric
            if verbose:
                print(f"\n📋 {metric.upper()} SUMMARY:")
                for bucket_name, board_df in boards.items():
                    pred_count = (board_df.get("source", "historical") == "pred").sum()
                    print(f"   {bucket_name.title()}: {pred_count}/10 are 2025 predictions")

        except Exception as e:
            print(f"❌ Error processing {metric}: {str(e)}")
            if verbose:
                import traceback
                traceback.print_exc()
            continue

    if verbose:
        print(f"\n✅ Completed leaderboard generation for {len(results)} metrics")
        
    return results

def create_simple_leaderboards_from_predictions(metrics: Tuple[str, ...] = ("game_score_per36", "season_pie"),
                                                prediction_year: int = 2025,
                                                top_n: int = 50,
                                                verbose: bool = True) -> Dict[str, pd.DataFrame]:
    """
    IMPROVED: Create simple CSV leaderboards from prediction parquet files.
    
    This creates the basic leaderboard files that other analysis functions expect,
    with improved column detection and error handling.
    """
    if verbose:
        print(f"\n📊 CREATING SIMPLE LEADERBOARDS")
        print("=" * 35)
    
    CFG.ensure_ml_dirs()
    results = {}
    
    for metric in metrics:
        if verbose:
            print(f"\n🏆 Processing {metric}")
            print("-" * 25)
        
        try:
            # Load predictions 
            pred_path = CFG.predictions_path(metric, prediction_year)
            if not pred_path.exists():
                print(f"❌ Predictions not found: {pred_path}")
                continue
                
            preds_df = pd.read_parquet(pred_path)
            if verbose:
                print(f"✅ Loaded {len(preds_df):,} predictions")

            # Find prediction column using improved detection
            pred_col = _find_prediction_column(preds_df, metric, verbose=verbose)
            
            # Create leaderboard
            leaderboard_cols = ["player_name", pred_col]
            missing_cols = [c for c in leaderboard_cols if c not in preds_df.columns]
            if missing_cols:
                print(f"❌ Missing columns: {missing_cols}")
                continue
            
            # Build leaderboard dataframe
            leaderboard_df = preds_df[leaderboard_cols].copy()
            leaderboard_df = leaderboard_df.dropna(subset=[pred_col])
            
            # Add season if not present
            if "season" not in leaderboard_df.columns:
                leaderboard_df["season"] = _season_str(prediction_year)
            
            # Sort and rank
            leaderboard_df = leaderboard_df.sort_values(
                [pred_col, "player_name"], 
                ascending=[False, True]
            ).head(top_n).reset_index(drop=True)
            
            # Clean up column names and add rank
            leaderboard_df = leaderboard_df.rename(columns={pred_col: metric})
            leaderboard_df.insert(0, "rank", range(1, len(leaderboard_df) + 1))
            leaderboard_df[metric] = leaderboard_df[metric].round(6)
            
            # Save leaderboard
            lb_path = CFG.leaderboard_path(metric, prediction_year)
            leaderboard_df.to_csv(lb_path, index=False)
            
            if verbose:
                print(f"✅ Saved: {lb_path.name}")
                print(f"📊 Top 3: {leaderboard_df.head(3)['player_name'].tolist()}")
            
            results[metric] = leaderboard_df
            
        except Exception as e:
            print(f"❌ Error with {metric}: {str(e)}")
            continue
    
    if verbose:
        print(f"\n✅ Created {len(results)} simple leaderboards")
    
    return results


Overwriting src/heat_data_scientist_2025/ml/leaderboard_compare.py


In [1]:
# %%writefile src/heat_data_scientist_2025/run_pie_gs36_prediction_pipeline.py
#!/usr/bin/env python3
"""
Multi-Target ML Pipeline Runner
==============================================

This script now properly handles column naming mismatches and includes
comprehensive error handling and code cleanup.

Key fixes:
1. Fixed column name detection for predictions (_pred vs predicted_)
2. Removed duplicate/unused functions  
3. Improved error handling and debugging
4. More efficient code structure
5. Better progress reporting
"""

import sys
import json
import pandas as pd
from pathlib import Path
from typing import Dict, List

from src.heat_data_scientist_2025.data.load_data_utils import load_data_optimized
from src.heat_data_scientist_2025.utils.config import CFG
from src.heat_data_scientist_2025.data.feature_engineering import engineer_features
from src.heat_data_scientist_2025.ml.enhanced_ml_pipeline import (
    EnhancedMultiTargetMLPipeline,
    create_game_score_per36_feature
)

def main():
    """
    FIXED: Run the enhanced pipeline with proper error handling.
    """
    
    # Import feature configurations
    from src.heat_data_scientist_2025.utils.config import (
        season_pie_numerical_features,
        game_score_per36_numerical_features, 
        nominal_categoricals,
        ordinal_categoricals,
        y_variables
    )
    
    print("🚀 FIXED ENHANCED MULTI-TARGET PIPELINE")
    print("=" * 45)
    print(f"🎯 Targets: {', '.join(y_variables)}")
    print(f"📊 Total Features: {len(season_pie_numerical_features) + len(ordinal_categoricals)}")
    print("=" * 45)
    
    # Load and process data
    print("\n📊 LOADING AND ENGINEERING DATA")
    print("=" * 35)
    
    try:
        df = load_data_optimized(CFG.ml_dataset_path, drop_null_rows=True)
        df_engineered = engineer_features(df, verbose=True)
        print("✅ Feature engineering completed successfully")
        
        # Show top performers for verification
        if "season_pie" in df_engineered.columns:
            top_pie = df_engineered[["player_name", "season_pie"]].sort_values("season_pie", ascending=False).head(5)
            print(f"\n🏆 Top 5 season_pie players:")
            for _, row in top_pie.iterrows():
                print(f"   {row['player_name']}: {row['season_pie']:.6f}")
        
    except Exception as e:
        print(f"❌ Data loading failed: {str(e)}")
        return None
    
    # Add game_score_per36 if needed
    if 'game_score_per36' in y_variables and 'game_score_per36' not in df_engineered.columns:
        print("\n🎯 Computing game_score_per36...")
        try:
            df_engineered, _ = create_game_score_per36_feature(df_engineered)
            print("✅ game_score_per36 feature added")
        except Exception as e:
            print(f"❌ Failed to add game_score_per36: {str(e)}")
            return None
    
    # Run ML pipeline
    print("\n🤖 RUNNING ML PIPELINE")
    print("=" * 25)
    
    try:
        pipeline = EnhancedMultiTargetMLPipeline(
            numerical_features=season_pie_numerical_features,  # Will be overridden per target
            nominal_categoricals=nominal_categoricals,
            ordinal_categoricals=ordinal_categoricals,
            y_variables=y_variables,
            importance_threshold=0.001,
            max_features_per_target=30,
            verbose=True
        )
        
        results = pipeline.run_complete_pipeline(df_engineered)
        print("✅ ML Pipeline completed successfully")
        
    except Exception as e:
        print(f"❌ ML Pipeline failed: {str(e)}")
        return None
    
    # Create simple leaderboards (FIXED)
    print("\n🏆 CREATING SIMPLE LEADERBOARDS")
    print("=" * 35)
    
    try:
        # Import the FIXED leaderboard function
        from src.heat_data_scientist_2025.ml.leaderboard_compare import create_simple_leaderboards_from_predictions
        
        simple_leaderboards = create_simple_leaderboards_from_predictions(
            metrics=("game_score_per36", "season_pie"),
            prediction_year=2025,
            top_n=50,
            verbose=True
        )
        print("✅ Simple leaderboards created successfully")
        
    except Exception as e:
        print(f"❌ Simple leaderboard creation failed: {str(e)}")
        print("⚠️  Continuing without simple leaderboards...")
        simple_leaderboards = {}
    
    # Create comprehensive leaderboards (FIXED)
    print("\n🏆 CREATING COMPREHENSIVE LEADERBOARDS")
    print("=" * 40)
    
    try:
        # Import the FIXED comprehensive leaderboard function  
        from src.heat_data_scientist_2025.ml.leaderboard_compare import build_leaderboards_with_predictions
        
        comprehensive_leaderboards = build_leaderboards_with_predictions(
            metrics=("game_score_per36", "season_pie"),
            prediction_year=2025,
            save=True,
            verbose=True
        )
        print("✅ Comprehensive leaderboards created successfully")
        
    except Exception as e:
        print(f"❌ Comprehensive leaderboard creation failed: {str(e)}")
        print(f"🐛 Error details: {str(e)}")
        # Continue without comprehensive leaderboards
        comprehensive_leaderboards = {}
    
    # Save comprehensive results
    try:
        save_comprehensive_results_summary(results, simple_leaderboards, comprehensive_leaderboards)
        print("✅ Results summary saved")
    except Exception as e:
        print(f"⚠️  Failed to save comprehensive summary: {str(e)}")
    
    # Final summary
    print("\n" + "="*60)
    print("🎉 PIPELINE COMPLETED!")  
    print("="*60)
    print("✅ ML models trained and predictions generated")
    print("✅ Feature importance analysis completed")
    if simple_leaderboards:
        print("✅ Simple leaderboards created")
    if comprehensive_leaderboards:
        print("✅ Comprehensive leaderboards with historical comparison")
    print(f"📁 All results saved to: {CFG.ml_predictions_dir}")
    
    return {
        'ml_results': results,
        'simple_leaderboards': simple_leaderboards,
        'comprehensive_leaderboards': comprehensive_leaderboards
    }


def save_comprehensive_results_summary(ml_results: Dict, 
                                       simple_lb: Dict, 
                                       comp_lb: Dict) -> None:
    """Save a comprehensive summary with improved error handling."""
    try:
        summary_path = CFG.ml_evaluation_dir / "comprehensive_results_summary.json"
        
        # Build summary with defensive programming
        summary = {
            'pipeline_status': 'completed',
            'timestamp': pd.Timestamp.now().isoformat(),
            'pipeline_config': {
                'numerical_features_count': len(ml_results.get('config', {}).get('numerical_features', [])),
                'categorical_features_count': len(
                    ml_results.get('config', {}).get('nominal_categoricals', []) + 
                    ml_results.get('config', {}).get('ordinal_categoricals', [])
                ),
                'target_variables': ml_results.get('config', {}).get('y_variables', []),
                'prediction_year': 2025
            },
            'ml_performance': {},
            'leaderboard_status': {
                'simple_leaderboards': len(simple_lb),
                'comprehensive_leaderboards': len(comp_lb)
            }
        }
        
        # Add ML performance metrics safely
        if 'model_results' in ml_results:
            model_results = ml_results['model_results']
            if 'evaluation_metrics' in model_results:
                summary['ml_performance'] = model_results['evaluation_metrics']
            
            # Add feature importance summary
            if 'importance_scores' in model_results:
                summary['feature_importance'] = {}
                for target, importance_df in model_results['importance_scores'].items():
                    if not importance_df.empty:
                        summary['feature_importance'][target] = {
                            'top_feature': importance_df.iloc[0]['feature'],
                            'top_importance': float(importance_df.iloc[0]['importance_mean']),
                            'significant_features_count': len(importance_df[importance_df['importance_mean'] > 0.001])
                        }
        
        # Add leaderboard summaries
        if simple_lb:
            summary['top_predictions_simple'] = {}
            for metric, lb_df in simple_lb.items():
                if not lb_df.empty:
                    summary['top_predictions_simple'][metric] = {
                        'winner': lb_df.iloc[0]['player_name'],
                        'value': float(lb_df.iloc[0][metric]),
                        'total_predictions': len(lb_df)
                    }
        
        # Save summary
        with open(summary_path, 'w') as f:
            json.dump(summary, f, indent=2, default=str)
        
        print(f"📋 Summary saved: {summary_path}")
        
    except Exception as e:
        print(f"⚠️  Failed to save comprehensive summary: {str(e)}")


def load_and_analyze_saved_results(target_name: str, year: int,
                                   id_cols: List[str] = ["personId", "player_name"],
                                   verbose: bool = True) -> None:
    """
    ENHANCED: Utility to load saved predictions/leaderboard with improved error handling.
    
    Now handles:
    - Better column name detection and fallbacks
    - More informative error messages  
    - Graceful handling of missing files
    - Defensive programming for column access
    - Clearer guidance when files are missing
    """
    CFG.ensure_ml_dirs()

    if verbose:
        print(f"\n📊 Analyzing saved results for target='{target_name}', year={year}")
        print("=" * 60)

    # === PREDICTIONS ANALYSIS ===
    pred_path = CFG.predictions_path(target_name, year)
    if not pred_path.exists():
        print(f"❌ Predictions not found at: {pred_path}")
        # Try alternative path formats
        try:
            pred_path_alt = CFG.predictions_path(f"{target_name}_{year}")
            if pred_path_alt.exists():
                pred_path = pred_path_alt
                print(f"✅ Found alternative path: {pred_path}")
        except Exception:
            print("❌ No alternative prediction paths found")
            print("💡 Ensure the ML pipeline has been run to generate predictions")

    if pred_path.exists():
        try:
            preds = pd.read_parquet(pred_path)
            print(f"✅ Loaded predictions: {len(preds):,} rows from {pred_path}")
            
            # Find prediction columns (ending with _pred)
            pred_cols = [c for c in preds.columns if c.endswith("_pred")]
            if not pred_cols:
                print("⚠️  No *_pred column found.")
                print(f"Available columns: {list(preds.columns)}")
            else:
                for c in pred_cols:
                    col_data = pd.to_numeric(preds[c], errors='coerce')
                    print(f"   {c}: min={col_data.min():.6f}, median={col_data.median():.6f}, max={col_data.max():.6f}")
            
            # Build display columns defensively
            safe_id_cols = [col for col in id_cols if col in preds.columns]
            season_cols = [col for col in ['season_start_year', 'prediction_season', 'season'] if col in preds.columns]
            
            display_cols = safe_id_cols + pred_cols + season_cols
            
            if display_cols:
                print("\n📋 Sample predictions:")
                print(preds.head(5)[display_cols].to_string(index=False))
            else:
                print("⚠️  No suitable columns found for display")
                print(f"Available columns: {list(preds.columns)}")
                
        except Exception as e:
            print(f"❌ Error loading predictions: {str(e)}")
    else:
        print("❌ Could not locate predictions parquet.")

    # === LEADERBOARD ANALYSIS ===
    lb_path = CFG.leaderboard_path(target_name, year)
    if not lb_path.exists():
        print(f"\n❌ Simple leaderboard not found at: {lb_path}")
        print("💡 This suggests create_simple_leaderboards_from_predictions() hasn't been run")
        print("💡 Add this step to your pipeline after generating predictions")
        
        # Try alternative path formats
        try:
            lb_path_alt = CFG.leaderboard_path(f"{target_name}_{year}")
            if lb_path_alt.exists():
                lb_path = lb_path_alt
                print(f"✅ Found alternative leaderboard: {lb_path}")
        except Exception:
            print("❌ No alternative leaderboard paths found")

    if lb_path.exists():
        try:
            lb = pd.read_csv(lb_path)
            print(f"\n✅ Loaded leaderboard (top {len(lb):,}) from {lb_path}")
            print("\n🏆 Top 10 leaderboard:")
            print(lb.head(10).to_string(index=False))
        except Exception as e:
            print(f"❌ Error loading leaderboard: {str(e)}")
    else:
        print("\n❌ Could not locate leaderboard CSV.")
        print("💡 The missing step is creating simple leaderboards from predictions")
        print("💡 This should happen automatically in the updated pipeline")
        
    # === FILE STATUS SUMMARY ===
    print(f"\n📁 FILE STATUS SUMMARY for {target_name}_{year}")
    print("-" * 50)
    pred_status = "✅ EXISTS" if pred_path.exists() else "❌ MISSING"
    lb_status = "✅ EXISTS" if lb_path.exists() else "❌ MISSING"
    print(f"Predictions: {pred_status} - {pred_path}")
    print(f"Leaderboard: {lb_status} - {lb_path}")
    
    if not lb_path.exists():
        print("\n🔧 TO FIX: Add create_simple_leaderboards_from_predictions() to your pipeline")
        print("   This function should run after predictions are generated but before analysis")


def debug_predictions_and_leaderboards(verbose: bool = True) -> None:
    """
    DEBUG UTILITY: Diagnose prediction files and leaderboard creation issues.
    Call this if you encounter errors.
    """
    print("\n🔍 DEBUGGING PREDICTIONS AND LEADERBOARDS")
    print("=" * 50)
    
    for target in ["season_pie", "game_score_per36"]:
        print(f"\n📊 Checking {target}:")
        print("-" * 30)
        
        # Check if prediction file exists
        pred_path = CFG.predictions_path(target, 2025)
        if pred_path.exists():
            print(f"✅ Predictions exist: {pred_path}")
            try:
                df = pd.read_parquet(pred_path)
                print(f"   📋 Columns: {list(df.columns)}")
                print(f"   📊 Shape: {df.shape}")
                
                # Check for prediction columns
                pred_cols = [c for c in df.columns if 'pred' in c.lower()]
                print(f"   🎯 Prediction columns: {pred_cols}")
                
                if pred_cols:
                    pred_col = pred_cols[0]
                    pred_values = pd.to_numeric(df[pred_col], errors='coerce')
                    print(f"   📈 {pred_col} range: {pred_values.min():.6f} to {pred_values.max():.6f}")
                    print(f"   👤 Top player: {df.loc[pred_values.idxmax(), 'player_name']}")
                
            except Exception as e:
                print(f"   ❌ Error reading predictions: {str(e)}")
        else:
            print(f"❌ Predictions missing: {pred_path}")
        
        # Check leaderboard  
        lb_path = CFG.leaderboard_path(target, 2025)
        if lb_path.exists():
            print(f"✅ Leaderboard exists: {lb_path}")
        else:
            print(f"❌ Leaderboard missing: {lb_path}")


if __name__ == "__main__":
    # Run the main pipeline
    print("🚀 Starting FIXED Enhanced Multi-Target Pipeline...")
    
    try:
        results = main()
        
        if results is None:
            print("\n❌ Pipeline failed - running diagnostics...")
            debug_predictions_and_leaderboards()
            sys.exit(1)
        else:
            print("\n✅ Pipeline completed successfully!")
            
            # Optional: Run diagnostics for verification
            print("\n🔍 Running verification diagnostics...")
            debug_predictions_and_leaderboards()
            
    except KeyboardInterrupt:
        print("\n⏹️  Pipeline interrupted by user")
        sys.exit(1)
    except Exception as e:
        print(f"\n💥 Unexpected error: {str(e)}")
        print("\n🔍 Running diagnostics...")
        debug_predictions_and_leaderboards()
        sys.exit(1)


🚀 Starting FIXED Enhanced Multi-Target Pipeline...
🚀 FIXED ENHANCED MULTI-TARGET PIPELINE
🎯 Targets: season_pie, game_score_per36
📊 Total Features: 40

📊 LOADING AND ENGINEERING DATA
Loading data for enhanced comprehensive EDA...
→ Applying null dropping: how='any', all columns
✓ Dropped 0 rows by null criteria (how='any', subset=None); remaining 5,575 rows
✓ Dataset loaded: 5,575 rows × 55 columns in 0.04s
Starting feature engineering...
Parsing seasons...
Adding experience features...
Adding advanced metrics...
Adding usage features...
Adding minutes features...
Adding performance consistency...
Creating composite features...
Building portability index...
Creating lag features...
✓ Lag nulls confirmed as first seasons only (1152 rows)
Dropped 1152 first-season rows with null lags
Feature engineering complete: 5575 → 4423 rows, 55 → 174 columns
✅ Feature engineering completed successfully

🏆 Top 5 season_pie players:
   LeBron James: 0.174857
   LeBron James: 0.174495
   Russell Westb