---

# Question 3

**Project 3:**
Please download the NBA Dataset available for free here. Using only the data provided in that dataset, rate the 10 best, 10 most average, and 10 worst individual player seasons. Only include the regular season, use players with at least 500 minutes played in that specific season, and only go back as far as the 2010-11 season. You may define and calculate these metrics in any way you wish.

List the results of your code for all 3 rankings. Only ranks, player names, and seasons are needed for each set. *This question is required.*

---

## Answer

This analysis evaluates individual NBA player seasons from 2010-11 onwards using a dual-metric approach that captures both all-around impact and per-minute efficiency. After evaluating PER, VORP, and EWA alongside our chosen pair, I selected **PIE (Player Impact Estimate)** and **Game Score per 36 minutes** as the optimal pairing: together they surface complementary routes to excellence and provided the strongest predictive signal for forecasting.

**Key Findings:**

* **All-around dominance (PIE):** LeBron James led the early 2010s; Nikola Jokić and Russell Westbrook’s 2016–17 season headline later years.
* **Per-minute excellence (GS/36):** Modern bigs (Embiid, Jokić, Giannis) achieve unprecedented efficiency.
* **Predictive models:** Final R² ≈ **0.842** for PIE and **0.733** for Game Score/36 (details in Appendix).
* **2025–26 projections:** Leaderboards for PIE, GS/36, PER, VORP, and EWA included in Appendix.

---

## Results

### PIE Rankings – All-Around Impact Leaders

#### Top 10 Seasons

| Rank | Player            | Season  |
| ---- | ----------------- | ------- |
| 1    | LeBron James      | 2012-13 |
| 2    | LeBron James      | 2011-12 |
| 3    | Russell Westbrook | 2016-17 |
| 4    | Kevin Durant      | 2013-14 |
| 5    | Nikola Jokic      | 2021-22 |
| 6    | Nikola Jokic      | 2024-25 |
| 7    | Kevin Durant      | 2012-13 |
| 8    | Joel Embiid       | 2023-24 |
| 9    | LeBron James      | 2010-11 |
| 10   | LeBron James      | 2013-14 |

#### Middle 10 Seasons (Around Median)

| Rank | Player           | Season  |
| ---- | ---------------- | ------- |
| 1    | Patrick Beverley | 2018-19 |
| 2    | Bub Carrington   | 2024-25 |
| 3    | Omer Asik        | 2013-14 |
| 4    | J.J. Barea       | 2010-11 |
| 5    | Carmelo Anthony  | 2020-21 |
| 6    | Earl Clark       | 2012-13 |
| 7    | Cory Joseph      | 2015-16 |
| 8    | Jaxson Hayes     | 2021-22 |
| 9    | Quentin Grimes   | 2022-23 |
| 10   | Steve Novak      | 2011-12 |

#### Bottom 10 Seasons

| Rank | Player            | Season  |
| ---- | ----------------- | ------- |
| 1    | Ronnie Price      | 2010-11 |
| 2    | Gary Harris       | 2014-15 |
| 3    | Terrance Ferguson | 2019-20 |
| 4    | Kevin Seraphin    | 2010-11 |
| 5    | Ryan Hollins      | 2011-12 |
| 6    | Jason Collins     | 2010-11 |
| 7    | Doron Lamb        | 2012-13 |
| 8    | Rashad Vaughn     | 2015-16 |
| 9    | Terrance Ferguson | 2017-18 |
| 10   | John Lucas III    | 2013-14 |

---

### Game Score per 36 Rankings – Per-Minute Efficiency Leaders

#### Top 10 Seasons

| Rank | Player                  | Season  |
| ---- | ----------------------- | ------- |
| 1    | Joel Embiid             | 2023-24 |
| 2    | Nikola Jokic            | 2024-25 |
| 3    | Giannis Antetokounmpo   | 2019-20 |
| 4    | Nikola Jokic            | 2021-22 |
| 5    | Shai Gilgeous-Alexander | 2024-25 |
| 6    | Giannis Antetokounmpo   | 2021-22 |
| 7    | Joel Embiid             | 2022-23 |
| 8    | Nikola Jokic            | 2022-23 |
| 9    | Giannis Antetokounmpo   | 2023-24 |
| 10   | Nikola Jokic            | 2023-24 |

#### Middle 10 Seasons (Around Median)

| Rank | Player                  | Season  |
| ---- | ----------------------- | ------- |
| 1    | Jabari Walker           | 2023-24 |
| 2    | Roy Hibbert             | 2014-15 |
| 3    | Max Strus               | 2024-25 |
| 4    | Sam Hauser              | 2023-24 |
| 5    | Julian Champagnie       | 2024-25 |
| 6    | Justise Winslow         | 2021-22 |
| 7    | Naji Marshall           | 2023-24 |
| 8    | Shabazz Napier          | 2017-18 |
| 9    | Terance Mann            | 2021-22 |
| 10   | Michael Carter-Williams | 2014-15 |

#### Bottom 10 Seasons

| Rank | Player            | Season  |
| ---- | ----------------- | ------- |
| 1    | DeShawn Stevenson | 2011-12 |
| 2    | Terrance Ferguson | 2019-20 |
| 3    | Stephen Graham    | 2010-11 |
| 4    | Mike Miller       | 2014-15 |
| 5    | Shawne Williams   | 2011-12 |
| 6    | Rashad Vaughn     | 2015-16 |
| 7    | Semi Ojeleye      | 2017-18 |
| 8    | Jason Collins     | 2010-11 |
| 9    | Doron Lamb        | 2012-13 |
| 10   | Anthony Brown     | 2015-16 |

---

# Question 4

**Project 4:**
Describe how you calculated these rankings and why you chose that approach. *This question is required.*

---

## Answer 4

### Methodology & Rationale

**Why These Two Metrics?**

1. **PIE (Player Impact Estimate)** — all-around dominance

   * *Purpose:* Measures a player’s share of total game events while on court.
   * *Formula:* (Player positive contributions) ÷ (Total game events from both teams).
   * *Aggregation:* Game-level PIE sums to \~1 across all players in a game; season PIE is a weighted average of game PIE values by game totals.
   * *Strengths:* Captures high-usage, ball-dominant impact across categories.

2. **Game Score per 36 Minutes** — per-minute efficiency

   * *Purpose:* Hollinger’s comprehensive single-game metric scaled to per-36 minutes.
   * *Formula:* Points + 0.4×FGM − 0.7×FGA − 0.4×(FTA−FTM) + 0.7×OREB + 0.3×DREB + STL + 0.7×AST + 0.7×BLK − 0.4×PF − TOV.
   * *Strengths:* Rewards production, penalizes inefficiency, independent of playing time.

**Why not PER, VORP, or EWA?**
All five were computed, but PIE + GS/36 offered the clearest *complementarity*:

* **PER** overlaps substantially with PIE (both reflect all-around efficiency).
* **VORP**/**EWA** are valuable but track closely with PIE/playing time; they added less differentiation.
* **GS/36** exposes a distinct dimension: *extreme per-minute output* (often modern bigs).

**Data Filtering & Quality Control**

* **Seasons:** 2010-11 through 2024-25
* **Minutes:** ≥ 500 minutes in that season
* **Season type:** Regular season only
* **Qualified player-seasons:** 4,423

**Tiebreakers (in order):**

1. Primary metric (PIE or GS/36) → 2) **season\_pie** (DESC) → 3) **total\_minutes** (DESC) → 4) **ts\_pct** (DESC)

**Defining “Most Average”:**
Chosen as the 10 player-seasons *closest to the median* of each distribution (not the mean) to counter right-skew from superstar outliers.

---

## Appendix: Technical Implementation

### Predictive Analytics – 2025–26 Season Forecasts

#### Machine Learning Pipeline

A multi-target pipeline forecasts **PIE**, **Game Score/36**, **PER**, **VORP**, and **EWA** using lagged and engineered features (usage, efficiency, minutes, consistency, composites, portability). Feature importance filtering (threshold = 0.001) reduces overfitting. Separate **Random Forest Regressors** are fit per target.

**Model Performance (Final Results):**

| Target            | R²    | RMSE  | MAE   |
| ----------------- | ----- | ----- | ----- |
| PIE               | 0.842 | 0.011 | 0.009 |
| Game Score per 36 | 0.733 | 2.06  | 1.61  |
| PER               | 0.739 | 2.19  | 1.71  |
| EWA               | 0.728 | 0.16  | 0.12  |
| VORP              | 0.728 | 0.16  | 0.12  |

**Top Feature Importance by Target (summary):**

* **PIE:** season\_pie\_lag1 (dominant), plus prior PER, plus/minus, fouls/36, team win%.
* **GS/36:** game\_score\_per36\_lag1 (dominant), then BPM, adjusted PER (aPER), production consistency.
* **PER:** season\_PER\_lag1, aPER, with usage/rebounding supporting.
* **EWA & VORP:** strongly tied to lagged PIE/VORP/EWA and efficiency composites.

---

### 2025–26 Projections (Leaderboards)

#### PIE — Projected (Top / Median / Bottom)

| **Top**                             |   | **Median**                  |   | **Bottom**                     |
| ----------------------------------- | - | --------------------------- | - | ------------------------------ |
| 1. Joel Embiid — 0.1510             |   | 21. LaMelo Ball — 0.0994    |   | 41. LeBron James — 0.0898      |
| 2. Shai Gilgeous-Alexander — 0.1465 |   | 22. Jarrett Allen — 0.0993  |   | 42. Chet Holmgren — 0.0879     |
| 3. Nikola Jokic — 0.1457            |   | 23. Bam Adebayo — 0.0980    |   | 43. Jimmy Butler — 0.0878      |
| 4. Luka Doncic — 0.1390             |   | 24. Jamal Murray — 0.0978   |   | 44. Mikal Bridges — 0.0878     |
| 5. Giannis Antetokounmpo — 0.1387   |   | 25. De'Aaron Fox — 0.0976   |   | 45. Nikola Vucevic — 0.0876    |
| 6. Jalen Brunson — 0.1281           |   | 26. Julius Randle — 0.0976  |   | 46. Pascal Siakam — 0.0875     |
| 7. Jayson Tatum — 0.1238            |   | 27. Damian Lillard — 0.0965 |   | 47. Jalen Duren — 0.0872       |
| 8. Domantas Sabonis — 0.1236        |   | 28. Trae Young — 0.0961     |   | 48. Evan Mobley — 0.0869       |
| 9. Kyrie Irving — 0.1151            |   | 29. James Harden — 0.0961   |   | 49. Jaren Jackson Jr. — 0.0868 |
| 10. Anthony Edwards — 0.1144        |   | 30. Alperen Sengun — 0.0955 |   | 50. Tyler Herro — 0.0857       |

#### Game Score per 36 — Projected (Top / Median / Bottom)

| **Top**                              |   | **Median**                      |   | **Bottom**                       |
| ------------------------------------ | - | ------------------------------- | - | -------------------------------- |
| 1. Nikola Jokic — 27.7971            |   | 21. Damian Lillard — 20.2389    |   | 41. Trae Young — 18.1972         |
| 2. Giannis Antetokounmpo — 27.3828   |   | 22. Tyrese Maxey — 20.1450      |   | 42. James Harden — 18.1172       |
| 3. Luka Doncic — 26.7261             |   | 23. Kevin Durant — 20.0465      |   | 43. Collin Sexton — 18.0140      |
| 4. Joel Embiid — 26.7252             |   | 24. Kyrie Irving — 20.0203      |   | 44. Karl-Anthony Towns — 17.9754 |
| 5. Shai Gilgeous-Alexander — 25.6449 |   | 25. Tyrese Haliburton — 19.9375 |   | 45. Paolo Banchero — 17.9333     |
| 6. Victor Wembanyama — 23.0008       |   | 26. Devin Booker — 19.8419      |   | 46. Cade Cunningham — 17.7493    |
| 7. LeBron James — 22.8802            |   | 27. Donovan Mitchell — 19.7953  |   | 47. DeMar DeRozan — 17.7321      |
| 8. Anthony Davis — 22.7452           |   | 28. Pascal Siakam — 19.7534     |   | 48. Tyus Jones — 17.6267         |
| 9. Ja Morant — 22.3199               |   | 29. T.J. McConnell — 19.7319    |   | 49. RJ Barrett — 17.5536         |
| 10. Kawhi Leonard — 21.7732          |   | 30. Alperen Sengun — 19.3413    |   | 50. Mark Williams — 17.5137      |

#### PER — Projected (Top / Median / Bottom)

| **Top**                              |   | **Median**                         |   | **Bottom**                       |
| ------------------------------------ | - | ---------------------------------- | - | -------------------------------- |
| 1. Nikola Jokic — 30.5859            |   | 21. Tyrese Haliburton — 21.1995    |   | 41. Tyrese Maxey — 19.5191       |
| 2. Giannis Antetokounmpo — 29.7691   |   | 22. Donovan Mitchell — 21.1801     |   | 42. Paolo Banchero — 19.4764     |
| 3. Shai Gilgeous-Alexander — 29.6761 |   | 23. Clint Capela — 21.1135         |   | 43. Dereck Lively II — 19.4000   |
| 4. Joel Embiid — 28.2572             |   | 24. Kevin Durant — 21.0905         |   | 44. Nic Claxton — 19.3747        |
| 5. Luka Doncic — 27.6348             |   | 25. Jarrett Allen — 20.5474        |   | 45. Anthony Edwards — 19.3656    |
| 6. Victor Wembanyama — 23.7901       |   | 26. Jimmy Butler — 20.4167         |   | 46. Deandre Ayton — 19.2686      |
| 7. Anthony Davis — 23.4941           |   | 27. Trayce Jackson-Davis — 20.3812 |   | 47. Pascal Siakam — 19.2615      |
| 8. Kawhi Leonard — 23.4735           |   | 28. Jamal Murray — 20.2887         |   | 48. Jakob Poeltl — 19.2434       |
| 9. Ja Morant — 23.0264               |   | 29. LaMelo Ball — 20.2292          |   | 49. Chet Holmgren — 19.2422      |
| 10. Andre Drummond — 22.8885         |   | 30. Rudy Gobert — 20.1405          |   | 50. Isaiah Hartenstein — 19.1179 |

#### VORP — Projected (Top / Median / Bottom)

| **Top**                             |   | **Median**                      |   | **Bottom**                      |
| ----------------------------------- | - | ------------------------------- | - | ------------------------------- |
| 1. Nikola Jokic — 1.5278            |   | 21. Tyrese Maxey — 0.9200       |   | 41. Kyle Kuzma — 0.7636         |
| 2. Giannis Antetokounmpo — 1.3621   |   | 22. Victor Wembanyama — 0.9129  |   | 42. Karl-Anthony Towns — 0.7411 |
| 3. Shai Gilgeous-Alexander — 1.3093 |   | 23. Jalen Brunson — 0.8797      |   | 43. Jimmy Butler — 0.7368       |
| 4. Luka Doncic — 1.2767             |   | 24. James Harden — 0.8686       |   | 44. Paul George — 0.7294        |
| 5. Joel Embiid — 1.1523             |   | 25. LaMelo Ball — 0.8613        |   | 45. Collin Sexton — 0.7281      |
| 6. Kyrie Irving — 1.0590            |   | 26. Kawhi Leonard — 0.8500      |   | 46. Bam Adebayo — 0.7276        |
| 7. Ja Morant — 1.0477               |   | 27. Tyrese Haliburton — 0.8429  |   | 47. Rudy Gobert — 0.7228        |
| 8. De'Aaron Fox — 1.0468            |   | 28. Kristaps Porzingis — 0.8361 |   | 48. Julius Randle — 0.7224      |
| 9. Jayson Tatum — 1.0449            |   | 29. Zion Williamson — 0.8351    |   | 49. Nikola Vucevic — 0.7141     |
| 10. Domantas Sabonis — 1.0415       |   | 30. Stephen Curry — 0.8308      |   | 50. Jaylen Brown — 0.7112       |

#### EWA — Projected (Top / Median / Bottom)

| **Top**                             |   | **Median**                      |   | **Bottom**                      |
| ----------------------------------- | - | ------------------------------- | - | ------------------------------- |
| 1. Nikola Jokic — 1.5278            |   | 21. Tyrese Maxey — 0.9200       |   | 41. Kyle Kuzma — 0.7636         |
| 2. Giannis Antetokounmpo — 1.3621   |   | 22. Victor Wembanyama — 0.9129  |   | 42. Karl-Anthony Towns — 0.7411 |
| 3. Shai Gilgeous-Alexander — 1.3093 |   | 23. Jalen Brunson — 0.8797      |   | 43. Jimmy Butler — 0.7368       |
| 4. Luka Doncic — 1.2767             |   | 24. James Harden — 0.8686       |   | 44. Paul George — 0.7294        |
| 5. Joel Embiid — 1.1523             |   | 25. LaMelo Ball — 0.8613        |   | 45. Collin Sexton — 0.7281      |
| 6. Kyrie Irving — 1.0590            |   | 26. Kawhi Leonard — 0.8500      |   | 46. Bam Adebayo — 0.7276        |
| 7. Ja Morant — 1.0477               |   | 27. Tyrese Haliburton — 0.8429  |   | 47. Rudy Gobert — 0.7228        |
| 8. De'Aaron Fox — 1.0468            |   | 28. Kristaps Porzingis — 0.8361 |   | 48. Julius Randle — 0.7224      |
| 9. Jayson Tatum — 1.0449            |   | 29. Zion Williamson — 0.8351    |   | 49. Nikola Vucevic — 0.7141     |
| 10. Domantas Sabonis — 1.0415       |   | 30. Stephen Curry — 0.8308      |   | 50. Jaylen Brown — 0.7112       |

---

### Future Production Enhancements

* **Infrastructure & Orchestration:** Airflow (automated retraining), Kafka (streaming), MLflow (tracking), FastAPI (serving).
* **Advanced Analytics & Features:** Integrate PER/VORP/EWA, SHAP/LIME, RFE, Bayesian hierarchical pooling.
* **Model Enhancements:** Injury & play-type features, position/team/league-level predictions, uncertainty intervals, real-time updates, ensembles.
* **User Experience:** React/Vite dashboards, predicted vs. historical comparisons, custom metric builders, mobile-friendly UI.
* **Monitoring:** Drift detection, A/B testing, performance dashboards, automated alerts.

---

## Detailed Rankings (With Context)

### PIE

#### Top 10 Seasons by PIE (with context)

| Rank | Player Name       | Season  | PIE      | PIE%  | PIE Num | PIE Den | Points | FGM | FTM | FGA  | FTA | DREB | OREB | AST | STL | BLK | PF  | TOV | Minutes |
| ---- | ----------------- | ------- | -------- | ----- | ------- | ------- | ------ | --- | --- | ---- | --- | ---- | ---- | --- | --- | --- | --- | --- | ------- |
| 1    | LeBron James      | 2012-13 | 0.174857 | 17.49 | 2254.0  | 12890.5 | 2036.0 | 765 | 403 | 1354 | 535 | 513  | 97   | 551 | 129 | 67  | 110 | 226 | 2835.00 |
| 2    | LeBron James      | 2011-12 | 0.174495 | 17.45 | 1683.0  | 9645.0  | 1683.0 | 621 | 387 | 1169 | 502 | 398  | 94   | 387 | 115 | 50  | 96  | 213 | 2297.00 |
| 3    | Russell Westbrook | 2016-17 | 0.171554 | 17.16 | 2466.0  | 14374.5 | 2558.0 | 824 | 710 | 1941 | 840 | 727  | 137  | 840 | 132 | 31  | 190 | 438 | 2761.00 |
| 4    | Kevin Durant      | 2013-14 | 0.169186 | 16.92 | 2339.5  | 13828.0 | 2593.0 | 849 | 703 | 1688 | 805 | 540  | 58   | 445 | 103 | 59  | 174 | 285 | 3087.00 |
| 5    | Nikola Jokic      | 2021-22 | 0.166170 | 16.62 | 2536.5  | 15264.5 | 2004.0 | 764 | 379 | 1311 | 468 | 813  | 206  | 584 | 109 | 63  | 191 | 281 | 2440.00 |
| 6    | Nikola Jokic      | 2024-25 | 0.163343 | 16.33 | 2670.5  | 16349.0 | 2071.0 | 786 | 361 | 1364 | 451 | 692  | 200  | 716 | 127 | 45  | 160 | 230 | 2557.28 |
| 7    | Kevin Durant      | 2012-13 | 0.162572 | 16.26 | 2243.5  | 13800.0 | 2280.0 | 731 | 679 | 1433 | 750 | 594  | 46   | 374 | 116 | 105 | 143 | 280 | 3076.00 |
| 8    | Joel Embiid       | 2023-24 | 0.162201 | 16.22 | 1195.5  | 7370.5  | 1217.0 | 412 | 342 | 768  | 389 | 302  | 80   | 197 | 39  | 59  | 96  | 130 | 1140.00 |
| 9    | LeBron James      | 2010-11 | 0.161252 | 16.13 | 2030.0  | 12589.0 | 2111.0 | 758 | 503 | 1485 | 663 | 510  | 80   | 554 | 124 | 50  | 163 | 284 | 3027.00 |
| 10   | LeBron James      | 2013-14 | 0.158951 | 15.90 | 2075.5  | 13057.5 | 2089.0 | 767 | 439 | 1353 | 585 | 452  | 81   | 488 | 121 | 26  | 126 | 270 | 2863.00 |

#### Middle 10 Seasons by PIE (with context)

| Rank | Player Name      | Season  | PIE      | PIE% | PIE Num | PIE Den | Points | FGM | FTM | FGA | FTA | DREB | OREB | AST | STL | BLK | PF  | TOV | Minutes |
| ---- | ---------------- | ------- | -------- | ---- | ------- | ------- | ------ | --- | --- | --- | --- | ---- | ---- | --- | --- | --- | --- | --- | ------- |
| 1    | Patrick Beverley | 2018-19 | 0.044128 | 4.41 | 674.5   | 15285.0 | 596.0  | 194 | 96  | 477 | 123 | 312  | 76   | 300 | 67  | 43  | 265 | 85  | 2097.00 |
| 2    | Bub Carrington   | 2024-25 | 0.044140 | 4.41 | 751.5   | 17025.5 | 801.0  | 297 | 69  | 743 | 85  | 303  | 33   | 360 | 53  | 20  | 190 | 140 | 2419.13 |
| 3    | Omer Asik        | 2013-14 | 0.044143 | 4.41 | 377.0   | 8540.5  | 280.0  | 101 | 78  | 190 | 126 | 277  | 101  | 25  | 14  | 37  | 92  | 59  | 947.00  |
| 4    | J.J. Barea       | 2010-11 | 0.044121 | 4.41 | 601.0   | 13621.5 | 769.0  | 285 | 133 | 649 | 157 | 130  | 29   | 317 | 30  | 1   | 136 | 136 | 1631.00 |
| 5    | Carmelo Anthony  | 2020-21 | 0.044149 | 4.41 | 699.5   | 15844.0 | 1035.0 | 367 | 155 | 870 | 176 | 221  | 41   | 110 | 53  | 42  | 168 | 69  | 1894.00 |
| 6    | Earl Clark       | 2012-13 | 0.044116 | 4.41 | 430.0   | 9747.0  | 426.0  | 170 | 51  | 386 | 74  | 242  | 82   | 65  | 36  | 44  | 102 | 61  | 1332.00 |
| 7    | Cory Joseph      | 2015-16 | 0.044110 | 4.41 | 588.5   | 13341.5 | 677.0  | 257 | 133 | 585 | 174 | 171  | 39   | 250 | 63  | 20  | 131 | 102 | 2003.00 |
| 8    | Jaxson Hayes     | 2021-22 | 0.044106 | 4.41 | 609.0   | 13807.5 | 654.0  | 245 | 144 | 398 | 188 | 200  | 115  | 43  | 33  | 55  | 155 | 54  | 1363.00 |
| 9    | Quentin Grimes   | 2022-23 | 0.044102 | 4.41 | 627.5   | 14228.5 | 799.0  | 282 | 78  | 602 | 98  | 180  | 49   | 150 | 47  | 26  | 177 | 69  | 2086.00 |
| 10   | Steve Novak      | 2011-12 | 0.044089 | 4.41 | 350.0   | 7938.5  | 477.0  | 161 | 22  | 336 | 26  | 95   | 9    | 12  | 16  | 9   | 59  | 21  | 998.00  |

#### Bottom 10 Seasons by PIE (with context)

| Rank | Player Name       | Season  | PIE      | PIE% | PIE Num | PIE Den | Points | FGM | FTM | FGA | FTA | DREB | OREB | AST | STL | BLK | PF  | TOV | Minutes |
| ---- | ----------------- | ------- | -------- | ---- | ------- | ------- | ------ | --- | --- | --- | --- | ---- | ---- | --- | --- | --- | --- | --- | ------- |
| 1    | Ronnie Price      | 2010-11 | 0.004207 | 0.42 | 40.5    | 9627.5  | 197.0  | 74  | 29  | 210 | 39  | 39   | 22   | 56  | 42  | 5   | 106 | 55  | 689.00  |
| 2    | Gary Harris       | 2014-15 | 0.004483 | 0.45 | 41.0    | 9146.5  | 188.0  | 66  | 35  | 217 | 47  | 43   | 21   | 29  | 39  | 7   | 71  | 38  | 692.00  |
| 3    | Terrance Ferguson | 2019-20 | 0.004869 | 0.49 | 46.5    | 9549.5  | 209.0  | 74  | 15  | 199 | 19  | 50   | 23   | 45  | 26  | 16  | 144 | 30  | 1146.00 |
| 4    | Kevin Seraphin    | 2010-11 | 0.005166 | 0.52 | 49.0    | 9486.0  | 154.0  | 66  | 22  | 147 | 31  | 72   | 80   | 10  | 17  | 28  | 126 | 42  | 604.00  |
| 5    | Ryan Hollins      | 2011-12 | 0.005233 | 0.52 | 31.5    | 6020.0  | 131.0  | 46  | 39  | 84  | 75  | 48   | 34   | 9   | 5   | 17  | 78  | 35  | 505.00  |
| 6    | Jason Collins     | 2010-11 | 0.005796 | 0.58 | 44.5    | 7677.5  | 96.0   | 34  | 27  | 71  | 41  | 72   | 30   | 22  | 9   | 9   | 97  | 26  | 570.00  |
| 7    | Doron Lamb        | 2012-13 | 0.005845 | 0.58 | 46.0    | 7870.0  | 154.0  | 60  | 20  | 163 | 34  | 38   | 8    | 32  | 13  | 0   | 52  | 26  | 560.00  |
| 8    | Rashad Vaughn     | 2015-16 | 0.007004 | 0.70 | 86.5    | 12350.5 | 217.0  | 81  | 12  | 266 | 15  | 77   | 11   | 39  | 29  | 16  | 73  | 28  | 965.00  |
| 9    | Terrance Ferguson | 2017-18 | 0.007394 | 0.74 | 80.5    | 10887.0 | 189.0  | 70  | 9   | 169 | 10  | 28   | 19   | 19  | 24  | 10  | 83  | 11  | 730.00  |
| 10   | John Lucas III    | 2013-14 | 0.007493 | 0.75 | 51.0    | 6806.0  | 159.0  | 62  | 10  | 190 | 16  | 27   | 12   | 42  | 14  | 0   | 41  | 22  | 572.00  |

---

### Game Score

#### Top 10 Seasons by Game Score per 36 (with context)

| Rank | Player Name             | Season  | GS/36   | PTS/36 | FGM/36 | FGA/36 | FTM/36 | FTA/36 | OREB/36 | DREB/36 | STL/36 | AST/36 | BLK/36 | PF/36 | TOV/36 |
| ---- | ----------------------- | ------- | ------- | ------ | ------ | ------ | ------ | ------ | ------- | ------- | ------ | ------ | ------ | ----- | ------ |
| 1    | Joel Embiid             | 2023-24 | 32.2674 | 38.432 | 13.011 | 24.253 | 10.800 | 12.284 | 2.526   | 9.537   | 1.232  | 6.221  | 1.863  | 3.032 | 4.105  |
| 2    | Nikola Jokic            | 2024-25 | 29.6739 | 29.154 | 11.065 | 19.202 | 5.082  | 6.349  | 2.815   | 9.742   | 1.788  | 10.079 | 0.633  | 2.252 | 3.238  |
| 3    | Giannis Antetokounmpo   | 2019-20 | 29.1506 | 35.185 | 12.985 | 23.626 | 7.502  | 11.864 | 2.670   | 13.597  | 1.223  | 6.849  | 1.203  | 3.629 | 4.301  |
| 4    | Nikola Jokic            | 2021-22 | 28.7543 | 29.567 | 11.272 | 19.343 | 5.592  | 6.905  | 3.039   | 11.995  | 1.608  | 8.616  | 0.930  | 2.818 | 4.146  |
| 5    | Shai Gilgeous-Alexander | 2024-25 | 28.3994 | 34.434 | 11.932 | 23.093 | 8.303  | 9.251  | 0.921   | 4.344   | 1.828  | 6.708  | 1.058  | 2.296 | 2.557  |
| 6    | Giannis Antetokounmpo   | 2021-22 | 28.3919 | 33.213 | 11.430 | 20.654 | 9.174  | 12.708 | 2.223   | 10.684  | 1.194  | 6.437  | 1.510  | 3.517 | 3.633  |
| 7    | Joel Embiid             | 2022-23 | 28.3874 | 34.912 | 11.643 | 21.239 | 10.571 | 12.331 | 1.807   | 8.908   | 1.056  | 4.382  | 1.791  | 3.279 | 3.614  |
| 8    | Nikola Jokic            | 2022-23 | 28.2147 | 26.591 | 10.164 | 16.080 | 5.365  | 6.530  | 2.628   | 10.227  | 1.369  | 10.668 | 0.740  | 2.738 | 3.886  |
| 9    | Giannis Antetokounmpo   | 2023-24 | 28.2119 | 31.862 | 12.057 | 19.805 | 7.296  | 11.124 | 2.894   | 9.366   | 1.198  | 6.861  | 1.151  | 2.971 | 3.532  |
| 10   | Nikola Jokic            | 2023-24 | 27.9987 | 27.592 | 10.947 | 18.635 | 4.641  | 5.685  | 2.892   | 9.959   | 1.439  | 9.367  | 0.931  | 2.596 | 3.132  |

#### Middle 10 Seasons by Game Score per 36 (with context)

| Rank | Player Name             | Season  | GS/36   | PTS/36 | FGM/36 | FGA/36 | FTM/36 | FTA/36 | OREB/36 | DREB/36 | STL/36 | AST/36 | BLK/36 | PF/36 | TOV/36 |
| ---- | ----------------------- | ------- | ------- | ------ | ------ | ------ | ------ | ------ | ------- | ------- | ------ | ------ | ------ | ----- | ------ |
| 1    | Jabari Walker           | 2023-24 | 11.6792 | 13.739 | 5.037  | 10.973 | 2.788  | 3.688  | 3.283   | 7.690   | 0.899  | 1.574  | 0.450  | 3.733 | 1.462  |
| 2    | Roy Hibbert             | 2014-15 | 11.6767 | 15.284 | 6.041  | 13.531 | 3.202  | 3.888  | 2.973   | 7.318   | 0.343  | 1.601  | 2.382  | 4.116 | 2.039  |
| 3    | Max Strus               | 2024-25 | 11.6758 | 13.439 | 4.708  | 10.643 | 0.799  | 0.970  | 1.512   | 4.679   | 0.742  | 4.508  | 0.342  | 2.967 | 1.541  |
| 4    | Sam Hauser              | 2023-24 | 11.6745 | 14.772 | 5.166  | 11.645 | 0.362  | 0.385  | 0.816   | 4.916   | 0.884  | 1.699  | 0.566  | 2.152 | 0.657  |
| 5    | Julian Champagnie       | 2024-25 | 11.6744 | 15.330 | 5.181  | 12.497 | 1.649  | 1.824  | 1.261   | 4.754   | 1.164  | 2.018  | 0.660  | 2.154 | 1.397  |
| 6    | Justise Winslow         | 2021-22 | 11.6832 | 13.152 | 5.280  | 12.336 | 1.872  | 3.168  | 2.400   | 7.296   | 1.680  | 4.032  | 1.200  | 3.216 | 2.352  |
| 7    | Naji Marshall           | 2023-24 | 11.6723 | 13.401 | 5.025  | 10.746 | 1.707  | 2.244  | 1.422   | 5.373   | 1.454  | 3.856  | 0.316  | 2.718 | 1.896  |
| 8    | Shabazz Napier          | 2017-18 | 11.6712 | 15.456 | 5.352  | 12.744 | 2.784  | 3.312  | 0.624   | 3.456   | 1.944  | 3.600  | 0.336  | 2.016 | 2.160  |
| 9    | Terance Mann            | 2021-22 | 11.6696 | 13.787 | 5.281  | 10.909 | 2.024  | 2.593  | 1.644   | 5.075   | 0.870  | 3.304  | 0.332  | 2.814 | 1.328  |
| 10   | Michael Carter-Williams | 2014-15 | 11.6677 | 16.435 | 6.193  | 15.635 | 3.437  | 4.951  | 1.089   | 4.917   | 1.888  | 7.520  | 0.510  | 2.841 | 4.304  |

#### Bottom 10 Seasons by Game Score per 36 (with context)

| Rank | Player Name       | Season  | GS/36  | PTS/36 | FGM/36 | FGA/36 | FTM/36 | FTA/36 | OREB/36 | DREB/36 | STL/36 | AST/36 | BLK/36 | PF/36 | TOV/36 |
| ---- | ----------------- | ------- | ------ | ------ | ------ | ------ | ------ | ------ | ------- | ------- | ------ | ------ | ------ | ----- | ------ |
| 1    | DeShawn Stevenson | 2011-12 | 3.2811 | 5.686  | 1.883  | 6.608  | 0.346  | 0.615  | 0.269   | 3.612   | 0.730  | 1.575  | 0.154  | 2.267 | 0.730  |
| 2    | Terrance Ferguson | 2019-20 | 3.4524 | 6.565  | 2.325  | 6.251  | 0.471  | 0.597  | 0.723   | 1.571   | 0.817  | 1.414  | 0.503  | 4.524 | 0.942  |
| 3    | Stephen Graham    | 2010-11 | 3.5149 | 7.540  | 3.093  | 7.695  | 1.199  | 1.469  | 0.657   | 4.099   | 0.541  | 1.547  | 0.039  | 4.099 | 1.469  |
| 4    | Mike Miller       | 2014-15 | 3.6801 | 5.822  | 1.976  | 6.089  | 0.160  | 0.214  | 0.214   | 4.647   | 0.748  | 2.457  | 0.214  | 3.953 | 1.228  |
| 5    | Shawne Williams   | 2011-12 | 3.7728 | 8.136  | 3.024  | 10.584 | 0.576  | 0.792  | 1.440   | 3.456   | 0.720  | 1.152  | 0.792  | 3.168 | 0.936  |
| 6    | Rashad Vaughn     | 2015-16 | 3.8462 | 8.095  | 3.022  | 9.923  | 0.448  | 0.560  | 0.410   | 2.873   | 1.082  | 1.455  | 0.597  | 2.723 | 1.045  |
| 7    | Semi Ojeleye      | 2017-18 | 3.9658 | 6.313  | 2.104  | 6.086  | 0.809  | 1.327  | 1.198   | 4.014   | 0.680  | 0.647  | 0.129  | 2.946 | 0.809  |
| 8    | Jason Collins     | 2010-11 | 3.9663 | 6.063  | 2.147  | 4.484  | 1.705  | 2.589  | 1.895   | 4.547   | 0.568  | 1.389  | 0.568  | 6.126 | 1.642  |
| 9    | Doron Lamb        | 2012-13 | 4.1079 | 9.900  | 3.857  | 10.479 | 1.286  | 2.186  | 0.514   | 2.443   | 0.836  | 2.057  | 0.000  | 3.343 | 1.671  |
| 10   | Anthony Brown     | 2015-16 | 4.1775 | 7.065  | 2.396  | 7.741  | 1.044  | 1.229  | 0.553   | 3.747   | 0.860  | 1.167  | 0.307  | 2.089 | 0.922  |

---

### Alternative Metrics Analysis

Beyond PIE and GS/36, I trained/validated models on **PER**, **VORP**, and **EWA**:

* **PER:** R² = 0.739 — strong, but overlaps with PIE in what it rewards.
* **VORP:** R² = 0.728 — valuable, but closely aligned with PIE/EWA and time on court.
* **EWA:** R² = 0.728 — practical for front offices; predictive profile similar to VORP.
  **Conclusion:** PIE + GS/36 remain the core outputs for complementary perspectives: overall impact vs. per-minute production.

---


In [None]:
import os
if os.getcwd() != "docker_dev_template":
    os.chdir("../../")

In [3]:
%%writefile src/heat_data_scientist_2025/utils/config.py
# src/heat_data_scientist_2025/utils/config.py
from __future__ import annotations

import os
from dataclasses import dataclass
from pathlib import Path
from typing import Iterable, Tuple, List, Dict, Optional
import difflib
from pathlib import Path
from pydantic_settings import BaseSettings, SettingsConfigDict
from pydantic import Field

class Settings(BaseSettings):
    # allow overriding via .env or real env; keep your robust repo_root default
    repo_root: Path = Path(__file__).resolve().parents[3]
    heat_data_root: Path = Field(default=Path("data"), alias="HEAT_DATA_ROOT")

    model_config = SettingsConfigDict(
        env_file=".env",          # loads .env if present
        env_prefix="",            # we already use explicit aliases
        case_sensitive=False,     # friendlier on Windows
        extra="ignore",           # ignore unknown envs
    )

    @property
    def data_root(self) -> Path:
        return (self.repo_root / self.heat_data_root
                if not self.heat_data_root.is_absolute()
                else self.heat_data_root)


S = Settings()

_REPO_ROOT = S.repo_root
DATA_ROOT  = S.data_root

RAW_DIR =       DATA_ROOT / "raw" / "heat_data_scientist_2025"
PROCESSED_DIR = DATA_ROOT / "processed" / "heat_data_scientist_2025"
QUALITY_DIR =   DATA_ROOT / "quality" / "quality_reports"
ML_DATASET_PATH = PROCESSED_DIR / "nba_ml_dataset.parquet"
NBA_CATALOG_PATH = PROCESSED_DIR / "nba_data_catalog.md"
SQLITE_PATH = PROCESSED_DIR / "nba.sqlite"
RANKINGS_PATH = PROCESSED_DIR / "nba_rankings_results.txt"

# EDA needs
EDA_OUT_DIR = PROCESSED_DIR / "eda"

# ML Pipeline paths
ML_MODELS_DIR = PROCESSED_DIR / "models"
ML_PREDICTIONS_DIR = PROCESSED_DIR / "predictions"
ML_EVALUATION_DIR = PROCESSED_DIR / "evaluation"

# production yaml
PROJECT_ROOT: Path = Path("src")
COLUMN_SCHEMA_PATH: Path = PROJECT_ROOT / "heat_data_scientist_2025" / "data" / "column_schema.yaml"

# Project Settings:
start_season = 2009 # need to start back to 2009 for the lagged features in machine learning pipeline
project_start_season = 2010 # only pull from 2010 on for rankings per project 
end_season = 2024
final_top_data_amt = 10
season_type = 'Regular Season'
minutes_total_minimum_per_season = 500

# --- ML Pipeline Configuration ---
class MLPipelineConfig:
    """Configuration for ML Pipeline automation"""
    
    # Target and prediction settings
    TARGET_COLUMN = "season_pie"
    PREDICTION_YEAR = 2025
    SOURCE_YEAR = 2024  # Use 2024 data to predict 2025
    
    # Training settings
    TEST_YEARS = [2023, 2024]  # Hold out for validation
    
    # Model settings
    RANDOM_STATE = 42
    
    # Required lag features for complete cases
    REQUIRED_LAG_FEATURES = [
        "season_pie",
        "pts_per36", "ast_per36", "reb_per36",
        "ts_pct", "efg_pct",
        "usage_events_per_min",
        "games_played", "total_minutes",
        "defensive_per36", "production_per36",
        "win_pct", "team_win_pct_final"
    ]
    
    # Feature engineering settings
    CREATE_LAG_YEARS = [1]  # Create lag1 features
    
    # Model parameters
    MODEL_PARAMS = {
        'random_forest': {
            'n_estimators': 200,
            'max_depth': 12,
            'min_samples_split': 5,
            'min_samples_leaf': 2,
            'random_state': RANDOM_STATE,
            'n_jobs': -1
        },
        'xgboost': {
            'n_estimators': 200,
            'max_depth': 8,
            'learning_rate': 0.05,
            'subsample': 0.8,
            'random_state': RANDOM_STATE,
            'n_jobs': -1
        }
    }
    
    # Evaluation settings
    EVALUATION_METRICS = ['r2', 'rmse', 'mae', 'mape']
    
    # Feature importance settings
    MIN_FEATURE_IMPORTANCE = 0.001
    TOP_FEATURES_COUNT = 20
    
    # Output settings
    SAVE_ENGINEERED_DATA = True
    SAVE_MODELS = True
    SAVE_PREDICTIONS = True
    SAVE_EVALUATION = True
    
    # Automation settings
    AUTO_FEATURE_SELECTION = True
    AUTO_HYPERPARAMETER_TUNING = False  # Set to True for automated tuning
    CROSS_VALIDATION_FOLDS = 5

# Initialize ML config
ML_CONFIG = MLPipelineConfig()

# --- Kaggle dataset handle ---
KAGGLE_DATASET = "eoinamoore/historical-nba-data-and-player-box-scores"

# --- Tables we actually care about right now ---
IMPORTANT_TABLES = ["PlayerStatistics", "TeamStatistics"]

# --- Table -> CSV filename mapping (simple & explicit) ---
KAGGLE_TABLE_TO_CSV = {
    "PlayerStatistics": "PlayerStatistics.csv",
    "TeamStatistics": "TeamStatistics.csv",
}
PARQUET_DIR = Path(DATA_ROOT) / "parquet_cache"

# Canonical ML export list (final parquet column order)
ML_EXPORT_COLUMNS = [
    # IDs & core season
    "personId", "player_name", "season",

    # playing time & outcomes
    "games_played", "total_minutes",
    "win_pct", "home_games_pct", "avg_plus_minus", "total_plus_minus",

    # efficiency & per-36
    "season_pie", "ts_pct", "fg_pct", "fg3_pct", "ft_pct",
    "pts_per36", "ast_per36", "reb_per36",
    "usage_per_min", "efficiency_per_game",

    # raw season totals
    "total_points", "total_assists", "total_rebounds",
    "total_steals", "total_blocks", "total_turnovers",
    "total_fgm", "total_fga", "total_ftm", "total_fta", "total_3pm", "total_3pa",

    # player bio & role (expanded to include everything you listed)
    "height", "bodyWeight", "draftYear", "draftRound", "draftNumber",
    "birthdate", "country", "position",

    # share-of-team season metrics
    "share_pts", "share_ast", "share_reb",
    "share_stl", "share_blk",
    "share_fga", "share_fgm",
    "share_3pa", "share_3pm",
    "share_fta", "share_ftm",
    "share_tov", "share_reb_off", "share_reb_def", "share_pf",
    "season_game_score_total", "game_score_per36",
]

def _first_existing(candidates: Iterable[Path]) -> Path:
    for p in candidates:
        if p.exists():
            return p
    raise FileNotFoundError(
        "None of the candidate files exist:\n" + "\n".join(str(c) for c in candidates)
    )


# training features
numerical_features = [
    "season_start_year",

    "season_pie_lag1", "ts_pct_lag1", "efg_pct_lag1", "fg_pct_lag1", "fg3_pct_lag1", "ft_pct_lag1",
    "pts_per36_lag1", "ast_per36_lag1", "reb_per36_lag1", "defensive_per36_lag1",
    "production_per36_lag1", "stocks_per36_lag1", "three_point_rate_lag1", "ft_rate_lag1",
    "pts_per_shot_lag1", "ast_to_tov_lag1", "usage_events_per_min_lag1", "usage_per_min_lag1",
    "games_played_lag1", "total_minutes_lag1", "total_points_lag1", "total_assists_lag1",
    "total_rebounds_lag1", "total_steals_lag1", "total_blocks_lag1", "total_fga_lag1",
    "total_fta_lag1", "total_3pa_lag1", "total_3pm_lag1", "total_tov_lag1", "win_pct_lag1",
    "avg_plus_minus_lag1", "team_win_pct_final_lag1",
    "offensive_impact_lag1", "two_way_impact_lag1", "efficiency_volume_score_lag1",
    "versatility_score_lag1", "shooting_score_lag1",


    "total_reb_off_lag1", "total_reb_def_lag1",
    "total_fgm_lag1", "total_ftm_lag1",
    "total_plus_minus_lag1",
    "wins_lag1", "home_games_lag1",
    "season_game_score_total_lag1", "game_score_per36_lag1",
    "season_uPER_lag1", "season_aPER_lag1", "season_PER_lag1",
    "season_BPM_lag1", "season_max_games_lag1", "season_VORP_lag1", "season_EWA_lag1",
    "ts_attempts_lag1",
    "usage_events_total_lag1", "total_usage_lag1",
    "shot_creation_lag1", "shot_creation_per36_lag1",
    "minutes_per_game_lag1",
    "games_pct_lag1",
    "home_games_pct_lag1",
    "fg_vs_league_lag1", "fg3_vs_league_lag1", "ft_vs_league_lag1",
    "portability_index_lag1", "pi_scoring_eff_lag1", "pi_shooting_lag1",
    "pi_defense_lag1", "pi_versatility_lag1", "pi_passing_lag1", "pi_usage_term_lag1",

    
    "fgm_per36_lag1", "fga_per36_lag1", "ftm_per36_lag1", "fta_per36_lag1",
    "oreb_per36_lag1", "dreb_per36_lag1", "stl_per36_lag1", "blk_per36_lag1",
    "pf_per36_lag1", "tov_per36_lag1",
]

nominal_categoricals = []
ordinal_categoricals = ["minutes_tier"] 
y_variables = ["season_pie", "game_score_per36", "season_PER", "season_EWA", "season_VORP"]



@dataclass(frozen=True)
class Paths:
    repo_root: Path = _REPO_ROOT
    data_root: Path = DATA_ROOT
    raw_dir: Path = RAW_DIR
    processed_dir: Path = PROCESSED_DIR
    quality_dir: Path = QUALITY_DIR
    ml_dataset_path: Path = ML_DATASET_PATH
    nba_catalog_path: Path = NBA_CATALOG_PATH
    sqlite_path: Path = SQLITE_PATH
    rankings_path: Path = RANKINGS_PATH
    eda_out_dir: Path = EDA_OUT_DIR
    column_schema_path: Path = COLUMN_SCHEMA_PATH
    
    # ML Pipeline paths
    ml_models_dir: Path = ML_MODELS_DIR
    ml_predictions_dir: Path = ML_PREDICTIONS_DIR
    ml_evaluation_dir: Path = ML_EVALUATION_DIR

    def ensure_ml_dirs(self) -> None:
        """Create ML pipeline directories if they don't exist."""
        self.ml_models_dir.mkdir(parents=True, exist_ok=True)
        self.ml_predictions_dir.mkdir(parents=True, exist_ok=True)
        self.ml_evaluation_dir.mkdir(parents=True, exist_ok=True)

    def csv(self, name: str) -> Path:
        """Return a path under RAW_DIR for an exact filename, with rich diagnostics on failure."""
        p = (self.raw_dir / name).resolve()
        if p.exists():
            return p

        raw_exists = self.raw_dir.exists()
        all_csvs = []
        if raw_exists:
            try:
                all_csvs = sorted([q.name for q in self.raw_dir.glob("*.csv")])
            except Exception:
                all_csvs = []

        suggestions = difflib.get_close_matches(name, all_csvs, n=5, cutoff=0.5) if all_csvs else []

        lines = [
            f"Expected CSV not found: {p}",
            f"RAW_DIR: {self.raw_dir}  (exists={raw_exists})",
            f"DATA_ROOT (override with HEAT_DATA_ROOT): {self.data_root}",
            f"CSV files found in RAW_DIR ({len(all_csvs)}): {all_csvs[:25]}{' ...' if len(all_csvs) > 25 else ''}",
        ]
        if suggestions:
            lines.append(f"Closest names: {suggestions}")
        lines.append("If your data lives elsewhere, set the environment variable HEAT_DATA_ROOT to that base folder.")
        raise FileNotFoundError("\n".join(lines))

    def csv_any(self, *names: str) -> Path:
        return _first_existing([(self.raw_dir / n).resolve() for n in names])

    # canonical asset getters (7 kaggle tables + optional league schedule)
    def playerstats_csv(self) -> Path:       return self.csv("PlayerStatistics.csv")
    def teamstats_csv(self) -> Path:         return self.csv("TeamStatistics.csv")
    def games_csv(self) -> Path:             return self.csv("Games.csv")
    def teams_csv(self) -> Path:             return self.csv("Teams.csv")
    def team_histories_csv(self) -> Path:    return self.csv_any("CoachHistory.csv", "Coaches.csv")
    def league_schedule_csv(self) -> Path:   return (self.raw_dir / "LeagueSchedule24_25.csv")

    # ML Pipeline file getters
    def model_path(self, model_name: str, year: int) -> Path:
        """Get path for saving/loading trained models."""
        return self.ml_models_dir / f"{model_name}_{year}_model.pkl"
    
    def predictions_path(self, target: str, year: int | None = None) -> Path:
        """
        Flexible getter for per-target predictions path.

        Accepts either:
        - (target, year) e.g., ("season_pie", 2025)
        - single string with trailing year, e.g., "season_pie_2025"
        - single string that already includes a *_predictions_YYYY tag, we'll normalize it

        Produces: .../predictions/{target}_predictions_{year}.parquet
        """
        self.ensure_ml_dirs()

        safe_target = str(target).strip().lower()

        # If year isn't separately provided, try to parse it from the 'target' string
        if year is None:
            # tolerate patterns like "season_pie_2025" or "season_pie_predictions_2025"
            import re
            m = re.search(r'(\d{4})$', safe_target)
            if m:
                year = int(m.group(1))
                # strip known suffixes to get the target core
                safe_target = re.sub(r'(_predictions)?_\d{4}$', '', safe_target)
            else:
                raise TypeError("predictions_path() requires 'year' or a target string ending with a 4-digit year.")

        return self.ml_predictions_dir / f"{safe_target}_predictions_{int(year)}.parquet"


    def leaderboard_path(self, target: str, year: int | None = None) -> Path:
        """
        Flexible getter for per-target leaderboard path.

        Accepts either:
        - (target, year) e.g., ("game_score_per36", 2025)
        - single string with trailing year, e.g., "game_score_per36_2025"
        - single string that already includes a *_leaderboard_YYYY tag, we'll normalize it

        Produces: .../predictions/{target}_leaderboard_{year}.csv
        """
        self.ensure_ml_dirs()

        safe_target = str(target).strip().lower()

        if year is None:
            import re
            m = re.search(r'(\d{4})$', safe_target)
            if m:
                year = int(m.group(1))
                safe_target = re.sub(r'(_leaderboard)?_\d{4}$', '', safe_target)
            else:
                raise TypeError("leaderboard_path() requires 'year' or a target string ending with a 4-digit year.")

        return self.ml_predictions_dir / f"{safe_target}_leaderboard_{int(year)}.csv"

    
    def evaluation_path(self, model_name: str, year: int) -> Path:
        """Get path for saving/loading evaluation results."""
        return self.ml_evaluation_dir / f"{model_name}_{year}_evaluation.json"
    
    def feature_importance_path(self, model_name: str, year: int) -> Path:
        """Get path for saving/loading feature importance."""
        return self.ml_evaluation_dir / f"{model_name}_{year}_feature_importance.csv"

CFG = Paths()

if __name__ == "__main__":
    print("Config paths:")
    print(f"Raw dir: {CFG.raw_dir}")
    print(f"Processed dir: {CFG.processed_dir}")
    print(f"ML models dir: {CFG.ml_models_dir}")
    print(f"ML predictions dir: {CFG.ml_predictions_dir}")
    print(f"ML evaluation dir: {CFG.ml_evaluation_dir}")
    
    print("\nML Config:")
    print(f"Target column: {ML_CONFIG.TARGET_COLUMN}")
    print(f"Prediction year: {ML_CONFIG.PREDICTION_YEAR}")
    print(f"Required lag features: {len(ML_CONFIG.REQUIRED_LAG_FEATURES)}")


Overwriting src/heat_data_scientist_2025/utils/config.py


In [4]:
# %%writefile src/heat_data_scientist_2025/data/kaggle_pull.py
"""
NBA statistics analysis - PIE, Game Score per 36, and PER calculations
Load kaggle data, compute metrics, rank players, export results
"""

from __future__ import annotations
from typing import Dict, Iterable
from pathlib import Path
import pandas as pd
import numpy as np
import kagglehub as kh
from kagglehub import KaggleDatasetAdapter as KDA

from src.heat_data_scientist_2025.utils.config import (
    CFG,
    KAGGLE_DATASET,
    IMPORTANT_TABLES,
    KAGGLE_TABLE_TO_CSV,
    start_season as CFG_START_SEASON,
    season_type as CFG_SEASON_TYPE,
    minutes_total_minimum_per_season as CFG_MIN_SEASON_MINUTES,
)

# Core utilities
def _season_from_timestamp(ts: pd.Series) -> pd.Series:
    """Convert timestamps to season strings like '2010-11'"""
    dt = pd.to_datetime(ts, errors="coerce", utc=False)
    start_year = np.where(dt.dt.month >= 8, dt.dt.year, dt.dt.year - 1)
    end_year = start_year + 1
    start_s = pd.Series(start_year, index=ts.index).astype(str)
    end_s = (pd.Series(end_year, index=ts.index) % 100).astype(str).str.zfill(2)
    return start_s + "-" + end_s

def _safe_div(num: pd.Series, den: pd.Series) -> pd.Series:
    """Safe division - returns 0 when denominator <= 0"""
    if not hasattr(den, "fillna") or not hasattr(num, "index"):
        raise TypeError(f"Expected pandas Series, got {type(num)}, {type(den)}")
    
    den = den.fillna(0)
    out = pd.Series(np.zeros(len(num)), index=num.index, dtype="float64")
    mask = den > 0
    out.loc[mask] = (num[mask] / den[mask]).astype("float64")
    return out

def _safe_div_scalar(num: float, den: float) -> float:
    """Safe division for scalars"""
    return 0.0 if pd.isna(den) or den <= 0 else num / den

def _to_numeric(df: pd.DataFrame, cols: list[str]) -> pd.DataFrame:
    """Convert columns to numeric, keeping NaN for errors"""
    for c in cols:
        if c in df.columns:
            df[c] = pd.to_numeric(df[c], errors="coerce")
    return df

def _resolve_player_team_cols(df: pd.DataFrame) -> tuple[str, str]:
    """Find player team columns across different naming conventions"""
    options = [
        ("playerteamCity", "playerteamName"),
        ("playerTeamCity", "playerTeamName"),
        ("teamCity", "teamName"),
    ]
    for city_col, name_col in options:
        if city_col in df.columns and name_col in df.columns:
            return city_col, name_col
    raise KeyError("Could not find player team columns")

# Data loading and filtering
def enforce_criteria_python(
    players_df: pd.DataFrame | None,
    player_stats_df: pd.DataFrame,
    team_stats_df: pd.DataFrame,
    start_season: int = CFG_START_SEASON,
    season_type: str = CFG_SEASON_TYPE,
    minutes_total_minimum_per_season: int = CFG_MIN_SEASON_MINUTES,
    defer_minutes_gate: bool = True,
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """Apply season filters and compute player minutes totals"""
    
    ps = player_stats_df.copy()
    ps["season"] = _season_from_timestamp(ps["gameDate"])
    ts = team_stats_df.copy()
    ts["season"] = _season_from_timestamp(ts["gameDate"])

    # Convert numeric columns
    numeric_cols = [
        "numMinutes","points","assists","blocks","steals",
        "fieldGoalsAttempted","fieldGoalsMade","freeThrowsAttempted","freeThrowsMade",
        "threePointersAttempted","threePointersMade",
        "reboundsDefensive","reboundsOffensive","reboundsTotal",
        "foulsPersonal","turnovers","plusMinusPoints","home","win"
    ]
    ps = _to_numeric(ps, numeric_cols)
    
    for col in ["home","win"]:
        if col in ps.columns:
            ps[col] = ps[col].fillna(0).astype(int)

    # Filter by season and game type
    ps = ps.loc[
        (ps["gameType"] == season_type) &
        (ps["season"].str.slice(0, 4).astype(int) >= start_season)
    ].copy()

    # Create player names
    fn = ps.get("firstName", "").fillna("")
    ln = ps.get("lastName", "").fillna("")
    ps["player_name"] = (fn + " " + ln).str.strip()

    # Calculate season minutes
    season_minutes = (
        ps.groupby(["personId","season"], as_index=False)["numMinutes"]
          .sum().rename(columns={"numMinutes": "minutes_total"})
    )
    ps = ps.merge(season_minutes, on=["personId","season"], how="left")

    if not defer_minutes_gate:
        ps = ps.loc[ps["minutes_total"] >= minutes_total_minimum_per_season].copy()

    players_out = players_df.copy() if players_df is not None else None
    return players_out, ps, ts

# PIE calculation
def compute_player_game_pie(
    filtered_player_stats: pd.DataFrame,
    drop_zero_minute_games: bool = True,
    validate_game_sums: bool = True,
    atol: float = 1e-9,
    rtol: float = 1e-6,
) -> pd.DataFrame:
    """Compute per-game PIE values"""
    
    required_cols = [
        "points","fieldGoalsMade","freeThrowsMade","fieldGoalsAttempted","freeThrowsAttempted",
        "reboundsDefensive","reboundsOffensive","assists","steals","blocks","foulsPersonal","turnovers",
        "gameId","numMinutes"
    ]
    missing = [c for c in required_cols if c not in filtered_player_stats.columns]
    if missing:
        raise KeyError(f"Missing PIE columns: {missing}")

    df = filtered_player_stats.copy()
    df = _to_numeric(df, required_cols)

    if drop_zero_minute_games:
        before = len(df)
        df = df.loc[df["numMinutes"] > 0].copy()
        print(f"Dropped {before - len(df):,} zero-minute games")

    # Calculate PIE numerator
    df["pie_numerator"] = (
        df["points"] + df["fieldGoalsMade"] + df["freeThrowsMade"]
        - df["fieldGoalsAttempted"] - df["freeThrowsAttempted"]
        + df["reboundsDefensive"] + 0.5 * df["reboundsOffensive"]
        + df["assists"] + df["steals"] + 0.5 * df["blocks"]
        - df["foulsPersonal"] - df["turnovers"]
    )

    # Calculate game denominators
    game_totals = (
        df.groupby("gameId", as_index=False)["pie_numerator"]
          .sum().rename(columns={"pie_numerator": "pie_denominator"})
    )

    # Merge and calculate PIE
    merged = df.merge(game_totals, on="gameId", how="left", validate="many_to_one")
    merged["pie_denominator"] = pd.to_numeric(merged["pie_denominator"], errors="coerce")
    merged = merged.loc[merged["pie_denominator"] > 0].copy()
    merged["game_pie"] = _safe_div(merged["pie_numerator"], merged["pie_denominator"])

    # Validate game sums
    if validate_game_sums:
        sums = merged.groupby("gameId", as_index=False)["game_pie"].sum()
        bad_games = sums.loc[~np.isclose(sums["game_pie"], 1.0, rtol=rtol, atol=atol)]
        if not bad_games.empty:
            print(f"Warning: {len(bad_games):,} games don't sum to 1.0")

    return merged

# Team game calculations for PER
def compute_team_game_totals_and_pace(filtered_player_stats: pd.DataFrame) -> pd.DataFrame:
    """Calculate team totals and pace for PER computation"""
    
    df = filtered_player_stats.copy()
    if "season" not in df.columns:
        df["season"] = _season_from_timestamp(df["gameDate"])

    numeric_cols = [
        "numMinutes","points","assists","turnovers","reboundsOffensive","reboundsDefensive",
        "reboundsTotal","foulsPersonal","fieldGoalsAttempted","fieldGoalsMade",
        "freeThrowsAttempted","freeThrowsMade","threePointersMade"
    ]
    df = _to_numeric(df, [c for c in numeric_cols if c in df.columns])

    team_city_col, team_name_col = _resolve_player_team_cols(df)

    # Aggregate team totals per game
    team_totals = df.groupby(["gameId", "season", team_city_col, team_name_col], as_index=False).agg(
        team_min=("numMinutes","sum"),
        team_pts=("points","sum"),
        team_ast=("assists","sum"),
        team_tov=("turnovers","sum"),
        team_orb=("reboundsOffensive","sum"),
        team_drb=("reboundsDefensive","sum"),
        team_trb=("reboundsTotal","sum"),
        team_pf=("foulsPersonal","sum"),
        team_fga=("fieldGoalsAttempted","sum"),
        team_fgm=("fieldGoalsMade","sum"),
        team_fta=("freeThrowsAttempted","sum"),
        team_ftm=("freeThrowsMade","sum"),
        team_3pm=("threePointersMade","sum"),
    ).rename(columns={team_city_col: "teamCity", team_name_col: "teamName"})

    # Calculate possessions
    team_totals["team_poss"] = (
        team_totals["team_fga"] - team_totals["team_orb"] + 
        team_totals["team_tov"] + 0.44 * team_totals["team_fta"]
    )

    # Get opponent possessions
    opponent_data = team_totals.rename(columns={
        "teamCity": "oppCity", "teamName": "oppName", "team_poss": "opp_poss", "team_min": "opp_min"
    })
    
    merged = team_totals.merge(
        opponent_data[["gameId","oppCity","oppName","opp_poss","opp_min"]], 
        on="gameId", how="left"
    )

    # Handle self-joins by excluding same team
    same_team = (merged["teamCity"] == merged["oppCity"]) & (merged["teamName"] == merged["oppName"])
    if same_team.any():
        # Find proper opponents
        other_teams = team_totals[["gameId","teamCity","teamName","team_poss","team_min"]].copy()
        other_teams = other_teams.merge(team_totals, on="gameId", suffixes=("","_opp"))
        other_teams = other_teams[
            (other_teams["teamCity"] != other_teams["teamCity_opp"]) | 
            (other_teams["teamName"] != other_teams["teamName_opp"])
        ]
        other_teams = other_teams.rename(columns={
            "team_poss_opp": "opp_poss", "team_min_opp": "opp_min"
        })[["gameId","teamCity","teamName","opp_poss","opp_min"]].drop_duplicates()
        
        merged = team_totals.merge(other_teams, on=["gameId","teamCity","teamName"], how="left")

    # Calculate pace and assist ratio
    merged["pace"] = 48.0 * _safe_div((merged["team_poss"] + merged["opp_poss"]) / 2.0, merged["team_min"] / 5.0)
    merged["tm_ast_over_fg"] = _safe_div(merged["team_ast"], merged["team_fgm"])

    return merged

# League constants for PER
def compute_league_constants_per_season(team_game_df: pd.DataFrame) -> pd.DataFrame:
    """Calculate league-wide constants for PER"""
    
    tg = team_game_df.copy()
    tg["pace_x_min"] = tg["pace"] * tg["team_min"]

    league_stats = tg.groupby("season", as_index=False).agg(
        lgFG=("team_fgm","sum"),
        lgFGA=("team_fga","sum"),
        lgFT=("team_ftm","sum"),
        lgFTA=("team_fta","sum"),
        lgAST=("team_ast","sum"),
        lgORB=("team_orb","sum"),
        lgDRB=("team_drb","sum"),
        lgTRB=("team_trb","sum"),
        lgTOV=("team_tov","sum"),
        lgPF=("team_pf","sum"),
        lgPTS=("team_pts","sum"),
        tot_min=("team_min","sum"),
        pace_x_min=("pace_x_min","sum"),
    )

    # Calculate derived constants
    league_stats["lgPace"] = _safe_div(league_stats["pace_x_min"], league_stats["tot_min"])
    league_stats["VOP"] = _safe_div(
        league_stats["lgPTS"],
        league_stats["lgFGA"] - league_stats["lgORB"] + league_stats["lgTOV"] + 0.44 * league_stats["lgFTA"]
    )
    league_stats["DRB_PCT"] = _safe_div(
        league_stats["lgTRB"] - league_stats["lgORB"], 
        league_stats["lgTRB"]
    )

    # Basketball-Reference factor calculation
    ast_over_fg = _safe_div(league_stats["lgAST"], league_stats["lgFG"])
    fg_over_ft = _safe_div(league_stats["lgFG"], league_stats["lgFT"])
    league_stats["factor"] = (2.0/3.0) - _safe_div(0.5 * ast_over_fg, 2.0 * fg_over_ft)
    
    league_stats["ft_per_pf"] = _safe_div(league_stats["lgFT"], league_stats["lgPF"])
    league_stats["fta_per_pf"] = _safe_div(league_stats["lgFTA"], league_stats["lgPF"])

    return league_stats[["season","VOP","DRB_PCT","factor","ft_per_pf","fta_per_pf","lgPace"]]

# PER calculation
def compute_player_game_per(
    filtered_player_stats: pd.DataFrame,
    team_game_df: pd.DataFrame,
    league_constants_df: pd.DataFrame,
    drop_zero_minute_games: bool = True,
) -> pd.DataFrame:
    """Calculate per-game PER values"""
    
    ps = filtered_player_stats.copy()
    if drop_zero_minute_games:
        ps = ps.loc[pd.to_numeric(ps["numMinutes"], errors="coerce") > 0].copy()

    required_cols = [
        "gameId","season","personId","numMinutes","threePointersMade","assists",
        "fieldGoalsMade","fieldGoalsAttempted","freeThrowsMade","freeThrowsAttempted",
        "turnovers","reboundsOffensive","reboundsTotal","steals","blocks","foulsPersonal","points"
    ]
    
    numeric_cols = [c for c in required_cols if c in ps.columns and c != "season"]
    ps = _to_numeric(ps, numeric_cols)
    
    missing = [c for c in required_cols if c not in ps.columns]
    if missing:
        raise KeyError(f"Missing PER columns: {missing}")

    # Join with team data
    t_city_col, t_name_col = _resolve_player_team_cols(ps)
    ps_with_team = ps.merge(
        team_game_df[["gameId","teamCity","teamName","pace","tm_ast_over_fg"]],
        left_on=["gameId", t_city_col, t_name_col],
        right_on=["gameId","teamCity","teamName"],
        how="left"
    )

    # Join with league constants
    ps_final = ps_with_team.merge(league_constants_df, on="season", how="left")

    # Calculate PER components
    MP = pd.to_numeric(ps_final["numMinutes"], errors="coerce")
    FG, FGA = ps_final["fieldGoalsMade"], ps_final["fieldGoalsAttempted"]
    FT, FTA = ps_final["freeThrowsMade"], ps_final["freeThrowsAttempted"]
    AST, TOV = ps_final["assists"], ps_final["turnovers"]
    ORB, TRB = ps_final["reboundsOffensive"], ps_final["reboundsTotal"]
    STL, BLK, PF = ps_final["steals"], ps_final["blocks"], ps_final["foulsPersonal"]
    TPM = ps_final["threePointersMade"]

    # League constants
    tm_ast_fg = ps_final["tm_ast_over_fg"].fillna(0.0)
    VOP = ps_final["VOP"].fillna(0.0)
    DRBP = ps_final["DRB_PCT"].fillna(0.0)
    factor = ps_final["factor"].fillna(0.0)
    ft_per_pf = ps_final["ft_per_pf"].fillna(0.0)
    fta_per_pf = ps_final["fta_per_pf"].fillna(0.0)
    lgPace = ps_final["lgPace"].fillna(0.0)
    tmPace = ps_final["pace"].fillna(0.0)

    # uPER calculation
    positive_box = (
        TPM + (2.0/3.0) * AST + (2.0 - factor * tm_ast_fg) * FG
        + 0.5 * FT * (2.0 - tm_ast_fg / 3.0) + VOP * (1.0 - DRBP) * (TRB - ORB)
        + VOP * DRBP * ORB + VOP * STL + VOP * DRBP * BLK
    )
    
    negative_box = (
        VOP * TOV + VOP * DRBP * (FGA - FG)
        + VOP * 0.44 * (0.44 + 0.56 * DRBP) * (FTA - FT)
        + PF * (ft_per_pf - 0.44 * fta_per_pf * VOP)
    )
    
    uPER = _safe_div(positive_box - negative_box, MP)
    
    # Pace adjustment
    pace_adj = _safe_div(lgPace, tmPace)
    aPER = uPER * pace_adj

    result = ps_final[["personId","gameId","season","numMinutes"]].copy()
    result["uPER"] = pd.to_numeric(uPER, errors="coerce")
    result["aPER"] = pd.to_numeric(aPER, errors="coerce")
    
    return result

# NEW: BPM calculation for VORP
def compute_player_game_bpm(
    filtered_player_stats: pd.DataFrame,
    team_game_df: pd.DataFrame,
    league_constants_df: pd.DataFrame,
    drop_zero_minute_games: bool = True,
) -> pd.DataFrame:
    """Calculate simplified Box Plus/Minus (BPM) for VORP computation"""
    
    ps = filtered_player_stats.copy()
    if drop_zero_minute_games:
        ps = ps.loc[pd.to_numeric(ps["numMinutes"], errors="coerce") > 0].copy()

    required_cols = [
        "gameId","season","personId","numMinutes","points","assists","reboundsTotal",
        "steals","blocks","turnovers","foulsPersonal","fieldGoalsMade","fieldGoalsAttempted",
        "freeThrowsMade","freeThrowsAttempted","threePointersMade"
    ]
    
    # Handle column name variations
    if "reboundsTotal" in ps.columns:
        ps["rebounds Total"] = ps["reboundsTotal"]
    
    numeric_cols = [c for c in required_cols if c in ps.columns and c != "season"]
    ps = _to_numeric(ps, numeric_cols)
    
    missing = [c for c in required_cols if c not in ps.columns]
    if missing:
        raise KeyError(f"Missing BPM columns: {missing}")
    
    # Join with team data to get pace
    t_city_col, t_name_col = _resolve_player_team_cols(ps)
    ps_with_team = ps.merge(
        team_game_df[["gameId","teamCity","teamName","pace","team_poss"]],
        left_on=["gameId", t_city_col, t_name_col],
        right_on=["gameId","teamCity","teamName"],
        how="left"
    )

    # Join with league constants for league averages
    ps_final = ps_with_team.merge(league_constants_df, on="season", how="left")

    # Calculate player possessions used (estimate)
    MP = pd.to_numeric(ps_final["numMinutes"], errors="coerce")
    team_pace = ps_final["pace"].fillna(100.0)
    
    # Player possessions per 100 team possessions
    player_poss_per100 = 100.0 * MP / 48.0  # Approximate possessions per 100 team possessions
    
    # Simplified BPM calculation based on box score stats per 100 possessions
    # This is a simplified version - full BPM requires more complex calculations
    pts_per100 = _safe_div(ps_final["points"] * 100.0, player_poss_per100)
    reb_per100 = _safe_div(ps_final.get("reboundsTotal", ps_final.get("rebounds Total", 0)) * 100.0, player_poss_per100)
    ast_per100 = _safe_div(ps_final["assists"] * 100.0, player_poss_per100)
    stl_per100 = _safe_div(ps_final["steals"] * 100.0, player_poss_per100)
    blk_per100 = _safe_div(ps_final["blocks"] * 100.0, player_poss_per100)
    tov_per100 = _safe_div(ps_final["turnovers"] * 100.0, player_poss_per100)
    
    # Calculate True Shooting Percentage
    tsa = ps_final["fieldGoalsAttempted"] + 0.44 * ps_final["freeThrowsAttempted"]
    ts_pct = _safe_div(ps_final["points"], 2.0 * tsa)
    
    # Simplified BPM formula (coefficients based on statistical impact)
    # These are approximated coefficients - full BPM uses regression analysis
    # Scale to approximate BPM range (-10 to +10)
    bpm = (
        0.15 * pts_per100 +
        0.12 * reb_per100 +
        0.18 * ast_per100 +
        0.25 * stl_per100 +
        0.15 * blk_per100 -
        0.20 * tov_per100 +
        5.0 * (ts_pct - 0.53)  # TS% adjustment (league average ~53%)
        - 5.0  # Baseline adjustment to center around 0
    )

    result = ps_final[["personId","gameId","season","numMinutes"]].copy()
    result["game_bpm"] = pd.to_numeric(bpm, errors="coerce").fillna(0.0)
    
    return result

# Season aggregation
def build_player_season_table_python(
    player_game_with_pie: pd.DataFrame,
    team_stats_df: pd.DataFrame,
    minutes_total_minimum_per_season: int = CFG_MIN_SEASON_MINUTES,
    player_game_per: pd.DataFrame | None = None,
    player_game_bpm: pd.DataFrame | None = None,  # NEW parameter for VORP/EWA
) -> pd.DataFrame:
    """Aggregate game-level stats to season level"""
    
    pg = player_game_with_pie.copy()
    
    # Convert numeric columns
    numeric_cols = [
        "numMinutes","points","assists","blocks","steals","reboundsDefensive","reboundsOffensive",
        "reboundsTotal","foulsPersonal","turnovers","fieldGoalsAttempted","fieldGoalsMade",
        "freeThrowsAttempted","freeThrowsMade","threePointersAttempted","threePointersMade",
        "plusMinusPoints","game_pie","home","win"
    ]
    pg = _to_numeric(pg, numeric_cols)

    # Season aggregations
    season_stats = pg.groupby(["personId","player_name","season"], as_index=False).agg(
        games_played=("gameId","nunique"),
        total_minutes=("numMinutes","sum"),
        total_points=("points","sum"),
        total_assists=("assists","sum"),
        total_rebounds=("reboundsTotal","sum"),
        total_oreb=("reboundsOffensive","sum"),
        total_dreb=("reboundsDefensive","sum"),
        total_blocks=("blocks","sum"),
        total_stls=("steals","sum"),
        total_pf=("foulsPersonal","sum"),
        total_tov=("turnovers","sum"),
        total_fga=("fieldGoalsAttempted","sum"),
        total_fgm=("fieldGoalsMade","sum"),
        total_fta=("freeThrowsAttempted","sum"),
        total_ftm=("freeThrowsMade","sum"),
        total_3pa=("threePointersAttempted","sum"),
        total_3pm=("threePointersMade","sum"),
        total_plus_minus=("plusMinusPoints","sum"),
        avg_plus_minus=("plusMinusPoints","mean"),
        wins=("win","sum"),
        home_games=("home","sum"),
        season_pie_num=("pie_numerator","sum"),
        season_pie_den=("pie_denominator","sum"),
    )

    # Calculate shooting percentages
    season_stats["fg_pct"] = _safe_div(season_stats["total_fgm"], season_stats["total_fga"])
    season_stats["fg3_pct"] = _safe_div(season_stats["total_3pm"], season_stats["total_3pa"])
    season_stats["ft_pct"] = _safe_div(season_stats["total_ftm"], season_stats["total_fta"])
    season_stats["ts_pct"] = _safe_div(
        season_stats["total_points"], 
        2.0 * (season_stats["total_fga"] + 0.44 * season_stats["total_fta"])
    )

    # PIE calculations
    season_stats["season_pie"] = _safe_div(season_stats["season_pie_num"], season_stats["season_pie_den"])
    season_stats["season_pie_pct"] = 100.0 * season_stats["season_pie"]

    # Per-36 minutes stats
    per36_cols = [
        ("pts_per36", "total_points"),
        ("ast_per36", "total_assists"), 
        ("reb_per36", "total_rebounds"),
        ("fgm_per36", "total_fgm"),
        ("fga_per36", "total_fga"),
        ("ftm_per36", "total_ftm"),
        ("fta_per36", "total_fta"),
        ("oreb_per36", "total_oreb"),
        ("dreb_per36", "total_dreb"),
        ("stl_per36", "total_stls"),
        ("blk_per36", "total_blocks"),
        ("pf_per36", "total_pf"),
        ("tov_per36", "total_tov"),
    ]
    
    for per36_col, total_col in per36_cols:
        season_stats[per36_col] = _safe_div(season_stats[total_col] * 36.0, season_stats["total_minutes"])

    # Additional efficiency metrics
    season_stats["usage_per_min"] = _safe_div(
        season_stats["total_fga"] + 0.44 * season_stats["total_fta"] + season_stats["total_tov"],
        season_stats["total_minutes"]
    )
    
    season_stats["efficiency_per_game"] = _safe_div(
        (season_stats["total_points"] + season_stats["total_rebounds"] + season_stats["total_assists"]
         + season_stats["total_stls"] + season_stats["total_blocks"]
         - (season_stats["total_fga"] - season_stats["total_fgm"])
         - (season_stats["total_fta"] - season_stats["total_ftm"])
         - season_stats["total_tov"]),
        season_stats["games_played"]
    )

    season_stats["win_pct"] = _safe_div(season_stats["wins"], season_stats["games_played"])
    season_stats["home_games_pct"] = _safe_div(season_stats["home_games"], season_stats["games_played"])

    # Get team information
    ts = team_stats_df.copy()
    if "season" not in ts.columns:
        ts["season"] = _season_from_timestamp(ts["gameDate"])
    ts = _to_numeric(ts, ["seasonWins","seasonLosses"])
    ts["seasonWins"] = ts["seasonWins"].fillna(0)
    ts["seasonLosses"] = ts["seasonLosses"].fillna(0)

    team_final_records = (
        ts.groupby(["teamCity","teamName","season"], as_index=False)
          .agg(final_wins=("seasonWins","max"), final_losses=("seasonLosses","max"))
    )
    team_final_records["team_win_pct_final"] = _safe_div(
        team_final_records["final_wins"], 
        team_final_records["final_wins"] + team_final_records["final_losses"]
    )

    # Find primary team for each player
    team_city_col = "playerteamCity" if "playerteamCity" in pg.columns else "playerTeamCity"
    team_name_col = "playerteamName" if "playerteamName" in pg.columns else "playerTeamName"
    
    player_team_minutes = (
        pg.groupby(["personId","season",team_city_col,team_name_col], as_index=False)
          .agg(minutes_on_team=("numMinutes","sum"))
    )
    
    primary_team_idx = player_team_minutes.groupby(["personId","season"])["minutes_on_team"].idxmax()
    primary_teams = player_team_minutes.loc[primary_team_idx, ["personId","season",team_city_col,team_name_col]].copy()
    primary_teams = primary_teams.rename(columns={team_city_col: "teamCity", team_name_col: "teamName"})

    # Merge team data
    season_with_team = season_stats.merge(primary_teams, on=["personId","season"], how="left")
    season_final = season_with_team.merge(
        team_final_records[["teamCity","teamName","season","team_win_pct_final"]],
        on=["teamCity","teamName","season"], how="left"
    )

    # Rename for consistency
    season_final = season_final.rename(columns={
        "total_stls": "total_steals",
        "total_oreb": "total_reb_off", 
        "total_dreb": "total_reb_def",
    })

    # Game Score calculation
    season_final["season_game_score_total"] = (
        season_final["total_points"]
        + 0.4 * season_final["total_fgm"]
        - 0.7 * season_final["total_fga"] 
        - 0.4 * (season_final["total_fta"] - season_final["total_ftm"])
        + 0.7 * season_final["total_reb_off"]
        + 0.3 * season_final["total_reb_def"]
        + season_final["total_steals"]
        + 0.7 * season_final["total_assists"]
        + 0.7 * season_final["total_blocks"]
        - 0.4 * season_final["total_pf"]
        - season_final["total_tov"]
    )
    season_final["game_score_per36"] = _safe_div(
        season_final["season_game_score_total"] * 36.0, 
        season_final["total_minutes"]
    )

    # Add PER if available
    if player_game_per is not None and not player_game_per.empty:
        per_df = player_game_per.copy()
        per_df = _to_numeric(per_df, ["numMinutes","uPER","aPER"])

        # Minutes-weighted PER aggregation
        def calculate_weighted_per(group):
            total_minutes = group["numMinutes"].sum()
            if total_minutes > 0:
                return pd.Series({
                    "minutes_per_season": total_minutes,
                    "season_aPER": (group["aPER"] * group["numMinutes"]).sum() / total_minutes,
                    "season_uPER": (group["uPER"] * group["numMinutes"]).sum() / total_minutes,
                })
            return pd.Series({
                "minutes_per_season": 0,
                "season_aPER": 0,
                "season_uPER": 0,
            })

        per_aggregated = (per_df.groupby(["personId","season"])
                         .apply(calculate_weighted_per, include_groups=False)
                         .reset_index())

        # League average aPER for scaling
        def league_aper_by_season(group):
            total_minutes = group["numMinutes"].sum()
            if total_minutes > 0:
                return (group["aPER"] * group["numMinutes"]).sum() / total_minutes
            return 0

        league_aper = (per_df.groupby("season")
                      .apply(league_aper_by_season, include_groups=False)
                      .rename("lg_aPER")
                      .reset_index())

        per_aggregated = per_aggregated.merge(league_aper, on="season", how="left")
        per_aggregated["per_scale"] = _safe_div(
            pd.Series(15.0, index=per_aggregated.index), 
            per_aggregated["lg_aPER"]
        )
        per_aggregated["season_PER"] = per_aggregated["season_aPER"] * per_aggregated["per_scale"]

        season_final = season_final.merge(
            per_aggregated[["personId","season","season_uPER","season_aPER","season_PER"]],
            on=["personId","season"], how="left"
        )

    # NEW: Add VORP and EWA if BPM available
    if player_game_bpm is not None and not player_game_bpm.empty:
        bpm_df = player_game_bpm.copy()
        bpm_df = _to_numeric(bpm_df, ["numMinutes","game_bpm"])

        # Minutes-weighted BPM aggregation
        def calculate_weighted_bpm(group):
            total_minutes = group["numMinutes"].sum()
            if total_minutes > 0:
                return pd.Series({
                    "season_BPM": (group["game_bpm"] * group["numMinutes"]).sum() / total_minutes,
                })
            return pd.Series({"season_BPM": 0})

        bpm_aggregated = (bpm_df.groupby(["personId","season"])
                         .apply(calculate_weighted_bpm, include_groups=False)
                         .reset_index())

        season_final = season_final.merge(
            bpm_aggregated[["personId","season","season_BPM"]],
            on=["personId","season"], how="left"
        )
        
        # Calculate team games per season (typically 82, but can vary)
        team_games_per_season = season_final.groupby("season")["games_played"].max().reset_index()
        team_games_per_season = team_games_per_season.rename(columns={"games_played": "season_max_games"})
        season_final = season_final.merge(team_games_per_season, on="season", how="left")
        
        # Calculate VORP
        # VORP = (BPM - (-2.0)) * (% of possessions played) * (team games / 82) / 2.77
        # Simplified: VORP = (BPM + 2.0) * minutes_pct * games_pct / 2.77
        replacement_level = -2.0
        minutes_in_season = season_final["total_minutes"]
        max_possible_minutes = season_final["season_max_games"] * 48  # 48 minutes per game max
        
        minutes_pct = _safe_div(minutes_in_season, max_possible_minutes)
        games_pct = _safe_div(season_final["season_max_games"], pd.Series(82, index=season_final.index))
        
        season_final["season_VORP"] = _safe_div(
            (season_final["season_BPM"] - replacement_level) * minutes_pct * games_pct,
            pd.Series(2.77, index=season_final.index)  # Points per win divisor
        )
        
        # Calculate EWA (Estimated Wins Added)
        # EWA = VORP (since VORP is already in wins above replacement)
        season_final["season_EWA"] = season_final["season_VORP"].copy()

    # Apply minutes filter
    before_filter = len(season_final)
    season_final = season_final.loc[season_final["total_minutes"] >= minutes_total_minimum_per_season].copy()
    after_filter = len(season_final)
    if before_filter != after_filter:
        print(f"Applied {minutes_total_minimum_per_season} minute filter: removed {before_filter - after_filter:,} players")

    return season_final

# Unified ranking system
def rank_seasons_by_metric(
    player_season_df: pd.DataFrame,
    metric: str,
    top_n: int = 10,
    middle_n: int = 10,
    bottom_n: int = 10,
    tie_breaker: str = "python",
    include_context: bool = False,
) -> dict[str, pd.DataFrame]:
    """Rank player seasons by specified metric"""
    
    df = player_season_df.copy()
    
    # Define metric configurations
    metric_config = {
        "pie": {
            "main_col": "season_pie",
            "display_cols": ["player_name","season","season_pie","season_pie_pct","games_played","total_minutes"],
            "context_cols": [
                "total_points", "total_fgm", "total_ftm", "total_fga", "total_fta",
                "total_reb_def", "total_reb_off", "total_assists", "total_steals", 
                "total_blocks", "total_pf", "total_tov", "total_minutes"
            ],
            "title": "PIE"
        },
        "gs36": {
            "main_col": "game_score_per36", 
            "display_cols": ["player_name","season","game_score_per36","games_played","total_minutes"],
            "context_cols": [
                "pts_per36", "fgm_per36", "fga_per36", "ftm_per36", "fta_per36",
                "oreb_per36", "dreb_per36", "stl_per36", "ast_per36", "blk_per36",
                "pf_per36", "tov_per36"
            ],
            "title": "Game Score per 36"
        },
        "per": {
            "main_col": "season_PER",
            "display_cols": ["player_name","season","season_PER","games_played","total_minutes"],
            "context_cols": [
                "season_uPER", "season_aPER", "total_points", "total_fgm", "total_fga",
                "total_ftm", "total_fta", "total_3pm", "total_assists", "total_steals",
                "total_blocks", "total_reb_off", "total_rebounds", "total_tov", "total_pf"
            ],
            "title": "PER"
        },
        "vorp": {  # NEW: VORP configuration
            "main_col": "season_VORP",
            "display_cols": ["player_name","season","season_VORP","games_played","total_minutes"],
            "context_cols": [
                "season_BPM", "total_points", "total_assists", "total_rebounds", 
                "total_steals", "total_blocks", "total_tov", "ts_pct"
            ],
            "title": "VORP"
        },
        "ewa": {  # NEW: EWA configuration
            "main_col": "season_EWA",
            "display_cols": ["player_name","season","season_EWA","games_played","total_minutes"],
            "context_cols": [
                "season_BPM", "season_VORP", "total_points", "total_assists", 
                "total_rebounds", "total_steals", "total_blocks", "total_tov", "ts_pct"
            ],
            "title": "EWA"
        }
    }
    
    if metric not in metric_config:
        raise ValueError(f"Metric must be one of: {list(metric_config.keys())}")
    
    config = metric_config[metric]
    main_col = config["main_col"]
    
    # Check required columns
    required = ["personId","player_name","season",main_col,"total_minutes","games_played"]
    missing = [c for c in required if c not in df.columns]
    if missing:
        raise KeyError(f"Missing columns for {metric} ranking: {missing}")

    # Set up sorting
    if tie_breaker == "duckdb" and "ts_pct" not in df.columns:
        df["ts_pct"] = np.nan
    
    sort_cols = [(main_col, False), ("total_minutes", False)]
    if tie_breaker == "duckdb":
        sort_cols.append(("ts_pct", False))
    else:
        sort_cols.append(("games_played", False))
    sort_cols.append(("player_name", True))

    # Get rankings
    sort_columns = [col for col, _ in sort_cols]
    sort_ascending = [asc for _, asc in sort_cols]
    
    top = df.sort_values(sort_columns, ascending=sort_ascending).head(top_n).copy()
    bottom = df.sort_values(
        [main_col] + sort_columns[1:],
        ascending=[True] + sort_ascending[1:]
    ).head(bottom_n).copy()

    # Middle (closest to median)
    median_val = df[main_col].median(skipna=True)
    df["dist_to_median"] = (df[main_col] - median_val).abs()
    middle = df.sort_values(
        ["dist_to_median"] + sort_columns,
        ascending=[True] + sort_ascending
    ).head(middle_n).drop(columns=["dist_to_median"]).copy()

    def format_basic_ranking(ranking_df: pd.DataFrame) -> pd.DataFrame:
        """Format basic ranking display"""
        result = ranking_df[config["display_cols"]].copy()
        
        # Round numeric columns appropriately
        if main_col in result.columns:
            if metric == "pie":
                result["season_pie"] = result["season_pie"].round(6)
                result["season_pie_pct"] = result["season_pie_pct"].round(2)
            elif metric == "gs36":
                result["game_score_per36"] = result["game_score_per36"].round(6)
            elif metric == "per":
                result["season_PER"] = result["season_PER"].round(3)
            elif metric in ["vorp", "ewa"]:  # NEW: VORP and EWA rounding
                result[main_col] = result[main_col].round(3)
        
        if "total_minutes" in result.columns:
            result["total_minutes"] = result["total_minutes"].round(1)
            
        return result.reset_index(drop=True)

    def format_context_ranking(ranking_df: pd.DataFrame) -> pd.DataFrame:
        """Format ranking with context columns"""
        if not include_context:
            return format_basic_ranking(ranking_df)
            
        result = ranking_df.copy().reset_index(drop=True)
        result.insert(0, "Rank", range(1, len(result) + 1))
        
        # Select relevant columns
        base_cols = ["Rank", "player_name", "season"]
        metric_cols = [main_col]
        if metric == "pie":
            metric_cols.append("season_pie_pct")
        context_cols = [c for c in config["context_cols"] if c in result.columns]
        
        selected_cols = base_cols + metric_cols + context_cols
        result = result[[c for c in selected_cols if c in result.columns]]
        
        # Apply rounding
        if metric == "pie":
            if "season_pie" in result.columns:
                result["season_pie"] = result["season_pie"].round(6)
            if "season_pie_pct" in result.columns:
                result["season_pie_pct"] = result["season_pie_pct"].round(2)
        elif metric == "gs36":
            if "game_score_per36" in result.columns:
                result["game_score_per36"] = result["game_score_per36"].round(6)
            # Round per-36 stats to 3 decimals
            for col in context_cols:
                if col.endswith("_per36") and col in result.columns:
                    result[col] = result[col].round(3)
        elif metric == "per":
            for col in ["season_PER", "season_uPER", "season_aPER"]:
                if col in result.columns:
                    result[col] = result[col].round(3)
        elif metric in ["vorp", "ewa"]:  # NEW: VORP and EWA context formatting
            # Round VORP/EWA and related metrics
            for col in [main_col, "season_BPM", "season_VORP", "season_EWA"]:
                if col in result.columns:
                    result[col] = result[col].round(3)
            if "ts_pct" in result.columns:
                result["ts_pct"] = result["ts_pct"].round(3)
        
        return result

    return {
        "top": format_context_ranking(top) if include_context else format_basic_ranking(top),
        "middle": format_context_ranking(middle) if include_context else format_basic_ranking(middle), 
        "bottom": format_context_ranking(bottom) if include_context else format_basic_ranking(bottom)
    }

# Output formatting and export functions
def print_rankings(player_season_df: pd.DataFrame, metric: str, **kwargs) -> dict[str, pd.DataFrame]:
    """Print formatted rankings for a metric"""
    
    rankings = rank_seasons_by_metric(player_season_df, metric=metric, **kwargs)
    
    metric_titles = {"pie": "PIE", "gs36": "Game Score per 36", "per": "PER"}
    title = metric_titles.get(metric, metric.upper())
    
    context_suffix = " (with context)" if kwargs.get("include_context", False) else ""
    
    print(f"\n=== Top 10 seasons by {title}{context_suffix} ===")
    print(rankings["top"].to_string(index=False))
    print(f"\n=== Middle 10 seasons by {title}{context_suffix} ===")
    print(rankings["middle"].to_string(index=False))
    print(f"\n=== Bottom 10 seasons by {title}{context_suffix} ===") 
    print(rankings["bottom"].to_string(index=False))
    
    return rankings

def format_for_submission(rankings: dict[str, pd.DataFrame]) -> dict[str, pd.DataFrame]:
    """Convert rankings to submission format (Rank, Player, Season)"""
    
    def format_submission_df(df: pd.DataFrame) -> pd.DataFrame:
        result = df[["player_name","season"]].copy()
        result.insert(0, "Rank", range(1, len(result) + 1))
        result = result.rename(columns={"player_name":"Player","season":"Season"})
        return result
    
    return {
        "top": format_submission_df(rankings["top"]),
        "middle": format_submission_df(rankings["middle"]),
        "bottom": format_submission_df(rankings["bottom"])
    }

# Output formatting and export functions
# NOTE: Exports rankings to one or more formats. Backward-compatible with string input.
def export_rankings(
    rankings: dict[str, pd.DataFrame],
    output_dir: Path | str,
    filename_prefix: str,
    file_format: str | Iterable[str] = "csv"
) -> None:
    """Export rankings dict to one or more file formats.

    Supported formats:
      - "csv": writes <prefix>_<top|middle|bottom>.csv
      - "txt": writes monospace tables via DataFrame.to_string()

    You can pass a single string (e.g., "csv") or an iterable like ("csv", "txt").
    """

    # -- normalize file_format to an iterable without breaking old calls
    if isinstance(file_format, str):
        formats = [file_format.lower()]
    else:
        formats = [str(fmt).lower() for fmt in file_format]

    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    for rank_type, df in rankings.items():
        for fmt in formats:
            if fmt == "csv":
                filepath = output_path / f"{filename_prefix}_{rank_type}.csv"
                df.to_csv(filepath, index=False)
            elif fmt == "txt":
                filepath = output_path / f"{filename_prefix}_{rank_type}.txt"
                with open(filepath, "w", encoding="utf-8") as f:
                    f.write(df.to_string(index=False) + "\n")
            else:
                raise ValueError(f"Unsupported file format: {fmt}")

            print(f"Exported {rank_type} rankings to: {filepath}")


# Data loading functions
def load_nba_data(table_names: Iterable[str] = IMPORTANT_TABLES) -> Dict[str, pd.DataFrame]:
    """Load NBA data tables from Kaggle"""
    
    result = {}
    failed = {}

    for table in table_names:
        csv_filename = KAGGLE_TABLE_TO_CSV.get(table)
        if not csv_filename:
            failed[table] = "No CSV mapping found"
            continue
            
        try:
            df = kh.dataset_load(
                KDA.PANDAS, KAGGLE_DATASET, csv_filename, 
                pandas_kwargs={"low_memory": False}
            )
            result[table] = df
            print(f"Loaded {csv_filename} -> '{table}' ({len(df):,} rows)")
        except Exception as e:
            failed[table] = str(e)

    if failed:
        error_msg = "\n".join([f"- {t}: {e}" for t, e in failed.items()])
        raise RuntimeError(f"Failed to load tables:\n{error_msg}")
        
    return result

# --- add below your imports ---
def export_base_csvs(
    player_stats: pd.DataFrame,
    team_stats: pd.DataFrame,
    outdir: Path | str | None = None,
    start_season: int = CFG_START_SEASON,
    season_type: str = CFG_SEASON_TYPE,
) -> None:
    """
    Save the filtered Kaggle bases actually used downstream.
    """
    base_dir = Path(outdir) if outdir else (Path(CFG.processed_dir) / "base")
    base_dir.mkdir(parents=True, exist_ok=True)

    ps_path = base_dir / f"player_statistics_used_{season_type}_from_{start_season}.csv"
    ts_path = base_dir / f"team_statistics_used_{season_type}_from_{start_season}.csv"

    player_stats.to_csv(ps_path, index=False)
    team_stats.to_csv(ts_path, index=False)
    print(f"Saved base CSVs:\n- {ps_path}\n- {ts_path}")


def build_player_game_data(
    start_season: int = CFG_START_SEASON,
    season_type: str = CFG_SEASON_TYPE, 
    minutes_minimum: int = CFG_MIN_SEASON_MINUTES,
    export_bases: bool = False,                 # <-- NEW
    bases_outdir: Path | str | None = None,     # <-- NEW
) -> pd.DataFrame:
    """Build player-game data with PIE calculations"""
    data = load_nba_data(["PlayerStatistics","TeamStatistics"])
    _, player_stats, team_stats = enforce_criteria_python(   # <-- capture team_stats
        None, data["PlayerStatistics"], data["TeamStatistics"],
        start_season=start_season, season_type=season_type,
        minutes_total_minimum_per_season=minutes_minimum,
        defer_minutes_gate=True,
    )

    # save the exact bases we used (once)
    if export_bases:
        export_base_csvs(
            player_stats=player_stats,
            team_stats=team_stats,
            outdir=bases_outdir,
            start_season=start_season,
            season_type=season_type,
        )

    return compute_player_game_pie(player_stats)


def build_per_data(
    start_season: int = CFG_START_SEASON,
    season_type: str = CFG_SEASON_TYPE,
    minutes_minimum: int = CFG_MIN_SEASON_MINUTES
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """Build complete PER calculation data"""
    
    print("Loading data for PER calculations...")
    data = load_nba_data(["PlayerStatistics", "TeamStatistics"])

    _, player_stats, _ = enforce_criteria_python(
        None, data["PlayerStatistics"], data["TeamStatistics"],
        start_season=start_season, season_type=season_type,
        minutes_total_minimum_per_season=minutes_minimum,
        defer_minutes_gate=True
    )
    print(f"Filtered player stats: {player_stats.shape}")

    team_game_data = compute_team_game_totals_and_pace(player_stats)
    print(f"Team game data: {team_game_data.shape}")

    league_constants = compute_league_constants_per_season(team_game_data)
    print(f"League constants: {league_constants.shape}")

    player_per_data = compute_player_game_per(player_stats, team_game_data, league_constants)
    print(f"Player PER data: {player_per_data.shape}")

    return team_game_data, league_constants, player_per_data

# NEW: Build VORP/EWA data
def build_vorp_ewa_data(
    start_season: int = CFG_START_SEASON,
    season_type: str = CFG_SEASON_TYPE,
    minutes_minimum: int = CFG_MIN_SEASON_MINUTES
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """Build complete VORP and EWA calculation data"""
    
    print("Loading data for VORP/EWA calculations...")
    data = load_nba_data(["PlayerStatistics", "TeamStatistics"])

    _, player_stats, _ = enforce_criteria_python(
        None, data["PlayerStatistics"], data["TeamStatistics"],
        start_season=start_season, season_type=season_type,
        minutes_total_minimum_per_season=minutes_minimum,
        defer_minutes_gate=True
    )
    print(f"Filtered player stats: {player_stats.shape}")

    team_game_data = compute_team_game_totals_and_pace(player_stats)
    print(f"Team game data: {team_game_data.shape}")

    league_constants = compute_league_constants_per_season(team_game_data)
    print(f"League constants: {league_constants.shape}")

    player_bpm_data = compute_player_game_bpm(player_stats, team_game_data, league_constants)
    print(f"Player BPM data: {player_bpm_data.shape}")

    return team_game_data, league_constants, player_bpm_data, player_stats

# Main execution
if __name__ == "__main__":
    print("=== NBA Player Rankings Analysis ===")

    # 1) Player-game + export used bases
    print("\n1. Building player-game data...")
    try:
        player_game_data = build_player_game_data(
            export_bases=True,                       # <-- NEW
            bases_outdir=Path(CFG.processed_dir)/"base"  # optional; defaults to processed/base
        )
        print(f"Success: {player_game_data.shape} player-game records")
    except Exception as e:
        print(f"Error building player-game data: {e}")
        player_game_data = None

    # Build PER data
    print("\n2. Building PER data...")
    try:
        team_games, league_consts, player_per = build_per_data()
        print(f"Success: PER data complete")
    except Exception as e:
        print(f"Error building PER data: {e}")
        player_per = None

    # NEW: Build VORP/EWA data
    print("\n3. Building VORP/EWA data...")
    try:
        team_games_vorp, league_consts_vorp, player_bpm, player_stats_vorp = build_vorp_ewa_data()
        print(f"Success: VORP/EWA data complete")
    except Exception as e:
        print(f"Error building VORP/EWA data: {e}")
        player_bpm = None

    # Build season-level data
    if player_game_data is not None:
        print("\n4. Building season-level data...")
        try:
            team_data = load_nba_data(["TeamStatistics"])
            season_data = build_player_season_table_python(
                player_game_data, 
                team_data["TeamStatistics"],
                player_game_per=player_per,
                player_game_bpm=player_bpm  # NEW parameter
            )
            print(f"Success: {season_data.shape} player-season records")

            # Save dataset
            season_data.to_parquet(CFG.ml_dataset_path, index=False)

            # Generate and export rankings for each metric (UPDATED with VORP and EWA)
            for metric in ["pie", "gs36", "per", "vorp", "ewa"]:
                required_col = {
                    "pie": "season_pie",
                    "gs36": "game_score_per36", 
                    "per": "season_PER",
                    "vorp": "season_VORP",
                    "ewa": "season_EWA"
                }[metric]
                
                if required_col not in season_data.columns:
                    print(f"Skipping {metric.upper()} - data not available")
                    continue

                print(f"\n=== {metric.upper()} Rankings ===")
                
                # Basic rankings
                basic_rankings = rank_seasons_by_metric(season_data, metric, tie_breaker="duckdb")
                submission_format = format_for_submission(basic_rankings)
                
                # Print and export submission format
                title = {
                    "pie": "PIE", 
                    "gs36": "Game Score per 36", 
                    "per": "PER",
                    "vorp": "VORP",
                    "ewa": "EWA"
                }[metric]
                print(f"\n=== Top 10 seasons by {title} ===")
                print(submission_format["top"].to_string(index=False))
                print(f"\n=== Middle 10 seasons by {title} ===")
                print(submission_format["middle"].to_string(index=False))
                print(f"\n=== Bottom 10 seasons by {title} ===") 
                print(submission_format["bottom"].to_string(index=False))
                
                export_rankings(
                    submission_format, CFG.processed_dir,
                    f"nba_{metric}_rankings", "txt"
                )
                
                # Context rankings
                context_rankings = rank_seasons_by_metric(
                    season_data, metric, tie_breaker="duckdb", include_context=True
                )

                # NEW: print the contextual rankings as tables in the console
                print_rankings(
                    season_data,
                    metric,
                    tie_breaker="duckdb",
                    include_context=True
                )

                # Export contextual rankings to CSV (existing) AND readable TXT tables (new)
                export_rankings(
                    context_rankings,
                    CFG.processed_dir,
                    f"nba_{metric}_rankings_context",
                    file_format=("csv", "txt")   # was: "csv"
                )


        except Exception as e:
            print(f"Error in season analysis: {e}")

    print("\n=== Analysis Complete ===")


=== NBA Player Rankings Analysis ===

1. Building player-game data...
Loaded PlayerStatistics.csv -> 'PlayerStatistics' (1,627,438 rows)
Loaded TeamStatistics.csv -> 'TeamStatistics' (143,758 rows)
Saved base CSVs:
- C:\docker_projects\docker_dev_template\data\processed\heat_data_scientist_2025\base\player_statistics_used_Regular Season_from_2009.csv
- C:\docker_projects\docker_dev_template\data\processed\heat_data_scientist_2025\base\team_statistics_used_Regular Season_from_2009.csv
Dropped 87,193 zero-minute games
Success: (397558, 41) player-game records

2. Building PER data...
Loading data for PER calculations...
Loaded PlayerStatistics.csv -> 'PlayerStatistics' (1,627,438 rows)
Loaded TeamStatistics.csv -> 'TeamStatistics' (143,758 rows)
Filtered player stats: (484751, 38)
Team game data: (38092, 22)
League constants: (16, 7)
Player PER data: (397558, 6)
Success: PER data complete

3. Building VORP/EWA data...
Loading data for VORP/EWA calculations...
Loaded PlayerStatistics.csv 

In [5]:
%%writefile src/heat_data_scientist_2025/data/load_data_utils.py
"""
load_data_utils
"""
import pandas as pd
import time

def load_data_optimized(
    DATA_PATH: str,
    debug: bool = False,
    drop_null_rows: bool = False,
    drop_null_how: str = 'any',  # 'any' or 'all'
    drop_null_subset: list | None = None,  # list of column names or None for all columns
    use_sample: bool = False,
    sample_size: int = 1000,
):
    """Load data with performance optimizations and enhanced debug diagnostics.

    Parameters:
    - DATA_PATH: Path to the parquet file.
    - debug: If True, prints detailed dataset diagnostics.
    - drop_null_rows: If True, drops rows based on null criteria.
    - drop_null_how: 'any' to drop rows with any nulls, 'all' to drop rows with all nulls.
    - drop_null_subset: List of columns to consider when dropping nulls; defaults to all.

    Returns:
    - df: Loaded (and optionally filtered) DataFrame.
    """
    print("Loading data for enhanced comprehensive EDA...")
    start_time = time.time()

    # 1. Load data
    if use_sample:
        print(f"⚡ Using sample data (n={sample_size}) instead of real parquet.")
        len_df = sample_size
        df = pd.read_parquet(DATA_PATH)
        #take only the len of the data
        df = df.head(len_df)
    else:
        if DATA_PATH is None:
            raise ValueError("DATA_PATH must be provided when not using sample data.")
        df = pd.read_parquet(DATA_PATH)


    # 2. Drop null rows if requested
    if drop_null_rows:
        before = len(df)
        # Determine which subset to use for dropna
        subset_desc = "all columns" if drop_null_subset is None else f"subset={drop_null_subset}"
        print(f"→ Applying null dropping: how='{drop_null_how}', {subset_desc}")
        if drop_null_subset is None:
            df = df.dropna(how=drop_null_how)
        else:
            # Defensive: ensure provided columns exist (warn if some missing)
            missing_cols = [c for c in drop_null_subset if c.upper() not in df.columns]
            if missing_cols:
                print(f"⚠️ Warning: drop_null_subset columns not found in dataframe and will be ignored: {missing_cols}")
            valid_subset = [c.upper() for c in drop_null_subset if c.upper() in df.columns]
            df = df.dropna(how=drop_null_how, subset=valid_subset if valid_subset else None)
        dropped = before - len(df)
        print(f"✓ Dropped {dropped:,} rows by null criteria (how='{drop_null_how}', subset={drop_null_subset}); remaining {len(df):,} rows")

    # 3. Debug diagnostics
    if debug:
        print("========== Dataset Debug Details ============")
        print(f"Total rows       : {df.shape[0]:,}")
        print(f"Total columns    : {df.shape[1]:,}")
        print(f"Columns          : {df.columns.tolist()}")

        total = len(df)
        null_counts = df.isnull().sum()
        non_null_counts = total - null_counts
        null_percent = (null_counts / total) * 100
        dtype_info = df.dtypes

        null_summary = pd.DataFrame({
            'dtype'          : dtype_info,
            'null_count'     : null_counts,
            'non_null_count' : non_null_counts,
            'null_percent'   : null_percent
        }).sort_values(by='null_percent', ascending=False)

        pd.set_option('display.max_rows', None)
        print("---- Nulls Summary (per column) ----")
        print(null_summary)

    load_time = time.time() - start_time
    print(f"✓ Dataset loaded: {df.shape[0]:,} rows × {df.shape[1]:,} columns in {load_time:.2f}s")
    return df



if __name__ == "__main__":
    from src.heat_data_scientist_2025.utils.config import CFG
    df = load_data_optimized(
        CFG.ml_dataset_path,
        debug=True,
        # use_sample=True,
        # drop_null_rows=True,
        # drop_null_subset=['AAV']
    )
    print(df.columns.tolist())
    print(df.head())
    print(df.shape)
    # unique values for season
    print(df['season'].unique())
    

    # df = load_data_optimized(
    #     FINAL_DATA_PATH,
    #     debug=True,
    #     use_sample=True,
    # )
    # print(df.columns.tolist())

Overwriting src/heat_data_scientist_2025/data/load_data_utils.py


In [6]:
%%writefile src/heat_data_scientist_2025/data/feature_engineering.py
"""
Current columns:
Index(['personId', 'player_name', 'season', 'games_played', 'total_minutes',
       'total_points', 'total_assists', 'total_rebounds', 'total_reb_off',
       'total_reb_def', 'total_blocks', 'total_steals', 'total_pf',
       'total_tov', 'total_fga', 'total_fgm', 'total_fta', 'total_ftm',
       'total_3pa', 'total_3pm', 'total_plus_minus', 'avg_plus_minus', 'wins',
       'home_games', 'season_pie_num', 'season_pie_den', 'fg_pct', 'fg3_pct',
       'ft_pct', 'ts_pct', 'season_pie', 'pts_per36', 'ast_per36', 'reb_per36',
       'usage_per_min', 'efficiency_per_game', 'win_pct', 'home_games_pct',
       'teamCity', 'teamName', 'team_win_pct_final', "season_PER", "season_VORP", "season_EWA"],
      dtype='object')


Feature engineering for NBA player-season data.

"""


from __future__ import annotations
from typing import List, Tuple, Optional, Dict
import numpy as np
import pandas as pd


# quick check: makes sure we have the columns we need
def require_columns(df: pd.DataFrame, cols: List[str], context: str) -> None:
    """Check if required columns exist, throw error if missing."""
    missing = [c for c in cols if c not in df.columns]
    if missing:
        raise ValueError(f"Missing columns for {context}: {missing}")


# helper: finds first column that actually exists in the data  
def _first_present(df: pd.DataFrame, candidates: List[str]) -> Optional[str]:
    """Return first column name that exists in df, else None."""
    for c in candidates:
        if c in df.columns:
            return c
    return None





# parse season string like '2023-24' into year 2023
def add_season_start_year(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """Extract numeric start year from season string like 'YYYY-YY'."""
    out = df.copy()
    if "season" not in out.columns:
        raise ValueError("Need 'season' column")
    
    out["season_start_year"] = (
        out["season"].astype(str).str.extract(r"(\d{4})")[0].astype(int)
    )
    return out, ["season_start_year"]


# experience stuff - years in league and rough groupings
def add_experience_features(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """Add experience features from draft year if available."""
    out = df.copy()
    created: List[str] = []

    # cumulative games played (simple running total)
    if "personId" in out.columns and "games_played" in out.columns:
        out = out.sort_values(["personId", "season_start_year"])
        out["games_played_total"] = out.groupby("personId")["games_played"].cumsum()
        created.append("games_played_total")

    # years since draft (if we have draft year)
    if "draftYear" in out.columns and "season_start_year" in out.columns:
        out["years_experience"] = (out["season_start_year"] - out["draftYear"]).clip(lower=0)
        
        # rough experience buckets
        def exp_bucket(exp):
            if pd.isna(exp): return "Unknown"
            if exp <= 2: return "Rookie/Sophomore"  
            if exp <= 5: return "Young Player"
            if exp <= 9: return "Prime Years"
            if exp <= 15: return "Veteran"
            return "Elder Statesman"
        
        out["experience_bucket"] = out["years_experience"].apply(exp_bucket)
        created.extend(["years_experience", "experience_bucket"])
        
    return out, created


# advanced box score metrics - efficiency stuff mostly
def add_advanced_metrics(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """Create advanced metrics from basic box score stats."""
    out = df.copy()
    created: List[str] = []

    # find the columns we need (handles different naming)
    fga = _first_present(out, ["total_fga"])
    fta = _first_present(out, ["total_fta"])
    fgm = _first_present(out, ["total_fgm"])
    tpa = _first_present(out, ["total_3pa"])
    tpm = _first_present(out, ["total_3pm"])
    ast = _first_present(out, ["total_assists"])
    blk = _first_present(out, ["total_blocks"])
    stl = _first_present(out, ["total_steals"])
    pts = _first_present(out, ["total_points"])
    mins = _first_present(out, ["total_minutes"])
    reb = _first_present(out, ["total_rebounds"])
    dreb = _first_present(out, ["total_reb_def"])
    oreb = _first_present(out, ["total_reb_off"])
    tov = _first_present(out, ["total_tov", "total_turnovers"])

    # true shooting attempts estimate
    if fga and fta:
        out["ts_attempts"] = out[fga] + 0.44 * out[fta]
        created.append("ts_attempts")
        
    # shooting rates and efficiency
    if fga and tpa:
        out["three_point_rate"] = np.where(out[fga] > 0, out[tpa] / out[fga], 0.0)
        created.append("three_point_rate")
    if fga and fta:
        out["ft_rate"] = np.where(out[fga] > 0, out[fta] / out[fga], 0.0)
        created.append("ft_rate")
    if fga and fgm and tpm:
        out["efg_pct"] = np.where(out[fga] > 0, (out[fgm] + 0.5 * out[tpm]) / out[fga], 0.0)
        created.append("efg_pct")
    if fga and pts:
        out["pts_per_shot"] = np.where(out[fga] > 0, out[pts] / out[fga], 0.0)
        created.append("pts_per_shot")

    # defensive and overall production per 36
    if mins and blk and stl:
        out["defensive_per36"] = np.where(out[mins] > 0, (out[blk] + out[stl]) * 36 / out[mins], 0.0)
        out["stocks_per36"] = out["defensive_per36"].copy()  # same thing
        created.extend(["defensive_per36", "stocks_per36"])
    if mins and pts and ast and reb:
        out["production_per36"] = np.where(out[mins] > 0, (out[pts] + out[ast] + out[reb]) * 36 / out[mins], 0.0)
        created.append("production_per36")
    if mins and tov:
        out["tov_per36"] = np.where(out[mins] > 0, out[tov] * 36 / out[mins], 0.0)
        created.append("tov_per36")

    # rebounding shares
    if reb and dreb and oreb:
        total_reb_safe = out[reb].replace(0, np.nan)
        out["dreb_share"] = (out[dreb] / total_reb_safe).fillna(0.0)
        out["oreb_share"] = (out[oreb] / total_reb_safe).fillna(0.0)
        created.extend(["dreb_share", "oreb_share"])

    # usage events (shots, fts, turnovers)
    if fga and fta and tov and mins:
        out["usage_events_total"] = out[fga] + 0.44 * out[fta] + out[tov]
        out["usage_events_per_min"] = np.where(out[mins] > 0, out["usage_events_total"] / out[mins], 0.0)
        created.extend(["usage_events_total", "usage_events_per_min"])

    # assist to turnover ratio
    if ast and tov:
        out["ast_to_tov"] = np.where(out[tov] > 0, out[ast] / out[tov], out[ast])
        created.append("ast_to_tov")

    return out, created


# usage and shot creation features
def add_usage_features(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """Usage and shot creation metrics."""
    out = df.copy()
    created: List[str] = []
    
    # total usage from per-minute usage  
    if "usage_per_min" in out.columns:
        min_col = _first_present(out, ["total_minutes"])
        if min_col:
            out["total_usage"] = out["usage_per_min"] * out[min_col]
            created.append("total_usage")

    # shot creation (shots + assists)
    fga_col = _first_present(out, ["total_fga"])
    fta_col = _first_present(out, ["total_fta"])
    ast_col = _first_present(out, ["total_assists"])
    min_col = _first_present(out, ["total_minutes"])
    
    if fga_col and fta_col and ast_col:
        out["shot_creation"] = out[fga_col] + out[fta_col] + out[ast_col]
        if min_col:
            out["shot_creation_per36"] = np.where(out[min_col] > 0, out["shot_creation"] * 36 / out[min_col], 0.0)
        else:
            out["shot_creation_per36"] = 0.0
        created.extend(["shot_creation", "shot_creation_per36"])
        
    return out, created





# figures out which numeric columns to lag automatically
def _build_lag_stat_list_auto(df: pd.DataFrame, season_col: str = "season_start_year") -> List[str]:
    """Auto-select numeric columns to lag, excluding obvious problem ones."""
    numeric = df.select_dtypes(include=[np.number]).columns.tolist()
    
    # stuff we definitely don't want to lag
    exclude_exact = {
        "personId", season_col, "games_played_total", "forecast_season", 
        "source_season", "season_pie_num", "season_pie_den"
    }
    
    # filter out excluded columns and existing lag columns
    base = [c for c in numeric if c not in exclude_exact 
            and not c.endswith("_lag1")]
    
    # skip columns with id/index type names
    bad_substrings = ("_id", "_idx", "_code")
    base = [c for c in base if not any(s in c.lower() for s in bad_substrings)]
    
    return base


# creates lagged features by player
def add_lag_features(df: pd.DataFrame, stats: Optional[List[str]] = None, 
                    lags: List[int] = [1], season_col: str = "season_start_year") -> Tuple[pd.DataFrame, List[str]]:
    """Add lag features by player-season. Nulls will be in first seasons only."""
    out = df.copy()
    created: List[str] = []
    require_columns(out, ["personId", season_col], "add_lag_features")
    
    # auto-detect stats to lag if not provided
    if stats is None:
        stats = _build_lag_stat_list_auto(out, season_col=season_col)
    else:
        # filter to numeric columns only
        num_cols = set(out.select_dtypes(include=[np.number]).columns)
        stats = [s for s in stats if s in num_cols]
    
    out = out.sort_values(["personId", season_col])
    gp = out.groupby("personId", group_keys=False)
    
    # create lag columns
    for col in stats:
        for k in lags:
            name = f"{col}_lag{k}"
            out[name] = gp[col].shift(k)
            created.append(name)
    
    # add helper column to identify first seasons (used for lag validation)
    out["has_prior_season"] = gp.cumcount() > 0
    created.append("has_prior_season")
    
    return out, created


# minutes and availability features  
def add_minutes_features(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """Playing time and availability metrics."""
    out = df.copy()
    created: List[str] = []
    
    if "games_played" in out.columns and "total_minutes" in out.columns:
        out["minutes_per_game"] = np.where(out["games_played"] > 0, out["total_minutes"] / out["games_played"], 0.0)
        out["games_pct"] = out["games_played"] / 82.0
        created.extend(["minutes_per_game", "games_pct"])
        
        # playing time tiers
        out["minutes_tier"] = pd.cut(out["minutes_per_game"], bins=[0, 15, 25, 35, 48], 
                                   labels=["Bench", "Role Player", "Starter", "Star"], include_lowest=True)
        created.append("minutes_tier")
        
        # total minutes tiers (handles duplicate edges)
        try:
            out["total_minutes_tier"] = pd.qcut(out["total_minutes"], q=5, 
                                              labels=["Very Low", "Low", "Medium", "High", "Very High"])
        except ValueError:
            ranks = out["total_minutes"].rank(method="average")
            out["total_minutes_tier"] = pd.qcut(ranks, q=5,
                                              labels=["Very Low", "Low", "Medium", "High", "Very High"])
        created.append("total_minutes_tier")
    
    return out, created


# shooting performance relative to league average by season  
def add_performance_consistency(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """Shooting performance vs league medians and composite score."""
    out = df.copy()
    created: List[str] = []
    
    need = ["season_start_year", "fg_pct", "fg3_pct", "ft_pct"]
    missing = [c for c in need if c not in out.columns]
    if missing:
        return out, created
        
    # season-level medians for comparison
    grp = out.groupby("season_start_year", group_keys=False)
    league_medians = ["fg_league_med", "fg3_league_med", "ft_league_med"]
    out["fg_league_med"] = grp["fg_pct"].transform("median")
    out["fg3_league_med"] = grp["fg3_pct"].transform("median") 
    out["ft_league_med"] = grp["ft_pct"].transform("median")
    
    # differences from league median
    out["fg_vs_league"] = out["fg_pct"] - out["fg_league_med"]
    out["fg3_vs_league"] = out["fg3_pct"] - out["fg3_league_med"]
    out["ft_vs_league"] = out["ft_pct"] - out["ft_league_med"]
    created.extend(["fg_vs_league", "fg3_vs_league", "ft_vs_league"])
    
    # composite shooting score
    out["shooting_score"] = out["fg_pct"] * 0.4 + out["fg3_pct"] * 0.3 + out["ft_pct"] * 0.3
    created.append("shooting_score")
    
    # clean up temp columns
    out.drop(columns=league_medians, inplace=True)
    return out, created


# composite features combining multiple stats
def create_composite_features(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """Create composite impact metrics."""
    out = df.copy()
    created: List[str] = []
    
    # make sure we have the base columns (fill with nan if missing)
    for need in ["pts_per36", "ast_per36", "reb_per36", "defensive_per36"]:
        if need not in out.columns:
            out[need] = np.nan
    
    # offensive impact score
    out["offensive_impact"] = out["pts_per36"] * 0.4 + out["ast_per36"] * 0.3 + out["ts_pct"] * 100 * 0.3
    created.append("offensive_impact")
    
    # two-way impact (offense + defense)
    out["two_way_impact"] = out["offensive_impact"] + out["defensive_per36"] * 10
    created.append("two_way_impact")
    
    # efficiency x volume
    if "efficiency_per_game" in out.columns and "total_usage" in out.columns:
        out["efficiency_volume_score"] = out["efficiency_per_game"] * out["total_usage"]  
        created.append("efficiency_volume_score")
    
    # versatility score (above median in multiple areas)
    scoring_contrib = (out["pts_per36"] > out["pts_per36"].median()).astype(int)
    assist_contrib = (out["ast_per36"] > out["ast_per36"].median()).astype(int)
    rebound_contrib = (out["reb_per36"] > out["reb_per36"].median()).astype(int) 
    defense_contrib = (out["defensive_per36"] > out["defensive_per36"].median()).astype(int)
    out["versatility_score"] = scoring_contrib + assist_contrib + rebound_contrib + defense_contrib
    created.append("versatility_score")
    
    return out, created


# helper for season-normalized z-scores
def _zscore_by_season(df: pd.DataFrame, col: str, season_col: str) -> pd.Series:
    """Z-score within each season, clipped to avoid extreme outliers."""
    if col not in df.columns:
        return pd.Series(np.nan, index=df.index)
    g = df.groupby(season_col)[col]
    z = (df[col] - g.transform("mean")) / (g.transform("std").replace(0, np.nan))
    return z.clip(-3, 3).fillna(0.0)


# portability index - how well skills transfer between situations  
def build_portability_index(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    """Portability index based on transferable skills."""
    out = df.copy()
    created: List[str] = []
    season_col = "season_start_year"
    require_columns(out, [season_col], "build_portability_index")
    
    # season-normalized z-scores for key components
    z = {}
    for col in ["ts_pct", "efg_pct", "pts_per_shot", "fg3_pct", "three_point_rate", "ft_pct",
                "stocks_per36", "dreb_share", "oreb_share", "ast_per36", "ast_to_tov", "usage_per_min"]:
        if col in out.columns:
            z[col] = _zscore_by_season(out, col, season_col)
        else:
            z[col] = pd.Series(0.0, index=out.index)
    
    # component scores
    score_eff = (z["ts_pct"] + z["efg_pct"] + z["pts_per_shot"]) / 3.0
    shoot_abil = (z["fg3_pct"] + z["three_point_rate"] + z["ft_pct"]) / 3.0  
    def_abil = z["stocks_per36"]
    
    # rebounding versatility (good at both, penalty for imbalance)
    reb_mean = (z["dreb_share"] + z["oreb_share"]) / 2.0
    reb_gap = (z["dreb_share"] - z["oreb_share"]).abs()
    def_vers = reb_mean - 0.25 * reb_gap
    
    pass_abil = (z["ast_per36"] + z["ast_to_tov"]) / 2.0
    
    # usage with diminishing returns (too much usage can hurt portability)
    usage_term = z["usage_per_min"] - 0.15 * (z["usage_per_min"] ** 2)
    
    # weighted combination (weights sum to 1.0)
    out["portability_index"] = (0.16 * score_eff + 0.40 * shoot_abil + 0.08 * def_abil + 
                               0.05 * def_vers + 0.25 * pass_abil + 0.06 * usage_term)
    created.append("portability_index")
    
    # save component scores too
    out["pi_scoring_eff"] = score_eff
    out["pi_shooting"] = shoot_abil  
    out["pi_defense"] = def_abil
    out["pi_versatility"] = def_vers
    out["pi_passing"] = pass_abil
    out["pi_usage_term"] = usage_term
    created.extend(["pi_scoring_eff", "pi_shooting", "pi_defense", "pi_versatility", "pi_passing", "pi_usage_term"])
    
    return out, created


# main feature engineering function
def engineer_features(df: pd.DataFrame, drop_null_lag_rows: bool = True, verbose: bool = False) -> pd.DataFrame:
    """
    Build all features and optionally drop first-season rows with null lags.
    
    Args:
        df: Input dataframe with player-season data
        drop_null_lag_rows: If True, drop rows where lag features are null (default True)
        verbose: Print progress info (default False)
    
    Returns:
        Processed dataframe with all engineered features
    """
    if verbose:
        print("Starting feature engineering...")
        
    original_shape = df.shape
    out = df.copy()
    
    # check we have required base columns
    required_base = ["personId", "season", "games_played", "total_minutes", "season_pie"]
    missing_base = [c for c in required_base if c not in out.columns]
    if missing_base:
        raise ValueError(f"Missing required columns: {missing_base}")
    
    # 1. parse season to get numeric year
    if verbose:
        print("Parsing seasons...")
    out, _ = add_season_start_year(out)
    
    # 2. sort by player and season for all subsequent operations
    out = out.sort_values(["personId", "season_start_year"])
    
    # 3. build features step by step
    if verbose:
        print("Adding experience features...")
    out, _ = add_experience_features(out)
    
    if verbose:
        print("Adding advanced metrics...")
    out, _ = add_advanced_metrics(out)
    
    if verbose:
        print("Adding usage features...")
    out, _ = add_usage_features(out)
    
    if verbose:
        print("Adding minutes features...")  
    out, _ = add_minutes_features(out)
    
    if verbose:
        print("Adding performance consistency...")
    out, _ = add_performance_consistency(out)
    
    if verbose:
        print("Creating composite features...")
    out, _ = create_composite_features(out)
    
    if verbose:
        print("Building portability index...")
    out, _ = build_portability_index(out)
    
    # 4. add lag features (creates nulls in first seasons)
    if verbose:
        print("Creating lag features...")
    out, _ = add_lag_features(out, lags=[1])
    
    # 5. clean up any infinite values
    numeric_cols = out.select_dtypes(include=[np.number]).columns
    inf_cols = []
    for col in numeric_cols:
        if np.isinf(out[col]).any():
            inf_cols.append(col)
            out[col] = out[col].replace([np.inf, -np.inf], np.nan)
    
    if verbose and inf_cols:
        print(f"Cleaned infinite values in {len(inf_cols)} columns")
    
    # 6. handle lag nulls using has_prior_season instead of separate mask
    lag_cols = [c for c in out.columns if c.endswith("_lag1")]
    if lag_cols:
        lag_nulls_mask = out[lag_cols].isnull().any(axis=1)
        
        # use has_prior_season to identify first seasons (created in add_lag_features)
        if "has_prior_season" in out.columns:
            first_season_mask = ~out["has_prior_season"]
        else:
            # defensive fallback: treat missing as not prior season
            first_season_mask = pd.Series(False, index=out.index)
        
        # validate expectation: lag nulls should be subset of first seasons
        # this is slightly safer against occasional upstream data quirks
        unexpected_nulls = lag_nulls_mask & (~first_season_mask)
        if unexpected_nulls.any():
            print("⚠️ Warning: Some lag nulls are not from first seasons - check data quality")
        
        if drop_null_lag_rows:
            rows_before = len(out)
            out = out[~lag_nulls_mask].copy()
            rows_dropped = rows_before - len(out)
            if verbose:
                print(f"Dropped {rows_dropped} first-season rows with null lags")
    
    if verbose:
        print(f"Feature engineering complete: {original_shape[0]} → {len(out)} rows, {original_shape[1]} → {len(out.columns)} columns")
    
    return out


if __name__ == "__main__":
    from src.heat_data_scientist_2025.data.load_data_utils import load_data_optimized
    from src.heat_data_scientist_2025.utils.config import (CFG, numerical_features, nominal_categoricals, ordinal_categoricals, y_variables)

    df = load_data_optimized(
        CFG.ml_dataset_path,
        debug=True,
        drop_null_rows=True,
    )

    
    # run feature engineering
    try:
        df_eng = engineer_features(df, drop_null_lag_rows=True, verbose=True)
        print(f"✓ Success! Result shape: {df_eng.shape}")
        print("lag columns created:", [c for c in df_eng.columns if c.endswith('_lag1')])
        print(f"First season rows dropped: {len(df) - len(df_eng)}")
    except Exception as e:
        print(f"✗ Error: {e}")
        
    #print columns
    print(df_eng.columns)

    # check that there are these lists in the dataset    
    #check that the features are in the df_eng
    for feature in numerical_features:
        assert feature in df_eng.columns, f"{feature} is not in the dataset"
    for feature in nominal_categoricals:
        assert feature in df_eng.columns, f"{feature} is not in the dataset"
    for feature in ordinal_categoricals:
        assert feature in df_eng.columns, f"{feature} is not in the dataset"
    for feature in y_variables:
        assert feature in df_eng.columns, f"{feature} is not in the dataset"


Overwriting src/heat_data_scientist_2025/data/feature_engineering.py


In [7]:
%%writefile src/heat_data_scientist_2025/ml/ml_pipeline.py
"""
ML pipeline 
"""

from __future__ import annotations

# Standard library imports
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
import json

# Third-party imports
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder
import warnings

warnings.filterwarnings("ignore")

# Local imports
from src.heat_data_scientist_2025.utils.config import CFG, ML_CONFIG
from src.heat_data_scientist_2025.data.feature_engineering import engineer_features

# Ordinal feature hierarchies
ORDINAL_ORDERS = {
    "minutes_tier": ["Bench", "Role Player", "Starter", "Star"],
}


# Convert fitted LabelEncoder to stable dictionaries
def _freeze_label_encoder(le: LabelEncoder) -> dict:
    classes = list(le.classes_)
    mapping = {cls: idx for idx, cls in enumerate(classes)}
    reverse = {idx: cls for idx, cls in enumerate(classes)}
    return {"forward": mapping, "reverse": reverse}


# Validate data contains only numeric columns for model training
def ensure_numeric_matrix(df_like: pd.DataFrame, context: str = "X") -> None:
    obj_cols = [c for c in df_like.columns if df_like[c].dtype == "object"]
    if obj_cols:
        examples = {c: df_like[c].dropna().astype(str).unique()[:5].tolist() for c in obj_cols}
        msg = [
            f"[ensure_numeric_matrix] {context} contains object/string columns:",
            f"  Columns: {obj_cols}",
            f"  Sample values: {examples}",
            "  -> Encode these before model training.",
        ]
        raise ValueError("\n".join(msg))


# Check for potential target leakage in feature names
def validate_target_feature_separation(target: str, feature_names: List[str], verbose: bool = True) -> bool:
    bad = [f for f in feature_names if f == target or (f.startswith(target + "_") and not f.endswith("_lag1"))]
    ok = len(bad) == 0
    if verbose and not ok:
        print(f"[leakage-check] Potential leakage for target '{target}': {bad}")
    return ok


# Build feature list excluding contemporaneous target
def create_target_specific_features(
    target_name: str,
    base_numerical_features: List[str],
    nominal_categoricals: List[str],
    ordinal_categoricals: List[str],
) -> List[str]:
    target_name = str(target_name).strip()
    exclusions = {target_name}
    safe_numerical = [f for f in base_numerical_features if f not in exclusions]
    return safe_numerical + nominal_categoricals + ordinal_categoricals


# Container for categorical encoders
@dataclass
class EncodersBundle:
    nominal_maps: dict
    ordinal_maps: dict
    raw_label_encoders: dict


# Encode categorical features with stable mappings
def encode_categoricals(
    df: pd.DataFrame,
    nominal_categoricals: List[str],
    ordinal_categoricals: List[str],
    strict: bool = True,
    verbose: bool = True,
) -> Tuple[pd.DataFrame, EncodersBundle]:
    out = df.copy()
    raw_label_encoders: dict[str, LabelEncoder] = {}
    nominal_maps: dict[str, dict] = {}
    ordinal_maps: dict[str, dict] = {}

    # Handle nominal categories
    for col in nominal_categoricals:
        if col not in out.columns:
            continue
        ser = out[col].astype("string").fillna("Unknown")
        le = LabelEncoder()
        le.fit(ser.to_numpy())
        raw_label_encoders[col] = le
        maps = _freeze_label_encoder(le)
        nominal_maps[col] = maps
        out[col] = ser.map(maps["forward"]).astype("Int32")
        if strict and out[col].isna().any():
            unseen = sorted(ser[out[col].isna()].dropna().unique().tolist())
            raise ValueError(f"Unknown nominal categories in '{col}': {unseen}")
        out[col] = out[col].fillna(-1).astype("int32")

    # Handle ordinal categories
    for col in ordinal_categoricals:
        if col not in out.columns:
            continue
        ser = out[col].astype("string").fillna("Unknown")
        order = ORDINAL_ORDERS.get(col)
        if order is None:
            raise ValueError(f"No explicit order for ordinal '{col}'. Add to ORDINAL_ORDERS.")
        if "Unknown" not in order:
            order = order + ["Unknown"]
        forward = {lvl: i for i, lvl in enumerate(order)}
        reverse = {i: lvl for lvl, i in forward.items()}
        ordinal_maps[col] = {"forward": forward, "reverse": reverse}

        out[col] = ser.map(forward).astype("Int16")
        if strict and out[col].isna().any():
            unseen = sorted(ser[out[col].isna()].dropna().unique().tolist())
            raise ValueError(f"Unknown ordinal categories in '{col}': {unseen}. Allowed={order}")
        out[col] = out[col].fillna(forward["Unknown"]).astype("int16")

    if verbose:
        print(f"Encoded: {len(nominal_maps)} nominal, {len(ordinal_maps)} ordinal columns")

    return out, EncodersBundle(nominal_maps, ordinal_maps, raw_label_encoders)


# Apply saved encoders to new data
def apply_encoders_to_frame(
    df: pd.DataFrame, encoders: EncodersBundle, strict: bool = True, verbose: bool = True
) -> pd.DataFrame:
    out = df.copy()

    for col, maps in encoders.nominal_maps.items():
        if col not in out.columns:
            continue
        ser = out[col].astype("string").fillna("Unknown")
        out[col] = ser.map(maps["forward"]).astype("Int32")
        if strict and out[col].isna().any():
            unseen = sorted(ser[out[col].isna()].dropna().unique().tolist())
            raise ValueError(f"Unknown nominal categories in '{col}': {unseen}")
        out[col] = out[col].fillna(-1).astype("int32")

    for col, maps in encoders.ordinal_maps.items():
        if col not in out.columns:
            continue
        ser = out[col].astype("string").fillna("Unknown")
        out[col] = ser.map(maps["forward"]).astype("Int16")
        if strict and out[col].isna().any():
            unseen = sorted(ser[out[col].isna()].dropna().unique().tolist())
            raise ValueError(f"Unknown ordinal categories in '{col}': {unseen}")
        out[col] = out[col].fillna(maps["forward"]["Unknown"]).astype("int16")

    return out


# Validate lag feature integrity across seasons
def audit_lag_feature_integrity(
    df: pd.DataFrame,
    person_col: str = "personId",
    season_col: str = "season_start_year",
    lag_pairs: List[Tuple[str, str]] = [("season_pie_lag1", "season_pie")],
    verbose: bool = True,
) -> None:
    if not {person_col, season_col}.issubset(df.columns):
        if verbose:
            print("Skipping lag audit: missing id/season columns")
        return

    g = df.sort_values([person_col, season_col]).groupby(person_col, group_keys=False)
    prev_year = g[season_col].shift(1)
    year_gap = df[season_col] - prev_year

    valid_prev = (g.cumcount() > 0) & (year_gap == 1)
    if verbose:
        total = int((g.cumcount() > 0).sum())
        good = int(valid_prev.sum())
        pct = (good / total * 100) if total else 100.0
        print(f"Consecutive season pairs: {good}/{total} ({pct:.1f}%)")

    for lag_col, base_col in lag_pairs:
        if lag_col not in df.columns or base_col not in df.columns:
            continue
        expected = g[base_col].shift(1)
        mism = (df[lag_col] != expected) & valid_prev
        mism_ct = int(mism.sum())
        if verbose:
            print(f"{lag_col} vs {base_col}: {mism_ct} mismatches")


# Comprehensive feature availability analysis
def validate_and_evaluate_features(
    df: pd.DataFrame,
    numerical_features: List[str],
    nominal_categoricals: List[str],
    ordinal_categoricals: List[str],
    y_variables: List[str],
    verbose: bool = True,
) -> Dict[str, Any]:
    results = {
        "numerical_features": {},
        "nominal_categoricals": {},
        "ordinal_categoricals": {},
        "y_variables": {},
        "missing_features": [],
        "available_features": [],
        "feature_completeness": {},
    }

    groups = [
        ("numerical_features", numerical_features),
        ("nominal_categoricals", nominal_categoricals),
        ("ordinal_categoricals", ordinal_categoricals),
        ("y_variables", y_variables),
    ]
    
    for feature_type, features in groups:
        available = [f for f in features if f in df.columns]
        missing = [f for f in features if f not in df.columns]

        results[feature_type]["available"] = available
        results[feature_type]["missing"] = missing
        results[feature_type]["availability_pct"] = (len(available) / len(features) * 100) if features else 100.0

        for feature in available:
            non_null_count = int(df[feature].notna().sum())
            total_count = int(len(df))
            completeness_pct = (non_null_count / total_count * 100.0) if total_count else 0.0
            results["feature_completeness"][feature] = {
                "non_null_count": non_null_count,
                "total_count": total_count,
                "completeness_pct": float(completeness_pct),
            }

        if verbose and features:
            pct = (len(available) / len(features) * 100.0)
            print(f"{feature_type.replace('_', ' ').title()}: {len(available)}/{len(features)} ({pct:.1f}%)")

    all_specified = numerical_features + nominal_categoricals + ordinal_categoricals + y_variables
    total_spec = len(all_specified)
    total_avail = len([f for f in all_specified if f in df.columns])
    
    results["overall"] = {
        "total_specified": int(total_spec),
        "total_available": int(total_avail),
        "availability_pct": float((total_avail / total_spec * 100.0) if total_spec else 100.0),
    }
    results["missing_features"] = [f for f in all_specified if f not in df.columns]
    results["available_features"] = [f for f in all_specified if f in df.columns]

    return results


# Calculate permutation importance for numerical features
def calculate_permutation_importance(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    X_test: pd.DataFrame,
    y_test: pd.Series,
    target_name: str,
    numerical_features: List[str],
    n_repeats: int = 10,
    random_state: int = 42,
    verbose: bool = True,
) -> pd.DataFrame:
    available = [f for f in numerical_features if f in X_train.columns]
    if not available:
        return pd.DataFrame(columns=[
            "feature", "importance_mean", "importance_std", "target", "importance_lower", "importance_upper"
        ])

    Xtr = X_train[available].copy()
    Xte = X_test[available].copy()

    if Xtr.isna().any().any() or Xte.isna().any().any() or y_train.isna().any() or y_test.isna().any():
        raise ValueError("Unexpected nulls found in importance calculation data")

    base_model = RandomForestRegressor(n_estimators=100, random_state=random_state, n_jobs=-1)
    base_model.fit(Xtr, y_train)

    perm = permutation_importance(
        base_model, Xte, y_test, n_repeats=n_repeats, random_state=random_state, scoring="r2"
    )

    df_imp = (
        pd.DataFrame(
            {
                "feature": available,
                "importance_mean": perm.importances_mean,
                "importance_std": perm.importances_std,
                "target": target_name,
            }
        )
        .sort_values("importance_mean", ascending=False)
        .reset_index(drop=True)
    )
    df_imp["importance_lower"] = df_imp["importance_mean"] - 1.96 * df_imp["importance_std"]
    df_imp["importance_upper"] = df_imp["importance_mean"] + 1.96 * df_imp["importance_std"]

    if verbose and not df_imp.empty:
        print(f"Top 5 important features for {target_name}:")
        for i, (_, row) in enumerate(df_imp.head(5).iterrows(), start=1):
            print(f" {i:2d}. {row['feature']:<25} {row['importance_mean']:7.4f}")

    return df_imp


# Filter features by importance threshold
def filter_features_by_importance(
    importance_df: pd.DataFrame,
    min_importance: float = 0.001,
    max_features: Optional[int] = None,
    target_name: str = "unknown",
    verbose: bool = True,
) -> List[str]:
    if importance_df.empty:
        return []

    kept = importance_df[importance_df["importance_mean"] > min_importance].copy()
    if max_features and len(kept) > max_features:
        kept = kept.head(max_features)

    feature_names = kept["feature"].tolist()

    if verbose:
        total = len(importance_df)
        k = len(feature_names)
        print(f"Kept {k}/{total} features for {target_name} (threshold={min_importance})")

    return feature_names


# Create game score per 36 minutes feature
def create_game_score_per36_feature(df: pd.DataFrame) -> Tuple[pd.DataFrame, List[str]]:
    out = df.copy()
    
    req = [
        "total_points", "total_fgm", "total_fga", "total_fta", "total_ftm",
        "total_reb_off", "total_reb_def", "total_steals", "total_assists", 
        "total_blocks", "total_pf", "total_tov", "total_minutes",
    ]
    
    missing = [c for c in req if c not in out.columns]
    if missing:
        print(f"Missing game score columns {missing}; using season_pie proxy")
        out["game_score_per36"] = out.get("season_pie", 0.1) * 36.0
        return out, ["game_score_per36"]

    out["game_score_total"] = (
        out["total_points"]
        + 0.4 * out["total_fgm"]
        - 0.7 * out["total_fga"]
        - 0.4 * (out["total_fta"] - out["total_ftm"])
        + 0.7 * out["total_reb_off"]
        + 0.3 * out["total_reb_def"]
        + out["total_steals"]
        + 0.7 * out["total_assists"]
        + 0.7 * out["total_blocks"]
        - 0.4 * out["total_pf"]
        - out["total_tov"]
    )
    out["game_score_per36"] = np.where(
        out["total_minutes"] > 0, out["game_score_total"] * 36.0 / out["total_minutes"], 0.0
    )

    return out, ["game_score_total", "game_score_per36"]


# Build train/test datasets for multiple targets
def create_multi_target_datasets(
    df_engineered: pd.DataFrame,
    numerical_features: List[str],
    nominal_categoricals: List[str],
    ordinal_categoricals: List[str],
    y_variables: List[str],
    strategy: str = "filter_complete",
    test_seasons: Optional[List[int]] = None,
    season_col: str = "season_start_year",
    verbose: bool = True,
) -> Dict[str, Any]:
    if verbose:
        print(f"Creating datasets for {len(y_variables)} targets using {strategy} strategy")

    # Audit lag features
    try:
        audit_lag_feature_integrity(df_engineered, verbose=verbose)
    except Exception as e:
        print(f"Lag audit error: {e}")

    # Encode categoricals once for all targets
    df_processed, enc_bundle = encode_categoricals(
        df_engineered, nominal_categoricals, ordinal_categoricals, verbose=verbose
    )

    if test_seasons is None:
        test_seasons = ML_CONFIG.TEST_YEARS

    train_mask = ~df_processed[season_col].isin(test_seasons)
    test_mask = df_processed[season_col].isin(test_seasons)

    results = {
        "datasets": {},
        "encoders": enc_bundle,
        "label_encoders": enc_bundle.raw_label_encoders,
        "train_seasons": sorted(df_processed[train_mask][season_col].unique()),
        "test_seasons": sorted(df_processed[test_mask][season_col].unique()),
    }

    for target in y_variables:
        if target not in df_processed.columns:
            if verbose:
                print(f"Target '{target}' not found; skipping")
            continue

        target_features = create_target_specific_features(
            target, numerical_features, nominal_categoricals, ordinal_categoricals
        )
        available_features = [f for f in target_features if f in df_processed.columns]

        if not validate_target_feature_separation(target, available_features, verbose):
            continue

        target_mask = df_processed[target].notna()
        if strategy == "filter_complete":
            feat_mask = df_processed[available_features].notna().all(axis=1)
            full_mask = target_mask & feat_mask
        else:
            full_mask = target_mask

        train_data = df_processed[train_mask & full_mask]
        test_data = df_processed[test_mask & full_mask]

        if len(train_data) == 0 or len(test_data) == 0:
            if verbose:
                print(f"Insufficient data for {target} (train={len(train_data)}, test={len(test_data)})")
            continue

        X_train = train_data[available_features].copy()
        y_train = train_data[target].copy()
        X_test = test_data[available_features].copy()
        y_test = test_data[target].copy()

        results["datasets"][target] = {
            "X_train": X_train,
            "y_train": y_train,
            "X_test": X_test,
            "y_test": y_test,
            "feature_names": available_features,
            "train_size": len(X_train),
            "test_size": len(X_test),
            "target_name": target,
        }

        if verbose:
            print(f"{target}: train={len(X_train)}, test={len(X_test)} samples")

    return results


# Train models with importance-based feature selection
def train_multi_target_models(
    datasets: Dict[str, Any],
    numerical_features: List[str],
    importance_threshold: float = 0.001,
    max_features_per_target: Optional[int] = None,
    n_importance_repeats: int = 10,
    verbose: bool = True,
) -> Dict[str, Any]:
    results = {
        "models": {},
        "importance_scores": {},
        "filtered_features": {},
        "evaluation_metrics": {},
    }

    for target_name, data in datasets["datasets"].items():
        if verbose:
            print(f"\nTraining model for {target_name}")

        X_train, y_train = data["X_train"], data["y_train"]
        X_test, y_test = data["X_test"], data["y_test"]
        feature_names = data["feature_names"]

        target_numerical = [f for f in numerical_features if f in feature_names]

        # Train initial model
        model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
        model.fit(X_train, y_train)

        # Calculate importance
        importance_df = calculate_permutation_importance(
            X_train, y_train, X_test, y_test, target_name, target_numerical, 
            n_repeats=n_importance_repeats, verbose=verbose
        )
        results["importance_scores"][target_name] = importance_df

        if importance_df.empty:
            results["models"][target_name] = model
            results["filtered_features"][target_name] = feature_names
            y_pred = model.predict(X_test)
            results["evaluation_metrics"][target_name] = {
                "r2": r2_score(y_test, y_pred),
                "rmse": float(np.sqrt(mean_squared_error(y_test, y_pred))),
                "mae": mean_absolute_error(y_test, y_pred),
            }
            continue

        # Filter by importance
        important_features = filter_features_by_importance(
            importance_df, importance_threshold, max_features_per_target, target_name, verbose
        )
        categorical_features = [f for f in feature_names if f not in numerical_features 
                              and f in X_train.columns and f != "prediction_season"]
        final_features = important_features + categorical_features

        if not final_features:
            if verbose:
                print(f"No features passed threshold for {target_name}")
            continue

        # Train final model
        X_train_f = X_train[final_features]
        X_test_f = X_test[final_features]

        final_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
        final_model.fit(X_train_f, y_train)

        y_pred = final_model.predict(X_test_f)
        metrics = {
            "r2": r2_score(y_test, y_pred),
            "rmse": float(np.sqrt(mean_squared_error(y_test, y_pred))),
            "mae": mean_absolute_error(y_test, y_pred),
        }

        results["models"][target_name] = final_model
        results["filtered_features"][target_name] = final_features
        results["evaluation_metrics"][target_name] = metrics

        if verbose:
            print(f"Final R²: {metrics['r2']:.3f}, Features: {len(final_features)}")

    return results


# Save importance results and model metrics
def save_feature_importance_results(results: Dict[str, Any], output_dir: Path, verbose: bool = True) -> None:
    out = Path(output_dir)
    out.mkdir(parents=True, exist_ok=True)

    for target_name, importance_df in results["importance_scores"].items():
        if not importance_df.empty:
            p = out / f"{target_name}_permutation_importance.csv"
            importance_df.to_csv(p, index=False)
            if verbose:
                print(f"Saved {target_name} importance to {p}")

    filtered_summary = {
        t: {"features": feats, "count": len(feats)} for t, feats in results["filtered_features"].items()
    }
    (out / "filtered_features_summary.json").write_text(json.dumps(filtered_summary, indent=2))

    (out / "model_evaluation_metrics.json").write_text(
        json.dumps(results["evaluation_metrics"], indent=2, default=str)
    )
    
    if verbose:
        print(f"Saved summaries to {out}")


# Print final model performance summary
def print_final_results(results: Dict[str, Any], verbose: bool = True) -> None:
    if not verbose:
        return

    print("\nFINAL MODEL RESULTS")
    print("-" * 30)

    for target_name in results["models"].keys():
        print(f"\n{target_name.upper()}")
        
        if target_name in results["evaluation_metrics"]:
            m = results["evaluation_metrics"][target_name]
            print(f"  R²: {m['r2']:.4f}, RMSE: {m['rmse']:.4f}, MAE: {m['mae']:.4f}")

        if target_name in results["filtered_features"]:
            feats = results["filtered_features"][target_name]
            print(f"  Features: {len(feats)}")


# Generate predictions for future seasons
def generate_and_save_predictions(
    df_engineered: pd.DataFrame,
    datasets: Dict[str, Any],
    model_results: Dict[str, Any],
    season_col: str = "season_start_year",
    id_cols: List[str] = ["personId", "player_name"],
    verbose: bool = True,
) -> Dict[str, Path]:
    enc_bundle: EncodersBundle = datasets.get("encoders")
    if enc_bundle is None:
        raise RuntimeError("Missing encoders; run create_multi_target_datasets() first")

    pred_year = ML_CONFIG.PREDICTION_YEAR
    source_year = ML_CONFIG.SOURCE_YEAR

    base = df_engineered.loc[df_engineered[season_col] == source_year].copy()
    if verbose:
        print(f"Generating predictions from {source_year} data: {len(base)} rows")

    saved_paths = {}

    for target, model in model_results.get("models", {}).items():
        final_feats = model_results["filtered_features"].get(target)
        if not final_feats:
            continue

        missing = [c for c in final_feats if c not in base.columns]
        if missing:
            raise KeyError(f"Missing features for {target}: {missing}")

        X_pred_raw = base[final_feats].copy()
        X_pred = apply_encoders_to_frame(X_pred_raw, enc_bundle, verbose=False)
        ensure_numeric_matrix(X_pred, context=f"X_pred ({target})")

        y_hat = model.predict(X_pred)

        pred_df = base[id_cols + [season_col]].copy()
        pred_df["prediction_season"] = int(pred_year)
        pred_df[f"{target}_pred"] = y_hat

        path = CFG.predictions_path(target, year=pred_year)
        path.parent.mkdir(parents=True, exist_ok=True)
        pred_df.to_parquet(path, index=False)
        saved_paths[target] = path

        if verbose:
            print(f"Saved {target} predictions to {path}")

    return saved_paths


# Main pipeline orchestrator
class MLPipeline:
    """Multi-target ML pipeline with importance filtering and consistent encodings"""

    def __init__(
        self,
        numerical_features: List[str],
        nominal_categoricals: List[str],
        ordinal_categoricals: List[str],
        y_variables: List[str],
        importance_threshold: float = 0.001,
        max_features_per_target: Optional[int] = None,
        verbose: bool = True,
    ):
        self.numerical_features = numerical_features
        self.nominal_categoricals = nominal_categoricals
        self.ordinal_categoricals = ordinal_categoricals
        self.y_variables = y_variables
        self.importance_threshold = importance_threshold
        self.max_features_per_target = max_features_per_target
        self.verbose = verbose
        self.results = {}
        CFG.ensure_ml_dirs()

    # Run complete pipeline from feature validation to predictions
    def run_complete_pipeline(self, df_engineered: pd.DataFrame) -> Dict[str, Any]:
        if self.verbose:
            print("MULTI-TARGET ML PIPELINE")
            print(f"Targets: {len(self.y_variables)}")
            print(f"Features: {len(self.numerical_features)} numerical + {len(self.nominal_categoricals + self.ordinal_categoricals)} categorical")

        # Validate feature availability
        validation_results = validate_and_evaluate_features(
            df_engineered, self.numerical_features, self.nominal_categoricals,
            self.ordinal_categoricals, self.y_variables, self.verbose
        )

        # Create datasets
        datasets = create_multi_target_datasets(
            df_engineered, self.numerical_features, self.nominal_categoricals,
            self.ordinal_categoricals, self.y_variables, verbose=self.verbose
        )

        # Train models
        model_results = train_multi_target_models(
            datasets, self.numerical_features, self.importance_threshold,
            self.max_features_per_target, verbose=self.verbose
        )

        # Save results
        save_feature_importance_results(model_results, CFG.ml_evaluation_dir, self.verbose)
        saved_pred_paths = generate_and_save_predictions(
            df_engineered, datasets, model_results, verbose=self.verbose
        )
        print_final_results(model_results, self.verbose)

        self.results = {
            "feature_validation": validation_results,
            "datasets": datasets,
            "model_results": model_results,
            "saved_predictions": {k: str(v) for k, v in saved_pred_paths.items()},
            "config": {
                "numerical_features": self.numerical_features,
                "nominal_categoricals": self.nominal_categoricals,
                "ordinal_categoricals": self.ordinal_categoricals,
                "y_variables": self.y_variables,
                "importance_threshold": self.importance_threshold,
                "max_features_per_target": self.max_features_per_target,
            },
        }
        return self.results

# -----
# Example entrypoint (kept for parity; ok to remove in production scripts)
# -----
def run_enhanced_pipeline_example():
    """Example usage with your default feature lists; returns results dict."""

    numerical_features = [
        # lagged features
        "season_pie_lag1",
        "ts_pct_lag1",
        "efg_pct_lag1",
        "fg_pct_lag1",
        "fg3_pct_lag1",
        "ft_pct_lag1",
        "pts_per36_lag1",
        "ast_per36_lag1",
        "reb_per36_lag1",
        "defensive_per36_lag1",
        "production_per36_lag1",
        "stocks_per36_lag1",
        "three_point_rate_lag1",
        "ft_rate_lag1",
        "pts_per_shot_lag1",
        "ast_to_tov_lag1",
        "usage_events_per_min_lag1",
        "usage_per_min_lag1",
        "games_played_lag1",
        "total_minutes_lag1",
        "total_points_lag1",
        "total_assists_lag1",
        "total_rebounds_lag1",
        "total_steals_lag1",
        "total_blocks_lag1",
        "total_fga_lag1",
        "total_fta_lag1",
        "total_3pa_lag1",
        "total_3pm_lag1",
        "total_tov_lag1",
        "win_pct_lag1",
        "avg_plus_minus_lag1",
        "team_win_pct_final_lag1",
        "offensive_impact_lag1",
        "two_way_impact_lag1",
        "efficiency_volume_score_lag1",
        "versatility_score_lag1",
        "shooting_score_lag1",
    ]

    nominal_categoricals = ["prediction_season"]
    ordinal_categoricals = ["minutes_tier"]
    y_variables = ["season_pie", "game_score_per36"]

    from src.heat_data_scientist_2025.data.load_data_utils import load_data_optimized

    df = load_data_optimized(CFG.ml_dataset_path, drop_null_rows=True)
    df_engineered, _, _ = engineer_features(df, verbose=True)

    if "game_score_per36" in y_variables and "game_score_per36" not in df_engineered.columns:
        df_engineered, _ = create_game_score_per36_feature(df_engineered)

    pipeline = MLPipeline(
        numerical_features=numerical_features,
        nominal_categoricals=nominal_categoricals,
        ordinal_categoricals=ordinal_categoricals,
        y_variables=y_variables,
        importance_threshold=0.001,
        max_features_per_target=30,
        verbose=True,
    )
    results = pipeline.run_complete_pipeline(df_engineered)
    return results


if __name__ == "__main__":
    _results = run_enhanced_pipeline_example()
    print("\nEnhanced pipeline completed. Check output directories for artifacts.")


Overwriting src/heat_data_scientist_2025/ml/ml_pipeline.py


In [8]:
%%writefile src/heat_data_scientist_2025/ml/leaderboard_compare.py
"""
 Leaderboard comparison module for basketball statistics
"""

from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import Dict, List, Tuple, Optional
import re
import numpy as np
import pandas as pd

from src.heat_data_scientist_2025.utils.config import CFG, ML_CONFIG


def _season_str(start_year: int) -> str:
    """Convert year to season format like 2024 -> '2024-25'"""
    return f"{start_year}-{str((start_year + 1) % 100).zfill(2)}"


def _as_float(x):
    """Convert input to float(s) safely."""
    import numpy as _np
    import pandas as _pd

    if isinstance(x, _pd.Series):
        return _pd.to_numeric(x, errors="coerce").astype(float)

    if isinstance(x, (list, tuple, _np.ndarray)):
        return _pd.to_numeric(_pd.Series(x), errors="coerce").astype(float)

    if x is None or (isinstance(x, float) and _np.isnan(x)):
        return _np.nan
    try:
        return float(x)
    except Exception:
        return _np.nan


def _find_prediction_column(df: pd.DataFrame, metric: str, verbose: bool = True) -> str:
    """Find the prediction column in dataframe, trying common naming patterns"""
    candidates = [
        metric,                           
        f"{metric}_pred",                 
        f"predicted_{metric}",            
        f"pred_{metric}",                 
    ]
    
    for candidate in candidates:
        if candidate in df.columns:
            if verbose:
                print(f"Found prediction column: '{candidate}' for {metric}")
            return candidate
    
    # Fallback to any column with 'pred' in the name
    pred_cols = [c for c in df.columns if 'pred' in c.lower() and pd.api.types.is_numeric_dtype(df[c])]
    if len(pred_cols) == 1:
        if verbose:
            print(f"Using fallback column: '{pred_cols[0]}' for {metric}")
        return pred_cols[0]
    elif len(pred_cols) > 1:
        raise KeyError(f"Multiple prediction columns found for {metric}: {pred_cols}")
    
    raise KeyError(f"No prediction column found for {metric}. Available: {list(df.columns)}")


def _load_hist_minimal(metric: str, minutes_gate: int = 200, verbose: bool = True) -> pd.DataFrame:
    """
    FIXED: Load historical data with more lenient filtering
    Changed minutes_gate from 500 to 200 to include more players
    """
    need_cols = ["player_name", "season", "games_played", "total_minutes", metric]
    df = pd.read_parquet(CFG.ml_dataset_path)
    
    if verbose:
        print(f"[_load_hist_minimal] Initial data: {len(df)} rows")
    
    # Keep only seasons up to 2024
    start_year = df["season"].astype(str).str.extract(r"^(\d{4})")[0].astype(int)
    df = df.loc[start_year <= 2024].copy()
    
    if verbose:
        print(f"[_load_hist_minimal] After year filter (<=2024): {len(df)} rows")

    # Calculate game_score_per36 if missing
    if metric == "game_score_per36" and "game_score_per36" not in df.columns:
        required_for_gs = {
            "total_points", "total_fgm", "total_fga", "total_fta", "total_ftm",
            "total_reb_off", "total_reb_def", "total_steals", "total_assists",
            "total_blocks", "total_pf", "total_tov", "total_minutes"
        }
        if required_for_gs.issubset(df.columns):
            if verbose:
                print(f"Computing {metric} from component stats")
            game_score_total = (
                df["total_points"] + 0.4 * df["total_fgm"] - 0.7 * df["total_fga"]
                - 0.4 * (df["total_fta"] - df["total_ftm"]) + 0.7 * df["total_reb_off"]
                + 0.3 * df["total_reb_def"] + df["total_steals"] + 0.7 * df["total_assists"]
                + 0.7 * df["total_blocks"] - 0.4 * df["total_pf"] - df["total_tov"]
            )
            df["game_score_per36"] = np.where(
                df["total_minutes"] > 0, 
                game_score_total * 36.0 / df["total_minutes"], 
                np.nan
            )
        else:
            if verbose:
                print(f"Cannot compute {metric} - missing required columns")
            df[metric] = np.nan

    # Make sure numeric columns are actually numeric
    for c in ["games_played", "total_minutes", metric]:
        if c not in df.columns:
            df[c] = np.nan
        df[c] = _as_float(df[c])

    # FIXED: More lenient filtering
    df = df.loc[df["total_minutes"] >= minutes_gate].copy()
    result = df[need_cols].dropna(subset=[metric])
    
    if verbose:
        print(f"[_load_hist_minimal] After minutes filter (>={minutes_gate}): {len(df)} rows")
        print(f"[_load_hist_minimal] After dropna on {metric}: {len(result)} rows")
    
    return result


def _load_predictions(metric: str, prediction_year: int, verbose: bool = True) -> pd.DataFrame:
    """FIXED: Load prediction data with better handling of missing columns"""
    pth = CFG.predictions_path(metric, prediction_year)
    if verbose:
        print(f"Loading predictions from {pth}")
    
    if not pth.exists():
        raise FileNotFoundError(f"Predictions file not found: {pth}")
        
    df = pd.read_parquet(pth)
    pred_col = _find_prediction_column(df, metric, verbose=verbose)
    
    if "player_name" not in df.columns:
        raise KeyError("Predictions missing player_name column")

    # FIXED: Handle missing columns more gracefully
    gp_col = _as_float(df["games_played"]) if "games_played" in df.columns else 65.0  # Reasonable default
    tm_col = _as_float(df["total_minutes"]) if "total_minutes" in df.columns else 2000.0  # Reasonable default

    # Build output with standard schema
    out = pd.DataFrame({
        "player_name": df["player_name"].astype(str),
        "season": _season_str(prediction_year),
        metric: _as_float(df[pred_col]),
        "games_played": gp_col,
        "total_minutes": tm_col,
        "source": "pred"
    })
    
    # Remove null predictions
    before_len = len(out)
    out = out.dropna(subset=[metric])
    if verbose and len(out) != before_len:
        print(f"Removed {before_len - len(out)} null predictions")
    
    if verbose:
        print(f"[_load_predictions] Final predictions: {len(out)} rows")

    return out


def _rank_three_buckets(df: pd.DataFrame, metric: str, top_n: int = 10, 
                        middle_n: int = 10, bottom_n: int = 10, verbose: bool = True) -> Dict[str, pd.DataFrame]:
    """FIXED: Create top/middle/bottom rankings with better debugging"""
    
    if verbose:
        print(f"[_rank_three_buckets] Input data: {len(df)} rows")
        print(f"[_rank_three_buckets] Columns: {list(df.columns)}")
    
    use = df.loc[df[metric].notna()].copy()
    
    if verbose:
        print(f"[_rank_three_buckets] After notna filter: {len(use)} rows")
        if len(use) > 0:
            print(f"[_rank_three_buckets] {metric} range: {use[metric].min():.6f} to {use[metric].max():.6f}")
    
    if len(use) == 0:
        if verbose:
            print(f"[_rank_three_buckets] WARNING: No valid data after filtering!")
        return {"top": pd.DataFrame(), "middle": pd.DataFrame(), "bottom": pd.DataFrame()}

    # Fill missing values for tie-breaking
    for col in ["total_minutes", "games_played"]:
        if col not in use.columns:
            use[col] = 0.0
        else:
            use[col] = _as_float(use[col]).fillna(0.0)

    # Top players (highest values)
    top = (
        use.sort_values([metric, "total_minutes", "games_played", "player_name"],
                        ascending=[False, False, False, True], kind="stable")
           .head(top_n).copy()
    )

    # Bottom players (lowest values)
    bottom = (
        use.sort_values([metric, "total_minutes", "games_played", "player_name"],
                        ascending=[True, False, False, True], kind="stable")
           .head(bottom_n).copy()
    )

    # Middle players (closest to median)
    median_val = use[metric].median(skipna=True)
    use_middle = use.copy()
    use_middle["__dist_to_median"] = (use_middle[metric] - median_val).abs()
    middle = (
        use_middle.sort_values(["__dist_to_median", metric, "total_minutes", "games_played", "player_name"],
                               ascending=[True, False, False, False, True], kind="stable")
                  .head(middle_n)
                  .drop(columns=["__dist_to_median"]).copy()
    )

    if verbose:
        print(f"[_rank_three_buckets] Results - Top: {len(top)}, Middle: {len(middle)}, Bottom: {len(bottom)}")

    return {"top": top, "middle": middle, "bottom": bottom}


def _generate_closest_predictions(combined_df: pd.DataFrame, boards: Dict[str, pd.DataFrame], 
                                  metric: str, prediction_year: int, k: int = 10) -> Dict[str, pd.DataFrame]:
    """Generate lists of predictions that just missed making each leaderboard"""
    season_tag = _season_str(prediction_year)
    predictions_only = combined_df.loc[combined_df["source"] == "pred"].copy()
    
    closest = {}
    
    for bucket_name, board_df in boards.items():
        if board_df.empty:
            closest[f"closest_{bucket_name}"] = pd.DataFrame()
            continue
            
        # Find predictions that didn't make this board
        board_players = set(zip(board_df["player_name"], board_df["season"]))
        missed_preds = predictions_only[
            ~predictions_only.apply(lambda r: (r["player_name"], r["season"]) in board_players, axis=1)
        ].copy()
        
        if bucket_name == "top":
            cutoff_value = board_df[metric].min() if not board_df.empty else float('-inf')
            missed_preds["gap_to_cutoff"] = cutoff_value - missed_preds[metric] 
            closest_missed = (
                missed_preds.sort_values(["gap_to_cutoff", metric], ascending=[True, False])
                           .head(k).copy()
            )
            
        elif bucket_name == "bottom":
            cutoff_value = board_df[metric].max() if not board_df.empty else float('inf')
            missed_preds["gap_to_cutoff"] = missed_preds[metric] - cutoff_value
            closest_missed = (
                missed_preds.sort_values(["gap_to_cutoff", metric], ascending=[True, True])
                           .head(k).copy()
            )
            
        else:  # middle
            median_val = combined_df[metric].median(skipna=True)
            missed_preds["gap_to_median"] = (missed_preds[metric] - median_val).abs()
            closest_missed = (
                missed_preds.sort_values(["gap_to_median", metric], ascending=[True, False])
                           .head(k).copy()
            )

        closest_missed["Notes"] = f"Closest {season_tag} prediction to {bucket_name}"
        closest[f"closest_{bucket_name}"] = closest_missed.reset_index(drop=True)
    
    return closest


def _build_new_boards(hist_df: pd.DataFrame, preds_df: pd.DataFrame,
                      metric: str, prediction_year: int, verbose: bool = True) -> Tuple[Dict[str, pd.DataFrame], Dict[str, pd.DataFrame]]:
    """FIXED: Combine historical data with predictions to create leaderboards"""
    
    if verbose:
        print(f"[_build_new_boards] Historical: {len(hist_df)} rows")
        print(f"[_build_new_boards] Predictions: {len(preds_df)} rows")
    
    # Make sure both dataframes have same structure
    hist_df = hist_df.copy()
    hist_df["source"] = "historical"
    
    preds_df = preds_df.copy()
    for col in ["games_played", "total_minutes"]:
        if col not in hist_df.columns:
            hist_df[col] = np.nan
        if col not in preds_df.columns: 
            preds_df[col] = np.nan

    # Combine datasets
    combined = pd.concat([hist_df, preds_df], ignore_index=True, sort=False)
    
    if verbose:
        print(f"[_build_new_boards] Combined: {len(combined)} rows")
        print(f"[_build_new_boards] Predictions in combined: {(combined['source'] == 'pred').sum()}")
    
    # Create rankings
    boards = _rank_three_buckets(combined, metric, verbose=verbose)

    # Add rank numbers and notes for predictions
    season_tag = _season_str(prediction_year)
    for bucket_name, board_df in boards.items():
        if board_df.empty:
            continue
            
        board_df = board_df.copy()
        board_df.insert(0, "Rank", range(1, len(board_df) + 1))
        
        is_prediction = (board_df.get("source", "historical") == "pred") | (board_df["season"] == season_tag)
        board_df["Notes"] = np.where(is_prediction, f"NEW {season_tag} prediction", "")
        
        boards[bucket_name] = board_df
        
        if verbose:
            pred_count = is_prediction.sum()
            print(f"[_build_new_boards] {bucket_name}: {pred_count}/10 are 2025 predictions")

    # Find predictions that just missed each board
    closest = _generate_closest_predictions(combined, boards, metric, prediction_year)

    return boards, closest


def build_leaderboards_with_predictions(metrics: Tuple[str, ...] = ("game_score_per36", "season_pie"),
                                        prediction_year: int = 2025,
                                        save: bool = True, 
                                        verbose: bool = True) -> Dict[str, Dict[str, pd.DataFrame]]:
    """FIXED: Main function to build comprehensive leaderboards with predictions"""
    if verbose:
        print(f"\nBuilding leaderboards for {metrics} with {prediction_year} predictions")
    
    results = {}
    output_dir = CFG.ml_predictions_dir
    output_dir.mkdir(parents=True, exist_ok=True)

    for metric in metrics:
        if verbose:
            print(f"\nProcessing {metric}")
        
        try:
            # Load data with more lenient filtering
            hist_df = _load_hist_minimal(metric, minutes_gate=200, verbose=verbose)  # FIXED: Lower threshold
            pred_df = _load_predictions(metric, prediction_year, verbose=verbose)
            
            if verbose:
                print(f"Historical data: {len(hist_df):,} player-seasons")
                print(f"Predictions: {len(pred_df):,} players")

            # Build leaderboards
            boards, closest = _build_new_boards(hist_df, pred_df, metric, prediction_year, verbose=verbose)
            results[metric] = {"boards": boards, "closest": closest}

            if save:
                # Save main leaderboards
                for bucket_name, board_df in boards.items():
                    board_path = output_dir / f"{metric}_{bucket_name}_leaderboard_{prediction_year}_with_predictions.csv"
                    board_df.to_csv(board_path, index=False)
                    if verbose:
                        print(f"Saved {bucket_name} leaderboard: {board_path.name} ({len(board_df)} rows)")

                # Save closest miss lists  
                closest_path = output_dir / f"{metric}_closest_misses_{prediction_year}.csv"
                if closest:
                    closest_combined = pd.concat(
                        closest.values(), 
                        keys=list(closest.keys())
                    ).reset_index(level=0).rename(columns={"level_0": "category"})
                    closest_combined.to_csv(closest_path, index=False)
                    if verbose:
                        print(f"Saved closest misses: {closest_path.name}")

        except Exception as e:
            print(f"Error processing {metric}: {str(e)}")
            import traceback
            traceback.print_exc()
            continue

    if verbose:
        print(f"\nCompleted leaderboard generation for {len(results)} metrics")
        
    return results


def create_simple_leaderboards_from_predictions(metrics: Tuple[str, ...] = ("game_score_per36", "season_pie"),
                                                prediction_year: int = 2025,
                                                top_n: int = 50,
                                                verbose: bool = True) -> Dict[str, pd.DataFrame]:
    """Create simple leaderboards from prediction files (unchanged - this works)"""
    if verbose:
        print(f"\nCreating simple leaderboards for {prediction_year}")
    
    CFG.ensure_ml_dirs()
    results = {}
    
    for metric in metrics:
        if verbose:
            print(f"\nProcessing {metric}")
        
        try:
            pred_path = CFG.predictions_path(metric, prediction_year)
            if not pred_path.exists():
                print(f"Predictions not found: {pred_path}")
                continue
                
            preds_df = pd.read_parquet(pred_path)
            if verbose:
                print(f"Loaded {len(preds_df):,} predictions")

            pred_col = _find_prediction_column(preds_df, metric, verbose=verbose)
            
            leaderboard_cols = ["player_name", pred_col]
            if any(c not in preds_df.columns for c in leaderboard_cols):
                print(f"Missing required columns for {metric}")
                continue
            
            leaderboard_df = preds_df[leaderboard_cols].copy()
            leaderboard_df = leaderboard_df.dropna(subset=[pred_col])
            
            if "season" not in leaderboard_df.columns:
                leaderboard_df["season"] = _season_str(prediction_year)
            
            leaderboard_df = leaderboard_df.sort_values(
                [pred_col, "player_name"], 
                ascending=[False, True]
            ).head(top_n).reset_index(drop=True)
            
            leaderboard_df = leaderboard_df.rename(columns={pred_col: metric})
            leaderboard_df.insert(0, "rank", range(1, len(leaderboard_df) + 1))
            leaderboard_df[metric] = leaderboard_df[metric].round(6)
            
            lb_path = CFG.leaderboard_path(metric, prediction_year)
            leaderboard_df.to_csv(lb_path, index=False)
            
            if verbose:
                print(f"Saved: {lb_path.name}")
                print(f"Top 3: {leaderboard_df.head(3)['player_name'].tolist()}")
            
            results[metric] = leaderboard_df
            
        except Exception as e:
            print(f"Error with {metric}: {str(e)}")
            continue
    
    if verbose:
        print(f"\nCreated {len(results)} simple leaderboards")
    
    return results


Overwriting src/heat_data_scientist_2025/ml/leaderboard_compare.py


In [9]:
%%writefile src/heat_data_scientist_2025/run_pie_gs36_prediction_pipeline.py
#!/usr/bin/env python3
"""
Multi-Target ML Pipeline Runner
==============================================
"""

import sys
import json
import pandas as pd
from pathlib import Path
from typing import Dict, List

from src.heat_data_scientist_2025.data.load_data_utils import load_data_optimized
from src.heat_data_scientist_2025.utils.config import CFG
from src.heat_data_scientist_2025.data.feature_engineering import engineer_features
from src.heat_data_scientist_2025.ml.ml_pipeline import (
    MLPipeline,
    create_game_score_per36_feature
)

# Helper function to find the correct prediction column for a given metric
def _detect_prediction_column(metric: str, columns: List[str]) -> str:
    """Find the correct prediction column for a given metric."""
    metric_lower = metric.lower()
    pred_cols = [c for c in columns if c.endswith('_pred')]
    if not pred_cols:
        return ''

    exact = f'{metric}_pred'
    if exact in pred_cols:
        return exact

    for c in pred_cols:
        if metric_lower in c.lower():
            return c

    return pred_cols[0]

# Utility to find the appropriate value column from available columns
def _find_value_column(metric: str, columns: List[str]) -> str:
    """Find the appropriate value column for a metric."""
    candidates = [metric, f'{metric}_value', 'value']
    return next((c for c in candidates if c in columns), None)

# Print leaderboard data in organized format
def _print_leaderboard_data(metric: str, df: pd.DataFrame, title: str = "", 
                           top_n: int = 10, show_sections: bool = False) -> None:
    """Print leaderboard data in an organized format."""
    if df is None or df.empty:
        print(f"No data available for {metric}")
        return

    if title:
        print(f"\n{title}")
        print("-" * len(title))

    value_col = _find_value_column(metric, df.columns) or _detect_prediction_column(metric, df.columns)
    
    # Determine columns to display
    display_cols = []
    for col in ['rank', 'Rank', 'player_name']:
        if col in df.columns:
            display_cols.append(col)
    if value_col:
        display_cols.append(value_col)
    for col in ['season', 'prediction_season', 'season_start_year', 'Notes']:
        if col in df.columns:
            display_cols.append(col)

    if not display_cols:
        display_cols = df.columns[:4].tolist()

    if show_sections:
        # Show top, middle, bottom sections side by side
        n = min(top_n, len(df))
        sections = {
            'TOP': df.head(n),
            'MEDIAN': df.iloc[max((len(df) // 2) - (n // 2), 0):
                             min(max((len(df) // 2) - (n // 2), 0) + n, len(df))],
            'BOTTOM': df.tail(n)
        }
        
        print(f"{'TOP':^30} | {'MEDIAN':^30} | {'BOTTOM':^30}")
        print("-" * 94)
        
        max_rows = max(len(section) for section in sections.values())
        for i in range(max_rows):
            row_parts = []
            for section_name, section_df in sections.items():
                if i < len(section_df):
                    row_data = section_df.iloc[i][display_cols].tolist()
                    row_str = " | ".join(str(x) for x in row_data)
                else:
                    row_str = ""
                row_parts.append(f"{row_str:<30}")
            print(" | ".join(row_parts))
    else:
        # Standard table format
        print(df.head(top_n)[display_cols].to_string(index=False))

# Print comprehensive leaderboard sections for a metric
def _print_comprehensive_sections(metric: str, sections_dict: Dict[str, pd.DataFrame], top_n: int = 10) -> None:
    """Print comprehensive leaderboard sections for a metric."""
    if not sections_dict:
        print("No comprehensive sections available")
        return

    for section_name in ['top', 'middle', 'bottom', 'closest_misses']:
        section_df = sections_dict.get(section_name)
        if section_df is not None and not section_df.empty:
            _print_leaderboard_data(metric, section_df, f"[{section_name.upper()}]", top_n)

# Main pipeline execution function
def main():
    """Execute the ML pipeline with proper error handling."""
    
    from src.heat_data_scientist_2025.utils.config import (
        numerical_features,
        nominal_categoricals,
        ordinal_categoricals,
        y_variables
    )

    print("Multi-Target ML Pipeline")
    print("=" * 30)
    print(f"Targets: {', '.join(y_variables)}")
    print(f"Features: {len(numerical_features) + len(ordinal_categoricals)}")

    # Load and process data
    print("\nLoading and engineering data...")
    try:
        df = load_data_optimized(CFG.ml_dataset_path, drop_null_rows=True)
        df_engineered = engineer_features(df, verbose=True)
        print("Feature engineering completed")

        # Display top performers for verification
        if "season_pie" in df_engineered.columns:
            top_pie = df_engineered[["player_name", "season_pie"]].sort_values("season_pie", ascending=False).head(5)
            print("\nTop 5 season_pie players:")
            for _, row in top_pie.iterrows():
                print(f"   {row['player_name']}: {row['season_pie']:.6f}")

    except Exception as e:
        print(f"Data loading failed: {str(e)}")
        return None

    # Add game_score_per36 feature if needed
    if 'game_score_per36' in y_variables and 'game_score_per36' not in df_engineered.columns:
        print("Computing game_score_per36...")
        try:
            df_engineered, _ = create_game_score_per36_feature(df_engineered)
            print("game_score_per36 feature added")
        except Exception as e:
            print(f"Failed to add game_score_per36: {str(e)}")
            return None

    # Execute ML pipeline
    print("\nRunning ML pipeline...")
    try:
        pipeline = MLPipeline(
            numerical_features=numerical_features,
            nominal_categoricals=nominal_categoricals,
            ordinal_categoricals=ordinal_categoricals,
            y_variables=y_variables,
            importance_threshold=0.001,
            max_features_per_target=30,
            verbose=True
        )

        results = pipeline.run_complete_pipeline(df_engineered)
        print("ML pipeline completed")

    except Exception as e:
        print(f"ML pipeline failed: {str(e)}")
        import traceback
        traceback.print_exc()
        return None

    # Generate simple leaderboards
    print("\nCreating simple leaderboards...")
    try:
        from src.heat_data_scientist_2025.ml.leaderboard_compare import create_simple_leaderboards_from_predictions

        simple_leaderboards = create_simple_leaderboards_from_predictions(
            metrics=y_variables,
            prediction_year=2025,
            top_n=50,
            verbose=True
        )
        print("Simple leaderboards created")

        print("\n" + "=" * 60)
        print("SIMPLE LEADERBOARDS - TOP 10 PREDICTIONS")
        print("=" * 60)
        for metric, df_lb in (simple_leaderboards or {}).items():
            print(f"\n{metric.upper()}")
            _print_leaderboard_data(metric, df_lb, show_sections=True, top_n=10)

    except Exception as e:
        print(f"Simple leaderboard creation failed: {str(e)}")
        simple_leaderboards = {}

    # Generate comprehensive leaderboards
    print("\nCreating comprehensive leaderboards...")
    try:
        from src.heat_data_scientist_2025.ml.leaderboard_compare import build_leaderboards_with_predictions

        comprehensive_leaderboards = build_leaderboards_with_predictions(
            metrics=y_variables,
            prediction_year=2025,
            save=True,
            verbose=True
        )
        print("Comprehensive leaderboards created")

        print("\n" + "=" * 60)
        print("COMPREHENSIVE LEADERBOARDS - TOP 10 PER SECTION")
        print("=" * 60)
        
        for y_var in y_variables:
            print(f"\n{y_var.upper()}")
            metric_data = (comprehensive_leaderboards or {}).get(y_var, {})
            boards_data = metric_data.get("boards", {})
            _print_comprehensive_sections(y_var, boards_data, top_n=10)

    except Exception as e:
        print(f"Comprehensive leaderboard creation failed: {str(e)}")
        comprehensive_leaderboards = {}

    # Save results summary
    try:
        _save_results_summary(results, simple_leaderboards, comprehensive_leaderboards)
        print("Results summary saved")
    except Exception as e:
        print(f"Failed to save summary: {str(e)}")

    # Final summary
    print("\n" + "=" * 40)
    print("PIPELINE COMPLETED")
    print("=" * 40)
    print("ML models trained and predictions generated")
    print("Feature importance analysis completed")
    if simple_leaderboards:
        print("Simple leaderboards created")
    if comprehensive_leaderboards:
        print("Comprehensive leaderboards created")
    print(f"Results saved to: {CFG.ml_predictions_dir}")

    return {
        'ml_results': results,
        'simple_leaderboards': simple_leaderboards,
        'comprehensive_leaderboards': comprehensive_leaderboards
    }

# Save comprehensive results summary to JSON file
def _save_results_summary(ml_results: Dict, simple_lb: Dict, comp_lb: Dict) -> None:
    """Save comprehensive results summary with error handling."""
    try:
        summary_path = CFG.ml_evaluation_dir / "results_summary.json"
        
        summary = {
            'pipeline_status': 'completed',
            'timestamp': pd.Timestamp.now().isoformat(),
            'pipeline_config': {
                'numerical_features_count': len(ml_results.get('config', {}).get('numerical_features', [])),
                'categorical_features_count': len(
                    ml_results.get('config', {}).get('nominal_categoricals', []) + 
                    ml_results.get('config', {}).get('ordinal_categoricals', [])
                ),
                'target_variables': ml_results.get('config', {}).get('y_variables', []),
                'prediction_year': 2025
            },
            'ml_performance': {},
            'leaderboard_status': {
                'simple_leaderboards': len(simple_lb),
                'comprehensive_leaderboards': len(comp_lb)
            }
        }
        
        # Add ML performance metrics
        if 'model_results' in ml_results:
            model_results = ml_results['model_results']
            if 'evaluation_metrics' in model_results:
                summary['ml_performance'] = model_results['evaluation_metrics']
            
            if 'importance_scores' in model_results:
                summary['feature_importance'] = {}
                for target, importance_df in model_results['importance_scores'].items():
                    if not importance_df.empty:
                        summary['feature_importance'][target] = {
                            'top_feature': importance_df.iloc[0]['feature'],
                            'top_importance': float(importance_df.iloc[0]['importance_mean']),
                            'significant_features_count': len(importance_df[importance_df['importance_mean'] > 0.001])
                        }
        
        # Add leaderboard summaries
        if simple_lb:
            summary['top_predictions'] = {}
            for metric, lb_df in simple_lb.items():
                if not lb_df.empty:
                    value_col = _find_value_column(metric, lb_df.columns)
                    if value_col:
                        summary['top_predictions'][metric] = {
                            'winner': lb_df.iloc[0]['player_name'],
                            'value': float(lb_df.iloc[0][value_col]),
                            'total_predictions': len(lb_df)
                        }
        
        # Add comprehensive leaderboard status
        if comp_lb:
            summary['comprehensive_status'] = {}
            for metric, data in comp_lb.items():
                boards = data.get('boards', {})
                summary['comprehensive_status'][metric] = {
                    'top_count': len(boards.get('top', pd.DataFrame())),
                    'middle_count': len(boards.get('middle', pd.DataFrame())),
                    'bottom_count': len(boards.get('bottom', pd.DataFrame())),
                }
        
        with open(summary_path, 'w') as f:
            json.dump(summary, f, indent=2, default=str)
        
        print(f"Summary saved: {summary_path}")
        
    except Exception as e:
        print(f"Failed to save summary: {str(e)}")

# Diagnostic utility for checking predictions and leaderboard files
def check_pipeline_outputs(verbose: bool = True) -> None:
    """Check prediction files and leaderboard outputs for completeness."""
    print("\nChecking Pipeline Outputs")
    print("=" * 30)

    targets = ["season_pie", "game_score_per36", "season_PER", "season_EWA", "season_VORP"]
    
    for target in targets:
        print(f"\nChecking {target}:")
        
        # Check prediction files
        pred_path = CFG.predictions_path(target, 2025)
        if pred_path.exists():
            print(f"  Predictions found: {pred_path.name}")
            try:
                df = pd.read_parquet(pred_path)
                print(f"  Shape: {df.shape}")
                
                pred_cols = [c for c in df.columns if c.endswith('_pred')]
                if pred_cols:
                    pred_col = _detect_prediction_column(target, pred_cols)
                    pred_values = pd.to_numeric(df[pred_col], errors='coerce')
                    print(f"  Value range: {pred_values.min():.6f} to {pred_values.max():.6f}")
                    if 'player_name' in df.columns:
                        top_idx = pred_values.idxmax()
                        print(f"  Top player: {df.loc[top_idx, 'player_name']}")
                        
            except Exception as e:
                print(f"  Error reading predictions: {str(e)}")
        else:
            print(f"  Missing predictions: {pred_path}")

        # Check simple leaderboards
        lb_path = CFG.leaderboard_path(target, 2025)
        if lb_path.exists():
            print(f"  Simple leaderboard found")
            try:
                lb_df = pd.read_csv(lb_path)
                if not lb_df.empty and 'player_name' in lb_df.columns:
                    print(f"  Winner: {lb_df.iloc[0]['player_name']}")
            except Exception as e:
                print(f"  Error reading leaderboard: {str(e)}")
        else:
            print(f"  Missing simple leaderboard")

        # Check comprehensive leaderboard files  
        comp_files = [
            f"{target}_top_leaderboard_2025_with_predictions.csv",
            f"{target}_middle_leaderboard_2025_with_predictions.csv", 
            f"{target}_bottom_leaderboard_2025_with_predictions.csv",
            f"{target}_closest_misses_2025.csv"
        ]
        
        comp_found = 0
        for comp_file in comp_files:
            comp_path = CFG.ml_predictions_dir / comp_file
            if comp_path.exists():
                comp_found += 1
        
        print(f"  Comprehensive files: {comp_found}/{len(comp_files)} found")

if __name__ == "__main__":
    print("Starting Multi-Target ML Pipeline...")
    
    try:
        results = main()
        
        if results is None:
            print("\nPipeline failed - running diagnostics...")
            check_pipeline_outputs()
            sys.exit(1)
        else:
            print("\nPipeline completed successfully")
            
            # Run diagnostics for verification
            print("Running verification diagnostics...")
            check_pipeline_outputs()
            
    except KeyboardInterrupt:
        print("\nPipeline interrupted by user")
        sys.exit(1)
    except Exception as e:
        print(f"\nUnexpected error: {str(e)}")
        import traceback
        traceback.print_exc()
        check_pipeline_outputs()
        sys.exit(1)


Starting Multi-Target ML Pipeline...
Multi-Target ML Pipeline
Targets: season_pie, game_score_per36, season_PER, season_EWA, season_VORP
Features: 84

Loading and engineering data...
Loading data for enhanced comprehensive EDA...
→ Applying null dropping: how='any', all columns
✓ Dropped 0 rows by null criteria (how='any', subset=None); remaining 5,575 rows
✓ Dataset loaded: 5,575 rows × 61 columns in 0.03s
Starting feature engineering...
Parsing seasons...
Adding experience features...
Adding advanced metrics...
Adding usage features...
Adding minutes features...
Adding performance consistency...
Creating composite features...
Building portability index...
Creating lag features...
Dropped 1152 first-season rows with null lags
Feature engineering complete: 5575 → 4423 rows, 61 → 186 columns
Feature engineering completed

Top 5 season_pie players:
   LeBron James: 0.174857
   LeBron James: 0.174495
   Russell Westbrook: 0.171554
   Kevin Durant: 0.169186
   Nikola Jokic: 0.166170

Runni