<a href="https://colab.research.google.com/github/elifncebe/NBA-Rookie-of-the-Year-Predictions-2025-2026-Season/blob/main/NBA_FAS_2026_ROY_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üèÄ NBA Future Analytics Stars Technical Challenge  
### Predicting the 2025‚Äì26 Rookie of the Year (ROY)

This notebook presents my end-to-end workflow for forecasting the NBA‚Äôs **Rookie of the Year (ROY)** using publicly available historical data from the official NBA Stats API (`nba_api`). The goal is to build a reproducible and data-driven model that identifies which rookies in a target season are most likely to win the award.

I chose ROY because I think it would be super cool and also have loved watching this rookie class from the NBA Draft to Summer Leauge to the current season!

### Project Objectives
1. Collect historical rookie performance metrics from the 2010‚Äì11 through 2023‚Äì24 seasons.  
2. Label each season‚Äôs Rookie of the Year to create a supervised learning target.  
3. Engineer meaningful features (scoring, usage, efficiency, defensive impact, and team context).  
4. Train a statistical model to estimate the probability that a rookie becomes ROY.  
5. Apply the model to the **2025‚Äì26** rookie class and generate a ranked list of candidates.  
6. Export the required **`predictions.csv`** file containing:
   - `player_name`
   - `probability` (value between 0 and 1)

### Modeling Philosophy
The aim is **not** to predict ROY with perfect accuracy; the award is subjective and influenced by narrative and voting dynamics. Instead, the model serves as a **ranking tool**‚Äîhighlighting which rookies‚Äô profiles most resemble prior ROY winners based on measurable on-court impact.

### Reproducibility
This notebook runs cleanly end-to-end using only:
- `nba_api`  
- `pandas`  
- `numpy`  
- `scikit-learn`

All steps, transformations, and modeling decisions are documented in the cells below.

#Setup & Installing Dependencies

In this section, I install the required Python packages and import the core libraries used throughout the analysis. The primary external dependency I chose to use is `nba_api`, which provides access to NBA.com statistics and allows me to pull rookie performance data programmatically. I also configure pandas display options and load standard machine learning utilities.

In [None]:
!pip install nba_api pandas numpy scikit-learn tqdm
import pandas as pd, numpy as np
from tqdm import tqdm
from nba_api.stats.endpoints import LeagueDashPlayerStats
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import time
pd.set_option("display.max_columns", None)
print("Environment ready.")


Collecting nba_api
  Downloading nba_api-1.11.3-py3-none-any.whl.metadata (5.8 kB)
Downloading nba_api-1.11.3-py3-none-any.whl (318 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m319.0/319.0 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nba_api
Successfully installed nba_api-1.11.3
Environment ready.


## Helper Function: Season String Generator

NBA API endpoints require season identifiers in the format `"YYYY-YY"`.
This utility function creates consistently formatted season strings for
any desired year range (e.g., `"2010-11"`). It‚Äôs used throughout the pipeline
to pull both historical and target-season data.

In [None]:
def generate_season_strings(start_year, end_year):
    return [f"{yr}-{str(yr+1)[-2:]}" for yr in range(start_year, end_year+1)]

##Historical ROY Winners (Label Data)

To train a supervised model, I need labels indicating which rookie won Rookie of the Year in each historical season. This dictionary provides the award winner for each season from 2010‚Äì11 through 2024‚Äì25. These labels are merged with the rookie stats to create the binary target variable `IS_ROY_WINNER`.

In [None]:
ROY_WINNERS = {
    "2010-11": "Blake Griffin",
    "2011-12": "Kyrie Irving",
    "2012-13": "Damian Lillard",
    "2013-14": "Michael Carter-Williams",
    "2014-15": "Andrew Wiggins",
    "2015-16": "Karl-Anthony Towns",
    "2016-17": "Malcolm Brogdon",
    "2017-18": "Ben Simmons",
    "2018-19": "Luka Doncic",
    "2019-20": "Ja Morant",
    "2020-21": "LaMelo Ball",
    "2021-22": "Scottie Barnes",
    "2022-23": "Paolo Banchero",
    "2023-24": "Victor Wembanyama",
    "2024-25": "Stephon Castle"
}

TRAINING_SEASONS = list(ROY_WINNERS.keys())
TRAINING_SEASONS

['2010-11',
 '2011-12',
 '2012-13',
 '2013-14',
 '2014-15',
 '2015-16',
 '2016-17',
 '2017-18',
 '2018-19',
 '2019-20',
 '2020-21',
 '2021-22',
 '2022-23',
 '2023-24',
 '2024-25']

## Function to Fetch Rookie Statistics

This function retrieves per-game rookie performance data for a given season using `LeagueDashPlayerStats` with the filter `player_experience_nullable="Rookie"`. This avoids heavy per-player API loops and ensures efficient and reliable rookie detection directly from the official NBA stats API.

In [None]:
def get_rookie_stats_for_season(season):
    print(f"Fetching rookies for {season}...")
    time.sleep(1)
    df = LeagueDashPlayerStats(
        season=season,
        season_type_all_star="Regular Season",
        per_mode_detailed="PerGame",
        player_experience_nullable="Rookie"
    ).get_data_frames()[0]
    df["SEASON"] = season
    return df

## Building the Training Dataset (2010‚Äì2024)

Here I loop through all training seasons, pull rookie data for each season, and attach a binary label indicating whether each player was that season‚Äôs ROY winner. The resulting dataset represents approximately 1,200+ rookies across 14 seasons and includes both performance metrics and the ROY outcome.

In [None]:
dfs = []
for season in tqdm(TRAINING_SEASONS):
    df = get_rookie_stats_for_season(season)
    df["IS_ROY_WINNER"] = (df["PLAYER_NAME"] == ROY_WINNERS[season]).astype(int)
    dfs.append(df)

df_train = pd.concat(dfs).reset_index(drop=True)
df_train.shape

  0%|          | 0/15 [00:00<?, ?it/s]

Fetching rookies for 2010-11...


  7%|‚ñã         | 1/15 [00:01<00:17,  1.26s/it]

Fetching rookies for 2011-12...


 13%|‚ñà‚ñé        | 2/15 [00:02<00:15,  1.19s/it]

Fetching rookies for 2012-13...


 20%|‚ñà‚ñà        | 3/15 [00:03<00:13,  1.14s/it]

Fetching rookies for 2013-14...


 27%|‚ñà‚ñà‚ñã       | 4/15 [00:04<00:12,  1.13s/it]

Fetching rookies for 2014-15...


 33%|‚ñà‚ñà‚ñà‚ñé      | 5/15 [00:05<00:11,  1.13s/it]

Fetching rookies for 2015-16...


 40%|‚ñà‚ñà‚ñà‚ñà      | 6/15 [00:06<00:10,  1.13s/it]

Fetching rookies for 2016-17...


 47%|‚ñà‚ñà‚ñà‚ñà‚ñã     | 7/15 [00:07<00:09,  1.13s/it]

Fetching rookies for 2017-18...


 53%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé    | 8/15 [00:09<00:08,  1.15s/it]

Fetching rookies for 2018-19...


 60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 9/15 [00:10<00:06,  1.14s/it]

Fetching rookies for 2019-20...


 67%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã   | 10/15 [00:11<00:05,  1.13s/it]

Fetching rookies for 2020-21...


 73%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé  | 11/15 [00:12<00:04,  1.12s/it]

Fetching rookies for 2021-22...


 80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 12/15 [00:13<00:03,  1.20s/it]

Fetching rookies for 2022-23...


 87%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã | 13/15 [00:15<00:02,  1.18s/it]

Fetching rookies for 2023-24...


 93%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé| 14/15 [00:16<00:01,  1.21s/it]

Fetching rookies for 2024-25...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 15/15 [00:29<00:00,  1.97s/it]


(1393, 69)

## Sanity Checks on the Assembled Dataset

I verify that:
- Each season contains exactly one labeled ROY winner.
- Feature columns and season tags are correctly populated.

This step ensures dataset integrity before moving to feature selection and modeling.

In [None]:
df_train.groupby(["SEASON", "IS_ROY_WINNER"])["PLAYER_NAME"].count()
df_train.head()

Unnamed: 0,PLAYER_ID,PLAYER_NAME,NICKNAME,TEAM_ID,TEAM_ABBREVIATION,AGE,GP,W,L,W_PCT,MIN,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,TOV,STL,BLK,BLKA,PF,PFD,PTS,PLUS_MINUS,NBA_FANTASY_PTS,DD2,TD3,WNBA_FANTASY_PTS,GP_RANK,W_RANK,L_RANK,W_PCT_RANK,MIN_RANK,FGM_RANK,FGA_RANK,FG_PCT_RANK,FG3M_RANK,FG3A_RANK,FG3_PCT_RANK,FTM_RANK,FTA_RANK,FT_PCT_RANK,OREB_RANK,DREB_RANK,REB_RANK,AST_RANK,TOV_RANK,STL_RANK,BLK_RANK,BLKA_RANK,PF_RANK,PFD_RANK,PTS_RANK,PLUS_MINUS_RANK,NBA_FANTASY_PTS_RANK,DD2_RANK,TD3_RANK,WNBA_FANTASY_PTS_RANK,TEAM_COUNT,SEASON,IS_ROY_WINNER
0,202329,Al-Farouq Aminu,Al-Farouq,1610612746,LAC,20.0,81,31,50,0.383,17.9,2.0,5.0,0.394,0.6,1.8,0.315,1.1,1.5,0.773,0.9,2.4,3.3,0.7,1.3,0.7,0.3,0.5,1.5,1.1,5.6,-2.3,12.5,0,0,12.3,4,12,4,42,16,21,20,46,9,8,15,15,18,13,18,12,16,29,11,14,29,15,29,23,18,46,19,22,4,17,1,2010-11,0
1,202360,Andy Rautins,Andy,1610612752,NYK,24.0,5,2,3,0.4,4.8,0.6,1.4,0.429,0.2,0.8,0.25,0.2,0.4,0.5,0.0,0.2,0.2,0.6,1.4,0.2,0.0,0.0,0.0,0.2,1.6,-2.2,1.9,0,0,3.0,61,60,61,38,62,58,59,31,20,19,26,57,57,55,63,64,65,37,7,52,58,61,67,61,57,45,62,22,4,59,1,2010-11,0
2,202356,Armon Johnson,Armon,1610612757,POR,22.0,38,20,18,0.526,7.3,1.2,2.7,0.455,0.1,0.3,0.417,0.3,0.6,0.591,0.3,0.7,0.9,1.2,1.0,0.1,0.0,0.2,0.8,0.6,2.9,-0.6,5.3,0,0,5.5,38,27,37,23,50,39,43,22,22,31,4,53,49,46,48,53,56,20,19,56,57,47,55,48,41,19,52,22,4,49,1,2010-11,0
3,202340,Avery Bradley,Avery,1610612738,BOS,20.0,31,21,10,0.677,5.2,0.7,2.2,0.343,0.0,0.2,0.0,0.2,0.4,0.5,0.1,0.4,0.5,0.4,0.5,0.3,0.0,0.2,0.6,0.4,1.7,-1.8,3.3,0,0,3.2,41,25,50,11,58,53,51,56,35,40,35,58,58,55,57,60,61,50,45,42,58,38,58,57,56,38,58,22,4,58,1,2010-11,0
4,202386,Ben Uzoh,Ben,1610612751,NJN,23.0,42,6,36,0.143,10.4,1.5,3.4,0.424,0.1,0.2,0.375,0.8,1.3,0.589,0.7,0.7,1.5,1.6,0.6,0.3,0.2,0.4,0.7,0.9,3.8,-1.9,8.9,0,0,7.9,33,53,18,62,40,33,32,37,29,38,8,30,24,47,28,49,39,14,41,39,38,21,56,31,33,41,35,22,4,35,1,2010-11,0


##Feature Selection & Preprocessing

In this step, I select intuitive features known to correlate with early-career rookie impact (e.g., PTS, REB, AST, STL, BLK, efficiency metrics, and team win percentage).

Because these features span different scales, I standardize them using `StandardScaler` to prepare the data for logistic regression.

In [None]:
FEATURE_COLS = [
    "GP","MIN","PTS","REB","AST","STL","BLK",
    "TOV","FG_PCT","FG3_PCT","FT_PCT","W_PCT"
]
FEATURE_COLS = [c for c in FEATURE_COLS if c in df_train.columns]

X = df_train[FEATURE_COLS].fillna(0)
y = df_train["IS_ROY_WINNER"]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

## Training the Logistic Regression Model

I split the dataset into training and validation sets (stratified to preserve the extremely imbalanced positive class). A logistic regression model is then trained to predict the probability that a rookie becomes ROY.

The validation metrics are not intended to achieve high recall on the positive class due to the rarity of winners; instead, the goal is to obtain a ranking model that identifies comparatively stronger rookie profiles.

In [None]:
X_train, X_val, y_train, y_val = train_test_split(
    X_scaled, y, stratify=y, test_size=0.2, random_state=42
)

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

preds = model.predict(X_val)
print("Val Accuracy:", accuracy_score(y_val, preds))
print(classification_report(y_val, preds))

Val Accuracy: 0.989247311827957
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       276
           1       0.50      0.33      0.40         3

    accuracy                           0.99       279
   macro avg       0.75      0.66      0.70       279
weighted avg       0.99      0.99      0.99       279



## Predicting ROY Probabilities for the Target Season

Using the trained model, I apply the pipeline to rookies from the target season (set to 2025‚Äì26). After scaling the new season's features, the model outputs a probability score for each rookie.

As an important note, these probabilities do not sum to 1 they reflect absolute logistic-likelihood estimates. Their primary purpose is ranking, not calibrated win percentages.

In [None]:
TARGET_SEASON = "2025-26"

df_future = get_rookie_stats_for_season(TARGET_SEASON)

X_future = df_future[FEATURE_COLS].fillna(0)
X_future_scaled = scaler.transform(X_future)

df_future["probability"] = model.predict_proba(X_future_scaled)[:,1]

predictions = df_future[["PLAYER_NAME","probability"]].rename(
    columns={"PLAYER_NAME":"player_name"}
).sort_values("probability", ascending=False).reset_index(drop=True)

predictions.head(10)

Fetching rookies for 2025-26...


Unnamed: 0,player_name,probability
0,Cooper Flagg,0.075395
1,Kon Knueppel,0.04408
2,VJ Edgecombe,0.043695
3,Jeremiah Fears,0.009865
4,Derik Queen,0.008401
5,Cedric Coward,0.005299
6,Dylan Harper,0.002673
7,Ryan Kalkbrenner,0.001259
8,Ace Bailey,0.000504
9,Egor D√´min,0.000471


## Exporting the Required predictions.csv File

Finally, I save a two-column CSV file:
- `player_name`
- `probability`

This file represents the ranked ROY predictions and satisfies the submission requirements for the NBA Future Analytics Stars technical assessment.

In [None]:
predictions.to_csv("predictions.csv", index=False)
print("predictions.csv saved!")

predictions.csv saved!


## Conclusion

This notebook demonstrates a full, reproducible workflow for forecasting the NBA Rookie of the Year award using historical rookie performance data and a simple, interpretable modeling approach. The model provides a clear ranking of 2025‚Äì26 rookies based on statistical similarity to past ROY winners, and the required predictions.csv file has been generated accordingly.


I am also currently in a Artificial Intelligence and Machine Learning class where we have implemented a lot of the code I used in this notebook!

## Citations & External Resources

This project relies exclusively on publicly available data and widely used Python libraries. The following resources informed either the data acquisition process, the modeling workflow, or general implementation decisions:

### **NBA Data Sources**
- **NBA Stats API Documentation**  
  https://github.com/swar/nba_api  
  (Used to fetch historical and current rookie performance metrics.)

- **NBA Glossary ‚Äì Definitions for Stats Fields**  
  https://www.nba.com/stats/help/glossary  
  (Used to interpret advanced and base statistical categories.)

### **Python Libraries**
- Harris, W., et al. *pandas: powerful Python data analysis toolkit.*  
  https://pandas.pydata.org/

- Pedregosa, F., et al. *Scikit-learn: Machine Learning in Python.*  
  https://scikit-learn.org/

### **Machine Learning Methodology References**
- Hosmer, D. W., Lemeshow, S. *Applied Logistic Regression.* Wiley.    (General reference for logistic regression modeling.)

- He, H., Garcia, E. A. *Learning from Imbalanced Data.* IEEE TKDE.    (Background on challenges with rare-event classification such as ROY winners.)

### **General Web & Community Resources**
- **Stack Overflow**  
  (Consulted for isolated debugging questions and syntax clarifications, particularly related to `nba_api` usage and pandas dataframe operations.)
  https://stackoverflow.com/

### **Public NBA Statistical Context**
- Basketball Reference ‚Äì Historical award winners & seasons  
  https://www.nba.com/news/history-rookie-of-the-year-winners