**NCAA to NBA Draft Predictive Model**

Predicting NBA Draft outcomes using NCAA basketball statistics and machine learning.

# Step 1: Data Extraction (BigQuery/SQL)

The following query aggregates game-level statistics into season averages for each player:

```sql
SELECT
full_name,
season,

COUNT(starter) AS games_started,
COUNT(game_id) AS games_played,

ROUND(AVG(turnovers), 1) AS avg_turnovers,
ROUND(AVG(steals), 1) AS avg_steals,
ROUND(AVG(two_points_made), 1) AS avg_twos_made,
ROUND(AVG(two_points_pct), 1) AS avg_two_twos_pct,
ROUND(AVG(three_points_made),1) AS avg_threes_made,
ROUND(AVG(three_points_pct), 1) AS avg_threes_pct,
ROUND(AVG(free_throws_made), 1) AS avg_free_throws_made,
ROUND(AVG(free_throws_pct),1) AS avg_free_throws_pct,
ROUND(AVG(minutes_int64), 1) AS avg_minutes,
ROUND(AVG(points), 1) AS avg_points,
ROUND(AVG(rebounds), 1) AS avg_rebounds,
ROUND(AVG(assists), 1) AS avg_assists,
FROM
bigquery-public-data.ncaa_basketball.mbb_players_games_sr
GROUP BY
full_name,
season
HAVING
games_played >= 20
```

Importing Libraries:

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


# Step 2: Data Loading and Exploration

Loading both Datasets:

- **NCAA data** (`ncaa_data_v1.csv`): Player statistics exported from BigQuery
- **NBA draft data** (`nba_draft_data.csv`): Draft picks by year from Kaggle

Before merging, I wanted to examine column schema for both datasets and preview for any obvious data consistency issues.

In [2]:
ncaa = pd.read_csv('ncaa_data_v1.csv')
ncaa.head()

Unnamed: 0,full_name,season,games_started,games_played,avg_turnovers,avg_steals,avg_twos_made,avg_two_twos_pct,avg_threes_made,avg_threes_pct,avg_free_throws_made,avg_free_throws_pct,avg_minutes,avg_points,avg_rebounds,avg_assists
0,Justin Johnson,2017,39,39,1.5,0.7,5.0,54.2,1.2,39.8,1.8,35.9,34.1,15.4,9.3,1.2
1,Jalen Hayes,2017,34,34,2.3,1.0,6.9,52.9,0.2,12.8,3.9,75.9,34.8,18.2,7.9,1.9
2,Mike Thomas,2017,31,31,1.6,0.2,4.6,56.1,0.2,11.5,3.8,59.6,23.3,13.6,6.1,1.1
3,Chris Lewis,2017,33,33,1.8,0.5,5.3,61.1,0.0,0.0,2.1,54.3,26.2,12.6,5.7,1.2
4,Hassan Attia,2016,32,32,2.2,0.9,2.9,52.6,0.0,0.0,0.9,24.7,20.6,6.6,6.4,0.5


In [3]:
nba = pd.read_csv('nba_draft_data.csv')
nba.head()

Unnamed: 0.1,Unnamed: 0,Rk,Pk,Tm,Player,College,Yrs,G,TOTMP,TOTPTS,...,WS/48,BPM,VORP,DraftYr,MPG,PPG,RPG,APG,playerurl,DraftYear
0,1,1,1,BRK,Derrick Coleman,Syracuse,15.0,781.0,25903.0,12884.0,...,0.119,1.4,22.3,1990,33.2,16.5,9.3,2.5,https://www.sports-reference.com/cbb/players/d...,1990
1,2,2,2,OKC,Gary Payton,Oregon State,17.0,1335.0,47117.0,21813.0,...,0.148,3.3,62.5,1990,35.3,16.3,3.9,6.7,https://www.sports-reference.com/cbb/players/g...,1990
2,3,3,3,DEN,Mahmoud Abdul-Rauf,LSU,9.0,586.0,15628.0,8553.0,...,0.077,-0.8,4.5,1990,26.7,14.6,1.9,3.5,https://www.sports-reference.com/cbb/players/m...,1990
3,4,4,4,ORL,Dennis Scott,Georgia Tech,10.0,629.0,17983.0,8094.0,...,0.089,0.2,9.9,1990,28.6,12.9,2.8,2.1,https://www.sports-reference.com/cbb/players/d...,1990
4,5,5,5,CHA,Kendall Gill,Illinois,15.0,966.0,29481.0,12914.0,...,0.078,0.1,15.8,1990,30.5,13.4,4.1,3.0,https://www.sports-reference.com/cbb/players/k...,1990


# Step 3: Data Cleaning

### NBA Data
With `Pk` (Draft Pick #) as our target variable, I filtered the NBA data to only include draft pick number, player name, and draft year. Post-draft NBA performance is irrelevant to the model - only their NCAA statistics matter for predicting draft outcomes.

### NCAA Data  
Kept only each player's final college season to avoid duplicates. Many players appear multiple times (freshman, sophomore, junior, senior), but we only want their last season stats when predicting draft eligibility.

In [4]:
nba = nba[['Pk','Player','DraftYear']]
nba = nba.rename(columns={'Player': 'full_name'})
final_seasons = ncaa.groupby('full_name')['season'].max().reset_index()
final_seasons.columns = ['full_name', 'final_season']

ncaa_final = ncaa.merge(final_seasons, on='full_name')
ncaa_final = ncaa_final[ncaa_final['season'] == ncaa_final['final_season']]
ncaa_final = ncaa_final.drop(columns=['final_season'])


# Step 4: Data Merging

### Challenge: Timeline Alignment
When first joining datasets on player name and year, the amount of matches (players drafted) was unrealistically low, indicating an issue with how I was matching on year.

**The issue:** NCAA seasons are labeled by their ending year (e.g., the 2015-16 season is labeled as "2016"), while the NBA draft occurs the following summer (June 2017 for players finishing in spring 2016).

**Solution:** Created a `draft_year` column by adding 1 to the NCAA season year, aligning each player's final college season with their eligible draft year.

Merging: I merged on two keys: `full_name` and `draft_year`. Name alone isn't sufficient since players can appear across multiple seasons, and year alone would match unrelated players.

**Join**: I used a left join to avoid omitting undrafted ncaa players, a class needed for classification.


In [5]:

# Draft year creation with +1 offset
ncaa_final['draft_year'] = ncaa_final['season'] + 1

# Merge the ncaa data with nba draft data
merged = ncaa_final.merge(
    nba,
    left_on=['full_name', 'draft_year'],
    right_on=['full_name', 'DraftYear'],
    how='left'
)
merged = merged.drop(columns=['DraftYear'])


# Step 5: Data Filtering

After merging and examining results, I had over 26,000 undrafted players. This created a severe class imbalance issue that I had to address.

**Solution**: >= 8 PPG : Prospect

I tested multiple PPG thresholds (5, 8, 10, 12) and selected 8 PPG as it balanced maintaining a realistic prospect pool with manageable class imbalance. I also kept all drafted players averaging less than 8 PPG to preserve the positive class.

In [6]:
merged = merged[
    (merged['avg_points'] >= 8.0) |  # Prospects score 8+ PPG
    (merged['Pk'].notna())  # OR actually got drafted
]

#Check results
print(f"\nFinal dataset:")
print(f"Total players: {len(merged)}")
print(f"Drafted: {merged['Pk'].notna().sum()}")
print(f"Undrafted: {merged['Pk'].isna().sum()}")
print(f"Ratio: 1:{merged['Pk'].isna().sum() / merged['Pk'].notna().sum():.1f}")

print("\nSample drafted players:")
print(merged[merged['Pk'].notna()][['full_name', 'season', 'Pk', 'avg_points']].head(10))


Final dataset:
Total players: 3575
Drafted: 214
Undrafted: 3361
Ratio: 1:15.7

Sample drafted players:
              full_name  season    Pk  avg_points
45          Josh Okogie    2017  20.0        16.2
48       Richaun Holmes    2014  37.0        14.7
50   Chandler Hutchison    2017  22.0        20.0
54      Robert Williams    2017  27.0         9.1
76        Adreian Payne    2013  15.0        13.4
102    Cleanthony Early    2013  34.0        16.4
105      Cory Jefferson    2013  60.0        13.7
106        Josh Huestis    2013  29.0        11.2
108      Doug McDermott    2013  11.0        26.7
120     Justise Winslow    2014  10.0        12.6


# Step 6: Target Variable Creation:

Created a binary target variable `drafted`:
- `1` = Player was drafted (has a `Pk` value)
- `0` = Player was undrafted (`Pk` is NaN)

**Distribution:**
- 214 drafted (6%)
- 3,161 undrafted (94%)

This class imbalance reflects the reality of most college prospects not making it to the NBA.

In [7]:
# Target Variable Creation:

merged['drafted'] = (merged['Pk'].notna()).astype(int)

# Step 7: Feature Selection and Train/Test/Split

### Features Selected

Used 15 statistical features:
- **Volume stats:** Points, rebounds, assists, steals, turnovers
- **Efficiency:** Two-point %, three-point %, free-throw %
- **Opportunity:** Games played, games started, minutes

Excluded identifiers (player name, team, season) as these don't represent measurable performance patterns.

In [8]:
# Feature Creation:

feature_cols = ['games_started', 'games_played', 'avg_turnovers', 'avg_steals', 'avg_twos_made',
                'avg_two_twos_pct', 'avg_threes_made','avg_threes_pct','avg_free_throws_made',
                'avg_free_throws_made','avg_free_throws_pct', 'avg_minutes', 'avg_points', 'avg_rebounds',
                'avg_assists']

In [9]:
X = merged[feature_cols]
y = merged['drafted']

# Found no missing values
print(X.isnull().sum())


games_started           0
games_played            0
avg_turnovers           0
avg_steals              0
avg_twos_made           0
avg_two_twos_pct        0
avg_threes_made         0
avg_threes_pct          0
avg_free_throws_made    0
avg_free_throws_made    0
avg_free_throws_pct     0
avg_minutes             0
avg_points              0
avg_rebounds            0
avg_assists             0
dtype: int64


Split data 80/20:
- **Training (80%)**: Model learns patterns
- **Test (20%)**: Evaluates performance on unseen data

Used `stratify=y` to maintain the 6% drafted / 94% undrafted ratio in both sets, to keep samples for imbalanced data.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y  # Ensures both sets have same drafted/undrafted ratio
)

## Step 8: Model Training

### Models Tested

Trained three classification models to compare performance:

1. **Logistic Regression** - Simple baseline model
2. **Random Forest** - Ensemble method that aptures non-linear patterns
3. **XGBoost** - Gradient boosting, typically high performance

### Handling Class Imbalance

Used `class_weight='balanced'` to handle the 1:15 drafted/undrafted imbalance. This tells the model to weight the minority class (drafted) more heavily during training, preventing it from predicting "undrafted" for everyone.

### Evaluation Metrics

- **Accuracy**: Overall correctness
- **Precision**: Of players predicted as drafted, how many actually were?
- **Recall**: Of actual drafted players, how many did we catch?
- **F1-Score**: Balance between precision and recall

In [11]:
lr_model = LogisticRegression(
    class_weight='balanced',  # Handle imbalance
    max_iter=1000,
    random_state=42
)

lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)

# Evaluate
print("\nAccuracy:", accuracy_score(y_test, y_pred_lr))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr, target_names=['Undrafted', 'Drafted']))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_lr))


Accuracy: 0.806993006993007

Classification Report:
              precision    recall  f1-score   support

   Undrafted       0.98      0.81      0.89       672
     Drafted       0.20      0.77      0.32        43

    accuracy                           0.81       715
   macro avg       0.59      0.79      0.61       715
weighted avg       0.94      0.81      0.85       715


Confusion Matrix:
[[544 128]
 [ 10  33]]


In [12]:
rf_model = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',  # Handle imbalance
    random_state=42,
    max_depth=10
)

rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Evaluate
print("\nAccuracy:", accuracy_score(y_test, y_pred_rf))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf, target_names=['Undrafted', 'Drafted']))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))

# Feature importance
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
print(feature_importance.head(10))


Accuracy: 0.9384615384615385

Classification Report:
              precision    recall  f1-score   support

   Undrafted       0.96      0.97      0.97       672
     Drafted       0.48      0.37      0.42        43

    accuracy                           0.94       715
   macro avg       0.72      0.67      0.69       715
weighted avg       0.93      0.94      0.93       715


Confusion Matrix:
[[655  17]
 [ 27  16]]
                 feature  importance
1           games_played    0.154494
0          games_started    0.150951
12            avg_points    0.113114
13          avg_rebounds    0.078294
4          avg_twos_made    0.064646
10   avg_free_throws_pct    0.056759
8   avg_free_throws_made    0.056058
11           avg_minutes    0.051536
9   avg_free_throws_made    0.049214
7         avg_threes_pct    0.045240


In [13]:
from xgboost import XGBClassifier
xgb_model = XGBClassifier(
    scale_pos_weight=len(y_train[y_train==0]) / len(y_train[y_train==1]),
    max_depth=5,
    learning_rate=0.1,
    n_estimators=100,
    random_state=42,
    enable_categorical=False  # Add this
)

xgb_model.fit(X_train.values, y_train.values)  # Convert to numpy with .values
y_pred_xgb = xgb_model.predict(X_test.values)

print("\nAccuracy:", accuracy_score(y_test, y_pred_xgb))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_xgb, target_names=['Undrafted', 'Drafted']))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_xgb))


Accuracy: 0.9272727272727272

Classification Report:
              precision    recall  f1-score   support

   Undrafted       0.97      0.95      0.96       672
     Drafted       0.42      0.53      0.47        43

    accuracy                           0.93       715
   macro avg       0.69      0.74      0.72       715
weighted avg       0.94      0.93      0.93       715


Confusion Matrix:
[[640  32]
 [ 20  23]]


## Step 9: Results and Model Comparison

### Model Performance Summary

| Model | Accuracy | Precision (Drafted) | Recall (Drafted) | F1-Score |
|-------|----------|---------------------|------------------|----------|
| Logistic Regression | 81% | 20% | 77% | 0.32 |
| Random Forest | 94% | 48% | 37% | 0.42 |
| **XGBoost** | **93%** | **42%** | **53%** | **0.47** |

### Model Selection

**Selected XGBoost as the final model** for its balanced performance. Unlike high-stakes scenarios like cancer screening where false negatives are catastrophic, draft prediction allows for a more balanced approach. XGBoost successfully identifies 53% of drafted players while maintaining manageable false positives (32), creating a practical scouting list of 55 players versus the full 3,375 dataset.

### Feature Importance Analysis

**Top 5 Features:**
1. Games Played (15.4%)
2. Games Started (15.1%)
3. Average Points (11.3%)
4. Average Rebounds (7.8%)
5. Two-Pointers Made (6.5%)

**Key Insights:**
- Durability and consistency (games played/started) combined for 30% of importance, outweighing pure scoring ability
- Three-point percentage ranked lowest in importance, possibly reflecting the 2013-2017 timeframe before the modern NBA's emphasis on 3 point shooting
- Raw output matters less than volume of opportunity and proven reliability

### Limitations

The model cannot capture intangible factors critical to NBA success:
- **Physical attributes:** Height, wingspan, athleticism, speed
- **Competition level:** Strength of schedule, opponent quality
- **Intangibles:** Work ethic, nutrition, rest, coachability
- **Development potential:** Age, improvement trajectory, "upside"

### Future Improvements

**Enhanced features:**
- Strength of schedule adjustments
- Physical measurements (combine data)
- Advanced efficiency metrics
- Improvement trends from year-to-year

**Modeling refinements:**
- Hyperparameter tuning
- Ensemble approaches combining multiple models