# NBA 2024-25: Utilizing Roles
## Notebook 06: Role-Based Analysis
This notebook computes the core role output and consistency metrics used in the final dashboard, including PRA Signal, All-Star Output Rate, Output per Role, and Output Consistency.

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm

In [3]:
# Display options
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)
pd.set_option("display.width", 160)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.float_format", lambda x: f"{x:.2f}")

___
## Load

In [5]:
# Load game logs data and all-star baselines
game_logs = pd.read_parquet(r"C:\Users\dylan\OneDrive\Documents\Portfolio_Projects\NBA_2024_25_utilizing_roles\03_python_outputs\Merged_Player_Team_GameLogs_2024_25_final.parquet")
baselines = pd.read_parquet(r"C:\Users\dylan\OneDrive\Documents\Portfolio_Projects\NBA_2024_25_utilizing_roles\03_python_outputs\AS_baselines\NBA_Per_Game_2019_2024_baselines.parquet")

In [6]:
# Inspect game logs data
game_logs.sample(5)

Unnamed: 0,Player_Name,Player_ID,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,FGA,FTA,TOV,REB,AST,PTS,TEAM_ABBREVIATION,TEAM_ID,TEAM_MIN,TEAM_FGA,TEAM_FTA,TEAM_TOV,USG%,PRA,Season,Age,Team,G,Pos,season_PTS,season_REB,season_AST,season_USG%,season_PRA
21422,Reed Sheppard,1642263,22400779,"Feb 12, 2025",HOU vs. PHX,W,11,2,2,0,0,1,1,5,HOU,1610612745,240,97,24,11,7.36,7,2024-25,20,HOU,52,PG,4.4,1.5,1.4,17.8,7.3
2467,Bradley Beal,203078,22400402,"Dec 23, 2024",PHX @ DEN,L,35,9,17,0,2,0,2,23,PHX,1610612756,240,83,15,15,24.91,25,2024-25,31,PHO,53,SG,17.0,3.3,3.7,22.1,24.0
12931,Johnny Juzang,1630548,22401020,"Mar 21, 2025",UTA vs. BOS,L,22,4,8,1,1,2,1,11,UTA,1610612762,240,84,12,17,19.38,14,2024-25,23,UTA,64,SG,8.9,2.9,1.1,17.6,12.9
10250,Jake LaRavia,1631222,22400843,"Feb 26, 2025",SAC @ UTA,W,15,0,3,0,1,0,0,0,SAC,1610612758,240,87,12,16,11.82,0,2024-25,23,2TM,66,PF,6.9,3.9,2.4,14.4,13.2
16684,Larry Nance Jr.,1626204,22400736,"Feb 07, 2025",ATL vs. MIL,W,20,4,8,0,1,1,4,11,ATL,1610612737,240,92,18,11,19.47,16,2024-25,32,ATL,24,PF,8.5,4.3,1.6,15.7,14.4


___
## 1) Filters and Cutoff Logic
#### a) Minutes played (minimum)

In [8]:
# Minutes played (per player) stats
game_logs["MIN"].describe(percentiles=[0.10, 0.20, 0.25, 0.30, 0.35, 0.40])

count   26306.00
mean       22.57
std        10.85
min         0.00
10%         6.00
20%        12.00
25%        15.00
30%        17.00
35%        19.00
40%        20.00
50%        24.00
max        53.00
Name: MIN, dtype: float64

> For a player's game to be included in this study, they must be on the court for at least 12 minutes (20th percentile of all minutes played values).
>
> This ensures that enough game activity occurs for meaningful USG% and PRA patterns to unfold.

In [10]:
# Filter for only (meaningful) game logs with 12+ minutes played
meaningful_game_logs = game_logs[game_logs["MIN"] >= 12]

#### b) Games Played (minimum)

In [12]:
# Games (per player) stats
games_per_player = meaningful_game_logs.groupby("Player_ID").size().reset_index(name="games_played")
games_per_player["games_played"].describe()

count   533.00
mean     40.32
std      26.08
min       1.00
25%      15.00
50%      42.00
75%      64.00
max      82.00
Name: games_played, dtype: float64

> A full regular season is typically 82 games. In order to qualify for this study, a player must play at least 20 games (roughly one quarter of the regular season).
>
> With this season-long filter, enough game-to-game activity occurs for consistency metrics to become meaningful.

In [14]:
# Merge into meaningful game logs
meaningful_game_logs = meaningful_game_logs.merge(
    games_per_player,
    on="Player_ID",
    how="left"
)

# Filter meaningful game logs for only players with 20+ games played
min_games = 20
meaningful_game_logs_filtered = meaningful_game_logs[meaningful_game_logs["games_played"] >= min_games].copy()

___
## 2) Import All-Star Baselines

In [16]:
# USG% baseline
AS_USG_baseline = baselines.loc[
    baselines["Type"] == "All-Star", "USG_baseline"
].values[0]

print(f"All-Star USG% baseline: {AS_USG_baseline}%")

All-Star USG% baseline: 29.3%


In [17]:
# PRA baseline
AS_PRA_baseline = baselines.loc[
    baselines["Type"] == "All-Star", "PRA_baseline"
].values[0]

print(f"All-Star PRA baseline: {AS_PRA_baseline} Points + Rebounds + Assists")

All-Star PRA baseline: 37.8 Points + Rebounds + Assists


___
## 3) Regression Analysis
#### a) Build Regression Dataset
We want to see **how PRA changes** in response to a **change in USG%**. In other words, how *elastic* is PRA?

In [19]:
# Only need USG% and PRA columns
reg = meaningful_game_logs_filtered[["USG%", "PRA"]].copy()

# Remove rows where log would break (zeroes or negatives)
reg = reg[
    (reg["USG%"] > 0) & (reg["PRA"] > 0)
].copy()

print(f"Regression dataset: {reg.shape[0]:,} rows | {reg.shape[1]} columns")

Regression dataset: 20,141 rows | 2 columns


In [20]:
# Regression dataset preview
reg.sample(10)

Unnamed: 0,USG%,PRA
3967,24.94,10
17507,13.45,11
19946,16.87,26
1052,23.14,10
16604,18.21,10
15321,14.37,11
7414,6.66,9
9016,19.49,22
8124,14.84,14
18444,16.39,18


#### b) Fit the Log-Log Regression Model

In [22]:
# Log-transform both variables (USG% and PRA)
reg["log_USG"] = np.log(reg["USG%"])
reg["log_PRA"] = np.log(reg["PRA"])

In [23]:
# Regression setup
X = sm.add_constant(reg["log_USG"])  # multivariate independent variable -> multiple inputs
y = reg["log_PRA"]                   # univariate dependent variable -> single output

In [24]:
# Fit log-log model
model = sm.OLS(y, X).fit()

alpha = model.params["const"]   # intercept
beta = model.params["log_USG"]  # elasticity (how strongly PRA responds to USG%)

alpha, beta

(0.27541835726159747, 0.8984115299311776)

> Therefore, a **1.000% increase in USG%** is associated with a **0.898% increase in PRA**. It's not quite a linear 1:1 ratio, but it's close.
>
> ##### This 0.898% increase in PRA is the elasticity.

___
## 4) Calculating PRA Signal

**PRA Signal** is a role-adjusted version of a player's **PRA**.

It rescales a player's raw (or actual) PRA to the **All-Star usage baseline**. This baseline is **29.3%** (from Step 2). Therefore, any game where a player's USG% is less than 29.3%, their raw PRA is **scaled upward** using the elasticity (from Step 3).

> For example, for a game where a player's USG% is 20% and PRA is 16:
>
> > The usage ratio is 29.3% / 20.0% = 1.465.
> >
> > The elasticity is applied: 1.465^(0.898) = 1.93
> >
> > 16 PRA * 1.93 = 30.9 (PRA Signal)
> >
> > *This will be better explained in the README.*

In [28]:
# --- Calculate PRA signal for every game ---
def scale_pra(row, AS_USG, beta):
    usg = row["USG%"]
    pra = row["PRA"]

    if pd.isna(usg) or pd.isna(pra):
        return np.nan

    if usg >= AS_USG:
        return pra
    
    if usg > 0 and pra > 0:
        scale = (AS_USG / usg) ** beta
        return pra * scale
    else:
        return np.nan

meaningful_game_logs_filtered["PRA_signal"] = meaningful_game_logs_filtered.apply(
    lambda r: scale_pra(r, AS_USG_baseline, beta),
    axis=1
)

meaningful_game_logs_filtered["PRA_signal"] = meaningful_game_logs_filtered["PRA_signal"].round(1)

In [29]:
# Calculate output per role (i.e., output per usage)
meaningful_game_logs_filtered["OPR"] = np.where(
    meaningful_game_logs_filtered["USG%"] > 0,
    meaningful_game_logs_filtered["PRA"] / meaningful_game_logs_filtered["USG%"],
    np.nan
)

In [30]:
# Preview results
meaningful_game_logs_filtered[["Player_Name", "GAME_DATE", "USG%", "PRA", "PRA_signal", "OPR"]].sample(10)

Unnamed: 0,Player_Name,GAME_DATE,USG%,PRA,PRA_signal,OPR
21275,Zach Edey,"Apr 11, 2025",10.05,25,65.4,2.49
2726,Cade Cunningham,"Jan 04, 2025",38.27,55,55.0,1.44
1114,Anthony Black,"Oct 28, 2024",13.34,14,28.4,1.05
14995,Mike Conley,"Jan 25, 2025",7.1,15,53.6,2.11
11368,Julian Champagnie,"Jan 08, 2025",18.91,13,19.3,0.69
14937,Mikal Bridges,"Mar 12, 2025",29.23,41,41.1,1.4
13417,Kyle Kuzma,"Feb 23, 2025",19.15,22,32.2,1.15
20351,Ty Jerome,"Mar 09, 2025",23.66,11,13.3,0.46
2541,Bub Carrington,"Mar 15, 2025",21.43,20,26.5,0.93
9506,Jarred Vanderbilt,"Feb 10, 2025",10.0,3,7.9,0.3


___
## 5) Flag Games with All-Star Level Performances

In [32]:
# Flag projected all-star-level games
meaningful_game_logs_filtered["is_AS_level"] = (meaningful_game_logs_filtered["PRA_signal"] >= AS_PRA_baseline).astype(int)

In [33]:
# Preview results
meaningful_game_logs_filtered[["Player_Name", "TEAM_ABBREVIATION", "GAME_DATE", "USG%", "PRA", "PRA_signal", "is_AS_level"]].sample(5)

Unnamed: 0,Player_Name,TEAM_ABBREVIATION,GAME_DATE,USG%,PRA,PRA_signal,is_AS_level
14545,Matas Buzelis,CHI,"Dec 02, 2024",20.23,26,36.3,0
18401,Scottie Barnes,TOR,"Jan 29, 2025",32.49,32,32.0,0
18623,Shai Gilgeous-Alexander,OKC,"Dec 05, 2024",38.13,39,39.0,1
16952,Patrick Williams,CHI,"Mar 24, 2025",26.09,12,13.3,0
5145,Dennis Schröder,DET,"Feb 11, 2025",21.6,10,13.1,0


> Any game where the player's `PRA_signal` is greater than or equal to **37.8 (the all-star threshold)** is flagged as an **all-star level** game.

___
## 6) Season-Level Summary

In [36]:
# Create a season-level summary DataFrame
season_summary = (
    meaningful_game_logs_filtered.groupby("Player_ID").agg(
        # Game-level aggregations
        games_played=("GAME_ID", "count"),
        AS_level_games=("is_AS_level", "sum"),
        avg_pra_signal=("PRA_signal", "mean"),
        # Season-level context
        season_USG=("season_USG%", "first"),
        season_PRA=("season_PRA", "first"),
        age=("Age", "first"),
        team=("Team", "first"),
        pos=("Pos", "first"),
        season_PTS=("season_PTS", "first"),
        season_REB=("season_REB", "first"),
        season_AST=("season_AST", "first")
    ).reset_index()
)

In [37]:
# Find percentage of games that each player reached the all-star baseline (i.e., the all-star output rate)
season_summary["AS_output_rate"] = season_summary["AS_level_games"] / season_summary["games_played"]

In [38]:
# Find output per role for each player's entire season
season_summary["season_OPR"] = season_summary["season_PRA"] / season_summary["season_USG"]

In [39]:
# Merge Player_Name column into this new DataFrame
season_summary = season_summary.merge(
    meaningful_game_logs_filtered[["Player_ID", "Player_Name"]].drop_duplicates(),
    on="Player_ID",
    how="left"
)

In [40]:
# Reorder columns so that `Player_Name` column is first
cols = ["Player_Name"] + [c for c in season_summary.columns if c != "Player_Name"]
season_summary = season_summary[cols]

In [41]:
# Usage % (season-level) stats
meaningful_game_logs_filtered["season_USG%"].describe(percentiles=[0.10, 0.20, 0.25, 0.75, 0.80, 0.90])

count   20196.00
mean       19.47
std         5.55
min         7.60
10%        13.00
20%        14.70
25%        15.40
50%        18.50
75%        23.30
80%        24.10
90%        27.70
max        35.90
Name: season_USG%, dtype: float64

In [42]:
# Create season-level usage cohorts
def categorize_usage(usg):
    if usg <= 15.0:
        return "Low Usage"
    elif usg < 23.0:
        return "Medium Usage"
    else:
        return "High Usage"

season_summary["USG_cohort"] = season_summary["season_USG"].apply(categorize_usage)

In [43]:
# Calculate season-level output consistency for each player
season_summary["OC"] = np.sqrt(
    season_summary["AS_output_rate"] * season_summary["season_OPR"]
)

> A player's **output consistency** is simply a mathematical combination of their **all-star output rate** and **output per role**.

In [45]:
# Preview results
# season_summary.sort_values("AS_output_rate", ascending=False).head(10)
# season_summary.sort_values("season_OPR", ascending=False).head(10)
season_summary.sort_values("OC", ascending=False).head(10)

Unnamed: 0,Player_Name,Player_ID,games_played,AS_level_games,avg_pra_signal,season_USG,season_PRA,age,team,pos,season_PTS,season_REB,season_AST,AS_output_rate,season_OPR,USG_cohort,OC
59,Nikola Jokić,203999,70,67,57.47,29.5,52.5,29,DEN,C,29.6,12.7,10.2,0.96,1.78,High Usage,1.31
80,Domantas Sabonis,1627734,69,64,53.74,21.6,39.0,28,SAC,C,19.1,13.9,6.0,0.93,1.81,Medium Usage,1.29
38,Rudy Gobert,203497,72,61,53.34,13.0,24.7,32,MIN,C,12.0,10.9,1.8,0.85,1.9,Low Usage,1.27
116,Josh Hart,1628404,77,64,54.44,15.3,29.1,29,NYK,SG,13.6,9.6,5.9,0.83,1.9,Medium Usage,1.26
288,Walker Kessler,1631117,58,45,51.48,13.7,25.0,23,UTA,C,11.1,12.2,1.7,0.78,1.82,Low Usage,1.19
41,Giannis Antetokounmpo,203507,67,63,49.49,35.2,48.8,30,MIL,PF,30.4,11.9,6.5,0.94,1.39,High Usage,1.14
95,Ivica Zubac,1627826,80,61,46.99,19.5,32.1,27,LAC,C,16.8,12.6,2.7,0.76,1.65,Medium Usage,1.12
239,Jalen Johnson,1630552,35,28,45.07,22.5,33.9,23,ATL,SF,18.9,10.0,5.0,0.8,1.51,Medium Usage,1.1
65,Karl-Anthony Towns,1626157,72,55,44.98,27.4,40.3,29,NYK,C,24.4,12.8,3.1,0.76,1.47,High Usage,1.06
279,Jalen Duren,1631105,75,54,44.5,16.4,24.8,21,DET,C,11.8,10.3,2.7,0.72,1.51,Medium Usage,1.04


In [79]:
# OC statistics
print(f"Average OC: {season_summary["OC"].mean():.3f}")
print(f"Minimum OC: {season_summary["OC"].min():.3f}")
print(f"Maximum OC: {season_summary["OC"].max():.3f}")

Average OC: 0.479
Minimum OC: 0.000
Maximum OC: 1.305


___
## Save

In [None]:
# Save to CSV
meaningful_game_logs_filtered.to_csv("NBA_2024_25_game_logs_final.csv", index=False)
player_season_projections.to_csv("player_season_projections.csv", index=False)