# NBA 2024-25: Utilizing Roles
## Notebook 06: Role-Based Analysis
This notebook computes the core role output and consistency metrics used in the final dashboard, including PRA Signal, All-Star Output Rate, Output per Role, and Output Consistency.

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm

In [3]:
# Display options
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)
pd.set_option("display.width", 160)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.float_format", lambda x: f"{x:.2f}")

___
## Load

In [5]:
# Load game logs data and all-star baselines
game_logs = pd.read_parquet(r"C:\Users\dylan\OneDrive\Documents\Portfolio_Projects\NBA_2024_25_utilizing_roles\03_python_outputs\Merged_Player_Team_GameLogs_2024_25_final.parquet")
baselines = pd.read_parquet(r"C:\Users\dylan\OneDrive\Documents\Portfolio_Projects\NBA_2024_25_utilizing_roles\03_python_outputs\AS_baselines\NBA_Per_Game_2019_2024_baselines.parquet")

In [6]:
# Inspect game logs data
game_logs.sample(5)

Unnamed: 0,Player_Name,Player_ID,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,FGA,FTA,TOV,REB,AST,PTS,TEAM_ABBREVIATION,TEAM_ID,TEAM_MIN,TEAM_FGA,TEAM_FTA,TEAM_TOV,USG%,PRA,Season,Age,Team,G,Pos,season_PTS,season_REB,season_AST,season_USG%,season_PRA
19519,Nikola Jokić,203999,22400058,"Dec 03, 2024",DEN vs. GSW,W,40,14,24,9,4,10,6,38,DEN,1610612743,240,79,35,15,35.06,54,2024-25,29,DEN,70,C,29.6,12.7,10.2,29.5,52.5
20696,Patrick Williams,1630172,22400933,"Mar 10, 2025",CHI vs. IND,W,13,1,3,2,3,3,1,2,CHI,1610612741,240,98,14,18,20.79,6,2024-25,23,CHI,63,PF,9.0,3.8,2.0,16.9,14.8
1709,Ausar Thompson,1641709,22401144,"Apr 07, 2025",DET vs. SAC,L,29,7,13,0,0,4,5,15,DET,1610612765,240,89,28,9,19.5,24,2024-25,22,DET,59,SF,10.1,5.1,2.3,19.1,17.5
21104,Quentin Grimes,1629656,22400474,"Jan 03, 2025",DAL vs. CLE,L,33,9,14,7,2,2,6,26,DAL,1610612742,240,100,30,12,22.17,34,2024-25,24,2TM,75,SG,14.6,4.3,3.0,22.8,21.9
6554,Devin Booker,1626164,22400039,"Nov 26, 2024",PHX vs. LAL,W,34,10,17,4,3,2,10,26,PHX,1610612756,240,99,10,9,27.33,38,2024-25,28,PHO,75,SG,25.6,4.1,7.1,29.3,36.8


___
## 1) Filters and Cutoff Logic
#### a) Minutes played (minimum)

In [8]:
# Minutes played (per player) stats
game_logs["MIN"].describe(percentiles=[0.10, 0.20, 0.25, 0.30, 0.35, 0.40])

count   26306.00
mean       22.57
std        10.85
min         0.00
10%         6.00
20%        12.00
25%        15.00
30%        17.00
35%        19.00
40%        20.00
50%        24.00
max        53.00
Name: MIN, dtype: float64

> For a player's game to be included in this study, they must be on the court for at least 12 minutes (20th percentile of all minutes played values).
>
> This ensures that enough game activity occurs for meaningful USG% and PRA patterns to unfold.

In [10]:
# Filter for only (meaningful) game logs with 12+ minutes played
meaningful_game_logs = game_logs[game_logs["MIN"] >= 12]

#### b) Games Played (minimum)

In [12]:
# Games (per player) stats
games_per_player = meaningful_game_logs.groupby("Player_ID").size().reset_index(name="games_played")
games_per_player["games_played"].describe()

count   533.00
mean     40.32
std      26.08
min       1.00
25%      15.00
50%      42.00
75%      64.00
max      82.00
Name: games_played, dtype: float64

> A full regular season is typically 82 games. In order to qualify for this study, a player must play at least 20 games (roughly one quarter of the regular season).
>
> With this season-long filter, enough game-to-game activity occurs for consistency metrics to become meaningful.

In [14]:
# Merge into meaningful game logs
meaningful_game_logs = meaningful_game_logs.merge(
    games_per_player,
    on="Player_ID",
    how="left"
)

# Filter meaningful game logs for only players with 20+ games played
min_games = 20
meaningful_game_logs_filtered = meaningful_game_logs[meaningful_game_logs["games_played"] >= min_games].copy()

___
## 2) Import All-Star Baselines

In [16]:
# USG% baseline
AS_USG_baseline = baselines.loc[
    baselines["Type"] == "All-Star", "USG_baseline"
].values[0]

print(f"All-Star USG% baseline: {AS_USG_baseline}%")

All-Star USG% baseline: 29.3%


In [17]:
# PRA baseline
AS_PRA_baseline = baselines.loc[
    baselines["Type"] == "All-Star", "PRA_baseline"
].values[0]

print(f"All-Star PRA baseline: {AS_PRA_baseline} Points + Rebounds + Assists")

All-Star PRA baseline: 37.8 Points + Rebounds + Assists


___
## 3) Regression Analysis
#### a) Build Regression Dataset
We want to see **how PRA changes** in response to a **change in USG%**. In other words, how *elastic* is PRA?

In [19]:
# Only need USG% and PRA columns
reg = meaningful_game_logs_filtered[["USG%", "PRA"]].copy()

# Remove rows where log would break (zeroes or negatives)
reg = reg[
    (reg["USG%"] > 0) & (reg["PRA"] > 0)
].copy()

print(f"Regression dataset: {reg.shape[0]:,} rows | {reg.shape[1]} columns")

Regression dataset: 20,141 rows | 2 columns


In [20]:
# Regression dataset preview
reg.sample(10)

Unnamed: 0,USG%,PRA
3240,9.09,11
4471,26.69,32
18955,23.63,10
3685,13.84,12
14987,22.86,12
5802,35.74,37
14683,21.34,23
15991,30.25,41
10169,11.54,10
19447,14.26,8


#### b) Fit the Log-Log Regression Model

In [22]:
# Log-transform both variables (USG% and PRA)
reg["log_USG"] = np.log(reg["USG%"])
reg["log_PRA"] = np.log(reg["PRA"])

In [23]:
# Regression setup
X = sm.add_constant(reg["log_USG"])  # multivariate independent variable -> multiple inputs
y = reg["log_PRA"]                   # univariate dependent variable -> single output

In [24]:
# Fit log-log model
model = sm.OLS(y, X).fit()

alpha = model.params["const"]   # intercept
beta = model.params["log_USG"]  # elasticity (how strongly PRA responds to USG%)

alpha, beta

(0.27541835726159747, 0.8984115299311776)

> Therefore, a **1.000% increase in USG%** is associated with a **0.898% increase in PRA**. It's not quite a linear 1:1 ratio, but it's close.
>
> ##### This is the elasticity.

___
## 4) Calculating PRA Signal

**PRA Signal** is a role-adjusted version of a player's **PRA**.

It rescales a player's raw (or actual) PRA to the **All-Star usage baseline**. This baseline is **29.3%** (from Step 2). Therefore, any game where a player's USG% is less than 29.3%, their raw PRA is **scaled upward** using the elasticity (from Step 3).

> For example, for a game where a player's USG% is 20% and PRA is 16:
>
> > The usage ratio is 29.3% / 20.0% = 1.465.
> >
> > The elasticity is applied: 1.465^(0.898) = 1.93
> >
> > 16 PRA * 1.93 = 30.9 (PRA Signal)
> >
> > *This will be better explained in the README.*

In [28]:
# --- Calculate PRA signal for every game ---
def scale_pra(row, AS_USG, beta):
    usg = row["USG%"]
    pra = row["PRA"]

    if pd.isna(usg) or pd.isna(pra):
        return np.nan

    if usg >= AS_USG:
        return pra
    
    if usg > 0 and pra > 0:
        scale = (AS_USG / usg) ** beta
        return pra * scale
    else:
        return np.nan

meaningful_game_logs_filtered["PRA_signal"] = meaningful_game_logs_filtered.apply(
    lambda r: scale_pra(r, AS_USG_baseline, beta),
    axis=1
)

meaningful_game_logs_filtered["PRA_signal"] = meaningful_game_logs_filtered["PRA_signal"].round(1)

In [29]:
# Calculate output per role (i.e., output per usage)
meaningful_game_logs_filtered["OPR"] = np.where(
    meaningful_game_logs_filtered["USG%"] > 0,
    meaningful_game_logs_filtered["PRA"] / meaningful_game_logs_filtered["USG%"],
    np.nan
)

In [30]:
# Preview results
meaningful_game_logs_filtered[["Player_Name", "GAME_DATE", "USG%", "PRA", "PRA_signal", "OPR"]].sample(10)

Unnamed: 0,Player_Name,GAME_DATE,USG%,PRA,PRA_signal,OPR
20774,Victor Wembanyama,"Nov 15, 2024",39.2,47,47.0,1.2
8404,Jake LaRavia,"Nov 10, 2024",22.13,24,30.9,1.08
6197,Duncan Robinson,"Feb 07, 2025",19.69,11,15.7,0.56
1451,Austin Reaves,"Apr 08, 2025",23.32,34,41.7,1.46
17863,Rui Hachimura,"Nov 23, 2024",13.99,15,29.1,1.07
5195,Derrick Jones Jr.,"Dec 23, 2024",21.6,15,19.7,0.69
13371,Kyle Kuzma,"Jan 23, 2025",20.13,17,23.8,0.84
20190,Trey Murphy III,"Dec 02, 2024",14.12,20,38.5,1.42
10025,Jayson Tatum,"Mar 24, 2025",36.52,40,40.0,1.1
19462,Terry Rozier,"Jan 07, 2025",19.36,19,27.6,0.98


___
## 5) Flag Projected All-Star Game Logs

In [32]:
# Flag projected all-star-level games
meaningful_game_logs_filtered["is_AS_level"] = (meaningful_game_logs_filtered["PRA_signal"] >= AS_PRA_baseline).astype(int)

In [33]:
# Preview results
meaningful_game_logs_filtered[["Player_Name", "TEAM_ABBREVIATION", "GAME_DATE", "USG%", "PRA", "PRA_signal", "is_AS_level"]].sample(5)

Unnamed: 0,Player_Name,TEAM_ABBREVIATION,GAME_DATE,USG%,PRA,PRA_signal,is_AS_level
12916,Keyonte George,UTA,"Nov 12, 2024",17.94,18,28.0,0
20755,Vasilije Micic,CHA,"Nov 29, 2024",18.03,19,29.4,0
124,Aaron Gordon,DEN,"Oct 29, 2024",21.12,31,41.6,1
7499,Immanuel Quickley,TOR,"Apr 01, 2025",26.35,27,29.7,0
8598,Jalen Duren,DET,"Jan 22, 2025",19.35,28,40.7,1


> Any game where the player's `PRA_signal` is greater than or equal to **37.8 (the all-star threshold)** is flagged as an **all-star level** game.

___
## 6) Player-Level Season Metrics

In [36]:
# Roll up projected metrics to player-level
player_season_projections = (
    meaningful_game_logs_filtered.groupby("Player_ID").agg(
        # Game-level aggregations
        games_played=("GAME_ID", "count"),
        AS_level_games=("is_AS_level", "sum"),
        avg_pra_signal=("PRA_signal", "mean"),
        avg_opr=("OPR", "mean"),
        # Season-level context
        season_USG=("season_USG%", "first"),
        season_PRA=("season_PRA", "first"),
        age=("Age", "first"),
        team=("Team", "first"),
        pos=("Pos", "first"),
        season_PTS=("season_PTS", "first"),
        season_REB=("season_REB", "first"),
        season_AST=("season_AST", "first")
    ).reset_index()
)

In [37]:
# Find percentage of games that each player reached the all-star threshold (i.e., the all-star output rate)
player_season_projections["AS_output_rate"] = player_season_projections["AS_level_games"] / player_season_projections["games_played"]

In [38]:
# Merge Player_Name column into this new DataFrame
player_season_projections = player_season_projections.merge(
    meaningful_game_logs_filtered[["Player_ID", "Player_Name"]].drop_duplicates(),
    on="Player_ID",
    how="left"
)

In [39]:
# Reorder columns so that `Player_Name` column is first
cols = ["Player_Name"] + [c for c in player_season_projections.columns if c != "Player_Name"]
player_season_projections = player_season_projections[cols]

In [40]:
# Usage % (season-level) stats
meaningful_game_logs_filtered["season_USG%"].describe(percentiles=[0.10, 0.20, 0.25, 0.75, 0.80, 0.90])

count   20196.00
mean       19.47
std         5.55
min         7.60
10%        13.00
20%        14.70
25%        15.40
50%        18.50
75%        23.30
80%        24.10
90%        27.70
max        35.90
Name: season_USG%, dtype: float64

In [41]:
# Create player-level usage cohorts
def categorize_usage(usg):
    if usg <= 15.0:
        return "Low Usage"
    elif usg < 23.0:
        return "Medium Usage"
    else:
        return "High Usage"

player_season_projections["USG_cohort"] = player_season_projections["season_USG"].apply(categorize_usage)

In [42]:
# Calculate output consistency for each player
player_season_projections["OC"] = np.sqrt(
    player_season_projections["AS_output_rate"] * player_season_projections["avg_opr"]
)

> A player's **output consistency** (the primary metric) is simply a mathematical combination of their **all-star output rate** and **output per role** (secondary metrics).

In [44]:
# Preview results
player_season_projections.sort_values("OC", ascending=False).head(10)

Unnamed: 0,Player_Name,Player_ID,games_played,AS_level_games,avg_pra_signal,avg_opr,season_USG,season_PRA,age,team,pos,season_PTS,season_REB,season_AST,AS_output_rate,USG_cohort,OC
59,Nikola Jokić,203999,70,67,57.47,1.84,29.5,52.5,29,DEN,C,29.6,12.7,10.2,0.96,High Usage,1.33
80,Domantas Sabonis,1627734,69,64,53.74,1.9,21.6,39.0,28,SAC,C,19.1,13.9,6.0,0.93,Medium Usage,1.33
38,Rudy Gobert,203497,72,61,53.34,2.0,13.0,24.7,32,MIN,C,12.0,10.9,1.8,0.85,Low Usage,1.3
116,Josh Hart,1628404,77,64,54.44,2.0,15.3,29.1,29,NYK,SG,13.6,9.6,5.9,0.83,Medium Usage,1.29
288,Walker Kessler,1631117,58,45,51.48,1.92,13.7,25.0,23,UTA,C,11.1,12.2,1.7,0.78,Low Usage,1.22
41,Giannis Antetokounmpo,203507,67,63,49.49,1.41,35.2,48.8,30,MIL,PF,30.4,11.9,6.5,0.94,High Usage,1.15
95,Ivica Zubac,1627826,80,61,46.99,1.68,19.5,32.1,27,LAC,C,16.8,12.6,2.7,0.76,Medium Usage,1.13
239,Jalen Johnson,1630552,35,28,45.07,1.58,22.5,33.9,23,ATL,SF,18.9,10.0,5.0,0.8,Medium Usage,1.12
279,Jalen Duren,1631105,75,54,44.5,1.62,16.4,24.8,21,DET,C,11.8,10.3,2.7,0.72,Medium Usage,1.08
65,Karl-Anthony Towns,1626157,72,55,44.98,1.49,27.4,40.3,29,NYK,C,24.4,12.8,3.1,0.76,High Usage,1.07


___
## Save

In [46]:
# Save to CSV
meaningful_game_logs_filtered.to_csv("NBA_2024_25_game_logs_final.csv", index=False)
player_season_projections.to_csv("player_season_projections.csv", index=False)