# Correlation Heatmaps

## Trevor Rowland, Scott Campbell :: 2-4-2025

This notebook serves as a repository of functions for creating correlation heatmaps, used to identify relationships between features in our datasets. This notebook will use the `players` dataset, containing aggregated player stats for all NBA players from 2004-2024.

## 1. Importing Packages and Data

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_pickle('/home/arch-db/Documents/github/bint-capstone/data-sources/OneDrive_1_2-5-2025/nba_player_stats_2004_2024.pkl')
df.head()

Unnamed: 0,player_id,player_name,nickname,team_id,team_abbreviation,age,gp,w,l,w_pct,...,blka_rank,pf_rank,pfd_rank,pts_rank,plus_minus_rank,nba_fantasy_pts_rank,dd2_rank,td3_rank,wnba_fantasy_pts_rank,season
0,243,Aaron McKie,Aaron,1610612755,PHI,32.0,68,35,33,0.515,...,146,215,81,332,157,274,223,14,285,2004-05
1,1425,Aaron Williams,Aaron,1610612761,TOR,33.0,42,13,29,0.31,...,105,142,81,376,400,377,223,14,377,2004-05
2,1502,Adonal Foyle,Adonal,1610612744,GSW,30.0,78,31,47,0.397,...,232,386,10,247,256,141,129,14,168,2004-05
3,1559,Adrian Griffin,Adrian,1610612741,CHI,30.0,69,38,31,0.551,...,129,153,81,334,119,308,223,14,313,2004-05
4,1733,Al Harrington,Al,1610612737,ATL,25.0,66,10,56,0.152,...,446,445,81,53,463,65,59,14,64,2004-05


We will also need to import the shot data to grab the position of each player.

In [5]:
shots_df = pd.read_pickle('/home/arch-db/Documents/github/bint-capstone/data-sources/OneDrive_1_2-5-2025/all-shots.pkl')
shots = shots_df.to_pandas(use_pyarrow_extension_array=True)
shots.head()

Unnamed: 0,SEASON_1,SEASON_2,TEAM_ID,TEAM_NAME,PLAYER_ID,PLAYER_NAME,POSITION_GROUP,POSITION,GAME_DATE,GAME_ID,...,BASIC_ZONE,ZONE_NAME,ZONE_ABB,ZONE_RANGE,LOC_X,LOC_Y,SHOT_DISTANCE,QUARTER,MINS_LEFT,SECS_LEFT
0,2009,2008-09,1610612744,Golden State Warriors,201627,Anthony Morrow,G,SG,04-15-2009,20801229,...,Restricted Area,Center,C,Less Than 8 ft.,-0.0,5.25,0,4,0,1
1,2009,2008-09,1610612744,Golden State Warriors,101235,Kelenna Azubuike,F,SF,04-15-2009,20801229,...,Restricted Area,Center,C,Less Than 8 ft.,-0.0,5.25,0,4,0,9
2,2009,2008-09,1610612756,Phoenix Suns,255,Grant Hill,F,SF,04-15-2009,20801229,...,Restricted Area,Center,C,Less Than 8 ft.,-0.0,5.25,0,4,0,25
3,2009,2008-09,1610612739,Cleveland Cavaliers,200789,Daniel Gibson,G,PG,04-15-2009,20801219,...,Restricted Area,Center,C,Less Than 8 ft.,-0.2,5.25,0,5,0,4
4,2009,2008-09,1610612756,Phoenix Suns,255,Grant Hill,F,SF,04-15-2009,20801229,...,Mid-Range,Left Side,L,8-16 ft.,8.7,7.55,8,4,1,3


## 2. Concatenate Shot and Player Data

To choose a new hypothesis test, we need to have the positions of each player. Therefore we need to find each player in the `players` dataset's match with the players in the `shots` dataset, and add the `POSITION` field to the `players` dataset.

In [6]:
# Create a DF of the Positions from Shots we want to concat to Players
merged = df.merge(
    shots[['PLAYER_NAME', 'SEASON_2', 'POSITION', 'POSITION_GROUP']], 
    left_on=['player_name', 'season'],
    right_on=['PLAYER_NAME', 'SEASON_2'],
    how='left')

# Concat to Players (needs to be done 1 at a time)
df['position'] = merged['POSITION']
df['position_group'] = merged['POSITION_GROUP']

# Check
df.head()

Unnamed: 0,player_id,player_name,nickname,team_id,team_abbreviation,age,gp,w,l,w_pct,...,pfd_rank,pts_rank,plus_minus_rank,nba_fantasy_pts_rank,dd2_rank,td3_rank,wnba_fantasy_pts_rank,season,position,position_group
0,243,Aaron McKie,Aaron,1610612755,PHI,32.0,68,35,33,0.515,...,81,332,157,274,223,14,285,2004-05,SG,G
1,1425,Aaron Williams,Aaron,1610612761,TOR,33.0,42,13,29,0.31,...,81,376,400,377,223,14,377,2004-05,SG,G
2,1502,Adonal Foyle,Adonal,1610612744,GSW,30.0,78,31,47,0.397,...,10,247,256,141,129,14,168,2004-05,SG,G
3,1559,Adrian Griffin,Adrian,1610612741,CHI,30.0,69,38,31,0.551,...,81,334,119,308,223,14,313,2004-05,SG,G
4,1733,Al Harrington,Al,1610612737,ATL,25.0,66,10,56,0.152,...,81,53,463,65,59,14,64,2004-05,SG,G


In [7]:
# Pick a player name to check
player_name = "LeBron James"  # replace with any player name

# Check in your main dataframe
print("From main df:")
print(df[df['player_name'] == player_name][['player_name', 'season', 'position', 'position_group']])

# Check in shots dataframe
print("\nFrom shots df:")
print(shots[shots['PLAYER_NAME'] == player_name][['PLAYER_NAME', 'SEASON_2', 'POSITION', 'POSITION_GROUP']])

From main df:
        player_name   season position position_group
267    LeBron James  2004-05        C              C
731    LeBron James  2005-06       SF              F
1182   LeBron James  2006-07       SF              F
1643   LeBron James  2007-08       SF              F
2086   LeBron James  2008-09       PF              F
2536   LeBron James  2009-10       PG              G
2991   LeBron James  2010-11       PG              G
3475   LeBron James  2011-12       PG              G
3944   LeBron James  2012-13       PG              G
4404   LeBron James  2013-14     C-PF              C
4900   LeBron James  2014-15        C              C
5382   LeBron James  2015-16        C              C
5859   LeBron James  2016-17        C              C
6378   LeBron James  2017-18       SG              G
6923   LeBron James  2018-19       PG              G
7450   LeBron James  2019-20       PG              G
7985   LeBron James  2020-21       PF              F
8566   LeBron James  2021-22    

In [14]:
print(f"Total rows in df: {len(df)}")
print(f"Number of rows with position data: {df['position'].notna().sum()}")

Total rows in df: 10433
Number of rows with position data: 10433


In [1]:
df.head()

NameError: name 'df' is not defined