# NBA Shot Data 1 :: Preprocessing

## Trevor Rowland :: 1-29-2025

This notebook takes the data sources from Dom Samangy's [NBA_Shots_04_24](<https://github.com/DomSamangy/NBA_Shots_04_24/tree/main>) git repository and combines them into a single dataframe. The concatenation methods used will be as general as possible for reuse with other NBA data, or other sports like hockey.

## 1. Importing Packages and Data Sources

In [11]:
import os
import polars as pl
import glob
dir = '/Users/dB/Desktop/spring_25/ds4510/data-sources/nba/NBA_Shots_04_24-main/shots' # Change to your directory of csv files

First we need to create a list of polars DataFrames to load the csv files into, then we will combine them.

In [12]:
dfs = [pl.read_csv(file) for file in glob.glob(dir+'/*.csv')] # adding '/*.csv' will grab (or glob) all of the CSV files in the directory

Now we have a series of polars DataFrames, which you can view with the following code.

In [13]:
for df in dfs:
    print(type(df)) # Make sure the dfs are DataFrames and not LazyFrames
    print(df.head()) # View the first 5 rows of data in each dataset

<class 'polars.dataframe.frame.DataFrame'>
shape: (5, 26)
┌──────────┬──────────┬────────────┬────────────┬───┬────────────┬─────────┬───────────┬───────────┐
│ SEASON_1 ┆ SEASON_2 ┆ TEAM_ID    ┆ TEAM_NAME  ┆ … ┆ SHOT_DISTA ┆ QUARTER ┆ MINS_LEFT ┆ SECS_LEFT │
│ ---      ┆ ---      ┆ ---        ┆ ---        ┆   ┆ NCE        ┆ ---     ┆ ---       ┆ ---       │
│ i64      ┆ str      ┆ i64        ┆ str        ┆   ┆ ---        ┆ i64     ┆ i64       ┆ i64       │
│          ┆          ┆            ┆            ┆   ┆ i64        ┆         ┆           ┆           │
╞══════════╪══════════╪════════════╪════════════╪═══╪════════════╪═════════╪═══════════╪═══════════╡
│ 2009     ┆ 2008-09  ┆ 1610612744 ┆ Golden     ┆ … ┆ 0          ┆ 4       ┆ 0         ┆ 1         │
│          ┆          ┆            ┆ State      ┆   ┆            ┆         ┆           ┆           │
│          ┆          ┆            ┆ Warriors   ┆   ┆            ┆         ┆           ┆           │
│ 2009     ┆ 2008-09  ┆ 161061274

## 2. Checking Compatibility Before Combining

Sometimes feature names will change across seasons. We saw this with our NHL Clutch Project, so the following functions will check what features, if any, are not shared across seasons.

In [14]:
'''
This function takes in a list of Polars DataFrames, and checks if the schema matches across DataFrames to ensure successful concatenation

params:
  - l: A list of Polars DataFrames (for NBA/NHL, these are different seasons we are aiming to concatenate)
'''
def check_features(l:list[pl.DataFrame])->bool:
    if not l:
        raise ValueError("The List is Empty or Null!")
    
    reference_schema = l[0].schema # store the first instance in the list as a reference to check against

    matching_dfs = [df for df in l if df.schema == reference_schema]

    return len(l) == len(matching_dfs) # Returns true if all columns match, false if there are inconsistencies

In [15]:
check_features(dfs)

True

From the output of our `check_features` function, we see that the schemas match, and the DataFrames can be concatenated without issue.

In [16]:
def combine_dfs_given_matching_cols(l:list[pl.DataFrame])->pl.DataFrame:
    if check_features(l):
        return pl.concat(l, how="vertical")
    else:
        raise ValueError("Schema Mismatch! Fix DataFrames before attempting to concatentate!")

In [17]:
combined_df = combine_dfs_given_matching_cols(dfs)
combined_df.head()

SEASON_1,SEASON_2,TEAM_ID,TEAM_NAME,PLAYER_ID,PLAYER_NAME,POSITION_GROUP,POSITION,GAME_DATE,GAME_ID,HOME_TEAM,AWAY_TEAM,EVENT_TYPE,SHOT_MADE,ACTION_TYPE,SHOT_TYPE,BASIC_ZONE,ZONE_NAME,ZONE_ABB,ZONE_RANGE,LOC_X,LOC_Y,SHOT_DISTANCE,QUARTER,MINS_LEFT,SECS_LEFT
i64,str,i64,str,i64,str,str,str,str,i64,str,str,str,bool,str,str,str,str,str,str,f64,f64,i64,i64,i64,i64
2009,"""2008-09""",1610612744,"""Golden State Warriors""",201627,"""Anthony Morrow""","""G""","""SG""","""04-15-2009""",20801229,"""PHX""","""GSW""","""Made Shot""",True,"""Driving Layup Shot""","""2PT Field Goal""","""Restricted Area""","""Center""","""C""","""Less Than 8 ft.""",-0.0,5.25,0,4,0,1
2009,"""2008-09""",1610612744,"""Golden State Warriors""",101235,"""Kelenna Azubuike""","""F""","""SF""","""04-15-2009""",20801229,"""PHX""","""GSW""","""Missed Shot""",False,"""Layup Shot""","""2PT Field Goal""","""Restricted Area""","""Center""","""C""","""Less Than 8 ft.""",-0.0,5.25,0,4,0,9
2009,"""2008-09""",1610612756,"""Phoenix Suns""",255,"""Grant Hill""","""F""","""SF""","""04-15-2009""",20801229,"""PHX""","""GSW""","""Made Shot""",True,"""Layup Shot""","""2PT Field Goal""","""Restricted Area""","""Center""","""C""","""Less Than 8 ft.""",-0.0,5.25,0,4,0,25
2009,"""2008-09""",1610612739,"""Cleveland Cavaliers""",200789,"""Daniel Gibson""","""G""","""PG""","""04-15-2009""",20801219,"""CLE""","""PHI""","""Made Shot""",True,"""Driving Layup Shot""","""2PT Field Goal""","""Restricted Area""","""Center""","""C""","""Less Than 8 ft.""",-0.2,5.25,0,5,0,4
2009,"""2008-09""",1610612756,"""Phoenix Suns""",255,"""Grant Hill""","""F""","""SF""","""04-15-2009""",20801229,"""PHX""","""GSW""","""Missed Shot""",False,"""Jump Shot""","""2PT Field Goal""","""Mid-Range""","""Left Side""","""L""","""8-16 ft.""",8.7,7.55,8,4,1,3


## 3. Writing to a File

To ensure compatibility with team members using Pandas, we will be writing the DataFrame to a CSV as well as a PKL file for speed.

In [18]:
import pickle

def write_pl_df_to_csv_and_pkl(df: pl.DataFrame, csv_path: str, pkl_path: str):
    df.write_csv(csv_path)
    print(f'DataFrame saved to CSV at location: {csv_path}')

    with open(pkl_path, "wb") as f: # create the pickle file with intent to write to it
        pickle.dump(df, f)
    print(f'DataFrame saved to PKL at location: {pkl_path}')

In [21]:
csv_path = '/Users/dB/Desktop/spring_25/ds4510/data-sources/nba/all-shots.csv'
pkl_path = '/Users/dB/Desktop/spring_25/ds4510/data-sources/nba/all-shots.pkl'

converted_df = combined_df.to_pandas(use_pyarrow_extension_array=True)

write_pl_df_to_csv_and_pkl(combined_df, csv_path, pkl_path)

DataFrame saved to CSV at location: /Users/dB/Desktop/spring_25/ds4510/data-sources/nba/all-shots.csv
DataFrame saved to PKL at location: /Users/dB/Desktop/spring_25/ds4510/data-sources/nba/all-shots.pkl
