<h1><center>NBA Game Attendance Prediction</center></h1>
This project aims to collect various statistical and demographic factors to predict attendance at NBA games. The data contains 2,826 NBA games ranging from 2019-2023, with the 2020-2021 season being omitted due to Covid restrictions. All data was collected through using Python API Clients and data scraping various websites. Using the collected data, 113 features were created ranging from calculated statistics for a team's historical performance to demographic data about the home city from the American Census Survey. Model testing and cross-validation was done using the Sci-kit Learn library.

#### Field Categories:

1) NBA In-Game Statistics
2) NBA Out-of-Game Statistics
3) Date & Weather
4) Social Media Fans
5) American Census Survey Demographic Information

In [1]:
#analytics stack
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import requests
from bs4 import BeautifulSoup
import time
import numpy as np
import datetime 
import geopy.distance
from datetime import datetime

#nba_api package
from nba_api.stats.static import players, teams
from nba_api.stats.endpoints import leaguegamelog, boxscoresummaryv2, teamplayerdashboard

#meteostat weather package
from meteostat import Point, Daily, units

### Load in Data

NBA Data collection located in data_collection.ipynb file

In [2]:
# read in data
df_years = ['2018','2019','2021','2022']
initial_dfs = []
for year in df_years:
    df = pd.read_csv(f'data/attendance_{year}.csv')
    initial_dfs.append(df)

attendance_initial = pd.concat(initial_dfs)

# only keep necessary columns
attendance_initial = attendance_initial[['GAME_DATE_x', 'ATTENDANCE', 'GAME_ID', 'SEASON_ID','TEAM_ID_HOME', 
                                   'TEAM_ABBREVIATION_HOME','TEAM_NAME_HOME','TEAM_ID_AWAY',
                                   'TEAM_ABBREVIATION_AWAY', 'TEAM_NAME_AWAY','WL_HOME','FG3M_HOME','FG3_PCT_HOME',
                                   'REB_HOME','AST_HOME', 'STL_HOME', 'BLK_HOME','PTS_HOME', 'PLUS_MINUS_HOME','FG3M_AWAY',
                                   'FG3_PCT_AWAY', 'REB_AWAY', 'AST_AWAY', 'STL_AWAY','BLK_AWAY', 'PTS_AWAY', 'PLUS_MINUS_AWAY']]

In [3]:
#arena capacity
capacities = pd.read_html('https://en.wikipedia.org/wiki/List_of_National_Basketball_Association_arenas')[0][['Team(s)','Capacity']]
capacities['Team(s)'] = capacities['Team(s)'].replace(['Los Angeles Clippers'], 'LA Clippers')

attendance_initial = attendance_initial.merge(capacities, how='left', left_on='TEAM_NAME_HOME', right_on='Team(s)')
attendance_initial = attendance_initial.drop(columns=['Team(s)'])

# fix attendance numbers if over capacity
fixed_attendances = []
for i in range(len(attendance_initial)):
    if attendance_initial['ATTENDANCE'][i] <= attendance_initial['Capacity'][i]:
        fixed_attendances.append(attendance_initial['ATTENDANCE'][i])
    else:
        fixed_attendances.append(attendance_initial['Capacity'][i])
attendance_initial['ATTENDANCE'] = fixed_attendances


attendance_initial['Attendance'] = attendance_initial['ATTENDANCE']
attendance_initial.reset_index(inplace = True, drop = True)
attendance_initial.head(5)

Unnamed: 0,GAME_DATE_x,ATTENDANCE,GAME_ID,SEASON_ID,TEAM_ID_HOME,TEAM_ABBREVIATION_HOME,TEAM_NAME_HOME,TEAM_ID_AWAY,TEAM_ABBREVIATION_AWAY,TEAM_NAME_AWAY,...,FG3M_AWAY,FG3_PCT_AWAY,REB_AWAY,AST_AWAY,STL_AWAY,BLK_AWAY,PTS_AWAY,PLUS_MINUS_AWAY,Capacity,Attendance
0,2018-10-16,19156.0,21800002,22018,1610612738,BOS,Boston Celtics,1610612760,OKC,Oklahoma City Thunder,...,10,0.27,45,21,12,6,100,-8,19156,19156.0
1,2018-10-16,18064.0,21800001,22018,1610612744,GSW,Golden State Warriors,1610612755,PHI,Philadelphia 76ers,...,5,0.192,47,18,8,5,87,-18,18064,18064.0
2,2018-10-17,17889.0,21800003,22018,1610612739,CLE,Cleveland Cavaliers,1610612751,BKN,Brooklyn Nets,...,5,0.185,39,28,9,5,100,-3,19432,17889.0
3,2018-10-17,17583.0,21800006,22018,1610612758,SAC,Sacramento Kings,1610612749,MIL,Milwaukee Bucks,...,14,0.412,57,26,5,4,113,1,17583,17583.0
4,2018-10-17,18846.0,21800012,22018,1610612753,ORL,Orlando Magic,1610612759,SAS,San Antonio Spurs,...,11,0.44,52,22,3,4,112,4,18846,18846.0


### Previous Attendance and Average Statistics for each Team Over Last 5 Games

Features created: 

1) previous attendance against team
2) average (points, rebounds, assists, 3-pointers, plus/minus) over last 5 games

In [4]:
# exclude 2018 data to use for previous attendance
attendance_previous = attendance_initial[attendance_initial['SEASON_ID'] != 22018].copy()
prev_attendance_list = []
# get previous attendance for each game against opponent
for row in attendance_previous.index:
    try:
        home_team = attendance_previous.loc[row,'TEAM_ABBREVIATION_HOME']
        away_team = attendance_previous.loc[row,'TEAM_ABBREVIATION_AWAY']
        team_df = attendance_initial[(attendance_initial['TEAM_ABBREVIATION_HOME'] == home_team) & (attendance_initial['TEAM_ABBREVIATION_AWAY'] == away_team)]
        team_df = team_df.loc[:row]
        prev_attendance = team_df[:-1].iloc[-1]['Attendance']
        prev_attendance_list.append(prev_attendance)
    except:
        # For missing attendance info
        prev_attendance_list.append(attendance_previous.loc[row,'Attendance'])

# create previous attendance column    
attendance_previous['PREV_ATTENDANCE'] = prev_attendance_list
attendance_previous.head()

Unnamed: 0,GAME_DATE_x,ATTENDANCE,GAME_ID,SEASON_ID,TEAM_ID_HOME,TEAM_ABBREVIATION_HOME,TEAM_NAME_HOME,TEAM_ID_AWAY,TEAM_ABBREVIATION_AWAY,TEAM_NAME_AWAY,...,FG3_PCT_AWAY,REB_AWAY,AST_AWAY,STL_AWAY,BLK_AWAY,PTS_AWAY,PLUS_MINUS_AWAY,Capacity,Attendance,PREV_ATTENDANCE
1230,2019-10-22,16867.0,21900002,22019,1610612740,NOP,New Orleans Pelicans,1610612747,LAL,Los Angeles Lakers,...,0.394,41,20,4,7,102,-10,16867,16867.0,12412.0
1231,2019-10-22,19068.0,21900001,22019,1610612746,LAC,LA Clippers,1610612761,TOR,Toronto Raptors,...,0.35,57,23,7,3,130,8,19068,19068.0,19068.0
1232,2019-10-23,18203.0,21900010,22019,1610612760,OKC,Oklahoma City Thunder,1610612764,WAS,Washington Wizards,...,0.268,47,26,6,3,100,-8,18203,18203.0,13110.0
1233,2019-10-23,17732.0,21900006,22019,1610612753,ORL,Orlando Magic,1610612752,NYK,New York Knicks,...,0.333,39,24,16,1,111,-9,18846,17732.0,15898.0
1234,2019-10-23,17923.0,21900009,22019,1610612754,IND,Indiana Pacers,1610612750,MIN,Minnesota Timberwolves,...,0.302,52,23,9,4,127,1,17923,17923.0,16532.0


In [5]:
# find last 5 games for each team and calculate average for each statistic
categories = ['L5_AVG_PTS_', 'L5_AVG_BLK_','L5_AVG_REB_','L5_AVG_AST_','L5_AVG_FG3_PCT_','L5_AVG_PLUS_MINUS_']
last5_away = []
last5_home = []

stat_list = [last5_home, last5_away]
team_name_cols = ['TEAM_ABBREVIATION_HOME','TEAM_ABBREVIATION_AWAY']
home_away = ['HOME','AWAY']
prev_attendances = []
for row in attendance_previous.index:
    for i in range(0,2):
        # get last 5 game information for both teams
        team = attendance_previous.loc[row, team_name_cols[i]]
        team_df = attendance_initial[(attendance_initial['TEAM_ABBREVIATION_HOME'] == team) | (attendance_initial['TEAM_ABBREVIATION_AWAY'] == team)]
        team_df = team_df.loc[:row][-6:][:-1]
        
        # selected statistics 
        columns = ['PTS_','BLK_','REB_','AST_','FG3_PCT_','PLUS_MINUS_']
        totals = [0]*6
        winstreak = 0
        
        # calculate total for each statistic over last 5 games
        for x in range(len(team_df)):
            if team_df['TEAM_ABBREVIATION_HOME'].iloc[x] == team:
                for y in range(len(columns)):
                    totals[y] += team_df[columns[y]+home_away[i]].iloc[x]

            else:
                for y in range(len(columns)):
                    totals[y] += team_df[columns[y]+home_away[i]].iloc[x]
        
        # calculate average for each statistic for both home/away teams
        averages = [round(num/5,2) for num in totals]
        if i == 0:
            last5_home.append(averages)

        else:
            last5_away.append(averages)

In [6]:
# create home and away column names for each statistic
stat_df_column_names = []
for cat in categories:
    stat_df_column_names.append(cat+home_away[0])   
for cat in categories:
    stat_df_column_names.append(cat+home_away[1])

# combine home and away average statistics with dataframe
stats_l5_df = pd.concat([pd.DataFrame(last5_home),pd.DataFrame(last5_away)], axis = 1)
stats_l5_df.columns = stat_df_column_names
stats_combined = attendance_previous.copy()
stats_combined.reset_index(inplace = True, drop = True)
stats_combined = pd.concat([stats_combined,stats_l5_df], axis = 1)
stats_combined.head()

Unnamed: 0,GAME_DATE_x,ATTENDANCE,GAME_ID,SEASON_ID,TEAM_ID_HOME,TEAM_ABBREVIATION_HOME,TEAM_NAME_HOME,TEAM_ID_AWAY,TEAM_ABBREVIATION_AWAY,TEAM_NAME_AWAY,...,L5_AVG_REB_HOME,L5_AVG_AST_HOME,L5_AVG_FG3_PCT_HOME,L5_AVG_PLUS_MINUS_HOME,L5_AVG_PTS_AWAY,L5_AVG_BLK_AWAY,L5_AVG_REB_AWAY,L5_AVG_AST_AWAY,L5_AVG_FG3_PCT_AWAY,L5_AVG_PLUS_MINUS_AWAY
0,2019-10-22,16867.0,21900002,22019,1610612740,NOP,New Orleans Pelicans,1610612747,LAL,Los Angeles Lakers,...,45.0,23.0,0.32,-9.8,115.6,5.2,50.2,26.6,0.38,1.6
1,2019-10-22,19068.0,21900001,22019,1610612746,LAC,LA Clippers,1610612761,TOR,Toronto Raptors,...,46.4,26.6,0.37,-5.2,110.4,6.6,45.0,23.0,0.33,-4.6
2,2019-10-23,18203.0,21900010,22019,1610612760,OKC,Oklahoma City Thunder,1610612764,WAS,Washington Wizards,...,56.0,27.6,0.37,11.4,105.8,5.8,46.0,23.6,0.34,-4.0
3,2019-10-23,17732.0,21900006,22019,1610612753,ORL,Orlando Magic,1610612752,NYK,New York Knicks,...,49.6,29.0,0.39,11.0,99.2,5.8,51.4,21.2,0.26,-10.0
4,2019-10-23,17923.0,21900009,22019,1610612754,IND,Indiana Pacers,1610612750,MIN,Minnesota Timberwolves,...,39.2,26.0,0.36,-2.4,109.4,3.8,40.2,25.8,0.38,3.8


### Winning %, Winstreak

features created:
1) team win % at time of game
2) winstreak

In [7]:
# get names of all teams
teams = stats_combined['TEAM_ABBREVIATION_HOME'].unique()
home_percentages = []
away_percentages = []

for season in stats_combined['SEASON_ID'].unique():
    
    # create win / loss dictionary for each team
    wl_dict = {}
    for t in teams:
        wl_dict[t] = {'Wins': 0, 'Losses': 0, 'Winstreak': 0}
    df = attendance_initial[attendance_initial['SEASON_ID'] == season]
    df.reset_index(inplace = True, drop = True)
    
    # calculate team win percentage at time of each game
    for i in range(len(df)):
        first_game_home = False
        first_game_away = False
        # check if game is either teams first game of the season
        if (wl_dict[df['TEAM_ABBREVIATION_HOME'].iloc[i]]['Wins'] + wl_dict[df['TEAM_ABBREVIATION_HOME'].iloc[i]]['Losses']) == 0:
            home_percentages.append([df['GAME_ID'].iloc[i],0,0])
            first_game_home = True
        if (wl_dict[df['TEAM_ABBREVIATION_AWAY'].iloc[i]]['Wins'] + wl_dict[df['TEAM_ABBREVIATION_AWAY'].iloc[i]]['Losses']) == 0:
            away_percentages.append([df['GAME_ID'].iloc[i],0,0])
            first_game_away = True

        
        # calculate win/loss record and winstreak for home team (need to account for first game of the season)
        if first_game_home == False:
            home_wins = wl_dict[df['TEAM_ABBREVIATION_HOME'].iloc[i]]['Wins']
            home_losses = wl_dict[df['TEAM_ABBREVIATION_HOME'].iloc[i]]['Losses']
            record = round(home_wins / (home_wins + home_losses),2)
            home_percentages.append([df['GAME_ID'].iloc[i],record, wl_dict[df['TEAM_ABBREVIATION_HOME'].iloc[i]]['Winstreak']])
        else:
            pass
        
        #calculate win/loss record and winstreak for away team (need to account for first game of the season)
        if first_game_away == False:
            away_wins = wl_dict[df['TEAM_ABBREVIATION_AWAY'].iloc[i]]['Wins']
            away_losses = wl_dict[df['TEAM_ABBREVIATION_AWAY'].iloc[i]]['Losses']
            record = round(away_wins / (away_wins + away_losses),2)
            away_percentages.append([df['GAME_ID'].iloc[i],record,wl_dict[df['TEAM_ABBREVIATION_AWAY'].iloc[i]]['Winstreak']])
        else:
            pass
        
        # update win / loss dictionary and winstreak
        if df.loc[i,'WL_HOME'] == 'W':
            wl_dict[df['TEAM_ABBREVIATION_HOME'].iloc[i]]['Wins'] += 1
            wl_dict[df['TEAM_ABBREVIATION_AWAY'].iloc[i]]['Losses'] += 1
            wl_dict[df['TEAM_ABBREVIATION_HOME'].iloc[i]]['Winstreak'] += 1
            wl_dict[df['TEAM_ABBREVIATION_AWAY'].iloc[i]]['Winstreak'] = 0
        else:
            wl_dict[df['TEAM_ABBREVIATION_HOME'].iloc[i]]['Losses'] += 1
            wl_dict[df['TEAM_ABBREVIATION_AWAY'].iloc[i]]['Wins'] += 1
            wl_dict[df['TEAM_ABBREVIATION_HOME'].iloc[i]]['Winstreak'] = 0
            wl_dict[df['TEAM_ABBREVIATION_AWAY'].iloc[i]]['Winstreak'] += 1

In [8]:
# create side table for records and winstreak
hp = pd.DataFrame(home_percentages, columns = ['GAME_ID_Y','HOME_WIN_PCT','HOME_WINSTREAK'])
ap = pd.DataFrame(away_percentages, columns = ['GAME_ID_X','AWAY_WIN_PCT','AWAY_WINSTREAK'])

# combine with main table
records_df = pd.concat([hp,ap], axis = 1).drop(columns=['GAME_ID_X'], axis = 1)
records_df = records_df[records_df['GAME_ID_Y'].isin(stats_combined['GAME_ID'])]
records_df.reset_index(inplace = True, drop = True)
records_df = pd.concat([stats_combined, records_df], axis = 1).drop(columns=['GAME_ID_Y'], axis = 1)
records_df['SEASON_ID'] = records_df['SEASON_ID'].replace([22019, 22021, 22022],[2020, 2022,2023])
records_df.head()

Unnamed: 0,GAME_DATE_x,ATTENDANCE,GAME_ID,SEASON_ID,TEAM_ID_HOME,TEAM_ABBREVIATION_HOME,TEAM_NAME_HOME,TEAM_ID_AWAY,TEAM_ABBREVIATION_AWAY,TEAM_NAME_AWAY,...,L5_AVG_PTS_AWAY,L5_AVG_BLK_AWAY,L5_AVG_REB_AWAY,L5_AVG_AST_AWAY,L5_AVG_FG3_PCT_AWAY,L5_AVG_PLUS_MINUS_AWAY,HOME_WIN_PCT,HOME_WINSTREAK,AWAY_WIN_PCT,AWAY_WINSTREAK
0,2019-10-22,16867.0,21900002,2020,1610612740,NOP,New Orleans Pelicans,1610612747,LAL,Los Angeles Lakers,...,115.6,5.2,50.2,26.6,0.38,1.6,0.0,0,0.0,0
1,2019-10-22,19068.0,21900001,2020,1610612746,LAC,LA Clippers,1610612761,TOR,Toronto Raptors,...,110.4,6.6,45.0,23.0,0.33,-4.6,0.0,0,0.0,0
2,2019-10-23,18203.0,21900010,2020,1610612760,OKC,Oklahoma City Thunder,1610612764,WAS,Washington Wizards,...,105.8,5.8,46.0,23.6,0.34,-4.0,0.0,0,0.0,0
3,2019-10-23,17732.0,21900006,2020,1610612753,ORL,Orlando Magic,1610612752,NYK,New York Knicks,...,99.2,5.8,51.4,21.2,0.26,-10.0,0.0,0,0.0,0
4,2019-10-23,17923.0,21900009,2020,1610612754,IND,Indiana Pacers,1610612750,MIN,Minnesota Timberwolves,...,109.4,3.8,40.2,25.8,0.38,3.8,0.0,0,0.0,0


### Basketball Reference Statistics

Features created: 
1) Conference
2) Margin of victory
3) offensive rating
4) defensive rating 
5) net rating
6) all-star players on team

In [9]:
#get data from basketball reference
years = ['2020','2022', '2023']
br_dfs = {}

# create dictionary to store each seasons table
for year in years:
    url = f'https://www.basketball-reference.com/leagues/NBA_{year}_ratings.html'
    data = requests.get(url)
    
    soup = BeautifulSoup(data.text, 'html.parser')
    for tr in soup.findAll('tr', class_='over_header'):
        tr.decompose()
    table = soup.find(id='ratings')
    ratings = pd.read_html(str(table))[0]
    ratings['Season'] = int(year)
    ratings['Team'] = ratings['Team'].replace(['Los Angeles Clippers'],['LA Clippers'])
    br_dfs[f'br_stats_{year}'] = ratings[['Team','Conf','MOV','ORtg','DRtg','NRtg','Season']]
br_dfs.keys()

dict_keys(['br_stats_2020', 'br_stats_2022', 'br_stats_2023'])

In [10]:
# create columns for home team
team_stats_df = pd.concat(br_dfs, ignore_index = True)
team_stats_df = team_stats_df.add_suffix('_HOME')
tstats_combined_df = records_df.merge(team_stats_df, how='left', left_on=['TEAM_NAME_HOME','SEASON_ID'], right_on=['Team_HOME','Season_HOME']).drop(columns=['Team_HOME','Season_HOME'], axis = 1)

# create columns for away team
team_stats_df = pd.concat(br_dfs, ignore_index = True)
team_stats_df = team_stats_df.add_suffix('_AWAY')
tstats_combined_df = tstats_combined_df.merge(team_stats_df, how='left', left_on=['TEAM_NAME_AWAY','SEASON_ID'], right_on=['Team_AWAY','Season_AWAY']).drop(columns=['Team_AWAY','Season_AWAY'], axis = 1)
tstats_combined_df.head()

Unnamed: 0,GAME_DATE_x,ATTENDANCE,GAME_ID,SEASON_ID,TEAM_ID_HOME,TEAM_ABBREVIATION_HOME,TEAM_NAME_HOME,TEAM_ID_AWAY,TEAM_ABBREVIATION_AWAY,TEAM_NAME_AWAY,...,Conf_HOME,MOV_HOME,ORtg_HOME,DRtg_HOME,NRtg_HOME,Conf_AWAY,MOV_AWAY,ORtg_AWAY,DRtg_AWAY,NRtg_AWAY
0,2019-10-22,16867.0,21900002,2020,1610612740,NOP,New Orleans Pelicans,1610612747,LAL,Los Angeles Lakers,...,W,-1.29,111.42,112.62,-1.2,W,5.79,112.78,107.14,5.65
1,2019-10-22,19068.0,21900001,2020,1610612746,LAC,LA Clippers,1610612761,TOR,Toronto Raptors,...,W,6.44,114.66,108.25,6.41,E,6.24,112.02,105.85,6.18
2,2019-10-23,18203.0,21900010,2020,1610612760,OKC,Oklahoma City Thunder,1610612764,WAS,Washington Wizards,...,W,1.99,111.74,109.66,2.09,E,-4.67,111.69,116.16,-4.46
3,2019-10-23,17732.0,21900006,2020,1610612753,ORL,Orlando Magic,1610612752,NYK,New York Knicks,...,E,-1.01,109.1,110.28,-1.18,E,-6.45,107.1,113.52,-6.42
4,2019-10-23,17923.0,21900009,2020,1610612754,IND,Indiana Pacers,1610612750,MIN,Minnesota Timberwolves,...,E,1.96,110.9,108.77,2.12,W,-4.3,108.91,113.03,-4.12


In [11]:
# prepare table names for each season
all_star_years = [2022, 2021, 2019]
table_ids = [['Team Lebron','Team Durant'],['Team LeBron','Team Durant'],['Team LeBron','Team Giannis']]
all_star_tables = []

# get all star game data from each season
for i in range(3):
    url = f'https://www.basketball-reference.com/allstar/NBA_{all_star_years[i]}.html'
    data = requests.get(url)

    soup = BeautifulSoup(data.text, 'html.parser')
    for tr in soup.findAll('tr', class_='over_header'):
        tr.decompose()
    for tr in soup.findAll('tr', class_='thead'):
        tr.decompose()
    table1 = soup.find(id=table_ids[i][0])
    table2 = soup.find(id=table_ids[i][1])
    team1 = pd.read_html(str(table1))[0]
    team2 = pd.read_html(str(table2))[0]
    
    all_stars = pd.concat([team1, team2])
    all_stars.reset_index(inplace = True, drop = True)
    all_stars = all_stars.drop(all_stars.index[-1])
    all_star_count = pd.DataFrame(all_stars.groupby('Tm').size(), columns=['all_stars'])
    all_star_count.reset_index(inplace = True)
    all_star_count['Season'] = all_star_years[i] + 1
    all_star_tables.append(all_star_count)

In [12]:
# create all star columns for home team
all_star_table = pd.concat(all_star_tables, ignore_index = True)
all_star_table['Tm'] = all_star_table['Tm'].replace(['CHO','PHO','BRK'],['CHA','PHX','BKN'])
all_star_table = all_star_table.add_suffix('_HOME')
all_star_df = tstats_combined_df.merge(all_star_table, how='left', left_on=['TEAM_ABBREVIATION_HOME','SEASON_ID'], right_on=['Tm_HOME','Season_HOME']).drop(columns=['Tm_HOME','Season_HOME'], axis = 1)

# create all star columns for away team
all_star_table = pd.concat(all_star_tables, ignore_index = True)
all_star_table['Tm'] = all_star_table['Tm'].replace(['CHO','PHO','BRK'],['CHA','PHX','BKN'])
all_star_table = all_star_table.add_suffix('_AWAY')
all_star_df = all_star_df.merge(all_star_table, how='left', left_on=['TEAM_ABBREVIATION_AWAY','SEASON_ID'], right_on=['Tm_AWAY','Season_AWAY']).drop(columns=['Tm_AWAY','Season_AWAY'], axis = 1)

# fill rest of values with 0 for no all stars
all_star_df['all_stars_HOME'] = all_star_df['all_stars_HOME'].fillna(0)
all_star_df['all_stars_AWAY'] = all_star_df['all_stars_AWAY'].fillna(0)

all_star_df.head()

Unnamed: 0,GAME_DATE_x,ATTENDANCE,GAME_ID,SEASON_ID,TEAM_ID_HOME,TEAM_ABBREVIATION_HOME,TEAM_NAME_HOME,TEAM_ID_AWAY,TEAM_ABBREVIATION_AWAY,TEAM_NAME_AWAY,...,ORtg_HOME,DRtg_HOME,NRtg_HOME,Conf_AWAY,MOV_AWAY,ORtg_AWAY,DRtg_AWAY,NRtg_AWAY,all_stars_HOME,all_stars_AWAY
0,2019-10-22,16867.0,21900002,2020,1610612740,NOP,New Orleans Pelicans,1610612747,LAL,Los Angeles Lakers,...,111.42,112.62,-1.2,W,5.79,112.78,107.14,5.65,1.0,1.0
1,2019-10-22,19068.0,21900001,2020,1610612746,LAC,LA Clippers,1610612761,TOR,Toronto Raptors,...,114.66,108.25,6.41,E,6.24,112.02,105.85,6.18,0.0,2.0
2,2019-10-23,18203.0,21900010,2020,1610612760,OKC,Oklahoma City Thunder,1610612764,WAS,Washington Wizards,...,111.74,109.66,2.09,E,-4.67,111.69,116.16,-4.46,2.0,1.0
3,2019-10-23,17732.0,21900006,2020,1610612753,ORL,Orlando Magic,1610612752,NYK,New York Knicks,...,109.1,110.28,-1.18,E,-6.45,107.1,113.52,-6.42,1.0,0.0
4,2019-10-23,17923.0,21900009,2020,1610612754,IND,Indiana Pacers,1610612750,MIN,Minnesota Timberwolves,...,110.9,108.77,2.12,W,-4.3,108.91,113.03,-4.12,0.0,1.0


### Playoffs previous season

In [13]:
# read in dataset 
location_playoffs = pd.read_csv('weather_data.csv')
location_playoffs = location_playoffs.set_index('Team Code')
playoff_columns = location_playoffs.columns[-3:]

# check by season
years = [2020, 2022, 2023]
playoffs_home = []
playoffs_away = []

# check if either team was in the playoffs during the previous season
for x in range(3):
    df = all_star_df[all_star_df['SEASON_ID'] == years[x]]
    df.reset_index(inplace = True, drop = True)
    for i in range(len(df)):
        team_home = df.loc[i,'TEAM_ABBREVIATION_HOME']
        playoffs_home.append(location_playoffs.loc[team_home,playoff_columns[x]])
        
        team_away = df.loc[i,'TEAM_ABBREVIATION_AWAY']
        playoffs_away.append(location_playoffs.loc[team_away,playoff_columns[x]])

# add columns to dataframe
playoffs_df = all_star_df.copy()
playoffs_df['PLAYOFFS_HOME'] = playoffs_home
playoffs_df['PLAYOFFS_AWAY'] = playoffs_away
playoffs_df.head()

Unnamed: 0,GAME_DATE_x,ATTENDANCE,GAME_ID,SEASON_ID,TEAM_ID_HOME,TEAM_ABBREVIATION_HOME,TEAM_NAME_HOME,TEAM_ID_AWAY,TEAM_ABBREVIATION_AWAY,TEAM_NAME_AWAY,...,NRtg_HOME,Conf_AWAY,MOV_AWAY,ORtg_AWAY,DRtg_AWAY,NRtg_AWAY,all_stars_HOME,all_stars_AWAY,PLAYOFFS_HOME,PLAYOFFS_AWAY
0,2019-10-22,16867.0,21900002,2020,1610612740,NOP,New Orleans Pelicans,1610612747,LAL,Los Angeles Lakers,...,-1.2,W,5.79,112.78,107.14,5.65,1.0,1.0,No,No
1,2019-10-22,19068.0,21900001,2020,1610612746,LAC,LA Clippers,1610612761,TOR,Toronto Raptors,...,6.41,E,6.24,112.02,105.85,6.18,0.0,2.0,Yes,Yes
2,2019-10-23,18203.0,21900010,2020,1610612760,OKC,Oklahoma City Thunder,1610612764,WAS,Washington Wizards,...,2.09,E,-4.67,111.69,116.16,-4.46,2.0,1.0,Yes,No
3,2019-10-23,17732.0,21900006,2020,1610612753,ORL,Orlando Magic,1610612752,NYK,New York Knicks,...,-1.18,E,-6.45,107.1,113.52,-6.42,1.0,0.0,Yes,No
4,2019-10-23,17923.0,21900009,2020,1610612754,IND,Indiana Pacers,1610612750,MIN,Minnesota Timberwolves,...,2.12,W,-4.3,108.91,113.03,-4.12,0.0,1.0,Yes,No


### Date and Weather

features created: 
1) weekday vs weekend
2) playoffs previous season
2) average temperature
3) average precipitation
4) average snow level

In [14]:
# create column for weekday vs weekend
playoffs_df['GAME_DATE_x'] = pd.to_datetime(playoffs_df['GAME_DATE_x'])
playoffs_df['WEEKEND'] = [1 if x.weekday() >= 4 else 0 for x in playoffs_df['GAME_DATE_x']]
playoffs_df.head()

Unnamed: 0,GAME_DATE_x,ATTENDANCE,GAME_ID,SEASON_ID,TEAM_ID_HOME,TEAM_ABBREVIATION_HOME,TEAM_NAME_HOME,TEAM_ID_AWAY,TEAM_ABBREVIATION_AWAY,TEAM_NAME_AWAY,...,Conf_AWAY,MOV_AWAY,ORtg_AWAY,DRtg_AWAY,NRtg_AWAY,all_stars_HOME,all_stars_AWAY,PLAYOFFS_HOME,PLAYOFFS_AWAY,WEEKEND
0,2019-10-22,16867.0,21900002,2020,1610612740,NOP,New Orleans Pelicans,1610612747,LAL,Los Angeles Lakers,...,W,5.79,112.78,107.14,5.65,1.0,1.0,No,No,0
1,2019-10-22,19068.0,21900001,2020,1610612746,LAC,LA Clippers,1610612761,TOR,Toronto Raptors,...,E,6.24,112.02,105.85,6.18,0.0,2.0,Yes,Yes,0
2,2019-10-23,18203.0,21900010,2020,1610612760,OKC,Oklahoma City Thunder,1610612764,WAS,Washington Wizards,...,E,-4.67,111.69,116.16,-4.46,2.0,1.0,Yes,No,0
3,2019-10-23,17732.0,21900006,2020,1610612753,ORL,Orlando Magic,1610612752,NYK,New York Knicks,...,E,-6.45,107.1,113.52,-6.42,1.0,0.0,Yes,No,0
4,2019-10-23,17923.0,21900009,2020,1610612754,IND,Indiana Pacers,1610612750,MIN,Minnesota Timberwolves,...,W,-4.3,108.91,113.03,-4.12,0.0,1.0,Yes,No,0


In [15]:
# years for weather
years = [2020, 2022, 2023]

# create dataframe for weather in each home city during each season
temperature_dfs = {}
for i in range(3):
    # get start and end date for each season
    df = playoffs_df[playoffs_df['SEASON_ID'] == years[i]]
    start = min(df['GAME_DATE_x']).to_pydatetime()
    end = max(df['GAME_DATE_x']).to_pydatetime()
    
    for team in location_playoffs.index:
        # use meteostat to get weather data in each city
        city = Point(location_playoffs.loc[team,'Lat'],location_playoffs.loc[team,'Long'])
        
        data = Daily(city, start, end)
        data = data.convert(units.imperial)
        data = data.fetch()
        data.reset_index(inplace = True)
        if i == 0:
            temperature_dfs[team] = data
        else:
            temperature_dfs[team] = pd.concat([temperature_dfs[team],data])
    
for team in temperature_dfs.keys():
    temperature_dfs[team] = temperature_dfs[team].iloc[:,0:6]

In [16]:
# find weather information for the day of each game
weather_rows = []
for i in range(len(playoffs_df)):
    team = playoffs_df.loc[i,'TEAM_ABBREVIATION_HOME']
    date = playoffs_df.loc[i,'GAME_DATE_x']
    
    df = temperature_dfs[team]
    weather = df[df['time'] == date].iloc[:,1:]
    weather_rows.append(weather)
    
weather_df = pd.concat(weather_rows)
weather_df.reset_index(inplace = True, drop = True)
weather_df = weather_df.fillna(0)
weather_df.head()

Unnamed: 0,tavg,tmin,tmax,prcp,snow
0,72.0,57.0,79.0,0.0,0.0
1,77.7,64.9,96.1,0.0,0.0
2,57.2,43.0,78.1,0.071,0.0
3,76.5,71.1,82.9,0.0,0.0
4,49.6,36.0,64.9,0.0,0.0


In [17]:
# combine weather data with main table
date_weather = pd.concat([playoffs_df, weather_df], axis = 1)
date_weather.head()

Unnamed: 0,GAME_DATE_x,ATTENDANCE,GAME_ID,SEASON_ID,TEAM_ID_HOME,TEAM_ABBREVIATION_HOME,TEAM_NAME_HOME,TEAM_ID_AWAY,TEAM_ABBREVIATION_AWAY,TEAM_NAME_AWAY,...,all_stars_HOME,all_stars_AWAY,PLAYOFFS_HOME,PLAYOFFS_AWAY,WEEKEND,tavg,tmin,tmax,prcp,snow
0,2019-10-22,16867.0,21900002,2020,1610612740,NOP,New Orleans Pelicans,1610612747,LAL,Los Angeles Lakers,...,1.0,1.0,No,No,0,72.0,57.0,79.0,0.0,0.0
1,2019-10-22,19068.0,21900001,2020,1610612746,LAC,LA Clippers,1610612761,TOR,Toronto Raptors,...,0.0,2.0,Yes,Yes,0,77.7,64.9,96.1,0.0,0.0
2,2019-10-23,18203.0,21900010,2020,1610612760,OKC,Oklahoma City Thunder,1610612764,WAS,Washington Wizards,...,2.0,1.0,Yes,No,0,57.2,43.0,78.1,0.071,0.0
3,2019-10-23,17732.0,21900006,2020,1610612753,ORL,Orlando Magic,1610612752,NYK,New York Knicks,...,1.0,0.0,Yes,No,0,76.5,71.1,82.9,0.0,0.0
4,2019-10-23,17923.0,21900009,2020,1610612754,IND,Indiana Pacers,1610612750,MIN,Minnesota Timberwolves,...,0.0,1.0,Yes,No,0,49.6,36.0,64.9,0.0,0.0


### Best Statistics for Each Team

Getting the highest individual player average for each statistic by team (ex: LAL PTS: 30.1 | The highest average points per game by a player on the LA Lakers is 30.1)

features created: Max (Statistic)

In [18]:
all_seasons = ['2019-20','2021-22','2022-23']

# statistics to get max of
stats_list = ['OREB', 'DREB', 'REB', 'AST', 'TOV', 'STL',
       'BLK', 'BLKA', 'PF', 'PFD', 'PTS', 'PLUS_MINUS', 'NBA_FANTASY_PTS',
        'FG_PCT', 'FG3M', 'FG3A', 'FG3_PCT','FTM', 'FTA']

# get max of each statistic for each team during the season
i = 0
max_stats_dfs = []
for season in all_seasons:
    season_max_stats = []
    for team in date_weather['TEAM_ID_HOME'].unique():
        
        # use NBA API Client to get player statistic information by season
        df = teamplayerdashboard.TeamPlayerDashboard(season=season,team_id=str(team),per_mode_detailed= 'PerGame').get_data_frames()[1]
        stats_for_team = [team,years[i]]
        
        # get maximum of each statistic
        for stat in stats_list:
            stats_for_team.append(max(df[stat]))
        plyr_20 = len(df[df['PTS'] >= 20])
        stats_for_team.append(plyr_20)
        season_max_stats.append(stats_for_team)
        time.sleep(1)
    
    # create column names for max values
    s_cols = ['MAX_' + x for x in ['Team','Season'] + stats_list]
    s_df = pd.DataFrame(season_max_stats, columns=s_cols + ['PLAYERS_OVER_20PPG'])
    max_stats_dfs.append(s_df)
    i += 1

# create dataframe containing home and away max team statistics
max_stats_df = pd.concat(max_stats_dfs)
max_stats_df.reset_index(inplace = True, drop = True)
max_stats_df.head()

Unnamed: 0,MAX_Team,MAX_Season,MAX_OREB,MAX_DREB,MAX_REB,MAX_AST,MAX_TOV,MAX_STL,MAX_BLK,MAX_BLKA,...,MAX_PTS,MAX_PLUS_MINUS,MAX_NBA_FANTASY_PTS,MAX_FG_PCT,MAX_FG3M,MAX_FG3A,MAX_FG3_PCT,MAX_FTM,MAX_FTA,PLAYERS_OVER_20PPG
0,1610612740,2020,3.2,6.6,9.8,7.0,3.1,1.6,0.9,1.6,...,23.8,2.6,39.2,0.672,3.0,6.6,0.5,5.0,7.4,2
1,1610612746,2020,2.7,6.1,7.5,5.6,2.8,1.8,1.1,1.1,...,27.1,8.8,47.4,0.733,3.3,7.9,1.0,6.2,7.1,2
2,1610612760,2020,3.3,6.0,9.3,6.7,2.6,1.6,1.5,0.9,...,19.0,4.7,36.6,0.684,2.9,7.1,0.405,4.3,5.1,0
3,1610612753,2020,2.3,8.6,10.9,5.1,2.0,1.6,2.3,0.8,...,19.6,1.3,41.5,0.51,2.6,7.3,0.5,2.8,3.6,0
4,1610612754,2020,3.0,9.4,12.4,7.1,2.7,1.2,2.1,1.0,...,19.8,4.0,41.9,0.591,1.9,5.5,0.435,3.5,4.8,0


In [19]:
# create max columns for home team
max_home = max_stats_df.copy()
max_home = max_home.rename(columns={x: x+'_HOME' for x in max_home.columns if x not in ['MAX_Team', 'MAX_Season']})
player_max_df = date_weather.merge(max_home, how='left', left_on=['TEAM_ID_HOME', 'SEASON_ID'], right_on=['MAX_Team','MAX_Season'])
player_max_df = player_max_df.drop(columns=['MAX_Team','MAX_Season'], axis = 1)

# create max columns for away team
max_away = max_stats_df.copy()
max_away = max_away.rename(columns={x: x+'_AWAY' for x in max_away.columns if x not in ['MAX_Team', 'MAX_Season']})
player_max_df = player_max_df.merge(max_away, how='left', left_on=['TEAM_ID_AWAY', 'SEASON_ID'], right_on=['MAX_Team','MAX_Season'])
player_max_df = player_max_df.drop(columns=['MAX_Team','MAX_Season'], axis = 1)

player_max_df.head()

Unnamed: 0,GAME_DATE_x,ATTENDANCE,GAME_ID,SEASON_ID,TEAM_ID_HOME,TEAM_ABBREVIATION_HOME,TEAM_NAME_HOME,TEAM_ID_AWAY,TEAM_ABBREVIATION_AWAY,TEAM_NAME_AWAY,...,MAX_PTS_AWAY,MAX_PLUS_MINUS_AWAY,MAX_NBA_FANTASY_PTS_AWAY,MAX_FG_PCT_AWAY,MAX_FG3M_AWAY,MAX_FG3A_AWAY,MAX_FG3_PCT_AWAY,MAX_FTM_AWAY,MAX_FTA_AWAY,PLAYERS_OVER_20PPG_AWAY
0,2019-10-22,16867.0,21900002,2020,1610612740,NOP,New Orleans Pelicans,1610612747,LAL,Los Angeles Lakers,...,26.1,6.6,51.3,1.0,2.2,6.3,0.6,7.2,8.5,2
1,2019-10-22,19068.0,21900001,2020,1610612746,LAC,LA Clippers,1610612761,TOR,Toronto Raptors,...,22.9,6.6,40.0,0.6,2.8,8.0,0.5,5.1,5.9,1
2,2019-10-23,18203.0,21900010,2020,1610612760,OKC,Oklahoma City Thunder,1610612764,WAS,Washington Wizards,...,30.5,4.9,46.3,0.581,3.7,8.7,0.6,6.8,8.0,1
3,2019-10-23,17732.0,21900006,2020,1610612753,ORL,Orlando Magic,1610612752,NYK,New York Knicks,...,19.6,-0.5,36.2,0.742,2.7,6.1,0.439,4.0,5.5,0
4,2019-10-23,17923.0,21900009,2020,1610612754,IND,Indiana Pacers,1610612750,MIN,Minnesota Timberwolves,...,26.5,5.0,49.3,0.547,3.5,9.2,0.426,5.1,6.5,4


### Team Popularity Using Social Media Followers

In [20]:
# dataset that contains team twitter + fb followers at beginning of each season
fans = pd.read_csv('data/social_media_fans.csv')
fans.head(3)

Unnamed: 0,Team,Season,Fans
0,HOU,2020,15.46
1,HOU,2022,18.71
2,HOU,2023,19.03


In [21]:
# create social media fans column for home team
fans_df = player_max_df.merge(fans, how='left', left_on=['TEAM_ABBREVIATION_HOME','SEASON_ID'], right_on=['Team','Season'])
fans_df = fans_df.drop(columns=['Team','Season'], axis = 1)
fans_df = fans_df.rename(columns={'Fans': 'FANS_HOME'})

# create social media fans column for away team
fans_df = fans_df.merge(fans, how='left', left_on=['TEAM_ABBREVIATION_AWAY','SEASON_ID'], right_on=['Team','Season'])
fans_df = fans_df.drop(columns=['Team','Season'], axis = 1)
fans_df = fans_df.rename(columns={'Fans': 'FANS_AWAY'})
fans_df.head()

Unnamed: 0,GAME_DATE_x,ATTENDANCE,GAME_ID,SEASON_ID,TEAM_ID_HOME,TEAM_ABBREVIATION_HOME,TEAM_NAME_HOME,TEAM_ID_AWAY,TEAM_ABBREVIATION_AWAY,TEAM_NAME_AWAY,...,MAX_NBA_FANTASY_PTS_AWAY,MAX_FG_PCT_AWAY,MAX_FG3M_AWAY,MAX_FG3A_AWAY,MAX_FG3_PCT_AWAY,MAX_FTM_AWAY,MAX_FTA_AWAY,PLAYERS_OVER_20PPG_AWAY,FANS_HOME,FANS_AWAY
0,2019-10-22,16867.0,21900002,2020,1610612740,NOP,New Orleans Pelicans,1610612747,LAL,Los Angeles Lakers,...,51.3,1.0,2.2,6.3,0.6,7.2,8.5,2,2.73,29.57
1,2019-10-22,19068.0,21900001,2020,1610612746,LAC,LA Clippers,1610612761,TOR,Toronto Raptors,...,40.0,0.6,2.8,8.0,0.5,5.1,5.9,1,5.28,4.78
2,2019-10-23,18203.0,21900010,2020,1610612760,OKC,Oklahoma City Thunder,1610612764,WAS,Washington Wizards,...,46.3,0.581,3.7,8.7,0.6,6.8,8.0,1,9.87,2.55
3,2019-10-23,17732.0,21900006,2020,1610612753,ORL,Orlando Magic,1610612752,NYK,New York Knicks,...,36.2,0.742,2.7,6.1,0.439,4.0,5.5,0,4.25,7.93
4,2019-10-23,17923.0,21900009,2020,1610612754,IND,Indiana Pacers,1610612750,MIN,Minnesota Timberwolves,...,49.3,0.547,3.5,9.2,0.426,5.1,6.5,4,4.45,2.84


### Vegas Odds for each team

features created:
1) odds to win championship
2) projected number of wins

In [22]:
# get vegas betting data
vegas_dfs = []
for year in years:
    
    url = f'https://www.basketball-reference.com/leagues/NBA_{str(year)}_preseason_odds.html'
    df = pd.read_html(url)[0][['Team','Odds','W-L O/U']]
    df['Season'] = year
    vegas_dfs.append(df)

In [23]:
vegas_df = pd.concat(vegas_dfs)
vegas_df['Team'] = vegas_df['Team'].replace(['Los Angeles Clippers'], 'LA Clippers')

# create vegas columns for home team
vegas_home = vegas_df.add_suffix('_HOME')
vegas_team_df = fans_df.merge(vegas_home, how='left', left_on=['TEAM_NAME_HOME','SEASON_ID'], right_on=['Team_HOME','Season_HOME'])
vegas_team_df = vegas_team_df.drop(columns=['Team_HOME','Season_HOME'], axis = 1)

# create vegas columns for away team
vegas_away = vegas_df.add_suffix('_AWAY')
vegas_team_df = vegas_team_df.merge(vegas_away, how='left', left_on=['TEAM_NAME_AWAY','SEASON_ID'], right_on=['Team_AWAY','Season_AWAY'])
vegas_team_df = vegas_team_df.drop(columns=['Team_AWAY','Season_AWAY'], axis = 1)

vegas_team_df.head()

Unnamed: 0,GAME_DATE_x,ATTENDANCE,GAME_ID,SEASON_ID,TEAM_ID_HOME,TEAM_ABBREVIATION_HOME,TEAM_NAME_HOME,TEAM_ID_AWAY,TEAM_ABBREVIATION_AWAY,TEAM_NAME_AWAY,...,MAX_FG3_PCT_AWAY,MAX_FTM_AWAY,MAX_FTA_AWAY,PLAYERS_OVER_20PPG_AWAY,FANS_HOME,FANS_AWAY,Odds_HOME,W-L O/U_HOME,Odds_AWAY,W-L O/U_AWAY
0,2019-10-22,16867.0,21900002,2020,1610612740,NOP,New Orleans Pelicans,1610612747,LAL,Los Angeles Lakers,...,0.6,7.2,8.5,2,2.73,29.57,12500,37.5,450,50.5
1,2019-10-22,19068.0,21900001,2020,1610612746,LAC,LA Clippers,1610612761,TOR,Toronto Raptors,...,0.5,5.1,5.9,1,5.28,4.78,425,53.5,5500,46.5
2,2019-10-23,18203.0,21900010,2020,1610612760,OKC,Oklahoma City Thunder,1610612764,WAS,Washington Wizards,...,0.6,6.8,8.0,1,9.87,2.55,25000,32.5,125000,27.0
3,2019-10-23,17732.0,21900006,2020,1610612753,ORL,Orlando Magic,1610612752,NYK,New York Knicks,...,0.439,4.0,5.5,0,4.25,7.93,20000,41.5,100000,26.5
4,2019-10-23,17923.0,21900009,2020,1610612754,IND,Indiana Pacers,1610612750,MIN,Minnesota Timberwolves,...,0.426,5.1,6.5,4,4.45,2.84,5000,46.5,50000,35.5


### Census Data 

Various demographic statistics about the home city. Data collected through scraping the American Census Survey API (code in data_collection file)

features created: total population, total labor force, total unemployed, total high school graduates, total some college or associates, total bachelors, total graduate degree, total women who birthed, total households, total with income, median gross rent, total for poverty level, total under poverty level

In [24]:
census_data = pd.read_csv('data/census_data.csv').iloc[:,1:]
census_data['Team'] = census_data['Team'].replace(['PHO','BRK'], ['PHX','BKN'])

# take average for each city to use for toronto
tor = census_data.groupby('Season').mean().round(2)
tor.reset_index(inplace = True, drop = True)
tor['Season'] = [2020,2022,2023]
tor['Team'] = 'TOR'
census_data = pd.concat([census_data,tor])
census_data.reset_index(inplace = True, drop = True)
census_data = census_data.drop(columns=['state','county'], axis = 1)
census_data.head()

Unnamed: 0,Total Population,Total labor force,Total Unemployed,Total pop high school graduate,Total pop w/ some college or associates,Total pop w/ bachelors,Total pop w/ grad degree,Total women 15 TO 50y who birthed in last 12m,Total number of households in household income data,Total with household income <$10k,...,Total with household income $75-100k,Total with household income $100-125k,Total with household income $125-150k,Total with household income $150-200k,Total with household income $200k+,Median gross rent,Total for poverty level,Total under poverty level,Season,Team
0,1054286.0,586165.0,32148.0,118809.0,150430.0,235791.0,166732.0,288096.0,441958.0,27288.0,...,53696.0,37568.0,27639.0,36709.0,72026.0,1367.0,1021572.0,131419.0,2023,ATL
1,792647.0,470235.0,32537.0,115175.0,99726.0,144811.0,121365.0,238943.0,315192.0,26751.0,...,34132.0,28314.0,23518.0,30535.0,48298.0,1761.0,748596.0,129207.0,2023,BOS
2,2712360.0,1369891.0,97387.0,473782.0,355048.0,443269.0,298611.0,710040.0,985108.0,77313.0,...,108053.0,85540.0,61716.0,81610.0,122381.0,1582.0,2685920.0,504813.0,2023,BKN
3,1100984.0,620982.0,28741.0,123146.0,203525.0,227443.0,121767.0,298729.0,435562.0,18929.0,...,55631.0,42670.0,29876.0,34520.0,50131.0,1276.0,1084575.0,114598.0,2023,CHA
4,5265398.0,2809310.0,203970.0,820019.0,906703.0,869960.0,619971.0,1316498.0,2044658.0,139784.0,...,252760.0,191335.0,138706.0,173065.0,232167.0,1214.0,5179874.0,698319.0,2023,CHI


In [25]:
census_df = vegas_team_df.merge(census_data, how='left', left_on=['TEAM_ABBREVIATION_HOME','SEASON_ID'], right_on=['Team','Season'])
census_df = census_df.drop(columns=['Team','Season'], axis = 1)
census_df.head()

Unnamed: 0,GAME_DATE_x,ATTENDANCE,GAME_ID,SEASON_ID,TEAM_ID_HOME,TEAM_ABBREVIATION_HOME,TEAM_NAME_HOME,TEAM_ID_AWAY,TEAM_ABBREVIATION_AWAY,TEAM_NAME_AWAY,...,Total with household income $50-60k,Total with household income $60-75k,Total with household income $75-100k,Total with household income $100-125k,Total with household income $125-150k,Total with household income $150-200k,Total with household income $200k+,Median gross rent,Total for poverty level,Total under poverty level
0,2019-10-22,16867.0,21900002,2020,1610612740,NOP,New Orleans Pelicans,1610612747,LAL,Los Angeles Lakers,...,10146.0,11740.0,13571.0,9275.0,5891.0,6991.0,9879.0,998.0,377695.0,89340.0
1,2019-10-22,19068.0,21900001,2020,1610612746,LAC,LA Clippers,1610612761,TOR,Toronto Raptors,...,225974.0,300644.0,408135.0,310236.0,213893.0,258815.0,338330.0,1460.0,9928773.0,1480446.0
2,2019-10-23,18203.0,21900010,2020,1610612760,OKC,Oklahoma City Thunder,1610612764,WAS,Washington Wizards,...,24840.0,30412.0,37159.0,23142.0,14607.0,16013.0,17501.0,870.0,771566.0,123116.0
3,2019-10-23,17732.0,21900006,2020,1610612753,ORL,Orlando Magic,1610612752,NYK,New York Knicks,...,38499.0,48146.0,57426.0,39515.0,23631.0,26235.0,30255.0,1215.0,1316191.0,195467.0
4,2019-10-23,17923.0,21900009,2020,1610612754,IND,Indiana Pacers,1610612750,MIN,Minnesota Timberwolves,...,29975.0,36392.0,42149.0,26416.0,15707.0,15932.0,14571.0,889.0,932652.0,165969.0


### Further Pre-Processing

In [26]:
# select only fields
data_preprocessing = census_df.iloc[:,27:]

# change categorical fields to numerical 1 or 0 fields
data_preprocessing['PLAYOFFS_HOME'] = [1 if x == 'Yes' else 0 for x in data_preprocessing['PLAYOFFS_HOME']]
data_preprocessing['PLAYOFFS_AWAY'] = [1 if x == 'Yes' else 0 for x in data_preprocessing['PLAYOFFS_AWAY']]
data_preprocessing['Conf_HOME'] = [1 if x == 'W' else 0 for x in data_preprocessing['Conf_HOME']]
data_preprocessing['Conf_AWAY'] = [1 if x == 'W' else 0 for x in data_preprocessing['Conf_AWAY']]
data_preprocessing.head()

Unnamed: 0,Capacity,Attendance,PREV_ATTENDANCE,L5_AVG_PTS_HOME,L5_AVG_BLK_HOME,L5_AVG_REB_HOME,L5_AVG_AST_HOME,L5_AVG_FG3_PCT_HOME,L5_AVG_PLUS_MINUS_HOME,L5_AVG_PTS_AWAY,...,Total with household income $50-60k,Total with household income $60-75k,Total with household income $75-100k,Total with household income $100-125k,Total with household income $125-150k,Total with household income $150-200k,Total with household income $200k+,Median gross rent,Total for poverty level,Total under poverty level
0,16867,16867.0,12412.0,107.8,3.8,45.0,23.0,0.32,-9.8,115.6,...,10146.0,11740.0,13571.0,9275.0,5891.0,6991.0,9879.0,998.0,377695.0,89340.0
1,19068,19068.0,19068.0,113.0,4.4,46.4,26.6,0.37,-5.2,110.4,...,225974.0,300644.0,408135.0,310236.0,213893.0,258815.0,338330.0,1460.0,9928773.0,1480446.0
2,18203,18203.0,13110.0,116.8,4.2,56.0,27.6,0.37,11.4,105.8,...,24840.0,30412.0,37159.0,23142.0,14607.0,16013.0,17501.0,870.0,771566.0,123116.0
3,18846,17732.0,15898.0,116.2,4.4,49.6,29.0,0.39,11.0,99.2,...,38499.0,48146.0,57426.0,39515.0,23631.0,26235.0,30255.0,1215.0,1316191.0,195467.0
4,17923,17923.0,16532.0,105.2,4.4,39.2,26.0,0.36,-2.4,109.4,...,29975.0,36392.0,42149.0,26416.0,15707.0,15932.0,14571.0,889.0,932652.0,165969.0


### Feature Selection

In [27]:
#modeling stack
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, RobustScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import SequentialFeatureSelector
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
import matplotlib.pyplot as plt
import seaborn as sns

In [28]:
data_preprocessing.isnull().sum()

Capacity                                 0
Attendance                               0
PREV_ATTENDANCE                          0
L5_AVG_PTS_HOME                          0
L5_AVG_BLK_HOME                          0
                                        ..
Total with household income $150-200k    0
Total with household income $200k+       0
Median gross rent                        0
Total for poverty level                  0
Total under poverty level                0
Length: 113, dtype: int64

### Forward Selection

In [29]:
#set target and predictor variables
X = data_preprocessing.drop(columns=['Attendance'], axis = 1)
y = data_preprocessing['Attendance']
X.shape

(2826, 112)

In [30]:
#normalize data
mms = MinMaxScaler()
X_scaled = mms.fit_transform(X)

# forward selection
model = DecisionTreeRegressor(max_depth = 4, min_samples_split = 5)
sfs1 = SequentialFeatureSelector(model, n_features_to_select=20, direction='forward', cv = 5)

In [31]:
# fit data
sfs1 = sfs1.fit(X_scaled, y)

In [32]:
features = sfs1.get_feature_names_out(input_features=X.columns)
features

array(['Capacity', 'L5_AVG_PTS_HOME', 'L5_AVG_FG3_PCT_HOME', 'Conf_HOME',
       'MOV_HOME', 'ORtg_HOME', 'Conf_AWAY', 'all_stars_HOME',
       'all_stars_AWAY', 'PLAYOFFS_HOME', 'PLAYOFFS_AWAY', 'WEEKEND',
       'MAX_OREB_HOME', 'PLAYERS_OVER_20PPG_HOME', 'MAX_REB_AWAY',
       'MAX_FG3M_AWAY', 'PLAYERS_OVER_20PPG_AWAY',
       'Total with household income $20-25k',
       'Total with household income $25-30k',
       'Total with household income $35-40k'], dtype=object)

Looking at the 20 final features from forward selection, we can see that a few categories stand out. We can see that 7 of the 20 features are related to star power player:

1) all_stars_HOME
2) all_stars_AWAY
3) MAX_OREB_HOME
4) PLAYERS_OVER_20PPG_HOME
5) MAX_REB_AWAY
6) MAX_FG3M_AWAY
7) PLAYERS_OVER_20PPG_HOME

Over the past decade, the NBA has started receiving scrutiny for the officiating in games, where fans and analysts complain that too many fouls are given out. Analysts have suspected that the reason the NBA has not acted on these complaints is because more fouls means more points, and more points means more superstars. Looking at these 7 factors, the influence that star power has on attendance becomes more clear as we can see that these features have a higher importance in helping predict game attendance. The recent craze around 3-pointers thanks to Steph Curry also seems to appear, with average 3-point percentage and maximum number of 3-point attempts showing up in the final 20 features. 

What was also particularly interesting was the 3 census features that were selected:

1) Total with household income 20-25k
2) Total with household income 25-30k
3) Total with household income 35-40k

Somewhat surprisingly, the number of low-income households appears to be quite important in determining game attendance. 

### Final Attendance Model

In [34]:
X = data_preprocessing[features]
y = data_preprocessing['Attendance']

#normalize data
mms = MinMaxScaler()
X_scaled = mms.fit_transform(X)


X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, y, test_size = .25)
model = DecisionTreeRegressor(max_depth = 1, min_samples_split = 2, min_samples_leaf = 1, random_state = 5)
model.fit(X_train,Y_train)
model.score(X_test,Y_test)

0.5911863370733647