# Business Understanding: NBA Last 2 Minute Analysis

Over the years, the relationship between the referees and the players in the National Basketball Association (NBA) has come to impasse. The players and coaches believe the referees are affecting the outcome of the games and showing bias. Players and coaches’ sentiment have also reached into the fan base where the fans are taking issues on how the games are being called by the referees. The NBA decided to implement the last two-minute report to help protect the integrity of the game. This report provides an assessment of officiated events that occurred in the last two minutes for every game that were at or within three points during any points of the last two minutes of the fourth quarter and overtime. The plays assessed include all calls and all notable non calls. Non calls are defined as material plays directly related to the outcome of a possession. This report was the NBA's first step to total transparency. The main outcome that the NBA is concerned with is whether the referees got the call right in the realms of the rules. 

Our teams's analysis will focus on determining whether there is bias towards particular players or teams for referee calls during the previously mentioned last two minutes. We will be augmenting our base data set with additional attributes indicating whethere a player's status is all-star or not. Are all-star players treated differently? Do officials have a tendency to get more calls right or wrong when it comes to all-star players? Answering these questions can help with training and teaching of referees of their biases.

In [1]:
# Import Libraries
import numpy as np
import pandas as pd

import matplotlib as plt
# allow inline graphics from matplotlib
%matplotlib inline       

import seaborn as sns 

import warnings 
warnings.simplefilter('ignore', DeprecationWarning) 

In [3]:
# read data previously pulled from...and saved to csv
df = pd.read_csv("data/all_games.csv")

df.head()

Unnamed: 0,period,time,seconds_left,call_type,committing_player,disadvantaged_player,review_decision,comment,video,game_id,...,ref_1,ref_2,ref_3,score_away,score_home,original_pdf,box_score_url,disadvantaged_team,committing_team,ref_made_call
0,Q4,0:01:52,112.0,Foul: Shooting,Josh Smith,Kevin Love,CNC,Smith (HOU) does not make contact with Love (C...,http://official.nba.com/last-two-minute-report...,20150301CLEHOU,...,Tony Brown,Dan Crawford,Michael Smith,103.0,105.0,L2M-CLE-HOU-3-1-15.pdf,https://www.basketball-reference.com/boxscores...,CLE,HOU,
1,Q4,0:01:43,103.0,Foul: Shooting,J.R. Smith,James Harden,CC,Smith (CLE) makes contact with the body of Har...,http://official.nba.com/last-two-minute-report...,20150301CLEHOU,...,Tony Brown,Dan Crawford,Michael Smith,103.0,105.0,L2M-CLE-HOU-3-1-15.pdf,https://www.basketball-reference.com/boxscores...,HOU,CLE,
2,Q4,0:01:32,92.0,Foul: Shooting,Trevor Ariza,LeBron James,CC,Ariza (HOU) makes contact with the shoulder of...,http://official.nba.com/last-two-minute-report...,20150301CLEHOU,...,Tony Brown,Dan Crawford,Michael Smith,103.0,105.0,L2M-CLE-HOU-3-1-15.pdf,https://www.basketball-reference.com/boxscores...,CLE,HOU,
3,Q4,0:01:09,69.0,Foul: Loose Ball,Terrence Jones,Tristan Thompson,CC,Jones (HOU) makes contact with the arm of Thom...,http://official.nba.com/last-two-minute-report...,20150301CLEHOU,...,Tony Brown,Dan Crawford,Michael Smith,103.0,105.0,L2M-CLE-HOU-3-1-15.pdf,https://www.basketball-reference.com/boxscores...,CLE,HOU,
4,Q4,0:00:53,53.0,Foul: Shooting,Tristan Thompson,Josh Smith,CNC,Smith (HOU) loses the ball as he goes up for t...,http://official.nba.com/last-two-minute-report...,20150301CLEHOU,...,Tony Brown,Dan Crawford,Michael Smith,103.0,105.0,L2M-CLE-HOU-3-1-15.pdf,https://www.basketball-reference.com/boxscores...,HOU,CLE,


In [4]:
# drop unecessary columns
#   video: web link to a video of call or non call, not relevant to analysis
#   original_pdf: pdf file name, not relevant to analysis
#   box_score_url: url, scores are already in separate columns
#   comment: text describing the violation, we will not be performing any NLP in this analysis
#   ref_made_call: names only exist in 0.8% of the data

df = df.drop(columns=['video','original_pdf','box_score_url','comment','ref_made_call'], axis=1)

df.head()
#df['committing_player'].unique().tolist()
#df['disadvantaged_player'].unique().tolist()

Unnamed: 0,period,time,seconds_left,call_type,committing_player,disadvantaged_player,review_decision,game_id,play_id,away,home,date,ref_1,ref_2,ref_3,score_away,score_home,disadvantaged_team,committing_team
0,Q4,0:01:52,112.0,Foul: Shooting,Josh Smith,Kevin Love,CNC,20150301CLEHOU,20150301CLEHOU-0,CLE,HOU,20150301,Tony Brown,Dan Crawford,Michael Smith,103.0,105.0,CLE,HOU
1,Q4,0:01:43,103.0,Foul: Shooting,J.R. Smith,James Harden,CC,20150301CLEHOU,20150301CLEHOU-1,CLE,HOU,20150301,Tony Brown,Dan Crawford,Michael Smith,103.0,105.0,HOU,CLE
2,Q4,0:01:32,92.0,Foul: Shooting,Trevor Ariza,LeBron James,CC,20150301CLEHOU,20150301CLEHOU-2,CLE,HOU,20150301,Tony Brown,Dan Crawford,Michael Smith,103.0,105.0,CLE,HOU
3,Q4,0:01:09,69.0,Foul: Loose Ball,Terrence Jones,Tristan Thompson,CC,20150301CLEHOU,20150301CLEHOU-3,CLE,HOU,20150301,Tony Brown,Dan Crawford,Michael Smith,103.0,105.0,CLE,HOU
4,Q4,0:00:53,53.0,Foul: Shooting,Tristan Thompson,Josh Smith,CNC,20150301CLEHOU,20150301CLEHOU-4,CLE,HOU,20150301,Tony Brown,Dan Crawford,Michael Smith,103.0,105.0,HOU,CLE


### NaN Check
The "disadvantaged_player" column has the majority of missing values in this dataframe. We know that not all of these entries are fouls so it would make sense there are no "committing_team", "disadvantaged_team", "committing_player" or "disadvantaged_player" in some instances. However, there should always be at least 2 referees per contest. Let's dig deeper...

In [5]:
# Missing values sort descending
df_percent = df.isna().sum()/len(df)*100 # Percent of total 'NaN' rows by column
df_percent.sort_values(ascending=False)

disadvantaged_player    16.281411
review_decision          6.416375
committing_team          5.927970
disadvantaged_team       5.927970
committing_player        5.387369
ref_3                    0.503318
ref_2                    0.205056
ref_1                    0.205056
score_home               0.134218
score_away               0.134218
call_type                0.003728
time                     0.000000
seconds_left             0.000000
away                     0.000000
game_id                  0.000000
play_id                  0.000000
home                     0.000000
date                     0.000000
period                   0.000000
dtype: float64

There may have been games with only 2 referees 

In [6]:
#df['ref_1'] = np.nan

#df

In [7]:
# Create new dataframe with NaN rows removed
# Only remove NA values that are in the columns listed in the subset. If at least 1 NaN then remove
# Still need to handle ref_1, ref_2, ref_3 columns differently. Delete rows if all 3 ref columns are missing 
df_NoNaN = df.dropna(subset=['disadvantaged_player','review_decision','committing_player',
                            'score_home','score_away','call_type']) 

# comparing NEW (df) vs OLD (df2) dataframes 
print("Old data frame length:", len(df), "\nNew data frame length:",  
       len(df_NoNaN), "\nNumber of rows with at least 1 NA value: ", 
       (len(df)-len(df_NoNaN)),"\nPercent of total rows lost to NA values:",
     (len(df)-len(df_NoNaN))/len(df))

Old data frame length: 26822 
New data frame length: 22083 
Number of rows with at least 1 NA value:  4739 
Percent of total rows lost to NA values: 0.17668331966296325


### NBA_api Connection Attemps

In [8]:
# Get dataframe of all players and player id's

from nba_api.stats.static import players
# get_players returns a list of dictionaries, each representing a player.
nba_players = players.get_players()

# Convert dictionary to dataframe
df_nba_players = pd.DataFrame.from_dict(nba_players)

df_nba_players.head()

Unnamed: 0,first_name,full_name,id,is_active,last_name
0,Alaa,Alaa Abdelnaby,76001,False,Abdelnaby
1,Zaid,Zaid Abdul-Aziz,76002,False,Abdul-Aziz
2,Kareem,Kareem Abdul-Jabbar,76003,False,Abdul-Jabbar
3,Mahmoud,Mahmoud Abdul-Rauf,51,False,Abdul-Rauf
4,Tariq,Tariq Abdul-Wahad,1505,False,Abdul-Wahad


In [9]:
from nba_api.stats.static import playerprofile
nbaPlayers = players.get_players()

ImportError: cannot import name 'playerprofile'

In [10]:
# Create list of player id's needed to pass through endpoint 
player_id_list = df_nba_players['id'].unique().tolist()

In [11]:
# Get game specific stats with player_id and game_id

#playergamelog require playter id in api request
from nba_api.stats.endpoints import playergamelog 

nba_gamelogs = playergamelog.PlayerGameLog(player_id='203076', season='2018-19',
                                             season_type_all_star='Regular Season')

#Return dataframe
nba_gamelogs.get_data_frames()

[   SEASON_ID  Player_ID     Game_ID     GAME_DATE      MATCHUP WL  MIN  FGM  \
 0      22018     203076  0021801099  MAR 24, 2019  NOP vs. HOU  L   21    6   
 1      22018     203076  0021801056  MAR 18, 2019    NOP @ DAL  W   21    8   
 2      22018     203076  0021801036  MAR 16, 2019  NOP vs. PHX  L   22    6   
 3      22018     203076  0021801010  MAR 12, 2019  NOP vs. MIL  L   21    9   
 4      22018     203076  0021800995  MAR 10, 2019    NOP @ ATL  L   21    6   
 5      22018     203076  0021800971  MAR 06, 2019  NOP vs. UTA  L   21    6   
 6      22018     203076  0021800956  MAR 04, 2019    NOP @ UTA  W   22    7   
 7      22018     203076  0021800932  MAR 01, 2019    NOP @ PHX  W   21    7   
 8      22018     203076  0021800921  FEB 27, 2019    NOP @ LAL  L   21   10   
 9      22018     203076  0021800906  FEB 25, 2019  NOP vs. PHI  L   21    8   
 10     22018     203076  0021800874  FEB 22, 2019    NOP @ IND  L   20    6   
 11     22018     203076  0021800866  FE

### Duplicates 

In [12]:
# Which player_id should we use for duplicate records?
# Check to see which players and how many times duplicated.

print("Duplicate Summary:",
      "\nNumber of duplicate id:", df_nba_players.duplicated(['id']).sum(),
      "\nNumber of duplicate full_name:", df_nba_players.duplicated(['full_name']).sum(),
      "\nNumber of duplicate id AND full_name:", df_nba_players.duplicated(['id','full_name']).sum(),
      "\nNumber of duplicate rows:", df_nba_players.duplicated().sum())


Duplicate Summary: 
Number of duplicate id: 0 
Number of duplicate full_name: 36 
Number of duplicate id AND full_name: 0 
Number of duplicate rows: 0


In [13]:
# Create dataframe Only keep last player_id, full_name row/record

# df_nba_players_deftable_1 = df_nba_players.drop_duplicates(['full_name'], keep='last', inplace=False)

# df_nba_players_deftable_1.info()

In [14]:
df_nba_players['full_name'].unique()

array(['Alaa Abdelnaby', 'Zaid Abdul-Aziz', 'Kareem Abdul-Jabbar', ...,
       'Bill Zopf', 'Ivica Zubac', 'Matt Zunic'], dtype=object)

In [15]:
# Need to join id to original dataframe. 
# Need to use some combination of player_id and game_id to merge individual player states for that game


#df_nba_players['full_name'].value_counts()
#df_nba_players.groupby['full_name'].value_counts()

test = df_nba_players.groupby('full_name')

print(test.filter(lambda x: x['full_name'].value_counts()>1))

     first_name         full_name       id  is_active  last_name
482         Dee         Dee Brown      244      False      Brown
483         Dee         Dee Brown   200793      False      Brown
887     Charlie     Charlie Davis    76518      False      Davis
888     Charlie     Charlie Davis    76519      False      Davis
902        Mark        Mark Davis      707      False      Davis
903        Mark        Mark Davis    76528      False      Davis
1041        Bob         Bob Duffy    76609      False      Duffy
1042        Bob         Bob Duffy    76610      False      Duffy
1051       Mike     Mike Dunleavy     2399      False   Dunleavy
1052       Mike     Mike Dunleavy    76616      False   Dunleavy
1159    Patrick     Patrick Ewing      121      False      Ewing
1160    Patrick     Patrick Ewing   201607      False      Ewing
1522       Matt       Matt Guokas    76908      False     Guokas
1523       Matt       Matt Guokas    76909      False     Guokas
1671     Cedric  Cedric H

In [16]:
#df['committing_player_v2'] = df['committing_player'].astype(str)

In [17]:
df = pd.merge(df, df_nba_players[['id']], left_on='committing_player', right_on='full_name', how='left')

KeyError: 'full_name'