# COGS 108 - Data Checkpoint

# Names

- Jamie Wei
- Alexis Garduno
- James Daza
- Aleksander Archipov

<a id='research_question'></a>
# Research Question

- Does the number of times a player is traded predict that player's performance (e.g. average total points scored/game, average minutes played/game, etc) in an NBA game? Also, does the turn-over rate (e.g. number of players traded within a team/season) of a NBA team affect the likelihood that the team will reach the NBA finals (evidenced by the last 20 years of NBA games)?

# Dataset(s)

- Dataset Name: nba_api (replace with the dataset we generated from nba_api) 
- Link to the dataset: https://github.com/swar/nba_api
- Number of observations: N/A
    - API Client that allows access to various NBA's stats API endpoints

- Dataset Name: NBA Stats (replace with boxscore and team performance datasets)
- Link to the dataset: https://www.nba.com/stats/
- Number of observations: 12
    - Player stats from 10 random teams ranging from 2016-2021


# Setup

In [1]:
import pandas as pd
import numpy as np
import time

from nba_api.stats.static import players
from nba_api.stats.endpoints import playercareerstats

### Gathering Teams

In [2]:
#Import the roster of teams from the NBA API
from nba_api.stats.static import teams

nba_teams = teams.get_teams()

We select 10 nba teams at random 

In [3]:
#Obtain a full list of all abbreviations - will need abbreviations to identify team statistics
nba_teams_df=pd.DataFrame(nba_teams)
team_id=nba_teams_df['id'] #this is the unique team id
team_id_random=np.random.choice(team_id,10,replace=False) #identify ten team ids
nba_teams_rdf=nba_teams_df[nba_teams_df['id'].isin(team_id_random)] #df of 10 randomly selected teams
nba_teams_rdf

Unnamed: 0,id,full_name,abbreviation,nickname,city,state,year_founded
1,1610612738,Boston Celtics,BOS,Celtics,Boston,Massachusetts,1946
4,1610612741,Chicago Bulls,CHI,Bulls,Chicago,Illinois,1966
7,1610612744,Golden State Warriors,GSW,Warriors,Golden State,California,1946
12,1610612749,Milwaukee Bucks,MIL,Bucks,Milwaukee,Wisconsin,1968
13,1610612750,Minnesota Timberwolves,MIN,Timberwolves,Minnesota,Minnesota,1989
16,1610612753,Orlando Magic,ORL,Magic,Orlando,Florida,1989
18,1610612755,Philadelphia 76ers,PHI,76ers,Philadelphia,Pennsylvania,1949
19,1610612756,Phoenix Suns,PHX,Suns,Phoenix,Arizona,1968
21,1610612758,Sacramento Kings,SAC,Kings,Sacramento,California,1948
22,1610612759,San Antonio Spurs,SAS,Spurs,San Antonio,Texas,1976


### Requesting Games
Now we request the games for all of the ten teams we have selected

In [4]:
#Pull all games for all ten teams

#Documentation for this endpoint: 
#https://github.com/swar/nba_api/blob/master/docs/nba_api/stats/endpoints/leaguegamefinder.md
from nba_api.stats.endpoints import leaguegamefinder

# Query for games from the League Game Finder
gamefinder=pd.DataFrame()
for i in team_id_random:
    time.sleep(1) #delay to prevent being blocked from the API
    df = leaguegamefinder.LeagueGameFinder(team_id_nullable=[i]).get_data_frames()[0] #parameter of team ids given
    gamefinder = pd.concat([df,gamefinder])

In [5]:
#Game Statistics

#One row corresponds to one game and one team.
#There will be two rows per game, since there are two teams that played each other.
#Will need to exclude duplicate rows (XXXX will remove duplicate rows)
print(list(set(gamefinder.TEAM_ID))) #confirmed that identified 10 different teams
print(gamefinder.shape) #31,386 games

##Game Finder Dataset: This dataset will be used as the outcome when we look at the association between 
##the exposure and outcome relationship. 
gamefinder

[1610612738, 1610612741, 1610612744, 1610612749, 1610612750, 1610612753, 1610612755, 1610612756, 1610612758, 1610612759]
(33897, 28)


Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
0,22021,1610612744,GSW,Golden State Warriors,0022100883,2022-02-16,GSW vs. DEN,L,240,116,...,0.667,8.0,29.0,37.0,26,13.0,5,10,21,-1.0
1,22021,1610612744,GSW,Golden State Warriors,0022100866,2022-02-14,GSW @ LAC,L,240,104,...,0.769,7.0,31.0,38.0,23,7.0,3,14,18,-15.0
2,22021,1610612744,GSW,Golden State Warriors,0022100854,2022-02-12,GSW vs. LAL,W,240,117,...,0.735,10.0,40.0,50.0,24,4.0,3,9,24,2.0
3,22021,1610612744,GSW,Golden State Warriors,0022100837,2022-02-10,GSW vs. NYK,L,240,114,...,0.882,5.0,33.0,38.0,27,4.0,5,7,24,-2.0
4,22021,1610612744,GSW,Golden State Warriors,0022100828,2022-02-09,GSW @ UTA,L,241,85,...,0.765,3.0,32.0,35.0,17,8.0,3,11,19,-26.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3372,21983,1610612758,KCK,Kansas City Kings,0028300055,1983-11-05,KCK vs. HOU,W,240,123,...,0.868,19.0,29.0,48.0,31,11.0,4,20,23,
3373,21983,1610612758,KCK,Kansas City Kings,0028300037,1983-11-03,KCK vs. DEN,L,240,128,...,0.846,7.0,40.0,47.0,34,4.0,13,28,29,
3374,21983,1610612758,KCK,Kansas City Kings,0028300029,1983-11-01,KCK vs. GOS,W,240,116,...,0.811,17.0,39.0,56.0,31,4.0,6,29,32,
3375,21983,1610612758,KCK,Kansas City Kings,0028300021,1983-10-30,KCK vs. SEA,L,240,116,...,0.865,13.0,33.0,46.0,32,7.0,4,19,33,


Now we request all the player stats from seasons ranging from 2016-2021

In [6]:
#get game ids for the last five years

#for now, let's focus on the last five seasons for ease
from nba_api.stats.endpoints import playergamelogs

#generate a parameter dataframe to define timeframe
#Is this timeframe correct? What is the timeframe that the season normally runs from?
season_parameter_df=pd.DataFrame({'Season':['2016-17','2017-18','2018-19','2019-20','2020-21'], 
                    'Date_From':['9/01/2016','9/01/2017','9/01/2018','9/01/2019','9/01/2020'],
                    'Date_To':['8/31/2017','8/31/2018','8/31/2019','8/31/2020','8/31/2021']})

#will obtain no game ids, without the season_nullable and date_nullable items selected
logsdf=pd.DataFrame()
for i in list(range(1, 5)):
    time.sleep(1)
    season=season_parameter_df.iloc[i][0]
    date_from=season_parameter_df.iloc[i][1]
    date_to=season_parameter_df.iloc[i][2]
    
    logs = pd.DataFrame(playergamelogs.PlayerGameLogs(
        season_nullable = season,
        date_from_nullable = date_from,                                                     
        date_to_nullable = date_to
    ).player_game_logs.get_data_frame())
    logsdf = pd.concat([logs,logsdf])

In [7]:
#unique set of game ids
game_ids = list(set(logsdf['GAME_ID'])) #not sure if we need game_ids
player_ids = pd.DataFrame(list(set(logsdf['PLAYER_ID'])))
print(pd.DataFrame(game_ids).shape) #4599 game ids
print(player_ids.shape) #875 unique players
print(logsdf.shape) #97,655 rows where each row is for each player in each game. 

(4599, 1)
(875, 1)
(97655, 66)


<b>4599 Games <br>
875 Unique Players <br>
97,655 stats of players in each game<b>

All the columns of every player that played in the seasons we selected from the random teams chosen

In [8]:
#preview, likely not use these columns
logsdf.columns

Index(['SEASON_YEAR', 'PLAYER_ID', 'PLAYER_NAME', 'NICKNAME', 'TEAM_ID',
       'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID', 'GAME_DATE', 'MATCHUP',
       'WL', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M', 'FG3A', 'FG3_PCT', 'FTM',
       'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST', 'TOV', 'STL', 'BLK',
       'BLKA', 'PF', 'PFD', 'PTS', 'PLUS_MINUS', 'NBA_FANTASY_PTS', 'DD2',
       'TD3', 'GP_RANK', 'W_RANK', 'L_RANK', 'W_PCT_RANK', 'MIN_RANK',
       'FGM_RANK', 'FGA_RANK', 'FG_PCT_RANK', 'FG3M_RANK', 'FG3A_RANK',
       'FG3_PCT_RANK', 'FTM_RANK', 'FTA_RANK', 'FT_PCT_RANK', 'OREB_RANK',
       'DREB_RANK', 'REB_RANK', 'AST_RANK', 'TOV_RANK', 'STL_RANK', 'BLK_RANK',
       'BLKA_RANK', 'PF_RANK', 'PFD_RANK', 'PTS_RANK', 'PLUS_MINUS_RANK',
       'NBA_FANTASY_PTS_RANK', 'DD2_RANK', 'TD3_RANK', 'VIDEO_AVAILABLE_FLAG'],
      dtype='object')

In [9]:
#import boxscore, player statistics within each game
#https://en.wikipedia.org/wiki/Box_score_(baseball)#:~:text=A%20box%20score%20is%20a,the%20box%20score%20in%201858.
#from nba_api.stats.endpoints import BoxScoreAdvancedV2

# Takes a LONG time to run
#boxscfinder=pd.DataFrame()
#for i in game_ids:
#        time.sleep(1) #delay to prevent being blocked from the API
#        boxscore = BoxScoreAdvancedV2(game_id=i).player_stats.get_data_frame()
#        boxscfinder = pd.concat([boxscore,boxscfinder])

In [10]:
####NEXT STEPS:

####These tasks need to be completed and we can split them up.
#Task (Insert your name)

#TASK 1 (NAME): Get a final dataset for the Boxscore (verify that this runs)
#TASK 2 (NAME): Only keep the boxscore entries that match the teams identified
#in the team_id_random ARRAY; generate a file

#TASK 3 (Jamie): Run this notebook and restrict the dataset
#named "boxscfinder" to the parameters we need based on what you 
#identified. Note, Jamie, when you restrict the parameter, it would
#be helpful in the code to name the variables. 

#TASK 4 (NAME): Restrict the "gamefinder" dataframe to the "team performance" metrics

#TASK 5 (Alexis): Look at the final boxscore dataset and define the traded variable

#TASK 6 (Alek/other): Estimate the data distributions of the 
#the parameters that we are examining. 
#Depends on Task 3 and 4 being complete. 
#Generate descriptive statistics for the parameters that we list.


# Data Cleaning

(Alek and others) Describe. In class, data cleaning examples were to handle missing values, recode variables (generate new variables), and plot distribution.<br>

### Cleaning Player Logs

In [11]:
# Copy Logs DataFrame and remove TEAM ID's that were not selected
logsRandTeamDf = logsdf.copy()
logsRandTeamDf = logsRandTeamDf[logsRandTeamDf['TEAM_ID'].isin(team_id_random)]

# Restrict to the player stats we want
logsRandTeamDf = logsRandTeamDf[['SEASON_YEAR', 'PLAYER_ID', 'PLAYER_NAME', 'TEAM_ID', 'TEAM_NAME', 'MIN', 'FG_PCT', 'FT_PCT', 'PTS', 'AST', 'REB', 'STL', 'BLK', 'PLUS_MINUS']]
logsRandTeamDf

Unnamed: 0,SEASON_YEAR,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_NAME,MIN,FG_PCT,FT_PCT,PTS,AST,REB,STL,BLK,PLUS_MINUS
1,2020-21,1629003,Shake Milton,1610612755,Philadelphia 76ers,25.600000,0.462,0.0,15,9,4,1,2,7
2,2020-21,1630188,Jalen Smith,1610612756,Phoenix Suns,40.966667,0.455,0.0,11,2,10,0,1,5
5,2020-21,1629020,Jarred Vanderbilt,1610612750,Minnesota Timberwolves,23.750000,0.500,0.5,7,1,12,3,0,9
7,2020-21,1626192,Pat Connaughton,1610612749,Milwaukee Bucks,25.266667,0.500,1.0,15,0,1,2,0,-11
8,2020-21,203118,Mike Scott,1610612755,Philadelphia 76ers,25.166667,0.250,0.0,5,4,2,0,1,-4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26094,2017-18,203499,Shane Larkin,1610612738,Boston Celtics,4.816667,0.000,0.0,0,0,1,0,0,2
26096,2017-18,1627759,Jaylen Brown,1610612738,Boston Celtics,39.606667,0.478,0.5,25,0,6,2,0,-5
26100,2017-18,1627775,Patrick McCaw,1610612744,Golden State Warriors,18.656667,0.667,0.0,4,1,3,1,1,-9
26102,2017-18,202330,Gordon Hayward,1610612738,Boston Celtics,5.250000,0.500,0.0,2,0,1,0,0,3


### Cleaning boxstats

In [12]:
# NEED TO RUN BOXSTATS REQUEST BLOCK FIRST

# Restrict boxscfinder to teams the random teams we selected
# boxscfinder = boxscfinder[boxscfinder['TEAM_ID'].isin(team_id_random)]

# Restrict to the player stats we want
# boxscfinder = boxscfinder[['SEASON_YEAR', 'PLAYER_ID', 'PLAYER_NAME', 'MIN', 'FG_PCT', 'FT_PCT', 'PTS', 'AST', 'REB', 'STL', 'BLK', 'PLUS_MINUS']
# boxscfinder

### Saving Data <br>
Now we save the data we gathered we requested from the NBA API into csv files

In [13]:
#Saving the Data to a csv file
logsRandTeamDf.to_csv("logPlayerStats.csv")

# NEED TO RUN BOXSTATS REQUEST BLOCK FIRST
# boxscfinder.to_csv("boxscfinder.csv")

### Manipulating/Plotting CSV File Data

In [14]:
import seaborn as sns

In [15]:
#Need to plot the distributions data using either seaborn or matplotlib

dataDf = pd.read_csv("logPlayerStats.csv")
dataDf['TIMES_TRADED'] = np.nan
pid_lst = list(player_ids[0])
pid_lst[0]

for pid in pid_lst:
    tempDf = dataDf[dataDf['PLAYER_ID'] == pid]
    traded = len(pd.unique(tempDf['TEAM_NAME']))-1
    dataDf.loc[dataDf['PLAYER_ID'] == pid, 'TIMES_TRADED'] = int(traded)

dataDf[dataDf['TIMES_TRADED'] >= 3]
    

Unnamed: 0.1,Unnamed: 0,SEASON_YEAR,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_NAME,MIN,FG_PCT,FT_PCT,PTS,AST,REB,STL,BLK,PLUS_MINUS,TIMES_TRADED
78,253,2020-21,203953,Jabari Parker,1610612738,Boston Celtics,26.900000,0.600,0.833,18,2,7,0,0,8,3.0
110,326,2020-21,203953,Jabari Parker,1610612738,Boston Celtics,17.483333,0.500,0.667,9,2,3,0,0,-10,3.0
624,1775,2020-21,203953,Jabari Parker,1610612738,Boston Celtics,9.583333,0.500,0.000,4,0,2,0,1,-10,3.0
687,2016,2020-21,203953,Jabari Parker,1610612738,Boston Celtics,8.150000,0.500,0.000,2,0,5,0,0,6,3.0
1168,3479,2020-21,203953,Jabari Parker,1610612738,Boston Celtics,4.700000,0.000,0.000,0,0,1,0,0,2,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32277,24640,2017-18,202328,Greg Monroe,1610612749,Milwaukee Bucks,17.208333,0.500,0.000,8,1,6,0,0,-1,3.0
32463,25213,2017-18,202328,Greg Monroe,1610612749,Milwaukee Bucks,20.500000,0.444,0.000,8,3,7,0,0,5,3.0
32506,25369,2017-18,202328,Greg Monroe,1610612749,Milwaukee Bucks,6.733333,0.000,0.000,0,0,0,0,0,-3,3.0
32659,25757,2017-18,202328,Greg Monroe,1610612749,Milwaukee Bucks,18.133333,0.500,0.000,8,0,6,0,0,-15,3.0
