# COGS 108 - Data Checkpoint

# Names

- Jamie Wei
- Alexis Garduno
- James Daza
- Aleksander Archipov

<a id='research_question'></a>
# Research Question

- Does the number of times a player is traded predict that player's performance (e.g. average total points scored/game, average minutes played/game, etc) in an NBA game? Also, does the turn-over rate (e.g. number of players traded within a team/season) of a NBA team affect the likelihood that the team will reach the NBA finals (evidenced by the last 20 years of NBA games)?

# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name: nba_api (replace with the dataset we generated from nba_api) 
- Link to the dataset: https://github.com/swar/nba_api
- Number of observations: N/A
    - API Client that allows access to various NBA's stats API endpoints

- Dataset Name: NBA Stats (replace with boxscore and team performance datasets)
- Link to the dataset: https://www.nba.com/stats/
- Number of observations: No, this should not be NA. Please update based on obtained dataset. 
    - This question is referring to the dataset we have pulled out from the NBA API


1-2 sentences describing each dataset. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.

# Setup

In [1]:
import pandas as pd
import numpy as np
import time

In [None]:
#NBA team players
from nba_api.stats.static import players

from nba_api.stats.endpoints import playercareerstats
# Anthony Davis
nba_players = players.get_players()
career = playercareerstats.PlayerCareerStats(player_id='1630188')

# Loop through each dictionary and grab the id
nba_players.head()

In [None]:
#Import the roster of teams from the NBA API
from nba_api.stats.static import teams

nba_teams = teams.get_teams()

In [4]:
#Obtain a full list of all abbreviations - will need abbreviations to identify team statistics
nba_teams_df=pd.DataFrame(nba_teams)
team_id=nba_teams_df['id'] #this is the unique team id
team_id_random=np.random.choice(team_id,10,replace=False) #identify ten team ids
nba_teams_rdf=nba_teams_df[nba_teams_df['id'].isin(team_id_random)] #df of 10 randomly selected teams
nba_teams_rdf

Unnamed: 0,id,full_name,abbreviation,nickname,city,state,year_founded
2,1610612739,Cleveland Cavaliers,CLE,Cavaliers,Cleveland,Ohio,1970
5,1610612742,Dallas Mavericks,DAL,Mavericks,Dallas,Texas,1980
6,1610612743,Denver Nuggets,DEN,Nuggets,Denver,Colorado,1976
11,1610612748,Miami Heat,MIA,Heat,Miami,Florida,1988
16,1610612753,Orlando Magic,ORL,Magic,Orlando,Florida,1989
17,1610612754,Indiana Pacers,IND,Pacers,Indiana,Indiana,1976
19,1610612756,Phoenix Suns,PHX,Suns,Phoenix,Arizona,1968
22,1610612759,San Antonio Spurs,SAS,Spurs,San Antonio,Texas,1976
26,1610612763,Memphis Grizzlies,MEM,Grizzlies,Memphis,Tennessee,1995
28,1610612765,Detroit Pistons,DET,Pistons,Detroit,Michigan,1948


In [5]:
#Pull all games for all ten teams

#Documentation for this endpoint: 
#https://github.com/swar/nba_api/blob/master/docs/nba_api/stats/endpoints/leaguegamefinder.md
from nba_api.stats.endpoints import leaguegamefinder

# Query for games from the League Game Finder
gamefinder=pd.DataFrame()
for i in team_id_random:
    time.sleep(1) #delay to prevent being blocked from the API
    df = leaguegamefinder.LeagueGameFinder(team_id_nullable=[i]).get_data_frames()[0] #parameter of team ids given
    gamefinder = pd.concat([df,gamefinder])

In [6]:
#Game Statistics

#One row corresponds to one game and one team.
#There will be two rows per game, since there are two teams that played each other.
#Will need to exclude duplicate rows (XXXX will remove duplicate rows)
print(list(set(gamefinder.TEAM_ID))) #confirmed that identified 10 different teams
print(gamefinder.shape) #31,386 games

##Game Finder Dataset: This dataset will be used as the outcome when we look at the association between 
##the exposure and outcome relationship. 
gamefinder.head()

[1610612739, 1610612742, 1610612743, 1610612748, 1610612753, 1610612754, 1610612756, 1610612759, 1610612763, 1610612765]
(33115, 28)


Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
0,22021,1610612748,MIA,Miami Heat,22100811,2022-02-07,MIA @ WAS,W,240,121,...,0.81,5,32,37,29,9,1,18,23,21.0
1,22021,1610612748,MIA,Miami Heat,22100797,2022-02-05,MIA @ CHA,W,239,104,...,0.944,11,38,49,27,14,6,12,17,18.0
2,22021,1610612748,MIA,Miami Heat,22100521,2022-02-03,MIA @ SAS,W,240,112,...,0.889,7,35,42,27,13,3,13,27,17.0
3,22021,1610612748,MIA,Miami Heat,22100784,2022-02-01,MIA @ TOR,L,240,106,...,0.88,10,28,38,26,8,1,16,25,-4.0
4,22021,1610612748,MIA,Miami Heat,22100762,2022-01-31,MIA @ BOS,L,240,92,...,0.7,10,28,38,25,6,2,18,22,-30.0


In [7]:
#get game ids for the last five years

#for now, let's focus on the last five seasons for ease
from nba_api.stats.endpoints import playergamelogs

#generate a parameter dataframe to define timeframe
#Is this timeframe correct? What is the timeframe that the season normally runs from?
season_parameter_df=pd.DataFrame({'Season':['2016-17','2017-18','2018-19','2019-20','2020-21'], 
                    'Date_From':['9/01/2016','9/01/2017','9/01/2018','9/01/2019','9/01/2020'],
                    'Date_To':['8/31/2017','8/31/2018','8/31/2019','8/31/2020','8/31/2021']})

#will obtain no game ids, without the season_nullable and date_nullable items selected
logsdf=pd.DataFrame()
for i in list(range(1, 5)):
    season=season_parameter_df.iloc[i][0]
    date_from=season_parameter_df.iloc[i][1]
    date_to=season_parameter_df.iloc[i][2]
    
    logs = pd.DataFrame(playergamelogs.PlayerGameLogs(
        season_nullable = season,
        date_from_nullable = date_from,                                                     
        date_to_nullable = date_to
    ).player_game_logs.get_data_frame())
    logsdf = pd.concat([logs,logsdf])

In [8]:
#unique set of game ids
game_ids = list(set(logsdf['GAME_ID'])) #not sure if we need game_ids
player_ids = pd.DataFrame(list(set(logsdf['PLAYER_ID'])))
print(pd.DataFrame(game_ids).shape) #4599 game ids
print(player_ids.shape) #875 unique players
print(logsdf.shape) #97,655 rows where each row is for each player in each game. 

(4599, 1)
(875, 1)
(97655, 66)


In [9]:
#preview, likely not use these columns
logsdf.columns

Index(['SEASON_YEAR', 'PLAYER_ID', 'PLAYER_NAME', 'NICKNAME', 'TEAM_ID',
       'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID', 'GAME_DATE', 'MATCHUP',
       'WL', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M', 'FG3A', 'FG3_PCT', 'FTM',
       'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST', 'TOV', 'STL', 'BLK',
       'BLKA', 'PF', 'PFD', 'PTS', 'PLUS_MINUS', 'NBA_FANTASY_PTS', 'DD2',
       'TD3', 'GP_RANK', 'W_RANK', 'L_RANK', 'W_PCT_RANK', 'MIN_RANK',
       'FGM_RANK', 'FGA_RANK', 'FG_PCT_RANK', 'FG3M_RANK', 'FG3A_RANK',
       'FG3_PCT_RANK', 'FTM_RANK', 'FTA_RANK', 'FT_PCT_RANK', 'OREB_RANK',
       'DREB_RANK', 'REB_RANK', 'AST_RANK', 'TOV_RANK', 'STL_RANK', 'BLK_RANK',
       'BLKA_RANK', 'PF_RANK', 'PFD_RANK', 'PTS_RANK', 'PLUS_MINUS_RANK',
       'NBA_FANTASY_PTS_RANK', 'DD2_RANK', 'TD3_RANK', 'VIDEO_AVAILABLE_FLAG'],
      dtype='object')

In [None]:
#import boxscore, player statistics within each game
#https://en.wikipedia.org/wiki/Box_score_(baseball)#:~:text=A%20box%20score%20is%20a,the%20box%20score%20in%201858.
from nba_api.stats.endpoints import BoxScoreAdvancedV2

boxscfinder=pd.DataFrame()
for i in game_ids:
        time.sleep(1) #delay to prevent being blocked from the API
        boxscore = BoxScoreAdvancedV2(game_id=i).player_stats.get_data_frame()
        boxscfinder = pd.concat([boxscore,boxscfinder])

In [None]:
####NEXT STEPS:

####These tasks need to be completed and we can split them up.
#Task (Insert your name)

#TASK 1 (NAME): Get a final dataset for the Boxscore (verify that this runs)
#TASK 2 (NAME): Only keep the boxscore entries that match the teams identified
#in the team_id_random ARRAY; generate a file

#TASK 3 (Jamie): Run this notebook and restrict the dataset
#named "boxscfinder" to the parameters we need based on what you 
#identified. Note, Jamie, when you restrict the parameter, it would
#be helpful in the code to name the variables. 

#TASK 4 (NAME): Restrict the "gamefinder" dataframe to the "team performance" metrics

#TASK 5 (Alexis): Look at the final boxscore dataset and define the traded variable

#TASK 6 (Alek/other): Estimate the data distributions of the 
#the parameters that we are examining. 
#Depends on Task 3 and 4 being complete. 
#Generate descriptive statistics for the parameters that we list.


# Data Cleaning

(Alek and others) Describe. In class, data cleaning examples were to handle missing values, recode variables (generate new variables), and plot distribution.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION