# COGS 108 - Data Checkpoint

# Names

- Jamie Wei
- Alexis Garduno
- James Daza
- Aleksander Archipov

<a id='research_question'></a>
# Research Question

- Does the number of times a player is traded predict that player's performance (e.g. average total points scored/game, average minutes played/game, etc) in an NBA game? Also, does the turn-over rate (e.g. number of players traded within a team/season) of a NBA team affect the likelihood that the team will reach the NBA finals (evidenced by the last 20 years of NBA games)?

# Dataset(s)

- Dataset Name: nba_api (replace with the dataset we generated from nba_api) 
- Link to the dataset: https://github.com/swar/nba_api
- Number of observations: N/A
    - API Client that allows access to various NBA's stats API endpoints

- Dataset Name: NBA Stats (replace with boxscore and team performance datasets)
- Link to the dataset: https://www.nba.com/stats/
- Number of observations: 12
    - Player stats from 10 random teams ranging from 2016-2021


# Setup

In [2]:
import pandas as pd
import numpy as np
import time

from nba_api.stats.static import players
from nba_api.stats.endpoints import playercareerstats

### Gathering Teams

In [6]:
#Import the roster of teams from the NBA API
from nba_api.stats.static import teams

nba_teams = teams.get_teams()

We select 10 nba teams at random 

In [7]:
#Obtain a full list of all abbreviations - will need abbreviations to identify team statistics
nba_teams_df=pd.DataFrame(nba_teams)
team_id=nba_teams_df['id'] #this is the unique team id
team_id_random=np.random.choice(team_id,10,replace=False) #identify ten team ids
nba_teams_rdf=nba_teams_df[nba_teams_df['id'].isin(team_id_random)] #df of 10 randomly selected teams
nba_teams_rdf

Unnamed: 0,id,full_name,abbreviation,nickname,city,state,year_founded
3,1610612740,New Orleans Pelicans,NOP,Pelicans,New Orleans,Louisiana,2002
6,1610612743,Denver Nuggets,DEN,Nuggets,Denver,Colorado,1976
7,1610612744,Golden State Warriors,GSW,Warriors,Golden State,California,1946
10,1610612747,Los Angeles Lakers,LAL,Lakers,Los Angeles,California,1948
11,1610612748,Miami Heat,MIA,Heat,Miami,Florida,1988
14,1610612751,Brooklyn Nets,BKN,Nets,Brooklyn,New York,1976
24,1610612761,Toronto Raptors,TOR,Raptors,Toronto,Ontario,1995
27,1610612764,Washington Wizards,WAS,Wizards,Washington,District of Columbia,1961
28,1610612765,Detroit Pistons,DET,Pistons,Detroit,Michigan,1948
29,1610612766,Charlotte Hornets,CHA,Hornets,Charlotte,North Carolina,1988


### Requesting Games
Now we request the games for all of the ten teams we have selected

In [8]:
#Pull all games for all ten teams

#Documentation for this endpoint: 
#https://github.com/swar/nba_api/blob/master/docs/nba_api/stats/endpoints/leaguegamefinder.md
from nba_api.stats.endpoints import leaguegamefinder

# Query for games from the League Game Finder
gamefinder=pd.DataFrame()
for i in team_id_random:
    time.sleep(1) #delay to prevent being blocked from the API
    df = leaguegamefinder.LeagueGameFinder(team_id_nullable=[i]).get_data_frames()[0] #parameter of team ids given
    gamefinder = pd.concat([df,gamefinder])

In [9]:
#Game Statistics

#One row corresponds to one game and one team.
#There will be two rows per game, since there are two teams that played each other.
#Will need to exclude duplicate rows (XXXX will remove duplicate rows)
print(list(set(gamefinder.TEAM_ID))) #confirmed that identified 10 different teams
print(gamefinder.shape) #31,386 games

##Game Finder Dataset: This dataset will be used as the outcome when we look at the association between 
##the exposure and outcome relationship. 
gamefinder

[1610612740, 1610612743, 1610612744, 1610612747, 1610612748, 1610612751, 1610612761, 1610612764, 1610612765, 1610612766]
(31134, 28)


Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,...,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
0,22021,1610612765,DET,Detroit Pistons,0022100875,2022-02-16,DET @ BOS,W,238,112,...,0.769,18.0,29.0,47.0,28,9,1,16,17,1.0
1,22021,1610612765,DET,Detroit Pistons,0022100858,2022-02-14,DET @ WAS,L,238,94,...,0.815,13.0,31.0,44.0,19,7,3,11,19,-9.0
2,22021,1610612765,DET,Detroit Pistons,0022100838,2022-02-11,DET vs. CHA,L,240,119,...,0.611,16.0,29.0,45.0,32,10,6,18,20,-22.0
3,22021,1610612765,DET,Detroit Pistons,0022100831,2022-02-10,DET vs. MEM,L,240,107,...,0.857,10.0,30.0,40.0,27,4,7,9,29,-25.0
4,22021,1610612765,DET,Detroit Pistons,0022100820,2022-02-08,DET @ DAL,L,239,86,...,0.682,13.0,32.0,45.0,20,5,2,15,24,-30.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1804,22002,1610612740,NOH,New Orleans Hornets,0020200077,2002-11-08,NOH vs. GSW,W,240,110,...,0.821,17.0,26.0,43.0,20,8,5,13,20,6.0
1805,22002,1610612740,NOH,New Orleans Hornets,0020200063,2002-11-06,NOH vs. SEA,W,241,86,...,0.600,15.0,30.0,45.0,24,8,3,14,20,2.0
1806,22002,1610612740,NOH,New Orleans Hornets,0020200036,2002-11-02,NOH vs. MIA,W,240,100,...,0.818,14.0,27.0,41.0,19,9,5,8,23,4.4
1807,22002,1610612740,NOH,New Orleans Hornets,0020200025,2002-11-01,NOH @ CHI,L,240,79,...,0.714,14.0,25.0,39.0,18,11,5,14,22,-5.0


Now we request all the player stats from seasons ranging from 2016-2021

In [10]:
#get game ids for the last five years

#for now, let's focus on the last five seasons for ease
from nba_api.stats.endpoints import playergamelogs

#generate a parameter dataframe to define timeframe
#Is this timeframe correct? What is the timeframe that the season normally runs from?
season_parameter_df=pd.DataFrame({'Season':['2016-17','2017-18','2018-19','2019-20','2020-21'], 
                    'Date_From':['9/01/2016','9/01/2017','9/01/2018','9/01/2019','9/01/2020'],
                    'Date_To':['8/31/2017','8/31/2018','8/31/2019','8/31/2020','8/31/2021']})

#will obtain no game ids, without the season_nullable and date_nullable items selected
logsdf=pd.DataFrame()
for i in list(range(1, 5)):
    time.sleep(1)
    season=season_parameter_df.iloc[i][0]
    date_from=season_parameter_df.iloc[i][1]
    date_to=season_parameter_df.iloc[i][2]
    
    logs = pd.DataFrame(playergamelogs.PlayerGameLogs(
        season_nullable = season,
        date_from_nullable = date_from,                                                     
        date_to_nullable = date_to
    ).player_game_logs.get_data_frame())
    logsdf = pd.concat([logs,logsdf])

In [11]:
#unique set of game ids
game_ids = list(set(logsdf['GAME_ID'])) #not sure if we need game_ids
player_ids = pd.DataFrame(list(set(logsdf['PLAYER_ID'])))
print(pd.DataFrame(game_ids).shape) #4599 game ids
print(player_ids.shape) #875 unique players
print(logsdf.shape) #97,655 rows where each row is for each player in each game. 

(4599, 1)
(875, 1)
(97655, 66)


<b>4599 Games <br>
875 Unique Players <br>
97,655 stats of players in each game<b>

All the columns of every player that played in the seasons we selected from the random teams chosen

In [13]:
#preview, likely not use these columns
logsdf.columns

Index(['SEASON_YEAR', 'PLAYER_ID', 'PLAYER_NAME', 'NICKNAME', 'TEAM_ID',
       'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID', 'GAME_DATE', 'MATCHUP',
       'WL', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M', 'FG3A', 'FG3_PCT', 'FTM',
       'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST', 'TOV', 'STL', 'BLK',
       'BLKA', 'PF', 'PFD', 'PTS', 'PLUS_MINUS', 'NBA_FANTASY_PTS', 'DD2',
       'TD3', 'GP_RANK', 'W_RANK', 'L_RANK', 'W_PCT_RANK', 'MIN_RANK',
       'FGM_RANK', 'FGA_RANK', 'FG_PCT_RANK', 'FG3M_RANK', 'FG3A_RANK',
       'FG3_PCT_RANK', 'FTM_RANK', 'FTA_RANK', 'FT_PCT_RANK', 'OREB_RANK',
       'DREB_RANK', 'REB_RANK', 'AST_RANK', 'TOV_RANK', 'STL_RANK', 'BLK_RANK',
       'BLKA_RANK', 'PF_RANK', 'PFD_RANK', 'PTS_RANK', 'PLUS_MINUS_RANK',
       'NBA_FANTASY_PTS_RANK', 'DD2_RANK', 'TD3_RANK', 'VIDEO_AVAILABLE_FLAG'],
      dtype='object')

In [None]:
#import boxscore, player statistics within each game
#https://en.wikipedia.org/wiki/Box_score_(baseball)#:~:text=A%20box%20score%20is%20a,the%20box%20score%20in%201858.
#from nba_api.stats.endpoints import BoxScoreAdvancedV2

# Takes a LONG time to run
#boxscfinder=pd.DataFrame()
#for i in game_ids:
#        time.sleep(1) #delay to prevent being blocked from the API
#        boxscore = BoxScoreAdvancedV2(game_id=i).player_stats.get_data_frame()
#        boxscfinder = pd.concat([boxscore,boxscfinder])

In [None]:
####NEXT STEPS:

####These tasks need to be completed and we can split them up.
#Task (Insert your name)

#TASK 1 (NAME): Get a final dataset for the Boxscore (verify that this runs)
#TASK 2 (NAME): Only keep the boxscore entries that match the teams identified
#in the team_id_random ARRAY; generate a file

#TASK 3 (Jamie): Run this notebook and restrict the dataset
#named "boxscfinder" to the parameters we need based on what you 
#identified. Note, Jamie, when you restrict the parameter, it would
#be helpful in the code to name the variables. 

#TASK 4 (NAME): Restrict the "gamefinder" dataframe to the "team performance" metrics

#TASK 5 (Alexis): Look at the final boxscore dataset and define the traded variable

#TASK 6 (Alek/other): Estimate the data distributions of the 
#the parameters that we are examining. 
#Depends on Task 3 and 4 being complete. 
#Generate descriptive statistics for the parameters that we list.


# Data Cleaning

(Alek and others) Describe. In class, data cleaning examples were to handle missing values, recode variables (generate new variables), and plot distribution.<br>

### Cleaning Player Logs

In [13]:
# Copy Logs DataFrame and remove TEAM ID's that were not selected
logsRandTeamDf = logsdf.copy()
logsRandTeamDf = logsRandTeamDf[logsRandTeamDf['TEAM_ID'].isin(team_id_random)]

# Restrict to the player stats we want
logsRandTeamDf = logsRandTeamDf[['SEASON_YEAR', 'PLAYER_ID', 'PLAYER_NAME', 'TEAM_ID', 'TEAM_NAME', 'MIN', 'FG_PCT', 'FT_PCT', 'PTS', 'AST', 'REB', 'STL', 'BLK', 'PLUS_MINUS']]
logsRandTeamDf

Unnamed: 0,SEASON_YEAR,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_NAME,MIN,FG_PCT,FT_PCT,PTS,AST,REB,STL,BLK,PLUS_MINUS
5,2020-21,1628455,Mike James,1610612751,Brooklyn Nets,27.800000,0.375,0.0,14,8,4,2,0,-12
6,2020-21,1629639,Tyler Herro,1610612748,Miami Heat,35.183333,0.538,0.5,16,11,6,0,0,21
11,2020-21,1629308,Juan Toscano-Anderson,1610612744,Golden State Warriors,26.830000,0.571,0.0,9,1,4,2,0,10
15,2020-21,1629130,Duncan Robinson,1610612748,Miami Heat,19.083333,0.600,0.0,8,1,2,2,1,9
16,2020-21,1628420,Monte Morris,1610612743,Denver Nuggets,17.816667,0.571,0.0,8,4,0,1,0,-10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26097,2017-18,201939,Stephen Curry,1610612744,Golden State Warriors,29.548333,0.444,1.0,22,4,5,1,0,9
26098,2017-18,2561,David West,1610612744,Golden State Warriors,9.416667,0.667,0.0,4,0,1,0,1,0
26099,2017-18,2733,Shaun Livingston,1610612744,Golden State Warriors,18.545000,0.400,1.0,6,5,1,0,0,-8
26103,2017-18,201142,Kevin Durant,1610612744,Golden State Warriors,37.626667,0.467,0.8,20,7,5,0,4,11


### Cleaning boxstats

In [1]:
# NEED TO RUN BOXSTATS REQUEST BLOCK FIRST

# Restrict boxscfinder to teams the random teams we selected
# boxscfinder = boxscfinder[boxscfinder['TEAM_ID'].isin(team_id_random)]

# Restrict to the player stats we want
# boxscfinder = boxscfinder[['SEASON_YEAR', 'PLAYER_ID', 'PLAYER_NAME', 'MIN', 'FG_PCT', 'FT_PCT', 'PTS', 'AST', 'REB', 'STL', 'BLK', 'PLUS_MINUS']
# boxscfinder

### Saving Data <br>
Now we save the data we gathered we requested from the NBA API into csv files

In [24]:
#Saving the Data to a csv file
logsRandTeamDf.to_csv("logPlayerStats.csv")

# NEED TO RUN BOXSTATS REQUEST BLOCK FIRST
# boxscfinder.to_csv("boxscfinder.csv")

### Manipulating/Plotting CSV File Data

In [3]:
import seaborn as sns

In [35]:
#Need to plot the distributions data using either seaborn or matplotlib

dataDf = pd.read_csv("logPlayerStats.csv")
dataDf['TIMES_TRADED'] = np.nan
pid_lst = list(player_ids[0])
pid_lst[0]

for pid in pid_lst:
    tempDf = dataDf[dataDf['PLAYER_ID'] == pid]
    traded = len(pd.unique(tempDf['TEAM_NAME']))-1
    dataDf.loc[dataDf['PLAYER_ID'] == pid, 'TIMES_TRADED'] = int(traded)

dataDf[dataDf['TIMES_TRADED'] >= 3]
    

Unnamed: 0.1,Unnamed: 0,SEASON_YEAR,PLAYER_ID,PLAYER_NAME,TEAM_ID,TEAM_NAME,MIN,FG_PCT,FT_PCT,PTS,AST,REB,STL,BLK,PLUS_MINUS,TIMES_TRADED
1931,6013,2020-21,202738,Isaiah Thomas,1610612740,New Orleans Pelicans,4.050000,0.500,0.0,2,0,2,0,0,-1,3.0
2202,6886,2020-21,202738,Isaiah Thomas,1610612740,New Orleans Pelicans,19.000000,0.333,1.0,11,3,0,1,0,2,3.0
2284,7118,2020-21,202738,Isaiah Thomas,1610612740,New Orleans Pelicans,25.216667,0.308,0.0,10,2,2,0,0,7,3.0
9837,6694,2019-20,202738,Isaiah Thomas,1610612764,Washington Wizards,24.516667,0.300,1.0,9,4,1,0,0,-10,3.0
9929,6949,2019-20,202738,Isaiah Thomas,1610612764,Washington Wizards,22.600000,0.313,0.0,12,3,2,0,0,3,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26103,7307,2017-18,202738,Isaiah Thomas,1610612747,Los Angeles Lakers,23.153333,0.364,1.0,17,4,2,0,0,-1,3.0
26143,7420,2017-18,202738,Isaiah Thomas,1610612747,Los Angeles Lakers,25.466667,0.417,1.0,17,4,2,2,0,4,3.0
26248,7718,2017-18,202738,Isaiah Thomas,1610612747,Los Angeles Lakers,24.866667,0.200,1.0,7,5,1,0,1,-8,3.0
26323,7899,2017-18,202738,Isaiah Thomas,1610612747,Los Angeles Lakers,4.811667,1.000,1.0,3,2,1,0,0,-4,3.0
