# COGS 108 - Final Project (change this to your project's title)

## Permissions

Place an `X` in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that student names will be included (but PIDs will be scraped from any groups who include their PIDs).

* [`X`] YES - make available
* [  ] NO - keep private

# Overview

*Fill in your overview here*

# Names

- Alexis Garduno
- James Daza
- Jamie Wei
- Aleksander Archipov

<a id='research_question'></a>
# Research Question

- Does the number of times a player is traded predict that player's performance (e.g. average total points scored/game, average minutes played/game, etc) in an NBA game? Also, does the turn-over rate (e.g. number of players traded within a team/season) of a NBA team affect the likelihood that the team will reach the NBA finals (evidenced by the last 20 years of NBA games)?

<a id='background'></a>

## Background & Prior Work

#### Background

The NBA is a high-tension business, where is each team is constantly fighting to maintain only the highest skilled players.  While the top 25 big hitters tend to remain relatively static, the rest of the 450+ players in the league can come and go at any moment.  This can have a great impact on the team and the players.  Trades in the NBA are quite complex, dealing with variables such as salary cap, the NBA collective bargaining agreement(CBA), as well as the team's personal needs.  Some research has already been done on the effects of trades on NBA teams as well as work on predicting factors in NBA player and team success.   

The NBA is also a very stat-heavy sport. NBA games are decided by whichever team has the most points. However, that doesn’t mean Basketball is just about outsourcing your opponent. There are many other factors that are looked upon. Mainly: assists, rebounds, blocks, steals, and turnovers. Each of these variables could be a factor in increasing the points of a team. Assists count when an NBA player passes the ball to another player and that player manages to score some points. A rebound is when another player attempts a shot, and the player catches the ball as it's falling down. A block is when a player manages to deflect a player's shot from his goal. A steal is when a player manages to gain possession from the other team. Finally, a turnover is when a player loses control of the ball, and the opposing team gains possession. Obviously, it is better to have higher stats in all of the mentioned above except turnovers. In this study, we will be looking at these stats and many others to determine how a player is performing in a season. 

#### Prior Work

In a 2020 article by Michael C. Wright (Reference 1), Wright talks to players about the effect that trades can have on the team's morale.  Players mentioned how close wins and tough losses throughout a season can create strong bonds and chemistry between players.  Teams with good chemistry are generally more successful and thus make it closer to the Finals.  The NBA veterans note that it is something they can get used to.  For instance, early in their career, it can have a big impact on how they play, but as they get older they realize it is just a part of the business and it does not affect them as much.  Even still, players that might play together for 5 or more years can easily be separated at any time.  

Work on predicting NBA results has been done before.  The website fivethirtyeight.com (Reference 2) uses player projections that take into account factors, such as physiological, scoring, tendencies, passing, and defense, among other variables.  This data is compared to past NBA players to predict that player's season.  The site then runs simulations on this data to form predicted season statistics for each team.  While this is a reasonable predictor, it fails to take into consideration any phycological factors that can arise from trading, as mentioned previously.

Another source we looked at (Reference 3) found the most important stats when looking at team performances. Generally, a lot of team performance can’t be calculated into a number, however, this website managed to find a correlation between some stats and the likelihood of winning. They specifically looked at Rebounds, Turnovers, Field goal %, Free throw %, and Fouls. They found that the team with the higher Field goal % is 75% more likely to win the game and that the team with the highest Free throw % is 70% more likely to win. However, just because these stats may be higher for some teams, it doesn't always mean they will win the game.

References (include links):
- 1) https://www.nba.com/news/trade-deadline-when-friends-are-dealt
- 2) https://projects.fivethirtyeight.com/nba-trades-2022/
- 3) https://www.oskeimsportspicks.com/nba-stats/#:~:text=In%20addition%20to%20winning%20three%20of%[…]so%20why%20put%20them%20first%20on%20the%20list%3F

# Hypothesis


We expect that an NBA player who is traded more than an average NBA player at their point of their career (in terms of years in the NBA) will score less points/game, play less time/game, and have worse performance on average. Teams with NBA players who are traded more than an average player at a similar point in their career will be less likely to make it to the NBA finals.

# Dataset(s)

#### Source Name: NBA_API
- Link to the dataset: https://github.com/swar/nba_api
- Number of observations: N/A
- API Client that allows access to various NBA's stats API endpoints

#### Source Name: NBA Stats
- Link to the dataset: https://www.nba.com/stats/
- Player stats from 10 random teams ranging from 2016-2021

We began with two sources, primarily relying on the NBA_API to construct datastreams. We performed some simple quality checks to ensure that the dataset we pulled from the NBA_API was consistent with the data shown on the NBA stats site.

#### Dataset: PlayerStats-Team Ranking 
- ***Description:*** This dataset contains a player's performance for an individual season; each row is uniquely defined for a player's performance for that team in an individual season. Each player's stat will be averaged per game. If players were traded in the middle of the season, there would be two rows per player. This dataset also contains information about how often players were traded in a given season. This dataset was merged to team performance statistics, largely team rankings over an individual season. This dataset has a header of "working dataset."
- ***Observations:*** 2,892 players per season
##### Dataset: Column Meanings
- points per game(PTS)- the amount of points a player scores per games that they play
- assists per game(AST)- the amount of times a player pass the ball to a teammate and they score
- Offensive rebounds per game(OREB) - the amount of times a player obtains the ball off a miss on offensive.
- Defensive rebounds per game(DREB) - the amount of times a player obtains the ball off a miss on offensive.
- steals per game(STL) - the amount of times a player obtains the ball from the opponent. Either by intercepting the pass or hitting the ball out of their opponenets hands.
- blocks per game(BLK)- the amount of times a player manges to stop the ball after their opponent attempts a shot
- field goal percentage(FG_PCT) - the amount of times a player makes the ball vs how many they attempt
- free throw perentage(FT_PCT) - the amount of times a player makes the ball in freethrow vs how many they attempt
- field goals made(FMG)- the amount of goals a player hits in the game
- Three pointers made(FG3_PCT) - the amount of times a player scores a basket outside of the three pointer line
- Number of observations:

***1-2 sentences describing each dataset.***

***If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.***

# Setup

In [2]:
import pandas as pd
import numpy as np
import time
import seaborn as sns
import warnings
warnings.filterwarnings(action='once')

from nba_api.stats.static import players
from nba_api.stats.endpoints import playercareerstats

# Data Cleaning

Describe your data cleaning steps here.

In [2]:
### Gathering Teams

In [None]:
#Import the roster of teams from the NBA API
from nba_api.stats.static import teams

nba_teams = teams.get_teams()

In [None]:
#Obtain a full list of all abbreviations - will need abbreviations to identify team statistics
nba_teams_df=pd.DataFrame(nba_teams)
team_id=nba_teams_df['id'] #this is the unique team id
team_id_random=np.random.choice(team_id,10,replace=False) #identify ten team ids
nba_teams_rdf=nba_teams_df[nba_teams_df['id'].isin(team_id_random)] #df of 10 randomly selected teams
nba_teams_rdf

In [None]:
### Requesting Games
Now we request the games for all of the ten teams we have selected

In [None]:
#Pull all games for all ten teams

#Documentation for this endpoint: 
#https://github.com/swar/nba_api/blob/master/docs/nba_api/stats/endpoints/leaguegamefinder.md
from nba_api.stats.endpoints import leaguegamefinder

# Query for games from the League Game Finder
gamefinder=pd.DataFrame()
for i in team_id_random:
    time.sleep(1) #delay to prevent being blocked from the API
    df = leaguegamefinder.LeagueGameFinder(team_id_nullable=[i]).get_data_frames()[0] #parameter of team ids given
    gamefinder = pd.concat([df,gamefinder])

In [None]:
#Game Statistics

#One row corresponds to one game and one team.
#There will be two rows per game, since there are two teams that played each other.
#Will need to exclude duplicate rows (XXXX will remove duplicate rows)
print(list(set(gamefinder.TEAM_ID))) #confirmed that identified 10 different teams
print(gamefinder.shape) #31,386 games

##Game Finder Dataset: This dataset will be used as the outcome when we look at the association between 
##the exposure and outcome relationship. 
gamefinder

In [None]:
Now we request all the player stats from seasons ranging from 2016-2021

In [None]:
#get game ids for the last five years

#for now, let's focus on the last five seasons for ease
from nba_api.stats.endpoints import playergamelogs

#generate a parameter dataframe to define timeframe
#Is this timeframe correct? What is the timeframe that the season normally runs from?
season_parameter_df=pd.DataFrame({'Season':['2016-17','2017-18','2018-19','2019-20','2020-21'], 
                    'Date_From':['9/01/2016','9/01/2017','9/01/2018','9/01/2019','9/01/2020'],
                    'Date_To':['8/31/2017','8/31/2018','8/31/2019','8/31/2020','8/31/2021']})

#will obtain no game ids, without the season_nullable and date_nullable items selected
logsdf=pd.DataFrame()
for i in list(range(1, 5)):
    time.sleep(1)
    season=season_parameter_df.iloc[i][0]
    date_from=season_parameter_df.iloc[i][1]
    date_to=season_parameter_df.iloc[i][2]
    
    logs = pd.DataFrame(playergamelogs.PlayerGameLogs(
        season_nullable = season,
        date_from_nullable = date_from,                                                     
        date_to_nullable = date_to
    ).player_game_logs.get_data_frame())
    logsdf = pd.concat([logs,logsdf])

In [None]:
#unique set of game ids
game_ids = list(set(logsdf['GAME_ID'])) #not sure if we need game_ids
player_ids = pd.DataFrame(list(set(logsdf['PLAYER_ID'])))
print(pd.DataFrame(game_ids).shape) #4599 game ids
print(player_ids.shape) #875 unique players
print(logsdf.shape) #97,655 rows where each row is for each player in each game. 

In [None]:
<b>4599 Games <br>
875 Unique Players <br>
97,655 stats of players in each game<b>

In [None]:
All the columns of every player that played in the seasons we selected from the random teams chosen

In [None]:
#preview, likely not use these columns
logsdf.columns

In [None]:
# Data Cleaning

In [None]:
### Cleaning Player Logs

In [None]:
# Copy Logs DataFrame and remove TEAM ID's that were not selected
logsDf = logsdf.copy()


# Restrict to the player stats we want
logsDf = logsDf[['SEASON_YEAR', 'PLAYER_ID', 'PLAYER_NAME', 'TEAM_ID', 'TEAM_NAME', 'MIN', 'FG_PCT', 'FT_PCT', 'PTS', 'AST', 'REB', 'STL', 'BLK', 'PLUS_MINUS']]
logsDf

In [None]:
### Cleaning boxstats

In [None]:
### Saving Data <br>
Now we save the data we gathered we requested from the NBA API into csv files

In [None]:
#Saving the Data to a csv file
logsDf.to_csv("logPlayerStats.csv")

# NEED TO RUN BOXSTATS REQUEST BLOCK FIRST
# boxscfinder.to_csv("boxscfinder.csv")

In [None]:
### Manipulating/Plotting CSV File Data

In [None]:
#Need to plot the distributions data using either seaborn or matplotlib

dataDf = pd.read_csv("logPlayerStats.csv")
#dataDf['TIMES_TRADED'] = np.nan
pid_lst = list(player_ids[0])
pid_lst[0]

season = ['2016-17','2017-18','2018-19','2019-20','2020-21']

playerStats = pd.DataFrame()
for pid in pid_lst:
    time.sleep(1)
    career = playercareerstats.PlayerCareerStats(player_id=pid)
    temp = career.get_data_frames()[0]
    temp = temp[temp['SEASON_ID'].isin(season)]
    playerStats = playerStats.append(temp)

In [None]:
season = ['2016-17','2017-18','2018-19','2019-20','2020-21']
for s in season:
    playerStats["Traded " + s] = 0
    
for pid in pid_lst:
    for s in season:
        tempDf = dataDf.loc[(dataDf['PLAYER_ID'] == pid) & (dataDf['SEASON_YEAR'] == s)]
        tempDf = tempDf.reset_index(drop=True)
        tempDf = tempDf.drop_duplicates(subset=['TEAM_ID'])
        traded = len(tempDf.index)-1
        
        if traded >= 0:
            playerStats.loc[(playerStats['PLAYER_ID'] == pid), ["Traded " + s]] = traded
        else:
            playerStats.loc[(playerStats['PLAYER_ID'] == pid), ["Traded " + s]] = 0

In [None]:
playerStats = playerStats.reset_index(drop=True)
playerStats

In [None]:
playerStats.to_csv("PlayerStats.csv")

In [None]:
playerStats

In [None]:
playerStats[playerStats['PLAYER_ID'] ==1626144 ]

In [None]:
season_stats = pd.DataFrame()
for pid in pid_lst:
    time.sleep(1)
    career = playercareerstats.PlayerCareerStats(player_id=pid)
    temp = career.career_totals_regular_season.get_data_frame()
    season_stats = season_stats.append(temp)  

In [None]:
season_stats

In [None]:
season = ['2016-17','2017-18','2018-19','2019-20','2020-21']
for s in season:
    season_stats["Traded " + s] = 0
    
for pid in pid_lst:
    for s in season:
        tempDf = dataDf.loc[(dataDf['PLAYER_ID'] == pid) & (dataDf['SEASON_YEAR'] == s)]
        tempDf = tempDf.reset_index(drop=True)
        tempDf = tempDf.drop_duplicates(subset=['TEAM_ID'])
        traded = len(tempDf.index)-1
        
        if traded >= 0:
            season_stats.loc[(season_stats['PLAYER_ID'] == pid), ["Traded " + s]] = traded

        else:
            season_stats.loc[(playerStats['PLAYER_ID'] == pid), ["Traded " + s]] = 0

In [None]:
season_stats.to_csv("careerStats_v2.csv")

# Data Analysis & Results

Include cells that describe the steps in your data analysis.

In [3]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

*Fill in your ethics & privacy discussion here*

# Conclusion & Discussion

*Fill in your discussion information here*

# Team Contributions

*Specify who in your group worked on which parts of the project.*