This notebook consist of the process to retreive data of three seperate datasets:
- All-nba teams
- All-stars
- All-stars 2020

In [1]:
import pandas as pd
import sys 

sys.path.insert(1, '../src/retrieve_data')

pd.set_option('display.max_rows', 800)
pd.set_option('display.max_columns', 800)

## Part 1: All-nba teams

First, scrape data to get players who were on the all-nba team. Notice that, the All-NBA team has been composed of three teams since the 1988-89 season. Data is fetched from 1988-1989 season to 2018-2019 season.

In [2]:
import all_nba_team_scraper

In [3]:
all_nba_teams = all_nba_team_scraper.all_nba_team()

In [5]:
all_nba_teams.to_csv('..//data//All_nba_team_csv', index = False) # saving the csv file

In [6]:
all_nba_teams['All nba team'].value_counts()

1st    155
3rd    155
2nd    155
Name: All nba team, dtype: int64

## Part 2: All-star players
Second, scrape data to get players who were all-stars. Since we fetched data for players who were on the All-NBA team since the the 1988-89 season, we will fetch follow this format and retreive data of players who were all-stars between 1988-1989 season to 2018-2019 season

In [7]:
dfs = []
for year in range(1989,2020,1):
    if year == 1999: # Due to lockout, there was no all stars in 1999
        continue
        
    df = all_nba_team_scraper.all_stars(year)
    dfs.append(df)
    
star_players = pd.concat(dfs).reset_index(drop = True)

In [8]:
star_players.head()

Unnamed: 0,Players,Season
0,John Stockton,1989
1,Alex English,1989
2,Karl Malone,1989
3,Dale Ellis,1989
4,Hakeem Olajuwon,1989


In [9]:
star_players['Players'] = star_players['Players'].apply(all_nba_team_scraper.clean_text)

In the 2018-2019 season, Dwayne Wade and Dirk Nowitzki were named as special team roster additions for the game, to commemorate their contribution to the game of basektball. From an avid basketball perspective, they are outliers and would not be chosen as all-stars based on votings and performance. Therefore, we will remove them in our dataset

In [10]:
star_players.loc[(star_players['Players'] == 'Dwyane Wade') & (star_players['Season'] == 2019)]

Unnamed: 0,Players,Season
706,Dwyane Wade,2019


In [11]:
star_players.loc[(star_players['Players'] == 'Dirk Nowitzki') & (star_players['Season'] == 2019)]

Unnamed: 0,Players,Season
720,Dirk Nowitzki,2019


In [12]:
star_players = star_players.drop([706,720])

In [13]:
star_players = star_players.reset_index(drop = True)

## Part 3: Player stats 

Thirdly, scrape data to get players stats. Again, we are trying to retreieve data starting from 1989-1989 season. 

In [14]:
import nba_stats

In [None]:
result = nba_stats.player_stats_seasons(1989,2019)

In [None]:
result

Similar to when retrieving data for all nba team, there are a few cases where a player changed his name. Luckily the data that has been retrieved already has names taken account of and changed.

Clean up the players names

In [None]:
result['Player'] = result['Player'].apply(all_nba_team_scraper.clean_text)

In [None]:
result = result.reset_index(drop = True)

In [None]:
result

Create a dictionary of years abbreviated and write the full year

In [None]:
mydict = {}

for i in range(89,100,1):
    mydict[str(i)] = '19{}'.format(i)
    
for i in range(0,21,1):
    
    if len(str(i)) == 1:
        mydict[('%0.2d' % i)] = '200{}'.format(i)
    else:   
        mydict[str(i)] = '20{}'.format(i)

In [None]:
all_nba_teams['Season_end'] = all_nba_teams['Season'].str[-2:] # Create a new column

In [None]:
all_nba_teams = all_nba_teams.replace({'Season_end': mydict}) # replace with dictionary values
all_nba_teams['Season_end'] = all_nba_teams['Season_end'].astype(int)

Merge player stats df with all-nba teams df

In [None]:
nba_df = result.merge(all_nba_teams, how= 'left', left_on = ['Player', 'Season'], right_on = ['Players', 'Season_end'])

nba_df = nba_df.rename(columns ={'All nba team': 'NBA_ALL_TEAM'})

In [None]:
nba_df.head(5)

In [None]:
star_players['ALL_STAR'] = 'YES' # Column 'YES' if they are all-star

Merge the results from above with all-stars df

In [None]:
final_df = nba_df.merge(star_players, how = 'left', left_on = ['Players', 'Season_x'], right_on = ['Players', 'Season'])

In [None]:
final_df.head()

## Clean up the column names

In [None]:
final_df = final_df.rename(columns = {'MP_y': 'MP',
                                      'MP_x': 'MP/G'})

In [None]:
cols = [col for col in final_df.columns if col not in ['Players', 'Season_y', 'Season_end', 'Season']]

In [None]:
final_df = final_df[cols]

In [None]:
final_df

In [None]:
final_df['NBA_ALL_TEAM'].value_counts()

In [None]:
# nba_df = test
final_df.to_csv('..//data//nba_df.csv', index = False)

##  Part 4:Retrieve test set  (this season's stats)

In [None]:
import nba_stats

In [None]:
df_per_game = nba_stats.player_stats(2020, 'per_game')
df_advanced = nba_stats.player_stats(2020, 'advanced')

In [None]:
# Determine the matchin column names in these two dataframes
intersect_cols = df_per_game.columns.intersection(df_advanced.columns).tolist()
intersect_cols.remove('MP')

ex = df_per_game.merge(df_advanced, how = 'left', on = intersect_cols)

In [None]:
combined_df = ex.dropna().reset_index(drop = True)

In [None]:
all_star_2020 = all_nba_team_scraper.all_stars(2020)
all_star_2020['ALL_STAR'] = 'YES'

In [None]:
test_set = combined_df.merge(all_star_2020, how = 'left', left_on = 'Player', right_on = 'Players')
test_set['ALL_STAR'] = test_set['ALL_STAR'].fillna('NO')

In [None]:
test_set.to_csv('..//data//nba_19-20.csv',index = False)