## Project Setup
# Data Creation/ Data Cleaning
Getting the datasets for both NBA and NCAA and cleaning them

pip install sportsreference is needed to run this code, as it contains data from the popular Sportsreference website. The documentation is listed here: https://sportsreference.readthedocs.io/en/stable/sportsreference.html

The pypi site for this package is listed here: https://pypi.org/project/sportsreference/



In [1]:
#  pip install sportsreference

In [2]:
#Dependencies
import requests
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sportsreference.ncaab.roster import Player
from sportsreference.nba.roster import Roster

Creating the list of player objects.
This is done by going through each team, then getting the list of player objects on each roster, and storing them in a list. This project collects data only from players on an active roster, to preserve accuracy, and because we realize that trends in the modern NBA can change rapidly that attributes like the mid-range shot is not as needed today as it was years ago.

In [3]:
teams = ['ATL','BRK','BOS','CHO','CHI','CLE','DAL','DEN','DET','GSW',
         'HOU','IND','LAC','LAL','MEM','MIA','MIL','MIN','NOP','NYK',
         'OKC','ORL','PHI','PHO','POR','SAC','SAS','TOR','UTA','WAS']
player_list = []

for team in teams:
    teamname = Roster(team)
    for players in teamname.players:
        player_list.append(players)
print("Got all player names")

Got all player names


Storing all stats into two dictonaries now, each dictionary containing the player name as the key, and the value being the pandas dataframe that the function Player.dataframe gives us. This dataframe is a compilation of a ton of different stats, including some advanced stats. 

In [4]:
nba_player_info = {}
ncaa_player_info = {}
for nbaplayer in player_list:
    try:
        name = nbaplayer.name
        name = name.replace("'", "")
        name = name.replace(".", "")
        split_name = name.split()
        firstname = str(split_name[0]).lower()
        lastname = str(split_name[1]).lower()
        nameid = firstname + "-" + lastname + "-1"
        ncaa_player = Player(nameid)
        nba_player_info[nbaplayer.name] = nbaplayer.dataframe
        ncaa_player_info[nbaplayer.name] = ncaa_player.dataframe 
    except(TypeError):
        pass
print("Stored Everything in two dictionaries")

Stored Everything in two dictionaries


This is an example of the columns from each player's dataframe, seen below. The documentation contains information about what each column represents. 

In [5]:
nba_player_info['Kevin Durant'].columns

Index(['and_ones', 'assist_percentage', 'assists', 'block_percentage',
       'blocking_fouls', 'blocks', 'box_plus_minus', 'center_percentage',
       'defensive_box_plus_minus', 'defensive_rebound_percentage',
       'defensive_rebounds', 'defensive_win_shares', 'dunks',
       'effective_field_goal_percentage', 'field_goal_attempts',
       'field_goal_perc_sixteen_foot_plus_two_pointers',
       'field_goal_perc_ten_to_sixteen_feet',
       'field_goal_perc_three_to_ten_feet',
       'field_goal_perc_zero_to_three_feet', 'field_goal_percentage',
       'field_goals', 'free_throw_attempt_rate', 'free_throw_attempts',
       'free_throw_percentage', 'free_throws', 'games_played', 'games_started',
       'half_court_heaves', 'half_court_heaves_made', 'height',
       'lost_ball_turnovers', 'minutes_played', 'nationality',
       'net_plus_minus', 'offensive_box_plus_minus', 'offensive_fouls',
       'offensive_rebound_percentage', 'offensive_rebounds',
       'offensive_win_shares', '

Here is an example of how to get a stat from the information dictionary made earlier, this gives a career average of Kevin Durant's true shooting percentage.

In [6]:
float(nba_player_info['Kevin Durant']['true_shooting_percentage']['Career'])


0.613

This code below stores 2 Pandas Dataframes, one for NBA statistics and one for NCAA statistics. These statistics are from the dataframe object stored from earlier, and the purpose of this is to make is much easier to access stats than the cell above. Each stat is calculated as a career total, then averaged out later.

First we stored information to an NBA dataframe, and printed out names that are one's we have to omit, because there is not enough data on them to store. Often, these players have empty stats lists because they are technically on the roster, but haven't played much yet. 

In [7]:
nba_required_stats_list = []
nba_names_to_drop = []
for key, value in nba_player_info.items():
    try:
        raw_height = nba_player_info[key]['height']['Career'][-1].split('-')
        career_height = (float(raw_height[0]) * 12 ) + float(raw_height[1])
        career_weight = float(nba_player_info[key]['weight']['Career'])
        career_points = float(nba_player_info[key]['points']['Career'])
        career_games = float(nba_player_info[key]['games_played']['Career'])
        career_assists = float(nba_player_info[key]['assists']['Career'])
        defensive_rebounds = float(nba_player_info[key]['defensive_rebounds']['Career'])
        offensive_rebounds = float(nba_player_info[key]['offensive_rebounds']['Career'])
        career_turnovers = float(nba_player_info[key]['turnovers']['Career'])
        career_blocks = float(nba_player_info[key]['blocks']['Career'])
        career_steals = float(nba_player_info[key]['steals']['Career'])
        career_free_throw_percentage = float(nba_player_info[key]['free_throw_percentage']['Career'])
        career_three_point_percentage = float(nba_player_info[key]['three_point_percentage']['Career'])
        career_PER = float(nba_player_info[key]['player_efficiency_rating']['Career'])
        career_win_shares = float(nba_player_info[key]['win_shares']['Career'])
        off_win_shares = float(nba_player_info[key]['offensive_win_shares']['Career'])
        def_win_shares = float(nba_player_info[key]['defensive_win_shares']['Career'])
        career_field_goal_percentage = float(nba_player_info[key]['field_goal_percentage']['Career'])
        career_usage_percentage = float(nba_player_info[key]['usage_percentage']['Career'])
        vorp = float(nba_player_info[key]['value_over_replacement_player']['Career'][-1])
        boxplusminus = float(nba_player_info[key]['box_plus_minus']['Career'])
        true_shooting_percentage = float(nba_player_info[key]['true_shooting_percentage']['Career'])
        player_dict =  {'Name': key,
                        'Career Height': career_height,
                        'Career Weight': career_weight,
                        'Career Points': career_points,
                        'Career Games': career_games,
                        'Career Assists': career_assists,
                        'Career Def Rebounds':defensive_rebounds,
                        'Career Off Rebounds':offensive_rebounds,
                        'Career Turnovers': career_turnovers,
                        'Career Blocks': career_blocks,
                        'Career Steals': career_steals,
                        'Career Free Throw Percentage': career_free_throw_percentage,
                        'Career Three Point Percentage': career_three_point_percentage,
                        'Career Field Goal Percentage': career_field_goal_percentage,
                        'Career PER': career_PER,
                        'Career Win Shares': career_win_shares,
                        'Offensive Win Shares': off_win_shares,
                        'Defensive Win Shares': def_win_shares,
                        'Career Usage Percentage': career_usage_percentage,
                        'VORP': vorp,
                        'Box Plus Minus': boxplusminus,
                        'True Shooting Per': true_shooting_percentage
                        }
        nba_required_stats_list.append(player_dict)
    except (KeyError, TypeError):
        nba_names_to_drop.append(key)
        print(key)
nba_stats_df = pd.DataFrame(nba_required_stats_list)

Jeremiah Martin
Donta Hall
Robert Williams
Tacko Fall
Daniel Gafford
Matt Mooney
Sir'Dominic Pointer
Dylan Windler
Josh Reaves
Bol Bol
Tyler Cook
Jordan Bone
William Howard
Kostas Antetokounmpo
Talen Horton-Tucker
Devontae Cacok
Jontay Porter
Gabe Vincent
Kyle Alexander
Zylan Cheatham
Kenny Wooten
Jared Harper
Kevin Hervey
Isaiah Roby
Marial Shayok
Moses Brown
DaQuan Jeffries
Kyle Guy
Quinndary Weatherspoon
Juwan Morgan
Miye Oni
Jarrell Brantley
Justin Wright-Foreman


In [8]:
nba_stats_df

Unnamed: 0,Name,Career Height,Career Weight,Career Points,Career Games,Career Assists,Career Def Rebounds,Career Off Rebounds,Career Turnovers,Career Blocks,...,Career Three Point Percentage,Career Field Goal Percentage,Career PER,Career Win Shares,Offensive Win Shares,Defensive Win Shares,Career Usage Percentage,VORP,Box Plus Minus,True Shooting Per
0,De'Andre Hunter,79.0,225.0,778.0,63.0,112.0,242.0,44.0,103.0,18.0,...,0.355,0.410,8.6,0.1,-0.4,0.5,17.5,-1.4,-4.7,0.521
1,Trae Young,73.0,180.0,3327.0,141.0,1213.0,460.0,96.0,597.0,23.0,...,0.344,0.428,20.2,9.2,7.9,1.2,31.4,4.0,1.5,0.567
2,Vince Carter,78.0,220.0,25728.0,1541.0,4714.0,4948.0,1658.0,2590.0,888.0,...,0.371,0.435,18.6,125.3,80.7,44.6,26.3,57.9,3.0,0.536
3,Cam Reddish,80.0,208.0,610.0,58.0,87.0,181.0,35.0,96.0,28.0,...,0.332,0.384,9.0,-0.4,-1.2,0.8,18.9,-0.9,-4.2,0.500
4,Kevin Huerter,79.0,190.0,1411.0,131.0,427.0,378.0,95.0,196.0,52.0,...,0.383,0.416,10.8,3.0,1.7,1.3,16.3,0.0,-2.0,0.535
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380,Jerome Robinson,76.0,190.0,313.0,88.0,84.0,135.0,12.0,47.0,17.0,...,0.304,0.365,6.4,0.1,-0.8,1.0,15.2,-0.7,-4.6,0.457
381,Johnathan Williams,81.0,228.0,192.0,35.0,19.0,79.0,63.0,20.0,13.0,...,0.000,0.600,14.9,1.4,1.0,0.4,14.6,-0.1,-2.8,0.606
382,John Wall,76.0,210.0,10879.0,573.0,5282.0,2153.0,330.0,2191.0,396.0,...,0.324,0.433,19.4,44.3,21.5,22.8,27.4,23.6,2.6,0.519
383,Jerian Grant,76.0,205.0,1665.0,273.0,796.0,434.0,90.0,271.0,32.0,...,0.324,0.412,12.6,8.3,4.0,4.3,16.8,0.6,-1.5,0.520


Now to do the same for NCAA, we have to jump through a couple more hoop because of the fact that not all players have data that is available. For example, Tyler Johnson, who is in the NBA, has no NCAA data on height. To fix this, we need to create a function, that return -999 if no information is available, or return the corresponding stat as a float. This is so the code is not cluttered with a ton of if statements.   

In [9]:
def stat_checker(stat):
    if(stat is None):
        return -999
    else:
        return float(stat)


In [10]:
#NO VORP OR PER
ncaa_required_stats_list = []
ncaa_names_to_drop = []
for key, value in ncaa_player_info.items():
    try:
        raw_height = ncaa_player_info[key]['height']['Career'][-1]
        if(raw_height == None):
            career_height = 0
        else:
            raw_height = raw_height.split('-')
            career_height = (float(raw_height[0]) * 12 ) + float(raw_height[1])        
        career_weight = stat_checker(ncaa_player_info[key]['weight']['Career'][-1])
        career_points = stat_checker(ncaa_player_info[key]['points']['Career'][-1])
        career_games = stat_checker(ncaa_player_info[key]['games_played']['Career'][-1])
        career_assists = stat_checker(ncaa_player_info[key]['assists']['Career'][-1])
        defensive_rebounds = stat_checker(ncaa_player_info[key]['defensive_rebounds']['Career'][-1])
        offensive_rebounds = stat_checker(ncaa_player_info[key]['offensive_rebounds']['Career'][-1])
        career_turnovers = stat_checker(ncaa_player_info[key]['turnovers']['Career'][-1])
        career_blocks = stat_checker(ncaa_player_info[key]['blocks']['Career'][-1])
        career_steals = stat_checker(ncaa_player_info[key]['steals']['Career'][-1])
        career_free_throw_percentage = stat_checker(ncaa_player_info[key]['free_throw_percentage']['Career'][-1])
        career_three_point_percentage = stat_checker(ncaa_player_info[key]['three_point_percentage']['Career'][-1])
        career_win_shares = stat_checker(ncaa_player_info[key]['win_shares']['Career'][-1])
        off_win_shares = stat_checker(ncaa_player_info[key]['offensive_win_shares']['Career'][-1])
        def_win_shares = stat_checker(ncaa_player_info[key]['defensive_win_shares']['Career'][-1])
        career_field_goal_percentage = stat_checker(ncaa_player_info[key]['field_goal_percentage']['Career'][-1])
        career_usage_percentage = stat_checker(ncaa_player_info[key]['usage_percentage']['Career'][-1])
        boxplusminus = stat_checker(ncaa_player_info[key]['box_plus_minus']['Career'][-1])
        true_shooting_percentage = stat_checker(ncaa_player_info[key]['true_shooting_percentage']['Career'][-1])
        player_dict =  {'Name': key,
                        'Career Height': career_height,
                        'Career Weight': career_weight,
                        'Career Points': career_points,
                        'Career Games': career_games,
                        'Career Assists': career_assists,
                        'Career Def Rebounds':defensive_rebounds,
                        'Career Off Rebounds':offensive_rebounds,
                        'Career Turnovers': career_turnovers,
                        'Career Blocks': career_blocks,
                        'Career Steals': career_steals,
                        'Career Free Throw Percentage': career_free_throw_percentage,
                        'Career Three Point Percentage': career_three_point_percentage,
                        'Career Field Goal Percentage': career_field_goal_percentage,
                        'Career Win Shares': career_win_shares,
                        'Offensive Win Shares': off_win_shares,
                        'Defensive Win Shares': def_win_shares,
                        'Career Usage Percentage': career_usage_percentage,
                        'Box Plus Minus': boxplusminus,
                        'True Shooting Per': true_shooting_percentage
                        }
        ncaa_required_stats_list.append(player_dict)
    except KeyError as err:
        print(key, "Key Error: ", err)
    except TypeError as err:
        print(key, "Type Error: ", err)
    except AttributeError as err:
        print(key, "Attribute Error: ", err)
ncaa_stats_df = pd.DataFrame(ncaa_required_stats_list)

Now we have to condense the NCAA dataframe in order to include players only from the NBA. Recall the NBA dataframe where some players didn't play enough minutes in the NBA in order to record data, so we were not able to include them. So we need to make sure the names in the NCAA line up in the NBA. This also ensures a plot of x and y are represented as actual players, making it more accurate to see a transition of stats from college to the league. 

In [11]:

nbanames = list(nba_stats_df["Name"])
ncaa_stats_df = ncaa_stats_df[ncaa_stats_df['Name'].isin(nbanames)]
ncaa_stats_df = ncaa_stats_df.reset_index(drop = True)
ncaa_stats_df

Unnamed: 0,Name,Career Height,Career Weight,Career Points,Career Games,Career Assists,Career Def Rebounds,Career Off Rebounds,Career Turnovers,Career Blocks,Career Steals,Career Free Throw Percentage,Career Three Point Percentage,Career Field Goal Percentage,Career Win Shares,Offensive Win Shares,Defensive Win Shares,Career Usage Percentage,Box Plus Minus,True Shooting Per
0,De'Andre Hunter,79.0,225.0,882.0,71.0,111.0,220.0,90.0,83.0,35.0,41.0,0.773,0.419,0.509,11.0,6.8,4.2,24.7,10.5,0.606
1,Trae Young,74.0,180.0,876.0,32.0,279.0,111.0,14.0,167.0,8.0,54.0,0.861,0.360,0.422,5.7,4.6,1.1,37.1,11.1,0.585
2,Vince Carter,79.0,215.0,1267.0,103.0,197.0,-999.0,-999.0,83.0,80.0,114.0,0.705,0.368,0.547,14.8,9.0,5.8,-999.0,-999.0,0.622
3,Cam Reddish,80.0,218.0,485.0,36.0,70.0,113.0,20.0,96.0,21.0,56.0,0.772,0.333,0.356,3.0,0.8,2.2,25.3,4.5,0.499
4,Kevin Huerter,79.0,190.0,779.0,65.0,196.0,271.0,53.0,126.0,44.0,52.0,0.748,0.394,0.466,7.8,4.6,3.2,19.3,7.7,0.603
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380,Jerome Robinson,76.0,195.0,947.0,113.0,206.0,109.0,17.0,175.0,68.0,95.0,0.741,0.375,0.440,10.0,5.1,4.8,-999.0,-999.0,0.557
381,Johnathan Williams,74.0,185.0,7.0,14.0,1.0,1.0,1.0,3.0,0.0,1.0,,0.500,0.250,0.0,0.0,0.0,-999.0,-999.0,0.391
382,John Wall,76.0,195.0,616.0,37.0,241.0,129.0,30.0,149.0,19.0,66.0,0.754,0.325,0.461,6.3,3.7,2.6,25.7,-999.0,0.562
383,Jerian Grant,77.0,203.0,1739.0,119.0,690.0,296.0,47.0,266.0,37.0,175.0,0.790,0.345,0.436,18.4,13.9,4.5,23.1,6.9,0.563


Finally, we included averages for each stat in the dataframe in order to make data analysis in the next notebook a lot easier to do and making plotting more intuitive. Having averages for stats is more accurate than total stats, considering not all player have played the same amount of games. 

In [12]:
nba_stats_df["PPG"] = ''
nba_stats_df["APG"] = ''
nba_stats_df["TPG"] = ''
nba_stats_df["BPG"] = ''
nba_stats_df["SPG"] = ''
nba_stats_df["RPG"] = ''
ncaa_stats_df["PPG"] = ''
ncaa_stats_df["APG"] = ''
ncaa_stats_df["TPG"] = ''
ncaa_stats_df["BPG"] = ''
ncaa_stats_df["SPG"] = ''
ncaa_stats_df["RPG"] = ''

for index, row in nba_stats_df.iterrows():
    player = nba_stats_df.iloc[[index]]
    nba_stats_df.loc[index, "PPG"] = float(player["Career Points"] / player["Career Games"])
    nba_stats_df.loc[index, "APG"] = float(player["Career Assists"] / player["Career Games"])
    nba_stats_df.loc[index, "TPG"] = float(player["Career Turnovers"] / player["Career Games"])
    nba_stats_df.loc[index, "BPG"] = float(player["Career Blocks"] / player["Career Games"])
    nba_stats_df.loc[index, "SPG"] = float(player["Career Steals"] / player["Career Games"])
    nba_stats_df.loc[index, "RPG"] = float(player["Career Off Rebounds"] / player["Career Games"]) + float(player["Career Def Rebounds"] / player["Career Games"])
for index, row in ncaa_stats_df.iterrows():
    player = ncaa_stats_df.iloc[[index]]
    ncaa_stats_df.loc[index, "PPG"] = float(player["Career Points"] / player["Career Games"])
    ncaa_stats_df.loc[index, "APG"] = float(player["Career Assists"] / player["Career Games"])
    ncaa_stats_df.loc[index, "TPG"] = float(player["Career Turnovers"] / player["Career Games"])
    ncaa_stats_df.loc[index, "BPG"] = float(player["Career Blocks"] / player["Career Games"])
    ncaa_stats_df.loc[index, "SPG"] = float(player["Career Steals"] / player["Career Games"])
    ncaa_stats_df.loc[index, "RPG"] = float(player["Career Off Rebounds"] / player["Career Games"]) + float(player["Career Def Rebounds"] / player["Career Games"])

Finally, this saves both dataframes to csv file in order to prevent the need for running these cells again. This may take some time to run considering the request of ~800 player dataframes and objects, so loading these csv files are recommended.

In [13]:
nba_stats_df.to_csv('NBA_Data.csv', index = False)
ncaa_stats_df.to_csv('NCAA_Data.csv', index = False)