# Data analytics in Python: Scraping and learning from NBA player statistics

#### Alex Wurm

#### August 2022 
#TODO: Add link to Github Repo

## Introduction

I am a huge fan of the NBA, and basketball in general. If you're coming to this project from my website (#TODO: insert link to my website) you may have picked up on that. Starting in 2020, I began developing a stronger interest in technology and data as well. As part of the work I am doing to gain the skills I believe will be useful for being a founder in the tech world, I set out to marry my data skills to a topic I find interesting. 

For this project, I (i) use web-scraping tools such as BeautifulSoup to collect disparate data from espn.com's player pages, (ii) use Python's analytics and modeling tools to draw some interesting insights from this data, and (iii) use Python's predictive modeling capabilities to make some bold claims about the future of the NBA.

I've left commentary throughout this notebook, as well as checkpoints at which you can export the dataset to play around with it yourself.

If any trades / retirements have occured since I posted this project which impact the players used in the examples I've included (e.g., Lebron James, Kevin Durant, Stephen Curry) you may need to edit the (TODO: create .py script for users to run to re-scape this data) in order to replicate the steps detailed below.

If you enjoy this project, let me know! Check out my website (TODO: link) and substack (TODO: link) to get in touch and see more from me.

## Part 1: Webscraping

### Source: https://www.espn.com/nba/

Credit to Erick Lu for some of the ideas here in a similar project back in 2020 (https://erilu.github.io/web-scraping-NBA-statistics/)

In [2]:
# import packages to allow regex handling of url subdirectories

import re
import urllib
from time import sleep

In [3]:
# create function to compile list of all team roster urls

def scrape_roster_urls():
    # use regex to find all html on ESPN's NBA teams page that points to the teams' rosters
    f = urllib.request.urlopen('http://www.espn.com/nba/teams')
    teams_content = f.read().decode('utf-8')
    teams_list = dict(re.findall("\"/nba/team/roster/_/name/(\w+)/(.+?)\"", teams_content))

    # generate array of urls to scrape
    roster_urls = []
    for key in teams_list.keys():
        roster_urls.append('https://www.espn.com/nba/team/roster/_/name/' + key + '/' + teams_list[key])
        teams_list[key] = str(teams_list[key])
    return dict(zip(teams_list.values(),roster_urls))

In [4]:
# build dictionary of current nba rosters
nba_rosters = scrape_roster_urls()

# display the dictionary keys
nba_rosters.keys()

dict_keys(['boston-celtics', 'brooklyn-nets', 'new-york-knicks', 'philadelphia-76ers', 'toronto-raptors', 'chicago-bulls', 'cleveland-cavaliers', 'detroit-pistons', 'indiana-pacers', 'milwaukee-bucks', 'denver-nuggets', 'minnesota-timberwolves', 'oklahoma-city-thunder', 'portland-trail-blazers', 'utah-jazz', 'golden-state-warriors', 'la-clippers', 'los-angeles-lakers', 'phoenix-suns', 'sacramento-kings', 'atlanta-hawks', 'charlotte-hornets', 'miami-heat', 'orlando-magic', 'washington-wizards', 'dallas-mavericks', 'houston-rockets', 'memphis-grizzlies', 'new-orleans-pelicans', 'san-antonio-spurs'])

#### Upon visiting the roster and player pages on ESPN, I realized the data that was formerly stored in easily accessible json back in 2020 was now distributed throughout identical html table row tags, meaning re.findall wouldn't be a powerful enough tool to extract all of the data I'd need from the website. Therefore, I've replicated the above steps using Beautiful Soup, an HTML parser which would help with more targeted searches in the later steps of this project.

In [5]:
# import beautifulsoup library to help parse the tables where player information is stored
from bs4 import BeautifulSoup, Tag

# create an instance of the beautifulsoup class to parse the page
f = urllib.request.urlopen('http://www.espn.com/nba/teams')
teams_soup = BeautifulSoup(f.read(), 'html.parser')

# define an iterable helper class to pull list of links using regexes
class my_regex_searcher:
    def __init__(self, regex_string):
        self.__r = re.compile(regex_string)
        self.groups = []

    def __call__(self, what):
        if isinstance(what, Tag):
            what = what.name

        if what:
            g = self.__r.findall(what)
            if g:
                self.groups.append(g)
                return True
        return False

    def __iter__(self):
        yield from self.groups

# create instance of regex_searcher for links to roster pages
roster_searcher = my_regex_searcher(r"/nba/team/roster/_/name/(\w+)/(.+)")

# add all roster page links to a dictionary to unpack the regex searcher object
scraped_roster_details = dict(zip(teams_soup.find_all(href=roster_searcher), roster_searcher))

# extract the components of the keys and values in this intermediate dictionary
# and re-zip them together to create the final cleaned dictionary we'll want to use
teams = []
links = []

for value in scraped_roster_details.values():
    teams.append(value[0][1])

for key in scraped_roster_details.keys():
    links.append('https://www.espn.com' + key.get('href'))    

rosters_library = dict(zip(teams,links))

# display the dictionary keys
rosters_library.keys()


dict_keys(['boston-celtics', 'brooklyn-nets', 'new-york-knicks', 'philadelphia-76ers', 'toronto-raptors', 'chicago-bulls', 'cleveland-cavaliers', 'detroit-pistons', 'indiana-pacers', 'milwaukee-bucks', 'denver-nuggets', 'minnesota-timberwolves', 'oklahoma-city-thunder', 'portland-trail-blazers', 'utah-jazz', 'golden-state-warriors', 'la-clippers', 'los-angeles-lakers', 'phoenix-suns', 'sacramento-kings', 'atlanta-hawks', 'charlotte-hornets', 'miami-heat', 'orlando-magic', 'washington-wizards', 'dallas-mavericks', 'houston-rockets', 'memphis-grizzlies', 'new-orleans-pelicans', 'san-antonio-spurs'])

#### The next step in this process involves collecting player data from each of the 'roster' pages of the 30 NBA teams. To accomplish this pull, we need to iterate through the entire page and pull the set of values associated with each table row (player) across all columns.

#### You'll notice in the page's html that each row is associated with a numerical index, and that the fixed first column of the table is separate from the rest of the scrollable table columns.

In [6]:
# parse table headers
f = urllib.request.urlopen('https://www.espn.com/nba/team/roster/_/name/bkn/brooklyn-nets')
roster_soup = BeautifulSoup(f.read(), 'html.parser')
table_headers = roster_soup.find_all('th', {'class':'Table__TH'})

# convert bs4 result set into string array for regex matching
header_values = []
for x in table_headers:
    header_values.append(str(x))

# extract list of table headers
column_names = []
for x in header_values:

    # Append conditionally to avoid blank spacer block at the top left of the tables
    if len(re.findall(">([a-zA-Z]+?)<",header_values[header_values.index(x)])) > 0:
        column_names.append(re.findall(">([a-zA-Z]+?)<",header_values[header_values.index(x)])[0])

column_names

['Name', 'POS', 'Age', 'HT', 'WT', 'College', 'Salary']

#### Now that we have an array to reference for the sets of data we'll be pulling for each player, we can pull the player data.

#### First we'll start with an example for a single player

In [7]:
# parse first row of table
player_one = roster_soup.find_all('tr', {'data-idx': 5})

# extract all key values from columns

# convert bs4 result set into string array for regex matching
p1_values = []
for x in player_one:
    p1_values.append(str(x))

# match to contents of tags

# note that span is specifically excluded because not all players have a number listed, which makes it difficult to create
# same-length arrays of the column headers and the player information. We won't be using the player numbers for any 
# analysis, so it's alright to exclude them from this scrape.

player_stats = re.findall("<.+?\">([a-zA-Z0-9$;,\'\"\s\.\-\&]{1,25}?)</(?!span)", p1_values[0])

player_stats

['Kevin Durant', 'PF', '33', '6\' 10"', '240 lbs', 'Texas', '$42,018,900']

#### We'll repeat this process for every player on every NBA team. One thing to note is that not all teams have the same number of players, so we'll either have to (a) create a function to find the max row number for each team, or (b) handle errors for trying to manipulate non-existent html.

#### I opted to go with the former -- so I added try-except logic to handle index errors on the converted bs4 resultSets.

#### I also added a final step to scraping each player's information, which is to zip it with the column headers to create a dictionary where the column headers are keys and the player stats are the values.

In [8]:
# parse first row of table
player_one = roster_soup.find_all('tr', {'data-idx': 15})

# extract all key values from columns

# convert bs4 result set into string array for regex matching
p1_values = []
for x in player_one:
    p1_values.append(str(x))

# match to contents of tags
try:
    player_stats = re.findall("<.+?\">([a-zA-Z0-9$;,\'\"\s\.\-\&]{1,25}?)</(?!span)", p1_values[0])
    player_dict = dict(zip(column_names, player_stats))
except IndexError:
    pass

player_dict

{'Name': 'T.J. Warren',
 'POS': 'SF',
 'Age': '28',
 'HT': '6\' 8"',
 'WT': '220 lbs',
 'College': 'NC State',
 'Salary': '$12,690,000'}

In [9]:
# create function to take a team roster url and collect all player info

def get_player_info(team_roster_url):
    f = urllib.request.urlopen(team_roster_url)
    team_roster_soup = BeautifulSoup(f.read(), 'html.parser')
    
    # Part 1: Create table headers
    table_headers = team_roster_soup.find_all('th', {'class':'Table__TH'})

    # convert bs4 result set into string array for regex matching
    header_values = []
    for x in table_headers:
        header_values.append(str(x))

    # extract list of table headers
    column_names = []
    for x in header_values:

        # Append conditionally to avoid blank spacer block at the top left of the tables
        if len(re.findall(">([a-zA-Z]+?)<",header_values[header_values.index(x)])) > 0:
            column_names.append(re.findall(">([a-zA-Z]+?)<",header_values[header_values.index(x)])[0])
    
    # Part 2: Create player dictionaries
    roster_dict = dict()

    # Loop through indexes 0-30, which will cover the largest roster size of any NBA team.
    for i in range(0,30):

        # parse corresponding row of table
        player = team_roster_soup.find_all('tr', {'data-idx': i})

        # extract all key values from columns

        # convert bs4 result set into string array for regex matching
        p_values = []
        for x in player:
            p_values.append(str(x))

        # match to contents of tags
        try:
            player_stats = re.findall("<.+?\">([a-zA-Z0-9$;,\'\"\s\.\-\&]{1,25}?)</(?!span)", p_values[0])
            player_dict = dict(zip(column_names, player_stats))
            roster_dict[player_dict['Name']] = player_dict
        except IndexError:
            pass

    return roster_dict

#### With this new function, we should be able to loop through each of the team's respective roster pages and get all of their player information.

In [10]:
# create master dictionary of teams and player info
all_players = dict()

for team in rosters_library.keys():
    all_players[team] = get_player_info(rosters_library[team])

In [11]:
# test output from newly created all_players dictionary

kd_info = all_players['brooklyn-nets']['Kevin Durant']
kd_info

{'Name': 'Kevin Durant',
 'POS': 'PF',
 'Age': '33',
 'HT': '6\' 10"',
 'WT': '240 lbs',
 'College': 'Texas',
 'Salary': '$42,018,900'}

In [12]:
lebron_info = all_players['los-angeles-lakers']['LeBron James']
lebron_info

{'Name': 'LeBron James',
 'POS': 'SF',
 'Age': '37',
 'HT': '6\' 9"',
 'WT': '250 lbs',
 'College': '--',
 'Salary': '$41,180,544'}

In [13]:
steph_info = all_players['golden-state-warriors']['Stephen Curry']
steph_info

{'Name': 'Stephen Curry',
 'POS': 'PG',
 'Age': '34',
 'HT': '6\' 2"',
 'WT': '185 lbs',
 'College': 'Davidson',
 'Salary': '$45,780,966'}

#### At this point we have created a dictionary of dictionaries

#### The first level of the dictionary maps the teams (keys) to their full rosters (values)

#### The second level (the rosters) are themselves dictionaries, mapping the players' names (keys) to their stats (value)

In [14]:
# Display list of NBA teams
all_players.keys()

dict_keys(['boston-celtics', 'brooklyn-nets', 'new-york-knicks', 'philadelphia-76ers', 'toronto-raptors', 'chicago-bulls', 'cleveland-cavaliers', 'detroit-pistons', 'indiana-pacers', 'milwaukee-bucks', 'denver-nuggets', 'minnesota-timberwolves', 'oklahoma-city-thunder', 'portland-trail-blazers', 'utah-jazz', 'golden-state-warriors', 'la-clippers', 'los-angeles-lakers', 'phoenix-suns', 'sacramento-kings', 'atlanta-hawks', 'charlotte-hornets', 'miami-heat', 'orlando-magic', 'washington-wizards', 'dallas-mavericks', 'houston-rockets', 'memphis-grizzlies', 'new-orleans-pelicans', 'san-antonio-spurs'])

In [15]:
# Display list of NBA players on a team
all_players['brooklyn-nets'].keys()

dict_keys(['LaMarcus Aldridge', 'Nic Claxton', 'Seth Curry', 'Goran Dragic', 'David Duke Jr.', 'Kevin Durant', 'Blake Griffin', 'Joe Harris', 'Kyrie Irving', 'Patty Mills', "Royce O'Neale", "Day'Ron Sharpe", 'Ben Simmons', 'Edmond Sumner', 'Cam Thomas', 'T.J. Warren', 'Alondes Williams'])

In [16]:
# Display list of stats for an NBA player
all_players['brooklyn-nets']['Kevin Durant']

{'Name': 'Kevin Durant',
 'POS': 'PF',
 'Age': '33',
 'HT': '6\' 10"',
 'WT': '240 lbs',
 'College': 'Texas',
 'Salary': '$42,018,900'}

In [17]:
# Display key stat (e.g., salary) for an NBA player
all_players['brooklyn-nets']['Kevin Durant']['Salary']

'$42,018,900'

#### Now that we have a basic set of information for all NBA players (we'll add more data soon), we'll want to restructure our dataset to make it more conducive for analysis.

#### Pandas dataframes are a clean way to structure data in tabular form for this purpose.

In [18]:
# import pandas library
import pandas as pd

# example converting a teams roster dictionary into a dataframe
bkn = pd.DataFrame.from_dict(all_players['brooklyn-nets'], orient = 'index')
bkn.head(5)

Unnamed: 0,Name,POS,Age,HT,WT,College,Salary
LaMarcus Aldridge,LaMarcus Aldridge,C,37,"6' 11""",250 lbs,Texas,"$1,669,178"
Nic Claxton,Nic Claxton,PF,23,"6' 11""",215 lbs,Georgia,"$1,782,621"
Seth Curry,Seth Curry,SG,31,"6' 2""",185 lbs,Duke,"$8,207,518"
Goran Dragic,Goran Dragic,PG,36,"6' 3""",190 lbs,--,"$460,463"
David Duke Jr.,David Duke Jr.,SF,22,"6' 4""",204 lbs,Providence,--


#### Similarly to how we created the all_players dict, we'll need to create a dataframe for each team and roll them all up into one master dataframe for analysis of all NBA players

In [19]:
# initialize empty pandas dataframe
all_players_df = pd.DataFrame()

# loop through each team, creating a pandas dataframe as described above
# and append the records to the all_players_df object
# adding an extra field 'team' to keep track of data sources

for team in all_players:
    roster_df = pd.DataFrame.from_dict(all_players[team], orient = 'index')
    roster_df['Team'] = team
    all_players_df = pd.concat([all_players_df, roster_df])

In [20]:
# Display first 10 records from all_players_df
all_players_df.head(5)

Unnamed: 0,Name,POS,Age,HT,WT,College,Salary,Team
Malcolm Brogdon,Malcolm Brogdon,PG,29,"6' 5""",229 lbs,Virginia,"$21,700,000",boston-celtics
Jaylen Brown,Jaylen Brown,SG,25,"6' 6""",223 lbs,California,"$26,758,928",boston-celtics
JD Davison,JD Davison,G,19,"6' 3""",195 lbs,Alabama,--,boston-celtics
Danilo Gallinari,Danilo Gallinari,F,33,"6' 10""",236 lbs,--,"$20,475,000",boston-celtics
Sam Hauser,Sam Hauser,SF,24,"6' 8""",215 lbs,Virginia,"$313,737",boston-celtics


#### At this point we have a complete dataset of basic player information -- we'll still want to add player performance statistics to this dataset to have something interesting to analyze, but for anyone who wants to play around with this initial dataset, you can export it to a csv file below

In [21]:
# all_players_df.to_csv("Aug_2022_NBA_players_data.csv")

#### For simplicity, we'll take players' career averages and add them to our dataframe

#### To do so, we'll need to see what an individual player's page looks like

#### You'll notice the player's career stats are stored in the 'Stats' card on their page, as well as their most recent regular season and postseason stats. 

#### Not all players will have postseason stats (or even regular season stats, in the case of newly drafted rookies). To make sure we only pull career stats, we'll need to check if the desired stats exist for a player, and then if they do make sure we only pull the data from the corresponding row of then table

In [22]:
# parse individual player's page
f = urllib.request.urlopen('https://www.espn.com/nba/player/_/id/3202/kevin-durant')
kd_soup = BeautifulSoup(f.read(), 'html.parser')

# would return blank a blank bs4 ResultSet object if the player stats card did not exist
kd_stats = kd_soup.find_all('section', {'class':'Card PlayerStats'})

# convert the bs4 resultSet to a string
try:
    kd_stat_card = str(kd_stats[0])
except IndexError:
    kd_stat_card = []

# search the card for career stats record
try:
    row_number = re.findall("data-idx=\"(\d)\"><td class=\"Table__TD\">Career</td>",kd_stat_card)[0]
except TypeError:
    pass

# pull the list of column headers
try:
    card_headers = re.findall("class=\"Table__TH\".+?>(.+?)</th>", kd_stat_card)
except TypeError:
    pass

# pull the list of career stats
try:
    kd_career_stats = re.findall("data-idx=\"{row_number}\">(.+?)</tr>".format(row_number = row_number), kd_stat_card)
except (TypeError, NameError):
    pass

# convert from bs4 resultSet to list
try:
    card_data = []
    for x in kd_career_stats:
        stats = re.findall("<td class=\"Table__TD\">(.+?)</td>",kd_career_stats[kd_career_stats.index(x)])
        for y in stats:
            card_data.append(str(y))
except (IndexError, TypeError, NameError):
   pass

try:
    kd_dict = dict(zip(card_headers, card_data))
except (TypeError, NameError):
    pass

try:
    kd_dict
except NameError:
    pass


#### In order to iterate through all players, (a) we'll need to be able to construct the unique URLs for each of their pages, which will require knowing the IDs associated with each player, and (b) we'll need to define a function that can accomplish the above stats pull given that url information, and append it to the all_players_df object.

#### To start, we can extract these IDs from the anchorlinks to the players' names and photos in the tables on the rosters pages we analyzed previously.

#### We'll also just grab the full urls while we're at it, since the name formats in the urls are different than those we've already pulled, which would create issues later.

In [23]:
# create a function to take a team roster url and collect all of the player ids

def get_player_ids(team_roster_url):
    f = urllib.request.urlopen(team_roster_url)
    team_roster_soup = BeautifulSoup(f.read(), 'html.parser')

    # create player id dictionaries
    ids_dict = dict()

    # Loop through indexes 0-30, which will cover the largest roster size of any NBA team.
    for i in range(0,30):

        # parse corresponding row of table
        player_id = team_roster_soup.find_all('tr', {'data-idx': i})

        # extract all ids from anchor links

        # convert bs4 result set into string array for regex matching
        id_values = []
        for x in player_id:
            id_values.append(str(x))

        # match to contents of tags
        try:
            player_name = re.findall("<.+?\">([a-zA-Z0-9$;,\'\"\s\.\-\&]{1,25}?)</(?!span)", id_values[0])[0]
            player_id = re.findall("href=\"https://www.espn.com/nba/player/_/id/(\d+?)/[\w\-]+?\"", id_values[0])[0]
            player_url = re.findall("href=\"(https://www.espn.com/nba/player/_/id/\d+?/[\w\-]+?)\"", id_values[0])[0]
            ids_dict[player_name] = dict({'id': player_id, 'url': player_url})
        except IndexError:
            pass

    return ids_dict

#### With a function to collect all player ids, we can create another dictionary for all players, convert it to a dataframe, and join it with our existing all_players_df object

In [24]:
# create a new dictionary to hold all player ids

all_player_ids = dict()

# populate this new dictionary with the ids of all players across every NBA team

for team in rosters_library.keys():
    all_player_ids[team] = get_player_ids(rosters_library[team])

In [25]:
# display select values in all_player_ids dictionary

kd_id = all_player_ids['brooklyn-nets']['Kevin Durant']
kd_id

{'id': '3202', 'url': 'https://www.espn.com/nba/player/_/id/3202/kevin-durant'}

In [26]:
lebron_id = all_player_ids['los-angeles-lakers']['LeBron James']
lebron_id

{'id': '1966', 'url': 'https://www.espn.com/nba/player/_/id/1966/lebron-james'}

In [27]:
steph_id = all_player_ids['golden-state-warriors']['Stephen Curry']
steph_id

{'id': '3975',
 'url': 'https://www.espn.com/nba/player/_/id/3975/stephen-curry'}

In [28]:
# initialize empty pandas dataframe for ids
all_player_ids_df = pd.DataFrame()

# loop through each team, creating a pandas dataframe
# and append the records to the all_player_ids_df object

for team in all_player_ids:
    roster_ids_df = pd.DataFrame.from_dict(all_player_ids[team], orient = 'index')
    all_player_ids_df = pd.concat([all_player_ids_df, roster_ids_df])

In [29]:
# Display first 10 records from all_players_df
all_player_ids_df.head(5)

Unnamed: 0,id,url
Malcolm Brogdon,2566769,https://www.espn.com/nba/player/_/id/2566769/m...
Jaylen Brown,3917376,https://www.espn.com/nba/player/_/id/3917376/j...
JD Davison,4576085,https://www.espn.com/nba/player/_/id/4576085/j...
Danilo Gallinari,3428,https://www.espn.com/nba/player/_/id/3428/dani...
Sam Hauser,4065804,https://www.espn.com/nba/player/_/id/4065804/s...


In [30]:
# update the all_players_df object with ids

# note: there are no two players currently in the NBA with the exact same first and last name,
# and it is unlikely there will be in the future. If this situation did occur, we would need to use
# the pandas.merge function and specify the name column AND another identifying column (e.g., team)
# rather than simply joining on the indexes, which in this case are also the names of the players
all_players_df = all_players_df.join(all_player_ids_df)

In [31]:
# Display first 10 records from the updated all_players_df
all_players_df.head(5)

Unnamed: 0,Name,POS,Age,HT,WT,College,Salary,Team,id,url
Malcolm Brogdon,Malcolm Brogdon,PG,29,"6' 5""",229 lbs,Virginia,"$21,700,000",boston-celtics,2566769,https://www.espn.com/nba/player/_/id/2566769/m...
Jaylen Brown,Jaylen Brown,SG,25,"6' 6""",223 lbs,California,"$26,758,928",boston-celtics,3917376,https://www.espn.com/nba/player/_/id/3917376/j...
JD Davison,JD Davison,G,19,"6' 3""",195 lbs,Alabama,--,boston-celtics,4576085,https://www.espn.com/nba/player/_/id/4576085/j...
Danilo Gallinari,Danilo Gallinari,F,33,"6' 10""",236 lbs,--,"$20,475,000",boston-celtics,3428,https://www.espn.com/nba/player/_/id/3428/dani...
Sam Hauser,Sam Hauser,SF,24,"6' 8""",215 lbs,Virginia,"$313,737",boston-celtics,4065804,https://www.espn.com/nba/player/_/id/4065804/s...


#### Now that we finally have a complete dataframe with unique ids, we can go back and scrape all player pages for their career stats

In [32]:
# create a function that takes a player page url and scrapes a players stats, adding them to a dictionary

def get_player_stats(player_url):
    # parse individual player's page
    f = urllib.request.urlopen(player_url)
    player_soup = BeautifulSoup(f.read(), 'html.parser')

    # would return blank a blank bs4 ResultSet object if the player stats card did not exist
    player_stats = player_soup.find_all('section', {'class':'Card PlayerStats'})

    # convert the bs4 resultSet to a string
    try:
        player_stat_card = str(player_stats[0])
    except IndexError:
        player_stat_card = []

    # search the card for career stats record
    try:
        row_number = re.findall("data-idx=\"(\d)\"><td class=\"Table__TD\">Career</td>",player_stat_card)[0]
    except TypeError:
        pass

    # pull the list of column headers
    try:
        card_headers = re.findall("class=\"Table__TH\".+?>(.+?)</th>", player_stat_card)
    except TypeError:
        pass

    # pull the list of career stats
    try:
        player_career_stats = re.findall("data-idx=\"{row_number}\">(.+?)</tr>".format(row_number = row_number), player_stat_card)
    except (TypeError, UnboundLocalError) :
        pass

    # convert from bs4 resultSet to list
    try:
        card_data = []
        for x in player_career_stats:
            stats = re.findall("<td class=\"Table__TD\">(.+?)</td>",player_career_stats[player_career_stats.index(x)])
            for y in stats:
                card_data.append(str(y))
    except (IndexError, TypeError, UnboundLocalError):
        pass

    try:
        player_dict = dict(zip(card_headers, card_data))
    except (TypeError, UnboundLocalError):
        pass

    try:
        return player_dict
    except:
        player_dict = dict()
        return player_dict

In [33]:
# create a function that takes a dataframe with player names as indexes and uses the above stats-
# scraping function to return a dictionary of all player career avg. stats for an entire NBA team

def compile_all_stats(players_dataframe):

    career_stats_dict = dict()

    for player, info in players_dataframe.iterrows():
        player_url = players_dataframe.loc[player]["url"]
        pstats_dict = get_player_stats(player_url)
        career_stats_dict[player] = pstats_dict
    
    return career_stats_dict

In [54]:
# compile player career stats dictionary by scraping every NBA player's page
player_stats_dict = compile_all_stats(all_players_df)

#### If you are running this code locally you'll notice this last step takes considerably longer than the rest of the steps in this project. That difference is because we are scraping each NBA player's web page individually. Because there are ~15 players per team in the NBA, that is more than an order of magnitude greater than the number of pages we need to scrape for any roster page-level data.

In [53]:
# display select entries from the stats dictionary

kd_stats = player_stats_dict['Kevin Durant']
kd_stats

{'Stats': 'Career',
 'GP': '939',
 'MIN': '36.8',
 'FG%': '49.6',
 '3P%': '38.4',
 'FT%': '88.4',
 'REB': '7.1',
 'AST': '4.3',
 'BLK': '1.1',
 'STL': '1.1',
 'PF': '1.9',
 'TO': '3.2',
 'PTS': '27.2'}

In [35]:
lebron_stats = player_stats_dict['LeBron James']
lebron_stats

{'Stats': 'Career',
 'GP': '1366',
 'MIN': '38.2',
 'FG%': '50.5',
 '3P%': '34.6',
 'FT%': '73.4',
 'REB': '7.5',
 'AST': '7.4',
 'BLK': '0.8',
 'STL': '1.6',
 'PF': '1.8',
 'TO': '3.5',
 'PTS': '27.1'}

In [36]:
steph_stats = player_stats_dict['Stephen Curry']
steph_stats

{'Stats': 'Career',
 'GP': '826',
 'MIN': '34.3',
 'FG%': '47.3',
 '3P%': '42.8',
 'FT%': '90.8',
 'REB': '4.6',
 'AST': '6.5',
 'BLK': '0.2',
 'STL': '1.7',
 'PF': '2.4',
 'TO': '3.1',
 'PTS': '24.3'}

#### Finally, we'll convert this player-level dictionary to a dataframe and join it to our all_players_df object as we did with the ids and urls.

In [37]:
# create final dataframe to join with existing player-level data

all_player_stats_df = pd.DataFrame.from_dict(player_stats_dict, orient = 'index')

In [38]:
# join the all_players_df and all_player_stats_df objects
all_players_df = all_players_df.join(all_player_stats_df)

In [39]:
# display part of the complete dataset
all_players_df.head(5)

Unnamed: 0,Name,POS,Age,HT,WT,College,Salary,Team,id,url,...,FG%,3P%,FT%,REB,AST,BLK,STL,PF,TO,PTS
Malcolm Brogdon,Malcolm Brogdon,PG,29,"6' 5""",229 lbs,Virginia,"$21,700,000",boston-celtics,2566769,https://www.espn.com/nba/player/_/id/2566769/m...,...,46.4,37.6,88.1,4.2,4.8,0.2,0.9,1.9,1.8,15.5
Jaylen Brown,Jaylen Brown,SG,25,"6' 6""",223 lbs,California,"$26,758,928",boston-celtics,3917376,https://www.espn.com/nba/player/_/id/3917376/j...,...,47.3,37.3,71.2,4.9,2.0,0.4,0.9,2.5,1.9,16.5
JD Davison,JD Davison,G,19,"6' 3""",195 lbs,Alabama,--,boston-celtics,4576085,https://www.espn.com/nba/player/_/id/4576085/j...,...,,,,,,,,,,
Danilo Gallinari,Danilo Gallinari,F,33,"6' 10""",236 lbs,--,"$20,475,000",boston-celtics,3428,https://www.espn.com/nba/player/_/id/3428/dani...,...,42.8,38.2,87.7,4.8,1.9,0.4,0.7,1.8,1.2,15.6
Sam Hauser,Sam Hauser,SF,24,"6' 8""",215 lbs,Virginia,"$313,737",boston-celtics,4065804,https://www.espn.com/nba/player/_/id/4065804/s...,...,46.0,43.2,0.0,1.1,0.4,0.1,0.0,0.3,0.1,2.5


## We now have a complete dataset of biographical information and career stats for every player in the NBA!!!

#### We have some cleaning to do, but in case you are interested in playing around with this raw dataset, you can save it as a local csv below.

In [40]:
# all_players_df.to_csv("Aug_2022_NBA_players_full_dataset_raw.csv")

#### To calculate statistics from this dataset, we will need to convert each of the stats and player biometrics above to numerical values rather than strings

In [41]:
# display salaries in default format

all_players_df['Salary'].head(5)

Malcolm Brogdon     $21,700,000
Jaylen Brown        $26,758,928
JD Davison                   --
Danilo Gallinari    $20,475,000
Sam Hauser             $313,737
Name: Salary, dtype: object

In [42]:
# convert empty salaries to 0s
all_players_df['Salary'] = [re.sub(r'--', '$0', x) if isinstance(x, str) else x for x in all_players_df['Salary'].values]


# convert salaries to numerical values using list comprehension
all_players_df['Salary'] = [int(re.sub(r'[^\d]+', '', x)) if isinstance(x, str) else x for x in all_players_df['Salary'].values]


In [43]:
# display salaries in cleaned format

all_players_df['Salary'].head(5)

Malcolm Brogdon     21700000.0
Jaylen Brown        26758928.0
JD Davison                 0.0
Danilo Gallinari    20475000.0
Sam Hauser            313737.0
Name: Salary, dtype: float64

In [101]:
# display ages in default format

all_players_df['Age'].head(5)

Malcolm Brogdon     29
Jaylen Brown        25
JD Davison          19
Danilo Gallinari    33
Sam Hauser          24
Name: Age, dtype: object

In [102]:
# convert empty ages to 0s
all_players_df['Age'] = [re.sub(r'--', '0', x) if isinstance(x, str) else x for x in all_players_df['Age'].values]


# convert salaries to numerical values using list comprehension
all_players_df['Age'] = [int(x) if isinstance(x, str) else x for x in all_players_df['Age'].values]


In [105]:
# display ages in cleaned format

all_players_df['Age'].head(5)

Malcolm Brogdon     29
Jaylen Brown        25
JD Davison          19
Danilo Gallinari    33
Sam Hauser          24
Name: Age, dtype: int64

In [104]:
# display heights in default format

all_players_df['HT'].head(5)

Malcolm Brogdon     77.0
Jaylen Brown        78.0
JD Davison          75.0
Danilo Gallinari    82.0
Sam Hauser          80.0
Name: HT, dtype: float64

In [45]:
# define a function that takes a string in ft' in" format and converts to total inches as a numerical value

def convert_height(height):
    height_splits = height.split()
    feet = float(height_splits[0].replace("\'",""))
    inches = float(height_splits[1].replace("\"",""))
    return (12*feet + inches)

In [46]:
# convert heights to numerical values using list comprehension

all_players_df['HT'] = [float(convert_height(x)) if isinstance(x, str) else x for x in all_players_df['HT'].values]

In [47]:
# display heights in cleaned format

all_players_df['HT'].head(5)

Malcolm Brogdon     77.0
Jaylen Brown        78.0
JD Davison          75.0
Danilo Gallinari    82.0
Sam Hauser          80.0
Name: HT, dtype: float64

In [48]:
# display weights in default format

all_players_df['WT'].head(5)

Malcolm Brogdon     229 lbs
Jaylen Brown        223 lbs
JD Davison          195 lbs
Danilo Gallinari    236 lbs
Sam Hauser          215 lbs
Name: WT, dtype: object

In [49]:
# next, convert weights to numerical values with list comprehension

all_players_df['WT'] = [float(x.split(" ")[0]) if isinstance(x, str) else x for x in all_players_df['WT'].values]

In [50]:
# display weights in cleaned format

all_players_df['WT'].head(5)

Malcolm Brogdon     229.0
Jaylen Brown        223.0
JD Davison          195.0
Danilo Gallinari    236.0
Sam Hauser          215.0
Name: WT, dtype: float64

In [106]:
# at this point, we should check the remaining columns to make sure they are the types we expect

all_players_df.dtypes

Name        object
POS         object
Age          int64
HT         float64
WT         float64
College     object
Salary     float64
Team        object
id          object
url         object
Stats       object
GP         float64
MIN        float64
FG%        float64
3P%        float64
FT%        float64
REB        float64
AST        float64
BLK        float64
STL        float64
PF         float64
TO         float64
PTS        float64
dtype: object

#### We realize that all of the career stats we pulled are of type 'object' (e.g., recognized as strings) rather than ints or floats. Luckily we can apply a pretty quick transformation to each of these columns to case as the appropriate type.

In [100]:
# define dictionary of desired columns and types
stat_types = {
    'GP': 'float64',
    'MIN': 'float64',
    'FG%': 'float64',
    '3P%': 'float64',
    'FT%': 'float64',
    'REB': 'float64',
    'AST': 'float64',
    'BLK': 'float64',
    'STL': 'float64',
    'PF': 'float64',
    'TO': 'float64',
    'PTS': 'float64'
}

# convert all stats columns to floats using the above dictionary (in-place)
all_players_df = all_players_df.astype(dtype=stat_types, copy=False, errors='raise')

#### At this point, all of our metrics of interest should be in numerical form throughout the dataframe.

In [98]:
# re-check field types

all_players_df.dtypes

Name        object
POS         object
Age         object
HT         float64
WT         float64
College     object
Salary     float64
Team        object
id          object
url         object
Stats       object
GP         float64
MIN        float64
FG%        float64
3P%        float64
FT%        float64
REB        float64
AST        float64
BLK        float64
STL        float64
PF         float64
TO         float64
PTS        float64
dtype: object

#### I'll save this file one more time before analysis for anyone interested in playing around with it. Locally, you can also read in the data from this csv to your own pandas dataframe to conduct analysis.

In [51]:
# export cleaned dataset as csv
# all_players_df.to_csv('Aug_2022_NBA_players_full_dataset_cleaned.csv')

# read in the dataset to dataframe from a csv
#your_df = pd.read_csv('Aug_2022_NBA_players_full_dataset_cleaned.csv', index_col=0)

## Part 2: Data Analysis


#### There are many different statistics we could find for this dataset. For this project, I'll aim to avoid any basic queries that you could find from going to espn.com and using the built in sorts and filters

#### First, let's take a quick look at team-wide stats

In [108]:
# find the mean values of each metric for every team (note: all non-float columns are automatically dropped)

team_means_df = all_players_df.groupby(by='Team').mean()

team_means_df.head(5)


Unnamed: 0_level_0,Age,HT,WT,Salary,GP,MIN,FG%,3P%,FT%,REB,AST,BLK,STL,PF,TO,PTS
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
atlanta-hawks,25.65,78.0,211.35,6328549.0,326.944444,20.838889,45.438889,29.622222,73.166667,3.838889,2.016667,0.411111,0.65,1.711111,1.144444,9.577778
boston-celtics,26.125,78.8125,220.375,10284610.0,290.133333,20.566667,44.006667,32.34,68.766667,3.586667,2.02,0.533333,0.693333,1.606667,1.02,9.38
brooklyn-nets,28.235294,78.058824,217.411765,10308410.0,446.8125,25.2125,48.775,32.73125,76.21875,4.45625,2.65,0.46875,0.8,2.0125,1.5125,12.78125
charlotte-hornets,24.526316,79.052632,211.368421,5795680.0,259.352941,18.570588,42.988235,26.429412,66.164706,3.323529,1.964706,0.376471,0.635294,1.588235,1.076471,8.564706
chicago-bulls,25.666667,78.777778,219.0,7275821.0,310.647059,21.9,48.3,34.282353,71.4,4.511765,1.941176,0.458824,0.717647,1.847059,1.211765,9.394118


In [109]:
# find the sum values of each metric for every team (note: all non-float columns are automatically dropped)

team_totals_df = all_players_df.groupby(by='Team').sum()

team_totals_df

Unnamed: 0_level_0,Age,HT,WT,Salary,GP,MIN,FG%,3P%,FT%,REB,AST,BLK,STL,PF,TO,PTS
Team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
atlanta-hawks,513,1560.0,4227.0,126570986.0,5885.0,375.1,817.9,533.2,1317.0,69.1,36.3,7.4,11.7,30.8,20.6,172.4
boston-celtics,418,1261.0,3526.0,164553816.0,4352.0,308.5,660.1,485.1,1031.5,53.8,30.3,8.0,10.4,24.1,15.3,140.7
brooklyn-nets,480,1327.0,3696.0,175242957.0,7149.0,403.4,780.4,523.7,1219.5,71.3,42.4,7.5,12.8,32.2,24.2,204.5
charlotte-hornets,466,1502.0,4016.0,110117912.0,4409.0,315.7,730.8,449.3,1124.8,56.5,33.4,6.4,10.8,27.0,18.3,145.6
chicago-bulls,462,1418.0,3942.0,130964783.0,5281.0,372.3,821.1,582.8,1213.8,76.7,33.0,7.8,12.2,31.4,20.6,159.7
cleveland-cavaliers,648,1958.0,5382.0,154225452.0,7190.0,480.8,964.1,580.6,1556.2,89.8,53.8,8.7,15.1,37.5,28.9,202.9
dallas-mavericks,462,1338.0,3677.0,121510554.0,5353.0,327.9,693.5,494.8,1239.5,57.6,32.4,6.6,9.0,27.1,18.2,148.6
denver-nuggets,502,1479.0,4164.0,115137949.0,6380.0,349.3,745.8,541.7,1205.8,72.0,35.2,7.3,11.5,30.4,20.5,161.1
detroit-pistons,564,1879.0,5117.0,97375817.0,4916.0,425.3,873.2,639.9,1448.2,77.6,38.9,9.5,13.9,38.6,23.9,177.7
golden-state-warriors,496,1480.0,3976.0,181117346.0,5963.0,349.2,739.3,533.1,1241.8,60.8,35.7,6.8,12.0,32.7,22.3,160.7


#### I think the 'sum' view is more interesting than the 'mean' view for team-wide stats, so let's stick with this dataframe for a minute

#### Obviously because these are career stats, they don't perfectly reflect each team's current situation. For example, team with veterans in the their final years in the league may get a big boost to career stats but not see that production on the court in the upcoming season. That said, the purpose of this project is to have a little fun and make some interesting observations, not to definitively rank teams or players based on their stats.

#### Let's take a look at how the top and bottom few teams stack up in some of the core stats in basketball: points, rebounds, and assists

In [110]:
# show top 5 teams by career scoring averages

team_totals_df.sort_values('PTS',ascending=False)['PTS'].head(5)

Team
los-angeles-lakers        231.9
brooklyn-nets             204.5
cleveland-cavaliers       202.9
portland-trail-blazers    178.2
detroit-pistons           177.7
Name: PTS, dtype: float64

In [111]:
# show bottom 5 teams by career scoring averages
team_totals_df.sort_values('PTS',ascending=False)['PTS'].tail(5)

Team
boston-celtics       140.7
new-york-knicks      137.4
memphis-grizzlies    133.8
indiana-pacers       132.2
san-antonio-spurs     92.2
Name: PTS, dtype: float64

In [112]:
# show top 5 teams by career rebounding averages
team_totals_df.sort_values('REB',ascending=False)['REB'].head(5)

Team
cleveland-cavaliers       89.8
los-angeles-lakers        89.0
detroit-pistons           77.6
portland-trail-blazers    77.4
chicago-bulls             76.7
Name: REB, dtype: float64

In [113]:
# show bottom 5 teams by career rebounding averages
team_totals_df.sort_values('REB',ascending=False)['REB'].tail(5)

Team
new-york-knicks      53.9
boston-celtics       53.8
utah-jazz            53.5
memphis-grizzlies    52.8
san-antonio-spurs    42.9
Name: REB, dtype: float64

In [114]:
# show top 5 teams by career assist averages
team_totals_df.sort_values('AST',ascending=False)['AST'].head(5)

Team
cleveland-cavaliers       53.8
los-angeles-lakers        49.0
brooklyn-nets             42.4
portland-trail-blazers    41.7
phoenix-suns              41.3
Name: AST, dtype: float64

In [115]:
# show bottom 5 teams by career assist averages
team_totals_df.sort_values('AST',ascending=False)['AST'].tail(5)

Team
indiana-pacers        29.1
washington-wizards    29.1
utah-jazz             28.8
memphis-grizzlies     26.9
san-antonio-spurs     16.9
Name: AST, dtype: float64

#### Looking at these lists there are a few things that become immediately apparent

1. The differences between the top teams and bottom teams in each of these stats seems quite dramatic. The top teams are ~2X the bottom teams in most cases. We know the stats of these top teams must be affected by players who are outside of their prime inflating the numbers, as no NBA team is scoring even close to 200ppg. With veteran players like Carmelo Anthony, Dwight Howard, and Russell Westbrook on the Los Angeles Lakers, we can see how this total is possible.
2. Speaking of the Los Angeles Lakers - they crack the top 5 in all 3 of these categories -- the only team to do so. Clearly given their performance in the '21-'22 season, cumulative career avg. stats wouldn't be a good way to estimate a team's winning percentage 😂
3. The San Antonio Spurs are last in each of these categories... by a lot.

#### My hunch is that the Spurs have the youngest / least experienced team in the league by a good margin. Let's test that hunch.

In [116]:
# show top 5 teams by cumulative age
team_totals_df.sort_values('Age',ascending=False)['Age'].head(5)

Team
los-angeles-lakers        664
cleveland-cavaliers       648
portland-trail-blazers    619
detroit-pistons           564
oklahoma-city-thunder     531
Name: Age, dtype: int64

In [117]:
# show bottom 5 teams by cumulative age
team_totals_df.sort_values('Age',ascending=False)['Age'].tail(5)

Team
orlando-magic        444
san-antonio-spurs    432
boston-celtics       418
sacramento-kings     405
new-york-knicks      388
Name: Age, dtype: int64

#### I was wrong! There are several teams with less cumulative experience than the spurs (approx. given varying # of total players on a roster) including the reigning Eastern Conference Champion Boston Celtics.

#### It would seem that age alone is not enough of an explanation for the lack of career stats on that spurs team. We also know there can be more churn amongst younger players in the league (new draft picks, G-league promotions/demotions, etc.) Perhaps these spurs players just haven't played in the league quite as much?

In [118]:
# show top 5 teams by cumulative games played
team_totals_df.sort_values('GP',ascending=False)['GP'].head(5)

Team
los-angeles-lakers     10344.0
milwaukee-bucks         7248.0
cleveland-cavaliers     7190.0
brooklyn-nets           7149.0
la-clippers             6706.0
Name: GP, dtype: float64

In [119]:
# show bottom 5 teams by cumulative games played
team_totals_df.sort_values('GP',ascending=False)['GP'].tail(5)

Team
new-york-knicks          3460.0
indiana-pacers           3198.0
orlando-magic            2700.0
oklahoma-city-thunder    2679.0
san-antonio-spurs        2473.0
Name: GP, dtype: float64

#### And a quick check of the teams ranked by games played gives us the answer -- the San Antonio Spurs roster has cumulatively played very few games; < 1/4 as many as the Los Angeles Lakers roster!

#### Just for our own understanding, exactly how far below the other teams in the league are the Spurs? We can compute a few quick stats to find out.

In [133]:
# Compute the mean and standard deviation team career PTS, REB, and AST
avg_pts = round(team_totals_df['PTS'].mean(),2)
stdev_pts = round(team_totals_df['PTS'].std(),2)
spurs_pts = round(team_totals_df.loc['san-antonio-spurs']['PTS'],2)
avg_reb = round(team_totals_df['REB'].mean(),2)
stdev_reb = round(team_totals_df['REB'].std(),2)
spurs_reb = round(team_totals_df.loc['san-antonio-spurs']['REB'],2)
avg_ast = round(team_totals_df['AST'].mean(),2)
stdev_ast = round(team_totals_df['AST'].std(),2)
spurs_ast = round(team_totals_df.loc['san-antonio-spurs']['AST'],2)


print('AVG career PTS/game for a team is {pts} with a STDEV of {pts_dev}'.format(pts=avg_pts,pts_dev=stdev_pts))
print('The Spurs total career PTS/game is {pts}, {devs} stdevs from avg.\n'.format(pts=spurs_pts, devs=round((spurs_pts - avg_pts)/stdev_pts,2)))
print('AVG career REB/game for a team is {reb} with a STDEV of {reb_dev}'.format(reb=avg_reb,reb_dev=stdev_reb))
print('The Spurs total career REB/game is {pts}, {devs} stdevs from avg.\n'.format(pts=spurs_reb, devs=round((spurs_reb - avg_reb)/stdev_reb,2)))
print('AVG career AST/game for a team is {ast} with a STDEV of {ast_dev}'.format(ast=avg_ast,ast_dev=stdev_ast))
print('The Spurs total career AST/game is {pts}, {devs} stdevs from avg.\n'.format(pts=spurs_ast, devs=round((spurs_ast - avg_ast)/stdev_ast,2)))


AVG career PTS/game for a team is 158.98 with a STDEV of 25.18
The Spurs total career PTS/game is 92.2, -2.65 stdevs from avg.

AVG career REB/game for a team is 64.7 with a STDEV of 10.54
The Spurs total career REB/game is 42.9, -2.07 stdevs from avg.

AVG career AST/game for a team is 34.2 with a STDEV of 6.87
The Spurs total career AST/game is 16.9, -2.52 stdevs from avg.



#### Assuming a normal distribution, the spurs 'team career stats' which we've pulled are considerably lower than the league average. These data points are nearly complete outliers, and one could make a case that we should exclude them from any league wide analysis of 'team career stats'. Luckily, that's not something we'll dive into here.

### I think that's enough looking at team-wide stats to get an idea of the types of questions we can answer with the data that we've scraped. Now let's turn to individual player stats, and get into some more interesting analyses!