# Scraping ESPN NBA Rosters (https://www.espn.com/nba/teams)

Credit to Erick Lu for some ideas in a similar project back in 2020 (https://erilu.github.io/web-scraping-NBA-statistics/)

In [112]:
# import packages to allow regex handling of url subdirectories

import re
import urllib
from time import sleep

In [113]:
# create function to compile list of all team roster urls

def scrape_roster_urls():
    # use regex to find all html on ESPN's NBA teams page that points to the teams' rosters
    f = urllib.request.urlopen('http://www.espn.com/nba/teams')
    teams_content = f.read().decode('utf-8')
    teams_list = dict(re.findall("\"/nba/team/roster/_/name/(\w+)/(.+?)\"", teams_content))

    # generate array of urls to scrape
    roster_urls = []
    for key in teams_list.keys():
        roster_urls.append('https://www.espn.com/nba/team/roster/_/name/' + key + '/' + teams_list[key])
        teams_list[key] = str(teams_list[key])
    return dict(zip(teams_list.values(),roster_urls))

In [114]:
# build dictionary of current nba rosters
nba_rosters = scrape_roster_urls()

# display the dictionary
nba_rosters

{'boston-celtics': 'https://www.espn.com/nba/team/roster/_/name/bos/boston-celtics',
 'brooklyn-nets': 'https://www.espn.com/nba/team/roster/_/name/bkn/brooklyn-nets',
 'new-york-knicks': 'https://www.espn.com/nba/team/roster/_/name/ny/new-york-knicks',
 'philadelphia-76ers': 'https://www.espn.com/nba/team/roster/_/name/phi/philadelphia-76ers',
 'toronto-raptors': 'https://www.espn.com/nba/team/roster/_/name/tor/toronto-raptors',
 'chicago-bulls': 'https://www.espn.com/nba/team/roster/_/name/chi/chicago-bulls',
 'cleveland-cavaliers': 'https://www.espn.com/nba/team/roster/_/name/cle/cleveland-cavaliers',
 'detroit-pistons': 'https://www.espn.com/nba/team/roster/_/name/det/detroit-pistons',
 'indiana-pacers': 'https://www.espn.com/nba/team/roster/_/name/ind/indiana-pacers',
 'milwaukee-bucks': 'https://www.espn.com/nba/team/roster/_/name/mil/milwaukee-bucks',
 'denver-nuggets': 'https://www.espn.com/nba/team/roster/_/name/den/denver-nuggets',
 'minnesota-timberwolves': 'https://www.espn

#### Upon visiting the roster and player pages on ESPN, I realized the data that was formerly stored in easily accessible json was now distributed throughout identical html components, meaning re.findall wouldn't be a powerful enough tool to extract all of the data I'd need from the website. Therefore, I've replicated the above steps using Beautiful Soup, an HTML parser which would help with more targeted searches in the later steps.

In [115]:
# import beautifulsoup library to help parse the tables where player information is stored
from bs4 import BeautifulSoup, Tag

# create an instance of the beautifulsoup class to parse the page
f = urllib.request.urlopen('http://www.espn.com/nba/teams')
teams_soup = BeautifulSoup(f.read(), 'html.parser')

# define an iterable helper class to pull list of links using regexes
class my_regex_searcher:
    def __init__(self, regex_string):
        self.__r = re.compile(regex_string)
        self.groups = []

    def __call__(self, what):
        if isinstance(what, Tag):
            what = what.name

        if what:
            g = self.__r.findall(what)
            if g:
                self.groups.append(g)
                return True
        return False

    def __iter__(self):
        yield from self.groups

# create instance of regex_searcher for links to roster pages
roster_searcher = my_regex_searcher(r"/nba/team/roster/_/name/(\w+)/(.+)")

# add all roster page links to a dictionary to unpack the regex searcher object
scraped_roster_details = dict(zip(teams_soup.find_all(href=roster_searcher), roster_searcher))

# extract the components of the keys and values in this intermediate dictionary
# and re-zip them together to create the final cleaned dictionary we'll want to use
teams = []
links = []

for value in scraped_roster_details.values():
    teams.append(value[0][1])

for key in scraped_roster_details.keys():
    links.append('https://www.espn.com' + key.get('href'))    

rosters_library = dict(zip(teams,links))

# display the dictionary
rosters_library


{'boston-celtics': 'https://www.espn.com/nba/team/roster/_/name/bos/boston-celtics',
 'brooklyn-nets': 'https://www.espn.com/nba/team/roster/_/name/bkn/brooklyn-nets',
 'new-york-knicks': 'https://www.espn.com/nba/team/roster/_/name/ny/new-york-knicks',
 'philadelphia-76ers': 'https://www.espn.com/nba/team/roster/_/name/phi/philadelphia-76ers',
 'toronto-raptors': 'https://www.espn.com/nba/team/roster/_/name/tor/toronto-raptors',
 'chicago-bulls': 'https://www.espn.com/nba/team/roster/_/name/chi/chicago-bulls',
 'cleveland-cavaliers': 'https://www.espn.com/nba/team/roster/_/name/cle/cleveland-cavaliers',
 'detroit-pistons': 'https://www.espn.com/nba/team/roster/_/name/det/detroit-pistons',
 'indiana-pacers': 'https://www.espn.com/nba/team/roster/_/name/ind/indiana-pacers',
 'milwaukee-bucks': 'https://www.espn.com/nba/team/roster/_/name/mil/milwaukee-bucks',
 'denver-nuggets': 'https://www.espn.com/nba/team/roster/_/name/den/denver-nuggets',
 'minnesota-timberwolves': 'https://www.espn

#### The next step in this process involves collecting player data from each of the 'roster' pages of the 30 NBA teams. To accomplish this pull, we need to iterate through the entire page and pull the set of values associated with each table row (player) across all columns.

#### You'll notice in the page's html that each row is associated with a numerical index, and that the fixed first column of the table is separate from the rest of the scrollable table columns.

In [116]:
# parse table headers
f = urllib.request.urlopen('https://www.espn.com/nba/team/roster/_/name/bkn/brooklyn-nets')
roster_soup = BeautifulSoup(f.read(), 'html.parser')
table_headers = roster_soup.find_all('th', {'class':'Table__TH'})

# convert bs4 result set into string array for regex matching
header_values = []
for x in table_headers:
    header_values.append(str(x))

# extract list of table headers
column_names = []
for x in header_values:

    # Append conditionally to avoid blank spacer block at the top left of the tables
    if len(re.findall(">([a-zA-Z]+?)<",header_values[header_values.index(x)])) > 0:
        column_names.append(re.findall(">([a-zA-Z]+?)<",header_values[header_values.index(x)])[0])

column_names

['Name', 'POS', 'Age', 'HT', 'WT', 'College', 'Salary']

#### Now that we have an array to reference for the sets of data we'll be pulling for each player, we can pull the player data.

#### First we'll start with an example for a single player

In [117]:
# parse first row of table
player_one = roster_soup.find_all('tr', {'data-idx': 5})

# extract all key values from columns

# convert bs4 result set into string array for regex matching
p1_values = []
for x in player_one:
    p1_values.append(str(x))

# match to contents of tags

# note that span is specifically excluded because not all players have a number listed, which makes it difficult to create
# same-length arrays of the column headers and the player information. We won't be using the player numbers for any 
# analysis, so it's alright to exclude them from this scrape.

player_stats = re.findall("<.+?\">([a-zA-Z0-9$;,\'\"\s\.\-\&]{1,25}?)</(?!span)", p1_values[0])

player_stats

['Kevin Durant', 'PF', '33', '6\' 10"', '240 lbs', 'Texas', '$42,018,900']

#### We'll repeat this process for every player on every NBA team. One thing to note is that not all teams have the same number of players, so we'll either have to (a) create a function to find the max row number for each team, or (b) handle errors for trying to manipulate non-existent html.

#### I opted to go with the former -- so I added try-except logic to handle index errors on the converted bs4 resultSets.

#### I also added a final step to scraping each player's information, which is to zip it with the column headers to create a dictionary where the column headers are keys and the player stats are the values.

In [118]:
# parse first row of table
player_one = roster_soup.find_all('tr', {'data-idx': 15})

# extract all key values from columns

# convert bs4 result set into string array for regex matching
p1_values = []
for x in player_one:
    p1_values.append(str(x))

# match to contents of tags
try:
    player_stats = re.findall("<.+?\">([a-zA-Z0-9$;,\'\"\s\.\-\&]{1,25}?)</(?!span)", p1_values[0])
    player_dict = dict(zip(column_names, player_stats))
except IndexError:
    pass

player_dict

{'Name': 'T.J. Warren',
 'POS': 'SF',
 'Age': '28',
 'HT': '6\' 8"',
 'WT': '220 lbs',
 'College': 'NC State',
 'Salary': '$12,690,000'}

In [119]:
# create function to take a team roster url and collect all player info

def get_player_info(team_roster_url):
    f = urllib.request.urlopen(team_roster_url)
    team_roster_soup = BeautifulSoup(f.read(), 'html.parser')
    
    # Part 1: Create table headers
    table_headers = team_roster_soup.find_all('th', {'class':'Table__TH'})

    # convert bs4 result set into string array for regex matching
    header_values = []
    for x in table_headers:
        header_values.append(str(x))

    # extract list of table headers
    column_names = []
    for x in header_values:

        # Append conditionally to avoid blank spacer block at the top left of the tables
        if len(re.findall(">([a-zA-Z]+?)<",header_values[header_values.index(x)])) > 0:
            column_names.append(re.findall(">([a-zA-Z]+?)<",header_values[header_values.index(x)])[0])
    
    # Part 2: Create player dictionaries
    roster_dict = dict()

    # Loop through indexes 0-30, which will cover the largest roster size of any NBA team.
    for i in range(0,30):

        # parse corresponding row of table
        player = team_roster_soup.find_all('tr', {'data-idx': i})

        # extract all key values from columns

        # convert bs4 result set into string array for regex matching
        p_values = []
        for x in player:
            p_values.append(str(x))

        # match to contents of tags
        try:
            player_stats = re.findall("<.+?\">([a-zA-Z0-9$;,\'\"\s\.\-\&]{1,25}?)</(?!span)", p_values[0])
            player_dict = dict(zip(column_names, player_stats))
            roster_dict[player_dict['Name']] = player_dict
        except IndexError:
            pass

    return roster_dict

#### With this new function, we should be able to loop through each of the team's respective roster pages and get all of their player information.

In [120]:
# create master dictionary of teams and player info
all_players = dict()

for team in rosters_library.keys():
    all_players[team] = get_player_info(rosters_library[team])

In [121]:
# test output from newly created all_players dictionary

all_players

{'boston-celtics': {'Malcolm Brogdon': {'Name': 'Malcolm Brogdon',
   'POS': 'PG',
   'Age': '29',
   'HT': '6\' 5"',
   'WT': '229 lbs',
   'College': 'Virginia',
   'Salary': '$21,700,000'},
  'Jaylen Brown': {'Name': 'Jaylen Brown',
   'POS': 'SG',
   'Age': '25',
   'HT': '6\' 6"',
   'WT': '223 lbs',
   'College': 'California',
   'Salary': '$26,758,928'},
  'JD Davison': {'Name': 'JD Davison',
   'POS': 'G',
   'Age': '19',
   'HT': '6\' 3"',
   'WT': '195 lbs',
   'College': 'Alabama',
   'Salary': '--'},
  'Danilo Gallinari': {'Name': 'Danilo Gallinari',
   'POS': 'F',
   'Age': '33',
   'HT': '6\' 10"',
   'WT': '236 lbs',
   'College': '--',
   'Salary': '$20,475,000'},
  'Sam Hauser': {'Name': 'Sam Hauser',
   'POS': 'SF',
   'Age': '24',
   'HT': '6\' 8"',
   'WT': '215 lbs',
   'College': 'Virginia',
   'Salary': '$313,737'},
  'Al Horford': {'Name': 'Al Horford',
   'POS': 'C',
   'Age': '36',
   'HT': '6\' 9"',
   'WT': '240 lbs',
   'College': 'Florida',
   'Salary': '$

#### At this point we have created a dictionary of dictionaries

#### The first level of the dictionary maps the teams (keys) to their full rosters (values)

#### The second level (the rosters) are themselves dictionaries, mapping the players' names (keys) to their stats (value)

In [122]:
# Display list of NBA teams
all_players.keys()

dict_keys(['boston-celtics', 'brooklyn-nets', 'new-york-knicks', 'philadelphia-76ers', 'toronto-raptors', 'chicago-bulls', 'cleveland-cavaliers', 'detroit-pistons', 'indiana-pacers', 'milwaukee-bucks', 'denver-nuggets', 'minnesota-timberwolves', 'oklahoma-city-thunder', 'portland-trail-blazers', 'utah-jazz', 'golden-state-warriors', 'la-clippers', 'los-angeles-lakers', 'phoenix-suns', 'sacramento-kings', 'atlanta-hawks', 'charlotte-hornets', 'miami-heat', 'orlando-magic', 'washington-wizards', 'dallas-mavericks', 'houston-rockets', 'memphis-grizzlies', 'new-orleans-pelicans', 'san-antonio-spurs'])

In [123]:
# Display list of NBA players on a team
all_players['brooklyn-nets'].keys()

dict_keys(['LaMarcus Aldridge', 'Nic Claxton', 'Seth Curry', 'Goran Dragic', 'David Duke Jr.', 'Kevin Durant', 'Blake Griffin', 'Joe Harris', 'Kyrie Irving', 'Patty Mills', "Royce O'Neale", "Day'Ron Sharpe", 'Ben Simmons', 'Edmond Sumner', 'Cam Thomas', 'T.J. Warren', 'Alondes Williams'])

In [124]:
# Display list of stats for an NBA player
all_players['brooklyn-nets']['Kevin Durant']

{'Name': 'Kevin Durant',
 'POS': 'PF',
 'Age': '33',
 'HT': '6\' 10"',
 'WT': '240 lbs',
 'College': 'Texas',
 'Salary': '$42,018,900'}

In [125]:
# Display key stat (e.g., salary) for an NBA player
all_players['brooklyn-nets']['Kevin Durant']['Salary']

'$42,018,900'

#### Now that we have a basic set of information for all NBA players (we'll add more data soon), we'll want to restructure our dataset to make it more conducive for analysis.

#### Pandas dataframes are a clean way to structure data in tabular form for this purpose.

In [126]:
# # import pandas library
import pandas as pd

# # example converting a teams roster dictionary into a dataframe
bkn = pd.DataFrame.from_dict(all_players['brooklyn-nets'], orient = 'index')
bkn

Unnamed: 0,Name,POS,Age,HT,WT,College,Salary
LaMarcus Aldridge,LaMarcus Aldridge,C,37,"6' 11""",250 lbs,Texas,"$1,669,178"
Nic Claxton,Nic Claxton,PF,23,"6' 11""",215 lbs,Georgia,"$1,782,621"
Seth Curry,Seth Curry,SG,31,"6' 2""",185 lbs,Duke,"$8,207,518"
Goran Dragic,Goran Dragic,PG,36,"6' 3""",190 lbs,--,"$460,463"
David Duke Jr.,David Duke Jr.,SF,22,"6' 4""",204 lbs,Providence,--
Kevin Durant,Kevin Durant,PF,33,"6' 10""",240 lbs,Texas,"$42,018,900"
Blake Griffin,Blake Griffin,PF,33,"6' 9""",250 lbs,Oklahoma,"$1,669,178"
Joe Harris,Joe Harris,SF,30,"6' 6""",220 lbs,Virginia,"$17,357,143"
Kyrie Irving,Kyrie Irving,PG,30,"6' 2""",195 lbs,Duke,"$35,328,700"
Patty Mills,Patty Mills,PG,33,"6' 0""",180 lbs,Saint Mary's,"$5,890,000"


#### Similarly to how we created the all_players dict, we'll need to create a dataframe for each team and roll them all up into one master dataframe for analysis of all NBA players

In [127]:
# initialize empty pandas dataframe
all_players_df = pd.DataFrame()

# loop through each team, creating a pandas dataframe as described above
# and append the records to the all_players_df object
# adding an extra field 'team' to keep track of data sources

for team in all_players:
    roster_df = pd.DataFrame.from_dict(all_players[team], orient = 'index')
    roster_df['Team'] = team
    all_players_df = pd.concat([all_players_df, roster_df])

In [128]:
# Display first 10 records from all_players_df
all_players_df.head(10)

Unnamed: 0,Name,POS,Age,HT,WT,College,Salary,Team
Malcolm Brogdon,Malcolm Brogdon,PG,29,"6' 5""",229 lbs,Virginia,"$21,700,000",boston-celtics
Jaylen Brown,Jaylen Brown,SG,25,"6' 6""",223 lbs,California,"$26,758,928",boston-celtics
JD Davison,JD Davison,G,19,"6' 3""",195 lbs,Alabama,--,boston-celtics
Danilo Gallinari,Danilo Gallinari,F,33,"6' 10""",236 lbs,--,"$20,475,000",boston-celtics
Sam Hauser,Sam Hauser,SF,24,"6' 8""",215 lbs,Virginia,"$313,737",boston-celtics
Al Horford,Al Horford,C,36,"6' 9""",240 lbs,Florida,"$27,000,000",boston-celtics
Mfiondu Kabengele,Mfiondu Kabengele,C,24,"6' 10""",250 lbs,Florida State,"$1,701,593",boston-celtics
Luke Kornet,Luke Kornet,F,27,"7' 2""",250 lbs,Vanderbilt,"$565,986",boston-celtics
Payton Pritchard,Payton Pritchard,PG,24,"6' 1""",195 lbs,Oregon,"$2,137,440",boston-celtics
Matt Ryan,Matt Ryan,F,25,"6' 7""",215 lbs,Chattanooga,--,boston-celtics


In [129]:
# Display last 10 records from all_players_df
all_players_df.tail(10)

Unnamed: 0,Name,POS,Age,HT,WT,College,Salary,Team
Jakob Poeltl,Jakob Poeltl,C,26,"7' 1""",245 lbs,Utah,"$8,750,000",san-antonio-spurs
Joshua Primo,Joshua Primo,SG,19,"6' 5""",190 lbs,Alabama,"$3,946,800",san-antonio-spurs
Josh Richardson,Josh Richardson,SG,28,"6' 5""",200 lbs,Tennessee,"$11,615,328",san-antonio-spurs
Isaiah Roby,Isaiah Roby,PF,24,"6' 8""",230 lbs,Nebraska,"$1,782,621",san-antonio-spurs
Jeremy Sochan,Jeremy Sochan,F,19,"6' 8""",225 lbs,Baylor,--,san-antonio-spurs
D.J. Stewart Jr.,D.J. Stewart Jr.,F,23,"6' 6""",205 lbs,Mississippi State,--,san-antonio-spurs
Devin Vassell,Devin Vassell,SG,21,"6' 5""",200 lbs,Florida State,"$4,235,160",san-antonio-spurs
Blake Wesley,Blake Wesley,G,19,"6' 5""",185 lbs,Notre Dame,--,san-antonio-spurs
Joe Wieskamp,Joe Wieskamp,SF,22,"6' 6""",205 lbs,Iowa,"$202,068",san-antonio-spurs
Robert Woodard II,Robert Woodard II,SF,22,"6' 7""",230 lbs,Mississippi State,"$1,517,981",san-antonio-spurs


#### At this point we have a complete dataset of basic player information -- we'll still want to add player performance statistics to this dataset to have something interesting to analyze, but for anyone who wants to play around with this initial dataset, you can export it to a csv file below

In [130]:
# all_players_df.to_csv("Aug_2022_NBA_players_data.csv")

#### For simplicity, we'll take players' career averages and add them to our dataframe

#### To do so, we'll need to see what an individual player's page looks like

#### You'll notice the player's career stats are stored in the 'Stats' card on their page, as well as their most recent regular season and postseason stats. 

#### Not all players will have postseason stats (or even regular season stats, in the case of newly drafted rookies). To make sure we only pull career stats, we'll need to check if the desired stats exist for a player, and then if they do make sure we only pull the data from the corresponding row of then table

In [131]:
# parse individual player's page
f = urllib.request.urlopen('https://www.espn.com/nba/player/_/id/4432573/paolo-banchero')
kd_soup = BeautifulSoup(f.read(), 'html.parser')

# would return blank a blank bs4 ResultSet object if the player stats card did not exist
kd_stats = kd_soup.find_all('section', {'class':'Card PlayerStats'})

# convert the bs4 resultSet to a string
try:
    kd_stat_card = str(kd_stats[0])
except IndexError:
    kd_stat_card = []

# search the card for career stats record
try:
    row_number = re.findall("data-idx=\"(\d)\"><td class=\"Table__TD\">Career</td>",kd_stat_card)[0]
except TypeError:
    pass

# pull the list of column headers
try:
    card_headers = re.findall("class=\"Table__TH\".+?>(.+?)</th>", kd_stat_card)
except TypeError:
    pass

# pull the list of career stats
try:
    kd_career_stats = re.findall("data-idx=\"{row_number}\">(.+?)</tr>".format(row_number = row_number), kd_stat_card)
except TypeError:
    pass

# convert from bs4 resultSet to list
try:
    card_data = []
    for x in kd_career_stats:
        stats = re.findall("<td class=\"Table__TD\">(.+?)</td>",kd_career_stats[kd_career_stats.index(x)])
        for y in stats:
            card_data.append(str(y))
except (IndexError, TypeError):
   pass

try:
    kd_dict = dict(zip(card_headers, card_data))
except TypeError:
    pass

kd_dict


{'Stats': 'Career',
 'GP': '939',
 'MIN': '36.8',
 'FG%': '49.6',
 '3P%': '38.4',
 'FT%': '88.4',
 'REB': '7.1',
 'AST': '4.3',
 'BLK': '1.1',
 'STL': '1.1',
 'PF': '1.9',
 'TO': '3.2',
 'PTS': '27.2'}

#### In order to iterate through all players, (a) we'll need to be able to construct the unique URLs for each of their pages, which will require knowing the IDs associated with each player, and (b) we'll need to define a function that can accomplish the above stats pull given that url information, and append it to the all_players_df object.

#### To start, we can extract these IDs from the anchorlinks to the players' names and photos in the tables on the rosters pages we analyzed previously.

#### We'll also just grab the full urls while we're at it, since the name formats in the urls are different than those we've already pulled, which would create issues later.

In [132]:
# create a function to take a team roster url and collect all of the player ids

def get_player_ids(team_roster_url):
    f = urllib.request.urlopen(team_roster_url)
    team_roster_soup = BeautifulSoup(f.read(), 'html.parser')

    # create player id dictionaries
    ids_dict = dict()

    # Loop through indexes 0-30, which will cover the largest roster size of any NBA team.
    for i in range(0,30):

        # parse corresponding row of table
        player_id = team_roster_soup.find_all('tr', {'data-idx': i})

        # extract all ids from anchor links

        # convert bs4 result set into string array for regex matching
        id_values = []
        for x in player_id:
            id_values.append(str(x))

        # match to contents of tags
        try:
            player_name = re.findall("<.+?\">([a-zA-Z0-9$;,\'\"\s\.\-\&]{1,25}?)</(?!span)", id_values[0])[0]
            player_id = re.findall("href=\"https://www.espn.com/nba/player/_/id/(\d+?)/[\w\-]+?\"", id_values[0])[0]
            player_url = re.findall("href=\"(https://www.espn.com/nba/player/_/id/\d+?/[\w\-]+?)\"", id_values[0])[0]
            ids_dict[player_name] = dict({'id': player_id, 'url': player_url})
        except IndexError:
            pass

    return ids_dict

#### With a function to collect all player ids, we can create another dictionary for all players, convert it to a dataframe, and join it with our existing all_players_df object

In [133]:
# create a new dictionary to hold all player ids

all_player_ids = dict()

# populate this new dictionary with the ids of all players across every NBA team

for team in rosters_library.keys():
    all_player_ids[team] = get_player_ids(rosters_library[team])

In [134]:
# display all_player_ids dictionary

all_player_ids

{'boston-celtics': {'Malcolm Brogdon': {'id': '2566769',
   'url': 'https://www.espn.com/nba/player/_/id/2566769/malcolm-brogdon'},
  'Jaylen Brown': {'id': '3917376',
   'url': 'https://www.espn.com/nba/player/_/id/3917376/jaylen-brown'},
  'JD Davison': {'id': '4576085',
   'url': 'https://www.espn.com/nba/player/_/id/4576085/jd-davison'},
  'Danilo Gallinari': {'id': '3428',
   'url': 'https://www.espn.com/nba/player/_/id/3428/danilo-gallinari'},
  'Sam Hauser': {'id': '4065804',
   'url': 'https://www.espn.com/nba/player/_/id/4065804/sam-hauser'},
  'Al Horford': {'id': '3213',
   'url': 'https://www.espn.com/nba/player/_/id/3213/al-horford'},
  'Mfiondu Kabengele': {'id': '4065660',
   'url': 'https://www.espn.com/nba/player/_/id/4065660/mfiondu-kabengele'},
  'Luke Kornet': {'id': '3064560',
   'url': 'https://www.espn.com/nba/player/_/id/3064560/luke-kornet'},
  'Payton Pritchard': {'id': '4066354',
   'url': 'https://www.espn.com/nba/player/_/id/4066354/payton-pritchard'},
  'M

In [135]:
# initialize empty pandas dataframe for ids
all_player_ids_df = pd.DataFrame()

# loop through each team, creating a pandas dataframe
# and append the records to the all_player_ids_df object

for team in all_player_ids:
    roster_ids_df = pd.DataFrame.from_dict(all_player_ids[team], orient = 'index')
    all_player_ids_df = pd.concat([all_player_ids_df, roster_ids_df])

In [136]:
# Display first 10 records from all_players_df
all_player_ids_df.head(10)

Unnamed: 0,id,url
Malcolm Brogdon,2566769,https://www.espn.com/nba/player/_/id/2566769/m...
Jaylen Brown,3917376,https://www.espn.com/nba/player/_/id/3917376/j...
JD Davison,4576085,https://www.espn.com/nba/player/_/id/4576085/j...
Danilo Gallinari,3428,https://www.espn.com/nba/player/_/id/3428/dani...
Sam Hauser,4065804,https://www.espn.com/nba/player/_/id/4065804/s...
Al Horford,3213,https://www.espn.com/nba/player/_/id/3213/al-h...
Mfiondu Kabengele,4065660,https://www.espn.com/nba/player/_/id/4065660/m...
Luke Kornet,3064560,https://www.espn.com/nba/player/_/id/3064560/l...
Payton Pritchard,4066354,https://www.espn.com/nba/player/_/id/4066354/p...
Matt Ryan,3908336,https://www.espn.com/nba/player/_/id/3908336/m...


In [137]:
# Display last 10 records from all_players_df
all_player_ids_df.tail(10)

Unnamed: 0,id,url
Jakob Poeltl,3134908,https://www.espn.com/nba/player/_/id/3134908/j...
Joshua Primo,4702176,https://www.espn.com/nba/player/_/id/4702176/j...
Josh Richardson,2581190,https://www.espn.com/nba/player/_/id/2581190/j...
Isaiah Roby,4066392,https://www.espn.com/nba/player/_/id/4066392/i...
Jeremy Sochan,4610139,https://www.espn.com/nba/player/_/id/4610139/j...
D.J. Stewart Jr.,4396960,https://www.espn.com/nba/player/_/id/4396960/d...
Devin Vassell,4395630,https://www.espn.com/nba/player/_/id/4395630/d...
Blake Wesley,4683935,https://www.espn.com/nba/player/_/id/4683935/b...
Joe Wieskamp,4397033,https://www.espn.com/nba/player/_/id/4397033/j...
Robert Woodard II,4396961,https://www.espn.com/nba/player/_/id/4396961/r...


In [138]:
# update the all_players_df object with ids

# note: there are no two players currently in the NBA with the exact same first and last name,
# and it is unlikely there will be in the future. If this situation did occur, we would need to use
# the pandas.merge function and specify the name column AND another identifying column (e.g., team)
# rather than simply joining on the indexes, which in this case are also the names of the players
all_players_df = all_players_df.join(all_player_ids_df)

In [139]:
# Display first 10 records from the updated all_players_df
all_players_df.head(10)

Unnamed: 0,Name,POS,Age,HT,WT,College,Salary,Team,id,url
Malcolm Brogdon,Malcolm Brogdon,PG,29,"6' 5""",229 lbs,Virginia,"$21,700,000",boston-celtics,2566769,https://www.espn.com/nba/player/_/id/2566769/m...
Jaylen Brown,Jaylen Brown,SG,25,"6' 6""",223 lbs,California,"$26,758,928",boston-celtics,3917376,https://www.espn.com/nba/player/_/id/3917376/j...
JD Davison,JD Davison,G,19,"6' 3""",195 lbs,Alabama,--,boston-celtics,4576085,https://www.espn.com/nba/player/_/id/4576085/j...
Danilo Gallinari,Danilo Gallinari,F,33,"6' 10""",236 lbs,--,"$20,475,000",boston-celtics,3428,https://www.espn.com/nba/player/_/id/3428/dani...
Sam Hauser,Sam Hauser,SF,24,"6' 8""",215 lbs,Virginia,"$313,737",boston-celtics,4065804,https://www.espn.com/nba/player/_/id/4065804/s...
Al Horford,Al Horford,C,36,"6' 9""",240 lbs,Florida,"$27,000,000",boston-celtics,3213,https://www.espn.com/nba/player/_/id/3213/al-h...
Mfiondu Kabengele,Mfiondu Kabengele,C,24,"6' 10""",250 lbs,Florida State,"$1,701,593",boston-celtics,4065660,https://www.espn.com/nba/player/_/id/4065660/m...
Luke Kornet,Luke Kornet,F,27,"7' 2""",250 lbs,Vanderbilt,"$565,986",boston-celtics,3064560,https://www.espn.com/nba/player/_/id/3064560/l...
Payton Pritchard,Payton Pritchard,PG,24,"6' 1""",195 lbs,Oregon,"$2,137,440",boston-celtics,4066354,https://www.espn.com/nba/player/_/id/4066354/p...
Matt Ryan,Matt Ryan,F,25,"6' 7""",215 lbs,Chattanooga,--,boston-celtics,3908336,https://www.espn.com/nba/player/_/id/3908336/m...


In [140]:
# Display last 10 records from the updated all_players_df
all_players_df.tail(10)

Unnamed: 0,Name,POS,Age,HT,WT,College,Salary,Team,id,url
Jakob Poeltl,Jakob Poeltl,C,26,"7' 1""",245 lbs,Utah,"$8,750,000",san-antonio-spurs,3134908,https://www.espn.com/nba/player/_/id/3134908/j...
Joshua Primo,Joshua Primo,SG,19,"6' 5""",190 lbs,Alabama,"$3,946,800",san-antonio-spurs,4702176,https://www.espn.com/nba/player/_/id/4702176/j...
Josh Richardson,Josh Richardson,SG,28,"6' 5""",200 lbs,Tennessee,"$11,615,328",san-antonio-spurs,2581190,https://www.espn.com/nba/player/_/id/2581190/j...
Isaiah Roby,Isaiah Roby,PF,24,"6' 8""",230 lbs,Nebraska,"$1,782,621",san-antonio-spurs,4066392,https://www.espn.com/nba/player/_/id/4066392/i...
Jeremy Sochan,Jeremy Sochan,F,19,"6' 8""",225 lbs,Baylor,--,san-antonio-spurs,4610139,https://www.espn.com/nba/player/_/id/4610139/j...
D.J. Stewart Jr.,D.J. Stewart Jr.,F,23,"6' 6""",205 lbs,Mississippi State,--,san-antonio-spurs,4396960,https://www.espn.com/nba/player/_/id/4396960/d...
Devin Vassell,Devin Vassell,SG,21,"6' 5""",200 lbs,Florida State,"$4,235,160",san-antonio-spurs,4395630,https://www.espn.com/nba/player/_/id/4395630/d...
Blake Wesley,Blake Wesley,G,19,"6' 5""",185 lbs,Notre Dame,--,san-antonio-spurs,4683935,https://www.espn.com/nba/player/_/id/4683935/b...
Joe Wieskamp,Joe Wieskamp,SF,22,"6' 6""",205 lbs,Iowa,"$202,068",san-antonio-spurs,4397033,https://www.espn.com/nba/player/_/id/4397033/j...
Robert Woodard II,Robert Woodard II,SF,22,"6' 7""",230 lbs,Mississippi State,"$1,517,981",san-antonio-spurs,4396961,https://www.espn.com/nba/player/_/id/4396961/r...


#### Now that we finally have a complete dataframe with unique ids, we can go back and scrape all player pages for their career stats

In [155]:
# create a function that takes a player page url and scrapes a players stats, adding them to a dictionary

def get_player_stats(player_url):
    # parse individual player's page
    f = urllib.request.urlopen(player_url)
    player_soup = BeautifulSoup(f.read(), 'html.parser')

    # would return blank a blank bs4 ResultSet object if the player stats card did not exist
    player_stats = player_soup.find_all('section', {'class':'Card PlayerStats'})

    # convert the bs4 resultSet to a string
    try:
        player_stat_card = str(player_stats[0])
    except IndexError:
        player_stat_card = []

    # search the card for career stats record
    try:
        row_number = re.findall("data-idx=\"(\d)\"><td class=\"Table__TD\">Career</td>",player_stat_card)[0]
    except TypeError:
        pass

    # pull the list of column headers
    try:
        card_headers = re.findall("class=\"Table__TH\".+?>(.+?)</th>", player_stat_card)
    except TypeError:
        pass

    # pull the list of career stats
    try:
        player_career_stats = re.findall("data-idx=\"{row_number}\">(.+?)</tr>".format(row_number = row_number), player_stat_card)
    except (TypeError, UnboundLocalError) :
        pass

    # convert from bs4 resultSet to list
    try:
        card_data = []
        for x in player_career_stats:
            stats = re.findall("<td class=\"Table__TD\">(.+?)</td>",player_career_stats[player_career_stats.index(x)])
            for y in stats:
                card_data.append(str(y))
    except (IndexError, TypeError, UnboundLocalError):
        pass

    try:
        player_dict = dict(zip(card_headers, card_data))
    except (TypeError, UnboundLocalError):
        pass

    try:
        return player_dict
    except:
        player_dict = dict()
        return player_dict

In [158]:
# create a function that takes a dataframe with player names as indexes and uses the above stats-
# scraping function to return a dictionary of all player career avg. stats for an entire NBA team

def compile_all_stats(players_dataframe):

    career_stats_dict = dict()

    for player, info in players_dataframe.iterrows():
        player_url = players_dataframe.loc[player]["url"]
        pstats_dict = get_player_stats(player_url)
        career_stats_dict[player] = pstats_dict
    
    return career_stats_dict

In [159]:
player_stats_dict = compile_all_stats(all_players_df)
player_stats_dict

{'Malcolm Brogdon': {'Stats': 'Career',
  'GP': '333',
  'MIN': '30.2',
  'FG%': '46.4',
  '3P%': '37.6',
  'FT%': '88.1',
  'REB': '4.2',
  'AST': '4.8',
  'BLK': '0.2',
  'STL': '0.9',
  'PF': '1.9',
  'TO': '1.8',
  'PTS': '15.5'},
 'Jaylen Brown': {'Stats': 'Career',
  'GP': '403',
  'MIN': '28.7',
  'FG%': '47.3',
  '3P%': '37.3',
  'FT%': '71.2',
  'REB': '4.9',
  'AST': '2.0',
  'BLK': '0.4',
  'STL': '0.9',
  'PF': '2.5',
  'TO': '1.9',
  'PTS': '16.5'},
 'JD Davison': {},
 'Danilo Gallinari': {'Stats': 'Career',
  'GP': '728',
  'MIN': '29.9',
  'FG%': '42.8',
  '3P%': '38.2',
  'FT%': '87.7',
  'REB': '4.8',
  'AST': '1.9',
  'BLK': '0.4',
  'STL': '0.7',
  'PF': '1.8',
  'TO': '1.2',
  'PTS': '15.6'},
 'Sam Hauser': {'Stats': 'Career',
  'GP': '26',
  'MIN': '6.1',
  'FG%': '46.0',
  '3P%': '43.2',
  'FT%': '0.0',
  'REB': '1.1',
  'AST': '0.4',
  'BLK': '0.1',
  'STL': '0.0',
  'PF': '0.3',
  'TO': '0.1',
  'PTS': '2.5'},
 'Al Horford': {'Stats': 'Career',
  'GP': '950',
  

#### If you are running this code locally you'll notice this last step takes considerably longer than the rest of the steps in this project. That difference is because we are scraping each NBA player's web page individually. Because there are ~15 players per team in the NBA, that is more than an order of magnitude greater than the number of pages we need to scrape for any roster page-level data.