Scraping ESPN NBA Rosters (https://www.espn.com/nba/teams)

Credit to Erick Lu for some ideas in a similar project back in 2020 (https://erilu.github.io/web-scraping-NBA-statistics/)

In [2]:
# import packages to allow regex handling of url subdirectories

import re
import urllib
from time import sleep

In [3]:
# create function to compile list of all team roster urls

def scrape_roster_urls():
    # use regex to find all html on ESPN's NBA teams page that points to the teams' rosters
    f = urllib.request.urlopen('http://www.espn.com/nba/teams')
    teams_content = f.read().decode('utf-8')
    teams_list = dict(re.findall("\"/nba/team/roster/_/name/(\w+)/(.+?)\"", teams_content))

    # generate array of urls to scrape
    roster_urls = []
    for key in teams_list.keys():
        roster_urls.append('https://www.espn.com/nba/team/roster/_/name/' + key + '/' + teams_list[key])
        teams_list[key] = str(teams_list[key])
    return dict(zip(teams_list.values(),roster_urls))

In [4]:
# build dictionary of current nba rosters
nba_rosters = scrape_roster_urls()

# display the dictionary
nba_rosters

{'boston-celtics': 'https://www.espn.com/nba/team/roster/_/name/bos/boston-celtics',
 'brooklyn-nets': 'https://www.espn.com/nba/team/roster/_/name/bkn/brooklyn-nets',
 'new-york-knicks': 'https://www.espn.com/nba/team/roster/_/name/ny/new-york-knicks',
 'philadelphia-76ers': 'https://www.espn.com/nba/team/roster/_/name/phi/philadelphia-76ers',
 'toronto-raptors': 'https://www.espn.com/nba/team/roster/_/name/tor/toronto-raptors',
 'chicago-bulls': 'https://www.espn.com/nba/team/roster/_/name/chi/chicago-bulls',
 'cleveland-cavaliers': 'https://www.espn.com/nba/team/roster/_/name/cle/cleveland-cavaliers',
 'detroit-pistons': 'https://www.espn.com/nba/team/roster/_/name/det/detroit-pistons',
 'indiana-pacers': 'https://www.espn.com/nba/team/roster/_/name/ind/indiana-pacers',
 'milwaukee-bucks': 'https://www.espn.com/nba/team/roster/_/name/mil/milwaukee-bucks',
 'denver-nuggets': 'https://www.espn.com/nba/team/roster/_/name/den/denver-nuggets',
 'minnesota-timberwolves': 'https://www.espn

Upon visiting the roster and player pages on ESPN, I realized the data that was formerly stored in easily accessible json was now distributed throughout identical html components, meaning re.findall wouldn't be a powerful enough tool to extract all of the data I'd need from the website. Therefore, I've replicated the above steps using Beautiful Soup, an HTML parser which would help with more targeted searches in the later steps.

In [5]:
# import beautifulsoup library to help parse the tables where player information is stored
from bs4 import BeautifulSoup, Tag

# create an instance of the beautifulsoup class to parse the page
f = urllib.request.urlopen('http://www.espn.com/nba/teams')
teams_soup = BeautifulSoup(f.read(), 'html.parser')

# define an iterable helper class to pull list of links using regexes
class my_regex_searcher:
    def __init__(self, regex_string):
        self.__r = re.compile(regex_string)
        self.groups = []

    def __call__(self, what):
        if isinstance(what, Tag):
            what = what.name

        if what:
            g = self.__r.findall(what)
            if g:
                self.groups.append(g)
                return True
        return False

    def __iter__(self):
        yield from self.groups

# create instance of regex_searcher for links to roster pages
roster_searcher = my_regex_searcher(r"/nba/team/roster/_/name/(\w+)/(.+)")

# add all roster page links to a dictionary to unpack the regex searcher object
scraped_roster_details = dict(zip(teams_soup.find_all(href=roster_searcher), roster_searcher))

# extract the components of the keys and values in this intermediate dictionary
# and re-zip them together to create the final cleaned dictionary we'll want to use
teams = []
links = []

for value in scraped_roster_details.values():
    teams.append(value[0][1])

for key in scraped_roster_details.keys():
    links.append('https://www.espn.com' + key.get('href'))    

rosters_library = dict(zip(teams,links))

# display the dictionary
rosters_library


{'boston-celtics': 'https://www.espn.com/nba/team/roster/_/name/bos/boston-celtics',
 'brooklyn-nets': 'https://www.espn.com/nba/team/roster/_/name/bkn/brooklyn-nets',
 'new-york-knicks': 'https://www.espn.com/nba/team/roster/_/name/ny/new-york-knicks',
 'philadelphia-76ers': 'https://www.espn.com/nba/team/roster/_/name/phi/philadelphia-76ers',
 'toronto-raptors': 'https://www.espn.com/nba/team/roster/_/name/tor/toronto-raptors',
 'chicago-bulls': 'https://www.espn.com/nba/team/roster/_/name/chi/chicago-bulls',
 'cleveland-cavaliers': 'https://www.espn.com/nba/team/roster/_/name/cle/cleveland-cavaliers',
 'detroit-pistons': 'https://www.espn.com/nba/team/roster/_/name/det/detroit-pistons',
 'indiana-pacers': 'https://www.espn.com/nba/team/roster/_/name/ind/indiana-pacers',
 'milwaukee-bucks': 'https://www.espn.com/nba/team/roster/_/name/mil/milwaukee-bucks',
 'denver-nuggets': 'https://www.espn.com/nba/team/roster/_/name/den/denver-nuggets',
 'minnesota-timberwolves': 'https://www.espn

The next step in this process involves collecting player data from each of the 'roster' pages of the 30 NBA teams. To accomplish this pull, we need to iterate through the entire page and pull the set of values associated with each table row (player) across all columns.

You'll notice in the page's html that each row is associated with a numerical index, and that the fixed first column of the table is separate from the rest of the scrollable table columns.

In [121]:
# parse table headers
f = urllib.request.urlopen('https://www.espn.com/nba/team/roster/_/name/bkn/brooklyn-nets')
roster_soup = BeautifulSoup(f.read(), 'html.parser')
table_headers = roster_soup.find_all('th', {'class':'Table__TH'})

# convert bs4 result set into string array for regex matching
header_values = []
for x in table_headers:
    header_values.append(str(x))

# header_values

# extract list of table headers
column_names = []
for x in header_values:

    # Append conditionally to avoid blank spacer block at the top left of the tables
    if len(re.findall(">([a-zA-Z]+?)<",header_values[header_values.index(x)])) > 0:
        column_names.append(re.findall(">([a-zA-Z]+?)<",header_values[header_values.index(x)])[0])

# add extra column for numbers to match length with player data scraping (below)

column_names.insert(1,'Num')

column_names

['Name', 'Num', 'POS', 'Age', 'HT', 'WT', 'College', 'Salary']

Now that we have an array to reference for the sets of data we'll be pulling for each player, we can pull the player data.

First we'll start with an example for a single player

In [122]:
# parse first row of table
player_one = roster_soup.find_all('tr', {'data-idx': 5})

# extract all key values from columns

# convert bs4 result set into string array for regex matching
p1_values = []
for x in player_one:
    p1_values.append(str(x))

# match to contents of tags
player_stats = re.findall("<.+?\">([a-zA-Z0-9$,\'\"\s\.\-]{1,25}?)</", p1_values[0])
player_stats

['Kevin Durant', '7', 'PF', '33', '6\' 10"', '240 lbs', 'Texas', '$42,018,900']