Scraping ESPN NBA Rosters (https://www.espn.com/nba/teams)

Credit to Erick Lu for some ideas in a similar project back in 2020 (https://erilu.github.io/web-scraping-NBA-statistics/)

In [30]:
# import packages to allow regex handling of url subdirectories

import re
import urllib
from time import sleep

In [31]:
# create function to compile list of all team roster urls

def scrape_roster_urls():
    # use regex to find all html on ESPN's NBA teams page that points to the teams' rosters
    f = urllib.request.urlopen('http://www.espn.com/nba/teams')
    teams_content = f.read().decode('utf-8')
    teams_list = dict(re.findall("\"/nba/team/roster/_/name/(\w+)/(.+?)\"", teams_content))

    # generate array of urls to scrape
    roster_urls = []
    for key in teams_list.keys():
        roster_urls.append('https://www.espn.com/nba/team/roster/_/name/' + key + '/' + teams_list[key])
        teams_list[key] = str(teams_list[key])
    return dict(zip(teams_list.values(),roster_urls))

In [32]:
# build dictionary of current nba rosters
nba_rosters = scrape_roster_urls()

# display the dictionary
nba_rosters

{'boston-celtics': 'https://www.espn.com/nba/team/roster/_/name/bos/boston-celtics',
 'brooklyn-nets': 'https://www.espn.com/nba/team/roster/_/name/bkn/brooklyn-nets',
 'new-york-knicks': 'https://www.espn.com/nba/team/roster/_/name/ny/new-york-knicks',
 'philadelphia-76ers': 'https://www.espn.com/nba/team/roster/_/name/phi/philadelphia-76ers',
 'toronto-raptors': 'https://www.espn.com/nba/team/roster/_/name/tor/toronto-raptors',
 'chicago-bulls': 'https://www.espn.com/nba/team/roster/_/name/chi/chicago-bulls',
 'cleveland-cavaliers': 'https://www.espn.com/nba/team/roster/_/name/cle/cleveland-cavaliers',
 'detroit-pistons': 'https://www.espn.com/nba/team/roster/_/name/det/detroit-pistons',
 'indiana-pacers': 'https://www.espn.com/nba/team/roster/_/name/ind/indiana-pacers',
 'milwaukee-bucks': 'https://www.espn.com/nba/team/roster/_/name/mil/milwaukee-bucks',
 'denver-nuggets': 'https://www.espn.com/nba/team/roster/_/name/den/denver-nuggets',
 'minnesota-timberwolves': 'https://www.espn

Upon visiting the roster and player pages on ESPN, I realized the data that was formerly stored in easily accessible json was now distributed throughout identical html components, meaning re.findall wouldn't be a powerful enough tool to extract all of the data I'd need from the website. Therefore, I've replicated the above steps using Beautiful Soup, an HTML parser which would help with more targeted searches in the later steps.

In [78]:
# import beautifulsoup library to help parse the tables where player information is stored
from bs4 import BeautifulSoup, Tag

# create an instance of the beautifulsoup class to parse the page
f = urllib.request.urlopen('http://www.espn.com/nba/teams')
soup = BeautifulSoup(f.read(), 'html.parser')

# define an iterable helper class to pull list of links using regexes
class my_regex_searcher:
    def __init__(self, regex_string):
        self.__r = re.compile(regex_string)
        self.groups = []

    def __call__(self, what):
        if isinstance(what, Tag):
            what = what.name

        if what:
            g = self.__r.findall(what)
            if g:
                self.groups.append(g)
                return True
        return False

    def __iter__(self):
        yield from self.groups

#create instance of regex_searcher for links to roster pages
roster_searcher = my_regex_searcher(r"/nba/team/roster/_/name/(\w+)/(.+)")

#add all roster page links to a dictionary to unpack the regex searcher object
scraped_roster_details = dict(zip(soup.find_all(href=roster_searcher), roster_searcher))

# extract the components of the keys and values in this intermediate dictionary
# and re-zip them together to create the final cleaned dictionary we'll want to use
teams = []
links = []

for value in scraped_roster_details.values():
    teams.append(value[0][1])

for key in scraped_roster_details.keys():
    links.append('https://www.espn.com' + key.get('href'))    

rosters_library = dict(zip(teams,links))

# display the dictionary
rosters_library


{'boston-celtics': 'https://www.espn.com/nba/team/roster/_/name/bos/boston-celtics',
 'brooklyn-nets': 'https://www.espn.com/nba/team/roster/_/name/bkn/brooklyn-nets',
 'new-york-knicks': 'https://www.espn.com/nba/team/roster/_/name/ny/new-york-knicks',
 'philadelphia-76ers': 'https://www.espn.com/nba/team/roster/_/name/phi/philadelphia-76ers',
 'toronto-raptors': 'https://www.espn.com/nba/team/roster/_/name/tor/toronto-raptors',
 'chicago-bulls': 'https://www.espn.com/nba/team/roster/_/name/chi/chicago-bulls',
 'cleveland-cavaliers': 'https://www.espn.com/nba/team/roster/_/name/cle/cleveland-cavaliers',
 'detroit-pistons': 'https://www.espn.com/nba/team/roster/_/name/det/detroit-pistons',
 'indiana-pacers': 'https://www.espn.com/nba/team/roster/_/name/ind/indiana-pacers',
 'milwaukee-bucks': 'https://www.espn.com/nba/team/roster/_/name/mil/milwaukee-bucks',
 'denver-nuggets': 'https://www.espn.com/nba/team/roster/_/name/den/denver-nuggets',
 'minnesota-timberwolves': 'https://www.espn