# Web Scraping in Python: Analyzing NBA Player Salaries

Erick Lu

March 22, 2020

This will be a tutorial on web scraping using Python’s urllib and regular expression (re) modules. I will be using these tools to write a python script that will “scrape” (obtain) data from the ESPN website and extract any relevant information (salaries, team names, etc.). Web scraping is a useful technique to learn that will allow you to extract data from websites that don’t offer formatted, raw data to the public. You can write scripts to automate the process of obtaining information from these websites, so that you don’t have to spend hours flipping through each page and copy-pasting.

I will be collecting two statistics on the salaries of all the players in the NBA (As of currently, March 2020):
1. the average salary paid to each individual team in the NBA.
2.  the highest salary (and corresponding player) on each team in the NBA.

Let’s first plan out how to do this in the easiest possible way, before diving into the code. The first we should do is take a look at the website to figure out which web pages we will be scraping information from.

The teams page looks like the following:

This looks very promising. All the team names are listed on this page, which means that they can easily be extracted from the page source. A good idea would be to have the python script obtain a list of all the teams from this specific URL, create the URLs for each team (format: http://espn.go.com/nba/team/roster/_/name/ + TEAM NAME) using that list, then systematically loop through all the team URLs and have the python script extract the player salaries from page sources of each of them.

Let’s take a look at the page source to see if this is feasible:

In general, webpages that link to other webpages always have their links provided in some sort of pattern (look below the highlighted segment in the page source above). We’re going to take advantage of this.
Let’s take a look at the Roster page within each team’s URL. Below is a screenshot (clickable) of the Boston Celtic’s roster page and page source:

Again, looking at the page source reveals that each player’s name and information are all provided in a specific pattern, which we can take advantage of when using RegEx’s to extract snippets of data from each page source.

After examining the general structure of the website, we can begin to write the python script.
Start by importing the “urllib” and “re” packages:

In [8]:
import re
import urllib
from time import sleep

Now, let’s create a function that will extract all the team names from the
URL: http://espn.go.com/nba/teams, and return a list of each team’s roster URL:

In [16]:
f = urllib.request.urlopen('http://www.espn.com/nba/teams')
words = f.read().decode('utf-8')

re.findall

teams = dict(re.findall("www\.espn\.com/nba/team/_/name/(\w+)/(.+?)\",", words))
teams

{'atl': 'atlanta-hawks',
 'bkn': 'brooklyn-nets',
 'bos': 'boston-celtics',
 'cha': 'charlotte-hornets',
 'chi': 'chicago-bulls',
 'cle': 'cleveland-cavaliers',
 'dal': 'dallas-mavericks',
 'den': 'denver-nuggets',
 'det': 'detroit-pistons',
 'gs': 'golden-state-warriors',
 'hou': 'houston-rockets',
 'ind': 'indiana-pacers',
 'lac': 'la-clippers',
 'lal': 'los-angeles-lakers',
 'mem': 'memphis-grizzlies',
 'mia': 'miami-heat',
 'mil': 'milwaukee-bucks',
 'min': 'minnesota-timberwolves',
 'no': 'new-orleans-pelicans',
 'ny': 'new-york-knicks',
 'okc': 'oklahoma-city-thunder',
 'orl': 'orlando-magic',
 'phi': 'philadelphia-76ers',
 'phx': 'phoenix-suns',
 'por': 'portland-trail-blazers',
 'sa': 'san-antonio-spurs',
 'sac': 'sacramento-kings',
 'tor': 'toronto-raptors',
 'utah': 'utah-jazz',
 'wsh': 'washington-wizards'}

In [17]:
# This method finds the urls for each of the rosters in the NBA using regexes.
def build_team_url():
    # Open the espn teams webpage and extract the names of each roster available.
    f = urllib.request.urlopen('http://www.espn.com/nba/teams')
    words = f.read().decode('utf-8')
    teams = dict(re.findall("www\.espn\.com/nba/team/_/name/(\w+)/(.+?)\",", words))
    # Using the names of the rosters, this creates the urls of each roster in the NBA.
    roster_urls = []
    for key in teams.keys():
        # each roster webpage follows this general pattern.
        roster_urls.append('http://www.espn.com/nba/team/roster/_/name/' + key + '/' + teams[key])
        teams[key] = str(teams[key])
    return dict(zip(teams.values(), roster_urls))


In [18]:
rosters = build_team_url()

In [19]:
rosters

{'atlanta-hawks': 'http://www.espn.com/nba/team/roster/_/name/atl/atlanta-hawks',
 'boston-celtics': 'http://www.espn.com/nba/team/roster/_/name/bos/boston-celtics',
 'brooklyn-nets': 'http://www.espn.com/nba/team/roster/_/name/bkn/brooklyn-nets',
 'charlotte-hornets': 'http://www.espn.com/nba/team/roster/_/name/cha/charlotte-hornets',
 'chicago-bulls': 'http://www.espn.com/nba/team/roster/_/name/chi/chicago-bulls',
 'cleveland-cavaliers': 'http://www.espn.com/nba/team/roster/_/name/cle/cleveland-cavaliers',
 'dallas-mavericks': 'http://www.espn.com/nba/team/roster/_/name/dal/dallas-mavericks',
 'denver-nuggets': 'http://www.espn.com/nba/team/roster/_/name/den/denver-nuggets',
 'detroit-pistons': 'http://www.espn.com/nba/team/roster/_/name/det/detroit-pistons',
 'golden-state-warriors': 'http://www.espn.com/nba/team/roster/_/name/gs/golden-state-warriors',
 'houston-rockets': 'http://www.espn.com/nba/team/roster/_/name/hou/houston-rockets',
 'indiana-pacers': 'http://www.espn.com/nba/t

Now, lets create a function that, for each team, will extract all the player names and salaries from the team’s ESPN roster URL, of the format http://espn.go.com/nba/team/roster/_/name/ + “TEAM NAME”,
ie: http://espn.go.com/nba/team/roster/_/name/bos/boston-celtics.

In [5]:
# Using the url of each roster, extract the salary data using regexes and
# perform calculations based on what we have extracted.
def calculate_statistic(rosters):
    # Create empty lists that will contain the statistics.
    average_team_salaries = []
    highest_salary_per_team = []
    for url in rosters.values(): # Open website for each roster, one by one.
        f = urllib.request.urlopen(url)
        
        stats = f.read().decode('utf-8')
        '''
        The salaries were embedded in the source code in this format:
        <a href="http://www.espn.com/nba/player/_/id/3975/stephen-curry">Stephen Curry</a>
        </td><td>PG</td><td >29</td><td >6-3</td><td >190</td><td>Davidson</td><td>$34,382,550</td>
        </tr><tr class="evenrow player-46-3202"><td >35</td><td class="sortcell">
        '''
        # This is the regex pattern to return players and their corresponding salary.
        player_salaries = dict(re.findall('http\://www\.espn\.com/nba/player/_/id/\d*?/.*?\">(\w+\s\w+)</a></td><td>\w*?</td><td >\d*?</td><td >.*?</td><td >\d*?</td><td>.*?</td><td>(.*?)</td>', stats))
        # in the dictionary, each player corresponds to a salary. change the salaries from strings to integers.
        salaries = []
        for key in player_salaries.keys():
            if (str(player_salaries[key]) == '&nbsp;'):
                player_salaries[key] = 'Not Reported'
            else:
                player_salaries[key] = int(re.sub("([,$])","", player_salaries[key]))
                salaries.append (player_salaries[key])
        # Sort the salaries and append them to the list,
        # Also returns the person with the highest salary
        highest_salary_per_team.append((str(find_key(player_salaries,sorted(salaries)[-1])),sorted(salaries)[-1]))
        average_team_salaries.append(sum(salaries)/len(salaries))
        sleep(1) # wait a second before opening next url so we don't get blocked
        
    # Prints the average salary out, with the team and salary side by side.
    print ("\n\nAverage Team Salaries in the NBA\n(Average amount spent on each player)\n\n")
    team_with_salary = dict(zip(average_team_salaries, rosters.keys()))
    average_team_salaries.sort()
    for key in average_team_salaries:
        print (team_with_salary[key], round(key,2))
    # Prints the highest salaries out, with the team and salary side by side.
    team_with_highest = dict(zip(highest_salary_per_team, rosters.keys() ))
    highest_salary_per_team.sort(key=lambda highest_salary_per_team: highest_salary_per_team[1])
    print ("\n\nPlayer with the highest salary per team in the NBA\n\n")
    for key in highest_salary_per_team:
        print (team_with_highest[key], key)


Now that we have our function’s defined, we can write a few lines of code to execute each function.

In [6]:
rosters = build_team_url()
calculate_statistic(rosters)

AttributeError: module 'urllib' has no attribute 'urlopen'

Again, this is achieved by looping through all the team rosters on the ESPN website, then looping through all the players and extracting their salaries. The average salary per team is obtained by simply adding together the salaries and dividing by the number of people per team. I also use a simple sort function in Python to be able to see which player on each team is the highest paid.

Looking at the data, we can see who is paid the most.

Thanks for reading! I hope my code helped you understand how to perform basic web scraping using Python.