# Web Scraping in Python: Analyzing NBA Player Salaries

Erick Lu

March 22, 2020

This will be a tutorial on web scraping using Python’s urllib and regular expression modules. I will be using these tools to write a python script that will “scrape” data from the ESPN website and extract player information. Web scraping is a useful technique for extracting data from websites that don’t offer formatted, raw data for download. You can write scripts to automate the process of obtaining information from these websites, so that you don’t have to spend hours flipping through each page and copy-pasting.

As an example, I will be scraping data for the salaries of all the players in the NBA (as of currently, March 2020). We can use the data to compute metrics such as:

1. the average salary paid to each individual team in the NBA.
2. the highest salary (and corresponding player) on each team in the NBA.

Let’s first plan out how to do this in the easiest possible way, before diving into the code. We will first take a look at the website to figure out which web pages we need to scrape information from.

The teams page looks like the following:

![ESPN_teams_webpage.png](images/ESPN_teams_webpage.png)

This looks very promising. All the team names are listed on this page, which means that they can easily be extracted from the page source. Let’s take a look at the page source to see if the we can find the URLs for each team's roster:

![ESPN_teams_source.png](images/ESPN_teams_source.png)

It looks like URLs for each of the teams rosters are listed in the page source with the following format: http://espn.go.com/nba/team/roster/_/name/team/team-name, as shown in the highlighted portion of the image above. Given that these are all following the same format, we can use regular expressions to pull out a list of all the teams the page source. Then, we can systematically loop through all of the team rosters and use regular expressions again to extract the player salaries from page sources for each roster.

Let’s take a look at the Roster page within each team’s URL. Below is a screenshot of the Golden State Warrior’s roster page and page source:

![GSW_roster_webpage.png](images/GSW_roster_webpage.png)

Again, looking at the page source reveals that each player’s name and information are all provided in a specific pattern, which we can take advantage of when using RegEx’s to extract snippets of data from each page source.

![GSW_roster_source.png](images/GSW_roster_source.png)

After examining the general structure of the website, we can begin to write the python script.
Start by importing the “urllib” and “re” packages:

In [2]:
import re
import urllib
from time import sleep

Now, let’s create a function that will extract all the team names from the
URL: http://espn.go.com/nba/teams, and return a list of each team’s roster URL:

In [16]:
f = urllib.request.urlopen('http://www.espn.com/nba/teams')
words = f.read().decode('utf-8')

re.findall

teams = dict(re.findall("www\.espn\.com/nba/team/_/name/(\w+)/(.+?)\",", words))
teams

{'atl': 'atlanta-hawks',
 'bkn': 'brooklyn-nets',
 'bos': 'boston-celtics',
 'cha': 'charlotte-hornets',
 'chi': 'chicago-bulls',
 'cle': 'cleveland-cavaliers',
 'dal': 'dallas-mavericks',
 'den': 'denver-nuggets',
 'det': 'detroit-pistons',
 'gs': 'golden-state-warriors',
 'hou': 'houston-rockets',
 'ind': 'indiana-pacers',
 'lac': 'la-clippers',
 'lal': 'los-angeles-lakers',
 'mem': 'memphis-grizzlies',
 'mia': 'miami-heat',
 'mil': 'milwaukee-bucks',
 'min': 'minnesota-timberwolves',
 'no': 'new-orleans-pelicans',
 'ny': 'new-york-knicks',
 'okc': 'oklahoma-city-thunder',
 'orl': 'orlando-magic',
 'phi': 'philadelphia-76ers',
 'phx': 'phoenix-suns',
 'por': 'portland-trail-blazers',
 'sa': 'san-antonio-spurs',
 'sac': 'sacramento-kings',
 'tor': 'toronto-raptors',
 'utah': 'utah-jazz',
 'wsh': 'washington-wizards'}

In [17]:
# This method finds the urls for each of the rosters in the NBA using regexes.
def build_team_url():
    # Open the espn teams webpage and extract the names of each roster available.
    f = urllib.request.urlopen('http://www.espn.com/nba/teams')
    words = f.read().decode('utf-8')
    teams = dict(re.findall("www\.espn\.com/nba/team/_/name/(\w+)/(.+?)\",", words))
    # Using the names of the rosters, this creates the urls of each roster in the NBA.
    roster_urls = []
    for key in teams.keys():
        # each roster webpage follows this general pattern.
        roster_urls.append('http://www.espn.com/nba/team/roster/_/name/' + key + '/' + teams[key])
        teams[key] = str(teams[key])
    return dict(zip(teams.values(), roster_urls))


In [18]:
rosters = build_team_url()

In [47]:
rosters

{'atlanta-hawks': 'http://www.espn.com/nba/team/roster/_/name/atl/atlanta-hawks',
 'boston-celtics': 'http://www.espn.com/nba/team/roster/_/name/bos/boston-celtics',
 'brooklyn-nets': 'http://www.espn.com/nba/team/roster/_/name/bkn/brooklyn-nets',
 'charlotte-hornets': 'http://www.espn.com/nba/team/roster/_/name/cha/charlotte-hornets',
 'chicago-bulls': 'http://www.espn.com/nba/team/roster/_/name/chi/chicago-bulls',
 'cleveland-cavaliers': 'http://www.espn.com/nba/team/roster/_/name/cle/cleveland-cavaliers',
 'dallas-mavericks': 'http://www.espn.com/nba/team/roster/_/name/dal/dallas-mavericks',
 'denver-nuggets': 'http://www.espn.com/nba/team/roster/_/name/den/denver-nuggets',
 'detroit-pistons': 'http://www.espn.com/nba/team/roster/_/name/det/detroit-pistons',
 'golden-state-warriors': 'http://www.espn.com/nba/team/roster/_/name/gs/golden-state-warriors',
 'houston-rockets': 'http://www.espn.com/nba/team/roster/_/name/hou/houston-rockets',
 'indiana-pacers': 'http://www.espn.com/nba/t

Now, lets create a function that, for each team, will extract all the player names and salaries from the team’s ESPN roster URL, of the format http://espn.go.com/nba/team/roster/_/name/ + “TEAM NAME”,
ie: http://espn.go.com/nba/team/roster/_/name/bos/boston-celtics.

In [3]:
url = "https://www.espn.com/nba/team/roster/_/name/gs/golden-state-warriors"
f = urllib.request.urlopen(url)
stats = f.read().decode('utf-8')

In [22]:
"""
href="http://www.espn.com/nba/player/_/id/3975/stephen-curry">Stephen Curry</a><span class="pl2 n10">30</span></span></td><td class="Table__TD"><span style="min-width:40px" class="">PG</span></td><td class="Table__TD"><span style="min-width:40px" class="">32</span></td><td class="Table__TD"><span style="min-width:50px" class="">6&#x27; 3&quot;</span></td><td class="Table__TD"><span style="min-width:70px" class="">185 lbs</span></td><td class="Table__TD"><span class="">Davidson</span></td><td class="Table__TD"><span class="">$40,231,758</span></td></tr>
"""

player_info_regex = ('\{\"name\"\:\"(\w+\s\w+)\",'
                         '\"href\"\:\"http\://www\.espn\.com/nba/player/.*?\",(.*?)\}')
player_info_regex


'\\{"name"\\:"(\\w+\\s\\w+)","href"\\:"http\\://www\\.espn\\.com/nba/player/.*?",(.*?)\\}'

In [24]:
player_info = re.findall(player_info_regex, stats)
player_info

[('Ky Bowman',
  '"uid":"s:40~l:46~a:4065635","guid":"d0ef63e951bb5f842b7357521697dc62","id":"4065635","height":"6\' 1\\"","weight":"187 lbs","age":22,"position":"PG","jersey":"12","salary":"$350,189","birthDate":"06/17/97","headshot":"https://a.espncdn.com/i/headshots/nba/players/full/4065635.png","lastName":"Ky Bowman","experience":0,"college":"Boston College"'),
 ('Marquese Chriss',
  '"uid":"s:40~l:46~a:3907487","guid":"a320ecf1d6481b7518ddc1dc576c27b4","id":"3907487","height":"6\' 9\\"","weight":"240 lbs","age":22,"position":"C","jersey":"32","salary":"$654,469","birthDate":"07/02/97","headshot":"https://a.espncdn.com/i/headshots/nba/players/full/3907487.png","lastName":"Marquese Chriss","experience":3,"college":"Washington","birthPlace":"Sacramento, CA"'),
 ('Stephen Curry',
  '"uid":"s:40~l:46~a:3975","guid":"5dda51f150c966e12026400b73f34fad","id":"3975","height":"6\' 3\\"","weight":"185 lbs","age":32,"position":"PG","jersey":"30","salary":"$40,231,758","birthDate":"03/14/88","h

In [41]:
import json

teststring = "{"+player_info[3][1]+"}"
teststring

draymond = json.loads(teststring)

type(draymond)

draymond.keys()

print(draymond["salary"])


$18,539,130


In [28]:
# Using the url of each roster, extract the salary data using regexes and
# perform calculations based on what we have extracted.
def calculate_statistic(rosters):
    # Create empty lists that will contain the statistics.
    average_team_salaries = []
    highest_salary_per_team = []
    for url in rosters.values(): # Open website for each roster, one by one.
        f = urllib.request.urlopen(url)
        
        stats = f.read().decode('utf-8')
        '''
        The salaries were embedded in the source code in this format:
        {"name":"Stephen Curry","href":"http://www.espn.com/nba/player/_/id/3975/stephen-curry",
        "uid":"s:40~l:46~a:3975","guid":"5dda51f150c966e12026400b73f34fad","id":"3975",
        "height":"6' 3\"","weight":"185 lbs","age":32,"position":"PG","jersey":"30",
        "salary":"$40,231,758","birthDate":"03/14/88",
        "headshot":"https://a.espncdn.com/i/headshots/nba/players/full/3975.png",
        "lastName":"Stephen Curry","experience":10,"college":"Davidson","birthPlace":"Akron, OH"}
        '''
        # This is the regex pattern to return players and their corresponding salary.
        player_info = dict(re.findall('', stats))
        # in the dictionary, each player corresponds to a salary. change the salaries from strings to integers.
        salaries = []
        for key in player_salaries.keys():
            if (str(player_salaries[key]) == '&nbsp;'):
                player_salaries[key] = 'Not Reported'
            else:
                player_salaries[key] = int(re.sub("([,$])","", player_salaries[key]))
                salaries.append (player_salaries[key])
        # Sort the salaries and append them to the list,
        # Also returns the person with the highest salary
        highest_salary_per_team.append((str(find_key(player_salaries,sorted(salaries)[-1])),sorted(salaries)[-1]))
        average_team_salaries.append(sum(salaries)/len(salaries))
        sleep(1) # wait a second before opening next url so we don't get blocked
        
    # Prints the average salary out, with the team and salary side by side.
    print ("\n\nAverage Team Salaries in the NBA\n(Average amount spent on each player)\n\n")
    team_with_salary = dict(zip(average_team_salaries, rosters.keys()))
    average_team_salaries.sort()
    for key in average_team_salaries:
        print (team_with_salary[key], round(key,2))
    # Prints the highest salaries out, with the team and salary side by side.
    team_with_highest = dict(zip(highest_salary_per_team, rosters.keys() ))
    highest_salary_per_team.sort(key=lambda highest_salary_per_team: highest_salary_per_team[1])
    print ("\n\nPlayer with the highest salary per team in the NBA\n\n")
    for key in highest_salary_per_team:
        print (team_with_highest[key], key)


Now that we have our function’s defined, we can write a few lines of code to execute each function.

In [6]:
rosters = build_team_url()
calculate_statistic(rosters)

AttributeError: module 'urllib' has no attribute 'urlopen'

Again, this is achieved by looping through all the team rosters on the ESPN website, then looping through all the players and extracting their salaries. The average salary per team is obtained by simply adding together the salaries and dividing by the number of people per team. I also use a simple sort function in Python to be able to see which player on each team is the highest paid.

Looking at the data, we can see who is paid the most:

In general, webpages that link to subpages within the same site construct their links in some sort of standardized pattern. The techniques that I've outlined here should be broadly applicable for other websites. I hope what you've learned from this project will help you out on your own web scraping quests.

Thanks for reading! I hope my code helped you understand how to perform basic web scraping using Python.