# Web Scraping in Python: Gathering NBA Player Information

Erick Lu

March 22, 2020

In this project, I use Python’s `urllib` and `re` modules to “scrape” ESPN for information on all the players in the NBA. I then use `pandas` to format the data into a DataFrame, and calculate some interesting statistics.

Web scraping is a useful technique for extracting data from websites that don’t offer formatted, raw data for download. You can write scripts in Python to automate the process of obtaining information from these websites, so that you don’t have to spend hours flipping through each page and copy-pasting.

As an example, I will be scraping data from the rosters for each team in the NBA (as of currently, March 2020). We can use the data to answer questions such as:

1. What is the average salary paid by each team in the NBA, and which player earns the most on each team?
2. Are there any interesting correlates with salary? Which 
3. What 

Let’s first plan out how to do this in the easiest possible way, before diving into the code. We will first take a look at the website to figure out which web pages we need to scrape information from.

The teams page looks like the following:

![ESPN_teams_webpage.png](images/ESPN_teams_webpage.png)

This looks very promising. All the team names are listed on this page, which means that they can easily be extracted from the page source. Let’s take a look at the page source to see if the we can find the URLs for each team's roster:

![ESPN_teams_source.png](images/ESPN_teams_source.png)

It looks like URLs for each of the teams rosters are listed in the page source with the following format: http://espn.go.com/nba/team/roster/_/name/team/team-name, as shown in the highlighted portion of the image above. Given that these are all following the same format, we can use regular expressions to pull out a list of all team roster URLs from the page source.

In order to figure out how to scrape the rosters, let’s take a look at the Golden State Warriors' roster page:

![GSW_roster_webpage.png](images/GSW_roster_webpage.png)

Information for each player is nicely laid out in a table, meaning that the data is likely easily obtainable using regular expressions. Taking a look at the page source reveals that each player’s name and information are all provided in blocks of what apppear to be `json`:

![GSW_roster_source.png](images/GSW_roster_source.png)

Given this information, we can systematically loop through all of the team rosters and use regular expressions to extract player information from each roster. We can begin to write the Python script with this plan in mind. Start by importing the `urllib` and `re` modules:

In [1]:
import re
import urllib
from time import sleep

Now, let’s create a function that will extract all the team names from the
URL: http://espn.go.com/nba/teams, and return a list of each team’s roster URL.

In [14]:
# This method finds the urls for each of the rosters in the NBA using regexes.
def build_team_url():
    # Open the espn teams webpage and extract the names of each roster available.
    f = urllib.request.urlopen('http://www.espn.com/nba/teams')
    teams_source = f.read().decode('utf-8')
    teams = dict(re.findall("www\.espn\.com/nba/team/_/name/(\w+)/(.+?)\",", teams_source))
    # Using the names of the rosters, create the urls of each roster
    roster_urls = []
    for key in teams.keys():
        # each roster webpage follows this general pattern.
        roster_urls.append('http://www.espn.com/nba/team/roster/_/name/' + key + '/' + teams[key])
        teams[key] = str(teams[key])
    return dict(zip(teams.values(), roster_urls))


In [28]:
rosters = build_team_url()
rosters

{'atlanta-hawks': 'http://www.espn.com/nba/team/roster/_/name/atl/atlanta-hawks',
 'boston-celtics': 'http://www.espn.com/nba/team/roster/_/name/bos/boston-celtics',
 'brooklyn-nets': 'http://www.espn.com/nba/team/roster/_/name/bkn/brooklyn-nets',
 'charlotte-hornets': 'http://www.espn.com/nba/team/roster/_/name/cha/charlotte-hornets',
 'chicago-bulls': 'http://www.espn.com/nba/team/roster/_/name/chi/chicago-bulls',
 'cleveland-cavaliers': 'http://www.espn.com/nba/team/roster/_/name/cle/cleveland-cavaliers',
 'dallas-mavericks': 'http://www.espn.com/nba/team/roster/_/name/dal/dallas-mavericks',
 'denver-nuggets': 'http://www.espn.com/nba/team/roster/_/name/den/denver-nuggets',
 'detroit-pistons': 'http://www.espn.com/nba/team/roster/_/name/det/detroit-pistons',
 'golden-state-warriors': 'http://www.espn.com/nba/team/roster/_/name/gs/golden-state-warriors',
 'houston-rockets': 'http://www.espn.com/nba/team/roster/_/name/hou/houston-rockets',
 'indiana-pacers': 'http://www.espn.com/nba/t

The function `build_team_url()` returns a dictionary that matches team names with their corresponding roster URL. We can now create a function that will loop through the dictionary, open each URL, and player information. Below is an example for a single team, to walk you through what the function does. First, we read in the roster webpage using `urllib.request.urlopen`:

In [6]:
url = "https://www.espn.com/nba/team/roster/_/name/gs/golden-state-warriors"
f = urllib.request.urlopen(url)
roster_source = f.read().decode('utf-8')

Then, we construct the regex that will return information for each of the players on the roster webpage.

In [8]:
player_regex = ('\{\"name\"\:\"(\w+\s\w+)\",\"href\"\:\"http\://www\.espn\.com/nba/player/.*?\",(.*?)\}')
player_regex
player_info = re.findall(player_regex, roster_source)
player_info[0:4]

[('Ky Bowman',
  '"uid":"s:40~l:46~a:4065635","guid":"d0ef63e951bb5f842b7357521697dc62","id":"4065635","height":"6\' 1\\"","weight":"187 lbs","age":22,"position":"PG","jersey":"12","salary":"$350,189","birthDate":"06/17/97","headshot":"https://a.espncdn.com/i/headshots/nba/players/full/4065635.png","lastName":"Ky Bowman","experience":0,"college":"Boston College"'),
 ('Marquese Chriss',
  '"uid":"s:40~l:46~a:3907487","guid":"a320ecf1d6481b7518ddc1dc576c27b4","id":"3907487","height":"6\' 9\\"","weight":"240 lbs","age":22,"position":"C","jersey":"32","salary":"$654,469","birthDate":"07/02/97","headshot":"https://a.espncdn.com/i/headshots/nba/players/full/3907487.png","lastName":"Marquese Chriss","experience":3,"college":"Washington","birthPlace":"Sacramento, CA"'),
 ('Stephen Curry',
  '"uid":"s:40~l:46~a:3975","guid":"5dda51f150c966e12026400b73f34fad","id":"3975","height":"6\' 3\\"","weight":"185 lbs","age":32,"position":"PG","jersey":"30","salary":"$40,231,758","birthDate":"03/14/88","h

As you can see, `player_info` is a list of tuples, in which each player name is paired with a set of information (height, weight, age, etc.) that is organized in `json` format. We can use the `json` module in Python to convert the information into a `dict`:

In [9]:
import json
draymond = json.loads("{"+player_info[3][1]+"}")
draymond

{'age': 30,
 'birthDate': '03/04/90',
 'birthPlace': 'Saginaw, MI',
 'college': 'Michigan State',
 'experience': 7,
 'guid': 'de360720e41625f28a6bb5ff82616cb1',
 'headshot': 'https://a.espncdn.com/i/headshots/nba/players/full/6589.png',
 'height': '6\' 6"',
 'id': '6589',
 'jersey': '23',
 'lastName': 'Draymond Green',
 'position': 'PF',
 'salary': '$18,539,130',
 'uid': 's:40~l:46~a:6589',
 'weight': '230 lbs'}

In the example above, all of the pertinent information for Draymond Green is now stored into a Python dictionary. We can use the snippets of code above to construct a function in which we loop through each roster, collecting information for each player:

In [47]:
def get_player_info(roster_url):
    f = urllib.request.urlopen(roster_url)
    roster_source = f.read().decode('utf-8')
    sleep(0.5)
    player_regex = ('\{\"name\"\:\"(\w+\s\w+)\",\"href\"\:\"http\://www\.espn\.com/nba/player/.*?\",(.*?)\}')
    player_info = re.findall(player_regex, roster_source)
    player_dict = dict()
    for player in player_info:
        player_dict[player[0]] = json.loads("{"+player[1]+"}")
    return(player_dict)

We will now loop through each team in `rosters` and run `get_player_info()`, storing the output in a dictionary called `all_players`:

In [49]:
all_players = dict()
for team in rosters.keys():
    print("Gathering player info for team: " + team)
    all_players[team] = get_player_info(rosters[team])

Gathering player info for team: boston-celtics
Gathering player info for team: brooklyn-nets
Gathering player info for team: new-york-knicks
Gathering player info for team: philadelphia-76ers
Gathering player info for team: toronto-raptors
Gathering player info for team: chicago-bulls
Gathering player info for team: cleveland-cavaliers
Gathering player info for team: detroit-pistons
Gathering player info for team: indiana-pacers
Gathering player info for team: milwaukee-bucks
Gathering player info for team: atlanta-hawks
Gathering player info for team: charlotte-hornets
Gathering player info for team: miami-heat
Gathering player info for team: orlando-magic
Gathering player info for team: washington-wizards
Gathering player info for team: denver-nuggets
Gathering player info for team: minnesota-timberwolves
Gathering player info for team: oklahoma-city-thunder
Gathering player info for team: portland-trail-blazers
Gathering player info for team: utah-jazz
Gathering player info for team

The `all_players` dictionary should be a dictionary of dictionaries of dictionaries. This sounds complicated, but let's walk through what it looks like. The first level of keys should correspond to teams:

In [50]:
all_players.keys()

dict_keys(['boston-celtics', 'brooklyn-nets', 'new-york-knicks', 'philadelphia-76ers', 'toronto-raptors', 'chicago-bulls', 'cleveland-cavaliers', 'detroit-pistons', 'indiana-pacers', 'milwaukee-bucks', 'atlanta-hawks', 'charlotte-hornets', 'miami-heat', 'orlando-magic', 'washington-wizards', 'denver-nuggets', 'minnesota-timberwolves', 'oklahoma-city-thunder', 'portland-trail-blazers', 'utah-jazz', 'golden-state-warriors', 'la-clippers', 'los-angeles-lakers', 'phoenix-suns', 'sacramento-kings', 'dallas-mavericks', 'houston-rockets', 'memphis-grizzlies', 'new-orleans-pelicans', 'san-antonio-spurs'])

Within a team, the keys should correspond to player names. Let's zoom in on the LA Lakers:

In [51]:
all_players["los-angeles-lakers"].keys()

dict_keys(['Kostas Antetokounmpo', 'Avery Bradley', 'Devontae Cacok', 'Alex Caruso', 'Quinn Cook', 'Anthony Davis', 'Jared Dudley', 'Danny Green', 'Dwight Howard', 'LeBron James', 'Kyle Kuzma', 'JaVale McGee', 'Markieff Morris', 'Rajon Rondo', 'Dion Waiters'])

Now we can choose which player to look at. Let's choose LeBron James as an example:

In [53]:
all_players["los-angeles-lakers"]["LeBron James"]

{'age': 35,
 'birthDate': '12/30/84',
 'birthPlace': 'Akron, OH',
 'experience': 16,
 'guid': '1f6592b3ff53d3218dc56038d48c1786',
 'headshot': 'https://a.espncdn.com/i/headshots/nba/players/full/1966.png',
 'height': '6\' 9"',
 'id': '1966',
 'jersey': '23',
 'lastName': 'LeBron James',
 'position': 'SF',
 'salary': '$37,436,858',
 'uid': 's:40~l:46~a:1966',
 'weight': '250 lbs'}

A dictionary with information about LeBron James is returned. We can extract information even more precisely by specifying which field we are interested in. Let's get his salary:

In [54]:
all_players["los-angeles-lakers"]["LeBron James"]["salary"]

'$37,436,858'

In order to make data analysis easier, we can re-format this dictionary into a pandas DataFrame. The function pd.DataFrame.from_dict() can turn a dictionary of dictionaries into a pandas DataFrame:

In [88]:
import pandas as pd
gsw = pd.DataFrame.from_dict(all_players["golden-state-warriors"], orient = "index")
gsw

Unnamed: 0,uid,guid,id,height,weight,age,position,jersey,salary,birthDate,headshot,lastName,experience,college,birthPlace,hand
Alen Smailagic,s:40~l:46~a:4401415,6ed3f8924bfef2e70329ebd6a104ecae,4401415,"6' 10""",215 lbs,19,PF,6,"$898,310",08/18/00,https://a.espncdn.com/i/headshots/nba/players/...,Alen Smailagic,0,,,
Andrew Wiggins,s:40~l:46~a:3059319,064c19d065276a21ca99fdfb296fe05d,3059319,"6' 7""",197 lbs,25,SF,22,"$27,504,630",02/23/95,https://a.espncdn.com/i/headshots/nba/players/...,Andrew Wiggins,5,Kansas,"Thornhill, ON",
Chasson Randle,s:40~l:46~a:2580898,71b7154a3d81842448b623ee3e65d586,2580898,"6' 2""",185 lbs,27,PG,25,,02/05/93,https://a.espncdn.com/i/headshots/nba/players/...,Chasson Randle,2,Stanford,"Rock Island, IL",
Damion Lee,s:40~l:46~a:2595209,41fafb6d47a66d8f79f94161918541a4,2595209,"6' 5""",210 lbs,27,SG,1,"$842,327",10/21/92,https://a.espncdn.com/i/headshots/nba/players/...,Damion Lee,2,Louisville,,L
Draymond Green,s:40~l:46~a:6589,de360720e41625f28a6bb5ff82616cb1,6589,"6' 6""",230 lbs,30,PF,23,"$18,539,130",03/04/90,https://a.espncdn.com/i/headshots/nba/players/...,Draymond Green,7,Michigan State,"Saginaw, MI",
Eric Paschall,s:40~l:46~a:3133817,b67e5e0fa5cb209355845d165a49407e,3133817,"6' 6""",255 lbs,23,PF,7,"$898,310",11/04/96,https://a.espncdn.com/i/headshots/nba/players/...,Eric Paschall,0,Villanova,"North Tarrytown, NY",
Jordan Poole,s:40~l:46~a:4277956,4b0492b5a52f267fe84098ef6d2e2bdf,4277956,"6' 4""",194 lbs,20,SG,3,"$1,964,760",06/19/99,https://a.espncdn.com/i/headshots/nba/players/...,Jordan Poole,0,Michigan,"Milwaukee, WI",B
Kevon Looney,s:40~l:46~a:3155535,10a8e77b877324c69966f0c4618caad6,3155535,"6' 9""",222 lbs,24,PF,5,"$4,464,226",02/06/96,https://a.espncdn.com/i/headshots/nba/players/...,Kevon Looney,4,UCLA,"Milwaukee, WI",
Klay Thompson,s:40~l:46~a:6475,3411530a7ab7e8dce4f165d59a559520,6475,"6' 6""",215 lbs,30,SG,11,"$32,742,000",02/08/90,https://a.espncdn.com/i/headshots/nba/players/...,Klay Thompson,8,Washington State,"Los Angeles, CA",
Ky Bowman,s:40~l:46~a:4065635,d0ef63e951bb5f842b7357521697dc62,4065635,"6' 1""",187 lbs,22,PG,12,"$350,189",06/17/97,https://a.espncdn.com/i/headshots/nba/players/...,Ky Bowman,0,Boston College,,


In the DataFrame above, each of the parameters such as 'age', 'salary', etc. are organized in columns and each player is a row. This makes display of the data much easier to read and understand. Furthermore, it also places null values when pieces of data are missing--for example, Chasson Randle's salary information is missing from the website, so 'NaN' is automatically placed in the DataFrame.

DataFrames allow us quickly make calculations, sort players based on their stats, and compare stats between teams. To make a DataFrame containing data from all the teams, we can loop through each team in `all_players`, construct DataFrames, label them with a `team` column, and aggregate them into a single DataFrame called `all_players_df`.

In [85]:
all_players_df = pd.DataFrame()

# loop through each team in all_players, create a pandas DataFrame, and append
for team in all_players.keys():
    team_df = pd.DataFrame.from_dict(all_players[team], orient = "index")
    team_df['team'] = pd.Series(team, index=team_df.index)
    all_players_df = all_players_df.append(team_df)
    
all_players_df

Unnamed: 0,age,birthDate,birthPlace,college,experience,guid,hand,headshot,height,id,jersey,lastName,position,salary,team,uid,weight
Brad Wanamaker,30,07/25/89,"Philadelphia, PA",Pittsburgh,1,5aad35bbbb760e3958107639266768ae,,https://a.espncdn.com/i/headshots/nba/players/...,"6' 3""",6507,9,Brad Wanamaker,PG,"$1,445,697",boston-celtics,s:40~l:46~a:6507,210 lbs
Carsen Edwards,22,03/12/98,"Houston, TX",Purdue,0,4b8ebdfd01221567925035c1e0d0c337,,https://a.espncdn.com/i/headshots/nba/players/...,"5' 11""",4066407,4,Carsen Edwards,PG,"$1,228,026",boston-celtics,s:40~l:46~a:4066407,200 lbs
Daniel Theis,27,04/04/92,Germany,,2,ce75206c087f83ace6f9a8e3efbd9671,,https://a.espncdn.com/i/headshots/nba/players/...,"6' 8""",2451037,27,Daniel Theis,C,"$5,000,000",boston-celtics,s:40~l:46~a:2451037,245 lbs
Enes Kanter,27,05/20/92,Switzerland,Kentucky,8,1e039b407b3daa6eeac69432aa6413fd,,https://a.espncdn.com/i/headshots/nba/players/...,"6' 10""",6447,11,Enes Kanter,C,"$4,767,000",boston-celtics,s:40~l:46~a:6447,250 lbs
Gordon Hayward,30,03/23/90,"Indianapolis, IN",Butler,9,56f675cb8f40a5aaee5f5747ec9099c5,,https://a.espncdn.com/i/headshots/nba/players/...,"6' 7""",4249,20,Gordon Hayward,SF,"$32,700,690",boston-celtics,s:40~l:46~a:4249,225 lbs
Grant Williams,21,11/30/98,"Houston, TX",Tennessee,0,3a93561dd9c3f1e8de40fbc7b40f7a5e,,https://a.espncdn.com/i/headshots/nba/players/...,"6' 6""",4066218,12,Grant Williams,PF,"$2,379,840",boston-celtics,s:40~l:46~a:4066218,236 lbs
Javonte Green,26,07/23/93,"Alberta, VA",Radford,0,a4940ed033e0a114e8862f5a094aa3f8,,https://a.espncdn.com/i/headshots/nba/players/...,"6' 4""",2596112,43,Javonte Green,SG,"$898,310",boston-celtics,s:40~l:46~a:2596112,205 lbs
Jaylen Brown,23,10/24/96,"Marietta, GA",California,3,0d5cde01f6d3225fdae544ef3304cda2,,https://a.espncdn.com/i/headshots/nba/players/...,"6' 6""",3917376,7,Jaylen Brown,SG,"$6,534,829",boston-celtics,s:40~l:46~a:3917376,223 lbs
Jayson Tatum,22,03/03/98,,Duke,2,ed3343b02ffaf6b4e4223a1920938c81,,https://a.espncdn.com/i/headshots/nba/players/...,"6' 8""",4065648,0,Jayson Tatum,PF,"$7,830,000",boston-celtics,s:40~l:46~a:4065648,210 lbs
Kemba Walker,29,05/08/90,"Bronx, NY",Connecticut,8,665c55b2776846ac62a04efb4c9bcc80,,https://a.espncdn.com/i/headshots/nba/players/...,"6' 0""",6479,8,Kemba Walker,PG,"$32,742,000",boston-celtics,s:40~l:46~a:6479,184 lbs


I'll export this data to a csv file, in case you readers out there want to read it in and play around with it yourself.

In [87]:
all_players_df.to_csv("NBA_roster_info_all_players_mar2020.csv")

It would be nice to also have data coresponding to the performance of each player, in terms of points per game, field goal percentage, rebounds per game, etc. We can find this information at each player's personal page on ESPN:

![curry_player_webpage.png](images/curry_player_webpage.png)

We'll want to extract the career stats, which can be found in the highlighted section of the source code below:

![curry_player_source.png](images/curry_player_source.png)

In order to extract this information for each player in our DataFrame, we can construct URLs to open using the `id` column. Fortunately, the URL is standardized and very easy to construct. For example, using the `id` value of 3975 for Stephen Curry, the URL to open would be: https://www.espn.com/nba/player/stats/_/id/3975. For each player in the DataFrame, we will open their webpage, extract the career stats, and store the stats in a separate data frame. After all the stats are extracted, we will merge the data to `all_players_df`.

In [126]:
url = "https://www.espn.com/nba/player/_/id/4401415"
f = urllib.request.urlopen(url)
player_source = f.read().decode('utf-8')

stats_regex = ('\[\"Career\",(.*?)\]\]\}\},\"gmlg\"')
stats_regex
career_info = re.findall(stats_regex, player_source)
career_info = career_info[0].replace("\"", "").split(",")
career_info
#list(map(int,))
#

# categories for career stats are:
# GP,MIN,FG%,3P%,FT%,REB,AST,BLK,STL,PF,TO,PTS

['14',
 '9.9',
 '50.0',
 '23.1',
 '84.2',
 '1.9',
 '0.9',
 '0.3',
 '0.2',
 '1.0',
 '0.8',
 '4.2']

In [117]:
career_df = pd.DataFrame(columns = ["GP","MIN","FG","3P","FT","REB","AST","BLK","STL","PF","TO","PTS"])
career_info = pd.Series(, index = career_df.columns)

career_df.append(career_info, ignore_index=True)

ValueError: could not convert string to float: '"14"'

In [109]:
test = 
test

GP       "14"
MIN     "9.9"
FG     "50.0"
3P     "23.1"
FT     "84.2"
REB     "1.9"
AST     "0.9"
BLK     "0.3"
STL     "0.2"
PF      "1.0"
TO      "0.8"
PTS     "4.2"
dtype: object

Now that we have gathered and organized the roster data from each team, we can use the data to calculate some statistics. Let's start by seeing which team has the heaviest players:

In [None]:



        '''
        The salaries were embedded in the source code in this format:
        {"name":"Stephen Curry","href":"http://www.espn.com/nba/player/_/id/3975/stephen-curry",
        "uid":"s:40~l:46~a:3975","guid":"5dda51f150c966e12026400b73f34fad","id":"3975",
        "height":"6' 3\"","weight":"185 lbs","age":32,"position":"PG","jersey":"30",
        "salary":"$40,231,758","birthDate":"03/14/88",
        "headshot":"https://a.espncdn.com/i/headshots/nba/players/full/3975.png",
        "lastName":"Stephen Curry","experience":10,"college":"Davidson","birthPlace":"Akron, OH"}
        '''

        for key in player_salaries.keys():
            if (str(player_salaries[key]) == '&nbsp;'):
                player_salaries[key] = 'Not Reported'
            else:
                player_salaries[key] = int(re.sub("([,$])","", player_salaries[key]))
                salaries.append (player_salaries[key])
        # Sort the salaries and append them to the list,
        # Also returns the person with the highest salary
        highest_salary_per_team.append((str(find_key(player_salaries,sorted(salaries)[-1])),sorted(salaries)[-1]))
        average_team_salaries.append(sum(salaries)/len(salaries))
        sleep(1) # wait a second before opening next url so we don't get blocked
    
    

In [None]:
# Create empty lists that will contain the statistics.
    average_team_salaries = []
    highest_salary_per_team = []
    
    # Prints the average salary out, with the team and salary side by side.
    print ("\n\nAverage Team Salaries in the NBA\n(Average amount spent on each player)\n\n")
    team_with_salary = dict(zip(average_team_salaries, rosters.keys()))
    average_team_salaries.sort()
    for key in average_team_salaries:
        print (team_with_salary[key], round(key,2))
    # Prints the highest salaries out, with the team and salary side by side.
    team_with_highest = dict(zip(highest_salary_per_team, rosters.keys() ))
    highest_salary_per_team.sort(key=lambda highest_salary_per_team: highest_salary_per_team[1])
    print ("\n\nPlayer with the highest salary per team in the NBA\n\n")
    for key in highest_salary_per_team:
        print (team_with_highest[key], key)

Now that we have our function’s defined, we can write a few lines of code to execute each function.

In [6]:
rosters = build_team_url()
calculate_statistic(rosters)

AttributeError: module 'urllib' has no attribute 'urlopen'

Again, this is achieved by looping through all the team rosters on the ESPN website, then looping through all the players and extracting their salaries. The average salary per team is obtained by simply adding together the salaries and dividing by the number of people per team. I also use a simple sort function in Python to be able to see which player on each team is the highest paid.

Looking at the data, we can see who is paid the most:

In general, webpages that link to subpages within the same site construct their links in some sort of standardized pattern. The techniques that I've outlined here should be broadly applicable for other websites. I hope what you've learned from this project will help you out on your own web scraping quests.

Thanks for reading! I hope my code helped you understand how to perform basic web scraping using Python.