#Web Scraping with Beautiful Soup

This tutorial will provide a very basic introduction to using the Beautiful Soup package to scrape text data from the web. 

##Installation

In a terminal, install Beautiful Soup if necessary by running <pre><code>conda install beautiful-soup</code></pre>

##Retrieving HTML

The general idea behind web scraping is to retrieve data that exists on a website, and convert it into a format that is usable for analysis. Webpages are rendered by the brower from HTML and CSS code, but much of the information included in the HTML underlying any website is not interesting to us.

We begin by reading in the source code for a given web page and creating a Beautiful Soup object with the BeautifulSoup function.

###urllib2
urllib2 is a module for working with urls, we will use it to open connections to urls and retrieve the webpage source.

In [20]:
from bs4 import BeautifulSoup
import urllib2
page = urllib2.urlopen('http://espn.go.com/nba/statistics/player/_/stat/assists/qualified/false').read()
soup = BeautifulSoup(page)
print type(soup)

<class 'bs4.BeautifulSoup'>


# To inspect data on the internet, right click and "inspect element"

The soup object contains all of the HTML in the original document.

In [21]:
print soup.prettify()[0:1000]

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml">
 <head>
  <script src="http://sports-ak.espn.go.com/sports/optimizely.js">
  </script>
  <meta charset="utf-8"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <link href="http://a.espncdn.com/prod/assets/icons/E.svg" mask="" rel="icon" sizes="any"/>
  <meta content="#CC0000" name="theme-color"/>
  <script type="text/javascript">
   if(true && navigator && navigator.userAgent.toLowerCase().indexOf("teamstream") >= 0) {
        window.location = 'http://a.m.espn.go.com/mobilecache/general/apps/sc';
    }
  </script>
  <title>
   2014-15 Regular Season NBA Player Stats and League Leaders - Assists - National Basketball Association - ESPN
  </title>
  <meta content="xuj1ODRluWa0frM-BjIr_aSHoUC7HB5C1MgmYAM_GkA" name="google-site-verification"/>
  <meta content="B1FEB7C682C46C8FCDA3130F3D18AC28" name="msvalidate.01"/>
  <meta content="noodp" name="googlebot"/>
  <meta content="index, follow" name="robot

##Parsing HTML

By "parsing HTML", we mean pulling out only the relevant tags/attributes for our analysis.  What Beautiful Soup does is provide a handy bunch of methods for doing this efficiently.

###find method

The find method will search for and return the first tag matching your corresponding search criteria, if it exists.  You can specify tag and attribute info etc.  There is also a findAll method that will return a collection of tags matching your query.

In [22]:
table_div = soup.find(id='my-players-table')
print table_div.prettify()[0:4000]

<div class="col-main" id="my-players-table">
 <div class="mod-container mod-table">
  <div class="mod-header stathead">
   <h4>
    Assists Per Game Leaders - All Players
   </h4>
  </div>
  <div class="mod-content">
   <table cellpadding="3" cellspacing="1" class="tablehead">
    <tr align="right" class="colhead">
     <td align="left" style="width:20px;">
      RK
     </td>
     <td align="left">
      PLAYER
     </td>
     <td align="left">
      TEAM
     </td>
     <td>
      <a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/gamesPlayed/qualified/false" title="Games Played">
       GP
      </a>
     </td>
     <td>
      <a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/avgMinutes/qualified/false" title="Minutes Per Game">
       MPG
      </a>
     </td>
     <td>
      <a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/assists/qualified/false" title="Assists">
       AST
      </a>
     </td>
     <td>
      <a href="

Notice that the object returned by the find method is just another inner HTML structure, which we can step through just like we would have with the original soup object.  We've gone to the location in the webpage where the table that we seek starts, now we can use find again to get to the table data.

In [23]:
table = table_div.find("table")
print table

<table cellpadding="3" cellspacing="1" class="tablehead">
<tr align="right" class="colhead"><td align="left" style="width:20px;">RK</td><td align="left">PLAYER</td><td align="left">TEAM</td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/gamesPlayed/qualified/false" title="Games Played">GP</a></td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/avgMinutes/qualified/false" title="Minutes Per Game">MPG</a></td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/assists/qualified/false" title="Assists">AST</a></td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/avgAssists/qualified/false/order/false" title="Assists Per Game">APG</a></td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/turnovers/qualified/false" title="Turnovers">TO</a></td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/avgTurnovers/qualified/false" title="Turnovers Per Gam

###Searching by Attributes

Now that we have the table object, we need to step through the rows.  First we'll find the header row so we can populate what the field names will be in our data.  Here we're searching for tags under the table tag whose class attritbute is "colhead".  

In [24]:
table_head = table.find(attrs={"class":'colhead'})
print table_head.prettify()

<tr align="right" class="colhead">
 <td align="left" style="width:20px;">
  RK
 </td>
 <td align="left">
  PLAYER
 </td>
 <td align="left">
  TEAM
 </td>
 <td>
  <a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/gamesPlayed/qualified/false" title="Games Played">
   GP
  </a>
 </td>
 <td>
  <a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/avgMinutes/qualified/false" title="Minutes Per Game">
   MPG
  </a>
 </td>
 <td>
  <a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/assists/qualified/false" title="Assists">
   AST
  </a>
 </td>
 <td>
  <a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/avgAssists/qualified/false/order/false" title="Assists Per Game">
   APG
  </a>
 </td>
 <td>
  <a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/turnovers/qualified/false" title="Turnovers">
   TO
  </a>
 </td>
 <td>
  <a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/avgTurnovers/qua

Now we find the actual values by searching for the 'td' tags, which is the tag for table data.

In [25]:
header_cols = table_head.findAll('td')
print header_cols

[<td align="left" style="width:20px;">RK</td>, <td align="left">PLAYER</td>, <td align="left">TEAM</td>, <td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/gamesPlayed/qualified/false" title="Games Played">GP</a></td>, <td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/avgMinutes/qualified/false" title="Minutes Per Game">MPG</a></td>, <td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/assists/qualified/false" title="Assists">AST</a></td>, <td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/avgAssists/qualified/false/order/false" title="Assists Per Game">APG</a></td>, <td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/turnovers/qualified/false" title="Turnovers">TO</a></td>, <td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/avgTurnovers/qualified/false" title="Turnovers Per Game">TOPG</a></td>, <td><a href="http://espn.go.com/nba/statistics/player/_/s

Finally, we step through these columns and save them to a list to be used later.  We'll ignore the rank column (RK) because that doesn't give us anything we want later.  We also separate the **PLAYER** column into **PLAYER** and **POSITION**.  


In [26]:
#<tag> value <xtag>

In [27]:
cols = []
for header_col in header_cols:
    val = header_col.string
    if val != 'RK':
        cols.append(val)
    if val == 'PLAYER':
        cols.append('POSITION')
print cols

[u'PLAYER', 'POSITION', u'TEAM', u'GP', u'MPG', u'AST', u'APG', u'TO', u'TOPG', u'AP48M', u'AST/TO']


###Stepping Through a Table

The table rows are indicated by the tag 'tr'.  Again we can find them all and iterate through them.  Within each row we iterate through the respective columns.

In [28]:
table_rows = table.findAll('tr') #tr = tag for rows
print table_rows

[<tr align="right" class="colhead"><td align="left" style="width:20px;">RK</td><td align="left">PLAYER</td><td align="left">TEAM</td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/gamesPlayed/qualified/false" title="Games Played">GP</a></td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/avgMinutes/qualified/false" title="Minutes Per Game">MPG</a></td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/assists/qualified/false" title="Assists">AST</a></td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/avgAssists/qualified/false/order/false" title="Assists Per Game">APG</a></td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/turnovers/qualified/false" title="Turnovers">TO</a></td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/assists/sort/avgTurnovers/qualified/false" title="Turnovers Per Game">TOPG</a></td><td><a href="http://espn.go.com/nba/stati

We will save our results in 2 different ways to demonstrate how we can handle both.  The first will be a list of dicts where the key is the field name and the value is the field value.  The second will just be a list of lists of stats with no field name values (we've already defined them earlier).

In [29]:
players_stats_dicts = [] #list of dictionaries (headers and values) (one dict for each row)
players_stats_array = [] #2d array (list of lists)
for row in table_rows:
    if row.attrs['class'][0]=='colhead': #if the first class atttibute is == colhead (**skipping header row**)
        continue
    player_stats = []
    row_cols = row.find_all('td')
    #col_vals = []
    player_col = row_cols[1] #starting at one because we dont care about rank
    player_name = player_col.find('a').string #a-tag for writing links (each player has a link)
    player_position = player_col.contents[1] #contents
    player_position = player_position.split(' ')[1] # splices (getting rid of the comma and space), returns list.
    player_stats.append(player_name) #appending player name to stats
    player_stats.append(player_position) #appending position to stats
    for i in range(2, len(row_cols)): #giving values to 2x2 columns
        stat = row_cols[i].string #value is string
        player_stats.append(stat) #adding stat 
    players_stats_array.append(player_stats)
    player_stats = dict(zip(cols, player_stats))
    players_stats_dicts.append(player_stats)
print players_stats_dicts[0:5]        

[{u'MPG': u'34.8', u'GP': u'82', u'AST': u'838', u'PLAYER': u'Chris Paul', u'TO': u'190', u'AP48M': u'14.1', u'TEAM': u'LAC', u'TOPG': u'2.3', 'POSITION': u'PG', u'APG': u'10.2', u'AST/TO': u'4.41'}, {u'MPG': u'35.9', u'GP': u'79', u'AST': u'792', u'PLAYER': u'John Wall', u'TO': u'304', u'AP48M': u'13.4', u'TEAM': u'WSH', u'TOPG': u'3.8', 'POSITION': u'PG', u'APG': u'10.0', u'AST/TO': u'2.61'}, {u'MPG': u'35.5', u'GP': u'75', u'AST': u'720', u'PLAYER': u'Ty Lawson', u'TO': u'185', u'AP48M': u'13.0', u'TEAM': u'DEN', u'TOPG': u'2.5', 'POSITION': u'PG', u'APG': u'9.6', u'AST/TO': u'3.89'}, {u'MPG': u'31.5', u'GP': u'22', u'AST': u'193', u'PLAYER': u'Ricky Rubio', u'TO': u'64', u'AP48M': u'13.4', u'TEAM': u'MIN', u'TOPG': u'2.9', 'POSITION': u'PG', u'APG': u'8.8', u'AST/TO': u'3.02'}, {u'MPG': u'34.4', u'GP': u'67', u'AST': u'574', u'PLAYER': u'Russell Westbrook', u'TO': u'293', u'AP48M': u'12.0', u'TEAM': u'OKC', u'TOPG': u'4.4', 'POSITION': u'PG', u'APG': u'8.6', u'AST/TO': u'1.96'}]


Here we've used the zip function to combine pairs of lists into tuples, and then transformed that into a dict to get a dictionary of FIELD --> VALUE for every player in the table.

Beautiful Soup has many other features, including the ability to step up, down, and sideways in the HTML tree and basically search for any tags, attributes, or values.  For more, take a look at the [Documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

##Load into Pandas

Let's load our scraped data into Pandas and take a look at it.  Here is the first way we can do it, simply directly from the dictionary we defined.

In [30]:
import pandas as pd
import numpy as np

df = pd.DataFrame.from_dict(players_stats_dicts)
df.head()

Unnamed: 0,AP48M,APG,AST,AST/TO,GP,MPG,PLAYER,POSITION,TEAM,TO,TOPG
0,14.1,10.2,838,4.41,82,34.8,Chris Paul,PG,LAC,190,2.3
1,13.4,10.0,792,2.61,79,35.9,John Wall,PG,WSH,304,3.8
2,13.0,9.6,720,3.89,75,35.5,Ty Lawson,PG,DEN,185,2.5
3,13.4,8.8,193,3.02,22,31.5,Ricky Rubio,PG,MIN,64,2.9
4,12.0,8.6,574,1.96,67,34.4,Russell Westbrook,PG,OKC,293,4.4


Now here is a 2nd way we can do it.  We convert the 2D stats array into a numpy array and create a Pandas dataframe from it along with the list of column headers we defined earlier.

In [31]:
np_array = np.array(players_stats_array)
df = pd.DataFrame(np_array, columns=cols)
df.head()

Unnamed: 0,PLAYER,POSITION,TEAM,GP,MPG,AST,APG,TO,TOPG,AP48M,AST/TO
0,Chris Paul,PG,LAC,82,34.8,838,10.2,190,2.3,14.1,4.41
1,John Wall,PG,WSH,79,35.9,792,10.0,304,3.8,13.4,2.61
2,Ty Lawson,PG,DEN,75,35.5,720,9.6,185,2.5,13.0,3.89
3,Ricky Rubio,PG,MIN,22,31.5,193,8.8,64,2.9,13.4,3.02
4,Russell Westbrook,PG,OKC,67,34.4,574,8.6,293,4.4,12.0,1.96


##Exercise

The goal of this exercise is to combine the scoring and assists statistics for every player in the NBA in 2014-2015.  The end result will have them in a pandas dataframe with the fields from both pages for every player.

The general steps should be as follows:
- Create a function get_cols that retrieves the names of the header columns given a table element (skip the ranks, split the positions)
- Create a function get_data that retrieves the actual table data given a table element (skip the ranks, split the positions).  You can use either the dict approach or the numpy array approach.
- Write a python loop to loop through the various pages and call these functions on the appropriate urls so that you can retrieve every player (rather than just the top few). #add 40 to each page (40 rows)
- Repeat the above on both the scoring and assists URLs to get a pandas dataframe for both of them
- Use the pandas.DataFrame.join() function to join your 2 pandas dataframes together and get a total result

In [46]:
#defining functions: retrieve column headers and table data

def get_cols(x):
    table_head = x.find(attrs={"class":'colhead'})
    header_cols = table_head.findAll('td')

    cols = []
    for header_col in header_cols:
        val = header_col.string
        if val != 'RK':
            cols.append(val)
        if val == 'PLAYER':
            cols.append('POSITION')
    print(cols)

def get_data(x):
    table_rows = x.findAll('tr')

    players_stats_dicts = [] #list of dictionaries (headers and values) (one dict for each row)
    players_stats_array = [] #2d array (list of lists)
    for row in table_rows:
        if row.attrs['class'][0]=='colhead': #if the first class atttibute is == colhead (**skipping header row**)
            continue
        player_stats = []
        row_cols = row.find_all('td')
        #col_vals = []
        player_col = row_cols[1] #starting at one because we dont care about rank
        player_name = player_col.find('a').string #a-tag for writing links (each player has a link)
        player_position = player_col.contents[1] #contents
        player_position = player_position.split(' ')[1] # splices (getting rid of the comma and space), returns list.
        player_stats.append(player_name) #appending player name to stats
        player_stats.append(player_position) #appending position to stats
        for i in range(2, len(row_cols)): #giving values to 2x2 columns
            stat = row_cols[i].string #value is string
            player_stats.append(stat) #adding stat 
        players_stats_array.append(player_stats)
        player_stats = dict(zip(cols, player_stats))
        players_stats_dicts.append(player_stats)
    np_array = np.array(players_stats_array)
    #df = pd.DataFrame(np_array, columns=cols)
    d2 = pd.DataFrame.from_dict(players_stats_dicts)
    print df.head()



In [50]:
#loop through web pages
a=1
scoring_url = 'http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/avgPoints/qualified/false'

for i in range(11):
    a = a + 40
    page = urllib2.urlopen('http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/avgPoints/qualified/false/count/%d' % a).read() 
    soup = BeautifulSoup(page)
    table_div = soup.find(id='my-players-table')
    table = table_div.find("table")
    get_cols(table) #calling functions from above to grab headers
    get_data(table) #calling function from above to grab data and make df

[u'PLAYER', 'POSITION', u'TEAM', u'GP', u'MPG', u'PTS', u'FGM-FGA', u'FG%', u'3PM-3PA', u'3P%', u'FTM-FTA', u'FT%']
              PLAYER POSITION TEAM  GP   MPG  AST   APG   TO TOPG AP48M AST/TO
0         Chris Paul       PG  LAC  82  34.8  838  10.2  190  2.3  14.1   4.41
1          John Wall       PG  WSH  79  35.9  792  10.0  304  3.8  13.4   2.61
2          Ty Lawson       PG  DEN  75  35.5  720   9.6  185  2.5  13.0   3.89
3        Ricky Rubio       PG  MIN  22  31.5  193   8.8   64  2.9  13.4   3.02
4  Russell Westbrook       PG  OKC  67  34.4  574   8.6  293  4.4  12.0   1.96
[u'PLAYER', 'POSITION', u'TEAM', u'GP', u'MPG', u'PTS', u'FGM-FGA', u'FG%', u'3PM-3PA', u'3P%', u'FTM-FTA', u'FT%']
              PLAYER POSITION TEAM  GP   MPG  AST   APG   TO TOPG AP48M AST/TO
0         Chris Paul       PG  LAC  82  34.8  838  10.2  190  2.3  14.1   4.41
1          John Wall       PG  WSH  79  35.9  792  10.0  304  3.8  13.4   2.61
2          Ty Lawson       PG  DEN  75  35.5  720   9.6  

In [48]:
#Repeat the above on both the scoring and assists URLs to get a pandas dataframe for both of them
#Use the pandas.DataFrame.join() function to join your 2 pandas dataframes together and get a total result

In [49]:
assist_url = 'http://espn.go.com/nba/statistics/player/_/stat/assists/qualified/false'
#loop through web pages
a=1


for i in range(11):
    a = a + 40
    page = urllib2.urlopen('http://espn.go.com/nba/statistics/player/_/stat/assists/qualified/false/count/%d' % (a))
    soup = BeautifulSoup(page)
    table_div = soup.find(id='my-players-table')
    table = table_div.find("table")
    get_cols(table) #calling functions from above to grab headers
    get_data(table) #calling function from above to grab data and make df

[u'PLAYER', 'POSITION', u'TEAM', u'GP', u'MPG', u'AST', u'APG', u'TO', u'TOPG', u'AP48M', u'AST/TO']
              PLAYER POSITION TEAM  GP   MPG  AST   APG   TO TOPG AP48M AST/TO
0         Chris Paul       PG  LAC  82  34.8  838  10.2  190  2.3  14.1   4.41
1          John Wall       PG  WSH  79  35.9  792  10.0  304  3.8  13.4   2.61
2          Ty Lawson       PG  DEN  75  35.5  720   9.6  185  2.5  13.0   3.89
3        Ricky Rubio       PG  MIN  22  31.5  193   8.8   64  2.9  13.4   3.02
4  Russell Westbrook       PG  OKC  67  34.4  574   8.6  293  4.4  12.0   1.96
[u'PLAYER', 'POSITION', u'TEAM', u'GP', u'MPG', u'AST', u'APG', u'TO', u'TOPG', u'AP48M', u'AST/TO']
              PLAYER POSITION TEAM  GP   MPG  AST   APG   TO TOPG AP48M AST/TO
0         Chris Paul       PG  LAC  82  34.8  838  10.2  190  2.3  14.1   4.41
1          John Wall       PG  WSH  79  35.9  792  10.0  304  3.8  13.4   2.61
2          Ty Lawson       PG  DEN  75  35.5  720   9.6  185  2.5  13.0   3.89
3       