#Web Scraping with Beautiful Soup

This tutorial will provide a very basic introduction to using the Beautiful Soup package to scrape text data from the web. 

##Installation

In a terminal, install Beautiful Soup if necessary by running <pre><code>conda install beautiful-soup</code></pre>

##Retrieving HTML

The general idea behind web scraping is to retrieve data that exists on a website, and convert it into a format that is usable for analysis. Webpages are rendered by the brower from HTML and CSS code, but much of the information included in the HTML underlying any website is not interesting to us.

We begin by reading in the source code for a given web page and creating a Beautiful Soup object with the BeautifulSoup function.

###urllib2
urllib2 is a module for working with urls, we will use it to open connections to urls and retrieve the webpage source.

In [2]:
from bs4 import BeautifulSoup
import urllib2
#make connection to web page, and gives the raw text back with the addition of read
page = urllib2.urlopen('http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/avgPoints/qualified/false').read() 
#make a BeautifulSoup object
soup = BeautifulSoup(page)
print type(soup)

<class 'bs4.BeautifulSoup'>


The soup object contains all of the HTML in the original document.

In [5]:
#soup object represents an HTML tree
print soup.prettify()[0:1000]

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml">
 <head>
  <script src="http://sports-ak.espn.go.com/sports/optimizely.js">
  </script>
  <meta charset="utf-8"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <link href="http://a.espncdn.com/prod/assets/icons/E.svg" mask="" rel="icon" sizes="any"/>
  <meta content="#CC0000" name="theme-color"/>
  <script type="text/javascript">
   if(true && navigator && navigator.userAgent.toLowerCase().indexOf("teamstream") >= 0) {
        window.location = 'http://a.m.espn.go.com/mobilecache/general/apps/sc';
    }
  </script>
  <title>
   2014-15 Regular Season NBA Player Stats and League Leaders - Scoring Per Game - National Basketball Association - ESPN
  </title>
  <meta content="xuj1ODRluWa0frM-BjIr_aSHoUC7HB5C1MgmYAM_GkA" name="google-site-verification"/>
  <meta content="B1FEB7C682C46C8FCDA3130F3D18AC28" name="msvalidate.01"/>
  <meta content="noodp" name="googlebot"/>
  <meta content="index, follow" na

##Parsing HTML

By "parsing HTML", we mean pulling out only the relevant tags/attributes for our analysis.  What Beautiful Soup does is provide a handy bunch of methods for doing this efficiently.

###find method

The find method will search for and return the first tag matching your corresponding search criteria, if it exists.  You can specify tag and attribute info etc.  There is also a findAll method that will return a collection of tags matching your query.

In [8]:
#find returns a specific object; find all returns all objects

table_div = soup.find(id='my-players-table')
#pull out everything closed within the div tag, so now we know the table is in there
#this returns a new BS object that contains stuff within that div tag
#prettify turns into a string; here we take the first 4000 characters
print table_div.prettify()[0:4000]

<div class="col-main" id="my-players-table">
 <div class="mod-container mod-table">
  <div class="mod-header stathead">
   <h4>
    Points Per Game Leaders - All Players
   </h4>
  </div>
  <div class="mod-content">
   <table cellpadding="3" cellspacing="1" class="tablehead">
    <tr align="right" class="colhead">
     <td align="left" style="width:20px;">
      RK
     </td>
     <td align="left">
      PLAYER
     </td>
     <td align="left">
      TEAM
     </td>
     <td>
      <a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/gamesPlayed/qualified/false" title="Games Played">
       GP
      </a>
     </td>
     <td>
      <a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/avgMinutes/qualified/false" title="Minutes Per Game">
       MPG
      </a>
     </td>
     <td>
      <a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/avgPoints/qualified/false/order/false" title="Points Per Game">
       PTS

Notice that the object returned by the find method is just another inner HTML structure, which we can step through just like we would have with the original soup object.  We've gone to the location in the webpage where the table that we seek starts, now we can use find again to get to the table data.

In [9]:
#now we search for the next table tag
table = table_div.find("table")
print table

<table cellpadding="3" cellspacing="1" class="tablehead">
<tr align="right" class="colhead"><td align="left" style="width:20px;">RK</td><td align="left">PLAYER</td><td align="left">TEAM</td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/gamesPlayed/qualified/false" title="Games Played">GP</a></td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/avgMinutes/qualified/false" title="Minutes Per Game">MPG</a></td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/avgPoints/qualified/false/order/false" title="Points Per Game">PTS</a></td><td><span title="Field Goals Made-Attempted Per Game">FGM-FGA</span></td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/fieldGoalPct/qualified/false" title="Field Goal Percentage">FG%</a></td><td><span title="3-Point Field Goals Made-Attempted Per Game">3PM-3PA</span></td><td><a href="http://espn.go.com/nba/statistics/play

###Searching by Attributes

Now that we have the table object, we need to step through the rows.  First we'll find the header row so we can populate what the field names will be in our data.  Here we're searching for tags under the table tag whose class attritbute is "colhead".  

In [10]:
#we are looking for distinguishing header row from other rows; in this case, it is distinguised by class "colhead"
table_head = table.find(attrs={"class":'colhead'}) #find rows where class = colhead
print table_head.prettify()

<tr align="right" class="colhead">
 <td align="left" style="width:20px;">
  RK
 </td>
 <td align="left">
  PLAYER
 </td>
 <td align="left">
  TEAM
 </td>
 <td>
  <a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/gamesPlayed/qualified/false" title="Games Played">
   GP
  </a>
 </td>
 <td>
  <a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/avgMinutes/qualified/false" title="Minutes Per Game">
   MPG
  </a>
 </td>
 <td>
  <a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/avgPoints/qualified/false/order/false" title="Points Per Game">
   PTS
  </a>
 </td>
 <td>
  <span title="Field Goals Made-Attempted Per Game">
   FGM-FGA
  </span>
 </td>
 <td>
  <a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/fieldGoalPct/qualified/false" title="Field Goal Percentage">
   FG%
  </a>
 </td>
 <td>
  <span title="3-Point Field Goals Made-Attempted Per Game">
   3PM-3PA
  </span>
 </td>
 <t

Now we find the actual values by searching for the 'td' tags, which is the tag for table data.

In [11]:
#now we've picked the row with each of the headers, and we need to go through and get each of its children
#findAll returns a list
header_cols = table_head.findAll('td') #td = table data = an HTML tag
print header_cols

[<td align="left" style="width:20px;">RK</td>, <td align="left">PLAYER</td>, <td align="left">TEAM</td>, <td><a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/gamesPlayed/qualified/false" title="Games Played">GP</a></td>, <td><a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/avgMinutes/qualified/false" title="Minutes Per Game">MPG</a></td>, <td><a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/avgPoints/qualified/false/order/false" title="Points Per Game">PTS</a></td>, <td><span title="Field Goals Made-Attempted Per Game">FGM-FGA</span></td>, <td><a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/fieldGoalPct/qualified/false" title="Field Goal Percentage">FG%</a></td>, <td><span title="3-Point Field Goals Made-Attempted Per Game">3PM-3PA</span></td>, <td><a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/threePointFieldGoalPct/qualified/false" t

Finally, we step through these columns and save them to a list to be used later.  We'll ignore the rank column (RK) because that doesn't give us anything we want later.  We also separate the **PLAYER** column into **PLAYER** and **POSITION**.

In [12]:
cols = [] #empty list that will create all the headers
for header_col in header_cols: #headercols = list of td tags
    val = header_col.string
    if val != 'RK':
        cols.append(val)
    if val == 'PLAYER':
        cols.append('POSITION')
print cols

[u'PLAYER', 'POSITION', u'TEAM', u'GP', u'MPG', u'PTS', u'FGM-FGA', u'FG%', u'3PM-3PA', u'3P%', u'FTM-FTA', u'FT%']


###Stepping Through a Table

The table rows are indicated by the tag 'tr'.  Again we can find them all and iterate through them.  Within each row we iterate through the respective columns.

In [13]:
table_rows = table.findAll('tr')
print table_rows

[<tr align="right" class="colhead"><td align="left" style="width:20px;">RK</td><td align="left">PLAYER</td><td align="left">TEAM</td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/gamesPlayed/qualified/false" title="Games Played">GP</a></td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/avgMinutes/qualified/false" title="Minutes Per Game">MPG</a></td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/avgPoints/qualified/false/order/false" title="Points Per Game">PTS</a></td><td><span title="Field Goals Made-Attempted Per Game">FGM-FGA</span></td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/fieldGoalPct/qualified/false" title="Field Goal Percentage">FG%</a></td><td><span title="3-Point Field Goals Made-Attempted Per Game">3PM-3PA</span></td><td><a href="http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/threePointFieldGoalPct/qu

We will save our results in 2 different ways to demonstrate how we can handle both.  The first will be a list of dicts where the key is the field name and the value is the field value.  The second will just be a list of lists of stats with no field name values (we've already defined them earlier).

In [20]:
players_stats_dicts = [] #one dictionary for each player
players_stats_array = [] #first item is stats for 1st row; 2nd item is stats for 2nd row; list of lists
for row in table_rows:
    #if row.attrs['class'][0] is just one element
    if row.attrs['class'][0]=='colhead': #check if attribute of class is colhead
        continue
    player_stats = []
    row_cols = row.find_all('td')
    col_vals = []
    #first column is rank (we dont care)
    player_col = row_cols[1]
    #a tags are for adding hyperlinks; actual value is within a tag
    player_name = player_col.find('a').string
    #player position is outside a tag but within player_col tag
    player_position = player_col.contents[1]
    #player position is structed ', [position]' so we need to get rid of comma and space
    #splitting on space returns list of elements, where first element is ',' and 2nd element is [position]
    #split splits a string by some delimiter and returns all the elements in between
    player_position = player_position.split(' ')[1]
    player_stats.append(player_name)
    player_stats.append(player_position)
    #loop through remaining columns and add stats
    for i in range(2, len(row_cols)):
        stat = row_cols[i].string
        player_stats.append(stat)
    players_stats_array.append(player_stats)
    #zip takes 2 lists (eg cols and player_status) and turns into a tuple: a set of key-value pairs; if lists are different length, it will take length of shorter
    player_stats = dict(zip(cols, player_stats))
    players_stats_dicts.append(player_stats)
print players_stats_dicts[0:5]  
#the thing returned by this is out of order; something to consider; dictionary is not inherintly ordered; don't worry for now
#lists are inherintly ordered; dictionaries are not
#in 2nd method, we already have header from earlier; now we have list of lists

[{u'FGM-FGA': u'9.4-22.0', u'MPG': u'34.4', u'FTM-FTA': u'8.1-9.8', u'FG%': u'.426', u'GP': u'67', u'PLAYER': u'Russell Westbrook', u'FT%': u'.835', u'TEAM': u'OKC', u'3PM-3PA': u'1.3-4.3', 'POSITION': u'PG', u'PTS': u'28.1', u'3P%': u'.299'}, {u'FGM-FGA': u'8.0-18.1', u'MPG': u'36.8', u'FTM-FTA': u'8.8-10.2', u'FG%': u'.440', u'GP': u'81', u'PLAYER': u'James Harden', u'FT%': u'.868', u'TEAM': u'HOU', u'3PM-3PA': u'2.6-6.9', 'POSITION': u'SG', u'PTS': u'27.4', u'3P%': u'.375'}, {u'FGM-FGA': u'8.8-17.3', u'MPG': u'33.8', u'FTM-FTA': u'5.4-6.3', u'FG%': u'.510', u'GP': u'27', u'PLAYER': u'Kevin Durant', u'FT%': u'.854', u'TEAM': u'OKC', u'3PM-3PA': u'2.4-5.9', 'POSITION': u'SF', u'PTS': u'25.4', u'3P%': u'.403'}, {u'FGM-FGA': u'9.0-18.5', u'MPG': u'36.1', u'FTM-FTA': u'5.4-7.7', u'FG%': u'.488', u'GP': u'69', u'PLAYER': u'LeBron James', u'FT%': u'.710', u'TEAM': u'CLE', u'3PM-3PA': u'1.7-4.9', 'POSITION': u'SF', u'PTS': u'25.3', u'3P%': u'.354'}, {u'FGM-FGA': u'9.4-17.6', u'MPG': u'36.1'

Here we've used the zip function to combine pairs of lists into tuples, and then transformed that into a dict to get a dictionary of FIELD --> VALUE for every player in the table.

Beautiful Soup has many other features, including the ability to step up, down, and sideways in the HTML tree and basically search for any tags, attributes, or values.  For more, take a look at the [Documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

##Load into Pandas

Let's load our scraped data into Pandas and take a look at it.  Here is the first way we can do it, simply directly from the dictionary we defined.

In [15]:
import pandas as pd
import numpy as np

df = pd.DataFrame.from_dict(players_stats_dicts)
df.head()

Unnamed: 0,3P%,3PM-3PA,FG%,FGM-FGA,FT%,FTM-FTA,GP,MPG,PLAYER,POSITION,PTS,TEAM
0,0.299,1.3-4.3,0.426,9.4-22.0,0.835,8.1-9.8,67,34.4,Russell Westbrook,PG,28.1,OKC
1,0.375,2.6-6.9,0.44,8.0-18.1,0.868,8.8-10.2,81,36.8,James Harden,SG,27.4,HOU
2,0.403,2.4-5.9,0.51,8.8-17.3,0.854,5.4-6.3,27,33.8,Kevin Durant,SF,25.4,OKC
3,0.354,1.7-4.9,0.488,9.0-18.5,0.71,5.4-7.7,69,36.1,LeBron James,SF,25.3,CLE
4,0.083,0.0-0.2,0.535,9.4-17.6,0.805,5.5-6.8,68,36.1,Anthony Davis,PF,24.4,NO


Now here is a 2nd way we can do it.  We convert the 2D stats array into a numpy array and create a Pandas dataframe from it along with the list of column headers we defined earlier.

In [16]:
np_array = np.array(players_stats_array)
df = pd.DataFrame(np_array, columns=cols) #dictate what column names should be when we create dataframe
df.head()
#this is quicker and in proper order

Unnamed: 0,PLAYER,POSITION,TEAM,GP,MPG,PTS,FGM-FGA,FG%,3PM-3PA,3P%,FTM-FTA,FT%
0,Russell Westbrook,PG,OKC,67,34.4,28.1,9.4-22.0,0.426,1.3-4.3,0.299,8.1-9.8,0.835
1,James Harden,SG,HOU,81,36.8,27.4,8.0-18.1,0.44,2.6-6.9,0.375,8.8-10.2,0.868
2,Kevin Durant,SF,OKC,27,33.8,25.4,8.8-17.3,0.51,2.4-5.9,0.403,5.4-6.3,0.854
3,LeBron James,SF,CLE,69,36.1,25.3,9.0-18.5,0.488,1.7-4.9,0.354,5.4-7.7,0.71
4,Anthony Davis,PF,NO,68,36.1,24.4,9.4-17.6,0.535,0.0-0.2,0.083,5.5-6.8,0.805


##Exercise

The goal of this exercise is to combine the scoring and assists statistics for every player in the NBA in 2014-2015.  The end result will have them in a pandas dataframe with the fields from both pages for every player.

The general steps should be as follows:
- Create a function get_cols that retrieves the names of the header columns given a table element (skip the ranks, split the positions)
- Create a function get_data that retrieves the actual table data given a table element (skip the ranks, split the positions).  You can use either the dict approach or the numpy array approach.
- Write a python loop to loop through the various pages and call these functions on the appropriate urls so that you can retrieve every player (rather than just the top few).
- Repeat the above on both the scoring and assists URLs to get a pandas dataframe for both of them
- Use the pandas.DataFrame.join() function to join your 2 pandas dataframes together and get a total result

In [32]:
#create function that retrieves the names of the header cols given a table element (skip ranks, split positions)
#finds the header, returns a list of headers
def get_cols(tbl):
    #we are looking for distinguishing header row from other rows; in this case, it is distinguised by class "colhead"
    table_head = tbl.find(attrs={"class":'colhead'}) #find rows where class = colhead
    
    header_cols = table_head.findAll('td') #td = table data = an HTML tag
    
    cols = [] #empty list that will create all the headers
    for header_col in header_cols: #headercols = list of td tags
        val = header_col.string
        if val != 'RK':
            cols.append(val)
        if val == 'PLAYER':
            cols.append('POSITION')
    return cols

#create function that retrieves the actual table data given a table element (skip ranks, split positions)
#step through data and return list of lists OR dictionary
def get_data(tbl):
    table_rows = tbl.findAll('tr')
    player_stats_dicts = []
    for row in table_rows:
        if row.attrs['class'][0]=='colhead': #check if attribute of class is colhead
            continue
        player_stats = []
        row_cols = row.find_all('td')
        col_vals = []
        #first column is rank (we dont care)
        player_col = row_cols[1]
        #a tags are for adding hyperlinks; actual value is within a tag
        player_name = player_col.find('a').string
        #player position is outside a tag but within player_col tag
        player_position = player_col.contents[1]
        #player position is structed ', [position]' so we need to get rid of comma and space
        #splitting on space returns list of elements, where first element is ',' and 2nd element is [position]
        #split splits a string by some delimiter and returns all the elements in between
        player_position = player_position.split(' ')[1]
        player_stats.append(player_name)
        player_stats.append(player_position)
        #loop through remaining columns and add stats
        for i in range(2, len(row_cols)):
            stat = row_cols[i].string
            player_stats.append(stat)
        #zip takes 2 lists (eg cols and player_status) and turns into a tuple: a set of key-value pairs; if lists are different length, it will take length of shorter
        player_stats = dict(zip(get_cols(tbl), player_stats))
        player_stats_dicts.append(player_stats)
    return player_stats_dicts

#all tables in ESPN stats have same format
#use for Loop to step through each page; some parameter in the URL will be different between different pages
#assist stats are off to the right; very comparable; use join similar to SQL
#need to loop through pages for regular + assists
#end result will be combined dataframe: for each player, show scoring + assists
#header is same on every page (?)....but data is different
#do both tables by playerName; don't worry about duplicate players

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import urllib2

df_scoring_total = pd.DataFrame()
df_assists_total = pd.DataFrame()

#python loop to go through pages and retrieve scoring and assist data on every player
for i in range(0, 13):
    if i == 0:
        scoring_URL = 'http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/qualified/false'
        assists_URL = 'http://espn.go.com/nba/statistics/player/_/stat/assists/qualified/false'
    else:
        scoring_URL = 'http://espn.go.com/nba/statistics/player/_/stat/scoring-per-game/sort/avgPoints/qualified/false/count/' + str(1 + (i * 40))
        assists_URL = 'http://espn.go.com/nba/statistics/player/_/stat/assists/sort/avgAssists/qualified/false/count/' + str(1 + (i * 40))
    
    page_scoring = urllib2.urlopen(scoring_URL).read()
    page_assists = urllib2.urlopen(assists_URL).read() 
    #make BeautifulSoup objects
    soup_scoring = BeautifulSoup(page_scoring)
    soup_assists = BeautifulSoup(page_assists)
    
    table_div_scoring = soup_scoring.find(id='my-players-table')
    table_div_assists = soup_assists.find(id='my-players-table')
    
    tbl_scoring = table_div_scoring.find("table")
    tbl_assists = table_div_assists.find("table")
    
    df_scoring = pd.DataFrame.from_dict(get_data(tbl_scoring))
    df_assists = pd.DataFrame.from_dict(get_data(tbl_assists))
    
    if i == 0:
        df_scoring_total = df_scoring
        df_assists_total = df_assists
    else:
        x = [df_scoring_total, df_scoring]
        df_scoring_total = pd.concat(x)
        y = [df_assists_total, df_assists]
        df_assists_total = pd.concat(y)
        #df_scoring_total.append(df_scoring)
        #df_assists_total.append(df_assists)
        
#df_combined = df_scoring_total.join(df_assists_total, on = u'PLAYER', how = 'inner', lsuffix = '_left', rsuffix = '_right')
#the above line returns no rows for some reason, even though I checked that both DF's being joined contain data before being joined

df_combined = pd.merge(df_scoring_total, df_assists_total, on = 'PLAYER')
df_combined.to_csv('test.csv')
    