# Web scraping

* Data scientists often require data stored on webpages
* Web scraping techniques can be implemented to systematically collect and store data for analysis purposes
* Beautiful Soup is a Python library for pulling data out of HTML and XML files
* Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

## Process

1. Obtain webpage url
* Save webpage contents to file or in a python object
* Convert webpage contents to soup
* Use Beautiful Soup functions to extract data from soup via html tags
* Write functions and loops to store extracted data to python data structures (e.g., dictionary)
* Write for loops to extract content on similarly structured webpages

## Example
* Extract NBA team information from espn.com
* [ESPN NBA teams](http://espn.go.com/nba/teams)

In [1]:
# from IPython.display import HTML
# HTML(url='http://espn.go.com/nba/teams')

In [2]:
import pandas 
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen

Read url and store webpage data

In [3]:
url = 'http://espn.go.com/nba/teams'

In [4]:
r = requests.get(url)
r.status_code

200

Convert webpage to soup

In [5]:
soup = BeautifulSoup(r.text, 'lxml')

Examine soup
* Identify classes and tags of information of interest

In [6]:
print(soup.prettify()[0:5000])

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml">
 <head>
  <script src="http://cdn.espn.com/sports/optimizely.js">
  </script>
  <meta charset="utf-8"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <link href="http://a.espncdn.com/favicon.ico" mask="" rel="icon" sizes="any"/>
  <meta content="#CC0000" name="theme-color"/>
  <script type="text/javascript">
   if(true && navigator && navigator.userAgent.toLowerCase().indexOf("teamstream") >= 0) {
        window.location = 'http://m.espn.com/mobilecache/general/apps/sc';
    }
  </script>
  <script>
   (function(){function r(a){a=a.replace(/[\[]/,"\\[").replace(/[\]]/,"\\]");var c=new RegExp("[\\?&]"+a+"=([^&#]*)"),b=c.exec(location.search);return b==null?"":decodeURIComponent(b[1].replace(/\+/g," "));}var p=navigator.userAgent,o=window.location,l=document.cookie,f=document.referrer,n=(f===""||f.indexOf("www.espn.com")!==-1),d=(n)?"http://m.espn.com/nba/teams?src=desktop":"http://m.espn.com/nba/te

Implement Beautiful Soup functions to store class/tag specific information

In [7]:
tables = soup.find_all('ul', class_='medium-logos')

In [8]:
uls = soup.find_all('ul')
print('Number of objects: ', len(uls))

Number of objects:  19


Display the first object

In [9]:
ul = uls[0]
print(ul.prettify())

<ul class="medium-logos">
 <li class="first">
  <div class="logo-nba-medium">
   <img src="http://a.espncdn.com/combiner/i?img=/i/teamlogos/nba/500/Bos.png?w=48&amp;h=48&amp;transparent=true"/>
   <h5>
    <a class="bi" href="http://www.espn.com/nba/team/_/name/bos/boston-celtics">
     Boston Celtics
    </a>
   </h5>
   <span>
    <a href="/nba/teams/stats?team=bos">
     Stats
    </a>
    |
    <a href="/nba/teams/schedule?team=bos">
     Schedule
    </a>
    |
    <a href="/nba/teams/roster?team=bos">
     Roster
    </a>
    |
    <a href="/nba/teams/depth?team=bos">
     Depth Chart
    </a>
   </span>
  </div>
 </li>
 <li class="alt">
  <div class="logo-nba-medium">
   <img src="http://a.espncdn.com/combiner/i?img=/i/teamlogos/nba/500/BKN.png?w=48&amp;h=48&amp;transparent=true"/>
   <h5>
    <a class="bi" href="http://www.espn.com/nba/team/_/name/bkn/brooklyn-nets">
     Brooklyn Nets
    </a>
   </h5>
   <span>
    <a href="/nba/teams/stats?team=bkn">
     Stats
    </a>
    

Continue to implement Beautful Soup functions to save relevent information

In [10]:
li = ul.find('li')
li

<li class="first"><div class="logo-nba-medium"><img src="http://a.espncdn.com/combiner/i?img=/i/teamlogos/nba/500/Bos.png?w=48&amp;h=48&amp;transparent=true"/><h5><a class="bi" href="http://www.espn.com/nba/team/_/name/bos/boston-celtics">Boston Celtics</a></h5><span><a href="/nba/teams/stats?team=bos">Stats</a> | <a href="/nba/teams/schedule?team=bos">Schedule</a> | <a href="/nba/teams/roster?team=bos">Roster</a> | <a href="/nba/teams/depth?team=bos">Depth Chart</a></span></div></li>

In [112]:
a = li.h5.a
a

<a class="bi" href="http://www.espn.com/nba/team/_/name/bos/boston-celtics">Boston Celtics</a>

In [113]:
a['href']

'http://www.espn.com/nba/team/_/name/bos/boston-celtics'

In [114]:
a.text

'Boston Celtics'

In [115]:
a0 = li.find('a', class_='bi')
a0.get('href')

'http://www.espn.com/nba/team/_/name/bos/boston-celtics'

In [116]:
a1 = li.find_all('a', text='Stats')
a1

[<a href="/nba/teams/stats?team=bos">Stats</a>]

In [117]:
a2 = li.find_all('a', text='Schedule')
a2

[<a href="/nba/teams/schedule?team=bos">Schedule</a>]

In [118]:
a3 = li.find_all('a', text='Roster')
a3

[<a href="/nba/teams/roster?team=bos">Roster</a>]

In [119]:
a4 = li.find_all('a', text='Depth Chart')

### Save webpage data in python data structures
* Develop data storage architecture
* Create empty data structures
* Implement control flow processes to save data in data structures

In [120]:
teams = []
prefix_1 = []
prefix_2 = []
teams_urls = []
for table in tables:
    lis = table.find_all('li')
    print('...........................')
#     print(lis[:20])
    print('***************************')


...........................
***************************
...........................
***************************
...........................
***************************
...........................
***************************
...........................
***************************
...........................
***************************


In [121]:
team = []
url = []
stats = []
schedule = []
roster = []
depth = []

for table in tables:
    lis = table.find_all('li')
    for li in lis:
        # team names and base url
        a1 = li.find('a', class_='bi')
        team.append(a1.text) # team names 
        url.append(a1['href']) # base url
        
        # get stats 
        a1 = li.find('a', text='Stats') 
        stats.append(a1['href'])
        
        # get schedule 
        a1 = li.find('a', text='Schedule') 
        schedule.append(a1['href'])

        # get roster
        a1 = li.find('a', text='Roster')
        roster.append(a1['href'])

        # get depth chart
        a1 = li.find('a', text='Depth Chart') 
        depth.append(a1['href'])


In [123]:
dic = {'team': team, 'url': url, 'stats': stats, 'schedule': schedule, 'roster': roster, 'depth': depth}
dic

{'depth': ['/nba/teams/depth?team=bos',
  '/nba/teams/depth?team=bkn',
  '/nba/teams/depth?team=nyk',
  '/nba/teams/depth?team=phi',
  '/nba/teams/depth?team=tor',
  '/nba/teams/depth?team=gsw',
  '/nba/teams/depth?team=lac',
  '/nba/teams/depth?team=lal',
  '/nba/teams/depth?team=pho',
  '/nba/teams/depth?team=sac',
  '/nba/teams/depth?team=chi',
  '/nba/teams/depth?team=cle',
  '/nba/teams/depth?team=det',
  '/nba/teams/depth?team=ind',
  '/nba/teams/depth?team=mil',
  '/nba/teams/depth?team=dal',
  '/nba/teams/depth?team=hou',
  '/nba/teams/depth?team=mem',
  '/nba/teams/depth?team=nor',
  '/nba/teams/depth?team=sas',
  '/nba/teams/depth?team=atl',
  '/nba/teams/depth?team=cha',
  '/nba/teams/depth?team=mia',
  '/nba/teams/depth?team=orl',
  '/nba/teams/depth?team=was',
  '/nba/teams/depth?team=den',
  '/nba/teams/depth?team=min',
  '/nba/teams/depth?team=okc',
  '/nba/teams/depth?team=por',
  '/nba/teams/depth?team=uth'],
 'roster': ['/nba/teams/roster?team=bos',
  '/nba/teams/rost

In [124]:
dt = pandas.DataFrame(dic)
dt = dt[['team', 'url', 'stats', 'schedule', 'roster', 'depth']]
dt.head()

Unnamed: 0,team,url,stats,schedule,roster,depth
0,Boston Celtics,http://www.espn.com/nba/team/_/name/bos/boston...,/nba/teams/stats?team=bos,/nba/teams/schedule?team=bos,/nba/teams/roster?team=bos,/nba/teams/depth?team=bos
1,Brooklyn Nets,http://www.espn.com/nba/team/_/name/bkn/brookl...,/nba/teams/stats?team=bkn,/nba/teams/schedule?team=bkn,/nba/teams/roster?team=bkn,/nba/teams/depth?team=bkn
2,New York Knicks,http://www.espn.com/nba/team/_/name/ny/new-yor...,/nba/teams/stats?team=nyk,/nba/teams/schedule?team=nyk,/nba/teams/roster?team=nyk,/nba/teams/depth?team=nyk
3,Philadelphia 76ers,http://www.espn.com/nba/team/_/name/phi/philad...,/nba/teams/stats?team=phi,/nba/teams/schedule?team=phi,/nba/teams/roster?team=phi,/nba/teams/depth?team=phi
4,Toronto Raptors,http://www.espn.com/nba/team/_/name/tor/toront...,/nba/teams/stats?team=tor,/nba/teams/schedule?team=tor,/nba/teams/roster?team=tor,/nba/teams/depth?team=tor
