# Web scraping

* Data scientists often require data stored on webpages
* Web scraping techniques can be implemented to systematically collect and store data for analysis purposes
* Beautiful Soup is a Python library for pulling data out of HTML and XML files
* Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

## Process

1. Obtain webpage url
* Save webpage contents to file or in a python object
* Convert webpage contents to soup
* Use Beautiful Soup functions to extract data from soup via html tags
* Write functions and loops to store extracted data to python data structures (e.g., dictionary)
* Write for loops to extract content on similarly structured webpages

## Example
* Extract NBA team information from espn.com
* [ESPN NBA teams](http://espn.go.com/nba/teams)

In [21]:
# from IPython.display import HTML
# HTML(url='http://espn.go.com/nba/teams')

In [22]:
import pandas 
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen

Read url and store webpage data

In [25]:
url = 'http://espn.go.com/nba/teams'

In [24]:
r = requests.get(url)
r.status_code

200

Convert webpage to soup

In [10]:
soup = BeautifulSoup(r.text, 'lxml')

Examine soup
* Identify classes and tags of information of interest

In [27]:
print(soup.prettify()[0:5000])

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml">
 <head>
  <script src="http://cdn.espn.com/sports/optimizely.js">
  </script>
  <meta charset="utf-8"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <link href="http://a.espncdn.com/favicon.ico" mask="" rel="icon" sizes="any"/>
  <meta content="#CC0000" name="theme-color"/>
  <script type="text/javascript">
   if(true && navigator && navigator.userAgent.toLowerCase().indexOf("teamstream") >= 0) {
        window.location = 'http://m.espn.com/mobilecache/general/apps/sc';
    }
  </script>
  <script>
   (function(){function r(a){a=a.replace(/[\[]/,"\\[").replace(/[\]]/,"\\]");var c=new RegExp("[\\?&]"+a+"=([^&#]*)"),b=c.exec(location.search);return b==null?"":decodeURIComponent(b[1].replace(/\+/g," "));}var p=navigator.userAgent,o=window.location,l=document.cookie,f=document.referrer,n=(f===""||f.indexOf("www.espn.com")!==-1),d=(n)?"http://m.espn.com/nba/teams?src=desktop":"http://m.espn.com/nba/te

Implement Beautiful Soup functions to store class/tag specific information

In [28]:
tables = soup.find_all('ul', class_='medium-logos')

In [37]:
uls = soup.find_all('ul')
print('Number of objects: ', len(uls))

Number of objects:  19


Display the first object

In [42]:
ul = uls[0]
print(ul.prettify())

<ul class="medium-logos">
 <li class="first">
  <div class="logo-nba-medium">
   <img src="http://a.espncdn.com/combiner/i?img=/i/teamlogos/nba/500/Bos.png?w=48&amp;h=48&amp;transparent=true"/>
   <h5>
    <a class="bi" href="http://www.espn.com/nba/team/_/name/bos/boston-celtics">
     Boston Celtics
    </a>
   </h5>
   <span>
    <a href="/nba/teams/stats?team=bos">
     Stats
    </a>
    |
    <a href="/nba/teams/schedule?team=bos">
     Schedule
    </a>
    |
    <a href="/nba/teams/roster?team=bos">
     Roster
    </a>
    |
    <a href="/nba/teams/depth?team=bos">
     Depth Chart
    </a>
   </span>
  </div>
 </li>
 <li class="alt">
  <div class="logo-nba-medium">
   <img src="http://a.espncdn.com/combiner/i?img=/i/teamlogos/nba/500/BKN.png?w=48&amp;h=48&amp;transparent=true"/>
   <h5>
    <a class="bi" href="http://www.espn.com/nba/team/_/name/bkn/brooklyn-nets">
     Brooklyn Nets
    </a>
   </h5>
   <span>
    <a href="/nba/teams/stats?team=bkn">
     Stats
    </a>
    

Continue to implement Beautful Soup functions to save relevent information

In [43]:
li = ul.find('li')

In [45]:
a = li.h5.a
a

<a class="bi" href="http://www.espn.com/nba/team/_/name/bos/boston-celtics">Boston Celtics</a>

In [46]:
a['href']

'http://www.espn.com/nba/team/_/name/bos/boston-celtics'

In [47]:
a.text

'Boston Celtics'

### Save webpage data in python data structures
* Develop data storage architecture
* Create empty data structures
* Implement control flow processes to save data in data structures

In [58]:
teams = []
prefix_1 = []
prefix_2 = []
teams_urls = []
for table in tables:
    lis = table.find_all('li')
    print('...........................')
#     print(lis[:20])
    print('***************************')


...........................
***************************
...........................
***************************
...........................
***************************
...........................
***************************
...........................
***************************
...........................
***************************


In [59]:
teams = []
teams_urls = []

for table in tables:
    lis = table.find_all('li')
    for li in lis:
        a = li.h5.a
        
        at = a.text
        teams.append(at)
        
        ah = a['href']
        teams_urls.append(ah)

In [60]:
dic = {'url': teams_urls, 'teams': teams}
dic


{'teams': ['Boston Celtics',
  'Brooklyn Nets',
  'New York Knicks',
  'Philadelphia 76ers',
  'Toronto Raptors',
  'Golden State Warriors',
  'LA Clippers',
  'Los Angeles Lakers',
  'Phoenix Suns',
  'Sacramento Kings',
  'Chicago Bulls',
  'Cleveland Cavaliers',
  'Detroit Pistons',
  'Indiana Pacers',
  'Milwaukee Bucks',
  'Dallas Mavericks',
  'Houston Rockets',
  'Memphis Grizzlies',
  'New Orleans Pelicans',
  'San Antonio Spurs',
  'Atlanta Hawks',
  'Charlotte Hornets',
  'Miami Heat',
  'Orlando Magic',
  'Washington Wizards',
  'Denver Nuggets',
  'Minnesota Timberwolves',
  'Oklahoma City Thunder',
  'Portland Trail Blazers',
  'Utah Jazz'],
 'url': ['http://www.espn.com/nba/team/_/name/bos/boston-celtics',
  'http://www.espn.com/nba/team/_/name/bkn/brooklyn-nets',
  'http://www.espn.com/nba/team/_/name/ny/new-york-knicks',
  'http://www.espn.com/nba/team/_/name/phi/philadelphia-76ers',
  'http://www.espn.com/nba/team/_/name/tor/toronto-raptors',
  'http://www.espn.com/nba

In [61]:
dt = pandas.DataFrame(dic)
dt.head()

Unnamed: 0,teams,url
0,Boston Celtics,http://www.espn.com/nba/team/_/name/bos/boston...
1,Brooklyn Nets,http://www.espn.com/nba/team/_/name/bkn/brookl...
2,New York Knicks,http://www.espn.com/nba/team/_/name/ny/new-yor...
3,Philadelphia 76ers,http://www.espn.com/nba/team/_/name/phi/philad...
4,Toronto Raptors,http://www.espn.com/nba/team/_/name/tor/toront...


Nested for loops

In [62]:
teams_urls = []
t_url_list = list(dt.url)

for t_url in t_url_list:
    for table in tables:
        lis = table.find_all('li')
        for li in lis:
            a = li.h5.a

            at = a.text
            teams.append(at)

            ah = a['href']
            teams_urls.append(ah)
teams_urls[0:2]

['http://www.espn.com/nba/team/_/name/bos/boston-celtics',
 'http://www.espn.com/nba/team/_/name/bkn/brooklyn-nets']

In [65]:
teams = []
prefix_1 = []
prefix_2 = []
teams_urls = []
for table in tables:
    lis = table.find_all('li')
    for li in lis:
        info = li.h5.a
        teams.append(info.text)
        url = info['href']
        teams_urls.append(url)
        prefix_1.append(url.split('/')[-2])
        prefix_2.append(url.split('/')[-1])


dic = {'url': teams_urls, 'prefix_2': prefix_2, 'prefix_1': prefix_1}

### Examine stored data

In [67]:
dic.keys()

dict_keys(['url', 'prefix_2', 'prefix_1'])

In [70]:
dic.get('url')[0:2]

['http://www.espn.com/nba/team/_/name/bos/boston-celtics',
 'http://www.espn.com/nba/team/_/name/bkn/brooklyn-nets']

In [71]:
dic.get('prefix_1')[0:2]

['bos', 'bkn']

In [72]:
dic.get('prefix_2')[0:2]


['boston-celtics', 'brooklyn-nets']