## Web Scraping 101: BeautifulSoup

[BeautifulSoup documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

Scraping data from the internet.

Web scraping is simple due to the consistent format of information among web pages.

## HTML Refresher

### Overview
* HTML is the basic language used to create a web page. 
* It tells the web browser what text/media to display, where to display it, and how to display it (style)
* HTML is very structured/hirarchical. 
* Every page is made up of discrete "elements."

### Tags

* Elements are labeled with "tags."

* For example:

    ```html
    <p>You are beginning to learn HTML.</p>
    ```

### Attributes

* A start tag also often contains "attributes" with info about the element.

* Attributes usually have a name and value.

* Example:

```html
<p class="my_red_sentences">You are beginning to learn HTML.</p>
```

### Structure

A full HTML document has a structure more like this:

```html
<html> 
  <head> </head>
  <body>
     <p class="red">You are beginning to learn HTML.</p>
     <h1> This is a header </h1>
     <a href="www.google.com"> Some link </a>
  </body>
</html>
```

### Explore in Browser

* Let's explore some live HTML!
* Go to http://boxofficemojo.com/movies/?id=biglebowski.htm in your browser, preferably Chrome.
* Click Inspect Element, also click on View Page Source.

## HTML to BeautifulSoup

### Request data for The Big Lebowski

Scrape some information about [The Big Lebowski](http://boxofficemojo.com/movies/?id=biglebowski.htm).

In [1]:
from __future__ import print_function, division

In [2]:
# if needed: pip install requests or conda install requests
import requests

requests.__path__

['//anaconda3/lib/python3.7/site-packages/requests']

In [3]:
url = 'https://data.gov.sg/dataset/weekly-number-of-dengue-and-dengue-haemorrhagic-fever-cases'

response = requests.get(url)

### Check the Status

For information on HTTP status codes, see:

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [4]:
response.status_code # status code = 200 => OK

200

### Look at the Text

In [5]:
print(response.text)



<!DOCTYPE html>

<html lang="en_GB">

  
  <head>
    <!--[if lte ie 8]><script type="text/javascript" src="/fanstatic/vendor/:version:2019-10-14T11:24:30/html5.min.js"></script><![endif]-->
<link rel="stylesheet" type="text/css" href="/fanstatic/vendor/:version:2019-10-14T11:24:30/select2/select2.css" />
<link rel="stylesheet" type="text/css" href="/fanstatic/datagovsg-css/:version:2019-10-14T11:42:38.32/datagovsg.min.css" />
<link rel="stylesheet" type="text/css" href="/fanstatic/datagovsg-css/:version:2019-10-14T11:42:38.32/package/read.min.css" />
<link rel="stylesheet" type="text/css" href="/fanstatic/datagovsg-css/:version:2019-10-14T11:42:38.32/package/snippets/resource_item_fields.min.css" />


    <title>
      Weekly Number of Dengue and Dengue Haemorrhagic Fever Cases-Data.gov.sg
      
    </title>

    
    <meta charset="utf-8">
    <meta name="generator" content="ckan ">
    <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" user-scala

### Soupify the Text

In [6]:
page = response.text

lxml is a library for processing XML and HTML in Python. We are parsing the data from txt to lxml.

In [7]:
# if needed: pip install beautifulsoup4 lxml or conda install beautifulsoup4 lxml
from bs4 import BeautifulSoup

soup = BeautifulSoup(page, "lxml")

In [8]:
print(soup)

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
<meta content="text/html;charset=utf-8" http-equiv="Content-type"/>
<title>The Big Lebowski (1998) - Box Office Mojo</title>
<style type="text/css">
table.chart-wide { width: 100%; }
</style>
<meta content="the big lebowski, movie, film, box office, result, records, charts, revenue, opening weekend, gross, worldwide, overseas, foreign, news, reviews, articles, stories, story, analysis, revenue, release date, mpaa rating, genre, running time, length, budget, production budget, distributor, studio, gramercy, theatrical summary, theatrical, showtimes, tickets, show times, theaters, playing, weekend box office results, weekly box office, weekly box office, release summary, similar movies, box office mojo" name="keywords"/>
<meta content="The Big Lebowski summary of box office results, charts and release information and related links." name="description"/>
<link cha

### Prettify the Soup

A webpage can be thought of as a tree of elements, there is the 'body', which would contain a few 'divs' and each of those 'divs' can in turn contain 'divs' and other elements. A Soup object contains this tree. The prettify() method will turn a Beautiful Soup tree into a nicely formatted Unicode string, with each HTML/XML tag on its own line.

In [8]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en_GB">
 <head>
  <!--[if lte ie 8]><script type="text/javascript" src="/fanstatic/vendor/:version:2019-10-14T11:24:30/html5.min.js"></script><![endif]-->
  <link href="/fanstatic/vendor/:version:2019-10-14T11:24:30/select2/select2.css" rel="stylesheet" type="text/css"/>
  <link href="/fanstatic/datagovsg-css/:version:2019-10-14T11:42:38.32/datagovsg.min.css" rel="stylesheet" type="text/css"/>
  <link href="/fanstatic/datagovsg-css/:version:2019-10-14T11:42:38.32/package/read.min.css" rel="stylesheet" type="text/css"/>
  <link href="/fanstatic/datagovsg-css/:version:2019-10-14T11:42:38.32/package/snippets/resource_item_fields.min.css" rel="stylesheet" type="text/css"/>
  <title>
   Weekly Number of Dengue and Dengue Haemorrhagic Fever Cases-Data.gov.sg
  </title>
  <meta charset="utf-8"/>
  <meta content="ckan " name="generator"/>
  <meta content="width=device-width, initial-scale=1, maximum-scale=1" name="viewport" user-scalable="no"/>
  <meta content="This

## Beautiful Soup - Find & Find_All

### `soup.find()`

* `soup.find()` is the most common function we will use from this package.  
* Let's try out some common variations of `soup.find()`

* `soup.find()` returns the first matched tag it finds.
* It searches the entire tree.

* Search for a type of tag by using the tag as a string argument ('body','div','p','a')

In [10]:
soup.find('a') # "a" tag is for hyperlink

<a href="/daily/chart/">Daily Box Office (Sun.)</a>

In [11]:
# Equivalently:
soup.a

<a href="/daily/chart/">Daily Box Office (Sun.)</a>

In [12]:
# Prettier:
print(soup.a.prettify())

<a href="/daily/chart/">
 Daily Box Office (Sun.)
</a>



Here's how you can find the next one.

In [13]:
soup.find('a').findNextSibling()

<a href="/weekend/chart/">Weekend Box Office (Oct. 11–13)</a>

### `soup.find_all()`

`soup.find_all()` returns a list of all matches

In [14]:
len(soup.find_all('a'))

100

In [15]:
for link in soup.find_all('a'): 
    print(link)

<a href="/daily/chart/">Daily Box Office (Sun.)</a>
<a href="/weekend/chart/">Weekend Box Office (Oct. 11–13)</a>
<a href="/movies/?id=joker2019.htm">#1 Movie: 'Joker (2019)'</a>
<a href="http://www.imdb.com/showtimes/?ref_=mojo">Showtimes</a>
<a href="//bs.serving-sys.com/Serving/adServer.bs?cn=brd&amp;pli=1073982656&amp;Page=&amp;Pos=668343950" target="_blank">
<img border="0" height="90" src="//bs.serving-sys.com/Serving/adServer.bs?c=8&amp;cn=display&amp;pli=1073982656&amp;Page=&amp;Pos=668343950" width="728"/>
</a>
<a href="/"><img alt="Box Office Mojo" height="56" src="/img/misc/bom_logo1.png" width="245"/></a>
<a href="http://pro.imdb.com/signup/index.html?rf=mojo_nb_hm&amp;ref_=mojo_nb_hm" target="_blank">
<img alt="Get industry info at IMDbPro" height="20" src="/images/IMDbPro.png"/>
</a>
<a href="http://twitter.com/boxofficemojo" target="_blank">
<img alt="Follow us on Twitter" height="18" src="/images/glyphicons-social-32-twitter@2x.png"/>
</a>
<a href="http://facebook.com/b

In [16]:
[link for link in soup.find_all('a') if 'joelcoen' in str(link)]

[<a href="/people/chart/?view=Director&amp;id=joelcoen.htm">Joel Coen</a>,
 <a href="/people/chart/?view=Writer&amp;id=joelcoen.htm">Joel Coen</a>]

## Beautiful Soup - More on Find

### `href` Example

In [18]:
# retrieve the url from an anchor tag
soup.find('a')['href']

'/daily/chart/'

### `id` and `class` examples

* An attribute like id or class can be matched
* Example: 'mp_box_content' classes

In [17]:
soup.find_all(id='top_links')

[<div id="top_links">
 <div style="float: left"><a href="/daily/chart/">Daily Box Office (Sun.)</a> | <a href="/weekend/chart/">Weekend Box Office (Oct. 11–13)</a> | <a href="/movies/?id=joker2019.htm">#1 Movie: 'Joker (2019)'</a> | <a href="http://www.imdb.com/showtimes/?ref_=mojo">Showtimes</a></div>
 <div style="float: right">Updated 10/13/2019 8:31 A.M. Pacific Time</div>
 <div style="clear:both; height: 0px"></div>
 </div>]

In [19]:
for element in soup.find_all(class_='mp_box_content'):
    print(element, '\n')

<div class="mp_box_content">
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td width="40%"><b>Domestic:</b></td>
<td align="right" width="35%"> <b>$18,034,458</b></td>
<td align="right" width="25%">   <b>38.6%</b></td>
</tr>
<tr>
<td width="40%">+ Foreign:</td>
<td align="right" width="35%"> $28,690,764</td>
<td align="right" width="25%">   61.4%</td>
</tr>
<tr>
<td colspan="3" width="100%"><hr/></td>
</tr>
<tr>
<td width="40%">= <b>Worldwide:</b></td>
<td align="right" width="35%"> <b>$46,725,222</b></td>
<td width="25%"> </td>
</tr>
</table>
</div> 

<div class="mp_box_content">
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td align="center"><a href="/weekend/chart/?yr=1998&amp;wknd=10&amp;p=.htm">Opening Weekend:</a></td><td> $5,533,844</td></tr>
<tr>
<td align="center" colspan="2"><font size="2">(#6 rank, 1,207 theaters, $4,585 average)</font></td></tr>
<tr>
<td align="right">% of Total Gross:</td><td> 31.7%</td></tr>
<tr><td align="right" colspan="2"><font fac

## Beautiful Soup - Chaining Finds

All the fields in mp_box_content can be found by "chaining" a few `find_all` functions.

In [20]:
# 'td' is for a cell in an HTML table
chain = [x.find_all('td') for x in soup.find_all(class_='mp_box_content')]

In [21]:
# for the first mp_box_content find all td's
chain[0]

[<td width="40%"><b>Domestic:</b></td>,
 <td align="right" width="35%"> <b>$18,034,458</b></td>,
 <td align="right" width="25%">   <b>38.6%</b></td>,
 <td width="40%">+ Foreign:</td>,
 <td align="right" width="35%"> $28,690,764</td>,
 <td align="right" width="25%">   61.4%</td>,
 <td colspan="3" width="100%"><hr/></td>,
 <td width="40%">= <b>Worldwide:</b></td>,
 <td align="right" width="35%"> <b>$46,725,222</b></td>,
 <td width="25%"> </td>]

To extract just the value of interest:

In [22]:
# Find the domestic gross. The '\xa0' represents a space in unicode
soup.find(class_='mp_box_content').find_all('td')[1].text

'\xa0$18,034,458'

In [23]:
# There are 2 td's the second one has the $17,451,873 and we remove the space character
soup.find(class_='mp_box_content').find_all('td')[1].text[1:] 

'$18,034,458'

## Let's Practice Web Scraping!

### Items to scrape for each movie:

* Movie Title
* Domestic Total Gross
* Runtime
* MPAA Rating
* Release Date

### Movie Title

In [24]:
soup.find('title')

<title>The Big Lebowski (1998) - Box Office Mojo</title>

In [25]:
soup.find('title').text

'The Big Lebowski (1998) - Box Office Mojo'

In [26]:
title_string = soup.find('title').text
title_string

'The Big Lebowski (1998) - Box Office Mojo'

In [27]:
title_string.split('(')

['The Big Lebowski ', '1998) - Box Office Mojo']

In [30]:
# .strip() removes the white spaces at the beginning and end of the string
title = title_string.split('(')[0].strip() 
title

'The Big Lebowski'

### Domestic Total Gross

Let's try to find the exact text.

In [31]:
print(soup.find(text="Domestic Total Gross"))

None


`Text` does an exact match search, so we have to be careful.

In [32]:
print(soup.find(text="Domestic Total Gross: "))

Domestic Total Gross: 


What if we don't want to be careful? [Regular expressions](https://xkcd.com/208/) to the rescue!

We are going to talk a lot more about regular expressions in the next week or two, but there's a really powerful way to search for patterns in text. Today, we're going to use a very simple case, basically doing a "contains" instead of an "exact match".

In [33]:
import re
domestic_total_regex = re.compile('Domestic Total') #results in a pattern object 
domestic_total_regex

re.compile(r'Domestic Total', re.UNICODE)

In [34]:
dtg_string = soup.find(text=domestic_total_regex)
dtg_string

'Domestic Total Gross: '

In [35]:
dtg_string.findNextSibling()

<b>$17,451,873</b>

We found the domestic total gross! Now let's strip it down and convert it to an integer.

In [36]:
dtg = dtg_string.findNextSibling().text
print(dtg, type(dtg))

dtg = dtg.replace('$','').replace(',','')
print(dtg, type(dtg))

domestic_total_gross = int(dtg)
print(domestic_total_gross, type(domestic_total_gross))

$17,451,873 <class 'str'>
17451873 <class 'str'>
17451873 <class 'int'>


### Runtime, MPAA Rating & Release Date

#### Step 1: Create Function to Identify Values

Let's make a function to scrape multiple things, assuming the value will always follow the field name.

In [37]:
def get_movie_value(soup, field_name):
    
    '''Grab a value from boxofficemojo HTML
    
    Takes a string attribute of a movie on the page and returns the string in
    the next sibling object (the value for that attribute) or None if nothing is found.
    '''
    
    obj = soup.find(text=re.compile(field_name))
    
    if not obj: 
        return None
    
    # this works for most of the values
    next_sibling = obj.findNextSibling()
    
    if next_sibling:
        return next_sibling.text 
    else:
        return None

In [38]:
# domestic total gross
dtg = get_movie_value(soup,'Domestic Total')
print(dtg)

$17,451,873


In [39]:
# runtime
runtime = get_movie_value(soup,'Runtime')
print(runtime)

1 hrs. 57 min.


In [40]:
# rating
rating = get_movie_value(soup,'MPAA Rating')
print(rating)

R


In [41]:
release_date = get_movie_value(soup,'Release Date')
print(release_date)

March 6, 1998


#### Step 2: Convert Values to Appropriate Data Types

In [42]:
import dateutil.parser

def money_to_int(moneystring):
    moneystring = moneystring.replace('$', '').replace(',', '')
    return int(moneystring)

def runtime_to_minutes(runtimestring):
    runtime = runtimestring.split()
    try:
        minutes = int(runtime[0])*60 + int(runtime[2])
        return minutes
    except:
        return None

def to_date(datestring):
    date = dateutil.parser.parse(datestring)
    return date

#### Step 3: Apply the Conversions

In [43]:
# Let's get these again and format them all in one swoop

raw_domestic_total_gross = get_movie_value(soup,'Domestic Total')
domestic_total_gross = money_to_int(raw_domestic_total_gross)

raw_runtime = get_movie_value(soup,'Runtime')
runtime = runtime_to_minutes(raw_runtime)

raw_release_date = get_movie_value(soup,'Release Date')
release_date = to_date(raw_release_date)

print(domestic_total_gross, runtime, release_date)
print(type(domestic_total_gross), type(runtime), type(release_date))

17451873 117 1998-03-06 00:00:00
<class 'int'> <class 'int'> <class 'datetime.datetime'>


#### Step 4: Print It All Out

In [44]:
from pprint import pprint # pretty print

headers = ['movie title', 'domestic total gross',
           'runtime (mins)', 'rating', 'release date']

movie_data = []
movie_dict = dict(zip(headers, [title,
                                domestic_total_gross,
                                runtime,
                                rating, 
                                release_date]))

movie_data.append(movie_dict)
pprint(movie_data)

[{'domestic total gross': 17451873,
  'movie title': 'The Big Lebowski',
  'rating': 'R',
  'release date': datetime.datetime(1998, 3, 6, 0, 0),
  'runtime (mins)': 117}]


## Table Scraping Example

### Step 1: Soupify the Website

Let's take a look at the foreign language page of Box Office Mojo. Let's say we wanted to pull all of the data from the main table on the page.

In [45]:
url = 'http://www.boxofficemojo.com/genres/chart/?id=foreign.htm'

response = requests.get(url)
page = response.text

soup = BeautifulSoup(page,"lxml")

### Step 2: Find the Tables

In [9]:
tables = soup.find_all("table")
tables

[<table class="table-condensed table-striped table-bordered">
 <thead></thead>
 <tbody>
 <tr>
 <th scope="row">Query</th>
 <td><code>https://data.gov.sg/api/action/datastore_search</code></td>
 </tr>
 </tbody>
 </table>, <table class="table table-condensed">
 <col class="field-index"/>
 <col class="field-name"/>
 <col class="field-title"/>
 <col class="field-type"/>
 <col class="field-units"/>
 <col class="field-desc"/>
 <thead>
 <tr>
 <th class="field-index">No.</th>
 <th class="field-name">Name</th>
 <th class="field-title">Title</th>
 <th class="field-type">Type</th>
 <th class="field-units">Unit of Measure</th>
 <th class="field-desc">Description</th>
 </tr>
 </thead>
 <tbody>
 <tr>
 <td class="field-index" width="10%">1</td>
 <td>year</td>
 <td>Year</td>
 <td class="field-type">
             Datetime (Year)
             
               <br/>
               "YYYY"
             
             
           </td>
 <td>-</td>
 <td>
             
             -
           </td>
 </tr>
 <t

### Step 3: Pull Just the Rows

In [10]:
rows = [row for row in tables[3].find_all('tr')] # tr tag is for rows

In [11]:
# let's take a look at one row
rows[0]

<tr>
<th scope="row">Last updated</th>
<td>September 30, 2019</td>
</tr>

In [49]:
# let's take a look at one value in the row
rows[0].find_all('td')[1].find('a')['href']

'/genres/chart/?id=foreign.htm&sort=title&order=ASC&p=.htm'

### Step 4: Pull All Values

In [50]:
rows[1].find_all('td')[1].find('a')['href']

'/movies/?id=crouchingtigerhiddendragon.htm'

In [51]:
rows = rows[1:21] # let's just look at the first 20 rows for now
movies = {}

for row in rows:
    items = row.find_all('td')
    title = items[1].find('a')['href']
    movies[title] = [i.text for i in items[1:]]
    
list(movies.items())[0]

('/movies/?id=crouchingtigerhiddendragon.htm',
 ['Crouching Tiger, Hidden Dragon(Taiwan)',
  'SPC',
  '$128,078,872',
  '2,027',
  '$663,205',
  '16',
  '12/8/00'])

### Step 5: Pandas Alternative

TypeError: Cannot read object of type 'Response'

In [89]:
# you can also use pandas to read tables
import pandas as pd

url = 'https://eshop-prices.com/prices?currency=SGD'

In [62]:
tables = pd.read_html(url)

HTTPError: HTTP Error 403: Forbidden

In [58]:
tables[2]
# how can you fix the header?

Unnamed: 0,0,1,2
0,Title (click to view),Studio,Release Date
1,Tampopo (2016 re-release),Jan.,10/21/16
2,"Cyrano, My Love",RAtt.,10/18/19
3,By The Grace of God,MBox,10/18/19
4,Housefull 4,FIP,10/25/19
5,Portrait of a Lady on Fire,Neon,12/6/19
6,The Traitor,SPC,1/31/20
7,Las Pildoras de Mi Novio (My Boyfriend's Meds),PNT,2/21/20
8,Johanna(Hungary),Tar.,TBD
9,Abel,LGF,TBD


In [55]:
tables[2][0:5]

Unnamed: 0,0,1,2
0,Title (click to view),Studio,Release Date
1,Tampopo (2016 re-release),Jan.,10/21/16
2,"Cyrano, My Love",RAtt.,10/18/19
3,By The Grace of God,MBox,10/18/19
4,Housefull 4,FIP,10/25/19


Conclusion: Beautiful Soup is powerful but it has many limitations. If a page needs interactions (like entering password) or if a page is not static, but dynamically generated, we can't use Soup. We will explore other tools for that.

One such tool for tomorrow: *Selenium*