<a href="https://colab.research.google.com/github/carlosfmorenog/CMM202/blob/master/CMM202_Topic_8/CMM202_T8_Lec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CMM202 Topic 8: Web Scraping

## Lecture objectives

* Learn how to get data even when people don't want us to get it!

## Web scraping

The extraction of data from a website by programmatic means

Used when...
    * No API
    * No published structure dataset
    * Multiple sources to be combined
    * Scheduled sampling of data at given times

Consider the legal barriers!

* Some sites don't want to share data so easily! (e.g. BBC, City Council, Wikipedia, etc.)

* Licensing or permission may be given by copyright, creative commons or open gov license

To scrape data, the first thing we need is a target website

Let's use [this one](https://www.transport.gov.scot/publication/key-reported-road-casualties-scotland-2018/3-reported-numbers-of-accidents-table-1) as an example

This page contains the number of fatalities in road accidents in Scotland from 1970 to 2018

In Python, I can use the `requests` and the `Beautiful Soup` libraries 

In [None]:
# Import the necessary packages
import requests
from bs4 import BeautifulSoup

In [None]:
# Specify the target

url = "https://www.transport.gov.scot/publication/key-reported-road-casualties-scotland-2018/3-reported-numbers-of-accidents-table-1"

In [None]:
# Now, we do the request
r = requests.get(url)
print(r, type(r))

<Response [200]> <class 'requests.models.Response'>


In [None]:
# We can see the content of our request

print(r.content)

b'\r\n\r\n<!DOCTYPE html>\r\n\r\n<html lang="en" class="">\r\n\r\n<head>\r\n    <meta charset="utf-8">\r\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\r\n    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\r\n    <link rel="stylesheet" href="/content/css/styles.css">\r\n    <script>document.documentElement.className = document.documentElement.className.split(\'no-js\').join(\'\');</script>\r\n    <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">\r\n    <meta name="apple-touch-fullscreen" content="YES">\r\n    \r\n    \r\n<title>3. Reported numbers of Accidents (Table 1)</title>\r\n<meta name="description" content="" />\r\n<meta property="og:title" content="" />\r\n<meta property="og:description" content="" />\r\n\r\n\r\n<meta name="twitter:card" content="summary_large_image"/>\r\n<meta name="twitter:site" content="@transcotland"/>\r\n<meta name="twitter:title" content=""/>\r\n<meta name="twitter:descripti

To work with this request, we need the `Beautiful Soup` package 

We first create a `soup` object and pass the contents of `r` to it

In [None]:
soup = BeautifulSoup(r.content,"html.parser")
print(soup)


<!DOCTYPE html>

<html class="" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
<link href="/content/css/styles.css" rel="stylesheet"/>
<script>document.documentElement.className = document.documentElement.className.split('no-js').join('');</script>
<meta content="black-translucent" name="apple-mobile-web-app-status-bar-style"/>
<meta content="YES" name="apple-touch-fullscreen"/>
<title>3. Reported numbers of Accidents (Table 1)</title>
<meta content="" name="description">
<meta content="" property="og:title">
<meta content="" property="og:description">
<meta content="summary_large_image" name="twitter:card"/>
<meta content="@transcotland" name="twitter:site"/>
<meta content="" name="twitter:title"/>
<meta content="" name="twitter:description"/>
<link href="https://www.transport.gov.scot/publication/key-reported-road-casualties-scotland-2018/3-reported-

Keep in mind that we are looking for the *tabular data*

* Just like when you **inspect** a website:
    * Right-click, `View Page Sources`
    * `View` $\rightarrow$ `Developer` $\rightarrow$ `Inspect Elements`

In [None]:
# Option 1
table_v1 = soup.find('table')
print(table_v1)

<table border="1" cellpadding="5" cellspacing="1" id="Table1">
<caption>
          Table 1: Injury Road Accidents by Severity, 1970 –
          2018
        </caption>
<thead>
<tr>
<th> </th>
<th>
<p>Fatal</p>
</th>
<th>
<p>Serious</p>
</th>
<th>
<p>Fatal and Serious</p>
</th>
<th>
<p>Slight</p>
</th>
<th>
<p>All</p>
</th>
</tr>
</thead>
<tbody>
<tr>
<th>
<p>1970</p>
</th>
<td>
<p>758</p>
</td>
<td>
<p>7,860</p>
</td>
<td>
<p>8,618</p>
</td>
<td>
<p>13,515</p>
</td>
<td>
<p>22,133</p>
</td>
</tr>
<tr>
<th>
<p>1975</p>
</th>
<td>
<p>699</p>
</td>
<td>
<p>6,912</p>
</td>
<td>
<p>7,611</p>
</td>
<td>
<p>13,041</p>
</td>
<td>
<p>20,652</p>
</td>
</tr>
<tr>
<th>
<p>1980</p>
</th>
<td>
<p>644</p>
</td>
<td>
<p>7,218</p>
</td>
<td>
<p>7,862</p>
</td>
<td>
<p>13,926</p>
</td>
<td>
<p>21,788</p>
</td>
</tr>
<tr>
<th>
<p>1985</p>
</th>
<td>
<p>550</p>
</td>
<td>
<p>6,507</p>
</td>
<td>
<p>7,057</p>
</td>
<td>
<p>13,587</p>
</td>
<td>
<p>20,644</p>
</td>
</tr>
<tr>
<th>
<p>1990</p>
</th>
<td>
<p>

In [None]:
# Option 2
table_v2 = soup.find('table', id='Table1')
print(table_v2)

<table border="1" cellpadding="5" cellspacing="1" id="Table1">
<caption>
          Table 1: Injury Road Accidents by Severity, 1970 –
          2018
        </caption>
<thead>
<tr>
<th> </th>
<th>
<p>Fatal</p>
</th>
<th>
<p>Serious</p>
</th>
<th>
<p>Fatal and Serious</p>
</th>
<th>
<p>Slight</p>
</th>
<th>
<p>All</p>
</th>
</tr>
</thead>
<tbody>
<tr>
<th>
<p>1970</p>
</th>
<td>
<p>758</p>
</td>
<td>
<p>7,860</p>
</td>
<td>
<p>8,618</p>
</td>
<td>
<p>13,515</p>
</td>
<td>
<p>22,133</p>
</td>
</tr>
<tr>
<th>
<p>1975</p>
</th>
<td>
<p>699</p>
</td>
<td>
<p>6,912</p>
</td>
<td>
<p>7,611</p>
</td>
<td>
<p>13,041</p>
</td>
<td>
<p>20,652</p>
</td>
</tr>
<tr>
<th>
<p>1980</p>
</th>
<td>
<p>644</p>
</td>
<td>
<p>7,218</p>
</td>
<td>
<p>7,862</p>
</td>
<td>
<p>13,926</p>
</td>
<td>
<p>21,788</p>
</td>
</tr>
<tr>
<th>
<p>1985</p>
</th>
<td>
<p>550</p>
</td>
<td>
<p>6,507</p>
</td>
<td>
<p>7,057</p>
</td>
<td>
<p>13,587</p>
</td>
<td>
<p>20,644</p>
</td>
</tr>
<tr>
<th>
<p>1990</p>
</th>
<td>
<p>

In [None]:
# You can print properties of the tables you find, such as the caption
print(table_v1.caption)

<caption>
          Table 1: Injury Road Accidents by Severity, 1970 –
          2018
        </caption>


In [None]:
# Now we can put together a command that finds the table, it's body and all entries (tr)
rows = soup.find('table').find('tbody').find_all('tr')
print(rows)

[<tr>
<th> </th>
<th>
<p>Fatal</p>
</th>
<th>
<p>Serious</p>
</th>
<th>
<p>Fatal and Serious</p>
</th>
<th>
<p>Slight</p>
</th>
<th>
<p>All</p>
</th>
</tr>, <tr>
<th>
<p>1970</p>
</th>
<td>
<p>758</p>
</td>
<td>
<p>7,860</p>
</td>
<td>
<p>8,618</p>
</td>
<td>
<p>13,515</p>
</td>
<td>
<p>22,133</p>
</td>
</tr>, <tr>
<th>
<p>1975</p>
</th>
<td>
<p>699</p>
</td>
<td>
<p>6,912</p>
</td>
<td>
<p>7,611</p>
</td>
<td>
<p>13,041</p>
</td>
<td>
<p>20,652</p>
</td>
</tr>, <tr>
<th>
<p>1980</p>
</th>
<td>
<p>644</p>
</td>
<td>
<p>7,218</p>
</td>
<td>
<p>7,862</p>
</td>
<td>
<p>13,926</p>
</td>
<td>
<p>21,788</p>
</td>
</tr>, <tr>
<th>
<p>1985</p>
</th>
<td>
<p>550</p>
</td>
<td>
<p>6,507</p>
</td>
<td>
<p>7,057</p>
</td>
<td>
<p>13,587</p>
</td>
<td>
<p>20,644</p>
</td>
</tr>, <tr>
<th>
<p>1990</p>
</th>
<td>
<p>491</p>
</td>
<td>
<p>5,237</p>
</td>
<td>
<p>5,728</p>
</td>
<td>
<p>14,443</p>
</td>
<td>
<p>20,171</p>
</td>
</tr>, <tr>
<th>
<p>1995</p>
</th>
<td>
<p>361</p>
</td>
<td>
<p>4,071</p>


In [None]:
# We will loop over the rows variable to see all text entries
year = '2020' # if you want 2018, you have to put 2018 prov.
for row in rows:
    cell = row.find_all('th')
    if cell[0].find('p').get_text()==year:
        data_cells = row.find_all('td')
        fatalities = data_cells[0].find('p').get_text()
print('The number of fatalities in '+year+' was '+fatalities)

The number of fatalities in 2020 was 264


How would you get the number of `serious` accidents for a given year?

In [None]:
# We will loop over the rows variable to see all text entries
year = '1970' # if you want 2018, you have to put 2018 prov.
for row in rows:
    cell = row.find_all('th')
    if cell[0].find('p').get_text()==year:
        data_cells = row.find_all('td')
        serious = data_cells[1].find('p').get_text()
print('The number of serious accidents in '+year+' was '+serious)

The number of serious accidents in 1970 was 7,860


* Other useful methods:
    * `table.parent` and `table.parents`: Go up the tree
    * `table.next_sibling(s)` and `table.previous_sibling(s)`: Go sideways
    * `table.content` or `table.children` or `table.descendants`: Go down

### More advise on this topic

Check out [Selenium](https://www.scrapingbee.com/blog/selenium-python/), a much more powerful package that allows you to navigate and scrape tables in easier ways!

Also, Arturo Regalado (PhD Univ. of Aberdeen and recurrent presenter in APUG) and  prepared the following tutorial on this topic

In [None]:
import warnings; 
warnings.simplefilter('ignore')
from IPython.display import HTML
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/6iqPwzSOkc4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')

## Lab(s)