<img src="https://www.exegetic.biz/img/exegetic-banner-black.svg" width="35%" align="right">

# Web Scraping: Members of Parliament

Andrew B. Collier (@datawookie | andrew@exegetic.biz)<br>
Data Scientist / Founder<br>
[Exegetic Analytics](https://www.exegetic.biz)

<span style="color: #3498db;">**↯ Notebooks**</span> available from https://bit.ly/2kxOTT9.

## Introduction

In this tutorial we're going to scrape (public) details of our esteemed members of parliament from the website of the [Parliamentary Monitoring Group](https://pmg.org.za/).

![](fig/members-of-parliament.png)

**The Brief**: Our brief is to capture data for all members and store it in a relational database. Why? Well, suppose you were developing an insurance or investment product targeted specifically at politicians, then this would immediately give you a list of prospects with their contact details.

**The Challenge**: There's an index page with links to individual pages for each of the members. Need to systematically scrape all of the member pages.

**The Approach:** These are the steps that we'll take to achieve that goal:

1. Manually scrape the data for a specific member.
2. Write a function to scrape the data for a specific member.
3. Test that function.
4. Run the function across all of the members.
5. Store the results.

## Packages

Load some packages.

In [None]:
# General packages
import re, random, time, sqlite3
import numpy as np
import pandas as pd

![](https://github.com/datawookie/useful-images/raw/master/banner/web-scraping-python.png)

In [None]:
# Scraping packages
from requests import get
from bs4 import BeautifulSoup

There are two components to a scrape:

- retrieving the HTML content of the page (done with the `requests` package) and
- parsing the page and extracting data (done with the `BeautifulSoup` package).

## Setup

Synchronise your watches (or your RNGs).

In [None]:
random.seed(17)

The name of the SQLite database that we'll use to store the data.

In [None]:
SQLITEDB = 'members-of-parliament.sqlite'

Open [this link](https://pmg.org.za/members/) in your browser.

In [None]:
# An index of the members, with a thumbnail linking to their individual profile pages.
URL = 'https://pmg.org.za/members/'
# A page for a specific member.
url = 'https://www.pa.org.za/person/alexandra-lilian-amelia-abrahams/'

## Manual Scrape

Grab the HTML for a specific member's page. This uses a HTTP `GET` request. This is functionally equivalent to opening the URL in a browser.

In [None]:
response = get(url)

First check whether the request was successful. The result below is a [HTTP status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes), where 200 indicates success.

In [None]:
response.status_code

Looks good.

Let's take a look at the response headers (essentially metadata).

In [None]:
response.headers

Check that we've received an HTML document.

In [None]:
response.headers['Content-Type'].lower()

Finally we can take a look at the actual content of the response.

In [None]:
response.content

That looks pretty complicated! Maybe one of the reasons for the term ["tag soup"](https://en.wikipedia.org/wiki/Tag_soup). Not to worry! We'll be using simple tools to parse the contents.

In [None]:
html = BeautifulSoup(response.content, 'html.parser')
html

In [None]:
type(html)

Superficially the document looks cleaner. However, the `BeautifulSoup` objects puts a slew of additional functionality at our disposal.

Let's wrap all of that up in a function which will reliably return parsed HTML for a page.

In [None]:
def read_html(url):
    try:
        # Use closing() to ensure that network resources are freed up after leaving context.
        response = get(url, stream=True)
        #
        status_code  = response.status_code
        content_type = response.headers['Content-Type'].lower()
    except RequestException as e:
        print('Error during requests to {0} : {1}'.format(url, str(e)))
        return None
    
    if status_code == 200 and content_type is not None and content_type.find('html') > -1:
        return BeautifulSoup(response.content, 'html.parser')
    else:
        return None

Give it a test run.

In [None]:
person = read_html(url)
person

### Name

Start by retrieving the person's name. Need to get the appropriate CSS selector. In this case it's easy: it's the only `<h1>` tag on the page.

In [None]:
person.select('h1')

The `select()` method returns *all* tags which match the selector. If we want just the first one then use the `select_one()` method.

In [None]:
person.select_one('h1')

In [None]:
type(person.select_one('h1'))

If we want the text enclosed by the tag then we access the `text` attribute.

In [None]:
person \
    .select_one('h1') \
    .text

<span style="color: #3498db;">**↯ Exercise**</span> Raw scraped data are often grubby. Remove excess whitespace.

In [None]:
# ------------------------------------------------------------------------------
#
# Your code goes here.
#
# ------------------------------------------------------------------------------

In [None]:
person.select_one('h1').text \
    .strip()

### Affiliation

Next let's get party affiliation. This information is in an `<a>` tag but it's the only tag on the page which has the `party-membership--party` class.

In [None]:
affiliation = person.select_one('.party-membership--party')
affiliation

In [None]:
affiliation.text

Let's take a moment to dig into the tag object.

In [None]:
affiliation.name

In [None]:
affiliation.attrs

This is a dictionary of attributes.

In [None]:
affiliation.attrs['href']

## Email Address

Now let's get the email address. The address is in a `<a>` tag nested inside a `<span>` with class `email-address`. There might be multiple email addresses, so here we use `select()` to capture all of them.

In [None]:
person.select('.email-address a')

Access the text for each tag using a list comprehension.

In [None]:
[a.text for a in person.select('.email-address a')]

<span style="color: #3498db;">**↯ Exercise**</span> Concatenate multiple email addresses with a semicolon separator.

In [None]:
# ------------------------------------------------------------------------------
#
# Your code goes here.
#
# ------------------------------------------------------------------------------

In [None]:
'; '.join([a.text for a in person.select('.email-address a')])

## Phone Number

Extracing the phone number requires a slightly more sophisticated selector.

In [None]:
person.select('a[href^="tel:"]')

Again we need to cater for multiple phone numbers.

In [None]:
'; '.join([a.text for a in person.select('[href^="tel:"]')])

This is good progress, but if we want to do this systematically across all members then we'll need to write another function.

## Scraping Function

The function should accept an URL and return a dictionary with the scraped data.

In [None]:
def get_person(url):
    person = read_html(url)
    #
    if person is None:
        return None
    else:
        return {
            'name': person.select_one('h1').text.strip(),
            'party': person.select_one('.party-membership--party').text,
            'phone': '; '.join([a.text for a in person.select('[href^="tel:"]')]),
            'email': '; '.join([a.text for a in person.select('.email-address a')])
        }

Let's run a few quick tests on the following members:

- [Alexandra Lilian Amelia Abrahams](https://www.pa.org.za/person/alexandra-lilian-amelia-abrahams/)
- [Rachel Cecilia Adams](https://www.pa.org.za/person/rachel-cecilia-adams/) and
- [Mr Michael Bagraim](https://www.pa.org.za/person/michael-bagraim/).

In [None]:
get_person('https://www.pa.org.za/person/alexandra-lilian-amelia-abrahams/')

In [None]:
get_person('https://www.pa.org.za/person/rachel-cecilia-adams/')

In [None]:
get_person('https://www.pa.org.za/person/michael-bagraim/')

Those all look good. I think we're ready to start scraping at scale!

## Scraping All Members

First get the HTML for the index page.

In [None]:
directory = read_html(URL)

Extract all of the URLs for members' pages. These URLs are in `<div>` tags with `single-mp` class. Within the `<div>` is an `<a>` linking to the member page.

Let's start by looking at a single anchor tag.

In [None]:
directory.select_one('.single-mp a')

Now we need to iterate over all of these tags and extract the `href` attribute from each one.

In [None]:
parliament = [a.attrs['href'] for a in directory.select('.single-mp a')]

In [None]:
# How many links?
#
len(parliament)

In [None]:
# Take a look at the first few links.
#
parliament[:10]

Keep only URLs which are on <https://www.pa.org.za/>.

In [None]:
pattern_url = re.compile('^https://www.pa.org.za/')

parliament = [url for url in parliament if pattern_url.match(url)]

How many are left?

In [None]:
len(parliament)

Now iterate over a random subset of URLs, scraping each one in turn.

In [None]:
tic = time.time()
#
members = [get_person(url) for url in random.sample(parliament, 20)]
#
toc = time.time()

In [None]:
members[:3]

In [None]:
print('Elapsed time: %.3fs' % (toc - tic))

<span style="color: #3498db;">**↯ Exercise**</span> Make the code above a little more server-friendly by introducing a delay. *Hint:* Use `time.sleep()` to pause and `np.random.poisson()` to sample a random number of seconds.

In [None]:
# ------------------------------------------------------------------------------
#
# Your code goes here.
#
# ------------------------------------------------------------------------------

In [None]:
def get_person_delay(url, mean):
    time.sleep(np.random.poisson(mean))
    return get_person(url)

tic = time.time()
#
# Sleep (on average) 5 seconds before retrieving URL.
members = [get_person_delay(url, 5)for url in random.sample(parliament, 20)]
#
toc = time.time()

print('Elapsed time: %.3fs' % (toc - tic))

Drop records without data.

In [None]:
members = [m for m in members if m is not None]

Now convert to a data frame.

In [None]:
members = pd.DataFrame(members)
members

So there we have the contact details of members of parliament.

Parliament is by no means static. Members come and go. Since we have a script though, we just have to run the script again to update the data.

## Database

To finish off we'll save the data to a [SQLite](https://www.sqlite.org/index.html) database.

Use a context manager to ensure that the connection to the database is closed neatly after the transaction.

In [None]:
with sqlite3.connect(SQLITEDB) as db:
    members.to_sql('members', db, if_exists='replace')

It'd be good to check on the content of the database. You can download a local copy as follows:

- select File ⟶ Open;
- check the box next to the file you've just created; and
- press the Download button.

You can open the file with something like [DB Browser for SQLite](https://sqlitebrowser.org/).

## Resources

- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)