<img src="https://www.exegetic.biz/img/exegetic-banner-black.svg" width="35%" align="right">

# Web Scraping: Members of Parliament

Andrew B. Collier (@datawookie | andrew@exegetic.biz)<br>
Data Scientist / Founder<br>
[Exegetic Analytics](https://www.exegetic.biz)

<span style="color: #3498db;">**↯ Notebooks**</span> available from https://bit.ly/2kxOTT9.

## Introduction

In this tutorial we're going to scrape (public) details of our esteemed members of parliament from the website of the [Parliamentary Monitoring Group](https://pmg.org.za/).

![](fig/members-of-parliament.png)

**The Brief**: Our brief is to capture data for all members and store it in a relational database. Why? Well, suppose you were developing an insurance or investment product targeted specifically at politicians, then this would immediately give you a list of prospects with their contact details.

**The Challenge**: There's an index page with links to individual pages for each of the members. Need to systematically scrape all of the member pages.

**The Approach:** These are the steps that we'll take to achieve that goal:

1. Manually scrape the data for a specific member.
2. Write a function to scrape the data for a specific member.
3. Test that function.
4. Run the function across all of the members.
5. Store the results.

## Packages

Load some packages.

In [126]:
# General packages
import re, random, time, sqlite3
import numpy as np
import pandas as pd

![](https://github.com/datawookie/useful-images/raw/master/banner/web-scraping-python.png)

In [127]:
# Scraping packages
from requests import get
from bs4 import BeautifulSoup

There are two components to a scrape:

- retrieving the HTML content of the page (done with the `requests` package) and
- parsing the page and extracting data (done with the `BeautifulSoup` package).

## Setup

Synchronise your watches (or your RNGs).

In [128]:
random.seed(17)

The name of the SQLite database that we'll use to store the data.

In [129]:
SQLITEDB = 'members-of-parliament.sqlite'

Open [this link](https://pmg.org.za/members/) in your browser.

In [130]:
# An index of the members, with a thumbnail linking to their individual profile pages.
URL = 'https://pmg.org.za/members/'
# A page for a specific member.
url = 'https://www.pa.org.za/person/alexandra-lilian-amelia-abrahams/'

## Manual Scrape

Grab the HTML for a specific member's page. This uses a HTTP `GET` request. This is functionally equivalent to opening the URL in a browser.

In [131]:
response = get(url)

First check whether the request was successful. The result below is a [HTTP status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes), where 200 indicates success.

In [132]:
response.status_code

200

Looks good.

Let's take a look at the response headers (essentially metadata).

In [133]:
response.headers

{'Server': 'nginx', 'Date': 'Fri, 13 Sep 2019 11:52:51 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Content-Length': '11172', 'Connection': 'keep-alive', 'Last-Modified': 'Fri, 13 Sep 2019 11:52:51 GMT', 'Content-Encoding': 'gzip', 'Expires': 'Fri, 13 Sep 2019 12:12:51 GMT', 'Vary': 'Cookie,Accept-Encoding', 'Cache-Control': 'max-age=1200', 'x-url': '/person/alexandra-lilian-amelia-abrahams/', 'Accept-Ranges': 'bytes', 'Age': '0'}

Check that we've received an HTML document.

In [134]:
response.headers['Content-Type'].lower()

'text/html; charset=utf-8'

Finally we can take a look at the actual content of the response.

In [135]:
response.content

b'<!DOCTYPE html>\n<html lang="en" class="no-js">\n    <head>\n\n\n\n        <meta charset="utf-8">\n        <title>\n            Alexandra Lilian Amelia Abrahams\n             :: People\'s Assembly \n        </title>\n\n        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n        <meta name="viewport" content="initial-scale=1">\n        <meta http-equiv="cleartype" content="on">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n\n        \n  \n  \n    <link href="//peoplesassembly.disqus.com" rel="dns-prefetch" />\n    <!--[if IE 9]>\n      <link href="http://peoplesassembly.disqus.com/" rel="prefetch" />\n    <![endif]-->\n  \n  <meta name="pombola-person-id" content="14995">\n  \n  <meta name="pa:identifier-elections_2019" content="8607130053089">\n  \n  <meta name="pa:identifier-za.org.pmg.api/member" content="1275">\n  \n\n\n        \n        <meta property="fb:app_id" content="1619725741628189" />\n        \n\n        \n\n        \n\

That looks pretty complicated! Maybe one of the reasons for the term ["tag soup"](https://en.wikipedia.org/wiki/Tag_soup). Not to worry! We'll be using simple tools to parse the contents.

In [136]:
html = BeautifulSoup(response.content, 'html.parser')
html

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<meta charset="utf-8"/>
<title>
            Alexandra Lilian Amelia Abrahams
             :: People's Assembly 
        </title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="initial-scale=1" name="viewport"/>
<meta content="on" http-equiv="cleartype"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<link href="//peoplesassembly.disqus.com" rel="dns-prefetch"/>
<!--[if IE 9]>
      <link href="http://peoplesassembly.disqus.com/" rel="prefetch" />
    <![endif]-->
<meta content="14995" name="pombola-person-id"/>
<meta content="8607130053089" name="pa:identifier-elections_2019"/>
<meta content="1275" name="pa:identifier-za.org.pmg.api/member"/>
<meta content="1619725741628189" property="fb:app_id">
<meta content="Alexandra Lilian Amelia Abrahams" property="og:title">
<meta content="People's Assembly" property="og:site_name">
<meta content="profile" property="og:type">
<meta conte

In [137]:
type(html)

bs4.BeautifulSoup

Superficially the document looks cleaner. However, the `BeautifulSoup` objects puts a slew of additional functionality at our disposal.

Let's wrap all of that up in a function which will reliably return parsed HTML for a page.

In [138]:
def read_html(url):
    try:
        # Use closing() to ensure that network resources are freed up after leaving context.
        response = get(url, stream=True)
        #
        status_code  = response.status_code
        content_type = response.headers['Content-Type'].lower()
    except RequestException as e:
        print('Error during requests to {0} : {1}'.format(url, str(e)))
        return None
    
    if status_code == 200 and content_type is not None and content_type.find('html') > -1:
        return BeautifulSoup(response.content, 'html.parser')
    else:
        return None

Give it a test run.

In [139]:
person = read_html(url)
person

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<meta charset="utf-8"/>
<title>
            Alexandra Lilian Amelia Abrahams
             :: People's Assembly 
        </title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="initial-scale=1" name="viewport"/>
<meta content="on" http-equiv="cleartype"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<link href="//peoplesassembly.disqus.com" rel="dns-prefetch"/>
<!--[if IE 9]>
      <link href="http://peoplesassembly.disqus.com/" rel="prefetch" />
    <![endif]-->
<meta content="14995" name="pombola-person-id"/>
<meta content="8607130053089" name="pa:identifier-elections_2019"/>
<meta content="1275" name="pa:identifier-za.org.pmg.api/member"/>
<meta content="1619725741628189" property="fb:app_id">
<meta content="Alexandra Lilian Amelia Abrahams" property="og:title">
<meta content="People's Assembly" property="og:site_name">
<meta content="profile" property="og:type">
<meta conte

### Name

Start by retrieving the person's name. Need to get the appropriate CSS selector. In this case it's easy: it's the only `<h1>` tag on the page.

In [140]:
person.select('h1')

[<h1> Alexandra Lilian Amelia Abrahams</h1>]

The `select()` method returns *all* tags which match the selector. If we want just the first one then use the `select_one()` method.

In [141]:
person.select_one('h1')

<h1> Alexandra Lilian Amelia Abrahams</h1>

In [142]:
type(person.select_one('h1'))

bs4.element.Tag

If we want the text enclosed by the tag then we access the `text` attribute.

In [143]:
person \
    .select_one('h1') \
    .text

' Alexandra Lilian Amelia Abrahams'

<span style="color: #3498db;">**↯ Exercise**</span> Raw scraped data are often grubby. Remove excess whitespace.

In [144]:
# ------------------------------------------------------------------------------
#
# Your code goes here.
#
# ------------------------------------------------------------------------------

In [145]:
person.select_one('h1').text \
    .strip()

'Alexandra Lilian Amelia Abrahams'

### Affiliation

Next let's get party affiliation. This information is in an `<a>` tag but it's the only tag on the page which has the `party-membership--party` class.

In [146]:
affiliation = person.select_one('.party-membership--party')
affiliation

<a class="party-membership party-membership--party" data-identifier-org.mysociety.za="/party/da" data-identifier-wikidata="Q761877" href="/organisation/da/">Democratic Alliance (DA)</a>

In [147]:
affiliation.text

'Democratic Alliance (DA)'

Let's take a moment to dig into the tag object.

In [148]:
affiliation.name

'a'

In [149]:
affiliation.attrs

{'href': '/organisation/da/',
 'data-identifier-org.mysociety.za': '/party/da',
 'data-identifier-wikidata': 'Q761877',
 'class': ['party-membership', 'party-membership--party']}

This is a dictionary of attributes.

In [150]:
affiliation.attrs['href']

'/organisation/da/'

## Email Address

Now let's get the email address. The address is in a `<a>` tag nested inside a `<span>` with class `email-address`. There might be multiple email addresses, so here we use `select()` to capture all of them.

In [151]:
person.select('.email-address a')

[<a href="mailto:alexandra.abrahams@gmail.com">alexandra.abrahams@gmail.com</a>,
 <a href="mailto:aabrahams@parliament.gov.za">aabrahams@parliament.gov.za</a>]

Access the text for each tag using a list comprehension.

In [152]:
[a.text for a in person.select('.email-address a')]

['alexandra.abrahams@gmail.com', 'aabrahams@parliament.gov.za']

<span style="color: #3498db;">**↯ Exercise**</span> Concatenate multiple email addresses with a semicolon separator.

In [153]:
# ------------------------------------------------------------------------------
#
# Your code goes here.
#
# ------------------------------------------------------------------------------

In [154]:
'; '.join([a.text for a in person.select('.email-address a')])

'alexandra.abrahams@gmail.com; aabrahams@parliament.gov.za'

## Phone Number

Extracing the phone number requires a slightly more sophisticated selector.

In [155]:
person.select('a[href^="tel:"]')

[<a href="tel:082 335 7740">082 335 7740</a>]

Again we need to cater for multiple phone numbers.

In [156]:
'; '.join([a.text for a in person.select('[href^="tel:"]')])

'082 335 7740'

This is good progress, but if we want to do this systematically across all members then we'll need to write another function.

## Scraping Function

The function should accept an URL and return a dictionary with the scraped data.

In [157]:
def get_person(url):
    person = read_html(url)
    #
    if person is None:
        return None
    else:
        return {
            'name': person.select_one('h1').text.strip(),
            'party': person.select_one('.party-membership--party').text,
            'phone': '; '.join([a.text for a in person.select('[href^="tel:"]')]),
            'email': '; '.join([a.text for a in person.select('.email-address a')])
        }

Let's run a few quick tests on the following members:

- [Alexandra Lilian Amelia Abrahams](https://www.pa.org.za/person/alexandra-lilian-amelia-abrahams/)
- [Rachel Cecilia Adams](https://www.pa.org.za/person/rachel-cecilia-adams/) and
- [Mr Michael Bagraim](https://www.pa.org.za/person/michael-bagraim/).

In [158]:
get_person('https://www.pa.org.za/person/alexandra-lilian-amelia-abrahams/')

{'name': 'Alexandra Lilian Amelia Abrahams',
 'party': 'Democratic Alliance (DA)',
 'phone': '082 335 7740',
 'email': 'alexandra.abrahams@gmail.com; aabrahams@parliament.gov.za'}

In [159]:
get_person('https://www.pa.org.za/person/rachel-cecilia-adams/')

{'name': 'Rachel Cecilia Adams',
 'party': 'African National Congress (ANC)',
 'phone': '',
 'email': 'radams@parliament.gov.za'}

In [160]:
get_person('https://www.pa.org.za/person/michael-bagraim/')

{'name': 'Mr Michael Bagraim',
 'party': 'Democratic Alliance (DA)',
 'phone': '082 557 7933; 021 403 3474',
 'email': 'bagm@iafrica.com; michael@bagraims.co.za'}

Those all look good. I think we're ready to start scraping at scale!

## Scraping All Members

First get the HTML for the index page.

In [161]:
directory = read_html(URL)

Extract all of the URLs for members' pages. These URLs are in `<div>` tags with `single-mp` class. Within the `<div>` is an `<a>` linking to the member page.

Let's start by looking at a single anchor tag.

In [162]:
directory.select_one('.single-mp a')

<a class="content-card flex" href="https://www.pa.org.za/person/noxolo-abraham-ntantiso/">
<img alt="Abraham-Ntantiso, Ms PN" class="member-profile-pic" src="http://pmg-assets.s3-website-eu-west-1.amazonaws.com/Abrahams_NP.jpg"/>
<div>
<h4 class="card-title name">Abraham-Ntantiso, Ms PN</h4></div></a>

Now we need to iterate over all of these tags and extract the `href` attribute from each one.

In [163]:
parliament = [a.attrs['href'] for a in directory.select('.single-mp a')]

In [164]:
# How many links?
#
len(parliament)

454

In [165]:
# Take a look at the first few links.
#
parliament[:10]

['https://www.pa.org.za/person/noxolo-abraham-ntantiso/',
 'https://www.pa.org.za/person/alexandra-lilian-amelia-abrahams/',
 'https://www.pa.org.za/person/rachel-cecilia-adams/',
 'https://www.pa.org.za/person/nombuyiselo-gladys-adoons/',
 'https://www.pa.org.za/person/heinrich-giovanni-april/',
 'https://www.pa.org.za/person/laetitia-heloise-arries/',
 'https://www.pa.org.za/person/shaun-nigel-august/',
 'https://www.pa.org.za/person/michael-bagraim/',
 'https://www.pa.org.za/person/kopeng-obed-bapela/',
 'https://www.pa.org.za/person/leonard-jones-basson/']

Keep only URLs which are on <https://www.pa.org.za/>.

In [166]:
pattern_url = re.compile('^https://www.pa.org.za/')

parliament = [url for url in parliament if pattern_url.match(url)]

How many are left?

In [167]:
len(parliament)

452

Now iterate over a random subset of URLs, scraping each one in turn.

In [168]:
tic = time.time()
#
members = [get_person(url) for url in random.sample(parliament, 20)]
#
toc = time.time()

In [169]:
members[:3]

[{'name': 'Nonkosi Queenie Mvana',
  'party': 'African National Congress (ANC)',
  'phone': '082 494 9364',
  'email': 'nmvana@parliament.gov.za'},
 {'name': 'Sibusiso Welcome Mdabe',
  'party': 'African National Congress (ANC)',
  'phone': '',
  'email': 'smdabe@parliament.gov.za'},
 {'name': 'Ms Nkagisang Poppy Koni',
  'party': 'Economic Freedom Fighters (EFF)',
  'phone': '073 220 6647; 065 845 8619; 021 403 8871',
  'email': 'nkagimok111@gmail.com'}]

In [170]:
print('Elapsed time: %.3fs' % (toc - tic))

Elapsed time: 19.320s


<span style="color: #3498db;">**↯ Exercise**</span> Make the code above a little more server-friendly by introducing a delay. *Hint:* Use `time.sleep()` to pause and `np.random.poisson()` to sample a random number of seconds.

In [171]:
# ------------------------------------------------------------------------------
#
# Your code goes here.
#
# ------------------------------------------------------------------------------

In [172]:
def get_person_delay(url, mean):
    time.sleep(np.random.poisson(mean))
    return get_person(url)

tic = time.time()
#
# Sleep (on average) 5 seconds before retrieving URL.
members = [get_person_delay(url, 5)for url in random.sample(parliament, 20)]
#
toc = time.time()

print('Elapsed time: %.3fs' % (toc - tic))

Elapsed time: 118.557s


Drop records without data.

In [173]:
members = [m for m in members if m is not None]

Now convert to a data frame.

In [174]:
members = pd.DataFrame(members)
members

Unnamed: 0,name,party,phone,email
0,Ms Regina Mina Mpontseng Lesoma,African National Congress (ANC),073 454 9545; 021 403 2911,mlesoma@parliament.gov.za
1,Ms Winnie Ngwenya,African National Congress (ANC),,wngwenya@parliament.gov.za
2,Thembi Portia Msane,Economic Freedom Fighters (EFF),060 560 2959; 061 467 8169,msanethembi@gmail.com
3,Stephanus Franszouis Du Toit,"Freedom Front + (Vryheidsfront Plus, FF+)",,
4,Ms Lusizo Sharon Makhubela-Mashele,African National Congress (ANC),082 699 6732; 079 246 7887; 021 403 3051,lmakhubela-mashele@parliament.gov.za
5,Mr Maliyakhe Lymon Shelembe,Democratic Alliance (DA),021 403 3036,maliyakhezn@gmail.com; mshelembe@parliament.go...
6,Moletsane Simon Moletsane,Economic Freedom Fighters (EFF),,
7,Mr Dikgang Mathews Stock,African National Congress (ANC),071 645 9014; 021 403 3949,dstock@parliament.gov.za
8,Ms Pam Tshwete,African National Congress (ANC),076 786 5681; 012 336 6696,Eartha.Scholtz@dhs.gov.za; jonesn@dws.gov.za
9,Ms Bongiwe Pricilla Mbinqo-Gigaba,African National Congress (ANC),083 709 8441; 021 403 2531,rnourse@parliament.gov.za


So there we have the contact details of members of parliament.

Parliament is by no means static. Members come and go. Since we have a script though, we just have to run the script again to update the data.

## Database

To finish off we'll save the data to a [SQLite](https://www.sqlite.org/index.html) database.

Use a context manager to ensure that the connection to the database is closed neatly after the transaction.

In [175]:
with sqlite3.connect(SQLITEDB) as db:
    members.to_sql('members', db, if_exists='replace')

It'd be good to check on the content of the database. You can download a local copy as follows:

- select File ⟶ Open;
- check the box next to the file you've just created; and
- press the Download button.

You can open the file with something like [DB Browser for SQLite](https://sqlitebrowser.org/).

## Resources

- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)