# Scraping political debates

Web scraping is a method that allows us to programmatically access and download the HTML underlying a website. It's the automated equivalent of manually clicking 'View Source' in most web browsers. 

Scraping is useful for grabbing large amounts of information from a site, but should be done ethically and responsibly. The UK Office for National Statistics has compiled a useful guide to ethical scraping (http://bit.ly/2IMCnJd) based on the following principles:

* Minimise burden on website owners
* Honour requests made by website owners to refrain from scraping their website
* Protect all personal data in all statistics and research outputs and seek ethical advice when scraping data that may identify individuals
* Apply scientific principles in the production of statistics and research based on web-scraped data and consider other sources of data
* Abide by all applicable legislation and monitor the evolving legal situation

### Visually inspecting the source code

Let's start by using our browser's 'Inspect' function to understand the Hansard website a bit better. 

*Navigate to https://hansard.parliament.uk/commons/ in your web browser and try to figure out:*

* *The full date range for which debate transcripts are available*
* *The URL pointing to debate transcripts for each day*


In [19]:
import requests
import pandas as pd
import os
import time
from bs4 import BeautifulSoup
import json
import unidecode
import re

### Downloading raw HTML

Once we've figured this out, we can define a date range in pandas that spans the entire Hansard archive. To save time, let's limit today's analysis to 2018.

In [2]:
hansard_date_range = [date.split(' ')[0] for date in pd.date_range(start='01/01/2018', end='12/31/2018').strftime('%Y-%m-%d')]
print(hansard_date_range[0:10])


['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08', '2018-01-09', '2018-01-10']


In [3]:
print(len(hansard_date_range))

365


The we can grab the raw HTML for every single day of debates from the very beginning of the Hansard archive to the present day. We introduce a time delay of 1 second between each download to be nice to the lovely folks at the Parliamentary Digital Service.
    

In [5]:
os.mkdir('raw_hansard_files')
base_path = os.path.join(os.getcwd(),'raw_hansard_files')

In [None]:
for date in hansard_date_range:
    
    date_url = 'https://hansard.parliament.uk/html/Commons/' + date + '/CommonsChamber'
    hansard_file = requests.get(date_url, allow_redirects=True)
    full_path = os.path.join(base_path,date+'.html')
    with open(full_path, 'wb') as f:
        f.write(hansard_file.content)
        
    with open('log.txt', 'a') as f:
        f.write(date)
        
    time.sleep(1)

Let's check how many files we've scraped

In [None]:
n_scraped_files = len(os.listdir(base_path))
print(n_scraped_files)

### Removing invalid sessions

Not every date will have a valid sitting of Parliament. Inspect the contents of invalid dates with your text browser of choice and then programmatically remove them. 

In [None]:
for raw_filepath in os.listdir(base_path):
    
    try:
        with open('raw_hansard_files/'+raw_filepath, 'r') as f:
            file_soup = BeautifulSoup(f.read(), 'html.parser')

        if str(file_soup)== 'The resource you are looking for has been removed, had its name changed, or is temporarily unavailable.':
            os.remove('raw_hansard_files/'+raw_filepath)
            print("Removed " + raw_filepath)
    except:
        raise
    

Let's check how many of the files we originally scraped are actually valid sittings of Parliament

In [8]:
n_valid_files = len(os.listdir(base_path))
print(n_valid_files)

155


In [10]:
n_valid_files/n_scraped_files


0.4246575342465753

# Parsing and cleaning text

Now we've downloaded a set of files from Hansard, it's time to wrangle them into a format that's easy to analyse. This will be our first real encounter with the HTML parsing library Beautiful Soup, which takes messy HTML or XML and transforms it into an easily searchable tree. 

To get some practise in using Beautiful Soup, try the following brief exercises. Start by reading in one of our downloaded Hansard files.

In [11]:
with open('raw_hansard_files/'+os.listdir(base_path)[0], 'r') as f:
    file_soup = BeautifulSoup(f.read(), 'html.parser')
        
print(file_soup)


<!DOCTYPE html>

<!--[if IE 7]> <html class="no-js ie7" lang="en"> <![endif]-->
<!--[if IE 8]> <html class="no-js ie8" lang="en"> <![endif]-->
<!--[if IE 9]> <html class="no-js ie9" lang="en"> <![endif]-->
<!--[if gt IE 9]><!-->
<html class="no-js" lang="en">
<!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width" name="viewport"/>
<!-- Google Tag Manager -->
<script>
    (function (w, d, s, l, i) {
        w[l] = w[l] || []; w[l].push({
            'gtm.start':
            new Date().getTime(), event: 'gtm.js'
        }); var f = d.getElementsByTagName(s)[0],
            j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : ''; j.async = true; j.src =
                'https://www.googletagmanager.com/gtm.js?id=' + i + dl; f.parentNode.insertBefore(j, f);
        })(window, document, 'script', 'dataLayer', 'GTM-NFVRG43');</script>
<!-- End Google Tag Manager -->
<meta content="Hansard (the

BeautifulSoup lets us search this HTML for particular tags. To find all links, for example, we could run:


In [13]:
all_links = file_soup.find_all('a')
print(all_links)

[<a href="http://www.parliament.uk/site-information/privacy/">Find out more</a>, <a class="uk-parliament" href="https://www.parliament.uk" target="_blank">
            UK Parliament
        </a>, <a href="http://www.parliament.uk/business/" target="_blank">Parliamentary Business</a>, <a href="http://www.parliament.uk/mps-lords-and-offices/" target="_blank">MPs, Lords and Offices</a>, <a href="http://www.parliament.uk/about/" target="_blank">About Parliament</a>, <a href="http://www.parliament.uk/get-involved/" target="_blank">Get Involved</a>, <a href="http://www.parliament.uk/visiting/" target="_blank">Visit</a>, <a href="http://www.parliament.uk/education/" target="_blank">Education</a>, <a class="brand hidden-sm hidden-xs" href="/">
<div class="identity-text"><strong>Hansard</strong></div>
</a>, <a class="navbar-brand hidden-md hidden-lg" href="/">
<strong>Hansard</strong>
</a>, <a aria-expanded="false" aria-haspopup="true" class="dropdown-toggle" data-toggle="dropdown" href="#" rol

To find just the first link, we would run:

In [14]:
first_link = file_soup.find('a')
print(first_link)

<a href="http://www.parliament.uk/site-information/privacy/">Find out more</a>


We can extract attributes from this result, such as the href and the text.

In [15]:
print(first_link['href'])
print(first_link.get_text())

http://www.parliament.uk/site-information/privacy/
Find out more


We can also test whether the result has particular attributes.

In [17]:
first_link.has_attr('href')

True

In [18]:
first_link.has_attr('id')

False

We can also narrow down our search according to the attributes of the HTML we're looking for

In [19]:
file_soup.find_all('a',{'class':'nohighlight'})

[<a class="nohighlight" href="/search/MemberContributions?house=Commons&amp;memberId=1583" title="View member's contributions">Lyn Brown (West Ham) (Lab)</a>,
 <a class="nohighlight" href="/search/MemberContributions?house=Commons&amp;memberId=17" title="View member's contributions">Mr Speaker</a>,
 <a class="nohighlight" href="/search/MemberContributions?house=Commons&amp;memberId=199" title="View member's contributions">Ms Karen Buck (Westminster North) (Lab)</a>,
 <a class="nohighlight" href="/search/MemberContributions?house=Commons&amp;memberId=1583" title="View member's contributions">Lyn Brown (West Ham) (Lab)</a>,
 <a class="nohighlight" href="/search/MemberContributions?house=Commons&amp;memberId=199" title="View member's contributions">Ms Buck</a>,
 <a class="nohighlight" href="/search/MemberContributions?house=Commons&amp;memberId=4471" title="View member's contributions">Rachael Maskell (York Central) (Lab/Co-op)</a>,
 <a class="nohighlight" href="/search/MemberContribution

When combined with list comprehensions, this gives us a powerful way of making sense of our HTML and pulling out relevant sections.

*As an exercise, modify this list comprehension to exclude 'Hon. Members' from the list of speakers.*

In [22]:
[member_link.get_text() for member_link in file_soup.find_all('h2',{'class':'memberLink'})]

['\nLyn Brown (West Ham) (Lab)\n',
 '\nMr Speaker\n',
 '\nMs Karen Buck (Westminster North) (Lab)\n',
 '\nLyn Brown (West Ham) (Lab)\n',
 '\nMs Buck\n',
 '\nRachael Maskell (York Central) (Lab/Co-op)\n',
 '\nMs Buck\n',
 '\nChuka Umunna (Streatham) (Lab)\n',
 '\nMs Buck\n',
 '\nJim McMahon (Oldham West and Royton) (Lab/Co-op)\n',
 '\nMs Buck\n',
 '\nAndy Slaughter (Hammersmith) (Lab)\n',
 '\nMs Buck\n',
 '\nWill Quince (Colchester) (Con)\n',
 '\nMs Buck\n',
 '\nEddie Hughes (Walsall North) (Con)\n',
 '\nKevin Foster (Torbay) (Con)\n',
 '\nEddie Hughes\n',
 '\nMr Speaker\n',
 '\nFaisal Rashid (Warrington South) (Lab)\n',
 "\nNeil O'Brien (Harborough) (Con)\n",
 '\nMatt Rodda (Reading East) (Lab)\n',
 '\nKevin Foster\n',
 '\nMatt Rodda\n',
 '\nKevin Foster (Torbay) (Con)\n',
 '\nEddie Hughes\n',
 '\nKevin Foster\n',
 '\nAndrew Bowie (West Aberdeenshire and Kincardine) (Con)\n',
 '\nKevin Foster\n',
 '\nMatt Rodda\n',
 '\nKevin Foster\n',
 '\nMr Speaker\n',
 '\nKevin Foster\n',
 '\nChris 

We want to represent each spoken contribution as a single observation or row in a huge Pandas DataFrame. The row for each spoken contribution should also contain columns giving:

* Contribution date
* Speaker name
* Speaker party 
* Speaker constituency 
* The title of the debate the speaker was contributing to
* Speaker gender 

*Open some of the downloaded HTML files in your web browser and use the 'Inspect' function to figure out which HTML tags, IDs and classes correspond uniquely to:*

* *Debate titles*
* *Speaker names*
* *Speaker ID numbers*

*Explore the JSON available at http://data.parliament.uk/membersdataplatform/services/mnis/members/query/House=Commons%7CIsEligible=true/ to figure out how we can use Speaker ID numbers to get information on speaker gender, party and constituency*


In [17]:
os.mkdir('clean_hansard_files')

In [None]:
contribution_counter = 0

for raw_filepath in [path for path in sorted(os.listdir(base_path)) if '.html' in path]:
    
    # Create an empty dataframe for this day's debates
    contributions_df = pd.DataFrame(columns=['date','debate_title','speaker_fullname','speaker_firstname','speaker_lastname','speaker_id','speaker_party','speaker_constituency','speaker_gender','contribution_text'])
    contributions_df.fillna('')
    
    with open('raw_hansard_files/'+raw_filepath, 'r') as f:
        file_soup = BeautifulSoup(f.read(), 'html.parser')
    
    date = raw_filepath[:-5]
    # Find all content items that contain individual debates
    debates = [content_item for content_item in file_soup.find_all('div',{'class':'content-item'}) if content_item.find('h2',{'class':'child-debate-title'})]

    # Loop through each content item and extract individual contributions
    for debate in debates:
        
        debate_title =  debate.find('h2',{'class':'child-debate-title'}).get_text()
        debate_contributions = [contribution for contribution in debate.find_all('div',{'class':'content-item'}) if (contribution.has_attr('id')) and ('contribution' in contribution['id'])]
        
        # Loop through each contribution and extract information about it
        for contribution in debate_contributions:
            
            speaker_info = contribution.find('h2',{'class':'memberLink'}).find('a')
            speaker_id = speaker_info['href'].split('=')[-1]
            
            if speaker_id != '0':
                
                try:
                    # Use the Parliamentary members' data platform to get more detailed information on the speaker
                    contribution_counter += 1
                    speaker_url = 'http://lda.data.parliament.uk/members/' + speaker_id + '.json'
                    speaker_json = json.loads(requests.get(speaker_url).content)['result']['primaryTopic']

                    contributions_df.at[contribution_counter,'speaker_id'] = speaker_id
                    contributions_df.at[contribution_counter,'date'] = date
                    contributions_df.at[contribution_counter,'debate_title'] = debate_title
                    contributions_df.at[contribution_counter,'speaker_firstname'] = unidecode.unidecode(speaker_json['givenName']['_value'])
                    contributions_df.at[contribution_counter,'speaker_lastname'] = unidecode.unidecode(speaker_json['familyName']['_value'])
                    contributions_df.at[contribution_counter,'speaker_fullname'] = unidecode.unidecode(speaker_json['fullName']['_value'])
                    contributions_df.at[contribution_counter,'speaker_gender'] = speaker_json['gender']['_value']
                    contributions_df.at[contribution_counter,'speaker_party'] = speaker_json['party']['_value']
                    contributions_df.at[contribution_counter,'speaker_constituency'] = speaker_json['constituency']['label']['_value']

                    # Get the text of the contribution and strip out whitespace characters
                    contributions_df.at[contribution_counter,'contribution_text'] = unidecode.unidecode(contribution.find('div',{'class':['contribution','content-container']}).get_text().strip())
                
                except:
                    pass
    
    contributions_df.to_csv('clean_hansard_files/'+date+'.csv',encoding='utf-8',index=False)
    print('Parsed: '+raw_filepath)


Parsed: 2018-01-08.html
Parsed: 2018-01-09.html
Parsed: 2018-01-10.html
Parsed: 2018-01-11.html
Parsed: 2018-01-15.html
Parsed: 2018-01-16.html
Parsed: 2018-01-17.html
Parsed: 2018-01-18.html
Parsed: 2018-01-19.html
Parsed: 2018-01-22.html
Parsed: 2018-01-23.html
Parsed: 2018-01-24.html
Parsed: 2018-01-25.html
Parsed: 2018-01-29.html
Parsed: 2018-01-30.html
Parsed: 2018-01-31.html
Parsed: 2018-02-01.html
Parsed: 2018-02-02.html
Parsed: 2018-02-05.html
Parsed: 2018-02-06.html
Parsed: 2018-02-07.html
Parsed: 2018-02-08.html
Parsed: 2018-02-20.html
Parsed: 2018-02-21.html
Parsed: 2018-02-22.html
Parsed: 2018-02-23.html
Parsed: 2018-02-26.html
Parsed: 2018-02-27.html
Parsed: 2018-02-28.html
Parsed: 2018-03-01.html
Parsed: 2018-03-05.html
Parsed: 2018-03-06.html
Parsed: 2018-03-07.html
Parsed: 2018-03-08.html
Parsed: 2018-03-12.html
Parsed: 2018-03-13.html
Parsed: 2018-03-14.html
Parsed: 2018-03-15.html
Parsed: 2018-03-16.html
Parsed: 2018-03-19.html
Parsed: 2018-03-20.html
Parsed: 2018-03-

# Analysing language 

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

import string

from collections import Counter
import collections

import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer

Now let's explore our dataset. Let's start by reading in all our clean CSV files and aggregating into one dataframe.

In [None]:
for idx, clean_filepath in enumerate(['clean_hansard_files/'+path for path in sorted(os.listdir('clean_hansard_files')) if '.csv' in path]):

    if idx==0:
        all_contributions_df = pd.read_csv(clean_filepath)
    else:
        all_contributions_df = all_contributions_df.append(pd.read_csv(clean_filepath))


We've got over 93000 speeches to analyse

In [16]:
all_contributions_df.shape[0]

93373

## What proportion of contributions are made by women?

Things to consider: 

* Is it fair to include female cabinet ministers (including the Prime Minister) in this analysis? 
* How should a 'contribution' be defined? How can we account for contribution length?
* Are women interrupted more than men? How are interruptions marked in Hansard?
* Is the proportion of women's speaking time equal to the proportion of women MPs?


## Have MPs become more disorderly during 2018? 

Things to consider:

* How should we define disorderliness? By interruptions, by the number of times John Bercow says 'order' or by his proportional speaking time?
