LSE Data Science Institute | DISCORDIA project

# Scraping House of Commons debates


**LAST UPDATED:** 5 December 2023

**AUTHORS:** [@jonjoncardoso](https://jonjoncardoso.github.io) & [@tz1211](https://github.com/tz1211) & [@Sevnhutsjr](https://github.com/Sevnhutsjr)

**TODO list:**

- [x] Figure out a way to scrape links representing debate sections and debate blocks from a given date
- [x] Automate the process above to collect all the blocks and sections of a given date
- [x] Navigate to a given debate block and scrape its content
- [x] Automate the process above to collect all the blocks and sections of a given date
- [ ] Figure out how to extract the motions and amendments from the speeches
- [ ] Move all of this to a Python script that can be run from the command line
- [ ] Write a condition so that, **before scraping a debate**, it checks if it doesn't already exist in the database


# ⚙️ Setup

Refer to the main `README.md` to read the instructions on how to set up your conda environment.

**Imports**



In [1]:
import os
import re
import sys
import sqlalchemy

import numpy as np
import pandas as pd

from tqdm.notebook import tqdm
tqdm.pandas()

from datetime import datetime
from selenium import webdriver

from sqlalchemy import create_engine

In [2]:
%load_ext sql
%config SqlMagic.autocommit=True # for engines that do not support autommit

**Load the `discordia` package**

Run the cell below to load the current alpha-version of our package `discordia`.

⚠️ I am also making use of the Jupyter magic `autoreload` here, so that whatever changes we make to `code/src/python/discordia/**.py` files will be reflected on the notebook automatically and interactively. Read more about this [on this Stackoverflow response](https://stackoverflow.com/a/73623267/843365).

In [3]:
%load_ext autoreload
%autoreload 1

sys.path.insert(0,'../src/python/')

# https://stackoverflow.com/questions/70898150/jupyter-autoreload-workflow/73623267#73623267

%aimport discordia
%aimport discordia.webscraping.twfy
%aimport discordia.webscraping.utils

#from discordia.webscraping.utils import print_HTML, show_HTML
from discordia.webscraping.twfy import build_url, scrape_debate_sections, get_speeches_divisions_and_votes

# 1: Connect to SQLite database

In [4]:
# Create a database engine using SQLAlchemy
engine = create_engine('sqlite:///../../data/discordia.db', echo=False, isolation_level="AUTOCOMMIT")

# Why? Read: https://stackoverflow.com/a/71685414/843365
with engine.connect() as conn:
    pass


In [5]:
%sql sqlite:///../../data/discordia.db --alias discordia

# 2. Create tables

Run the cells in this section to create the tables we will use to store the data we scrape.

Create `debates` table with controlled data types

In [6]:
%%sql discordia

DROP TABLE IF EXISTS debates;

CREATE TABLE debates (
    "debate_id" VARCHAR(20) NOT NULL,
    "debate_excerpt" TEXT,
    "url" VARCHAR(200) NOT NULL,
    "title" VARCHAR(100),
    "section" VARCHAR(100),
    "section_excerpt" TEXT,
    PRIMARY KEY ("debate_id")
)


Create `mp` table:

In [8]:
%%sql discordia

DROP TABLE IF EXISTS mp;

CREATE TABLE "mp" (
    "mp_id" INTEGER(5) NOT NULL,
    "first_name" VARCHAR(30),
    "last_name" VARCHAR(30),
    "party" VARCHAR(100),
    "constituency" VARCHAR(100),
    "url" VARCHAR(200),
    "term_start" INTEGER(4) ,
    "term_end" INTEGER(4),
    PRIMARY KEY ("mp_id", "term_start")
)

Create `speeches` table:

In [9]:
%%sql discordia

DROP TABLE IF EXISTS "speeches";

CREATE TABLE "speeches" (
    "debate_id" VARCHAR(20) NOT NULL,
    "speech_id" VARCHAR(10) NOT NULL,
    "speaker_id" CHAR(5) NOT NULL,
    "speaker_position" TEXT,
    "speech_html" TEXT,
    "speech_raw_text" TEXT,
    PRIMARY KEY ("debate_id", "speech_id"),
    FOREIGN KEY ("debate_id") REFERENCES "debates" ("debate_id")
    FOREIGN KEY ("speaker_id") REFERENCES "mp" ("mp_id")
)

Create `house_divisions` table:

In [25]:
%%sql discordia

DROP TABLE IF EXISTS "house_divisions";

CREATE TABLE "house_divisions" (
    "debate_id" VARCHAR(20) NOT NULL,
    "house_division_id" VARCHAR(10) NOT NULL,
    "vote_title" TEXT,
    PRIMARY KEY ("debate_id", "house_division_id"),
    FOREIGN KEY ("debate_id") REFERENCES "debates" ("debate_id")
)

Create `votes` table:

In [11]:
%%sql discordia

DROP TABLE IF EXISTS "votes";

CREATE TABLE "votes" (
    "house_division_id" VARCHAR(10) NOT NULL,
    "mp_id" CHAR(5) NOT NULL,
    "comment" VARCHAR(200),
    "is_teller" BOOLEAN NOT NULL,
    "is_vote_aye" BOOLEAN NOT NULL,
    PRIMARY KEY ("house_division_id", "mp_id"),
    FOREIGN KEY ("house_division_id") REFERENCES "house_divisions" ("house_division_id"),
    FOREIGN KEY ("mp_id") REFERENCES "mp" ("mp_id")
)

## 2.1 Check

In [12]:
%sqlcmd tables

Name
debates
house_divisions
mp
speeches
votes


In [13]:
%sqlcmd columns -t house_divisions

name,type,nullable,default,primary_key
debate_id,VARCHAR(20),False,,1
house_division_id,VARCHAR(10),False,,2
vote_title,TEXT,True,,0


# 2. Add Data to SQLite database

## 2.1. MP data 

In [14]:
folder_path = "../../mp_data/"
file_list = [f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))]

df_mp_list = []
for mp_list in file_list: 
    df = pd.read_csv(f"../../mp_data/{mp_list}")
    df["term_start"] = mp_list[4:8]
    df["term_end"] = mp_list[9:13]
    df_mp_list.append(df) 
df_mp = pd.concat(df_mp_list)
df_mp = df_mp.rename({"Person ID": "mp_id", "First name": "first_name", "Last name": "last_name", "Party": "party", "Constituency": "constituency", "URI": "url"}, axis=1)

In [15]:
df_mp.to_sql('mp', engine, if_exists='append', index=False)

3244

Check that it worked:

In [6]:
%%sql discordia

SELECT 
    *
FROM mp
ORDER BY term_start DESC
LIMIT 5

mp_id,first_name,last_name,party,constituency,url,term_start,term_end
10001,Diane,Abbott,Independent,Hackney North and Stoke Newington,https://www.theyworkforyou.com/mp/10001/diane_abbott/hackney_north_and_stoke_newington,2017,2023
25034,Debbie,Abrahams,Labour,Oldham East and Saddleworth,https://www.theyworkforyou.com/mp/25034/debbie_abrahams/oldham_east_and_saddleworth,2017,2023
25661,Bim,Afolami,Conservative,Hitchin and Harpenden,https://www.theyworkforyou.com/mp/25661/bim_afolami/hitchin_and_harpenden,2017,2023
11929,Adam,Afriyie,Conservative,Windsor,https://www.theyworkforyou.com/mp/11929/adam_afriyie/windsor,2017,2023
25817,Nickie,Aiken,Conservative,Cities of London and Westminster,https://www.theyworkforyou.com/mp/25817/nickie_aiken/cities_of_london_and_westminster,2017,2023


## 2.2. Debates

For now, let's just scrape the debates that happened in 2023, from January until November.


In [16]:
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 11, 30)

all_urls = [build_url(date_obj) for date_obj in pd.date_range(start_date, end_date)]

**We need selenium to scrape the debates**

In [17]:
driver = webdriver.Firefox()

df_debates = pd.concat([scrape_debate_sections(driver, url) for url in tqdm(all_urls)])

driver.quit()

  0%|          | 0/334 [00:00<?, ?it/s]

In [18]:
df_debates.to_sql('debates', engine, if_exists='append', index=False)

3231

Check that it worked:

In [19]:
%%sql discordia

SELECT 
    *
FROM debates
ORDER BY debate_id DESC
LIMIT 5

debate_id,debate_excerpt,url,title,section,section_excerpt
2023-11-30a.1108.0,"Motion made, and Question proposed, That this House do now adjourn.—(Mark Jenkinson.)",https://www.theyworkforyou.com/debates/?id=2023-11-30a.1108.0,HM Prison Bedford,,
2023-11-30a.1103.0,"Proceedings resumed (Order, this day). Considered in Committee [Dame Rosie Winterton in the Chair]",https://www.theyworkforyou.com/debates/?id=2023-11-30a.1103.0,National Insurance Contributions (Reduction in Rates) Bill,,
2023-11-30a.1102.1,"King’s recommendation signified. Motion made, and Question put forthwith ( Standing Order No. 52(1)(a)), That, for the purposes of any Act resulting from the National Insurance...",https://www.theyworkforyou.com/debates/?id=2023-11-30a.1102.1,National Insurance Contributions (Reduction in Rates) Bill: Money,,
2023-11-30a.1083.0,"Second Reading [Relevant documents: Oral evidence taken before the Treasury Committee on the morning of 28 November 2023, on the Autumn Statement 2023, HC 286; Oral evidence taken before the...",https://www.theyworkforyou.com/debates/?id=2023-11-30a.1083.0,National Insurance Contributions (Reduction in Rates) Bill,,
2023-11-30a.1082.2,(13) Standing Order No. 15(1) (Exempted business) shall apply to proceedings on the Bill. (14) Standing Order No. 82 (Business Committee) shall not apply in relation to any proceedings to which...,https://www.theyworkforyou.com/debates/?id=2023-11-30a.1082.2,Miscellaneous,,


In [20]:
%%sql discordia

SELECT 
    COUNT(*) AS num_debates
FROM debates

num_debates
3231


## 2.3. Speeches, house divisions and votes


In [21]:
# I left this cell here in case we need to re-run the scraping of speeches, etc.
# This is so I don't need to re-run Selenium every time I want to test the SQL queries
debates_sql = """
SELECT 
    *
FROM debates
-- Uncomment the WHERE clause below to filter by date for testing
-- WHERE 
--     debate_id LIKE '2023-11-14%' OR 
--     debate_id LIKE '2023-11-15%'
"""

df_debates = pd.read_sql(debates_sql, con=engine)

In [22]:
df_debates.shape

(3231, 6)

In [23]:
df_speeches, df_house_divisions, df_votes = get_speeches_divisions_and_votes(df_debates['url'], tqdm=tqdm)

  0%|          | 0/3231 [00:00<?, ?it/s]

### 2.3.1 Check that `speeches` was correctly populated

In [22]:
df_speeches.head()

Unnamed: 0,debate_id,speech_id,speaker_id,speaker_position,speech_html,speech_raw_text
0,2023-11-14b.485.0,g485.1,10295,"Speaker of the House of Commons, Chair, Speake...","<p pid=""b485.1/1"">\n I wish to inform the Hous...",I wish to inform the House that I have receive...
1,2023-11-14b.485.5,g485.6,26086,Liberal Democrat Spokesperson (Defence),"<p pid=""b485.6/1"" qnum=""900035"">\n What assess...",What assessment he has made of the potential i...
2,2023-11-14b.485.5,g485.7,25851,The Chief Secretary to the Treasury,"<p pid=""b485.7/1"">\n As a result of the Govern...","As a result of the Government’s triple lock, t..."
3,2023-11-14b.485.5,g486.0,26086,Liberal Democrat Spokesperson (Defence),"<p pid=""b486.0/1"">\n Now that\n <a href=""https...",Now that Lord Cameron has returned to the Cabi...
4,2023-11-14b.485.5,g486.1,25851,The Chief Secretary to the Treasury,"<p pid=""b486.1/1"">\n Nice try. The triple lock...",Nice try. The triple lock was a Conservative i...


In [52]:
df_speeches.shape

(821, 6)

In [32]:
df_speeches['debate_id'].unique()

array(['2023-11-14b.485.0', '2023-11-14b.485.5', '2023-11-14b.486.4',
       '2023-11-14b.488.3', '2023-11-14b.489.2', '2023-11-14b.490.7',
       '2023-11-14b.491.2', '2023-11-14b.492.1', '2023-11-14b.492.6',
       '2023-11-14b.493.7', '2023-11-14b.495.2', '2023-11-14b.496.0',
       '2023-11-14b.496.5', '2023-11-14b.497.4', '2023-11-14b.498.1',
       '2023-11-14b.498.6', '2023-11-14b.498.11', '2023-11-14b.499.3',
       '2023-11-14b.500.0', '2023-11-14b.500.5', '2023-11-14b.507.0',
       '2023-11-14b.534.3', '2023-11-14b.619.6', '2023-11-14b.620.0',
       '2023-11-14b.621.0', '2023-11-15b.629.0', '2023-11-15b.629.5',
       '2023-11-15b.630.3', '2023-11-15b.631.1', '2023-11-15b.632.5',
       '2023-11-15b.633.6', '2023-11-15b.634.4', '2023-11-15b.635.5',
       '2023-11-15b.638.4', '2023-11-15b.642.0', '2023-11-15b.643.1',
       '2023-11-15b.649.0', '2023-11-15b.672.0', '2023-11-15b.673.0',
       '2023-11-15b.674.3', '2023-11-15b.764.7'], dtype=object)

**Q:** Which debates from df_debates did not contain any speeches?

In [36]:
# TODO: Check those URLs, I feel like those are motion texts that we should scrape and put into the database somehow
df_debates[~df_debates['debate_id'].isin(df_speeches['debate_id'].unique())]

Unnamed: 0,debate_id,debate_excerpt,url,title,section,section_excerpt
21,2023-11-14b.532.4,,https://www.theyworkforyou.com/debates/?id=202...,Bills Presented,,
22,2023-11-14b.534.0,,https://www.theyworkforyou.com/debates/?id=202...,Debate on the Address,,
24,2023-11-14b.618.0,,https://www.theyworkforyou.com/debates/?id=202...,Business Without Debate,,
25,2023-11-14b.618.1,"Motion made, and Question put forthwith ( Stan...",https://www.theyworkforyou.com/debates/?id=202...,Deferred Divisions,,
26,2023-11-14b.618.3,"Motion made, and Question put forthwith ( Stan...",https://www.theyworkforyou.com/debates/?id=202...,Delegated Legislation,,
44,2023-11-15b.674.0,,https://www.theyworkforyou.com/debates/?id=202...,Debate on the Address,,
46,2023-11-15b.764.0,,https://www.theyworkforyou.com/debates/?id=202...,Business without Debate,,
47,2023-11-15b.764.1,"Motion made, and Question put forthwith ( Stan...",https://www.theyworkforyou.com/debates/?id=202...,Delegated Legislation,,
48,2023-11-15b.764.5,"Ordered, That at the sitting on Thursday 16 No...",https://www.theyworkforyou.com/debates/?id=202...,Business of the House (16 November),,
50,2023-11-15b.766.0,"Resolved, That this House do now adjourn.—(Joy...",https://www.theyworkforyou.com/debates/?id=202...,Adjournment,,


### 2.3.2 Check that `house_divisions` was correctly populated

In [38]:
# This matches what I see on the website, so I think we're good
df_house_divisions

Unnamed: 0,debate_id,house_division_id,vote_title
0,2023-11-14b.534.3,g614.1,Economic Growth
1,2023-11-15b.674.3,g754.0,"Violence Reduction, Policing and Criminal Justice"
2,2023-11-15b.674.3,g757.0,"Violence Reduction, Policing and Criminal Justice"
3,2023-11-15b.674.3,g761.0,"Violence Reduction, Policing and Criminal Justice"


### 2.3.3 Check that `votes` was correctly populated 

In [39]:
df_votes.shape

(1772, 5)

In [40]:
df_votes.head()

Unnamed: 0,house_division_id,mp_id,comment,is_teller,is_vote_aye
0,g614.1,25034,,False,True
1,g614.1,24958,,False,True
2,g614.1,25888,,False,True
3,g614.1,25702,,False,True
4,g614.1,25813,,False,True


In [41]:
df_votes['house_division_id'].unique()

array(['g614.1', 'g754.0', 'g757.0', 'g761.0'], dtype=object)

In [47]:
# A little summary of the votes

(df_votes
    .assign(vote=lambda x: ['aye' if is_vote_aye else 'no' for is_vote_aye in x['is_vote_aye']])
    .groupby(['house_division_id', 'vote']).size()
    .reset_index(name='counts')
    .pivot(index='house_division_id', columns='vote', values='counts')
)

vote,aye,no
house_division_id,Unnamed: 1_level_1,Unnamed: 2_level_1
g614.1,229,312
g754.0,185,292
g757.0,127,296
g761.0,26,305


Did we capture tellers correctly?

In [50]:
# A little summary of the votes

(df_votes
    .assign(vote=lambda x: ['aye' if is_vote_aye else 'no' for is_vote_aye in x['is_vote_aye']])
    .assign(teller=lambda x: ['teller' if is_teller else 'voter' for is_teller in x['is_teller']])
    .groupby(['house_division_id', 'vote', 'teller']).size()
    .reset_index(name='counts')
    .pivot(index=['house_division_id'], columns=['vote', 'teller'], values='counts')
)

vote,aye,aye,no,no
teller,teller,voter,teller,voter
house_division_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
g614.1,2,227,2,310
g754.0,2,183,2,290
g757.0,2,125,2,294
g761.0,2,24,2,303


### 2.3.4 Save them all

In [26]:
df_house_divisions.to_sql('house_divisions', engine, if_exists='append', index=False)

59

In [27]:
df_votes.to_sql('votes', engine, if_exists='append', index=False)

24300

In [28]:
df_speeches.to_sql('speeches', engine, if_exists='append', index=False)

44380

### 2.3.5 Check that the data was correctly saved

In [29]:
%%sql discordia

SELECT 
    'debates' AS table_name,
    COUNT(*) AS num_debates
FROM debates
UNION
SELECT
    'speeches' AS table_name,
    COUNT(*) AS num_speeches
FROM speeches
UNION
SELECT 
    'house_divisions' AS table_name,
    COUNT(*) AS num_house_divisions
FROM house_divisions
UNION
SELECT 
    'votes' AS table_name,
    COUNT(*) AS num_votes
FROM votes

table_name,num_debates
debates,3231
house_divisions,59
speeches,44380
votes,24300


# 3. How to get the motions?

Starting point: Gavin Abercrombie, the researcher of the HanDeSet project, created a list of strings that demarcate the start and end of a motion. 

🖇️ LINK: [processTWFYfiles.py](https://github.com/GavinAbercrombie/processHansards/blob/master/processTWFYfiles.py)


In [1]:
#TODO: This is a WIP