LSE Data Science Institute | DS105A (2023/24) | Week 10

# 🗓️ Week 10: Databases + Data reshaping + Basics of Text Mining

Theme: Cleaning and reshaping data

**LAST UPDATED:** 29 November 2023

**AUTHOR:** Dr [Jon Cardoso-Silva](https://jonjoncardoso.github.io)

-----


⚙️ **Setup**

In [15]:
import re
# My custom Python package
import discordia

import pandas as pd
import tqdm.notebook as tqdm

from datetime import datetime
from selenium import webdriver
from sqlalchemy import create_engine

from discordia.webscraping.twfy import build_url, scrape_debate_sections, get_speeches

In [11]:
%load_ext sql
%config SqlMagic.autocommit=False # for engines that do not support autommit

# 1. Collect Data

Example of calling a simple function from my package:

In [2]:
date_obj = datetime(2023, 11, 15)

build_url(date_obj)

'https://www.theyworkforyou.com/debates/?d=2023-11-15'

**Start Selenium:**

In [14]:
driver = webdriver.Firefox()

curr_url = build_url(date_obj)
driver.get(curr_url)

## 1.1 Collect debates


Collecting data from the web using Selenium:

In [15]:
df_debates = scrape_debate_sections(driver, curr_url)
df_debates

Unnamed: 0,debate_id,debate_excerpt,url,title,section,section_excerpt
0,2023-11-15b.629.0,I can now announce the arrangements for the el...,https://www.theyworkforyou.com/debates/?id=202...,Speaker’s Statement,,
1,2023-11-15b.629.5,What steps she is taking with Cabinet colleagu...,https://www.theyworkforyou.com/debates/?id=202...,Online Fraud,"Science, Innovation and Technology",The Secretary of State was asked—
2,2023-11-15b.630.3,What recent discussions she has had with (a) O...,https://www.theyworkforyou.com/debates/?id=202...,Telecoms Network Replacement,"Science, Innovation and Technology",The Secretary of State was asked—
3,2023-11-15b.631.1,What steps her Department is taking to tackle ...,https://www.theyworkforyou.com/debates/?id=202...,AI-generated Content: Social Media,"Science, Innovation and Technology",The Secretary of State was asked—
4,2023-11-15b.632.5,What steps the Government are taking to regula...,https://www.theyworkforyou.com/debates/?id=202...,AI Regulation,"Science, Innovation and Technology",The Secretary of State was asked—
5,2023-11-15b.633.6,What steps her Department is taking to improve...,https://www.theyworkforyou.com/debates/?id=202...,Rural Connectivity,"Science, Innovation and Technology",The Secretary of State was asked—
6,2023-11-15b.634.4,What steps she is taking with Cabinet colleagu...,https://www.theyworkforyou.com/debates/?id=202...,Net Zero Technologies: University Research,"Science, Innovation and Technology",The Secretary of State was asked—
7,2023-11-15b.635.5,If she will make a statement on her department...,https://www.theyworkforyou.com/debates/?id=202...,Topical Questions,"Science, Innovation and Technology",The Secretary of State was asked—
8,2023-11-15b.638.4,If he will list his official engagements for W...,https://www.theyworkforyou.com/debates/?id=202...,Engagements,Prime Minister,The Prime Minister was asked—
9,2023-11-15b.642.0,What assessment he has made of recent trends i...,https://www.theyworkforyou.com/debates/?id=202...,West Midlands: Economic Growth,Prime Minister,The Prime Minister was asked—


I won't need Selenium anymore (listen to the lecture to understand why), so I will just close it.


In [16]:
# Thanks, Selenium. Your time is done.
driver.quit()

## 1.2 Collect speeches

In [17]:
df_speeches = pd.concat([get_speeches(debate_url) for debate_url in df_debates['url']])

In [18]:
df_speeches.head()

Unnamed: 0,debate_id,speech_id,speaker_id,speaker_position,speech_html,speech_raw_text
0,2023-11-15b.629.0,g629.1,10295,"Speaker of the House of Commons, Chair, Speake...","<p pid=""b629.1/1"">\n I can now announce the ar...",I can now announce the arrangements for the el...
0,2023-11-15b.629.5,g629.6,10580,"Conservative, New Forest West","<p pid=""b629.6/1"" qnum=""900097"">\n What steps ...",What steps she is taking with Cabinet colleagu...
1,2023-11-15b.629.5,g629.7,25847,Parliamentary Under Secretary of State (Depart...,"<p pid=""b629.7/1"">\n Tackling fraud is a prior...",Tackling fraud is a priority for this Governme...
2,2023-11-15b.629.5,g629.8,10580,"Conservative, New Forest West","<p pid=""b629.8/1"">\n What will companies actua...",What will companies actually have to do under ...
3,2023-11-15b.629.5,g629.9,25847,Parliamentary Under Secretary of State (Depart...,"<p pid=""b629.9/1"">\n All companies in scope of...",All companies in scope of the Act will need to...


# 2. Automated data collection

(For explanation, watch the lecture)

In [3]:
start_date = datetime(2023, 11, 1)
end_date = datetime(2023, 11, 29)

all_urls = [build_url(date_obj) for date_obj in pd.date_range(start_date, end_date)]
all_urls

['https://www.theyworkforyou.com/debates/?d=2023-11-01',
 'https://www.theyworkforyou.com/debates/?d=2023-11-02',
 'https://www.theyworkforyou.com/debates/?d=2023-11-03',
 'https://www.theyworkforyou.com/debates/?d=2023-11-04',
 'https://www.theyworkforyou.com/debates/?d=2023-11-05',
 'https://www.theyworkforyou.com/debates/?d=2023-11-06',
 'https://www.theyworkforyou.com/debates/?d=2023-11-07',
 'https://www.theyworkforyou.com/debates/?d=2023-11-08',
 'https://www.theyworkforyou.com/debates/?d=2023-11-09',
 'https://www.theyworkforyou.com/debates/?d=2023-11-10',
 'https://www.theyworkforyou.com/debates/?d=2023-11-11',
 'https://www.theyworkforyou.com/debates/?d=2023-11-12',
 'https://www.theyworkforyou.com/debates/?d=2023-11-13',
 'https://www.theyworkforyou.com/debates/?d=2023-11-14',
 'https://www.theyworkforyou.com/debates/?d=2023-11-15',
 'https://www.theyworkforyou.com/debates/?d=2023-11-16',
 'https://www.theyworkforyou.com/debates/?d=2023-11-17',
 'https://www.theyworkforyou.co

In [6]:
driver = webdriver.Firefox()

df_debates = pd.concat([scrape_debate_sections(driver, url) for url in tqdm.tqdm(all_urls)])

driver.quit()

  0%|          | 0/29 [00:00<?, ?it/s]

In [9]:
driver.quit()

In [7]:
df_debates.shape

(303, 6)

In [24]:
df_debates.tail(20)

Unnamed: 0,debate_id,debate_excerpt,url,title,section,section_excerpt
10,2023-11-28a.687.1,What recent assessment she has made of the eff...,https://www.theyworkforyou.com/debates/?id=202...,Energy Bills Alternative Funds,Energy Security and Net Zero,The Secretary of State was asked—
11,2023-11-28a.688.0,What discussions she has had with National Gri...,https://www.theyworkforyou.com/debates/?id=202...,National Grid Infrastructure: East of England,Energy Security and Net Zero,The Secretary of State was asked—
12,2023-11-28a.688.5,What recent discussions she has had with Cabin...,https://www.theyworkforyou.com/debates/?id=202...,Offshore Wind Sector,Energy Security and Net Zero,The Secretary of State was asked—
13,2023-11-28a.689.3,What steps her Department is taking to support...,https://www.theyworkforyou.com/debates/?id=202...,Energy Bills Support: Shropshire,Energy Security and Net Zero,The Secretary of State was asked—
14,2023-11-28a.690.2,What steps her Department is taking to help en...,https://www.theyworkforyou.com/debates/?id=202...,Energy-intensive Industries,Energy Security and Net Zero,The Secretary of State was asked—
15,2023-11-28a.691.5,If she will make an estimate of the number of ...,https://www.theyworkforyou.com/debates/?id=202...,Household Energy Efficiency,Energy Security and Net Zero,The Secretary of State was asked—
16,2023-11-28a.692.3,Whether she is taking steps to support the dev...,https://www.theyworkforyou.com/debates/?id=202...,Offshore Wind: East of England,Energy Security and Net Zero,The Secretary of State was asked—
17,2023-11-28a.693.3,If she will make a statement on her department...,https://www.theyworkforyou.com/debates/?id=202...,Topical Questions,Energy Security and Net Zero,The Secretary of State was asked—
18,2023-11-28a.701.0,Nominations closed at 12 noon for the election...,https://www.theyworkforyou.com/debates/?id=202...,Speaker’s Statement,,
19,2023-11-28a.702.0,(Urgent Question): To ask the Secretary of Sta...,https://www.theyworkforyou.com/debates/?id=202...,Ukraine,,


**Get speeches of all debates**

In [8]:
df_speeches = pd.concat([get_speeches(debate_url) for debate_url in tqdm.tqdm(df_debates['url'])])

  0%|          | 0/303 [00:00<?, ?it/s]

In [10]:
df_speeches.shape

(4091, 6)

# 3. Databases

(For explanation, watch the lecture)

In [12]:
# A little trick to find out how big to set my VARCHAR columns in the database.
df_speeches.apply(lambda x: max([len(xx) for xx in x]), axis=0)

debate_id              18
speech_id               7
speaker_id              5
speaker_position      531
speech_html         53137
speech_raw_text     42564
dtype: int64

In [13]:
# Get a template schema to start with.
print(pd.io.sql.get_schema(df_speeches, 'speeches'))

CREATE TABLE "speeches" (
"debate_id" TEXT,
  "speech_id" TEXT,
  "speaker_id" TEXT,
  "speaker_position" TEXT,
  "speech_html" TEXT,
  "speech_raw_text" TEXT
)


## 3.1 Create database and tables

Create a database from within Python:

In [19]:
# Create a database engine using SQLAlchemy
engine = create_engine('sqlite:///../data/discordia.db', echo=False)

# Why? Read: https://stackoverflow.com/a/71685414/843365
with engine.connect() as conn:
    pass

In [22]:
%sql sqlite:///../data/discordia.db --alias discordia

**Create debates table with controlled data types**

TABLE `debates`

In [24]:
%%sql discordia

CREATE TABLE debates (
    "debate_id" VARCHAR(20),
    "debate_excerpt" TEXT,
    "url" VARCHAR(100),
    "title" VARCHAR(100),
    "section" VARCHAR(100),
    "section_excerpt" TEXT
)


TABLE `speeches`

In [25]:
%%sql discordia

CREATE TABLE "speeches" (
    "debate_id" VARCHAR(20),
    "speech_id" VARCHAR(10),
    "speaker_id" CHAR(5),
    "speaker_position" TEXT,
    "speech_html" TEXT,
    "speech_raw_text" TEXT
)

**Move data to database**

In [26]:
df_debates.to_sql('debates', engine, if_exists='append', index=False)
df_speeches.to_sql('speeches', engine, if_exists='append', index=False)

4091

## 3.2 Read from SQL

In [36]:
query = """
SELECT COUNT(*) as num_speeches FROM speeches
"""

pd.read_sql(query, engine)

Unnamed: 0,num_speeches
0,4091


## 3.3 pandas vs SQL

Question: How many utterances were said by each speaker?

In [53]:
(
    df_speeches.groupby('speaker_id')
               .apply(lambda x: pd.Series({'num_utterances': len(x)}))
               .sort_values('num_utterances', ascending=False)
               .reset_index()
               .head(10)
)

Unnamed: 0,speaker_id,num_utterances
0,11115,233
1,10295,172
2,25376,100
3,24938,97
4,11859,85
5,25428,64
6,10648,56
7,13864,54
8,25806,50
9,25227,48


In [51]:
%%sql discordia

-- How many utterances were said by each speaker?
SELECT 
    speaker_id, 
    COUNT(*) AS num_utterances 
FROM 
    speeches 
GROUP BY speaker_id
ORDER BY
    num_utterances DESC
LIMIT 10

speaker_id,num_utterances
11115,233
10295,172
25376,100
24938,97
11859,85
25428,64
10648,56
13864,54
25806,50
25227,48
