# Hansard Procedural Terms Creation


This notebook scrapes the [online index of Erskine May]('https://erskinemay.parliament.uk/browse/indexterms?page=1') to create a list of procedural terms used in the UK Parliament.


## Setup


In [14]:
import os
import ssl
import re
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import pandas as pd

DATA_PATH = 'data/'
DIST_PATH = 'dist/'

# Ignore SSL certificate errors
ssl._create_default_https_context = ssl._create_unverified_context

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/felixwallis/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/felixwallis/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Extracting parliamentary procedural terms from the online index of Erskine May


In [15]:
def extract_terms(html_file_path, filename):
    with open(html_file_path, 'r', encoding='utf-8') as file:
        soup = BeautifulSoup(file.read(), 'html.parser')

    index_terms = soup.find_all('span', class_='text')
    return [(term.get_text(strip=True), filename) for term in index_terms]


def extract_terms_from_files(directory):
    terms = []
    files = os.listdir(directory)
    for filename in files:
        if filename.endswith(".html"):
            print(f'Extracting terms from {filename}...')
            file_path = os.path.join(directory, filename)
            terms.extend(extract_terms(file_path, filename))
    return terms


directory = DATA_PATH + 'erskine-may-index/'
index_terms = extract_terms_from_files(directory)
index_terms_df = pd.DataFrame(index_terms, columns=['term', 'source_file'])

Extracting terms from 23.html...
Extracting terms from 35.html...
Extracting terms from 9.html...
Extracting terms from 19.html...
Extracting terms from 39.html...
Extracting terms from 5.html...
Extracting terms from 15.html...
Extracting terms from 42.html...
Extracting terms from 54.html...
Extracting terms from 43.html...
Extracting terms from 14.html...
Extracting terms from 4.html...
Extracting terms from 38.html...
Extracting terms from 18.html...
Extracting terms from 8.html...
Extracting terms from 34.html...
Extracting terms from 22.html...
Extracting terms from 29.html...
Extracting terms from 3.html...
Extracting terms from 13.html...
Extracting terms from 44.html...
Extracting terms from 52.html...
Extracting terms from 25.html...
Extracting terms from 33.html...
Extracting terms from 48.html...
Extracting terms from 49.html...
Extracting terms from 32.html...
Extracting terms from 24.html...
Extracting terms from 53.html...
Extracting terms from 45.html...
Extracting term

## Turning the procedural terms into a dictionary of stemmed unique unigrams


### Preprocessing function for the procedural terms


In [16]:
stopwords = set(stopwords.words('english'))


def clean_tokenize(text):
    # Text should almost always be a string, but we check
    # just in case
    if not isinstance(text, str):
        text = str(text)
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation, numbers, and symbols
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords
    filtered_tokens = [token for token in tokens if token not in stopwords]
    # Stem the tokens
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
    return stemmed_tokens

### Tokenizing and stemming the procedural terms


In [32]:
index_terms_df['cleaned_term'] = index_terms_df['term'].apply(clean_tokenize)
exploded_index_terms_df = index_terms_df.explode('cleaned_term')
unique_terms = exploded_index_terms_df['cleaned_term'].unique()
unique_terms_df = pd.DataFrame(
    unique_terms, columns=['term']).sort_values(by='term').reset_index(drop=True)

unique_terms_df.to_csv(
    DIST_PATH + 'hansard_procedural_terms.csv', index=False)

The dictionary is manually cleaned at this point to create the final [`shortened_hansard_procedural_terms.csv` file](https://docs.google.com/spreadsheets/d/1twVZ_ypcBOLroMDxgbC0veFKvHq7BbT9HbW99zUnNU8/edit?usp=sharing).
