# Hansard Procedural Terms Creation


This notebook scrapes the [online index of Erskine May]('https://erskinemay.parliament.uk/browse/indexterms?page=1') to create a list of procedural terms used in the UK Parliament.


## Setup


In [35]:
import os
import pandas as pd
from bs4 import BeautifulSoup

DATA_PATH = 'data/'
DIST_PATH = 'dist/'

## Extracting parliamentary procedural terms from the online index of Erskine May


In [52]:
def extract_terms(html_file_path, filename):
    """
    This function extracts the procedural terms from the HTML content of a page of 
    the Erskine May index and includes the source filename.
    """
    with open(html_file_path, 'r', encoding='utf-8') as file:
        soup = BeautifulSoup(file.read(), 'html.parser')

    index_terms = soup.find_all('span', class_='text')
    return [(term.get_text(strip=True), filename) for term in index_terms]


def extract_terms_from_files(directory):
    """
    This function iterates over HTML files in a given directory and extracts terms from each,
    including the filename from which each term came.
    """
    terms = []
    files = os.listdir(directory)
    for filename in files:
        if filename.endswith(".html"):
            print(f'Extracting terms from {filename}...')
            file_path = os.path.join(directory, filename)
            terms.extend(extract_terms(file_path, filename))
    return terms


directory = DATA_PATH + 'erskine-may-index/'
index_terms = extract_terms_from_files(directory)
index_terms_df = pd.DataFrame(index_terms, columns=['term', 'source_file'])

Extracting terms from 23.html...
Extracting terms from 35.html...
Extracting terms from 9.html...
Extracting terms from 19.html...
Extracting terms from 39.html...
Extracting terms from 5.html...
Extracting terms from 15.html...
Extracting terms from 42.html...
Extracting terms from 54.html...
Extracting terms from 43.html...
Extracting terms from 14.html...
Extracting terms from 4.html...
Extracting terms from 38.html...
Extracting terms from 18.html...
Extracting terms from 8.html...
Extracting terms from 34.html...
Extracting terms from 22.html...
Extracting terms from 29.html...
Extracting terms from 3.html...
Extracting terms from 13.html...
Extracting terms from 44.html...
Extracting terms from 52.html...
Extracting terms from 25.html...
Extracting terms from 33.html...
Extracting terms from 48.html...
Extracting terms from 49.html...
Extracting terms from 32.html...
Extracting terms from 24.html...
Extracting terms from 53.html...
Extracting terms from 45.html...
Extracting term