# Speeches by UK Members of Pariament
## Data wrangling

The data we have is scraped from <a href='https://www.ukpol.co.uk'>www.ukpol.co.uk</a> (ukpol) and <a href="http://www.britishpoliticalspeech.org/">www.britishpoliticalspeech.org</a> (bps). The two datasets contain the speech, along with the speaker, the date and either a description (ukpol) or the party affiliation of the speaker (bps).

The aim to is build a model that predicts the party affiliation of a speaker. We therefore want to build a dataset that contains speeches labelled by party. The speeches from bps already have party affiliations, however those from ukpol (which are far more numerous) do not. To help, a list of MPs since 1970 is obtained from <a href='https://www.wikidata.org/wiki/Wikidata:WikiProject_British_Politicians'>Wikidata</a>, which includes the party they represent.

In [None]:
import numpy as np
import pandas as pd
import spacy
import Levenshtein

In [None]:
# Load the tables into dataframes

ukpol_df = pd.read_csv('./Raw_speeches/speeches_ukpol.csv')
ukpol_df = ukpol_df[['Speaker','Description','Speech']]

bps_df = pd.read_csv('./Raw_speeches/speeches_bps.csv')
bps_df = bps_df[['Speaker','Party','Speech']]

MP_df = pd.read_csv('MPs_1970_onwards.csv')
MP_df = MP_df[['itemLabel', 'partyLabel']]
MP_df.columns = ['Name', 'Party']
MP_df = MP_df.groupby(['Name','Party']).first().reset_index()
MP_set = set(MP_df.Name)
MP_dict = { mp : party for mp,party in zip(list(MP_df.Name), list(MP_df.Party)) }

# Load 
nlp = spacy.load("en_core_web_lg")

# BPS dataset: tidying the speaker column

The party affiliations are already assigned in this set, so we will just do some tidying in the speaker column (though this is not necessary for our current objective, it might be helpful for other purposes).

In [None]:
# Remove the party in parenteses from some speakers
bps_df['Speaker'] = bps_df['Speaker'].str.replace(r'\s\(.+\)', '', regex=True)

In [None]:
# Change so names all of the form: firstname lastname
def reverse_name(split_name):
    if len(split_name) == 1:
        return split_name[0]
    else:
        reversed_name = split_name[::-1]
        return ' '.join(reversed_name)
bps_df['Speaker'] = bps_df['Speaker'].str.split(', ').apply(reverse_name)
# This introduced some extra white spaces
bps_df['Speaker'] = bps_df['Speaker'].str.replace('  ',' ')

# UKPOL party identification

## Finding MPs using Levenshtein distance.

The Levenshtein distance measures the number of character changes that are needed to convert one string into another.
For each person from Step 1 where we did not find a direct match, we now search for the MP with the smallest Levenshtein distance.
We seek only those with very close matches (distance 2 or fewer). 
Then, after a manual inspection, we remove those which are not accurate matches (e.g. 'John Smith' matched to 'Joan Smith').

In [None]:
finding_MPs = {}
for speaker in set(ukpol_df['Speaker']):
    if isinstance(speaker, str):
        finding_MPs[speaker] = {}
        for mp in MP_dict.keys():
            lev = Levenshtein.distance(speaker, mp)
            finding_MPs[speaker][mp] = lev    
closest_MP = {}
for speaker in finding_MPs.keys():    
    closest = min(finding_MPs[speaker], key = finding_MPs[speaker].get)
    if finding_MPs[speaker][closest] < 3 and finding_MPs[speaker][closest] > 0:
        closest_MP[speaker] = closest
bad_matches = ['John Eden', 'John Evans','Michael Jay','Peter Wilson','John Inge','John Morris','Joan Walmsley','John Stokes','Roger Taylor',
              'John McFall','Paul Eagland','Chris Whitty','David Moran','Ann Taylor','Julie Smith','John Ware','Jane Hutt','David Moyes',
              'John Hynd','Justin Manners','Roy Hughes','Johann Lamont','John Apter','Carwyn Jones']

for speaker in bad_matches:
    if speaker in closest_MP.keys():
        del closest_MP[speaker]
        
ukpol_df.Speaker = ukpol_df.Speaker.apply(lambda x : closest_MP[x] if x in closest_MP.keys() else x)

Some of those not yet matched to a party can manually be matched to a party. They may not be MPs, for example, Nicola Sturgeon is the leader of the SNP and First Minister of Scotland, but not an MP in Westminister. Others failed to match  because of variations of their names that have been used.
We manually assign some of the speakers who have given multiple speeches.

Some speakers were mislabelled as "2017 Labour Party Conference" or similar. We extract the party from this name and assign it to the Party column. We could infer the speakers from the descriptions, but this is more work than it is worth (given our current objective and the relatively small number of speechs involved).

In [None]:
manual_assignment = {
'Nicola Sturgeon' : 'Scottish National Party',
'John Hutton' : 'Labour Party',
'Steve Barclay' : 'Conservative Party',
'Therese Coffey': 'Conservative Party',
'Ken Clarke' : 'Conservative Party',
'Andrew Adonis' : 'Labour Party',
'Adonis' : 'Labour Party',
'Lord Falconer' : 'Labour Party',
'Barbara Castle' : 'Labour Party',
'Elizabeth Truss' : 'Conservative Party',
'Mr Major' : 'Conservative Party',
'Nusrat Ghani': 'Conservative Party',
'Matthew Hancock' : 'Conservative Party',
'Anthony Eden' : 'Conservative Party',
'Christian Matheson': 'Labour Party',
'Jonathan Hill' : 'Conservative Party',
'Warsi' : 'Conservative Party',
'John Reid' : 'Labour Party',
'Nigel Farage' : 'UK Independence Party',
'Caoimhe Archibald' : 'Sinn Féin',
'Carwyn Jones' : 'Labour Party',
'Clement Attlee' : 'Labour Party',
'Baroness Anelay' : 'Conservative Party',
'Ruth Davidson': 'Conservative Party',
'Clinton Davis' : 'Labour Party',
'Nicholas Ridley' : 'Conservative Party',
'Mark Drakeford' : 'Labour Party',
'Baroness Warsi' : 'Conservative Party',
'Marsha De Cordova' : 'Labour Party',
'Lord Freud' : 'Conservative Party',
'David Lloyd George' : 'Liberal Party',
'Chris Chope': 'Conservative Party',
'Colm Gildernew' : 'Sinn Fein',
'Baroness Verma' : 'Conservative Party',
'Anthony Meyer' : 'Conservative Party', # though he did join the Lib Dems later in his career
'Sir John Major' : 'Conservative Party',
'Jane Hutt' : 'Labour Party',
'Len McCluskey' : 'Labour Party',
'Tariq Ahmad' : 'Conservative Party',
'Lord Adonis' : 'Labour Party',
'Paul Channon' : 'Conservative Party',
'Sayeeda Warsi' :'Conservative Party',
'Baroness Kramer' : 'Liberal Democrats',
'Jim Wallace': 'Liberal Democrats'
    }
MP_dict.update(manual_assignment)

labour_party_conferences = { f'{x} Labour Party Conference' : 'Labour Party' for x in range(1970,2023)}
conservative_party_conferences = { f'{x} Conservative Party Conference' : 'Conservative Party' for x in range(1970,2023)}
libdem_party_conferences = { f'{x} Liberal Democrat Party Conference' : 'Liberal Democrats' for x in range(1970,2023)}

MP_dict.update(labour_party_conferences)
MP_dict.update(conservative_party_conferences)
MP_dict.update(libdem_party_conferences)

ukpol_df['Party'] = ukpol_df['Speaker'].apply(lambda x : MP_dict[x] if x in MP_dict.keys() else np.NaN)

# Merging the datasets

We now merge the datasets on the Speaker, Speech and Party columns.

In [None]:
# Ensure the naming of parties is consistent over the two sets
bps_to_ukpol_party_map = {
    'Labour': 'Labour Party',
    'Conservative' : 'Conservative Party',
    'Liberal Democrat': 'Liberal Democrats',
    'Liberal':'Liberal',
    'SDP-Liberal Alliance': 'Liberal Democrats' # strictly speaking this is not true, but the two parties in the alliance later merged and formed what was later to be named the Liberal Democrats
}
bps_df['Party'] = bps_df['Party'].map(lambda x : bps_to_ukpol_party_map[x])

In [None]:
speeches_df = pd.concat([ukpol_df[['Speaker', 'Party', 'Speech']], bps_df[['Speaker','Party','Speech']]], ignore_index=True)

In [None]:
# Make consistent some of the party names
name_fixes = {
    'Liberal' : 'Liberal Party',
    'Sinn Féin' : 'Sinn Fein'
}
speeches_df['Party'] = speeches_df['Party'].apply(lambda x : name_fixes[x] if x in name_fixes.keys() else x)

# Parties represented

The list below shows how many speeches are in the database from speakers of each party.

In [None]:
speeches_df.dropna(subset=['Party'], inplace=True) # drop rows with no Party
speeches_df.Party.value_counts()

In [None]:
speeches_df.to_csv('Processed_speeches/speeches.csv', index=False)