# Speeches by UK Members of Pariament

Other notebooks in this series:
* <a href='https://www.kaggle.com/datasets/andrewsale/uk-political-speeches'>The dataset</a>
* <a href='https://www.kaggle.com/code/andrewsale/speech-scraping'>Scraping notebook</a>
* <a href='https://www.kaggle.com/code/andrewsale/speeches-data-wrangling'>Wrangling notebook</a>
* <a href='https://www.kaggle.com/code/andrewsale/speeches-classification-model-trials-v2/notebook'>Model tuning notebook</a>

## Random sampling from speeches

In this notebook we collect random samples from the collected speeches. Most of the samples will have length of 100 words, but some are shorter to improve model performance on shorter inputs.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from functools import partial

# Load and prepare the data

First load the data, then we add a word length feature and instpect its statistics.


In [2]:
speeches_df = pd.read_csv('../input/speeches-data-wrangling/speeches.csv')

In [3]:
speeches_df['Word length'] = speeches_df['Speech'].apply(lambda x : len(x.split(' ')) if isinstance(x,str) else 0)
speeches_df['Word length'].describe()

count    10969.000000
mean      1301.883946
std       1454.084189
min          0.000000
25%        217.000000
50%        809.000000
75%       1864.000000
max      11857.000000
Name: Word length, dtype: float64

Now we will select only those speeches by Labour or Conservatives speakers, adn identify them with labels:

* 0 for Labour
* 1 for Conservative

In [4]:
tory_labour = {'Conservative Party', 'Labour Party', 'Labour Co-operative'}
tory_labour_speeches = speeches_df[speeches_df['Party'].isin(tory_labour)].dropna(subset=['Speech'])
tory_labour_speeches['Label'] = tory_labour_speeches.loc[:,'Party'].apply(lambda x : 1 if x=='Conservative Party' else 0)

## How many speeches should we use in our sample?

We saw that the median speech has length around 800 words. If we use subsamples of length 100 words and allow for 50 words of overlap between sample then we can get 16 samples (including one at 50 words length) from the average speech.

The total length of Labour speeches is almost 4,000,000 words. Taking 100 word subsamples, and allowing for 50 word overlaps, this gives us almost 80,000 samples. This is roughly half the number we get using Conservative speeches.

Thus, taking 150,000 samples in total will give a good variety of speeches.

In [5]:
tory_labour_speeches[tory_labour_speeches.Label == 0]['Word length'].sum() /50

77897.4

In [6]:
tory_labour_speeches[tory_labour_speeches.Label == 1]['Word length'].sum() /50

157980.38

## Helper functions

* `train_val_test_split`

We make the split on the speech level (before taking samples from each speech). This ensures that there is no leakage between train and test sets.

We upsample the Labout speeches (randomly select more than there are, with replacement) so that in the train and validation sets Labour and Conservatives are equally represented. At the end of the day, the number of sample speeches we take is chosen by looking at the number of (unique) Labour speeches.

Note that the Conservative speeches tend to be longer than the Labour speeches, and since the sampling weighs the speeches by their word count, it means we will ultimately end up with more Conservative speech samples anyway, despite doing upsampling.

* `random_sample`

Takes a whole speech and randomly pulls out a subspeech of the given length (and allowing for shorter speeches to be extracted with a given probability.

In [7]:
def train_val_test_split(data, split=[0.8, 0.1, 0.1], random_state=1):
    '''
    Splits the dataset into three parts by the weights given in the split argument. 
    Note that the output train and validation sets have the same number of Labour and Conservative speeches by upsampling the Labour speeches.
    
    Args:
        data --- the set to split
        split --- the relative sizes of the three output sets
        random_state
    '''
    # Check split weights are correctly input and add up to one
    if len(split) > 3:
        raise ValueError('Split should contain at most 3 values.')
    while len(split)<3:
        split.append(0)
    split_sum = split[0]+split[1]+split[2]
    if split_sum != 1:
        for i in range(3):
            split[i] /= split_sum
    
    # Perform the split
    train_val, test = train_test_split(data, test_size=split[2], random_state=random_state, stratify=data['Party'])
    train_val_0 = train_val[train_val.Label == 0]
    train_val_1 = train_val[train_val.Label == 1]
    train_val_0_upsample = resample(train_val_0, replace=True, n_samples = train_val_1.shape[0])
    train_val = pd.concat([train_val_0_upsample, train_val_1],ignore_index=True)
    train, val = train_test_split(train_val, train_size=split[0]/(split[0]+split[1]), random_state=random_state, stratify=train_val['Label'])
    return train, val, test

In [8]:
def random_sample(speech, length=100, shorten_prob=0.1):
    '''
    Randomly selects a substring from the given speech.
    
    Args:
        speech: str  -- the speech to sample from.
        length: int (default 100)  -- the desired maximum length of sample, in number of words.
        shorten_prob: float (default 0.1)  -- the probability that we take a sample that is shorter than the desired length.
    '''
    split_speech = speech.split()
    first_index = int(np.random.uniform(0, len(split_speech)-25))
    random_shortenizer = np.random.uniform() < shorten_prob
    if random_shortenizer:
        length = int(np.random.uniform(25,length-10))
    last_index = min(len(split_speech), first_index + length)
    return " ".join(split_speech[first_index:last_index])

# Take samples

In [9]:
def get_all_samples(speeches, full_size = 50000, split=[0.8,0.1,0.1], sample_length=100, shorten_prob=0.1, random_state=1):
    # split the set
    train, val, test = train_val_test_split(speeches, split = split, random_state=random_state)
    train.to_csv('full_train.csv')
    val.to_csv('full_val.csv')
    test.to_csv('full_test.csv')
    # define the function that obtains the random subpeech
    get_sample = partial(random_sample, length=sample_length, shorten_prob=shorten_prob)
    # instantialize a list to collect the output dataframes
    output = []
    # cycle through the three sets
    for i, df in enumerate([train, val, test]):
        # sample the rows, with replacement and weighted by word length
        sampled_rows = df.sample(int(full_size*split[i]), 
                                 replace=True, 
                                 weights='Word length', 
                                 axis=0, 
                                 ignore_index=True, 
                                 random_state=1
                                )
        # take sample from each speech
        sampled_rows['Speech'] = sampled_rows['Speech'].apply(get_sample)
        # append to output
        output.append(sampled_rows)
    return output        

In [10]:
train, val, test = get_all_samples(tory_labour_speeches,
                                   full_size = 150000, 
                                   split = [0.8,0.1,0.1],
                                   sample_length = 100,
                                   shorten_prob = 0.1
                                  )

# Summary and save

Now let's see the breakdown of our sets:

In [11]:
train.Label.value_counts()

1    64059
0    55941
Name: Label, dtype: int64

In [12]:
val.Label.value_counts()

1    8066
0    6934
Name: Label, dtype: int64

In [13]:
test.Label.value_counts()

1    9518
0    5482
Name: Label, dtype: int64

In [14]:
train.to_csv('train.csv', index=False)
test.to_csv('test.csv', index=False)
val.to_csv('val.csv', index=False)