## Making sample dataset using first 10k entries to explore the data

In [1]:
import pandas as pd

# Load just 10k rows to test
df = pd.read_csv("../data/ParlVote+_raw.csv", nrows=10000)

# Keep only relevant columns
df = df[['speech', 'vote', 'party', 'motion_party', 'policy_preference']]

print(df.head())
print(df['vote'].value_counts())

df.to_csv("../data/parlvote_sample_10k.csv", index=False)

                                              speech  vote   party  \
0  The right hon Gentleman has recited a catalogu...     0  labour   
1  I am not sure whether this has occurred to the...     0  labour   
2  Before the right hon Gentleman leaves the subj...     0  labour   
3  I thank the right hon Member for Penrith and T...     0  labour   
4  I thank my right hon Friend for giving way and...     0  labour   

   motion_party  policy_preference  
0  conservative              305.1  
1  conservative              305.1  
2  conservative              305.1  
3  conservative              305.1  
4  conservative              305.1  
vote
1    5085
0    4915
Name: count, dtype: int64


## Checking data for speech length to get an idea what the input size will be:

In [2]:
df_sample = pd.read_csv("../data/parlvote_sample_10k.csv")

avg_length = df_sample['speech'].apply(lambda x: len(x.split())).mean() 

print(f"Average speech length: {avg_length:.2f} words")



Average speech length: 763.11 words


Average speech length for this sample is 763 words, which is above the maximum token length of 512 that most models have. This was mentioned in the literature.
Abercrombie and Batista-Navarro (2022) pad the texts to the maximum input of 512 tokens.
Cao & Drinkall (2024) limited the analysis of motions and speeches to a maximum length of 512 tokens, MPNet’s maximum window size.
They also mention that "Future iterations of this work might benefit from exploring alternative strategies to handle longer texts without losing pertinent information."

### Options:
1. Truncation: cut off the speech after the 512 limit. Risks losing important information at the end of debates, but fits most models and has faster training
2. Split speeches into 512-token chunks, run model on each, then aggregate outputs (e.g. average predictions)
3. Use models designed for longer inputs: LongFormer, BigBird. Could require more setup & compute
4. Use a summarisation model first


I will start with truncation as this is the simplest and has been used in the two academic papers which use the ParlVote+ dataset. I will use the MPNet tokenizer as it delivered good results in Cao & Drinkall (2024)'s paper.

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")

def tokenize_speech(text):
    tokens = tokenizer(
        text,
        max_length=512,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )
    return tokens

speeches = df_sample['speech'].tolist()

tokenized = [tokenize_speech(speech) for speech in speeches]


In [9]:
print(tokenized[1006])

{'input_ids': tensor([[    0,  2519,  2000, 10193, 10174,  5997,  2012,  2032,  2001,  2000,
          4058,  3318,  2012,  2007,  2112,  4775,  2003,  2000,  5985,  2007,
          2012,  2000,  4108,  3024,  2499,  2295,  3795,  2497,  2017,  2039,
          7369,  2001,  2170,  2002,  2017,  2039,  3036,  2004,  3077,  2000,
          8910,  2002,  2573,  2795,  2012,  2195,  4108,  5538,  2065, 17030,
          2002,  3897,  1033,  2519,  2006,  5997,  2012,  2000,  2235,  1009,
          1059, 10344,  2101,  3302,  2189,  2000,  2204,  2497,  2044,  2511,
          2000,  4108,  2499,  2295,  2053,  2204,  2573, 28130,  1033,  2070,
          3075,  2846,  2044,  2042,  3859,  2018,  1014,  1049,  2576,  2040,
          3378,  2012,  2000, 10193,  2270,  2009, 19376,  2007,  2029,  2003,
          2018,  2177,  1016,  2000,  3025,  2042,  2046,  1041,  7075,  2009,
          2000,  2235,  2017,  2000,  2204,  2931,  1014,  2002,  2000,  4108,
          3164,  2042,  2912,  2017,  

In [10]:
tokenizer.save_pretrained("./MPNet_tokenizer")

('./MPNet_tokenizer/tokenizer_config.json',
 './MPNet_tokenizer/special_tokens_map.json',
 './MPNet_tokenizer/vocab.txt',
 './MPNet_tokenizer/added_tokens.json',
 './MPNet_tokenizer/tokenizer.json')