Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Replication files for "Classification Accuracy as a Substantive Quantity of Interest: Measuring Polarization in Westminster Systems"


*Note: In these replication files, we do not provide the full (rather large) data used in the paper (see below for instructions on obtaining that data) but do provide:

  • the scripts we used to generate the full results from the full XML data. These scripts have been modified trivially (changing just the session_indx dict) to work with the included sample data to generate results for two Parliamentary sessions.

  • accuracy results and predicted values generated from the full data that can be used to produce the figures.*

These replication files are organized in the following way:

  1. Scripts to generate simulated data and run classifiers on them.
  2. UK Parliament: Scripts to convert the XML data to sparse matrices with topic dummies, then run machine learning algorithms to produce accuracy results and predicted values.
  3. Scripts to generate the figures found in the article from the accuracy results and predicted values.

NB: As dataverse does not allow for folders, the filepaths may require altering since the scripts were created with data in a subfolder called "data".


Data included

The full XML speech data was provided by Rheault, et al (2016). We provide the scripts we used to convert the data as well as converted comma separated value files for selected sessions, along with the output of our machine learning classifiers as follows:

  1. Speech data and covariates for sessions 1944-11 and 2008-12 (comma separated values)

  2. "acc_sims.csv" (from

  3. "preds_sims.csv" (from

  4. An index of years and Parliamentary sessions: "years_session_index.csv"

  5. ML classification accuracies for all sessions "acc_j27_allmembers.csv" (from

  6. "SAG_speaker_prob_estimates_allmembers_j27.csv" (from


Simulated data

  1. sim_speeches.r (generates the data)
  2. (run ML classifiers)

UK Parliament: Preparing speech data, running ML classifiers

These scripts explain our process but require the complete XML data to generate the full results. (2) and (3) can however be run on the included csv files to generate accuracy statistics and predicted values.

  1. (converts xml to csv). Run from the command line with: python

  2. (generates sparse document term matrices, augmented with topic indicators). Run from the command line with: python

  3. (run ML classifiers) Run from the command line with: python

Generating the figures

NB: These scripts can be run independently as they refer to only to the included data.

  1. Figure 1 (Figure_1.r)
  2. Figure 2 (Figure_2.r)
  3. Figure 3 (Figure_3.r)
  4. Figure 4 (Figure_4.r)
  5. Figure 5 (Figure_5.r)

Additional Files

  1. example log files in the /log_files/ directory.
  2. (explanation of changes for generating bootstrapped estimates.)
  3. (similar to but using the xgboost classifier)


  • Andrew Peterson - Postdoctoral Researcher - University of Geneva

  • Arthur Spirling - Associate Professor of Politics and Data Science - New York University


This project is open source under the BSD 3-Clause.



Replication files for "Classification Accuracy as a Substantive Quantity of Interest" Political Analysis Letters







No releases published


No packages published