-
Notifications
You must be signed in to change notification settings - Fork 2
/
bertcnn-classifying-manifestos.Rmd
614 lines (522 loc) · 31.9 KB
/
bertcnn-classifying-manifestos.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
---
title: "Predicting Policy Domains and Preferences with BERT and Convolutional Neural Networks"
description: |
We present a deep learning approach to classifying labeled texts and phrases in party manifestos, using the coding scheme and documents from the Manifesto Project Corpus.
author:
- name: Allison Koh
url: http://allisonkoh.github.io
- name: Daniel Kai Sheng Boey
categories:
- Natural Language Processing with Deep Learning
date: 10-10-2020
bibliography: bibliography.bib
output:
distill::distill_article:
self_contained: true
preview: figures/BERTfig3.png
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
# Load dependencies
library(tidyverse) # for all the fun tidy data things
library(viridis) # pretty colors
library(tibble) # for row-wise tibble creation
library(reticulate) # For rendering Python code
# ggplot defaults
ggplot2::theme_set(theme_minimal())
```
```{r head-img, echo = FALSE, layout = "l-screen-inset"}
knitr::include_graphics("figures/prf1-major.png")
```
{#abstract}
============
Hand-labeled political texts are often required in empirical studies
on party systems, coalition building, agenda setting, and many other
areas in political science research. While hand-labeling remains the
standard procedure for analyzing political texts, it can be slow and
expensive, and subject to human error and disagreement. Recent studies
in the field have leveraged supervised machine learning techniques to
automate the labeling process of electoral programs, debate motions,
and other relevant documents. We build on current approaches to label
shorter texts and phrases in party manifestos using a pre-existing
coding scheme developed by political scientists for classifying texts
by policy domain and policy preference. Using labels and data compiled
by the Manifesto Project, we make use of the state-of-the-art
Bidirectional Encoder Representations from Transformers (BERT) in
conjunction with Convolutional Neural Networks (CNN) and Gated
Recurrent Units (GRU) to seek out the best model architecture for
policy domain and policy preference classification. We find that our
proposed BERT-CNN model outperforms other approaches for the task of
classifying statements from English language party manifestos by major
policy domain.
## Background {#background}
During campaigns, political actors communicate their position on a range
of key issues to signal campaign promises and gain favor with
constituents. Whilst identifying the political positions of political
actors provides no certainty with regards to whether they act upon their
policy preferences, it remains essential to understanding their intended
political actions. Quantitative methods, especially in the field of natural language
processing, have enabled the development of more scalable methods for
predicting policy preferences. These advancements have enabled political
scientists to analyze political texts and estimate their positions over
time [@nanni2016topfish; @zirn2016classifying]. To better understand
the political positions of political actors, many social science
researchers have turned to hand-labeling political documents, such as
parliamentary debate motions and party manifestos. Much of the previous
work on analyzing political texts relies on hand-labeling documents
[@abercrombie2018sentiment; @gilardi2009learning; @krause2011policy; @simmons2004globalization].
Yet, the analysis of political documents in this field stands to benefit
from automating the coding of texts using supervised machine learning.
Most recently, neural networks and deep language representation models
have been employed in state-of-the-art approaches to automatic labeling
of political texts by policy preference.
In this article, we present a deep learning approach to classifying
labeled texts and phrases in party manifestos, using the coding scheme
and documents from Manifesto Project [@volkens2019manifesto]. We use
English-language texts from the Manifesto Project Corpus, which divides
party manifestos into statements---or *quasi-sentences*---that do not
span more than one grammatical sentence. Based on the state-of-the-art
deep learning methods for text classification, we propose using
Bidirectional Encoder Representations from Transformers (BERT) combined
with neural networks to automate the task of labeling political texts.
We compare our models that combine BERT and neural
networks against previous experiments with similar architectures to
establish that our proposed method outperforms other approaches commonly
used in natural language processing research when it comes to choosing
the correct policy domain and policy preference. We identify differences
in performance across policy domains, paving the way for future work on
improving deep learning models for classifying political texts.
## Related Work {#related_work}
Several studies have concentrated on building scaling models that
identify the political position of texts
[@glavavs2017cross; @laver2003extracting; @nanni2019political; @proksch2010position].
Previously, most of the seminal work in this area has overlooked the
task of classifying texts by topic or policy area prior to detecting
policy preferences associated with the topic. Over the past couple of
years, several studies have addressed the gap in *opinion-topic
identification* by classifying text data from political speeches,
manifestos, and other documents by topic before predicting policy
preference. Perhaps most relevant to our research is the paper by @zirn2016classifying, in
which the authors trained and validated an approach to classifying
manifestos from the United States into seven policy domains that
involved binary classifiers predicting whether sentences that are
adjacent to one another belong to the same topic[^1]. Their proposed
approach of optimizing predictions using a Markov Logic framework
yielded an average micro-F1 score of .749. @glavavs2017cross introduced a multi-lingual
classifier for automatically labeling texts by policy domain. For
classification of 20,196 English-language manifestos by policy domain,
their CNN models yielded an average micro-F1 score of .59.
More recently, studies have employed neural networks and deep language
representation models to address the computationally intensive task of
classifying political texts into over thirty categories. To take on this
ambitious task, included contextual information about individual
quasi-sentences, specifically political party and the previous sentence
within a manifesto, into multi-scale convolutional neural networks with
word embeddings. Their best performing model for classifying 86,500
quasi-sentences from the Manifesto Project Corpus into the seven major
policy domains yielded an F1 score of .6532, and their best performing
model for classifying quasi-sentences by policy preference yielded an F1
score of .4273. propose employing a hierarchical sequential deep model
that captures information from within manifestos as well as contextual
information across manifestos to predict the political position of
texts. Their best performing hierarchical modeling approach for
classifying 86,603 English language quasi-sentences yielded an F1 score
of .50.
Abercrombie et al. (2019) [@abercrombie2019policy] used deep language representation models to detect the policy positions
of Members of Parliament in the United Kingdom. Using motions and
manifestos as data sources, the authors employed a variety of methods to
predict the policy and domain labels of texts. They propose utilizing
BERT for this task, with results fine-tuned with party manifestos and
the motions themselves. In addition to a final softmax layer, the
authors added a CNN model and max-pooling layers to the soft-max layer.
they found that the use of BERT demonstrated
state-of-the-art performance on both manifestos and motions via
supervised pipelines with a Macro-F1 score of 0.69 for their best
performing model. Our work builds on some of the methods proposed in
their paper, leveraging neural networks and deep language representation
models for classifying political texts.
## The Manifesto Project Corpus {#data}
The Manifesto Project Corpus[^2] provides
information on policy preferences of political parties from seven
different countries based on a coding scheme of 8 policy domains,
under which 58 policy preference codes are manually coded[^3]. The Manifesto
Project offers data that divides party manifestos into quasi-sentences,
or individual statements which do not span more than one grammatical
sentence. Quasi-sentences are then individually assigned to categories
pertaining to policy domain and preference. The 58 policy preference
codes, one of which is "not categorized", refer to the
position---positive or negative---of a party regarding a particular
policy area. The 58 policy preference codes fall into a macro-level
coding scheme comprising of 8 policy domain categories. In political
science research, the Manifesto Project Corpus is particularly useful
for studying party competition, the responsiveness of political parties
to constituent preferences, and estimating the ideological position of
political elites. While the official classification of manifestos in
this dataset has primarily relied on human coders, the investigation of
automatically detecting policy positions of the text data is valuable
for scaling up the classification of large volumes of political texts
available for analysis.
<aside>
Each of domain comprises of minor categories, also known as [policy preferences](https://manifesto-project.wzb.eu/down/data/2020a/codebooks/codebook_MPDataset_MPDS2020a.pdf). For instance, the policy preferences listed under the "External Relations" domain include:
- Foreign Special Relationships
- Anti-Imperialism
- Military
- Peace
- Internationalism
- European Community/Union
</aside>
Our final subset of all English-language manifestos comprises of 99,681
quasi-sentences. Figures \@ref(fig:desc-pd) and \@ref(fig:desc-country) illustrate the distribution of English-language
manifestos across countries and policy domains. To ensure that the ratio
between policy domains remains consistent across all categories in
running our models, we applied a 70/15/15 split between training,
validation, and test sets separately for all policy domains (major categories) and policy preferences (minor categories).
```{r desc-pd, eval = TRUE, echo = FALSE, out.width = '50%', fig.cap = "Quasi-sentences (QSs) from English language manifestos by policy domain"}
# knitr::include_graphics("figures/desc-major.png")
pd_df <- tribble(
~"Domain", ~"QSs",
"External Relations", 6580,
"Freedom and Democracy", 4700,
"Political System", 10557,
"Economy", 24757,
"Welfare and Quality of Life", 30750,
"Fabric of Society", 11099,
"Social Groups", 9910,
"Not categorized", 1328
) %>% mutate(Domain = as_factor(Domain))
ggplot(pd_df, aes(x = fct_rev(Domain), y = QSs, fill = QSs)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = QSs, hjust = -.1)) +
labs(x = "", y = "# QSs") +
ylim(0, 33000) +
scale_fill_viridis(option = "plasma") +
coord_flip() +
theme(legend.position = "none")
```
```{r desc-country, eval = TRUE, echo = FALSE, out.width = '50%', fig.cap = "Quasi-sentences (QSs) from English language manifestos by country"}
# knitr::include_graphics("figures/desc-country.png")
c_df <- tribble(
~"Country", ~"QSs",
"United States", 10819,
"South Africa", 6423,
"New Zealand", 28561,
"Ireland", 25352,
"Great Britain", 14839,
"Canada", 3047,
"Australia", 10370,
) %>% mutate(Domain = as_factor(Country))
ggplot(c_df, aes(x = fct_rev(Country), y = QSs, fill = QSs)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = QSs, hjust = -.1)) +
labs(x = "", y = "# QSs") +
ylim(0, 33000) +
scale_fill_viridis(option = "plasma") +
coord_flip() +
theme(legend.position = "none")
```
## Experimental Setup {#methodology}
### Bidirectional Encoder Representations from Transformers (BERT)
Bidirectional Encoder Representations from Transformers
(BERT) have proven successful in prior attempts to
classify phrases and short texts [@devlin2018bert].
BERT's key innovation lies in its ability to apply
bidirectional training of transformers to language modeling. This
state-of-the-art deep language representation model uses a "masked
language model", enabling it to overcome restrictions caused by the
unidirectional constraint.
Our experiments use the standard pre-trained BERT
transformers as the embedding layer in our model. Since
BERT is trained on sequences with a maximum lengths of 512
tokens, all quasi-sentences with more than 510 words were trimmed to fit
this requirement. Pre-trained embeddings were frozen and not trained for
the base models. We test two variants of BERT---one
incorporating a bidirectional GRU model, and another incorporating CNNs.
Model specifications and training times for our neural networks and deep
language representation models are shown in Table
[1](#tab:modelspec){reference-type="ref" reference="tab:modelspec"} and
Figure \@ref(fig:tt).
::: {#tab:modelspec}
| Models | Text Representation | Layers | Epochs |
|----------|-----------------------|------------------------------------------------------------------------------------------------------|--------|
| CNN | GloVe Wikipedia w-emb | 2 Convolutional Layers (1 per filter)<br>2 Max Pooling Layers<br>1 Dropout Layer <br>1 Linear Layer | 100 |
| BERT-CNN | Base BERT (uncased) | 2 Convolutional Layers (1 per filter)<br>2 Max Pooling Layers <br>1 Dropout Layer <br>1 Linear Layer | 10 |
| BERT-GRU | Base BERT (uncased) | 1 Bidirectional GRU RNN Layer <br>1 Dropout Layer <br>1 Linear Layer | 10 |
Table: Table 1: Model specifications of neural networks and deep language representation models
:::
```{r tt, out.width = '100%', fig.cap = "Training time for neural networks and deep language representation models for classifying political texts by *major* and *minor* policy domain"}
knitr::include_graphics("figures/training_time.png")
```
### BERT with Gated Recurrent Units (GRU)
First proposed by Cho et al. (2014) [@cho2014learning], Gated Recurrent Units---formerly referred
to as the RNN Encoder-Decoder model---use update gates and reset gates
to solve the vanishing gradient problems often encountered in
applications of recurrent neural networks [@kanai2017preventing]. The
update gate helps the model determine the extent to which past
information is carried on in the model whilst the reset gate determines
the information to be removed from the model [@chung2014empirical].
Hence, it solves the aforementioned problem by not completely removing
the new input, instead keeping relevant information to pass on to
further subsequent computed states. In our analysis, we employ a
multi-layer, bidirectional GRU model from PyTorch[^3] (Table
[1](#tab:modelspec){reference-type="ref" reference="tab:modelspec"}). The
results are subject to a dropout layer prior to classification via a
linear layer.
### BERT with Convolutional Neural Networks (CNN)
We incorporate CNNs with BERT using the same CNN
architecture as our baselines (Table
[1](#tab:modelspec){reference-type="ref" reference="tab:modelspec"}).
The model utilizes the aforementioned BERT base, uncased
tokenizer with convolutional filters of sizes 2 and 3 applied with a
ReLu activation function. We use a 1D-max pooling layer, a dropout layer
($N = 0.5$) to prevent overfitting, and a Cross Entropy Loss function.
We employ the model to classify policy domains ($N = 8$) and policy
preferences ($N = 58$), each of which includes a category for
quasi-sentences that do not fall into this classification scheme.
Hereafter, we refer to these classifications as 'major' and 'minor'
categories, respectively. A graphical representation of our model is
shown in Figure \@ref(fig:bertcnn-fig).
```{r bertcnn-fig, out.width = '50%', fig.cap = "Graphical representation of the base BERT-CNN model to predict major policy domains"}
knitr::include_graphics("figures/BERTfig3.png")
```
### Evaluation
We evaluate the performance of our proposed method against several
baselines:
- **Multinomial Naive Bayes** [@su2011large]: This algorithm, commonly used in text
classification, operates on the *Bag of Words assumption* and the
assumption of *Conditional independence*.
- **Support Vector Machines** [@tong2001support]: We used this
traditional binary classifier to calculate baselines with the `SVC`
package from `scikit-learn`[^4], employing a "one-against-one"
approach for multi-class classification.
- **Convolutional Neural Networks (CNN)**
[@DBLP:journals/corr/Kim14f; @lecun1998gradient]: To run this deep
learning model, originally designed for image classification, we
first made use of pre-trained word vectors trained by GloVe, an
unsupervised learning algorithm for obtaining vector representations
for words [@Pennington_Socher_Manning_2014].
To evaluate model fit, we utilized *accuracy* and *loss* as key metrics
to compare performance of our *CNN* baseline
and our proposed models (BERT-CNN,
BERT-GRU). We calculated the *F1-score* for each model
that we ran. In our results, we present both the Macro-F1 score and
Micro-F1 score[^6].
### Architecture fine tuning
We tested different modifications of the CNN and BERT
models. For the CNN models, we compared the following modifications:
- **Stemming and Lemmatization**: We test whether stemming or
lemmatizing text in the pre-processing steps improves predictions
using quasi-sentences from the Manifesto Project Corpus.
- **Dropout rates**: We decreased the dropout rate from 0.5 to 0.25 to
determine whether fine-tuning dropout rates yield differences in
performance. This is because we initially found that our models were
overfitting.
- **Additional linear layer**: An additional linear layer was added
prior to the final categorization linear layer to establish whether
"deeper" neural networks generate improved predictions.
- **Removal of uncategorized quasi-sentences**: We removed quasi-sentences
that were labeled as "not categorized" to investigate whether predictions improve
with an altered classification scheme of 7 specified policy domains and 57
policy preference codes[^7].
For the BERT models, we compared the following
modifications:
- **Training Embeddings**: For our base BERT models, all
training of embeddings were frozen. Therefore, we enable training of
the embeddings in this modification to establish how training
embeddings contributes to the performance of deep language
representation models with this classification task.
- **Training models based on recurrent runs**: We trialed training the
BERT models sequentially with different learning rates
(LR = 0.001, 0.0005 and 0.0001) of 10 epochs each for a total of 30
epochs in aims to improve the performance of our neural networks and
deep language representation models.
- **Large, cased tokenizer**: The BERT Large cased
tokenizer was used instead of the BERT BASE uncased
tokenizer employed in our base models.
### Results
As shown in Table [2](#tab:modelresults), the BERT-CNN model
performed best for predicting both major and minor categories compared
to the BERT-GRU model and CNN baseline. However, our SVM baseline
outperformed the neural network models for predicting minor categories.
We believe that the shortcomings of our neural networks and deep
language representation models for this text classification task are due
to limitations in specifying the number of epochs in training. We also
observed overfitting in our models. For instance, with our CNN model,
validation loss increased with each additional epoch after a certain
number of epochs. As shown in Figure \@ref(fig:major-overfitting), training accuracy of this model
also increased at the cost of validation accuracy. However, this was not
the case for deep language representation models classifying texts by
minor categories. Overall, our results show that, between the two
BERT models, the BERT-CNN model demonstrates
superior performance against bag-of-words approaches and other models
that utilize neural networks.
::: {#tab:modelresults}
| Category | Model | Test Loss | Test Accuracy | Micro-F1 | Macro-F1 |
|----------|----------|-----------|---------------|----------|----------|
| _Major_
| | MNB | --- | 0.553 | 0.553 | 0.398 |
| | SVM | --- | 0.578 | 0.578 | 0.460 |
| | CNN | 1.177 | 0.589 | 0.589 | 0.466 |
| | BERT-GRU | 1.166 | 0.594 | 0.593 | 0.479 |
| | BERT-CNN | __1.152__ | __0.591__ | __0.591__ | __0.473__
| _Minor_
| | MNB | --- | 0.385 | 0.385 | 0.154 |
| | SVM | --- | __0.463__ | __0.463__ | __0.299__ |
| | CNN | 2.136 | 0.454 | 0.454 | 0.273 |
| | BERT-GRU | 2.216 | 0.432 | 0.432 | 0.239 |
| | BERT-CNN | __2.098__ | 0.448 | 0.448 | 0.260 |
Table: Table 2: Baseline, CNN and BERT models run with base model specifications as detailed in Table [1](#tab:modelspec)
:::
```{r major-overfitting, out.width = '50%', fig.cap = "An illustration of overfitting in our CNN model for classifying manifesto QSs by major policy domain"}
knitr::include_graphics("figures/cnnmaj_acc_loss.png")
```
### CNN and BERT Modifications {#cnn-and-bert-modifications .unnumbered}
Comparing modifications to our CNN models, our results suggest that the
base model outperforms most alternative model specifications. As
outlined in Table
[3](tab:model-modifications), reducing the dropout rate to 0.25 improved
the model on some indicators marginally. As expected, the removal of
uncategorized quasi-sentences yielded improvements in predictions, with
a significantly higher Macro-F1 score compared to other model
specifications. Based on these results, future work should focus on how
model predictions of uncategorized quasi-sentences can be improved,
given their random nature.
::: {#tab:model-modifications}
| Model | Modification | Test Loss | Test Accuracy | Micro-F1 | Macro-F1 | Epochs |
|----------|----------------------------|-----------|---------------|----------|----------|--------|
| _CNN_ | | | | | | |
| | Base model | 1.177 | __0.589__ | __0.589__ | 0.466 | 100 |
| | Lemmatized text | __1.174__ | 0.585 | 0.585 | 0.46 | 100 |
| | Stemmed text | 1.213 | 0.577 | 0.576 | 0.448 | 100 |
| | Dropout = 0.25 | 1.177 | __0.589__ | 0.588 | __0.467__ | 100 |
| | Additional layer | 1.18 | 0.586 | 0.586 | 0.462 | 100 |
| | Removing uncategorized QSs | __1.136__ | __0.596__ | __0.595__ | __0.535__ | 100 |
| _BERT-GRU_ | | | | | | |
| | Base model | __1.152__ | __0.594__ | __0.593__ | __0.479__ | 10 |
| | Training emb | 1.163 | 0.592 | 0.592 | __0.479__ | 10 |
| | Recurrent runs, training | 1.234 | 0.582 | 0.581 | 0.459 | 30 |
| | Large, uncased | 1.172 | 0.592 | 0.591 | 0.469 | 10 |
| _BERT-CNN_ | | | | | | |
| | Base model | 1.166 | __0.591__ | __0.591__ | __0.473__ | 10 |
| | Training emb | 1.167 | 0.587 | 0.587 | 0.458 | 10 |
| | Recurrent runs, training | __1.157__ | 0.589 | 0.589 | 0.468 | 30 |
| | Large, uncased | 1.192 | 0.58 | 0.58 | 0.45 | 10 |
Table: Table 3: Comparing results of modifications to CNN and BERT base models for predicting policy domains
:::
While we observed some improvements with modifications to the CNN model,
we find that our base BERT models performed best compared
to other fine-tuned modifications to model architecture. The results of
our base BERT model and alternative model specifications
are shown in Table
[3](tab:model-modifications). Even though it is possible that our base
BERT model is best for this classification model, our
results could also indicate the presence of over-fitting or the lack of
sufficient training available given the low number of epochs.
## Limitations and Analysis {#discussion}
As shown in Figure \@ref(fig:major-overfitting), we observed overfitting with our
major policy domain classification models. Despite employing changes and
modifications to our models, including varied dropout rates,
architecture fine-tuning and different learning rates, we did not find
any variants of the models employed in analysis that would yield
significant improvements in performance. We posit that potential
improvements on these issues could be resolved by employing transfer
learning and appending our sample of English-language manifestos with
other political documents, such as debate transcripts.
In contrast, as shown in Figure \@ref(fig:minor-overfitting), we observed little over-fitting in
our minor policy domain classification models. Our classifier could
benefit from employing transfer learning and appending our sample of
manifesto quasi-sentences with other political texts, especially for
policy domains with relatively fewer quasi-sentences to train on. It is
also important to note that, compared to the more computationally
intensive neural networks and deep language representation models, our
Multinomial Bayes and SVM baselines did not perform significantly worse.
In fact, for the minor categories, the SVM yielded superior performance
in some metrics compared to that of the neural network models.
Notwithstanding the lack of training of certain models, this may suggest
that increasing the model complexity and consequently the computational
power required may not necessarily lead to increased model performance.
```{r minor-overfitting, out.width = '50%', fig.cap = "Training and validation metrics for the BERT-CNN model for classifying English language manifestos by policy preference"}
knitr::include_graphics("figures/bertcnnmin_acc_loss.png")
```
Substantially lower Macro-F1 scores across all models point to mixed
performance in classification by category. As shown in Figure
\@ref(fig:prf1-major), we observe high variation in the performance
of our classifiers between categories. However, we observe poor
performance in classifying quasi-sentences that do not belong to one of
the major policy domains. For our BERT-CNN model, the easiest categories
to predict were "welfare and quality of life", "economy", and "freedom
and democracy". The superior performance of predicting the first two
categories is not particularly surprising, as a substantial number of
quasi-sentences in our sample of English-language party manifestos are
attributed to these topics. As shown in Figure \@ref(fig:desc-pd),
30,750 quasi-sentences are attributed to the "welfare and quality of
life" category and 24,757 quasi-sentences are attributed to the
"economy" domain.
```{r prf1-major, out.width = '50%', fig.cap = "Average precision, recall, and Macro-F1 scores by major policy domain across all models"}
knitr::include_graphics("figures/prf1-major.png")
```
<!-- <aside> -->
<!-- The numbers on the x-axis correspond to their respective policy domains: -->
<!-- - __0__: Not categorized -->
<!-- - __1__: External Relations -->
<!-- - __2__: Freedom and Democracy -->
<!-- - __3__: Political System -->
<!-- - __4__: Economy -->
<!-- - __5__: Welfare and Quality of Life -->
<!-- - __6__: Fabric of Society -->
<!-- - __7__: Social Groups -->
<!-- </aside> -->
In contrast, the relatively superior performance of predicting the
"freedom and democracy" category is surprising. Out of our total sample
of $n_{\mathrm{sentences}}=99,681$, only 4,700 documents are attributed
to the "freedom and democracy" category. Intuitively, the performance of
our classifier with this underrepresented policy domain could be
attributed to a variety of possible explanations. One is the presence of distinct features such as topic-unique
vocabulary that do not exist in other categories. Future work on
classification of political documents that fall under this category
would benefit from looking into features that might distinguish this
policy domain from others.
## Conclusion {#conclusion}
In this paper, we trained two variants of
BERT---one incorporating a bidirectional GRU model, and
another incorporating CNNs. We demonstrate the superior performance of
deep language representation models combined with neural networks to
classify political domains and preferences in the Manifesto Project. Our
proposed method of incorporating BERT with neural networks
for classifying English language manifestos addresses issues of
reproducibility and scalability in labeling large volumes of political
texts. As far as we know, this is the most comprehensive application of
deep language representation models and neural networks for classifying
statements from political manifestos.
We find that using BERT in conjunction with CNNs yields the best predictions for classifying English
language statements parsed from party manifestos. However, our proposed
BERT-CNN model requires further fine-tuning to be
effective in providing acceptable predictions to improve on less
computationally intensive methods and replace human annotations of
fine-grained policy positions. As expected, our proposed approach and
baselines perform better for classifying major policy domains over minor
categories. We also observe differences in performance between
categories. Among the major policy domains, the categories that
performed best include "welfare and quality of life", "economy", and
"freedom and democracy". The superior performance of the latter category
is surprising because it makes up the smallest proportion of
quasi-sentences in the Manifesto Project Corpus.
There are several avenues for future work on neural networks and deep
language representation models for automatically labeling political
texts. For instance, investigating the features of individual categories
that demonstrate superior performance would shed light on how we can
incorporate additional features of texts to improve model performance.
This area of research would also benefit from better understanding how
we can filter out texts that do not fall into a particular
classification scheme. Knowledge on how these issues could be resolved
to improve model performance would allow for extensions in the
application of deep learning models for classifying political texts.
[^1]: The data used in analysis comprises of statements from six
Democratic and Republican election manifestos from the 2004, 2008
and 2012 elections in the United States.
[^2]: <https://manifesto-project.wzb.eu/>
[^3]: The classification scheme of 8 policy domains and 58 policy preferences each include a category of quasi-sentences that are not considered part of any meaningful category.
[^4]: <https://pytorch.org/>
[^5]: <https://scikit-learn.org/stable/>
[^6]: The micro score calculates metrics globally, whilst the macro score
calculates metrics for each label and reports the unweighted mean.
[^7]: This modification is motivated by the results from our base models, which yield lower Macro-F1 scores due to the difficulty of correctly identifying the quasi-sentences that were "not categorized".