Investigating opinions on public policies in digital media: setting up a supervised machine learning tool for stance classification
This repository contains the data and code to reproduce the results of our paper.
Abstract: Supervised machine learning (SML) provides us with tools to efficiently scrutinize large corpora of communication texts. Yet, setting up such a tool involves plenty of decisions starting with the data needed for training, the selection of an algorithm, and the details of model training. We aim at establishing a firm link between communication research tasks and the corresponding state-of-the-art in natural language processing research by systematically comparing the performance of different automatic text analysis approaches. We do this for a challenging task – stance detection of opinions on policy measures to tackle the COVID-19 pandemic in Germany voiced on Twitter. Our results add evidence that pre-trained language models such as BERT outperform feature-based and other neural network approaches. Yet, the gains one can achieve differ greatly depending on the specific merits of pre-training (i.e., use of different language models). Adding to the robustness of our conclusions, we run a generalizability check with a different use case in terms of language and topic. Additionally, we illustrate how the amount and quality of training data affect model performance pointing to potential compensation effects. Based on our results, we derive important practical implications for setting up such SML tools to study communication texts.
Disclaimer:
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.
In order to reproduce our experiments, the corresponding data needs to be downloaded. We cannot publish some of the data due to Twitter's Terms of Service which only allow the publication of the IDs of the Tweets. These tweets need to be downloaded using the IDs or can be requested from the authors (only for research purposes).
- Covid-19 Stance Data
- IMPORTANT: the data reading scripts expect the first row in the Covid-19 data files to be the Tweet text
- Covid-19 Student Training Data: We provide
the files containing the labels, annotator IDs, annotator group ID, annotation round ID, Tweet IDs for the different training data files (
covid19-student-all.tsv
,covid19-student-alpha06.tsv
,covid19-student-alpha07.tsv
,covid19-student-alpha08.tsv
). For more information on the annotation procedure, please refer to our corresponding publication here - Covid-19 Expert Test Data: We provide the labels and Tweet IDs for the test data in file
covid19_expert.tsv
- SemEval 2016 Task 6: We use the following
files for our experiments: [
trainingdata-all-annotations.txt
,trialdata-all-annotations.txt
,testdata-taskA-all-annotations.txt
]. Download them and put them in a foldersemeval2016
. - Dutch Sentiment Analysis: We used the
sentences_ml.csv
file for our experiments - Covid-19 Twitter data: We used the corona100d dataset by Rieger and von Nordheim (2021). The dataset can be re-created using their code repository. The dataset should be stored in a JSONL data file format.
We provide the weights for the best-performing model (i.e. GerBert-Twitter
) described in our paper.
This code was developed using Python 3.6. on a linux machine.
Download the repository to your local machine.
In a terminal, create a virtual environment venv
, activate it and install the necessary dependencies using
python3.6 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
To perform hyperparameter optimization across 5 folds for the sentiment classification task (van Atteveldt et al., 2021)
using a bert-base-uncased
BERT model, we run
python run_hyper_optimization.py --task sentiment --training_input /path/to/sentiment/data --output_dir ./results/hyperparameter_optimization/sentiment --modelname bert-base-uncased --folds 5
Similarly, it can be used to perform hyperparameter optimization for the stance classification
task by providing --task stance
and the corresponding data files.
Once we identified optimal hyperparameters, we can use them to train a classification model for stance prediction:
python run_stance.py --training_input /path/to/stance/training/data --test_input /path/to/stance/test/data --hyperparameters /path/to/hyperparameter/json/file --output_dir /path/to/output
The model and seed can be controlled via the --modelname
and --seeds
parameters, respectively.
The script for sentiment analysis works in a similar way. Once we identified optimal hyperparameters, we can use them to train a classification model for stance prediction:
python run_sentiment.py --training_input /path/to/sentiment/training/data --test_input /path/to/sentiment/test/data --hyperparameters /path/to/hyperparameter/json/file --output_dir /path/to/output
The model and seed can be controlled via the --modelname
and --seeds
parameters, respectively.
To train a model using both task-related (Covid-19 Stance) and task-distant (SemEval 2016 Task 6) training data, the following script can be used. Here, for convenience hyperparameter optimization is directly integrated in the training procedure.
python run_transfer.py --input_dir /path/to/sentiment/training/data --output_dir /path/to/output --trials 5
The --trials
parameter describes the number of hyperparameter search trials which are performed before
identifying best performing set of hyperparameters.
If interested in the performance in relation to available (annotated) training data size (i.e. few-shot scenario), we use the following script:
python run_fewshot.py --training_input /path/to/training/data --test_input /path/to/test/data --hyperparameters /path/to/hyperparameter/json/file --folds 5 --output_dir /path/to/output --size numbers
The parameter --size
is used to define the scenario - either using increasing percentages (--size percentage
) of training data
([.1, .2, .3, .., .9, 1.0]
) or a fixed number (--size number
) of data points ([200,400,600,800,1000,1200]
).
We used the corona100d dataset by Rieger and von Nordheim (2021) which can be recreated using the original repository at GitHub. The dataset should be stored in a JSONL data file format. After downloading the dataset, we can apply our preprocessing steps (i.e. replace newlines and symbols, convert emojis to text, etc.) using the following
python extract.py --inputfile /path/to/dataset/file --outputfile /path/to/output/file
We can use the resulting file to continue masked language modeling using an existing pretrained language model (e.g. bert-base-german-cased
):
python run_mlm.py --model_name_or_path bert-base-german-cased --train_file /path/to/output/file --validation_split_percentage 5 --max_seq_length 256 --line_by_line --output_dir /path/to/output/model/weights --do_train --per_device_train_batch_size 32 --learning_rate 0.0001
This repository contains different sub-projects:
<ROOT>/src
├── evaluation/
├── pretraining/
├── analysis/
├── util/
├── pretraining/
├── run_fewshot.py
├── run_hyper_optimization.py
├── run_sentiment.py
├── run_stance.py
└── run_transfer.py
evaluation Contains our evaluation framework that we use to evaluate hyperparameter optimization and experiments using multiple random seeds.
pretraining We provide the code for preprocessing and running masked language modeling using a pretrained language model
analysis Various methods for analysis of results and datasets.
util We provide the software that we used to translate additional training data (i.e. SemEval 2016 Task 6) to German (using DeepL) and other useful utility methods.
If you find this repository helpful, feel free to cite our publication
@article{viehmannCov19Stance,
title = {Investigating Opinions on Public Policies in Digital Media: Setting up a Supervised Machine Learning Tool for Stance Classification},
author = {Christina Viehmann, Tilman Beck, Marcus Maurer, Oliver Quiring, Iryna Gurevych},
journal = {Communication Methods and Measures},
year = {2022},
pages = {XX-XX},
url = {https://doi.org/10.1080/19312458.2022.2151579}
}
Contact persons:
Christina Viehmann, https://www.kowi.ifp.uni-mainz.de/
Tilman Beck, https://www.ukp.tu-darmstadt.de/
Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.