Investigating opinions on public policies in digital media: setting up a supervised machine learning tool for stance classification

This repository contains the data and code to reproduce the results of our paper.

Abstract: Supervised machine learning (SML) provides us with tools to efficiently scrutinize large corpora of communication texts. Yet, setting up such a tool involves plenty of decisions starting with the data needed for training, the selection of an algorithm, and the details of model training. We aim at establishing a firm link between communication research tasks and the corresponding state-of-the-art in natural language processing research by systematically comparing the performance of different automatic text analysis approaches. We do this for a challenging task – stance detection of opinions on policy measures to tackle the COVID-19 pandemic in Germany voiced on Twitter. Our results add evidence that pre-trained language models such as BERT outperform feature-based and other neural network approaches. Yet, the gains one can achieve differ greatly depending on the specific merits of pre-training (i.e., use of different language models). Adding to the robustness of our conclusions, we run a generalizability check with a different use case in terms of language and topic. Additionally, we illustrate how the amount and quality of training data affect model performance pointing to potential compensation effects. Based on our results, we derive important practical implications for setting up such SML tools to study communication texts.

Disclaimer:

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Data

In order to reproduce our experiments, the corresponding data needs to be downloaded. We cannot publish some of the data due to Twitter's Terms of Service which only allow the publication of the IDs of the Tweets. These tweets need to be downloaded using the IDs or can be requested from the authors (only for research purposes).

Additional Downloads

Covid-19 Stance Data
- IMPORTANT: the data reading scripts expect the first row in the Covid-19 data files to be the Tweet text
- Covid-19 Student Training Data: We provide the files containing the labels, annotator IDs, annotator group ID, annotation round ID, Tweet IDs for the different training data files (covid19-student-all.tsv, covid19-student-alpha06.tsv, covid19-student-alpha07.tsv, covid19-student-alpha08.tsv). For more information on the annotation procedure, please refer to our corresponding publication here
- Covid-19 Expert Test Data: We provide the labels and Tweet IDs for the test data in file covid19_expert.tsv
SemEval 2016 Task 6: We use the following files for our experiments: [trainingdata-all-annotations.txt, trialdata-all-annotations.txt, testdata-taskA-all-annotations.txt]. Download them and put them in a folder semeval2016.
Dutch Sentiment Analysis: We used the sentences_ml.csv file for our experiments
Covid-19 Twitter data: We used the corona100d dataset by Rieger and von Nordheim (2021). The dataset can be re-created using their code repository. The dataset should be stored in a JSONL data file format.

We provide the weights for the best-performing model (i.e. GerBert-Twitter) described in our paper.

Usage

Preliminaries

This code was developed using Python 3.6. on a linux machine. Download the repository to your local machine. In a terminal, create a virtual environment venv, activate it and install the necessary dependencies using

python3.6 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Hyperparameter Optimization

To perform hyperparameter optimization across 5 folds for the sentiment classification task (van Atteveldt et al., 2021) using a bert-base-uncased BERT model, we run

python run_hyper_optimization.py --task sentiment --training_input /path/to/sentiment/data --output_dir ./results/hyperparameter_optimization/sentiment --modelname bert-base-uncased --folds 5

Similarly, it can be used to perform hyperparameter optimization for the stance classification task by providing --task stance and the corresponding data files.

Stance Prediction

Once we identified optimal hyperparameters, we can use them to train a classification model for stance prediction:

python run_stance.py --training_input /path/to/stance/training/data --test_input /path/to/stance/test/data --hyperparameters /path/to/hyperparameter/json/file --output_dir /path/to/output

The model and seed can be controlled via the --modelname and --seeds parameters, respectively.

Sentiment Analysis

The script for sentiment analysis works in a similar way. Once we identified optimal hyperparameters, we can use them to train a classification model for stance prediction:

python run_sentiment.py --training_input /path/to/sentiment/training/data --test_input /path/to/sentiment/test/data --hyperparameters /path/to/hyperparameter/json/file --output_dir /path/to/output

The model and seed can be controlled via the --modelname and --seeds parameters, respectively.

Combination of Task-Related and Task-Distant Training Data

To train a model using both task-related (Covid-19 Stance) and task-distant (SemEval 2016 Task 6) training data, the following script can be used. Here, for convenience hyperparameter optimization is directly integrated in the training procedure.

python run_transfer.py --input_dir /path/to/sentiment/training/data --output_dir /path/to/output --trials 5

The --trials parameter describes the number of hyperparameter search trials which are performed before identifying best performing set of hyperparameters.

Few-Shot Experiments

If interested in the performance in relation to available (annotated) training data size (i.e. few-shot scenario), we use the following script:

python run_fewshot.py --training_input /path/to/training/data --test_input /path/to/test/data --hyperparameters /path/to/hyperparameter/json/file --folds 5 --output_dir /path/to/output --size numbers

The parameter --size is used to define the scenario - either using increasing percentages (--size percentage) of training data ([.1, .2, .3, .., .9, 1.0]) or a fixed number (--size number) of data points ([200,400,600,800,1000,1200]).

Fine-calibration of language input (i.e. continue pretraining on target domain)

We used the corona100d dataset by Rieger and von Nordheim (2021) which can be recreated using the original repository at GitHub. The dataset should be stored in a JSONL data file format. After downloading the dataset, we can apply our preprocessing steps (i.e. replace newlines and symbols, convert emojis to text, etc.) using the following

python extract.py --inputfile /path/to/dataset/file --outputfile /path/to/output/file

We can use the resulting file to continue masked language modeling using an existing pretrained language model (e.g. bert-base-german-cased):

python run_mlm.py --model_name_or_path bert-base-german-cased --train_file /path/to/output/file --validation_split_percentage 5 --max_seq_length 256 --line_by_line --output_dir /path/to/output/model/weights --do_train --per_device_train_batch_size 32 --learning_rate 0.0001

Sub-Projects

This repository contains different sub-projects:

<ROOT>/src
├── evaluation/
├── pretraining/
├── analysis/
├── util/
├── pretraining/
├── run_fewshot.py
├── run_hyper_optimization.py
├── run_sentiment.py
├── run_stance.py
└── run_transfer.py

evaluation Contains our evaluation framework that we use to evaluate hyperparameter optimization and experiments using multiple random seeds.

pretraining We provide the code for preprocessing and running masked language modeling using a pretrained language model

analysis Various methods for analysis of results and datasets.

util We provide the software that we used to translate additional training data (i.e. SemEval 2016 Task 6) to German (using DeepL) and other useful utility methods.

Citation & Authors

If you find this repository helpful, feel free to cite our publication

@article{viehmannCov19Stance,
  title = {Investigating Opinions on Public Policies in Digital Media: Setting up a Supervised Machine Learning Tool for Stance Classification},
  author = {Christina Viehmann, Tilman Beck, Marcus Maurer, Oliver Quiring, Iryna Gurevych},
  journal = {Communication Methods and Measures},
  year = {2022},
  pages = {XX-XX},
  url = {https://doi.org/10.1080/19312458.2022.2151579}
}

Contact persons:

Christina Viehmann, https://www.kowi.ifp.uni-mainz.de/

Tilman Beck, https://www.ukp.tu-darmstadt.de/

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
src		src
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Investigating opinions on public policies in digital media: setting up a supervised machine learning tool for stance classification

Data

Additional Downloads

Usage

Preliminaries

Hyperparameter Optimization

Stance Prediction

Sentiment Analysis

Combination of Task-Related and Task-Distant Training Data

Few-Shot Experiments

Fine-calibration of language input (i.e. continue pretraining on target domain)

Sub-Projects

Citation & Authors

About

Releases

Packages

Contributors 2

Languages

License

UKPLab/cmm2022-stance-covid19

Folders and files

Latest commit

History

Repository files navigation

Investigating opinions on public policies in digital media: setting up a supervised machine learning tool for stance classification

Data

Additional Downloads

Usage

Preliminaries

Hyperparameter Optimization

Stance Prediction

Sentiment Analysis

Combination of Task-Related and Task-Distant Training Data

Few-Shot Experiments

Fine-calibration of language input (i.e. continue pretraining on target domain)

Sub-Projects

Citation & Authors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages