Skip to content

Blubberli/storytestimony

Repository files navigation

storytestimony

The following repository contains the data and the code for the paper "Reports of personal experiences and stories in argumentation: datasets and analysis". [1]

Datasets

The data used to train and test the models is stored in datasets. All information about the datasets, annotation and structure can be found the datasets/readme.md

Training and testing a model

The simple classifiers (bag of words, random forest) can be found in baselines. To train and test a new model with BERT you can use the script bert_classifier.py

python bert_classifier source_folder result_folder temp_folder epochs gpu model_path --test
  • source_folder: loads train.csv and val.csv from the given path
  • result_folder: saves the trained model (metrics.pt and model.pt), results (classification_report.csv) and the predictions (predictions.csv with gold label, predicted label and class probabilities)
  • temp_folder: saves a minimal file (train, val, test) as a tsv with only text and label
  • max_seqlen: the maximum sequence length (default is 512)
  • epochs: the number of epochs to train
  • gpu: the GPU (no) to train on if trained on a GPU
  • model_path: the LM name to load the pretrained LM to fine-tune
    • bert-based transfomer model, e.g. bert-base-uncased
  • --test: add this flag to create predictions for the test data

The domain-adapted versions of the language model are also available. For these, the underlying language model have been tuned on domain-specific data. Note that if you load these, you still need to fine-tune them on the classification task. To load them use the following names as model_path:

python bert_classifier.py datasets/splits/cmv_10splits/split0/ datasets/splits/cmv_10splits/split0/ tmp/ 125 tmp/ 5 0 "falkne/bert-europarl-en" --test

Inference

To use a trained model for inference you just need a dataset you want to create predictions for. The dataset should have a column-based format, the only required column is the one that stores the textual input. You can use the script bert_predict_storytelling.py to create predictions. Several fine-tuned classification models are made available via the hub. Alternatively you can first train your own model and use it for predictions. The following models can be used directly from the hub

  • falkne/storytelling-europolis-en
  • falkne/storytelling-regulationroom-en
  • falkne/storytelling-change-my-view-en
  • falkne/storytelling-mixed-domains-en*
    • This model was trained on all three available datasets. It is the most robust model if you work with diverse domains.
  • falkne/storytelling-LM-europarl-mixed-en
    • this model was trained on all available datasets. the underlying LM was fine-tuned on europarl. This model got the best results in the paper.

For example you want to create predictions for CMV data, you also want to evaluate the model (possible if you have some gold labels) and you want to use the model trained on regulationroom:

python bert_predict_storytelling.py datasets/splits/cmv_10splits/split0/test.tsv "post_text" 512 "falkne/storytelling-regulationroom-en" test_results/ 0 --gold_label_col "label"

Requirements

The code is based on the following packages (package versions)

  • numpy 1.21.5
  • pandas 1.3.5
  • scikit-learn 1.0.2
  • scipy 1.7.3
  • sklearn 0.0
  • torch 1.10.2+cu113
  • torchaudio 0.10.2+cu113
  • torchtext 0.11.2
  • torchvision 0.11.3+cu113
  • tqdm 4.62.3
  • transformers 4.16.2

Regression Analysis

The data and the code for the linear regression analysis can be found in regression_analysis.

Reference

[1] Reports of personal experiences and stories in argumentation: datasets and analysis Neele Falk and Gabriella Lapesa. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages