Skip to content
No description, website, or topics provided.
Python Jupyter Notebook Java Other
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


FEVER 2019 Interactive Evidence Detection

This repository contains the code to convert the datasets into topic relevant collections of sentence level annotated documents as well as the code to run simulations of users who annotate one document after the other.

The subfolders contain the following content:

data contains the ED-ACL-2014 and ED-EMNLP-2015 datasets and the to convert them, as well we the code to download and convert the Argument Mining dataset. EvidenceDetection contains the code to simulate the users and analyse the quality of the predictions Bert contains code to use Bert based on


This projects uses four datasets, ED-ACL-2014, ED-EMNLP-2015, ED-ACL-2018, and Argument Mining. In the used form, each dataset except ED-ACL-2018 consisted of Documents with sentence level annotations. Each sentence is either a piece of evidence or not. It uses the following TSV format.

Label Sentence
Evidence candidate sentence


Copyright Wikipedia Copyright IBM 2014. Released under CC-BY-SA.

The original dataset consisted of a table of topic related pieces of evidence from Wikipedia articles and the associated articles. We converted it into collections of topic related Wikipedia articles with sentence level annotations of evidence. Each sentence in the article is either a piece of evidence or not.


Copyright Wikipedia Copyright IBM 2015. Released under CC-BY-SA.

The dataset was processed identically to the ED-ACL-2014 dataset.


Copyright Wikipedia Copyright IBM 2015. Released under CC-BY-SA.

This dataset is not included in this repository, but can be downloaded from

Argument Mining

This dataset is based on the Sentential UKP Argument Mining corpus. We extended the source code to complete the dataset by saving the original file with sentence level annotations. We included the modified source code.

Evidence Detection

This project aims towards extracting evidences from documents based on the data collected in the hypothesis validation user surveys.

Currently, the project is focussed on sentence level classification as evidence, or not.

Folder Structure

|- bin (callable scripts)
|- hrl (Scripts to run the experiments on the Lichtenberg high performance computer)
|- scripts (contains a script for post processing to make the analysis easier)
|- evidencedetection (the library code)
|- analysis (scripts to analyse the predictions)


Before running the experiments, please install the evidence detection package through the script.

python3 (install|develop)

Run experiments

The bin folder contains the runnable script to train the BiLSTM as well as the pre-trained and finetuning models. Pre-training for the evidence detection and argument mining models is to be done by two separate scripts whereas both pre-trained models can be further fine tuned by a single script.

The results of the experiments are the raw predictions which will be saved next to the original test file. The results follows the convention of $TEST_FILE_FOLDER/$MODEL_NAME/$SEED/$ITERATION/$TEST_FILENAME.pred. The file will use the same data format as the test file, i.e. $LABEL\t$SENTENCE. The $SEED is the randomisation seed and the ITERATION the number of files used for training.


The analysis folder contains the code and scripts to read the test files as well as the prediction files and conduct different evaluations on them. For instance, it is possible to evaluate individual models and plot the change in performance with the script. It is also possible to evaluate the final performance of the model alone. However, this assumes that the final prediction files are stored in the randomisation seed specific folder and not in last iteration. To copy the prediction files from the last iteration, we added a bash script in the scripts folder.


To run the experiments please follow the instructions provided at

Evidence Detection The script trains the evidence detection model and the script runs the evaluation.

Argument Mining The script trains the argument mining model and the scripts runs the evaluation.


  address = {{Hongkong, China}},
  title = {Interactive {{Evidence Detection}}: Train State-of-the-Art Model out-of-Domain or Simple Model Interactively?},
  abstract = {Finding evidence is of vital importance in research as well as fact checking and an evidence detection method would be useful in speeding up this process.
However, when addressing a new topic there is no training data and there are two approaches to get started.
One could use large amounts of out-of-domain data to train a state-of-the-art method, or to use the small data that a person creates while working on the topic.
In this paper, we address this problem in two steps.
First, by simulating users who read source documents and label sentences they can use as evidence, thereby creating small amounts of training data for an interactively trained evidence detection model; and second, by comparing such an interactively trained model against a pre-trained model that has been trained on large out-of-domain data.
We found that an interactively trained model not only often out-performs a state-of-the-art model but also requires significantly lower amounts of computational resources.
Therefore, especially when computational resources are scarce, e.g. no GPU available, training a smaller model on the fly is preferable to training a well generalising but resource hungry out-of-domain model.},
  booktitle = {Proceedings of the {{Second Workshop}} on {{Fact Extraction}} and {{VERification}} ({{FEVER}})},
  publisher = {{Association for Computational Linguistics}},
  author = {Stahlhut, Chris},
  month = nov,
  year = {2019},
You can’t perform that action at this time.