Skip to content

Data and code for the paper COVID-Fact: Fact Extraction and Verification of Real-World Claims on COVID-19 Pandemic.

Notifications You must be signed in to change notification settings

asaakyan/covidfact

Repository files navigation

COVID-Fact

This repository contains data and code for our ACL 2021 paper

              COVID-Fact: Fact Extraction and Verification of Real-World Claims on COVID-19 Pandemic
              Arkadiy Saakyan, Tuhin Chakrabarty, and Smaranda Muresan.

The full dataset is contained in the file COVIDFACT_dataset.jsonl. Every line is a dictionary with the following format below:

{"claim": "claim text", "label": "REFUTED or SUPPORTED", "evidence": [list of evidence sentences], "gold_source": "link", "flair": "post's flair"}

Note gold_source in our dataset refers to the source of the claim, not source of the evidence.

Note that list of evidence sentences is retrieved by first creating a "corpus" by scraping top 5 google search results for the claim, and then retrieving the most similar sentences to the claim using SBERT.

Evidence Selection

The script used for evidence selection / claim filtration can be found in the folder doc_selection. There is also the script to select evidence for a particular claim.

Note gold_source in our dataset refers to the source of the claim, not source of the evidence. Evidence is scraped form top 5 Google Search results for the claim, including the source of the claim link.

Files search_results*.json provide links we used for scraping evidence.

Note that list of evidence sentences is retrieved by first creating a "corpus" by scraping top 5 google search results for the claim, and then retrieving the most similar sentences to the claim using SBERT.

Scraped websites from search results (pre-sentence tokenization and SBERT filtration) are available here:  https://drive.google.com/drive/folders/1CZZ8QvUxkQMITofZ2o3_ZGUKtw0Zu3x3?usp=sharing
The format is {claim_1: {source_1: text, source_2: text, ...}, claim_2:...}
There are more claims in this file than in the final dataset, since evidence could not be obtained in some cases which led to filtering those claims out. Other discrepancies with the final dataset may be possible due to sites changing their content.

Scraped selections after SBERT similarity filtering with the claim are uploaded in the files siteMAtchEVIDENCE_LINK_v2.csvsiteMAtchEVIDENCE_LINKS_ADDITIONAL_1k.csv.

search_and_select_evidence_for_claim.py is an auxilary script that can be used to obtain evidence candidates for a given claim. It was not used in dataset creation process and is simply provided for convenience.

Files with "additional" in the name indicate that they were collected at a later date compared to most of the dataset. In order to run comprehensive evaluations, please simply join the files.

Counter-claim Generation

The scripts used for counter-claim generation can be found in the folder contrastive_gen. get_attn_model.py is used to find the best head and layer of a fine-tuned BERT uncased model. get_top_words.py is used to obtain the salient words for a particular set of claims. Finally, gen_contrast_claims.py is used to replace salient words by outputs generated by RoBERTA fine-tuned on CORD-19.

You can find the BERT model fine-tuned on SciFact used to get the attention weights here: https://drive.google.com/drive/folders/1EH2nk3NLfcNAdyesPGu9u0_7altoTRXS?usp=sharing

Evaluation

The folder evaluation provides the scripts needed to recreate Table 4 and Table 5 of the paper. eval_data folder provides the dataset already processed in the format necessary for the experiments. The train/test splits are contained in the folders with prefix RTE.

Download the fairseq folder from and reproduce results for veracity prediction using Gold evidence for Covidfact (the best configuration)
https://drive.google.com/file/d/1WzDrE3DQHLnlM6j_nokFP5D9ZxAB2f66/view?usp=sharing
Best Covid fact result checkpoint
https://drive.google.com/drive/folders/1KyQAh_EK4Vi9k28RLXE4pD4Pu2P9Oh7z?usp=sharing

Run result.py for accuracy F1 score and use roberta_train.sh to train new model

We have also uploaded data for RTE-Fever, RTE-scifact, RTE-covidscifact

If you want to train models using them you need to create binary files, following this way https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.glue.md If you want the trained model checkpoints for FEVER / scifact / covidscifact email us

Citation

            @inproceedings{saakyan-etal-2021-covid,
                 title = "{COVID}-Fact: Fact Extraction and Verification of Real-World Claims on {COVID}-19 Pandemic",
                 author = "Saakyan, Arkadiy  and
                   Chakrabarty, Tuhin  and
                   Muresan, Smaranda",
                 booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
                 month = aug,
                 year = "2021",
                 address = "Online",
                 publisher = "Association for Computational Linguistics",
                 url = "https://aclanthology.org/2021.acl-long.165",
                 pages = "2116--2129",
             }

Contact

a.saakyan@columbia.edu
tuhin.chakr@cs.columbia.edu

About

Data and code for the paper COVID-Fact: Fact Extraction and Verification of Real-World Claims on COVID-19 Pandemic.

Resources

Stars

Watchers

Forks