Skip to content

UKPLab/conll2019-snopes-crawling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Repository for the CoNLL 2019 paper: A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking

Link to the paper: A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking

Please use the following citation:

@inproceedings{hanselowski2019snopes,
          title={A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking},
          author={Hanselowski, Andreas and Stab, Christian and Schulz, Claudia and Li, Zile and Gurevych, Iryna},
          booktitle={Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL2019)},
          year={2019}
        }

Disclaimer:

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

This is a crawler to generate the Snopes Corpus

The repository contains or constructs the following corpora:

Corpus 1: Contains the links (URLS) to the indiviidual Snopes fact-checking websites. On every page one claim is validated. The corpus is already available in the repository.

Corpus 2: Contains the information cralwed from the Snopes fact-checking websites: Claim, verdict, Evidence Text Snippets (ETSs), Resolution, ...

Corpus 3: Contains the the information about the links from the Snopes fact-checking websites and the original documents (Origin Docs) to which the links are pointing.

Corpus 4: Contains the merged information from the Corpus 2 and Corpus 3: ETS are mached with the original documents (Origin Docs) from which they have been extracted by fact-checkers.

The repository for the construction of the Corpora 2, 3, and 4 has two parts. The first part is the crawler based on a Maven project, which is crawles the information (such as claim, evidence, ratings, Origin) of Snopes fact-checking pages from the web archives Wayback Machine and Common Crawl to generate the Corpus 2 and Corpus 3. Moreover, the annotations from Amazon Mechanical Turk for the stance of the Evidence Text Snippets (ETS) and for retrieving the necessary Fince Grained Evidence (FGE) are incorporated into the Corpus 2. The second part is a python program, it is based on the thesis of Arphitha Nagaraja. It builuds upon a pre-trained random-forest classifier, which matches the ETS from Corpus 2 and original documents (Origin Doc) from Corpus 3 to generate Corpus 4.

Generating the Annotated Corpus (Corpus 2 + annotations)

In order to generate the annotated Corpus 2, which can be used for stance detection, evidence extraction and claim validation, run the following command:

sudo apt install maven
chmod +x script.sh
./script.sh mode2

This command will generate Corpus2_annotated.csv in final_corpus directory. It contains the following important information:

Claim: The statement need to be verfied.

Evidence: It is a small text snippet in the blockquote of a Snopes fact-check website. It is a snippet that was extracted from an online article which may be related to the claim.

Origin: It is the article which provides the resolution of the claim.

Rating: THe Rating it the given verdict to the claim.

Snopes URL: Each url is corresponding to an unique claim. It is where the information in the corpus that was extracted from.

Commoncrawl URL: This is where the website snapshot is stored by the commoncrawl.

Snippets: The sentence-level tokeniztion of the evidence.

Stance: The perspective of the claim according to the evidence, can be refuted, support or nostance.

Evidence_Sentences: The sentences in the evidence that can refute or support the claim.

Complete Corpora (Corpus 2, Corpus 3, Corpus 4 = Corpus 2 + Corpus 3)

In order to generated the complete corpora run the following command:

chmod +x script.sh
./script.sh mode1

This command will generate the Corpus2_annotated.csv and Corpus4.csv. The Corpus4.csv contains the following information:

Claim: The statement need to be verfied.

Evidence: It is a small text snippet in the blockquote of a Snopes fact-check website. It is a snippet that was extracted from an online article which may be related to the claim.

Origin Doc: The extracted text from a link in the Origin section of a Snopes fact-check website.

Match: If an evidence snippet is extracted from a given Origin Doc. The value is 'match' or 'no match'.

This corpus can be used for document retrieval. Given a claim, extract the related documents.

Other:

Original corpus

  • You can request the original corpus crawled by us: data archive website (However, please take note of the licence agreement as the corpus is not publicly available)

Repository for experiments:

  • You can run experiments using this cralwed corpus on the basis of this repository

Contacts:

Credit:

This repository was build by Hao Zhang, Arphitha Nagaraja, and Andreas Hanselowski

License:

  • Apache License Version 2.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages