Skip to content

aic-factcheck/automated-fact-checking

Repository files navigation

Automated Fact-Checking

Data, models, and code to reproduce our Pipeline and Dataset Generation for Automated Fact-checking in Almost Any Language paper. Currently in review for NCAA journal.

@article{drchal2023pipeline,
  title={Pipeline and Dataset Generation for Automated Fact-checking in Almost Any Language},
  author={Drchal, Jan and Ullrich, Herbert and Mlyn{\'a}{\v{r}}, Tom{\'a}{\v{s}} and Moravec, V{\'a}clav},
  journal={arXiv preprint arXiv:2312.10171},
  year={2023}
}

Code

  • QACG Data Generation -- our fork of the original QACG procedure.
  • ColBERTv2 -- our fork of ColBERTv2. The retrieval for FactSearch is realized via REST API.
  • anserini-indexing -- wrapper for ANSERINI BM25.The retrieval for FactSearch is realized via REST API.
  • FactSearch source is hosted in this repository.

Data to Train QACG Models

The following datasets were created by machine translation using DeepL. See the paper for more details.

  1. SQuAD-cs
  2. QA2D-cs
  3. QA2D-pl
  4. QA2D-sk

QACG Models

  1. Question Generation model trained on a concatenation of Czech, English, Polish, and Slovak SQuAD datasets:
  1. Claim Generation model train on a concatenation of Czech, English, Polish, and Slovak QA2D datasets:

QACG Generated Data

All QACG-generated datasets are based on the corresponding Wikipedia snapshots using the QACG models above. The QACG-mix combines all four languages, preserving the size of each individual language dataset. The QACG-sum is a four-times larger concatenation of all individual language datasets.

  1. QACG-cs
  2. QACG-en
  3. QACG-pl
  4. QACG-sk
  5. QACG-mix
  6. QACG-sum

ColBERTv2 Evidence Retrieval

colbertv2-QACG-SUM

NLI Veracity Evaluation

nli-QACG-sum

NLI Annotations

Here

Evidence Retrieval Annotations

Here

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published