Skip to content

Releases: bitextor/bitextor-data

Bitextor test data - WARCs v1.1

14 Oct 10:37
Compare
Choose a tag to compare

Collection of WARC files which are compliant to the WARC-1.0 standard and can be used to run regression tests with Bitextor. This release includes three websites that were crawled between January 25 and 28 of 2019. The websites are:

25/11/2022: Added documents.tar.gz file containing the necessary documents for testing dir2warc.

Bitextor test data - WARCs v1.0

28 Jan 10:40
Compare
Choose a tag to compare

Collection of XZ compressed files that can be used to run regression tests with Bitextor (run-tests.sh). Tests can be run on three websites crawled between January 25 and 28 of 2019. The three websites are:

kremlin-many-small.tar.xz package is a test using kremlin.warc.xz content, but each warc only contains one pair of documents (from Bitextor run of kremlin.warc.xz).

Bitextor dictionaries v1.0

24 Jan 12:22
Compare
Choose a tag to compare

Bitextor document aligner dictionaries: https://github.com/bitextor/bitextor/

en-ar.dic: generated using OpenSubtitles2018
ca-es.dic: generated using https://object.pouta.csc.fi/OPUS-DOGC/v2/moses/ca-es.txt.zip (mostly).
en-ru-morpheme.QED.dic and en-ru.QED.dic: generated using QED corpora.
hu-en.hunalign.dic: from Hunalign original code
kk-ru.dic: all OPUS available data on 2017

The rest of dictionaries were trained using JRC-AQUI, on 2017.