Companion code and data for PolEval 2021 task 3 solution

Data

Please find converted dataset in data subdirectory.

data/escaped contains a MT oriented dataset with escaped control sequences;
data/un-escaped/standard contains a dataset with texts grouped by manually created buckets;
data/un-escaped/no-buckets contains a purely converted dataset with no buckets at all;
data/un-escaped/bucket-per-year contains a dataset with texts grouped by year published;
data/un-escaped/clustered contains a dataset with bucket assignment based on k-means clustering algorithm over the n-gram character language model (this turned out to be the best method).

In order to use these file for your purposes, please unpack them (for example):

cd data/un-escaped/standard
tar xjf data.tbz

I used HuggineFace Transformers library to fine-tune Machine Translation models. The script to run fine-tuning could be found here:

This is the script required by example files and has to be downloaded manually.

The example files for fine-tuning process are:

In order to run fine-tuning, you can just issue the following command:

./run_mt5_fine-tuning.sh 0 un-escaped-standard-mt5

where 0 stands for GPU device you want to use and un-escaped-standard-mt5 is a target directory.

After fine-tuning, one can generate predictions in the same manner:

./run_mt5_prediction.sh 0 un-escaped-standard-mt5

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_mt5_fine-tuning.sh		run_mt5_fine-tuning.sh
run_mt5_prediction.sh		run_mt5_prediction.sh