Please find converted dataset in data subdirectory.
data/escapedcontains a MT oriented dataset with escaped control sequences;data/un-escaped/standardcontains a dataset with texts grouped by manually created buckets;data/un-escaped/no-bucketscontains a purely converted dataset with no buckets at all;data/un-escaped/bucket-per-yearcontains a dataset with texts grouped by year published;data/un-escaped/clusteredcontains a dataset with bucket assignment based on k-means clustering algorithm over the n-gram character language model (this turned out to be the best method).
In order to use these file for your purposes, please unpack them (for example):
cd data/un-escaped/standard
tar xjf data.tbz
I used HuggineFace Transformers library to fine-tune Machine Translation models. The script to run fine-tuning could be found here:
This is the script required by example files and has to be downloaded manually.
The example files for fine-tuning process are:
- run_mt5_fine-tuning.sh - script to start fine tuning process
- run_mt5_prediction.sh - script to run predictions after fine-tuning
In order to run fine-tuning, you can just issue the following command:
./run_mt5_fine-tuning.sh 0 un-escaped-standard-mt5
where 0 stands for GPU device you want to use and un-escaped-standard-mt5 is a target directory.
After fine-tuning, one can generate predictions in the same manner:
./run_mt5_prediction.sh 0 un-escaped-standard-mt5