This repository contains code to train an mT5 model from HuggingFace on the COWS-L2H dataset. A project writeup can be found here.
It is recommended to set up this repository on a GPU for training, 8GB RAM is minumum to train a small mT5 model. The conda environment can be created using:
>>> conda env create -f environment.yml
You can add this project to your PYTHONPATH using:
>>> export PYTHONPATH="</path/to/spanish_gec/>":$PYTHONPATH
The data is cleaned and already available in cowsl2h/data
.
To clean the source text yourself, you can download the original dataset from the COWS-L2H GitHub
and use the following script:
>>> python process_dataset <path/to/cowsl2h/csv>
You can start finetuning the model:
>>> export WANDB_API_KEY="<Your WandB API Key>"
>>> python run_train.py
The following files can be used to set various hyperparameters:
cowsl2h.py
-- dataset loading
globals.py
-- dataset parameters
mt5_finetuner
-- training loop, loss computation, WandB logging
run_train
-- training hyperparameters
You can run inference on all datasets:
>>> python run_predict.py </path/to/ckpt/dir> </path/to/ckpt/file> -d <train, val, test, or all> -b <num_beams>
You can run inference on a sentence with a pretrained model:
>>> python predict.py <path/to/model/dir> <text>
You can also play with inference using this Colab notebook.
You can compute the F-0.5 score using ERRANT:
>>> ./eval.sh <path/to/predicted/sentences/>
This will output a results-cs.txt
in your predictions directory with the several metrics:
TPs, FPs, FNs, Precision, Recall, F-0.5