Spanish Grammatical Error Correction (GEC)

This repository contains code to train an mT5 model from HuggingFace on the COWS-L2H dataset. A project writeup can be found here.

Setup

It is recommended to set up this repository on a GPU for training, 8GB RAM is minumum to train a small mT5 model. The conda environment can be created using:

>>> conda env create -f environment.yml

You can add this project to your PYTHONPATH using:

>>> export PYTHONPATH="</path/to/spanish_gec/>":$PYTHONPATH

Data Preprocessing

The data is cleaned and already available in cowsl2h/data. To clean the source text yourself, you can download the original dataset from the COWS-L2H GitHub and use the following script:

>>> python process_dataset <path/to/cowsl2h/csv>

Training

You can start finetuning the model:

>>> export WANDB_API_KEY="<Your WandB API Key>"
>>> python run_train.py

The following files can be used to set various hyperparameters:
cowsl2h.py -- dataset loading
globals.py -- dataset parameters
mt5_finetuner -- training loop, loss computation, WandB logging
run_train -- training hyperparameters

Prediction

You can run inference on all datasets:

>>> python run_predict.py </path/to/ckpt/dir> </path/to/ckpt/file> -d <train, val, test, or all> -b <num_beams>

You can run inference on a sentence with a pretrained model:

>>> python predict.py <path/to/model/dir> <text>

You can also play with inference using this Colab notebook.

Evaluation

You can compute the F-0.5 score using ERRANT:

>>> ./eval.sh <path/to/predicted/sentences/>

This will output a results-cs.txt in your predictions directory with the several metrics:

TPs, FPs, FNs, Precision, Recall, F-0.5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cowsl2h/data

cowsl2h/data

train

train

README.md

README.md

environment.yml

environment.yml

eval.sh

eval.sh

predict.py

predict.py

Repository files navigation

Spanish Grammatical Error Correction (GEC)

Setup

Data Preprocessing

Training

Prediction

Evaluation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
cowsl2h/data		cowsl2h/data
train		train
README.md		README.md
environment.yml		environment.yml
eval.sh		eval.sh
predict.py		predict.py

abhisaary/spanish_gec

Folders and files

Latest commit

History

Repository files navigation

Spanish Grammatical Error Correction (GEC)

Setup

Data Preprocessing

Training

Prediction

Evaluation

About

Resources

Stars

Watchers

Forks

Languages