Skip to content

abhisaary/spanish_gec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spanish Grammatical Error Correction (GEC)

This repository contains code to train an mT5 model from HuggingFace on the COWS-L2H dataset. A project writeup can be found here.

Setup

It is recommended to set up this repository on a GPU for training, 8GB RAM is minumum to train a small mT5 model. The conda environment can be created using:

>>> conda env create -f environment.yml

You can add this project to your PYTHONPATH using:

>>> export PYTHONPATH="</path/to/spanish_gec/>":$PYTHONPATH

Data Preprocessing

The data is cleaned and already available in cowsl2h/data. To clean the source text yourself, you can download the original dataset from the COWS-L2H GitHub and use the following script:

>>> python process_dataset <path/to/cowsl2h/csv>

Training

You can start finetuning the model:

>>> export WANDB_API_KEY="<Your WandB API Key>"
>>> python run_train.py

The following files can be used to set various hyperparameters:
cowsl2h.py -- dataset loading
globals.py -- dataset parameters
mt5_finetuner -- training loop, loss computation, WandB logging
run_train -- training hyperparameters

Prediction

You can run inference on all datasets:

>>> python run_predict.py </path/to/ckpt/dir> </path/to/ckpt/file> -d <train, val, test, or all> -b <num_beams>

You can run inference on a sentence with a pretrained model:

>>> python predict.py <path/to/model/dir> <text>

You can also play with inference using this Colab notebook.

Evaluation

You can compute the F-0.5 score using ERRANT:

>>> ./eval.sh <path/to/predicted/sentences/>

This will output a results-cs.txt in your predictions directory with the several metrics:

TPs, FPs, FNs, Precision, Recall, F-0.5

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published