Adapted fine tuning method from [Vennify AI tutorial](https://www.vennify.ai/fine-tune-grammar-correction/)

This file was run on Google colab with GPU runtime and the transformer was uploaded to [huggingface](https://huggingface.co/audribean/happy-gec/tree/main) for local use

In [None]:
!pip install happytransformer



In [3]:
from happytransformer import HappyTextToText

happy_tt = HappyTextToText("T5", "t5-base")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
from datasets import load_dataset

In [5]:
train_dataset = load_dataset("jfleg", split='validation[:]')

eval_dataset = load_dataset("jfleg", split='test[:]')

In [6]:
import csv

def generate_csv(csv_path, dataset):
    with open(csv_path, 'w', newline='') as csvfile:
        writter = csv.writer(csvfile)
        writter.writerow(["input", "target"])
        for case in dataset:
     	    # Adding the task's prefix to input
            input_text = "grammar: " + case["sentence"]
            for correction in case["corrections"]:
                # a few of the cases contain blank strings.
                if input_text and correction:
                    writter.writerow([input_text, correction])



generate_csv("train.csv", train_dataset)
generate_csv("eval.csv", eval_dataset)

In [7]:
happy_tt.tokenizer.padding=True
#happy_tt.tokenizer.truncation=False
happy_tt.tokenizer.model_max_length=512

In [8]:
 before_result = happy_tt.eval("eval.csv")

Generating eval split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2988 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [9]:
print("Before loss:", before_result.loss)

Before loss: 1.2803919315338135


In [8]:
from happytransformer import TTTrainArgs

args = TTTrainArgs(batch_size=8)
happy_tt.train("train.csv", args=args)

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2714 [00:00<?, ? examples/s]

Map:   0%|          | 0/302 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss,Validation Loss
1,1.4782,1.378161
34,0.7815,0.744391
68,0.6787,0.646292
102,0.6477,0.593389
136,0.7112,0.567457
170,0.6693,0.563348
204,0.594,0.55123
238,0.5463,0.548967
272,0.6347,0.54041
306,0.5555,0.53902


In [11]:
before_loss = happy_tt.eval("eval.csv")

print("After loss: ", before_loss.loss)

Map:   0%|          | 0/2988 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


After loss:  0.47910746932029724
