## Muthu Palaniappan M - 21011101079
Big Thanks to
- https://www.vennify.ai/fine-tune-grammar-correction/
- https://happytransformer.com/text-to-text/finetuning/

- T5 was created by Google AI and released to the world for anyone to download and use.
- T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format. It is trained using teacher forcing. This means that for training, we always need an input sequence and a corresponding target sequence.

# Installing Packages

In [1]:
!pip install simpletransformers datasets tqdm pandas
from IPython.display import clear_output
clear_output()

# Importing Pacakges

In [2]:
import pandas as pd
from datasets import load_dataset
from tqdm import tqdm
from simpletransformers.t5 import T5Model
from sklearn.model_selection import train_test_split
import sklearn
import csv

# Loading Dataset

In [3]:
train_dataset = load_dataset("jfleg", split='validation[:]')
eval_dataset = load_dataset("jfleg", split='test[:]')

Downloading readme:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/148k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/141k [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/755 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/748 [00:00<?, ? examples/s]

# Data Pre-Processing
- We need to structure both of the training and evaluating data into the same format, which is a CSV file with two columns: input and target.
- The input column contains grammatically incorrect text, and the target column contains text that is the corrected version of the text from the target column.

In [4]:
def generate_csv(csv_path, dataset):
    with open(csv_path, 'w', newline='') as csvfile:
        writter = csv.writer(csvfile)
        writter.writerow(["input", "target"])
        for case in dataset:
     	    ##Adding the task's prefix to input
            input_text = case["sentence"]
            for correction in case["corrections"]:
                ##a few of the cases contain blank strings.
                if input_text and correction:
                    writter.writerow([input_text, correction])

In [5]:
generate_csv("train.csv", train_dataset)
generate_csv("eval.csv", eval_dataset)

In [6]:
train_data = pd.read_csv("train.csv")
train_data['prefix'] = 'grammar'
train_data.columns = ["input_text","target_text","prefix"]
train_data.head()

Unnamed: 0,input_text,target_text,prefix
0,So I think we can not live if old people could...,So I think we would not be alive if our ancest...,grammar
1,So I think we can not live if old people could...,So I think we could not live if older people d...,grammar
2,So I think we can not live if old people could...,So I think we can not live if old people could...,grammar
3,So I think we can not live if old people could...,So I think we can not live if old people can n...,grammar
4,For not use car .,Not for use with a car .,grammar


In [7]:
train_data.shape

(3016, 3)

In [8]:
eval_data = pd.read_csv("eval.csv")
eval_data['prefix'] = 'grammar'
eval_data.columns = ["input_text","target_text","prefix"]
eval_data.head()

Unnamed: 0,input_text,target_text,prefix
0,New and new technology has been introduced to ...,New technology has been introduced to society .,grammar
1,New and new technology has been introduced to ...,New technology has been introduced into the so...,grammar
2,New and new technology has been introduced to ...,Newer and newer technology has been introduced...,grammar
3,New and new technology has been introduced to ...,Newer and newer technology has been introduced...,grammar
4,One possible outcome is that an environmentall...,One possible outcome is that an environmentall...,grammar


In [9]:
train_data.iloc[22]['input_text']

'They draw the consumers , like me , to purchase this great product with all these amazing ingredients and all that but actually they just sometimes make something up just to increase their sales . '

In [10]:
eval_data.shape

(2988, 3)

# Model Training

In [11]:
args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 256,
    "num_train_epochs": 4,
    "num_beams": None,
    "do_sample": True,
    "top_k": 50,
    "top_p": 0.95,
    "use_multiprocessing": False,
    "save_steps": -1,
    "save_eval_checkpoints": True,
    "evaluate_during_training": False,
    'adam_epsilon': 1e-08,
    'eval_batch_size': 6,
    'fp_16': False,
    'gradient_accumulation_steps': 16,
    'learning_rate': 0.0003,
    'max_grad_norm': 1.0,
    'n_gpu': 1,
    'seed': 42,
    'train_batch_size': 6,
    'warmup_steps': 0,
    'weight_decay': 0.0
}

In [12]:
model = T5Model("t5","t5-small", args=args)

ValueError: 'use_cuda' set to True when cuda is unavailable.Make sure CUDA is available or set `use_cuda=False`.

In [None]:
model.train_model(train_data, eval_data=eval_data, use_cuda=True,acc=sklearn.metrics.accuracy_score)

# Inference

In [36]:
from simpletransformers.t5 import T5Model
from pprint import pprint
import os

In [37]:
trained_model_path = '/content/outputs'

In [38]:
args = {
    "overwrite_output_dir": True,
    "max_seq_length": 256,
    "max_length": 50,
    "top_k": 50,
    "top_p": 0.95,
    "num_return_sequences": 3,
}

In [39]:
trained_model = T5Model("t5",trained_model_path,args=args)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [43]:
prefix = "grammar"
pred = trained_model.predict([f"{prefix}: Here was no promise of morning except that we looked up through the trees we saw how low the forest had swung."])
pprint(pred[0])

Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

  self.pid = os.fork()


Decoding outputs:   0%|          | 0/3 [00:00<?, ?it/s]

['We looked up through trees and saw how low the forest had swung.',
 'Here was no promise of morning except that we looked up through the trees '
 'and saw how low the forest had swung.',
 'Here was no promise of morning except that we looked up through the trees we '
 'saw how low the forest had swung.']


In [44]:
def generate_correct_sent(text):
  prefix = "grammar"
  pred = trained_model.predict([f"{prefix}: {text}"])
  return pred[0]

In [49]:
generate_correct_sent("I saw an girl in PayPal")

Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/3 [00:00<?, ?it/s]

['I saw a girl in PayPal.',
 'I saw a girl in PayPal.',
 'I saw a girl in PayPal.']