# Error Correction

220604

- [blog](https://towardsdatascience.com/nlp-building-a-grammatical-error-correction-model-deep-learning-analytics-c914c3a8331b)
- [code](https://github.com/priya-dwivedi/Deep-Learning/tree/master/GrammarCorrector)
- [data](https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction)
- [c4 data paper](https://aclanthology.org/2021.bea-1.4.pdf)
- [정제된 데이터](https://drive.google.com/drive/folders/1kKlGcinD_FhGXC0LztN4Ts605YXzMEVA)

# Prepare

In [1]:
!nvidia-smi

Fri Jun  3 23:10:18 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!git clone https://github.com/airobotlab/lecture_NLP_advanced.git
!mv lecture_NLP_advanced/code_grammer/2_grammer_error_dataset.csv .
!rm -rf lecture_NLP_advanced

Cloning into 'lecture_NLP_advancde'...
remote: Enumerating objects: 35, done.[K
remote: Counting objects: 100% (35/35), done.[K
remote: Compressing objects: 100% (32/32), done.[K
remote: Total 35 (delta 9), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (35/35), done.


In [3]:
## install
!pip install datasets tqdm pandas
!pip install sentencepiece==0.1.90
!pip install transformers==4.16.0
# !pip install torch==1.10
!pip install wandb
!pip install rouge_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.2.2-py3-none-any.whl (346 kB)
[K     |████████████████████████████████| 346 kB 4.3 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 50.5 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 51.4 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 53.5 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting dill<0.3.5
  Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.5 MB/s

In [4]:
## load library
import argparse
import glob
import os
import json
import time
import logging
import random
import re
from itertools import chain
from string import punctuation

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

import pandas as pd
import numpy as np
from tqdm import tqdm

import torch
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset, load_metric
rouge_metric = load_metric("rouge")

from transformers import (T5ForConditionalGeneration, T5Tokenizer,
                          Seq2SeqTrainingArguments, Seq2SeqTrainer,
                          DataCollatorForSeq2Seq, AdamW,
                          get_linear_schedule_with_warmup)

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)


set_seed(42)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

In [5]:
## load model 
model_name = 't5-base'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

In [6]:
## load dataset
pd.set_option('display.max_colwidth', None)

# DATA_ROOT = 'data'
# data_name = 'c4_200m_550k.csv'  # 55만개
# data_name = 'c4_200m_1M.csv'  # 100만개

DATA_ROOT = './'
data_name = '2_grammer_error_dataset.csv'  # 100만개
data_path = os.path.join(DATA_ROOT, data_name)
df = pd.read_csv(data_path).dropna(axis=0)
print(df.shape)
df.head()

(99998, 2)


Unnamed: 0,correct_sentence,error_sentence
0,Answers Regions Is Nagorno Karabakh region part of Armenia,Answers Regions Is Nagorno Karabakh region part at Armenia
1,Flaneuring Fun in Maple Creek SK,Flaneurg Fun Maple Creek SK
2,About Private Investigators Ellesmere Port In Ellesmere Port Cheshire,About PEivate InvestigatoEs EllesmeEe PoEt In EllesmeEe PoEt PoEt CheshiEe
3,Bake in the oven for 35 mins scattering the flaked almonds after the first 20,Bake in the oven for 35 mins scattering the the flaked almonds after the 20
4,informing you of changes in our web site,formg you by changes our web site


In [7]:
## make data
# 0.5%만(5,000개) test, 99.5%는 학습
def calc_token_len(example):
    return len(tokenizer(example).input_ids)

from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.005, shuffle=True)
print(train_df.shape, test_df.shape)

test_df['input_token_len'] = test_df['error_sentence'].apply(calc_token_len)
test_df.head()

(99498, 2) (500, 2)


Unnamed: 0,correct_sentence,error_sentence,input_token_len
33969,Out of interest is that classed as restriction of trade,Out interest is that classed as restriction trade,11
24612,We must move forward grow evolve adapt if we are to survive,We must move forward grow evolve adapt if we are survive,13
42678,New Lush bubble bar dropping soon,NNw Nush bubblN bar dropping soon,11
9732,Free Online Bible Study Course Easy and basic Bible study course,Free Online Bible Study Course Easy basic Bible study course,11
7129,Re Need help with Chinese,Re Need help Chinese,5


In [8]:
test_df['input_token_len'].describe()

count    500.000000
mean      13.770000
std        5.428447
min        5.000000
25%       10.000000
50%       13.000000
75%       16.000000
max       39.000000
Name: input_token_len, dtype: float64

In [9]:
# df to dataset
# from datasets import Dataset
from datasets import Dataset as datasets_Dataset
train_dataset = datasets_Dataset.from_pandas(train_df)
test_dataset = datasets_Dataset.from_pandas(test_df)
test_dataset

Dataset({
    features: ['correct_sentence', 'error_sentence', 'input_token_len', '__index_level_0__'],
    num_rows: 500
})

In [10]:
# custom dataset
from torch.utils.data import Dataset
class GrammarDataset(Dataset):
    def __init__(self, dataset, tokenizer, print_text=False):
        self.dataset = dataset
        self.pad_to_max_length = False
        self.tokenizer = tokenizer
        self.print_text = print_text
        self.max_len = 64

    def __len__(self):
        return len(self.dataset)

    def tokenize_data(self, example):
        input_, target_ = example['error_sentence'], example['correct_sentence']

        # tokenize inputs
        tokenized_inputs = tokenizer(input_,
                                     pad_to_max_length=self.pad_to_max_length,
                                     max_length=self.max_len,
                                     return_attention_mask=True)

        tokenized_targets = tokenizer(target_,
                                      pad_to_max_length=self.pad_to_max_length,
                                      max_length=self.max_len,
                                      return_attention_mask=True)

        inputs = {
            "input_ids": tokenized_inputs['input_ids'],
            "attention_mask": tokenized_inputs['attention_mask'],
            "labels": tokenized_targets['input_ids']
        }

        return inputs

    def __getitem__(self, index):
        inputs = self.tokenize_data(self.dataset[index])

        if self.print_text:
            for k in inputs.keys():
                print(k, len(inputs[k]))

        return inputs

In [11]:
# check dataset
dataset = GrammarDataset(test_dataset, tokenizer, True)
print(dataset[121])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


input_ids 7
attention_mask 7
labels 6
{'input_ids': [11419, 11419, 19, 82, 793, 11341, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'labels': [11419, 19, 82, 793, 11341, 1]}


# Train Model

In [12]:
## config
# defining training related arguments
batch_size = 10
args = Seq2SeqTrainingArguments(
    output_dir='weights',
    evaluation_strategy='steps',
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2,
    predict_with_generate=True,
    fp16=True,
    gradient_accumulation_steps=6,
    eval_steps=1000,
    save_steps=1000,
    load_best_model_at_end=True,
    logging_dir='logs',
    report_to='wandb')

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding='longest', return_tensors='pt')

In [13]:
# metric
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions,
                                           skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Rouge expects a newline after each sentence
    decoded_preds = [
        "\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds
    ]
    decoded_labels = [
        "\n".join(nltk.sent_tokenize(label.strip()))
        for label in decoded_labels
    ]

    result = rouge_metric.compute(predictions=decoded_preds,
                                  references=decoded_labels,
                                  use_stemmer=True)
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

    # Add mean generated length
    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id)
        for pred in predictions
    ]
    result["gen_len"] = np.mean(prediction_lens)
    return {k: round(v, 4) for k, v in result.items()}

In [14]:
# trainer
trainer = Seq2SeqTrainer(model=model,
                         args=args,
                         train_dataset=GrammarDataset(train_dataset, tokenizer),
                         eval_dataset=GrammarDataset(test_dataset, tokenizer),
                         tokenizer=tokenizer,
                         data_collator=data_collator,
                         compute_metrics=compute_metrics)

Using amp half precision backend


# Train!!

In [15]:
## train!!
# os.environ["WANDB_DISABLED"] = "true"
trainer.train()
trainer.save_model('weights/t5_gec_model')

***** Running training *****
  Num examples = 99498
  Num Epochs = 3
  Instantaneous batch size per device = 10
  Total train batch size (w. parallel, distributed & accumulation) = 60
  Gradient Accumulation steps = 6
  Total optimization steps = 4974
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1000,0.7294,0.627394,87.7909,76.6429,87.5161,87.5893,12.41
2000,0.6522,0.587282,88.2561,77.6951,87.985,88.0556,12.36
3000,0.6285,0.570424,88.5403,78.2334,88.3392,88.389,12.34
4000,0.6025,0.562088,88.7359,78.5243,88.5191,88.599,12.354


***** Running Evaluation *****
  Num examples = 500
  Batch size = 10
Saving model checkpoint to weights/checkpoint-1000
Configuration saved in weights/checkpoint-1000/config.json
Model weights saved in weights/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in weights/checkpoint-1000/tokenizer_config.json
Special tokens file saved in weights/checkpoint-1000/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 500
  Batch size = 10
Saving model checkpoint to weights/checkpoint-2000
Configuration saved in weights/checkpoint-2000/config.json
Model weights saved in weights/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in weights/checkpoint-2000/tokenizer_config.json
Special tokens file saved in weights/checkpoint-2000/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 500
  Batch size = 10
Saving model checkpoint to weights/checkpoint-3000
Configuration saved in weights/checkpoint-3000/config.json
Model weights saved in

# Inference

In [16]:
# prepare inference
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
# model_name = 'deep-learning-analytics/GrammarCorrector'
# model_name = 't5_gec_model'
model_name = 'weights/checkpoint-4000'

device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)
print('load model done')

def correct_grammar(input_text, num_return_sequences):
    batch = tokenizer([input_text],
                      truncation=True,
                      padding='max_length',
                      max_length=64,
                      return_tensors="pt").to(device)
    translated = model.generate(**batch,
                                max_length=64,
                                num_beams=4,
                                num_return_sequences=num_return_sequences,
                                temperature=1.5)
    tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
    return tgt_text

Didn't find file weights/checkpoint-4000/added_tokens.json. We won't load it.
Didn't find file weights/checkpoint-4000/tokenizer.json. We won't load it.
loading file weights/checkpoint-4000/spiece.model
loading file None
loading file weights/checkpoint-4000/special_tokens_map.json
loading file weights/checkpoint-4000/tokenizer_config.json
loading file None
loading configuration file weights/checkpoint-4000/config.json
Model config T5Config {
  "_name_or_path": "t5-base",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "task_specific_params": 

load model done


### Do inference!!

In [17]:
text = 'He are moving here.'
print(correct_grammar(text, num_return_sequences=2))

['He are moving here', 'He are moving here.']


In [18]:
text = 'Cat drinked milk'
print(correct_grammar(text, num_return_sequences=2))

['Cat drinked milk', 'Cat drank milk']


# Done!