In [1]:
import random
%matplotlib inline

Grammar Error Correction with RNNs
======================================================

This notebook shows an encoder-decoder model for Grammar Error Correction on C4 200M dataset.
The following notebook was based on [Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation](https://arxiv.org/abs/1406.1078) and followed Pytorch's sequence to sequence NMT [tutorials](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) and [pytorch-seq2seq](https://github.com/bentrevett/pytorch-seq2seq)

# Data Sourcing and Processing

C4 200M dataset from Google Research is used in this notebook. You can find more information about the C4 200M dataset on GR's [BEA 2021 paper](https://aclanthology.org/2021.bea-1.4/).
The already [processed dataset](https://huggingface.co/datasets/liweili/c4_200m) was extracted from Huggingface, then was transformed to HDF5 format for better manageability. The conversion process was based on this [notebook](https://github.com/rasbt/deeplearning-models/blob/master/pytorch_ipynb/mechanics/custom-data-loader-csv.ipynb).
The final version of the dataset is uploaded on [Kaggle](https://www.kaggle.com/datasets/dariocioni/c4200m).

A custom class ``Hdf5Dataset`` based on ``torch.utils.data.Dataset`` is developed, which yields a pair of source-target raw sentences.

| source                                             | target                                                  |
|----------------------------------------------------|---------------------------------------------------------|
| Much many brands and sellers still in the market.  | Many brands and sellers still in the market.            |
| She likes playing in park and come here every week | She likes playing in the park and comes here every week |

In [2]:
# Import libraries
import torch
import pandas as pd
import numpy as np
import pathlib as pl

In [3]:
import h5py
from torch.utils.data import Dataset

class Hdf5Dataset(Dataset):
    """Custom Dataset for loading entries from HDF5 databases"""

    def __init__(self, h5_path, transform=None,num_entries = None):

        self.h5f = h5py.File(h5_path, 'r')
        if num_entries:
            self.num_entries = num_entries
        else:
            self.num_entries = self.h5f['labels'].shape[0]
        self.transform = transform

    def __getitem__(self, index):
        if index > self.num_entries:
            raise StopIteration
        input = self.h5f['input'][index].decode('utf-8')
        label = self.h5f['labels'][index].decode('utf-8')
        if self.transform is not None:
            features = self.transform(input)
        return input, label

    def __len__(self):
        return self.num_entries

In [4]:
from typing import Iterable, List
from tqdm import tqdm
import pathlib as pl
from torchtext.data import get_tokenizer

# helper function to yield list of tokens
def yield_tokens(data_iter: Iterable, index: int) -> List[str]:
    language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}
    for data_sample in tqdm(data_iter):
        if data_sample[index] and isinstance(data_sample[index],str):
            yield token_transform(data_sample[index])

SRC_LANGUAGE = 'incorrect'
TGT_LANGUAGE = 'correct'
# MAX_LENGTH = 512
VOCAB_SIZE = 20000
N_SAMPLES = 1000

# # Define special symbols and indices
UNK_IDX,PAD_IDX, BOS_IDX, EOS_IDX = 0,1,2,3
# # Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<pad>','<unk>', '[CLS]', '[SEP]']

# Place-holders
token_transform =get_tokenizer('basic_english')
vocab_transform = None

folder = 'D:\Datasets\c4_200m\data\hdf5'
train_filename = 'C4_200M.hf5-00000-of-00010'
valid_filename = 'C4_200M.hf5-00001-of-00010'
embedding_path = 'D:\Datasets\glove\glove.42B.300d.txt'
checkpoint_folder = 'D:\Datasets\c4_200m\checkpoints'

## Tokenizing and Embedding
Data is then tokenized by the standard tokenizer from ``torchtext`` library, which performs basic normalization and splitting by space. Normalization includes
- lowercasing
- complete some basic text normalization for English words as follows:
    add spaces before and after '\''
    remove '\"',
    add spaces before and after '.'
    replace '<br \/>'with single space
    add spaces before and after ','
    add spaces before and after '('
    add spaces before and after ')'
    add spaces before and after '!'
    add spaces before and after '?'
    replace ';' with single space
    replace ':' with single space
    replace multiple spaces with single space

Possible future enhancements could be:
- A tokenization library like ``spacy``
- Using pretrained embeddings such as ``Word2vec`` or ``GloVe`` Embeddings, which was trained on Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors)

In [5]:
# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<UNK>', '<PAD>', '<BOS>', '<EOS>']

In [6]:
vocab_transform = torch.load('vocab/vocab_20K.pth')

Collation
---------

An iterator over ``Hdf5dataset`` yields a pair of raw strings.
We need to convert these string pairs into the batched tensors that can be processed by our ``Seq2Seq`` network.
Below we define our collate function that convert batch of raw strings into batch tensors that can be fed directly into our model.

In [7]:
from torch.nn.utils.rnn import pad_sequence

# helper function to club together sequential operations
def sequential_transforms(*transforms):
    def func(txt_input):
        for transform in transforms:
            txt_input = transform(txt_input)
        return txt_input
    return func

# def glove_transform(tokens: List[str]):



# function to add BOS/EOS and create tensor for input sequence indices
def tensor_transform(token_ids: List[int]):
    return torch.cat((torch.tensor([BOS_IDX]),
                      torch.tensor(token_ids),
                      torch.tensor([EOS_IDX])))

# src and tgt language text transforms to convert raw strings into tensors indices
text_transform = sequential_transforms(token_transform,
                                               vocab_transform,
                                               tensor_transform) # Add BOS/EOS and create tensor


# function to collate data samples into batch tesors
def collate_fn(batch):
    src_batch, tgt_batch = [], []
    for src_sample, tgt_sample in batch:
        src_batch.append(text_transform(src_sample.rstrip("\n")))
        tgt_batch.append(text_transform(tgt_sample.rstrip("\n")))

    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX)
    return src_batch, tgt_batch

Let's finally see all the three steps of conversion of a sentence to an embedding tensor.

In [8]:
text = 'data mining is awesome!'
tokenized_input = token_transform(text)
print("tokenized input:\n",tokenized_input)

encoded_input = vocab_transform(tokenized_input)
print("encoded input:\n",encoded_input)

print("transformed input:\n",text_transform(text))



# my_embedding_layer = torch.nn.Embedding.from_pretrained(torch.from_numpy(embs_npa).float())
# assert my_embedding_layer.weight.shape == embs_npa.shape
# embedding1 = my_embedding_layer(tensor_transform(encoded_input))
# print(embedding1)

tokenized input:
 ['data', 'mining', 'is', 'awesome', '!']
encoded input:
 [157, 1185, 13, 1480, 32]
transformed input:
 tensor([   2,  157, 1185,   13, 1480,   32,    3])


### Unknown words
In this version, unknown words are all converted to <unk> and converted to the same embedding.

In [9]:
text = 'dataminingisawesome!'
tokenized_input = token_transform(text)
print(tokenized_input)

encoded_input = vocab_transform(tokenized_input)
print(encoded_input)

['dataminingisawesome', '!']
[0, 32]


# RNN Network
This network is a seq2seq network composed by GRU layers.

In [10]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

teacher_forcing_ratio = 0.5
torch.manual_seed(0)

EMB_SIZE = 300
HIDDEN_SIZE = 512
BATCH_SIZE = 16
NUM_ENCODER_LAYERS = 1
NUM_DECODER_LAYERS =1

learning_rate = 0.001

During training, we need a subsequent word mask that will prevent model to look into the future words when making predictions. We will also need masks to hide source and target padding tokens. Below, let's define a function that will take care of both.

In [11]:
def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask


def create_mask(src):
    src_seq_len = src.shape[0]

    src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)

    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    return src_mask, src_padding_mask

In [12]:
def train(model, iterator, optimizer, criterion, clip):

    model.train()

    epoch_loss = 0

    for src, trg in tqdm(iterator):

        optimizer.zero_grad()

        src = src.to(DEVICE)
        trg = trg.to(DEVICE)

        output = model(src, trg)

        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]

        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)

        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]

        loss = criterion(output, trg)

        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

        optimizer.step()

        epoch_loss += loss.item()

    return epoch_loss / len(iterator)

def evaluate(model, iterator, criterion):

    model.eval()

    epoch_loss = 0

    with torch.no_grad():

        for src, trg in tqdm(iterator):

            src = src.to(DEVICE)
            trg = trg.to(DEVICE)

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size]

            loss = criterion(output, trg)

            epoch_loss += loss.item()

    return epoch_loss / len(iterator)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

To wrap the encoder and decoder, we can define a ``Seq2Seq`` class where the forward passing is performed.
Teacher forcing can be imposed and also be done with a varying ratio between 0 and 1.

In [13]:
from torch import nn

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()

        self.encoder = encoder
        self.decoder = decoder
        self.device = device

        assert encoder.hid_dim == decoder.hid_dim,"Hidden dimensions of encoder and decoder must be equal!"

    def forward(self, src, trg, teacher_forcing_ratio = 0.5):

        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time

        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim

        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)

        #last hidden state of the encoder is the context
        context = self.encoder(src)

        #context also used as the initial hidden state of the decoder
        hidden = context

        #first input to the decoder is the <sos> tokens
        input = trg[0,:]

        for t in range(1, trg_len):

            #insert input token embedding, previous hidden state and the context state
            #receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, context)

            #place predictions in a tensor holding predictions for each token
            outputs[t] = output

            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio

            #get the highest predicted token from our predictions
            top1 = output.argmax(1)

            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1

        return outputs

Let's now define the parameters of our model and instantiate the same. Below, we also define our loss function which is the cross-entropy loss and the optmizer used for training.

In [14]:
from models.rnn_seq2seq import Encoder, Decoder

# attn = Attention(HIDDEN_SIZE, HIDDEN_SIZE)

encoder1 = Encoder(VOCAB_SIZE,EMB_SIZE,HIDDEN_SIZE,0)
decoder1 = Decoder(VOCAB_SIZE,EMB_SIZE,HIDDEN_SIZE,0.1)

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

#optimizer = torch.optim.Adam(encoder1.parameters(), lr = learning_rate , betas=(0.9, 0.98), eps=1e-9)
#decoder_optimizer = torch.optim.Adam(encoder1.parameters(), lr = learning_rate, betas=(0.9, 0.98), eps=1e-9)

model = Seq2Seq(encoder1,decoder1,DEVICE).to(DEVICE)
optimizer = torch.optim.Adam(model.parameters())
print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 41,787,040 trainable parameters


Now we have all the ingredients to train our model. Let's do it!




In [15]:
from torch.utils.data import DataLoader
from timeit import default_timer as timer
NUM_EPOCHS = 10
CLIP = 1
RESUME = False

train_iter = Hdf5Dataset(pl.Path(folder)/train_filename,num_entries=N_SAMPLES)
train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)
val_iter = Hdf5Dataset(pl.Path(folder)/valid_filename,num_entries=N_SAMPLES)
val_dataloader = DataLoader(val_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

model.train()
if RESUME:
    checkpoint = torch.load(pl.Path('checkpoints')/"model.pt")
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    epoch = checkpoint['epoch']
for epoch in range(1, NUM_EPOCHS+1):
    start_time = timer()
    train_loss = train(model,train_dataloader,optimizer,loss_fn,0)
    end_time = timer()
    val_loss = evaluate(model,val_dataloader,loss_fn)
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': val_loss,
    }, pl.Path('checkpoints')/"model.pt")

100%|██████████| 63/63 [00:29<00:00,  2.13it/s]
100%|██████████| 63/63 [00:08<00:00,  7.75it/s]


Epoch: 1, Train loss: 9.970, Val loss: 9.973, Epoch time = 29.561s


100%|██████████| 63/63 [00:26<00:00,  2.36it/s]
100%|██████████| 63/63 [00:08<00:00,  7.56it/s]


Epoch: 2, Train loss: 9.963, Val loss: 9.973, Epoch time = 26.750s


100%|██████████| 63/63 [00:26<00:00,  2.35it/s]
100%|██████████| 63/63 [00:08<00:00,  7.42it/s]


Epoch: 3, Train loss: 9.968, Val loss: 9.973, Epoch time = 26.802s


100%|██████████| 63/63 [00:26<00:00,  2.33it/s]
100%|██████████| 63/63 [00:08<00:00,  7.41it/s]


Epoch: 4, Train loss: 9.973, Val loss: 9.973, Epoch time = 26.990s


100%|██████████| 63/63 [00:26<00:00,  2.37it/s]
100%|██████████| 63/63 [00:08<00:00,  7.45it/s]


Epoch: 5, Train loss: 9.970, Val loss: 9.973, Epoch time = 26.593s


100%|██████████| 63/63 [00:26<00:00,  2.38it/s]
100%|██████████| 63/63 [00:08<00:00,  7.45it/s]


Epoch: 6, Train loss: 9.970, Val loss: 9.973, Epoch time = 26.525s


100%|██████████| 63/63 [00:26<00:00,  2.37it/s]
100%|██████████| 63/63 [00:08<00:00,  7.51it/s]


Epoch: 7, Train loss: 9.970, Val loss: 9.973, Epoch time = 26.531s


100%|██████████| 63/63 [00:26<00:00,  2.38it/s]
100%|██████████| 63/63 [00:08<00:00,  7.49it/s]


Epoch: 8, Train loss: 9.966, Val loss: 9.973, Epoch time = 26.497s


100%|██████████| 63/63 [00:26<00:00,  2.37it/s]
100%|██████████| 63/63 [00:08<00:00,  7.50it/s]


Epoch: 9, Train loss: 9.970, Val loss: 9.973, Epoch time = 26.552s


100%|██████████| 63/63 [00:26<00:00,  2.38it/s]
100%|██████████| 63/63 [00:08<00:00,  7.50it/s]


Epoch: 10, Train loss: 9.971, Val loss: 9.973, Epoch time = 26.529s


To evaluate the results, we can define a function which gets in input a text (already converted in a tensor with the previously defined collation pipeline) and returns the output sequence, already converted by taking the highest probability token at each time.

In [18]:
import re
# function to generate output sequence using greedy algorithm 
def correct_sentence_vectorized(src_tensor, model, max_len=50):
    assert isinstance(src_tensor, torch.Tensor)

    model.eval()
    src_tensor = src_tensor.unsqueeze(1).to(DEVICE)

    trg_vocab_size = model.decoder.output_dim

    #tensor to store decoder outputs
    outputs = torch.zeros(max_len, 1, trg_vocab_size).to(DEVICE)

    #last hidden state of the encoder is the context
    with torch.no_grad():
        context = model.encoder(src_tensor)

    #context also used as the initial hidden state of the decoder
    hidden = context

    #first input to the decoder is the <sos> tokens
    input = src_tensor[0,:]
    # enc_src = [batch_sz, src_len, hid_dim]
    # Even though some examples might have been completed by producing a <eos> token
    # we still need to feed them through the model because other are not yet finished
    # and all examples act as a batch. Once every single sentence prediction encounters
    # <eos> token, then we can stop predicting.
    for t in range(1, max_len):

        #insert input token embedding, previous hidden state and the context state
        #receive output tensor (predictions) and new hidden state
        output, hidden = model.decoder(input, hidden, context)

        #place predictions in a tensor holding predictions for each token
        outputs[t] = output

        #get the highest predicted token from our predictions
        top1 = output.argmax(1)

        #if teacher forcing, use actual next token as next input
        #if not, use predicted token
        input = top1

    pred_sentence = []

    for i in range(1, len(outputs)):
        topv, topi = outputs[i,:,:].topk(1)
        pred_sentence.append(vocab_transform.vocab.itos_[topi])
        if topi == EOS_IDX:
            break

    return ' '.join(pred_sentence)

In [19]:
checkpoint = torch.load(pl.Path('checkpoints')/"model.pt")
model.load_state_dict(checkpoint['model_state_dict'])

model.eval()

# Pick one in 18M examples
val_iter = Hdf5Dataset(pl.Path(folder)/valid_filename,num_entries=None)

src,trg = random.choice(val_iter)

print("input: \"",src,"\"")
print("target: \"",trg,"\"")

src = text_transform(src)

print(correct_sentence_vectorized(src,model))

input: " Bobby was despentate to escape. "
target: " Bobby was desperate to escape. "
demand analysts edwin acidity mayonnaise driving 122 submission myers destiny furnishing steroids commands breakers whatsapp gibson way newer attracts vanities methods communities meltdown telstra western spice methane pulled hurry hosts clever lam puerto paste ink dropbox thou hashtags equipment bits cambridge chandler alec evenings wolf bsc 2006 binder leaching


References
----------

1. Attention is all you need paper.
   https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
2. The annotated transformer. https://nlp.seas.harvard.edu/2018/04/03/attention.html#positional-encoding

