## Fine tuning a Pretrained BERT for spell correction tasks.

### Data download and setup

We will be using a general spell correction dataset available [here](https://github.com/mhagiwara/xfspell/blob/master/data/fce/fce-split.us.tsv). This dataset is an induced spell errors using general english sentences. Download the file and store it in the local notebook instance.

In [8]:
import pandas as pd

In [10]:
ScheckDF = pd.read_csv("xfspell/data/fce/fce-split.us.tsv",sep="\t",header=None,names=["input_text","target_text"])

In [53]:
ScheckDF.head().values

array([['I WANT TO THAK YOU FOR PREPARING SUCH A GOOD PROGRAMME FOR US AND ESPECIALLY FOR TAKING US ON THE RIVER TRIP TO GREENWICH.',
        'I WANT TO THANK YOU FOR PREPARING SUCH A GOOD PROGRAMME FOR US AND ESPECIALLY FOR TAKING US ON THE RIVER TRIP TO GREENWICH.'],
       ['I WOULD LIKE TO KNOW IF THERE IS ANY CHANCE OF CHANGING THE PROGRAMME BECAUSE WE HAVE FOUND A VERY INTERESTING ACTIVITY TO DO ON TUESDAY 14 MARCH.',
        'I WOULD LIKE TO KNOW IF THERE IS ANY CHANCE OF CHANGING THE PROGRAMME BECAUSE WE HAVE FOUND A VERY INTERESTING ACTIVITY TO DO ON TUESDAY 14 MARCH.'],
       ['IT INVOLVES VISITING THE LONDON FASHION AND LEISURE SHOW AT THE CENTRAL EXHIBITION HALL.',
        'IT INVOLVES VISITING THE LONDON FASHION AND LEISURE SHOW AT THE CENTRAL EXHIBITION HALL.'],
       ["I THINK IT'S A GREAT OPPORTUNITY TO MAKE GREATER USE OF OUR KNOWLEDGE OF THE ENGLISH LANGUAGE.",
        "I THINK IT'S A GREAT OPPORTUNITY TO MAKE GREATER USE OF OUR KNOWLEDGE OF THE ENGLISH LANGUAGE."],

In [36]:
trainDF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8498 entries, 0 to 8499
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   input_text   8498 non-null   object
 1   target_text  8498 non-null   object
dtypes: object(2)
memory usage: 199.2+ KB


####  Divide the data into train and eval set and remove null values if any.

In [16]:
trainDF = ScheckDF[:8500]
evalDF = ScheckDF[8500:]

In [32]:
trainDF = trainDF.dropna(how='any',axis=0) 
evalDF = evalDF.dropna(how='any',axis=0) 

In [None]:
print("size of training data {}".format(len(trainDF.index)))
print("size of eval data {}".format(len(evalDF.index)))

### Implementation details

The Idea is to use a Seq2Seq Architecture with both inputs and targets as text sequences. For our dataset we will use the incorrect sentences as inputs and the correct sentences as outputs. We will be further using a encoder-decoder architecture with fine tuning with Bert/Roberta for both encoder and decoder networks.

We will be using a python library called "simpletransformers" which uses the "Transformers" library from hugging faces to build more high level usable API's for certain language tasks. More information can be found [here](https://simpletransformers.ai/)

#### Install the simple transformers library

In [1]:
! pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.48.14-py3-none-any.whl (214 kB)
[K     |████████████████████████████████| 214 kB 3.1 MB/s eta 0:00:01
[?25hCollecting regex
  Downloading regex-2020.10.15-cp36-cp36m-manylinux2010_x86_64.whl (662 kB)
[K     |████████████████████████████████| 662 kB 7.7 MB/s eta 0:00:01
Collecting tokenizers
  Downloading tokenizers-0.9.2-cp36-cp36m-manylinux1_x86_64.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 16.1 MB/s eta 0:00:01
[?25hCollecting seqeval
  Downloading seqeval-1.2.1.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 4.0 MB/s  eta 0:00:01
[?25hCollecting wandb
  Downloading wandb-0.10.7-py2.py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 32.5 MB/s eta 0:00:01
Collecting transformers>=3.0.2
  Downloading transformers-3.4.0-py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 43.5 MB/s eta 0:00:01
[?25hCollecting streamlit
  Downloading s

Collecting smmap<4,>=3.0.1
  Using cached smmap-3.0.4-py2.py3-none-any.whl (25 kB)


Building wheels for collected packages: seqeval, subprocess32, promise, sacremoses, blinker
  Building wheel for seqeval (setup.py) ... [?25ldone
[?25h  Created wheel for seqeval: filename=seqeval-1.2.1-py3-none-any.whl size=16167 sha256=4dc407db4f5009fa6e6b695d724c104683d3e37945ab8bfddd9e3744a4147cc2
  Stored in directory: /home/ec2-user/.cache/pip/wheels/41/ac/96/e9d5bceb83600d09ba2ca99693befff9b47bc35334943da786
  Building wheel for subprocess32 (setup.py) ... [?25ldone
[?25h  Created wheel for subprocess32: filename=subprocess32-3.5.4-py3-none-any.whl size=6489 sha256=ebeecb9acfb5accbd06935ebf43b9021fb086287bde7f27bb4248cccaaaab9e3
  Stored in directory: /home/ec2-user/.cache/pip/wheels/44/3a/ab/102386d84fe551b6cedb628ed1e74c5f5be76af8b909aeda09
  Building wheel for promise (setup.py) ... [?25ldone
[?25h  Created wheel for promise: filename=promise-2.3-py3-none-any.whl size=21495 sha256=5dede3c08ddd83fba9aa751950b8736a6959907ec2d149797b41b105e32a0b76
  Stored in directory: /h

### Train a Model with roberta as encoder and Bert as decoder.

This is a default setting from the library. The following rules currently apply to Encoder-Decoder models:

1. The decoder must be a bert model.
2. The encoder can be one of [bert, roberta, distilbert, camembert, electra].
3. The encoder and the decoder must be of the same "size". (E.g. roberta-base encoder and a bert-base-uncased decoder)

We will be using a batch size of 16 set to 10 epochs and GPU(preferably P3's) to run the below code. 

In [2]:
from simpletransformers.seq2seq import Seq2SeqModel
import logging
import pandas as pd
import sklearn




In [35]:
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)



model_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 15,
    "train_batch_size": 16,
    "num_train_epochs": 10,
    "save_eval_checkpoints": False,
    "save_model_every_epoch": False,
    "evaluate_generated_text": True,
    "evaluate_during_training_verbose": True,
    "use_multiprocessing": False,
    "max_length": 15,
    "manual_seed": 4,
}

encoder_type = "roberta"

model = Seq2SeqModel(
     encoder_type,
    "roberta-base",
    "bert-base-cased",
    args=model_args,
    use_cuda=True,
)

model.train_model(trainDF)



Some weights of the model checkpoint at bert-base-cased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['bert.encoder.layer.0.crossattention.self.query.weight', 'bert.encoder.layer.0.crossattention.self.query.bias', 'bert.encoder.layer.0.crossattention.self.key.weight', 'bert.encoder.layer.0.crossattention.self.key.bias', 'bert.encoder.layer.0.c

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=8498.0), HTML(value='')))

Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f0673ada0f0>>
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1101, in __del__
    self._shutdown_workers()
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1075, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/multiprocessing/process.py", line 124, in join
    res = self._popen.wait(timeout)
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/multiprocessing/popen_fork.py", line 47, in wait
    if not wait([self.sentinel], timeout):
  File "/home/ec2-user/anaconda3/envs/pytorch_latest_p36/lib/python3.6/multiprocessing/connection.py", line 911,




INFO:simpletransformers.seq2seq.seq2seq_model: Training started


HBox(children=(HTML(value='Epoch'), FloatProgress(value=0.0, max=10.0), HTML(value='')))

HBox(children=(HTML(value='Running Epoch 0 of 10'), FloatProgress(value=0.0, max=532.0), HTML(value='')))






HBox(children=(HTML(value='Running Epoch 1 of 10'), FloatProgress(value=0.0, max=532.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 2 of 10'), FloatProgress(value=0.0, max=532.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 3 of 10'), FloatProgress(value=0.0, max=532.0), HTML(value='')))

INFO:simpletransformers.seq2seq.seq2seq_model:Saving model into outputs/checkpoint-2000





HBox(children=(HTML(value='Running Epoch 4 of 10'), FloatProgress(value=0.0, max=532.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 5 of 10'), FloatProgress(value=0.0, max=532.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 6 of 10'), FloatProgress(value=0.0, max=532.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 7 of 10'), FloatProgress(value=0.0, max=532.0), HTML(value='')))

INFO:simpletransformers.seq2seq.seq2seq_model:Saving model into outputs/checkpoint-4000





HBox(children=(HTML(value='Running Epoch 8 of 10'), FloatProgress(value=0.0, max=532.0), HTML(value='')))




HBox(children=(HTML(value='Running Epoch 9 of 10'), FloatProgress(value=0.0, max=532.0), HTML(value='')))

INFO:simpletransformers.seq2seq.seq2seq_model:Saving model into outputs/






INFO:simpletransformers.seq2seq.seq2seq_model: Training of roberta-base-bert-base-cased model complete. Saved to outputs/.


(5320, 0.709893412254637)

### Evaluate the model on evalset by taking 1000 sentences

In [38]:
results = model.eval_model(evalDF[:1000])

INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1000.0), HTML(value='')))

INFO:simpletransformers.seq2seq.seq2seq_utils: Saving features into cached file cache_dir/roberta-base-bert-base-cased_cached_151000





HBox(children=(HTML(value='Running Evaluation'), FloatProgress(value=0.0, max=125.0), HTML(value='')))




INFO:simpletransformers.seq2seq.seq2seq_model:{'eval_loss': 0.488968826815486}


### Run predictions on few unseen sentences

In [62]:
print(model.predict(["Wht is your name"]))

['What is your name?....']
