<a href="https://colab.research.google.com/github/arthurflor23/spelling-correction/blob/master/src/tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/arthurflor23/spelling-correction/blob/master/doc/image/header.png?raw=true">

# Spelling Correction using TensorFlow 2.x

This tutorial shows how you can use the project [Spelling Correction](https://github.com/arthurflor23/text-corretion) in your Google Colab.



## 1 Localhost Environment

We'll make sure you have the project in your Google Drive with the datasets folders. If you already have structured files in the cloud, skip this step.

### 1.1 Datasets

The datasets that you can use:

a. [BEA2019](https://www.cl.cam.ac.uk/research/nl/bea2019st/)

b. [Bentham](http://www.transcriptorium.eu/~tsdata/)

c. [CoNLL13](https://www.comp.nus.edu.sg/~nlp/conll13st.html)

d. [CoNLL14](https://www.comp.nus.edu.sg/~nlp/conll14st.html)

e. [Google](https://ai.google/research/pubs/pub41880)

f. [IAM](http://www.fki.inf.unibe.ch/databases/iam-handwriting-database)

g. [Rimes](http://www.a2ialab.com/doku.php?id=rimes_database:start)

h. [Saint Gall](https://fki.tic.heia-fr.ch/databases/saint-gall-database)

i. [Washington](https://fki.tic.heia-fr.ch/databases/washington-database)

### 1.2 Raw folder

On localhost, download the code project from GitHub and extract the chosen dataset in the **raw** folder. Don't change anything of the structure of the dataset, since the scripts were made from the **original structure** of them. Your project directory will be like this:

```
.
├── raw
│   ├── bea2019
│   │   ├── json
│   │   ├── json_to_m2.py
│   │   ├── licence.wi.txt
│   │   ├── license.locness.txt
│   │   ├── m2
│   │   └── readme.txt
│   ├── bentham
│   │   ├── BenthamDatasetR0-GT
│   │   └── BenthamDatasetR0-Images
│   ├── conll13
│   │   ├── m2scorer
│   │   ├── original
│   │   ├── README
│   │   ├── revised
│   │   └── scripts
│   ├── conll14
│   │   ├── alt
│   │   ├── noalt
│   │   ├── README
│   │   └── scripts
│   ├── google
│   │   ├── europarl-v6.cs
│   │   ├── europarl-v6.de
│   │   ├── europarl-v6.en
│   │   ├── europarl-v6.es
│   │   ├── europarl-v6.fr
│   │   ├── news.2007.cs.shuffled
│   │   ├── news.2007.de.shuffled
│   │   ├── news.2007.en.shuffled
│   │   ├── news.2007.es.shuffled
│   │   ├── news.2007.fr.shuffled
│   │   ├── news.2008.cs.shuffled
│   │   ├── news.2008.de.shuffled
│   │   ├── news.2008.en.shuffled
│   │   ├── news.2008.es.shuffled
│   │   ├── news.2008.fr.shuffled
│   │   ├── news.2009.cs.shuffled
│   │   ├── news.2009.de.shuffled
│   │   ├── news.2009.en.shuffled
│   │   ├── news.2009.es.shuffled
│   │   ├── news.2009.fr.shuffled
│   │   ├── news.2010.cs.shuffled
│   │   ├── news.2010.de.shuffled
│   │   ├── news.2010.en.shuffled
│   │   ├── news.2010.es.shuffled
│   │   ├── news.2010.fr.shuffled
│   │   ├── news.2011.cs.shuffled
│   │   ├── news.2011.de.shuffled
│   │   ├── news.2011.en.shuffled
│   │   ├── news.2011.es.shuffled
│   │   ├── news.2011.fr.shuffled
│   │   ├── news-commentary-v6.cs
│   │   ├── news-commentary-v6.de
│   │   ├── news-commentary-v6.en
│   │   ├── news-commentary-v6.es
│   │   └── news-commentary-v6.fr
│   ├── iam
│   │   ├── ascii
│   │   ├── forms
│   │   ├── largeWriterIndependentTextLineRecognitionTask
│   │   ├── lines
│   │   └── xml
│   ├── rimes
│   │   ├── eval_2011
│   │   ├── eval_2011_annotated.xml
│   │   ├── training_2011
│   │   └── training_2011.xml
│   ├── saintgall
│   │   ├── data
│   │   ├── ground_truth
│   │   ├── README.txt
│   │   └── sets
│   └── washington
│       ├── data
│       ├── ground_truth
│       ├── README.txt
│       └── sets
└── src
    ├── data
    │   ├── evaluation.py
    │   ├── generator.py
    │   ├── __init__.py
    │   ├── preproc.py
    │   └── reader.py
    ├── main.py
    ├── tool
    │   ├── __init__.py
    │   ├── seq2seq.py
    │   ├── statistical.py 
    │   └── transformer.py
    └── tutorial.ipynb

```

After that, create virtual environment and install the dependencies with python 3 and pip:

> ```python -m venv .venv && source .venv/bin/activate```

> ```pip install -r requirements.txt```

### 1.3 Dataset folders

Now, you'll run the *transform* function from **main.py**. For this, execute on **src** folder:

> ```python main.py --source=<DATASET_NAME> --transform```

Your data will be preprocess and encode, creating and saving in the **data** folder. Now your project directory will be like this:


```
.
├── data
│   ├── bea2019.txt
│   ├── bentham.txt
│   ├── conll13.txt
│   ├── conll14.txt
│   ├── google.txt
│   ├── iam.txt
│   ├── rimes.txt
│   ├── saintgall.txt
│   └── washington.txt
├── raw
│   ├── bea2019
│   │   ├── json
│   │   ├── json_to_m2.py
│   │   ├── licence.wi.txt
│   │   ├── license.locness.txt
│   │   ├── m2
│   │   └── readme.txt
│   ├── bentham
│   │   ├── BenthamDatasetR0-GT
│   │   └── BenthamDatasetR0-Images
│   ├── conll13
│   │   ├── m2scorer
│   │   ├── original
│   │   ├── README
│   │   ├── revised
│   │   └── scripts
│   ├── conll14
│   │   ├── alt
│   │   ├── noalt
│   │   ├── README
│   │   └── scripts
│   ├── google
│   │   ├── europarl-v6.cs
│   │   ├── europarl-v6.de
│   │   ├── europarl-v6.en
│   │   ├── europarl-v6.es
│   │   ├── europarl-v6.fr
│   │   ├── news.2007.cs.shuffled
│   │   ├── news.2007.de.shuffled
│   │   ├── news.2007.en.shuffled
│   │   ├── news.2007.es.shuffled
│   │   ├── news.2007.fr.shuffled
│   │   ├── news.2008.cs.shuffled
│   │   ├── news.2008.de.shuffled
│   │   ├── news.2008.en.shuffled
│   │   ├── news.2008.es.shuffled
│   │   ├── news.2008.fr.shuffled
│   │   ├── news.2009.cs.shuffled
│   │   ├── news.2009.de.shuffled
│   │   ├── news.2009.en.shuffled
│   │   ├── news.2009.es.shuffled
│   │   ├── news.2009.fr.shuffled
│   │   ├── news.2010.cs.shuffled
│   │   ├── news.2010.de.shuffled
│   │   ├── news.2010.en.shuffled
│   │   ├── news.2010.es.shuffled
│   │   ├── news.2010.fr.shuffled
│   │   ├── news.2011.cs.shuffled
│   │   ├── news.2011.de.shuffled
│   │   ├── news.2011.en.shuffled
│   │   ├── news.2011.es.shuffled
│   │   ├── news.2011.fr.shuffled
│   │   ├── news-commentary-v6.cs
│   │   ├── news-commentary-v6.de
│   │   ├── news-commentary-v6.en
│   │   ├── news-commentary-v6.es
│   │   └── news-commentary-v6.fr
│   ├── iam
│   │   ├── ascii
│   │   ├── forms
│   │   ├── largeWriterIndependentTextLineRecognitionTask
│   │   ├── lines
│   │   └── xml
│   ├── rimes
│   │   ├── eval_2011
│   │   ├── eval_2011_annotated.xml
│   │   ├── training_2011
│   │   └── training_2011.xml
│   ├── saintgall
│   │   ├── data
│   │   ├── ground_truth
│   │   ├── README.txt
│   │   └── sets
│   └── washington
│       ├── data
│       ├── ground_truth
│       ├── README.txt
│       └── sets
└── src
    ├── data
    │   ├── evaluation.py
    │   ├── generator.py
    │   ├── __init__.py
    │   ├── preproc.py
    │   └── reader.py
    ├── main.py
    ├── tool
    │   ├── __init__.py
    │   ├── seq2seq.py
    │   ├── statistical.py 
    │   └── transformer.py
    └── tutorial.ipynb

```

Then upload the **data** and **src** folders in the same directory in your Google Drive.

## 2 Google Drive Environment


### 2.1 TensorFlow 2.x

Make sure the jupyter notebook is using GPU mode.

In [None]:
!nvidia-smi

In [None]:
%tensorflow_version 2.x
import tensorflow as tf

device_name = tf.test.gpu_device_name()

if device_name != "/device:GPU:0":
    raise SystemError("GPU device not found")

print("Found GPU at: {}".format(device_name))

### 2.2 Google Drive

Mount your Google Drive partition.

**Note:** *\"Colab Notebooks/spelling-correction/src/\"* was the directory where you put the project folders, specifically the **src** folder.

In [None]:
from google.colab import drive

drive.mount("./gdrive", force_remount=True)

%cd "./gdrive/My Drive/Colab Notebooks/spelling-correction/src/"
!ls -l

After mount, you can see the list os files in the project folder.

## 3 Set Python Classes

### 3.1 Environment

First, let's define our environment variables.

Set the main configuration parameters, such as dataset, method, number of epochs and batch size. This make compatible with **main.py** and jupyter notebook:

* **dataset**:

  * **``bea2019``**, **``bentham``**, **``conll13``**, **``conll14``**, **``google``**, **``iam``**, **``rimes``**, **``saintgall``**, **``washington``**

* **mode**:

  * neural network: **``luong``**, **``bahdanau``**, **``transformer``**

  * statistical (localhost only): **``similarity``**, **``norvig``**, **``symspell``**

* **epochs**: number of epochs

* **batch_size**: number size of the batch

In [None]:
import os
import datetime
import string

# define parameters
source = "bea2019"
mode = "luong"
epochs = 1000
batch_size = 64

# define paths
data_path = os.path.join("..", "data")
source_path = os.path.join(data_path, f"{source}.txt")

output_path = os.path.join("..", "output", source, mode)
target_path = os.path.join(output_path, "checkpoint_weights.hdf5")
os.makedirs(output_path, exist_ok=True)

# define number max of chars per line and list of valid chars
max_text_length = 128
charset_base = string.printable[:95]
charset_special = """ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöùúûüý"""

print("output", output_path)
print("target", target_path)
print("charset:", charset_base + charset_special)

### 3.2 DataGenerator Class

The second class is **DataGenerator()**, responsible for:

* Load the dataset partitions (train, valid, test);

* Manager batchs for train/validation/test process.

In [None]:
from data.generator import DataGenerator

dtgen = DataGenerator(source=source_path,
                      batch_size=batch_size,
                      charset=(charset_base + charset_special),
                      max_text_length=max_text_length)

print(f"Train sentences: {dtgen.size['train']}")
print(f"Validation sentences: {dtgen.size['valid']}")
print(f"Test sentences: {dtgen.size['test']}")

### 3.3 Neural Network Model

In this step, the model will be created/loaded and default callbacks setup.

In [None]:
from data import preproc as pp, evaluation as ev
from tool.seq2seq import Seq2SeqAttention
from tool.transformer import Transformer

if mode == "transformer":
    # disable one hot encode (seq2seq) to use transformer model
    dtgen.one_hot_process = False
    model = Transformer(dtgen.tokenizer,
                        num_layers=6,
                        units=512,
                        d_model=256,
                        num_heads=8,
                        dropout=0.1,
                        stop_tolerance=20,
                        reduce_tolerance=15)
else:
    model = Seq2SeqAttention(dtgen.tokenizer,
                             mode,
                             units=512,
                             dropout=0.2,
                             stop_tolerance=20,
                             reduce_tolerance=15)

model.compile(learning_rate=0.001)
model.summary(output_path, "summary.txt")

# get default callbacks list and load checkpoint weights file (HDF5) if exists 
model.load_checkpoint(target=target_path)

callbacks = model.get_callbacks(logdir=output_path, checkpoint=target_path, verbose=1)

## 4 Training

The training process using *fit_generator()* to fit memory. After training, the information (epochs and minimum loss) is save.

In [None]:
# to calculate total and average time per epoch
start_time = datetime.datetime.now()

h = model.fit(x=dtgen.next_train_batch(),
              epochs=epochs,
              steps_per_epoch=dtgen.steps['train'],
              validation_data=dtgen.next_valid_batch(),
              validation_steps=dtgen.steps['valid'],
              callbacks=callbacks,
              shuffle=True,
              verbose=1)

total_time = datetime.datetime.now() - start_time

loss = h.history['loss']
accuracy = h.history['accuracy']

val_loss = h.history['val_loss']
val_accuracy = h.history['val_accuracy']

time_epoch = (total_time / len(loss))
total_item = (dtgen.size['train'] + dtgen.size['valid'])
best_epoch_index = val_loss.index(min(val_loss))

t_corpus = "\n".join([
    f"Total train sentences:      {dtgen.size['train']}",
    f"Total validation sentences: {dtgen.size['valid']}",
    f"Batch:                      {dtgen.batch_size}\n",
    f"Total epochs:               {len(accuracy)}",
    f"Total time:                 {total_time}",
    f"Time per epoch:             {time_epoch}",
    f"Time per item:              {time_epoch / total_item}\n",
    f"Best epoch                  {best_epoch_index + 1}",
    f"Training loss:              {loss[best_epoch_index]:.8f}",
    f"Training accuracy:          {accuracy[best_epoch_index]:.8f}\n",
    f"Validation loss:            {val_loss[best_epoch_index]:.8f}",
    f"Validation accuracy:        {val_accuracy[best_epoch_index]:.8f}"
])

with open(os.path.join(output_path, "train.txt"), "w") as lg:
    lg.write(t_corpus)
    print(t_corpus)

## 5 Predict and Evaluate

Since the goal is to correct text, the metrics (CER, WER and SER) are calculated before and after of the correction.

The predict process also using the *predict_generator()*:

In [None]:
start_time = datetime.datetime.now()

predicts = model.predict(x=dtgen.next_test_batch(), steps=dtgen.steps['test'], verbose=1)
predicts = [pp.text_standardize(x) for x in predicts]

total_time = datetime.datetime.now() - start_time

# calculate metrics (before and after)
old_metric, new_metric = ev.ocr_metrics(ground_truth=dtgen.dataset['test']['gt'],
                                        data=dtgen.dataset['test']['dt'],
                                        predict=predicts)

# generate report
e_corpus = "\n".join([
    f"Total test sentences: {dtgen.size['test']}\n",
    f"Total time:           {total_time}",
    f"Time per item:        {total_time / dtgen.size['test']}\n",
    f"Metrics (before):",
    f"Character Error Rate: {old_metric[0]:.8f}",
    f"Word Error Rate:      {old_metric[1]:.8f}",
    f"Sequence Error Rate:  {old_metric[2]:.8f}\n",
    f"Metrics (after):",
    f"Character Error Rate: {new_metric[0]:.8f}",
    f"Word Error Rate:      {new_metric[1]:.8f}",
    f"Sequence Error Rate:  {new_metric[2]:.8f}"
])

p_corpus = []
for i in range(dtgen.size['test']):
    p_corpus.append(f"GT {dtgen.dataset['test']['gt'][i]}")
    p_corpus.append(f"DT {dtgen.dataset['test']['dt'][i]}")
    p_corpus.append(f"PD {predicts[i]}\n")

# write report
with open(os.path.join(output_path, "predict.txt"), "w") as lg:
    lg.write("\n".join(p_corpus))
    print("\n".join(p_corpus[:30]))

with open(os.path.join(output_path, "evaluate.txt"), "w") as lg:
    lg.write(e_corpus)
    print(e_corpus)