<a href="https://colab.research.google.com/github/abcvivek/EnglishSpellCheckingSystem/blob/master/EnglishSpellCheckingSystemWithComments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**ENGLISH SPELL CHECKING SYSTEM**

- Select the Tensorflow Version 1.x to run the below code 

In [0]:
%tensorflow_version 1.1

`%tensorflow_version` only switches the major version: 1.x or 2.x.
You set: `1.1`. This will be interpreted as: `1.x`.


TensorFlow 1.x selected.


**Importing Libraries**
- numpy are used for maths calculations
- tenserflow is used for building Neural Networks
- os will help in managing all the os related stuff
- time is used for calculating time difference
- ceil is used for rounding off the values
- train_test_split will help in dividing dataset into training set and testing set
- Dense is used for output layer.....you get a m dimensional vector as output. A dense layer thus is used to change the dimensions of your vector. Mathematically speaking, it applies a rotation, scaling, translation transform to your vector.
- Python supports a type of container like dictionaries called “namedtuples()” present in module, “collections“. Like dictionaries they contain keys that are hashed to a particular value. But on contrary, it supports both access from key value and iteration, the functionality that dictionaries lack.

In [0]:
import numpy as np
import tensorflow as tf
import os
import time
from math import ceil
from sklearn.model_selection import train_test_split
from tensorflow.python.layers.core import Dense
from collections import namedtuple

**Loading of Data**

- Path variable contains the dataset path
- Used dataset is the cleaned one so there is no step of data cleaning
- file_content will hold all the contents of dataset by calling load_file function

In [0]:
path = '/content/drive/My Drive/clean.txt'
     
def load_file(path): 
  input_file = os.path.join(path) 
  with open(input_file) as file:
    File = file.read()
  return File
       
file_content = load_file(path)

**Preprocessing of Data**

- vocab_to_int is a python dictionary in this key is character and value is the assigned number
- First block of code will get all the unique characters present in the dataset and it will add to vocab_to_int dictionary
- Numbers are assigned on the first come first serve basis
- Second block will add some special tokens to the vocab_to_int dictionary

- **GO** - the same as <start> on the picture below - the first token which is fed to the decoder along with the though vector in order to start generating tokens of the answer
- **EOS** - "end of sentence" - the same as <end> on the picture below - as soon as decoder generates this token we consider the answer to be complete (you can't use usual punctuation marks for this purpose cause their meaning can be different)
- **PAD** - your GPU (or CPU at worst) processes your training data in batches and all the sequences in your batch should have the same length. If the max length of your sequence is 8, your sentence My name is guotong1988 will be padded from either side to fit this length: My name is guotong1988 _pad_ _pad_ _pad_ _pad_

(https://cloud.githubusercontent.com/assets/2272790/18410099/1d0a1c1a-7761-11e6-9fe1-bd2e5622b90a.png)

In [0]:
vocab_to_int = {}
count = 0

for character in file_content:
  if character not in vocab_to_int:
    vocab_to_int[character] = count
    count += 1

codes = ['<PAD>','<EOS>','<GO>']
for code in codes:
  vocab_to_int[code] = count
  count += 1

In [0]:
print("The vocabulary contains {} characters.".format(len(vocab_to_int)))
print(sorted(vocab_to_int.items()))

The vocabulary contains 31 characters.
[('\n', 20), (' ', 4), ('<EOS>', 29), ('<GO>', 30), ('<PAD>', 28), ('a', 6), ('b', 14), ('c', 10), ('d', 18), ('e', 12), ('f', 16), ('g', 24), ('h', 1), ('i', 2), ('j', 21), ('k', 23), ('l', 17), ('m', 22), ('n', 19), ('o', 8), ('p', 15), ('q', 27), ('r', 13), ('s', 3), ('t', 0), ('u', 9), ('v', 11), ('w', 5), ('x', 25), ('y', 7), ('z', 26)]


- Another dictionary to convert integers to their respective characters
- Key is Numbers and value is character

In [0]:
int_to_vocab = {}
for character, value in vocab_to_int.items():
    int_to_vocab[value] = character

print(int_to_vocab.items())

dict_items([(0, 't'), (1, 'h'), (2, 'i'), (3, 's'), (4, ' '), (5, 'w'), (6, 'a'), (7, 'y'), (8, 'o'), (9, 'u'), (10, 'c'), (11, 'v'), (12, 'e'), (13, 'r'), (14, 'b'), (15, 'p'), (16, 'f'), (17, 'l'), (18, 'd'), (19, 'n'), (20, '\n'), (21, 'j'), (22, 'm'), (23, 'k'), (24, 'g'), (25, 'x'), (26, 'z'), (27, 'q'), (28, '<PAD>'), (29, '<EOS>'), (30, '<GO>')])


- From the file content we are converting text into sentences.
- When we get a newline in the file we split that into lines

In [0]:
sentences = []
for sentence in file_content.splitlines():
  sentences.append(sentence)
print(" Dataset contains {} sentences.".format(len(sentences)))

 Dataset contains 1232681 sentences.


- As the character cannot be fed to seq2seq model we convert every sentence into numbered sentence.
- Conversion of sentences happen using the vocab_to_int dictionary.

In [0]:
int_sentences = []
for sentence in sentences:
    int_sentence = []
    for character in sentence:
        int_sentence.append(vocab_to_int[character])
    int_sentences.append(int_sentence)

- We are limiting the training sentences by keeping max_length and min_length
- with this filtering sentence with less than 30 characters and more than 250 characters are discared for training.

In [0]:
max_length = 250
min_length = 30

good_sentences = []
for sentence in int_sentences:
    if len(sentence) <= max_length and len(sentence) >= min_length:
        good_sentences.append(sentence)

print("We will use {} to train and test our model.".format(len(good_sentences)))

We will use 1017165 to train and test our model.


- Here the set of sentences is divided into three parts.They are
- **Training Sentence** = used for training the model.
- **Validation Sentence** = used for validating how well the training has been done on the model.
- **Testing sentence** = Once the model is saved and finalized, these sentences can be used for testing the model for accuracy.



In [0]:
training, testing = train_test_split(good_sentences, test_size = 0.10, random_state = 2)
testing, validation = train_test_split(testing, test_size = 0.70, random_state = 2)
print("Number of Training sentences:", len(training))
print("Number of Validiation sentences:", len(validation))
print("Number of Testing sentences:", len(testing))

Number of Training sentences: 915448
Number of Validiation sentences: 71202
Number of Testing sentences: 30515


- Sort all the category sentences by length to reduce padding, which will allow the model to train faster

In [0]:
training_sorted = []
validation_sorted = []
testing_sorted = []

for i in range(min_length, max_length+1):
    for sentence in training:
        if len(sentence) == i:
            training_sorted.append(sentence)

    for sentence in validation:
        if len(sentence) == i:
            validation_sorted.append(sentence)

    for sentence in testing:
        if len(sentence) == i:
            testing_sorted.append(sentence)

- This method helps in adding noise to the sentences which are correct.
- noise_maker will create noises based on the threshold value. If the threshold value is 100 all the sentences are kept as it is. In our case threshold can be around 90 to 95.
- noise_maker creates 3 types of mistakes based on randomness. They are 1. Swap the character locations 2. Remove the character  3. Addition of any random lower case letter

In [0]:
letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z']

def noise_maker(sentence, threshold):

    '''Relocate, remove, or add characters to create spelling mistakes''' 

    noisy_sentence = []
    i = 0
    while i < len(sentence):
        random = np.random.uniform(0,1,1)
        # Most characters will be correct since the threshold value is high
        if random < threshold:
            noisy_sentence.append(sentence[i])
        else:
            new_random = np.random.uniform(0,1,1)
            # ~33% chance characters will swap locations
            if new_random > 0.67:
                if i == (len(sentence) - 1):
                    # If last character in sentence, it will not be typed
                    continue
                else:
                    # if any other character, swap order with following character
                    noisy_sentence.append(sentence[i+1])
                    noisy_sentence.append(sentence[i])
                    i += 1

            # ~33% chance an extra lower case letter will be added to the sentence
            elif new_random < 0.33:
                random_letter = np.random.choice(letters, 1)[0]
                noisy_sentence.append(vocab_to_int[random_letter])
                noisy_sentence.append(sentence[i])

            # ~33% chance a character will not be typed
            else:
                pass     
        i += 1
    return noisy_sentence

- This is a utility method to verify whether the noise_maker is making the desired mistakes or not 

In [0]:
# Check to ensure noise_maker is making mistakes correctly.

threshold = 0.9
for sentence in training_sorted[:5]:
    print(sentence)
    print(noise_maker(sentence, threshold))
    print()

[6, 4, 5, 6, 13, 19, 2, 19, 24, 4, 5, 6, 3, 4, 24, 2, 11, 12, 19, 4, 14, 7, 4, 2, 3, 6, 6, 10, 4, 22]
[6, 4, 5, 6, 13, 19, 2, 19, 24, 8, 4, 5, 6, 3, 4, 24, 2, 11, 12, 19, 4, 14, 7, 4, 2, 3, 6, 6, 10, 7, 4, 22]

[6, 4, 16, 12, 5, 4, 15, 12, 0, 13, 8, 17, 4, 14, 8, 22, 14, 3, 4, 5, 12, 13, 12, 4, 0, 1, 13, 8, 5, 19]
[6, 4, 16, 5, 12, 4, 15, 12, 0, 8, 17, 4, 14, 8, 22, 14, 3, 4, 5, 12, 13, 12, 4, 0, 1, 8, 13, 5, 19]

[0, 1, 12, 4, 13, 12, 3, 0, 4, 8, 16, 4, 9, 3, 4, 19, 12, 12, 18, 4, 12, 6, 10, 1, 4, 8, 0, 1, 12, 13]
[0, 1, 12, 4, 13, 12, 3, 0, 4, 8, 16, 4, 9, 3, 4, 19, 12, 12, 18, 4, 12, 6, 10, 1, 4, 8, 0, 12, 13]

[3, 8, 22, 12, 4, 3, 0, 9, 22, 14, 17, 12, 4, 8, 11, 12, 13, 4, 6, 4, 16, 6, 17, 3, 12, 4, 1, 8, 15, 12]
[3, 22, 12, 4, 18, 3, 9, 0, 22, 14, 17, 12, 4, 8, 11, 12, 13, 4, 15, 6, 4, 16, 6, 17, 3, 12, 4, 1, 8, 12, 15]

[3, 2, 19, 10, 12, 4, 0, 1, 12, 4, 22, 6, 13, 23, 12, 0, 4, 5, 2, 17, 17, 4, 14, 12, 4, 24, 2, 11, 12, 19]
[3, 2, 19, 10, 12, 4, 0, 1, 12, 4, 22, 6, 13, 23, 0, 4,

- This is also a utility method which is used for making the sentence readable by converting from numbers to their corresponding characters
- int_to_vaocab dictionary is used here for conversion.
- this method will also remove all unnecessary prediction predicted by the model ( this will increase the accuracy of the model )

In [0]:
def MakeSentenceReadable(correct):
  correct_sentence = ""
  for i in correct:
    if i < 28:
      correct_sentence += int_to_vocab[i]
  return correct_sentence.strip()

**Hyper Parameters**

- Below are the set of parameters which is used for tuining the model. In this project we are using trail and error method to find the right combination of values to get a model with the good amount of accuracy.
- Epochs = A complete cycle of training the model with the whole data and validating the model.
- Batch Size = Training is done using batches so at a time we are taking x sentences from dataset. so that x will be our batch size. 
- RNN layers = Number of rnn layers to be present for the model
- Embedding Size = with the given value we construct a array and this array is passed as the first input to the encoder.
- Learning Rate = The rate at which model should learn the behaviour of data.
- Direction = This used for choosing the direction in encoder layer.
- Threshold = Rate at which error has to be made in the correct sentence.
- Keep Probability = Value ranges from 0 to 1. 1 means consider all the input layers and 0 means leave all the input layers which will not give any result. This is used for overfitting problem so generally common values are 0.5 to 0.8.

In [0]:
epochs = 100
batch_size = 64
num_layers = 4
rnn_size = 512
embedding_size = 128
learning_rate = 0.0005
direction = 2
threshold = 0.9
keep_probability = 0.8 

**Building the Model**

- This method contains necessary pipe kind of structure for model to feed the data.
- Here every placeholder acts as a pipe to the model.
- tf.name_scope = usually used to group some variables together in an operation.
- tf.placeholder = this help in declaring the pipes.
- [None,None] = declared as 2d-array with any size.
- (None) = 1d-array with any size.
- pipes without array is declared as a normal variable.
- name attribute is used to identify the placeholders.
- tf.reduce_max will take array as an input and returns the maximum value present in that array.

In [0]:
def model_inputs():

    with tf.name_scope('inputs'):
        # ARGS: data type, shape of the tensor to be fed, name for operation
        inputs = tf.placeholder(tf.int32, [None, None], name='inputs')

    with tf.name_scope('targets'):
        targets = tf.placeholder(tf.int32, [None, None], name='targets')

    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    inputs_length = tf.placeholder(tf.int32, (None,), name='inputs_length')
    targets_length = tf.placeholder(tf.int32, (None,), name='targets_length')

    # ARGS: input tensor, name for operation
    max_target_length = tf.reduce_max(targets_length, name='max_target_len')

    return inputs, targets, keep_prob, inputs_length, targets_length, max_target_length

- This method will remove last column from the array and it will add the GO token column to the start of the array
- tf.strided_slice is used for removing the last column from the given array.
- tf.concat is used for adding GO token.

In [0]:
def process_encoding_input(targets, vocab_to_int, batch_size):

    with tf.name_scope("processing_encoding"):
        ending = tf.strided_slice(targets, [0, 0], [batch_size, -1], [1, 1])
        dec_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)
    return dec_input

- tf.name_scope creates namespace for operators in the default graph.
- tf.variable_scope creates namespace for both variables and operators in the default graph.
- For this we use Bi-directional RNN with forward and backward direction.
- LSTM's are special kind of RNN. ( Simple version of RNN is rarely used, its more advanced version i.e. LSTM or GRU are used. This is because RNN suffers from the problem of vanishing gradient )
- Here we have created backward and forward nodes with given no of rnn_size
- So 1 layer consists of backward and forward nodes. If we have multiple layers then in the same way that many layers will have same forward and backward structure.
- We use Dropout wrapper to remove random nodes inorder to not to over train the model on the given data.
- As we are using bi-directional rnn we are concatenating the outputs.
- tf.nn.bidirectional_dynamic_rnn combines our embedding layer and RNN layers.

In [0]:
def encoding_layer(rnn_size, sequence_length, num_layers, rnn_inputs, keep_prob, direction):

    if direction == 1:
        with tf.name_scope("RNN_Encoder_Cell_1D"):
            for layer in range(num_layers):
                with tf.variable_scope('encoder_{0}'.format(layer)):
                    lstm = tf.contrib.rnn.LSTMCell(rnn_size)
                    drop = tf.contrib.rnn.DropoutWrapper(lstm, input_keep_prob = keep_prob)
                    enc_output, enc_state = tf.nn.dynamic_rnn(drop, rnn_inputs, sequence_length, dtype=tf.float32)
            return enc_output, enc_state

    if direction == 2:
        with tf.name_scope("RNN_Encoder_Cell_2D"):
            for layer in range(num_layers):
                with tf.variable_scope('encoder_{0}'.format(layer)):
                    cell_fw = tf.contrib.rnn.LSTMCell(rnn_size)
                    cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw, input_keep_prob = keep_prob)

                    cell_bw = tf.contrib.rnn.LSTMCell(rnn_size)
                    cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw, input_keep_prob = keep_prob)

                    enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, rnn_inputs, sequence_length, dtype=tf.float32)
      
            # Concat outputs
            enc_output = tf.concat(enc_output, 2)

            # Use only forwarded state
            return enc_output, enc_state[0]

- In the first block we are creating the LSTM Decoding layers with dropout wrapper
- No of layers depends on the value which is given for the num_layers.
- Output layer is used as final prediction layer where activation functions are used to filter out the unwanted predictions.
- In this project we are using BahdanauAttention which will help in getting the write character for the spelling correction.
- Go through this link as it will help in understanding attention mechanism better.  https://blog.floydhub.com/attention-mechanism/
- In the next line we are wrapping the attention mechanism to the decoder layer
- intial_state will hold the output of encoder and this will be used as the input for decoder layer.
- We are calling training layer used for training and inference layer used for prediction
- For more understanding on training and inferencing https://towardsdatascience.com/seq2seq-model-in-tensorflow-ec0c557e560f 

In [0]:
def decoding_layer(dec_embed_input, embeddings, enc_output, enc_state, vocab_size, inputs_length, targets_length, max_target_length, rnn_size, vocab_to_int, keep_prob, batch_size, num_layers, direction):

    with tf.name_scope("RNN_Decoder_Cell"):
        for layer in range(num_layers):
            with tf.variable_scope('decoder_{}'.format(layer)):
                lstm = tf.contrib.rnn.LSTMCell(rnn_size)
                dec_cell = tf.contrib.rnn.DropoutWrapper(lstm, input_keep_prob = keep_prob)

    output_layer = Dense(vocab_size, kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))

    attn_mech = tf.contrib.seq2seq.BahdanauAttention(rnn_size, enc_output, inputs_length, normalize=False, name='BahdanauAttention')

    with tf.name_scope("Attention_Wrapper"):
        dec_cell = tf.contrib.seq2seq.AttentionWrapper(dec_cell, attn_mech, rnn_size)
        
    initial_state = dec_cell.zero_state(dtype=tf.float32, batch_size=batch_size)
    initial_state = initial_state.clone(cell_state=enc_state)

    with tf.variable_scope("decode"):
        training_logits = training_decoding_layer(dec_embed_input, targets_length, dec_cell, initial_state, output_layer, vocab_size, max_target_length)
    with tf.variable_scope("decode", reuse=True):
        inference_logits = inference_decoding_layer(embeddings, vocab_to_int['<GO>'], vocab_to_int['<EOS>'], dec_cell, initial_state, output_layer, max_target_length, batch_size)

    return training_logits, inference_logits

-  TrainingHelper is where we pass the embeded input. As the name indicates, this is only a helper instance. This instance should be delivered to the BasicDecoder, which is the actual process of building the decoder model.
- BasicDecoder builds the decoder model. It means it connects the RNN layer(s) on the decoder side and the input prepared by TrainingHelper.
- Dynamic Decode will help in training the data and settiing the appropriate weights.
- time_major attributes=False says that training will happen based on batches not based on time.
- impute_finished=True will make sure to correctly save the batch data and the corresponding weights. This process slow downs the training but it will help in getting the accuracte result in the end.
- In python if we specify _ as a return variable, then that variable is called as throwaway variable.

In [0]:
def training_decoding_layer(dec_embed_input, targets_length, dec_cell, initial_state, output_layer, vocab_size, max_target_length):

    with tf.name_scope("Training_Decoder"):
        training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=dec_embed_input, sequence_length=targets_length, time_major=False)
        training_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell, training_helper, initial_state, output_layer)
        training_logits, _, _ = tf.contrib.seq2seq.dynamic_decode(training_decoder, output_time_major=False, impute_finished=True, maximum_iterations=max_target_length)

        return training_logits

- GreedyEmbeddingHelper asks to give the start_tokens for the same amount as the batch size and end_token. Bascially these tokens are used for starting the prediction and ending the prediction.
- All the functions are same as of training decoding layer.

In [0]:
def inference_decoding_layer(embeddings, start_token, end_token, dec_cell, initial_state, output_layer, max_target_length, batch_size):
     
     with tf.name_scope("Inference_Decoder"):
        start_tokens = tf.tile(tf.constant([start_token], dtype=tf.int32), [batch_size], name='start_tokens')

        inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embeddings, start_tokens, end_token)
        inference_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell, inference_helper, initial_state, output_layer)
        inference_logits, _, _ = tf.contrib.seq2seq.dynamic_decode(inference_decoder, output_time_major=False, impute_finished=True, maximum_iterations=max_target_length)

        return inference_logits

- This is the main seq2seq_model function which connets all the above functions to build seq2seq model.
- First line enc_embeddings is a tensor of shape [31,128]. Basicaly tensor is like array only so for simplicity we can choose enc_embeddings as 2D-Array where we have 31 rowa and 128 columns. whole array is being filled with random numbers from -1 to 1 ( Don't know exactly why -1 to 1 is used but may be it will help in activation function decision )
- enc_embed_input will contain all the inputs embeded value. For example a will be represented as a row with 128 values in that row.
- This input is passed to the encoding layer. This input is also commonly known as emmbedding layer.
- In the same for decoder also we create the embbedded input
- Finally from decoding_layer we get training output and inference output.

In [0]:
def seq2seq_model(inputs, targets, keep_prob, inputs_length, targets_length, max_target_length, vocab_size, rnn_size, num_layers, vocab_to_int, batch_size, embedding_size, direction):

    enc_embeddings = tf.Variable(tf.random_uniform([vocab_size, embedding_size], -1, 1))
    enc_embed_input = tf.nn.embedding_lookup(enc_embeddings, inputs)
    enc_output, enc_state = encoding_layer(rnn_size, inputs_length, num_layers, enc_embed_input, keep_prob, direction)

    dec_embeddings = tf.Variable(tf.random_uniform([vocab_size, embedding_size], -1, 1))
    dec_input = process_encoding_input(targets, vocab_to_int, batch_size)
    dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)

    training_logits, inference_logits  = decoding_layer(dec_embed_input, dec_embeddings, enc_output, enc_state, vocab_size, inputs_length, targets_length, max_target_length, rnn_size, vocab_to_int, keep_prob, batch_size, num_layers, direction)

    return training_logits, inference_logits

- seq2seq model is like a machine is built. But to run the machine we need power. So this power for seq2seq model is defining the graph.
- Defining the graph will allow us to perform different type of operation on the model and how those operatins are carried out.
- First we are making sure that the graph we are going to construct is with the defualt config set by the tensorflow.
- Next we will load all the data's which are required for the model and we pass those data using the placeholder ( Pipe ) to the model.
- Pass all the necessary inputs to seq2seq model.
- training_logits will have trained outputs and using this we are finding the cost or loss by comparing the real target.
- Loss is find using the sequence loss ( We have a max prediction length if the real prediction increases the length then we calculate that extra length and added to loss )
- We have used Adam Optimizer for learning. Adam = Adaptive Movement Estimation
- For Adam optimizer we use gradient clipping to overcome vanishing and exploding gradient descent.
- tf.summary.merge_all() will merge all the available summary's and can be used to visualize it in tensorboard.
- All the necessary properties of the graph are made as namedtuple. To understand namedtuple https://www.geeksforgeeks.org/namedtuple-in-python/
- For each batch we are exporting the values which are declared in the namedtuple.

In [0]:
def build_graph(keep_prob, rnn_size, num_layers, batch_size, learning_rate, embedding_size, direction):

    tf.reset_default_graph()

    # Load model inputs
    inputs, targets, keep_prob, inputs_length, targets_length, max_target_length = model_inputs()

    # Create the training and inference logits
    training_logits, inference_logits = seq2seq_model(tf.reverse(inputs, [-1]), targets, keep_prob, inputs_length, targets_length, max_target_length, len(vocab_to_int)+1, rnn_size, num_layers, vocab_to_int, batch_size, embedding_size, direction)

    # Create tensors for the training logits and inference logits
    training_logits = tf.identity(training_logits.rnn_output, 'logits')

    with tf.name_scope('predictions'):
        predictions = tf.identity(inference_logits.sample_id, name='predictions')
        tf.summary.histogram('predictions', predictions)

    # Create the weights for sequence_loss
    masks = tf.sequence_mask(targets_length, max_target_length, dtype=tf.float32, name='masks')

    with tf.name_scope("cost"):
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(training_logits, targets, masks)
        tf.summary.scalar('cost', cost)

    with tf.name_scope("optimze"):
        optimizer = tf.train.AdamOptimizer(learning_rate)

        # Gradient Clipping
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)


    # Merge summaries
    merged = tf.summary.merge_all()

    # Export the nodes
    export_nodes = ['inputs', 'targets', 'keep_prob', 'cost', 'inputs_length', 'targets_length', 'predictions', 'merged', 'train_op','optimizer']
    Graph = namedtuple('Graph', export_nodes)
    local_dict = locals()
    graph = Graph(*[local_dict[each] for each in export_nodes])

    return graph

- If in a batch of sentence every sentence has the same length then there is no need of padding.
- If the batch has diffrent lengths of sentence then we add PAD token to the sentence which has smaller length.
- Padding is required for better training.

In [0]:
def pad_sentence_batch(sentence_batch):
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [vocab_to_int['<PAD>']] * (max_sentence - len(sentence)) for sentence in sentence_batch]

- This method will help in getting the sentences in each batch depends on the batch size.
- For every iteration it will return the correct set of sentence and the noisy sentence and the lengths of the sentence.

In [0]:
def get_batches(sentences, batch_size, threshold):

    for batch_i in range(0, len(sentences)//batch_size):
        start_i = batch_i * batch_size
        sentences_batch = sentences[start_i:start_i + batch_size]
        
        sentences_batch_noisy = []
        for sentence in sentences_batch:
            sentences_batch_noisy.append(noise_maker(sentence, threshold))

        sentences_batch_eos = []
        for sentence in sentences_batch:
            sentence.append(vocab_to_int['<EOS>'])
            sentences_batch_eos.append(sentence)

        pad_sentences_batch = np.array(pad_sentence_batch(sentences_batch_eos))
        pad_sentences_noisy_batch = np.array(pad_sentence_batch(sentences_batch_noisy))

        # Need the lengths for the _lengths parameters

        pad_sentences_lengths = []
        for sentence in pad_sentences_batch:
            pad_sentences_lengths.append(len(sentence))
        

        pad_sentences_noisy_lengths = []
        for sentence in pad_sentences_noisy_batch:
            pad_sentences_noisy_lengths.append(len(sentence))

        yield pad_sentences_noisy_batch, pad_sentences_batch, pad_sentences_noisy_lengths, pad_sentences_lengths

**Training and Validating the Model**

- A session allows to execute graphs or part of graphs
- Only after running tf.global_variables_initializer() in a session will your variables hold the values you told them to hold when you declare them.
- tf.train.Saver() will help in saving the checkpoints. Checkpoints are nothing but the variable values at the given time.
- Checkpoints can be saved to local file system and these checkpoints cane be restored for continuation of training,testing and prediction.
- testing_loss_summary is an array and it will hold all the losses which are generated at the time of validation testing. This is used to compare whether the new validation loss is less than the already presented.
- Logs are being written in the Logs folder and this folder can be read using tensorboard which will visually represent every graph element.
- For every batch we are calling the get_batches method.
- Every batch will call sess.run method in which the first argument will be all the resturn values which are taken from the graph. Return values can be single or can be list.
- So in training we are getting merged, loss (Cost) and training optimizer properties so these parts of the graphs will be executed.
- For model we are passing all the necessary inputs.
- Looping over enumerate will give 2 outputs key and value.
- batch_i+1 will be having the ongoing batch number.
- If the batch_i is a multiple of display_step then we will display the progress of training to the user.
- In the same way if batch_i is a multiple of testing_check then we start validation testing.
- For validation we have a different set of sentence which model has no training history.
- Validation will also happens in batches and it will calculate the loss for validation testing.
- After this we will print 20 statements with input given to the model, prediction from the model and the orginal output.
- For getting prediction we use sess.run in which we use prediction property to get the result.
- For the above model input has to be in the batch wise but while prediction we pass sentence by sentence so we convert the sentence array into batch_size array. and at the end we tell run method to use the first sentence for prediction.
- In the 20 sentences if more than 8 sentences are predicted correcly by the model then that model is saved as an accuracy model.
- If the model loss is less than the already presented then also the model is saved.
- If the epochs are running without improvement then we stop training.

In [0]:
def train(model, epochs):   

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        saver = tf.train.Saver()
        #checkpoint = "/content/drive/My Drive/Ml-Models/Model-2-E-2.ckpt"
        #saver.restore(sess, checkpoint)

        testing_loss_summary = []
    
        iteration = 0 # Keep track of which batch iteration is being trained
        display_step = 30 # The progress of the training will be displayed after every 30 batches
        stop_early = 0 
        stop = 5 # If the batch_loss_testing does not decrease in 5 consecutive checks, stop training
        per_epoch = 1 # Test the model 3 times per epoch
        testing_check = (len(training_sorted)//batch_size//per_epoch)-1  # After how many batch of training validation testing has to be started.
               
        for epoch_i in range(1, epochs+1): 
            batch_loss = 0
            batch_time = 0
            is_correct = 0

            print()
            print("Training Model: {}".format(epoch_i))

            train_writer = tf.summary.FileWriter('./logs/1/train/{}'.format(epoch_i), sess.graph)
            test_writer = tf.summary.FileWriter('./logs/1/test/{}'.format(epoch_i))


             # Per batch
            for batch_i, (input_batch, target_batch, input_length, target_length) in enumerate(get_batches(training_sorted,batch_size,threshold)):
                start_time = time.time()
                summary, loss, _ = sess.run([model.merged, model.cost, model.train_op],
                                             {model.inputs: input_batch,
                                              model.targets: target_batch,
                                              model.inputs_length: input_length,
                                              model.targets_length: target_length,
                                              model.keep_prob: keep_probability})

                batch_loss += loss
                end_time = time.time()
                batch_time += end_time - start_time

                # Record the progress of training
                train_writer.add_summary(summary, iteration)

                iteration += 1

                # Print info
                if batch_i % display_step == 0 and batch_i > 0:
                    print('Epoch {:>3}/{} Batch {:>4}/{} - Loss: {:>6.3f}, Seconds: {:>4.2f}'
                          .format(epoch_i,
                                  epochs, 
                                  batch_i, 
                                  len(training_sorted) // batch_size, 
                                  batch_loss / display_step, 
                                  batch_time))
                    # Reset
                    batch_loss = 0
                    batch_time = 0


                #### Run Validation Testing ####

                if batch_i % testing_check == 0 and batch_i > 0:
                    batch_loss_testing = 0
                    batch_time_testing = 0

                    for batch_i, (input_batch, target_batch, input_length, target_length) in enumerate(get_batches(validation_sorted, batch_size, threshold)):
                        start_time_testing = time.time()
                        is_val_correct = 0
                        summary, loss= sess.run([model.merged, model.cost],
                                                     {model.inputs: input_batch,
                                                      model.targets: target_batch,
                                                      model.inputs_length: input_length,
                                                      model.targets_length: target_length,
                                                      model.keep_prob: 1})                      

                        batch_loss_testing += loss
                        end_time_testing = time.time()
                        batch_time_testing += end_time_testing - start_time_testing
                   
                        # Record the progress of testing

                        test_writer.add_summary(summary, iteration)

                    n_batches_testing = batch_i + 1

                    # Print Result

                    for i in range(100, 120):

                        correct = validation_sorted[i]
                        text = noise_maker(validation_sorted[i],threshold)
                        answer_logits = sess.run(model.predictions, {model.inputs: [text]* batch_size,
                                                                 model.inputs_length: [len(text)]* batch_size,
                                                                 model.targets_length: [len(text)+1],
                                                                 model.keep_prob: [1.0]})[0]


                        correct_sentence = MakeSentenceReadable(correct)
                        text_sentence = MakeSentenceReadable(text)
                        answer_logits_sentence = MakeSentenceReadable(answer_logits)
                       
                        if (answer_logits_sentence == correct_sentence):
                            is_correct += 1                   

                        print('  Validation Input: {}'.format(text_sentence))
                        print('  Validation Output: {}'.format(answer_logits_sentence))
                        print('  Correct: {}'.format(correct_sentence))
                        print('  Is Correct: {}'.format(answer_logits_sentence == correct_sentence))
                        print()                    

                    print('Testing Loss: {:>6.3f}, Seconds: {:>4.2f}'.format(batch_loss_testing / n_batches_testing, batch_time_testing))

                    # If the batch_loss_testing is at a new minimum, save the model

                    testing_loss_summary.append(batch_loss_testing)

                    if is_correct > 8:
                        print('New Accuracy Record!') 
                        stop_early = 0
                        checkpoint = "/content/drive/My Drive/Colab Notebooks/Model-{}-E-{}.ckpt".format(is_correct,epoch_i)
                        saver = tf.train.Saver()
                        saver.save(sess, checkpoint)
                    else:
                        if batch_loss_testing <= min(testing_loss_summary):
                            print('New Loss Record!') 
                            stop_early = 0
                            checkpoint = "/content/drive/My Drive/Colab Notebooks/Model-{}-E-{}.ckpt".format(epoch_i,epoch_i)
                            saver = tf.train.Saver()
                            saver.save(sess, checkpoint)
                        else:
                            print("No Improvement.")
                            stop_early += 1
                            if stop_early == stop:
                                break

            if stop_early == stop:
                print("Stopping Training.")
                break


# Train the model with the desired tuning parameters

model = build_graph(keep_probability, rnn_size, num_layers, batch_size, learning_rate, embedding_size, direction)
train(model, epochs)

**Testing the Model**

- For testing we have a set of testing set. On these set every sentence is fed to the model and prediction of that sentence is taken and compared with the original output.
- Based on the no of correct sentence we calculate accuracy. No_of_correct_sentence / total_no_of_sentences_used_for_testing.
- After every 1000 sentences user will be notifies with the percentage of testing has completed.

In [0]:
def test(model,testing_set):
    # Start session
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        saver = tf.train.Saver()

        print()
        print("Testing LSTM Model")

        testing_check = (len(testing_set)//batch_size//1)-1
        tested = 0
        is_correct = 0
        #checkpoint = "/content/drive/My Drive/Ml-Models/Model-12-E-2.ckpt"
        #saver.restore(sess, checkpoint)

        # Per batch
        for batch_i, (input_batch, target_batch, input_length, target_length) in enumerate(get_batches(testing_set,batch_size,threshold)):
            if batch_i % testing_check == 0 and batch_i > 0:  
                
                print_tested_each = 1000

                for i in range(0, len(testing_set)):

                    if (tested > print_tested_each):
                        print_tested_each  += 1000
                        print("Tested {}% of test set".format((ceil(i / len(testing_set) * 100) * 100) / 100.0))

                    text = noise_maker(testing_set[i],threshold)
                    correct = testing_set[i]
                    answer_logits = sess.run(model.predictions, {model.inputs: [text]*batch_size,
                                                             model.inputs_length: [len(text)]*batch_size,
                                                             model.targets_length: [len(text)+1],
                                                             model.keep_prob: [1.0]})[0]

                    correct_sentence = MakeSentenceReadable(correct)
                    text_sentence = MakeSentenceReadable(text)
                    answer_logits_sentence = MakeSentenceReadable(answer_logits)

                    tested += 1
                    if (answer_logits_sentence == correct_sentence):
                        is_correct += 1

                # Reset
                print("Accuracy %: {}%".format((ceil((is_correct / tested) * 100) * 100) / 100.0))
                print("Exact Accuracy: {}".format(is_correct / tested))
                
                return is_correct / tested


model = build_graph(keep_probability, rnn_size, num_layers, batch_size, learning_rate, embedding_size, direction) 
test(model, testing_sorted)

**Custom Testing**

- Edit the text_sentence with the user input sentence
- text_sentence is convrted to integer sentence as it is necessary for the input for model.
- Restore the best available model from the checkpoint.
- Once the predcition is returned through the model we convert the output into a human readable and printed on the console.

In [0]:
def text_to_ints(text):
    return [vocab_to_int[word] for word in text]

# Create your own sentence or use one from the dataset

text_sentence = "the first days of her existence in th country were vrey hard for dolly"

text = text_to_ints(text_sentence)

checkpoint = "/content/drive/My Drive/Ml-Models/Model-12-E-2.ckpt"

model = build_graph(keep_probability, rnn_size, num_layers, batch_size, learning_rate, embedding_size, direction) 

with tf.Session() as sess:

    # Load saved model
    saver = tf.train.Saver()
    saver.restore(sess, checkpoint)

    #Multiply by batch_size to match the model's input parameters
    answer_logits = sess.run(model.predictions, {model.inputs: [text]*batch_size, 
                                                 model.inputs_length: [len(text)]*batch_size,
                                                 model.targets_length: [len(text)+1], 
                                                 model.keep_prob: [1.0]})[0]

answer_logits_sentence = MakeSentenceReadable(answer_logits)
print(text_sentence)
print()
print(answer_logits_sentence)
