# Introduction

We tackle the problem of OCR post processing. In OCR, we map the image form of the document into the text domain. This is done first using an CNN+LSTM+CTC model, in our case based on tesseract. Since this output maps only image to text, we need something on top to validate and correct language semantics.

The idea is to build a language model, that takes the OCRed text and corrects it based on language knowledge. The langauge model could be:
- Char level: the aim is to capture the word morphology. In which case it's like a spelling correction system.
- Word level: the aim is to capture the sentence semnatics. But such systems suffer from the OOV problem.
- Fusion: to capture semantics and morphology language rules. The output has to be at char level, to avoid the OOV. However, the input can be char, word or both.

The fusion model target is to learn:

    p(char | char_context, word_context)

In this workbook we use seq2seq vanilla Keras implementation, adapted from the lstm_seq2seq example on Eng-Fra translation task. The adaptation involves:

- Adapt to spelling correction, on char level
- Pre-train on a noisy, medical sentences
- Fine tune a residual, to correct the mistakes of tesseract 
- Limit the input and output sequence lengths
- Enusre teacher forcing auto regressive model in the decoder
- Limit the padding per batch (TODO)
- Learning rate schedule (TODO)


# Imports

In [1]:
from __future__ import print_function
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
from keras.models import Model
from keras.layers import Input, LSTM, Dense
from keras import optimizers
from keras.callbacks import ModelCheckpoint, TensorBoard, LearningRateScheduler
import numpy as np
import os
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


# Utility functions

In [2]:
# Limit gpu allocation. allow_growth, or gpu_fraction
def gpu_alloc():
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    set_session(tf.Session(config=config))

In [3]:
gpu_alloc()

In [4]:
def calculate_WER_sent(gt, pred):
    '''
    calculate_WER('calculating wer between two sentences', 'calculate wer between two sentences')
    '''
    gt_words = gt.lower().split(' ')
    pred_words = pred.lower().split(' ')
    d = np.zeros(((len(gt_words) + 1), (len(pred_words) + 1)), dtype=np.uint8)
    # d = d.reshape((len(gt_words)+1, len(pred_words)+1))

    # Initializing error matrix
    for i in range(len(gt_words) + 1):
        for j in range(len(pred_words) + 1):
            if i == 0:
                d[0][j] = j
            elif j == 0:
                d[i][0] = i

    # computation
    for i in range(1, len(gt_words) + 1):
        for j in range(1, len(pred_words) + 1):
            if gt_words[i - 1] == pred_words[j - 1]:
                d[i][j] = d[i - 1][j - 1]
            else:
                substitution = d[i - 1][j - 1] + 1
                insertion = d[i][j - 1] + 1
                deletion = d[i - 1][j] + 1
                d[i][j] = min(substitution, insertion, deletion)
    return d[len(gt_words)][len(pred_words)]

In [5]:
def calculate_WER(gt, pred):
    '''

    :param gt: list of sentences of the ground truth
    :param pred: list of sentences of the predictions
    both lists must have the same length
    :return: accumulated WER
    '''
#    assert len(gt) == len(pred)
    WER = 0
    nb_w = 0
    for i in range(len(gt)):
        #print(gt[i])
        #print(pred[i])
        WER += calculate_WER_sent(gt[i], pred[i])
        nb_w += len(gt[i])

    return WER / nb_w

In [6]:
# Artificial noisy spelling mistakes
def noise_maker(sentence, threshold):
    '''Relocate, remove, or add characters to create spelling mistakes'''
    letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m',
           'n','o','p','q','r','s','t','u','v','w','x','y','z',]
    noisy_sentence = []
    i = 0
    while i < len(sentence):
        random = np.random.uniform(0, 1, 1)
        # Most characters will be correct since the threshold value is high
        if random < threshold:
            noisy_sentence.append(sentence[i])
        else:
            new_random = np.random.uniform(0, 1, 1)
            # ~33% chance characters will swap locations
            if new_random > 0.67:
                if i == (len(sentence) - 1):
                    # If last character in sentence, it will not be typed
                    continue
                else:
                    # if any other character, swap order with following character
                    noisy_sentence.append(sentence[i + 1])
                    noisy_sentence.append(sentence[i])
                    i += 1
            # ~33% chance an extra lower case letter will be added to the sentence
            elif new_random < 0.33:
                random_letter = np.random.choice(letters, 1)[0]
                noisy_sentence.append(random_letter)
                noisy_sentence.append(sentence[i])
            # ~33% chance a character will not be typed
            else:
                pass
        i += 1

    return ''.join(noisy_sentence)

In [7]:
def load_data_with_gt(file_name, num_samples, max_sent_len, min_sent_len):
    '''Load data from txt file, with each line has: <TXT><TAB><GT>. The  target to the decoder muxt have \t as the start trigger and \n as the stop trigger.'''
    cnt = 0  
    input_texts = []
    gt_texts = []
    target_texts = []
    for row in open(file_name):
        if cnt < num_samples :
            print(row)
            sents = row.split("\t")
            input_text = sents[0]
            
            target_text = '\t' + sents[1] + '\n'
            if len(input_text) > min_sent_len and len(input_text) < max_sent_len and len(target_text) > min_sent_len and len(target_text) < max_sent_len:
                cnt += 1
                
                input_texts.append(input_text)
                target_texts.append(target_text)
                gt_texts.append(sents[1])
    return input_texts, target_texts, gt_texts

In [8]:
def load_data_with_noise(file_name, num_samples, noise_threshold, max_sent_len, min_sent_len):
    '''Load data from txt file, with each line has: <TXT>. The GT is just a noisy version of TXT. The  target to the decoder muxt have \t as the start trigger and \n as the stop trigger.'''
    cnt = 0  
    input_texts = []
    gt_texts = []
    target_texts = []
    while cnt < num_samples :
        for row in open(file_name):
            if cnt < num_samples :
                sents = row.split("\t")
                input_text = noise_maker(sents[1], noise_threshold)
                input_text = input_text[:-1]

                target_text = '\t' + sents[1] + '\n'            
                if len(input_text) > min_sent_len and len(input_text) < max_sent_len and len(target_text) > min_sent_len and len(target_text) < max_sent_len:
                    cnt += 1
                    input_texts.append(input_text)
                    target_texts.append(target_text)
                    gt_texts.append(target_text[1:-1])
                    
    return input_texts, target_texts, gt_texts

In [9]:
def build_vocab(all_texts):
    '''Build vocab dictionary to victorize chars into ints'''
    vocab_to_int = {}
    count = 0
    
    for sentence in all_texts:
        for char in sentence:
            if char not in vocab_to_int:
                vocab_to_int[char] = count
                count += 1
    # Add special tokens to vocab_to_int
    codes = ['\t','\n']
    for code in codes:
        if code not in vocab_to_int:
            vocab_to_int[code] = count
            count += 1
    '''''Build inverse translation from int to char'''
    int_to_vocab = {}
    for character, value in vocab_to_int.items():
        int_to_vocab[value] = character
        
    return vocab_to_int, int_to_vocab

In [10]:
def vectorize_data(input_texts, target_texts, max_encoder_seq_length, num_encoder_tokens, vocab_to_int):
    '''Prepares the input text and targets into the proper seq2seq numpy arrays'''
    encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
    decoder_input_data = np.zeros(
        (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
        dtype='float32')
    decoder_target_data = np.zeros(
        (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
        dtype='float32')

    for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
        for t, char in enumerate(input_text):
            # c0..cn
            encoder_input_data[i, t, vocab_to_int[char]] = 1.
        for t, char in enumerate(target_text):
            # c0'..cm'
            # decoder_target_data is ahead of decoder_input_data by one timestep
            decoder_input_data[i, t, vocab_to_int[char]] = 1.
            if t > 0:
                # decoder_target_data will be ahead by one timestep
                # and will not include the start character.
                decoder_target_data[i, t - 1, vocab_to_int[char]] = 1.
                
    return encoder_input_data, decoder_input_data, decoder_target_data

In [11]:
def decode_sequence(input_seq, encoder_model, decoder_model, num_decoder_tokens, int_to_vocab):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, vocab_to_int['\t']] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = int_to_vocab[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    return decoded_sentence


# Load data

In [12]:
data_path = '../../dat/'

In [13]:
max_sent_len = 40
min_sent_len = 4

## Results on tesseract correction

In [14]:
num_samples = 10000
tess_correction_data = os.path.join(data_path, 'all_ocr_data.txt')
input_texts_OCR, target_texts_OCR, gt_OCR = load_data_with_gt(tess_correction_data, num_samples, max_sent_len, min_sent_len)

Claim Tji'pe: V'B Accident - Accidental Injury 	Claim Type: VB Accident - Accidental Injury

Who Th 2 Reported Event Happen ed To: EtmloyeerPolicyholders Child 	Who The Reported Event Happened To: Employee/Policyholder’s child

Policg'h old er: Owner Information 	Policyholder/Owner Information

First Name: 	First Name:

Last Name: 	Last Name:

Social Secun'ty Number: 	Social Security Number:

Birth Date: 	Birth Date:

Gender: 	Gender:

Language Preference: 	Language Preference:

Address Line 1: 	Address Line 1:

City: 	City:

Stater‘Pronnce: 	State/Province:

Postal Code: 	Postal Code:

C mmtry', 	Country:

Best Phone Number to be Reached During the Day. 	Best Phone Number to be Reached During the Day:

Email Address: 	Email Address:

Dependent Information 	Dependent Information

Claim Detail 	Claim Detail

Billed Amomit ‘ 	Billed Amounts

( Contract Adjustment	Contract Adjustment

iUCRiNomPar; LAdjustment	URC/Non-Par Adjustment

Aiiowed Amount	Allowed Amount

NotCovered Amount	Not Cov

OPERATIVE REPORT - PAGE 2. of 2 ‘ 	OPERATIVE REPORT - PAGE 2 of 2

TWIN CITIES 	TWIN CITIES

ORTHOPEDICS 	ORTHOPEDICS

Twin Cities Orthopedics-Burnsville 	Twin Cities Orthopedics-Burnsville

Date of Service: 01/21/2018 7:30PM 	Date of Service: 01/21/2018 7:30PM

Provider: David Feivor PA-C 	Provider: David Feivor PA-C

Chief Complaint 	Chief Complaint

Right knee injury 	Right knee injury

DOI 112112018 	DOI 1/21/2018

History of Present Illness - 	History of Present Illness

is a 37 year old woman presenting with a right knee injury that occurred today when she twisted her knee and fell while playing hockey. She states that the pain is mainly at the posterior aspect of her right knee with weight bearing and turning. She also states that it is unstable and has given out on her a few times, and there is swelling present. She denies any clicking or locking of the knee. She has tried ice and Ibuprofen with some relief and denies any history of injury to her right knee. 	is a 37 year old w

In [15]:
input_texts = input_texts_OCR
target_texts = target_texts_OCR

# Results of pre-training on generic data

In [16]:
'''
num_samples = 0
big_data = os.path.join(data_path, 'big.txt')
threshold = 0.9
input_texts_gen, target_texts_gen, gt_gen = load_data_with_noise(file_name=big_data, 
                                                                 num_samples=num_samples, 
                                                                 noise_threshold=threshold, 
                                                                 max_sent_len=max_sent_len, 
                                                                 min_sent_len=min_sent_len)
'''                                                                 

"\nnum_samples = 0\nbig_data = os.path.join(data_path, 'big.txt')\nthreshold = 0.9\ninput_texts_gen, target_texts_gen, gt_gen = load_data_with_noise(file_name=big_data, \n                                                                 num_samples=num_samples, \n                                                                 noise_threshold=threshold, \n                                                                 max_sent_len=max_sent_len, \n                                                                 min_sent_len=min_sent_len)\n"

In [17]:
#input_texts = input_texs_gen
#target_texts = target_texts_gen

# Results on noisy tesseract corrections

In [18]:
num_samples = 0
tess_correction_data = os.path.join(data_path, 'new_trained_data.txt')
threshold = 0.9
input_texts_noisy_OCR, target_texts_noisy_OCR, gt_noisy_OCR = load_data_with_noise(file_name=tess_correction_data, 
                                                                 num_samples=num_samples, 
                                                                 noise_threshold=threshold, 
                                                                 max_sent_len=max_sent_len, 
                                                                 min_sent_len=min_sent_len)

In [19]:
'''
input_texts = input_texts_noisy_OCR
target_texts = target_texts_noisy_OCR
'''

'\ninput_texts = input_texts_noisy_OCR\ntarget_texts = target_texts_noisy_OCR\n'

# Results on merge of tesseract correction + generic data

In [20]:
'''
input_texts = input_texts_OCR + input_texts_gen
target_texts = input_texts_OCR + target_texts_gen
'''

'\ninput_texts = input_texts_OCR + input_texts_gen\ntarget_texts = input_texts_OCR + target_texts_gen\n'

# Results noisy tesseract correction + generic data

In [21]:
'''
input_texts = input_texts_noisy_OCR + input_texts_gen
target_texts = input_texts_noisy_OCR + target_texts_gen
'''

'\ninput_texts = input_texts_noisy_OCR + input_texts_gen\ntarget_texts = input_texts_noisy_OCR + target_texts_gen\n'

# Results noisy tesseract noisy + correction data

In [22]:
input_texts = input_texts_noisy_OCR + input_texts_OCR
target_texts = target_texts_noisy_OCR + target_texts_OCR

# Results of pre-training on generic and fine tuning on tesseract correction

In [23]:
# TODO

In [24]:
# Sample data
print(len(input_texts))
for i in range(10):
    print(input_texts[i], '\n', target_texts[i])

1882
Policg'h old er: Owner Information  
 	Policyholder/Owner Information


First Name:  
 	First Name:


Last Name:  
 	Last Name:


Social Secun'ty Number:  
 	Social Security Number:


Birth Date:  
 	Birth Date:


Gender:  
 	Gender:


Language Preference:  
 	Language Preference:


Address Line 1:  
 	Address Line 1:


City:  
 	City:


Stater‘Pronnce:  
 	State/Province:




## Build vocab

In [25]:
all_texts = target_texts + input_texts
vocab_to_int, int_to_vocab = build_vocab(all_texts)

In [26]:
input_characters = sorted(list(vocab_to_int))
target_characters = sorted(list(vocab_to_int))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

In [27]:
print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

Number of samples: 1882
Number of unique input tokens: 106
Number of unique output tokens: 106
Max sequence length for inputs: 39
Max sequence length for outputs: 39


# Prepare training data

## Train/test split

In [28]:
# Split the data into training and testing sentences
input_texts, test_input_texts, target_texts, test_target_texts  = train_test_split(input_texts, target_texts, test_size = 0.15, random_state = 42)

## Vectorize data

## Train data

In [29]:
encoder_input_data, decoder_input_data, decoder_target_data = vectorize_data(input_texts=input_texts,
                                                                             target_texts=target_texts, 
                                                                             max_encoder_seq_length=max_encoder_seq_length, 
                                                                             num_encoder_tokens=num_encoder_tokens, 
                                                                             vocab_to_int=vocab_to_int)

## Test data

In [30]:
test_encoder_input_data, test_decoder_input_data, test_decoder_target_data = vectorize_data(input_texts=test_input_texts,
                                                                                            target_texts=test_target_texts, 
                                                                                            max_encoder_seq_length=max_encoder_seq_length, 
                                                                                            num_encoder_tokens=num_encoder_tokens, 
                                                                                            vocab_to_int=vocab_to_int)

# Training model

In [31]:
batch_size = 64  # Batch size for training.
epochs = 200  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
lr = 0.01

In [32]:
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
# TODO: Add Embedding for chars
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
print(model.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None, 106)    0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None, 106)    0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   [(None, 256), (None, 371712      input_1[0][0]                    
__________________________________________________________________________________________________
lstm_2 (LSTM)                   [(None, None, 256),  371712      input_2[0][0]                    
                                                                 lstm_1[0][1]                     
          

# Learning rate decay

In [33]:
model.compile(optimizer=optimizers.Adam(lr=lr), loss='categorical_crossentropy', metrics=['categorical_accuracy'])

In [34]:
#filepath="weights-improvement-{epoch:02d}-{val_categorical_accuracy:.2f}.hdf5"
filepath="best_model.hdf5" # Save only the best model for inference step, as saving the epoch and metric might confuse the inference function which model to use
checkpoint = ModelCheckpoint(filepath, monitor='val_categorical_accuracy', verbose=1, save_best_only=True, mode='max')
tbCallBack = TensorBoard(log_dir='./Graph', histogram_freq=0, write_graph=True, write_images=True)
callbacks_list = [checkpoint, tbCallBack]
#callbacks_list = [checkpoint, tbCallBack, lrate]



In [35]:
def exp_decay(epoch):
    initial_lrate = 0.1
    k = 0.1
    lrate = initial_lrate * np.exp(-k*epoch)
    return lrate
lrate = LearningRateScheduler(exp_decay)
#lr = 0

In [36]:
def step_decay(epoch):
    initial_lrate = 0.1
    drop = 0.5
    epochs_drop = 10.0
    lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop))
    return lrate
lrate = LearningRateScheduler(step_decay)
#lr = 0

In [37]:
#callbacks_list.append(lrate)

In [38]:
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          validation_data = ([test_encoder_input_data, test_decoder_input_data], test_decoder_target_data),
          batch_size=batch_size,
          epochs=epochs,
          callbacks=callbacks_list,
          #validation_split=0.2,
          shuffle=True)

Train on 1599 samples, validate on 283 samples
Epoch 1/200

Epoch 00001: val_categorical_accuracy improved from -inf to 0.08055, saving model to best_model.hdf5
Epoch 2/200


  '. They will not be included '



Epoch 00002: val_categorical_accuracy improved from 0.08055 to 0.13745, saving model to best_model.hdf5
Epoch 3/200

Epoch 00003: val_categorical_accuracy improved from 0.13745 to 0.20141, saving model to best_model.hdf5
Epoch 4/200

Epoch 00004: val_categorical_accuracy improved from 0.20141 to 0.25242, saving model to best_model.hdf5
Epoch 5/200

Epoch 00005: val_categorical_accuracy improved from 0.25242 to 0.28368, saving model to best_model.hdf5
Epoch 6/200

Epoch 00006: val_categorical_accuracy improved from 0.28368 to 0.31095, saving model to best_model.hdf5
Epoch 7/200

Epoch 00007: val_categorical_accuracy improved from 0.31095 to 0.32962, saving model to best_model.hdf5
Epoch 8/200

Epoch 00008: val_categorical_accuracy improved from 0.32962 to 0.34294, saving model to best_model.hdf5
Epoch 9/200

Epoch 00009: val_categorical_accuracy improved from 0.34294 to 0.34720, saving model to best_model.hdf5
Epoch 10/200

Epoch 00010: val_categorical_accuracy improved from 0.34720 to


Epoch 00046: val_categorical_accuracy did not improve from 0.42213
Epoch 47/200

Epoch 00047: val_categorical_accuracy did not improve from 0.42213
Epoch 48/200

Epoch 00048: val_categorical_accuracy did not improve from 0.42213
Epoch 49/200

Epoch 00049: val_categorical_accuracy did not improve from 0.42213
Epoch 50/200

Epoch 00050: val_categorical_accuracy did not improve from 0.42213
Epoch 51/200

Epoch 00051: val_categorical_accuracy did not improve from 0.42213
Epoch 52/200

Epoch 00052: val_categorical_accuracy did not improve from 0.42213
Epoch 53/200

Epoch 00053: val_categorical_accuracy did not improve from 0.42213
Epoch 54/200

Epoch 00054: val_categorical_accuracy did not improve from 0.42213
Epoch 55/200

Epoch 00055: val_categorical_accuracy did not improve from 0.42213
Epoch 56/200

Epoch 00056: val_categorical_accuracy did not improve from 0.42213
Epoch 57/200

Epoch 00057: val_categorical_accuracy did not improve from 0.42213
Epoch 58/200

Epoch 00058: val_categorica


Epoch 00081: val_categorical_accuracy did not improve from 0.42222
Epoch 82/200

Epoch 00082: val_categorical_accuracy did not improve from 0.42222
Epoch 83/200

Epoch 00083: val_categorical_accuracy did not improve from 0.42222
Epoch 84/200

Epoch 00084: val_categorical_accuracy did not improve from 0.42222
Epoch 85/200

Epoch 00085: val_categorical_accuracy did not improve from 0.42222
Epoch 86/200

Epoch 00086: val_categorical_accuracy did not improve from 0.42222
Epoch 87/200

Epoch 00087: val_categorical_accuracy did not improve from 0.42222
Epoch 88/200

Epoch 00088: val_categorical_accuracy did not improve from 0.42222
Epoch 89/200

Epoch 00089: val_categorical_accuracy did not improve from 0.42222
Epoch 90/200

Epoch 00090: val_categorical_accuracy did not improve from 0.42222
Epoch 91/200

Epoch 00091: val_categorical_accuracy did not improve from 0.42222
Epoch 92/200

Epoch 00092: val_categorical_accuracy did not improve from 0.42222
Epoch 93/200

Epoch 00093: val_categorica


Epoch 00115: val_categorical_accuracy did not improve from 0.42222
Epoch 116/200

Epoch 00116: val_categorical_accuracy did not improve from 0.42222
Epoch 117/200

Epoch 00117: val_categorical_accuracy did not improve from 0.42222
Epoch 118/200

Epoch 00118: val_categorical_accuracy did not improve from 0.42222
Epoch 119/200

Epoch 00119: val_categorical_accuracy did not improve from 0.42222
Epoch 120/200

Epoch 00120: val_categorical_accuracy did not improve from 0.42222
Epoch 121/200

Epoch 00121: val_categorical_accuracy did not improve from 0.42222
Epoch 122/200

Epoch 00122: val_categorical_accuracy did not improve from 0.42222
Epoch 123/200

Epoch 00123: val_categorical_accuracy did not improve from 0.42222
Epoch 124/200

Epoch 00124: val_categorical_accuracy did not improve from 0.42222
Epoch 125/200

Epoch 00125: val_categorical_accuracy improved from 0.42222 to 0.42276, saving model to best_model.hdf5
Epoch 126/200

Epoch 00126: val_categorical_accuracy improved from 0.42276 


Epoch 00148: val_categorical_accuracy did not improve from 0.42493
Epoch 149/200

Epoch 00149: val_categorical_accuracy did not improve from 0.42493
Epoch 150/200

Epoch 00150: val_categorical_accuracy did not improve from 0.42493
Epoch 151/200

Epoch 00151: val_categorical_accuracy did not improve from 0.42493
Epoch 152/200

Epoch 00152: val_categorical_accuracy did not improve from 0.42493
Epoch 153/200

Epoch 00153: val_categorical_accuracy did not improve from 0.42493
Epoch 154/200

Epoch 00154: val_categorical_accuracy did not improve from 0.42493
Epoch 155/200

Epoch 00155: val_categorical_accuracy did not improve from 0.42493
Epoch 156/200

Epoch 00156: val_categorical_accuracy did not improve from 0.42493
Epoch 157/200

Epoch 00157: val_categorical_accuracy did not improve from 0.42493
Epoch 158/200

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [39]:
model.load_weights('best_model.hdf5')

# Inference model

In [40]:
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

In [41]:
# Sample output from train data
decoded_sentences = []
target_texts_ =  []
for seq_index in range(100):
    # Take one sequence (part of the training set)
    # for trying out decoding.

    input_seq = encoder_input_data[seq_index: seq_index + 1]
    
    decoded_sentence = decode_sequence(input_seq, encoder_model, decoder_model, num_decoder_tokens, int_to_vocab)
    target_text = target_texts[seq_index][1:-1]
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('GT sentence:', target_text)
    print('Decoded sentence:', decoded_sentence)   
    decoded_sentences.append(decoded_sentence)
    target_texts_.append(target_text)

-
Input sentence: Date of Birth _ 
GT sentence: Date of Birth

Decoded sentence: Date of Birth (mm/dd/yy)

-
Input sentence: Emplqvm am In formation 
GT sentence: Employment Information

Decoded sentence: Employer Name:

-
Input sentence: Piedmont Healllleare 
GT sentence: Piedmont Healthcare

Decoded sentence: Piedmont Healthcare

-
Input sentence: ACCIDENT CLAIM FORM 
GT sentence: ACCIDENT CLAIM FORM

Decoded sentence: ACCIDENT DETAILS

-
Input sentence: Acoldent / Injury — no 
GT sentence: Accident / Injury - no

Decoded sentence: Accident / Injury - no

-
Input sentence: Review of Systems 
GT sentence: Review of Systems

Decoded sentence: Review of Systems

-
Input sentence: Electronic Submis sion 
GT sentence: Electronic Submission

Decoded sentence: Electronic Submission

-
Input sentence: SERVICE LINE # 
GT sentence: SERVICE LINE #

Decoded sentence: SERVICE LINE #

-
Input sentence: 4. Probiotic CAPS; 
GT sentence: 4. Probiotic CAPS;

Decoded sentence: 4. Probiotic CAPS;

-
Inp

-
Input sentence: Electron iL‘ Submission 
GT sentence: Electronic Submission

Decoded sentence: Electronic Submission

-
Input sentence: T‘mfﬂnynn TD:
GT sentence: Employee ID:

Decoded sentence: Employee ID:

-
Input sentence: Reg. Status Verified 
GT sentence: Reg Status Verified

Decoded sentence: Regional Health Inc

-
Input sentence: DISCOVER
GT sentence: DISCOVER

Decoded sentence: DISCOVER

-
Input sentence: Message 
GT sentence: Message

Decoded sentence: Message

-
Input sentence: Last Name: 
GT sentence: Last Name:

Decoded sentence: Last Name:

-
Input sentence: 12/29/17 DEDUCTIBLE 
GT sentence: 12/29/17 DEDUCTIBLE

Decoded sentence: 12/29/17 DEDUCTIBLE

-
Input sentence: Surgical History 
GT sentence: Surgical History

Decoded sentence: Surgery - no

-
Input sentence: Claim Event Information 
GT sentence: Claim Event Information

Decoded sentence: Claim Event Information

-
Input sentence: Admitted to hospital — no 
GT sentence: Admitted to hospital - no

Decoded sentence:

In [42]:
#WER_spell_correction = calculate_WER(target_texts_, decoded_sentences)
#print('WER_spell_correction |TRAIN= ', WER_spell_correction)

In [43]:
# Sample output from test data
# Sample output from train data
decoded_sentences = []
target_texts_ =  []
for seq_index in range(100):
    # Take one sequence (part of the training set)
    # for trying out decoding.

    input_seq = test_encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq, encoder_model, decoder_model, num_decoder_tokens, int_to_vocab)
    target_text = test_target_texts[seq_index][1:-1]
    print('-')
    print('Input sentence:', test_input_texts[seq_index])
    print('GT sentence:', target_text)
    print('Decoded sentence:', decoded_sentence)   
    decoded_sentences.append(decoded_sentence)
    target_texts_.append(target_text)

-
Input sentence: First name — Debra 
GT sentence: First name - Debra

Decoded sentence: First Name:

-
Input sentence: Social Secun'ty Number: 
GT sentence: Social Security Number:

Decoded sentence: Social Security Number:

-
Input sentence: Address Line 1: 
GT sentence: Address Line 1:

Decoded sentence: Address Line 1:

-
Input sentence: Acct #: 
GT sentence: Acct #:

Decoded sentence: Acct ID

-
Input sentence: FACESHEE’T
GT sentence: FACESHEET

Decoded sentence: SPOCIDESIl First Name: David

-
Input sentence: Employer N ﬂl’m: 
GT sentence: Employer Name:

Decoded sentence: Employer Name:

-
Input sentence: Patient 
GT sentence: Patient

Decoded sentence: Patient DOB:

-
Input sentence: Member Ne: 
GT sentence: Member No:

Decoded sentence: Member No:

-
Input sentence: Diagnosis: ICD Coda: 
GT sentence: Diagnosis: ICD Code:

Decoded sentence: Diagnosis

-
Input sentence: 0 Exercises regularly _ 
GT sentence: • Exercises regularly

Decoded sentence: • Family history of Cancer (C80

-
Input sentence: Employer N arm: 
GT sentence: Employer Name:

Decoded sentence: Employer Name:

-
Input sentence: Birth Date: 
GT sentence: Birth Date:

Decoded sentence: Birth Date:

-
Input sentence: Date ﬁrst unable to work (mmreeim 
GT sentence: Date first unable to work (mm/dd/yy)

Decoded sentence: Date of Birth (mm/dd/yy)

-
Input sentence: Electonically Signed Indicator: Yes 
GT sentence: Electronically Signed Indicator: Yes

Decoded sentence: Electronic Submission

-
Input sentence: Employee:
GT sentence: Employee:

Decoded sentence: Employer Name:

-
Input sentence: Submission Date: 03/152018 
GT sentence: Submission Date: 03/15/2018

Decoded sentence: Submission Date: 03/12/2018

-
Input sentence: Signatures 
GT sentence: Signatures

Decoded sentence: Signatures

-
Input sentence: Address Line 2: 
GT sentence: Address Line 2:

Decoded sentence: Address Line 1:

-
Input sentence: Paper Prescription given to patient 
GT sentence: Paper Prescription given to patient

Decoded 

In [44]:
WER_spell_correction = calculate_WER(target_texts_, decoded_sentences)
print('WER_spell_correction |TEST= ', WER_spell_correction)

WER_spell_correction |TEST=  0.10154639175257732


# Next steps
- Add attention
- Full attention
- Condition the Encoder on word embeddings of the context (Bi-directional LSTM)
- Condition the Decoder on word embeddings of the context (Bi-directional LSTM) 

# References
- Sequence to Sequence Learning with Neural Networks
    https://arxiv.org/abs/1409.3215
- Learning Phrase Representations using
    RNN Encoder-Decoder for Statistical Machine Translation
    https://arxiv.org/abs/1406.107