# Introduction

We tackle the problem of OCR post processing. In OCR, we map the image form of the document into the text domain. This is done first using an CNN+LSTM+CTC model, in our case based on tesseract. Since this output maps only image to text, we need something on top to validate and correct language semantics.

The idea is to build a language model, that takes the OCRed text and corrects it based on language knowledge. The langauge model could be:
- Char level: the aim is to capture the word morphology. In which case it's like a spelling correction system.
- Word level: the aim is to capture the sentence semnatics. But such systems suffer from the OOV problem.
- Fusion: to capture semantics and morphology language rules. The output has to be at char level, to avoid the OOV. However, the input can be char, word or both.

The fusion model target is to learn:

    p(char | char_context, word_context)

In this workbook we use seq2seq vanilla Keras implementation, adapted from the lstm_seq2seq example on Eng-Fra translation task. The adaptation involves:

- Adapt to spelling correction, on char level
- Pre-train on a noisy, medical sentences
- Fine tune a residual, to correct the mistakes of tesseract 
- Limit the input and output sequence lengths
- Enusre teacher forcing auto regressive model in the decoder
- Limit the padding per batch (TODO)
- Learning rate schedule (TODO)


# Imports

In [1]:
from __future__ import print_function
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
from keras.models import Model
from keras.layers import Input, LSTM, Dense
from keras import optimizers
from keras.callbacks import ModelCheckpoint, TensorBoard, LearningRateScheduler
import numpy as np
import os
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


# Utility functions

In [2]:
# Limit gpu allocation. allow_growth, or gpu_fraction
def gpu_alloc():
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    set_session(tf.Session(config=config))

In [3]:
gpu_alloc()

In [4]:
def calculate_WER_sent(gt, pred):
    '''
    calculate_WER('calculating wer between two sentences', 'calculate wer between two sentences')
    '''
    gt_words = gt.lower().split(' ')
    pred_words = pred.lower().split(' ')
    d = np.zeros(((len(gt_words) + 1), (len(pred_words) + 1)), dtype=np.uint8)
    # d = d.reshape((len(gt_words)+1, len(pred_words)+1))

    # Initializing error matrix
    for i in range(len(gt_words) + 1):
        for j in range(len(pred_words) + 1):
            if i == 0:
                d[0][j] = j
            elif j == 0:
                d[i][0] = i

    # computation
    for i in range(1, len(gt_words) + 1):
        for j in range(1, len(pred_words) + 1):
            if gt_words[i - 1] == pred_words[j - 1]:
                d[i][j] = d[i - 1][j - 1]
            else:
                substitution = d[i - 1][j - 1] + 1
                insertion = d[i][j - 1] + 1
                deletion = d[i - 1][j] + 1
                d[i][j] = min(substitution, insertion, deletion)
    return d[len(gt_words)][len(pred_words)]

In [5]:
def calculate_WER(gt, pred):
    '''

    :param gt: list of sentences of the ground truth
    :param pred: list of sentences of the predictions
    both lists must have the same length
    :return: accumulated WER
    '''
#    assert len(gt) == len(pred)
    WER = 0
    nb_w = 0
    for i in range(len(gt)):
        #print(gt[i])
        #print(pred[i])
        WER += calculate_WER_sent(gt[i], pred[i])
        nb_w += len(gt[i])

    return WER / nb_w

In [6]:
# Artificial noisy spelling mistakes
def noise_maker(sentence, threshold):
    '''Relocate, remove, or add characters to create spelling mistakes'''
    letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m',
           'n','o','p','q','r','s','t','u','v','w','x','y','z',]
    noisy_sentence = []
    i = 0
    while i < len(sentence):
        random = np.random.uniform(0, 1, 1)
        # Most characters will be correct since the threshold value is high
        if random < threshold:
            noisy_sentence.append(sentence[i])
        else:
            new_random = np.random.uniform(0, 1, 1)
            # ~33% chance characters will swap locations
            if new_random > 0.67:
                if i == (len(sentence) - 1):
                    # If last character in sentence, it will not be typed
                    continue
                else:
                    # if any other character, swap order with following character
                    noisy_sentence.append(sentence[i + 1])
                    noisy_sentence.append(sentence[i])
                    i += 1
            # ~33% chance an extra lower case letter will be added to the sentence
            elif new_random < 0.33:
                random_letter = np.random.choice(letters, 1)[0]
                noisy_sentence.append(random_letter)
                noisy_sentence.append(sentence[i])
            # ~33% chance a character will not be typed
            else:
                pass
        i += 1

    return ''.join(noisy_sentence)

In [7]:
def load_data_with_gt(file_name, num_samples, max_sent_len, min_sent_len):
    '''Load data from txt file, with each line has: <TXT><TAB><GT>. The  target to the decoder muxt have \t as the start trigger and \n as the stop trigger.'''
    cnt = 0  
    input_texts = []
    gt_texts = []
    target_texts = []
    for row in open(file_name, encoding='utf8'):
        if cnt < num_samples :
            #print(row)
            sents = row.split("\t")
            input_text = sents[0]
            
            target_text = '\t' + sents[1] + '\n'
            if len(input_text) > min_sent_len and len(input_text) < max_sent_len and len(target_text) > min_sent_len and len(target_text) < max_sent_len:
                cnt += 1
                
                input_texts.append(input_text)
                target_texts.append(target_text)
                gt_texts.append(sents[1])
    return input_texts, target_texts, gt_texts

In [8]:
def load_data_with_noise(file_name, num_samples, noise_threshold, max_sent_len, min_sent_len):
    '''Load data from txt file, with each line has: <TXT>. The GT is just a noisy version of TXT. The  target to the decoder muxt have \t as the start trigger and \n as the stop trigger.'''
    cnt = 0  
    input_texts = []
    gt_texts = []
    target_texts = []
    while cnt < num_samples :
        for row in open(file_name, encoding='utf8'):
            if cnt < num_samples :
                sents = row.split("\t")
                input_text = noise_maker(sents[1], noise_threshold)
                input_text = input_text[:-1]

                target_text = '\t' + sents[1] + '\n'            
                if len(input_text) > min_sent_len and len(input_text) < max_sent_len and len(target_text) > min_sent_len and len(target_text) < max_sent_len:
                    cnt += 1
                    input_texts.append(input_text)
                    target_texts.append(target_text)
                    gt_texts.append(target_text[1:-1])
                    
    return input_texts, target_texts, gt_texts

In [9]:
def build_vocab(all_texts):
    '''Build vocab dictionary to victorize chars into ints'''
    vocab_to_int = {}
    count = 0
    
    for sentence in all_texts:
        for char in sentence:
            if char not in vocab_to_int:
                vocab_to_int[char] = count
                count += 1
    # Add special tokens to vocab_to_int
    codes = ['\t','\n']
    for code in codes:
        if code not in vocab_to_int:
            vocab_to_int[code] = count
            count += 1
    '''''Build inverse translation from int to char'''
    int_to_vocab = {}
    for character, value in vocab_to_int.items():
        int_to_vocab[value] = character
        
    return vocab_to_int, int_to_vocab

In [10]:
def vectorize_data(input_texts, target_texts, max_encoder_seq_length, num_encoder_tokens, vocab_to_int):
    '''Prepares the input text and targets into the proper seq2seq numpy arrays'''
    encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
    decoder_input_data = np.zeros(
        (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
        dtype='float32')
    decoder_target_data = np.zeros(
        (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
        dtype='float32')

    for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
        for t, char in enumerate(input_text):
            # c0..cn
            encoder_input_data[i, t, vocab_to_int[char]] = 1.
        for t, char in enumerate(target_text):
            # c0'..cm'
            # decoder_target_data is ahead of decoder_input_data by one timestep
            decoder_input_data[i, t, vocab_to_int[char]] = 1.
            if t > 0:
                # decoder_target_data will be ahead by one timestep
                # and will not include the start character.
                decoder_target_data[i, t - 1, vocab_to_int[char]] = 1.
                
    return encoder_input_data, decoder_input_data, decoder_target_data

In [11]:
def decode_sequence(input_seq, encoder_model, decoder_model, num_decoder_tokens, int_to_vocab):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, vocab_to_int['\t']] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = int_to_vocab[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    return decoded_sentence


# Load data

In [12]:
data_path = '../../dat/'

In [13]:
max_sent_len = 40
min_sent_len = 4

## Results on tesseract correction

In [14]:
num_samples = 1000000
tess_correction_data = os.path.join(data_path, 'all_ocr_data_2.txt')
input_texts_OCR, target_texts_OCR, gt_OCR = load_data_with_gt(tess_correction_data, num_samples, max_sent_len, min_sent_len)

In [15]:
input_texts = input_texts_OCR
target_texts = target_texts_OCR

# Results of pre-training on generic data

In [16]:
'''
num_samples = 0
big_data = os.path.join(data_path, 'big.txt')
threshold = 0.9
input_texts_gen, target_texts_gen, gt_gen = load_data_with_noise(file_name=big_data, 
                                                                 num_samples=num_samples, 
                                                                 noise_threshold=threshold, 
                                                                 max_sent_len=max_sent_len, 
                                                                 min_sent_len=min_sent_len)
'''                                                                 

"\nnum_samples = 0\nbig_data = os.path.join(data_path, 'big.txt')\nthreshold = 0.9\ninput_texts_gen, target_texts_gen, gt_gen = load_data_with_noise(file_name=big_data, \n                                                                 num_samples=num_samples, \n                                                                 noise_threshold=threshold, \n                                                                 max_sent_len=max_sent_len, \n                                                                 min_sent_len=min_sent_len)\n"

In [17]:
#input_texts = input_texs_gen
#target_texts = target_texts_gen

# Results on noisy tesseract corrections

In [18]:
num_samples = 0
tess_correction_data = os.path.join(data_path, 'all_ocr_data_2.txt')
threshold = 0.9
input_texts_noisy_OCR, target_texts_noisy_OCR, gt_noisy_OCR = load_data_with_noise(file_name=tess_correction_data, 
                                                                 num_samples=num_samples, 
                                                                 noise_threshold=threshold, 
                                                                 max_sent_len=max_sent_len, 
                                                                 min_sent_len=min_sent_len)

In [19]:
'''
input_texts = input_texts_noisy_OCR
target_texts = target_texts_noisy_OCR
'''

'\ninput_texts = input_texts_noisy_OCR\ntarget_texts = target_texts_noisy_OCR\n'

# Results on merge of tesseract correction + generic data

In [20]:
'''
input_texts = input_texts_OCR + input_texts_gen
target_texts = input_texts_OCR + target_texts_gen
'''

'\ninput_texts = input_texts_OCR + input_texts_gen\ntarget_texts = input_texts_OCR + target_texts_gen\n'

# Results noisy tesseract correction + generic data

In [21]:
'''
input_texts = input_texts_noisy_OCR + input_texts_gen
target_texts = input_texts_noisy_OCR + target_texts_gen
'''

'\ninput_texts = input_texts_noisy_OCR + input_texts_gen\ntarget_texts = input_texts_noisy_OCR + target_texts_gen\n'

# Results noisy tesseract noisy + correction data

In [22]:
input_texts = input_texts_noisy_OCR + input_texts_OCR
target_texts = target_texts_noisy_OCR + target_texts_OCR

# Results of pre-training on generic and fine tuning on tesseract correction

In [23]:
# TODO

In [24]:
# Sample data
print(len(input_texts))
for i in range(10):
    print(input_texts[i], '\n', target_texts[i])

3154
Pol inyhold elm-Chm er [11 form arlon 
 	Policyholder/Owner Information


First Name: 
 	First Name:


Middle Nameﬂnitial: 
 	Middle Name/Initial:


Last Name: 
 	Last Name:


Social S ecurity Number: 
 	Social Security Number:


Birth Date: 
 	Birth Date:


Gender: 
 	Gender:


Language Preference: 
 	Language Preference:


Address Line 1: 
 	Address Line 1:


StatefPrmince : 
 	State/Province:




## Build vocab

In [25]:
all_texts = target_texts + input_texts
vocab_to_int, int_to_vocab = build_vocab(all_texts)

In [26]:
input_characters = sorted(list(vocab_to_int))
target_characters = sorted(list(vocab_to_int))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

In [27]:
print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

Number of samples: 3154
Number of unique input tokens: 110
Number of unique output tokens: 110
Max sequence length for inputs: 39
Max sequence length for outputs: 39


# Prepare training data

## Train/test split

In [28]:
# Split the data into training and testing sentences
input_texts, test_input_texts, target_texts, test_target_texts  = train_test_split(input_texts, target_texts, test_size = 0.15, random_state = 42)

## Vectorize data

## Train data

In [29]:
encoder_input_data, decoder_input_data, decoder_target_data = vectorize_data(input_texts=input_texts,
                                                                             target_texts=target_texts, 
                                                                             max_encoder_seq_length=max_encoder_seq_length, 
                                                                             num_encoder_tokens=num_encoder_tokens, 
                                                                             vocab_to_int=vocab_to_int)

## Test data

In [30]:
test_encoder_input_data, test_decoder_input_data, test_decoder_target_data = vectorize_data(input_texts=test_input_texts,
                                                                                            target_texts=test_target_texts, 
                                                                                            max_encoder_seq_length=max_encoder_seq_length, 
                                                                                            num_encoder_tokens=num_encoder_tokens, 
                                                                                            vocab_to_int=vocab_to_int)

# Training model

In [31]:
batch_size = 64  # Batch size for training.
epochs = 200  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
lr = 0.01

In [32]:
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
# TODO: Add Embedding for chars
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
print(model.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None, 110)    0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None, 110)    0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   [(None, 256), (None, 375808      input_1[0][0]                    
__________________________________________________________________________________________________
lstm_2 (LSTM)                   [(None, None, 256),  375808      input_2[0][0]                    
                                                                 lstm_1[0][1]                     
          

# Learning rate decay

In [33]:
model.compile(optimizer=optimizers.Adam(lr=lr), loss='categorical_crossentropy', metrics=['categorical_accuracy'])

In [34]:
#filepath="weights-improvement-{epoch:02d}-{val_categorical_accuracy:.2f}.hdf5"
filepath="best_model.hdf5" # Save only the best model for inference step, as saving the epoch and metric might confuse the inference function which model to use
checkpoint = ModelCheckpoint(filepath, monitor='val_categorical_accuracy', verbose=1, save_best_only=True, mode='max')
tbCallBack = TensorBoard(log_dir='./Graph', histogram_freq=0, write_graph=True, write_images=True)
callbacks_list = [checkpoint, tbCallBack]
#callbacks_list = [checkpoint, tbCallBack, lrate]



In [35]:
def exp_decay(epoch):
    initial_lrate = 0.1
    k = 0.1
    lrate = initial_lrate * np.exp(-k*epoch)
    return lrate
lrate = LearningRateScheduler(exp_decay)
#lr = 0

In [36]:
def step_decay(epoch):
    initial_lrate = 0.1
    drop = 0.5
    epochs_drop = 10.0
    lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop))
    return lrate
lrate = LearningRateScheduler(step_decay)
#lr = 0

In [37]:
#callbacks_list.append(lrate)

In [38]:
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          validation_data = ([test_encoder_input_data, test_decoder_input_data], test_decoder_target_data),
          batch_size=batch_size,
          epochs=epochs,
          callbacks=callbacks_list,
          #validation_split=0.2,
          shuffle=True)

Train on 2680 samples, validate on 474 samples
Epoch 1/200

Epoch 00001: val_categorical_accuracy improved from -inf to 0.10792, saving model to best_model.hdf5
Epoch 2/200


  '. They will not be included '



Epoch 00002: val_categorical_accuracy improved from 0.10792 to 0.17705, saving model to best_model.hdf5
Epoch 3/200

Epoch 00003: val_categorical_accuracy improved from 0.17705 to 0.23093, saving model to best_model.hdf5
Epoch 4/200

Epoch 00004: val_categorical_accuracy improved from 0.23093 to 0.26647, saving model to best_model.hdf5
Epoch 5/200

Epoch 00005: val_categorical_accuracy improved from 0.26647 to 0.29438, saving model to best_model.hdf5
Epoch 6/200

Epoch 00006: val_categorical_accuracy improved from 0.29438 to 0.31635, saving model to best_model.hdf5
Epoch 7/200

Epoch 00007: val_categorical_accuracy improved from 0.31635 to 0.33301, saving model to best_model.hdf5
Epoch 8/200

Epoch 00008: val_categorical_accuracy improved from 0.33301 to 0.34945, saving model to best_model.hdf5
Epoch 9/200

Epoch 00009: val_categorical_accuracy improved from 0.34945 to 0.35649, saving model to best_model.hdf5
Epoch 10/200

Epoch 00010: val_categorical_accuracy improved from 0.35649 to


Epoch 00032: val_categorical_accuracy improved from 0.40571 to 0.40674, saving model to best_model.hdf5
Epoch 33/200

Epoch 00033: val_categorical_accuracy improved from 0.40674 to 0.40712, saving model to best_model.hdf5
Epoch 34/200

Epoch 00034: val_categorical_accuracy did not improve from 0.40712
Epoch 35/200

Epoch 00035: val_categorical_accuracy did not improve from 0.40712
Epoch 36/200

Epoch 00036: val_categorical_accuracy did not improve from 0.40712
Epoch 37/200

Epoch 00037: val_categorical_accuracy improved from 0.40712 to 0.40723, saving model to best_model.hdf5
Epoch 38/200

Epoch 00038: val_categorical_accuracy improved from 0.40723 to 0.40739, saving model to best_model.hdf5
Epoch 39/200

Epoch 00039: val_categorical_accuracy improved from 0.40739 to 0.40955, saving model to best_model.hdf5
Epoch 40/200

Epoch 00040: val_categorical_accuracy did not improve from 0.40955
Epoch 41/200

Epoch 00041: val_categorical_accuracy did not improve from 0.40955
Epoch 42/200

Epoc


Epoch 00066: val_categorical_accuracy did not improve from 0.41291
Epoch 67/200

Epoch 00067: val_categorical_accuracy improved from 0.41291 to 0.41496, saving model to best_model.hdf5
Epoch 68/200

Epoch 00068: val_categorical_accuracy improved from 0.41496 to 0.41702, saving model to best_model.hdf5
Epoch 69/200

Epoch 00069: val_categorical_accuracy improved from 0.41702 to 0.41815, saving model to best_model.hdf5
Epoch 70/200

Epoch 00070: val_categorical_accuracy did not improve from 0.41815
Epoch 71/200

Epoch 00071: val_categorical_accuracy improved from 0.41815 to 0.41961, saving model to best_model.hdf5
Epoch 72/200

Epoch 00072: val_categorical_accuracy did not improve from 0.41961
Epoch 73/200

Epoch 00073: val_categorical_accuracy did not improve from 0.41961
Epoch 74/200

Epoch 00074: val_categorical_accuracy did not improve from 0.41961
Epoch 75/200

Epoch 00075: val_categorical_accuracy did not improve from 0.41961
Epoch 76/200

Epoch 00076: val_categorical_accuracy imp


Epoch 00099: val_categorical_accuracy did not improve from 0.42080
Epoch 100/200

Epoch 00100: val_categorical_accuracy did not improve from 0.42080
Epoch 101/200

Epoch 00101: val_categorical_accuracy did not improve from 0.42080
Epoch 102/200

Epoch 00102: val_categorical_accuracy did not improve from 0.42080
Epoch 103/200

Epoch 00103: val_categorical_accuracy did not improve from 0.42080
Epoch 104/200

Epoch 00104: val_categorical_accuracy did not improve from 0.42080
Epoch 105/200

Epoch 00105: val_categorical_accuracy did not improve from 0.42080
Epoch 106/200

Epoch 00106: val_categorical_accuracy did not improve from 0.42080
Epoch 107/200

Epoch 00107: val_categorical_accuracy did not improve from 0.42080
Epoch 108/200

Epoch 00108: val_categorical_accuracy did not improve from 0.42080
Epoch 109/200

Epoch 00109: val_categorical_accuracy did not improve from 0.42080
Epoch 110/200

Epoch 00110: val_categorical_accuracy did not improve from 0.42080
Epoch 111/200

Epoch 00111: va


Epoch 00134: val_categorical_accuracy did not improve from 0.42151
Epoch 135/200

Epoch 00135: val_categorical_accuracy did not improve from 0.42151
Epoch 136/200

Epoch 00136: val_categorical_accuracy did not improve from 0.42151
Epoch 137/200

Epoch 00137: val_categorical_accuracy did not improve from 0.42151
Epoch 138/200

Epoch 00138: val_categorical_accuracy improved from 0.42151 to 0.42167, saving model to best_model.hdf5
Epoch 139/200

Epoch 00139: val_categorical_accuracy improved from 0.42167 to 0.42216, saving model to best_model.hdf5
Epoch 140/200

Epoch 00140: val_categorical_accuracy did not improve from 0.42216
Epoch 141/200

Epoch 00141: val_categorical_accuracy improved from 0.42216 to 0.42254, saving model to best_model.hdf5
Epoch 142/200

Epoch 00142: val_categorical_accuracy did not improve from 0.42254
Epoch 143/200

Epoch 00143: val_categorical_accuracy improved from 0.42254 to 0.42264, saving model to best_model.hdf5
Epoch 144/200

Epoch 00144: val_categorical_ac


Epoch 00167: val_categorical_accuracy did not improve from 0.42319
Epoch 168/200

Epoch 00168: val_categorical_accuracy did not improve from 0.42319
Epoch 169/200

Epoch 00169: val_categorical_accuracy did not improve from 0.42319
Epoch 170/200

Epoch 00170: val_categorical_accuracy did not improve from 0.42319
Epoch 171/200

Epoch 00171: val_categorical_accuracy did not improve from 0.42319
Epoch 172/200

Epoch 00172: val_categorical_accuracy did not improve from 0.42319
Epoch 173/200

Epoch 00173: val_categorical_accuracy did not improve from 0.42319
Epoch 174/200

Epoch 00174: val_categorical_accuracy did not improve from 0.42319
Epoch 175/200

Epoch 00175: val_categorical_accuracy did not improve from 0.42319
Epoch 176/200

Epoch 00176: val_categorical_accuracy did not improve from 0.42319
Epoch 177/200

Epoch 00177: val_categorical_accuracy did not improve from 0.42319
Epoch 178/200

Epoch 00178: val_categorical_accuracy did not improve from 0.42319
Epoch 179/200

Epoch 00179: va

<keras.callbacks.History at 0x7f857dcabc18>

In [39]:
model.load_weights('best_model.hdf5')

# Inference model

In [40]:
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

In [41]:
# Sample output from train data
decoded_sentences = []
target_texts_ =  []
for seq_index in range(100):
    # Take one sequence (part of the training set)
    # for trying out decoding.

    input_seq = encoder_input_data[seq_index: seq_index + 1]
    
    decoded_sentence = decode_sequence(input_seq, encoder_model, decoder_model, num_decoder_tokens, int_to_vocab)
    target_text = target_texts[seq_index][1:-1]
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('GT sentence:', target_text)
    print('Decoded sentence:', decoded_sentence)   
    decoded_sentences.append(decoded_sentence)
    target_texts_.append(target_text)

-
Input sentence: Accident Work Related: No
GT sentence: Accident Work Related: No

Decoded sentence: Accident Work Related: No

-
Input sentence: Accident Poi' Icy meerI
GT sentence: Accident Policy Number 

Decoded sentence: Accident Work Related: No

-
Input sentence: Phone:
GT sentence: Phone:

Decoded sentence: Phone:

-
Input sentence: Page 1
GT sentence: Page 1

Decoded sentence: Page 1

-
Input sentence: AUTHORIZA ‘ION
GT sentence: AUTHORIZATION NO.

Decoded sentence: AUTHORIZATION NO.

-
Input sentence: un um
GT sentence: Unum

Decoded sentence: Unum

-
Input sentence: C ity.
GT sentence: City:

Decoded sentence: City:

-
Input sentence: Provider Last Name: S ingh Dev
GT sentence: Provider Last Name: Singh Dev

Decoded sentence: Provider Last Name: Hakim

-
Input sentence: CURRENT TOTAL VISIT BALANCE
GT sentence: CURRENT TOTAL VISIT BALANCE

Decoded sentence: CURRENT TOTAL VISIT BALANCE

-
Input sentence: AIQQhQI
GT sentence: Alcohol

Decoded sentence: Alcohol

-
Input sentenc

-
Input sentence: Printed Name
GT sentence: Printed Name

Decoded sentence: Printed Name

-
Input sentence: Eastslde Medical Center "a,“
GT sentence: Eastside Medical Center

Decoded sentence: Eastside Medical Center

-
Input sentence: Mamba: No:
GT sentence: Member No:

Decoded sentence: Member No:

-
Input sentence: Prescriber: Dev, Jasminder Singh
GT sentence: Prescriber: Dev, Jasminder Singh

Decoded sentence: Prescriber: Dev, Jasminder Singh

-
Input sentence: ADDRESS - NUMBER AND STREET:
GT sentence: ADDRESS - NUMBER AND STREET:

Decoded sentence: ADDRESS - NUMBER AND STREET:

-
Input sentence: Procedure/Surgical History:
GT sentence: Procedure/Surgical History:

Decoded sentence: Procedure/Surgical History:

-
Input sentence: Social Secun'ty Number:
GT sentence: Social Security Number:

Decoded sentence: Social Security Number:

-
Input sentence: . Q . ACCIDENT CLAIM FORM
GT sentence: ACCIDENT CLAIM FORM

Decoded sentence: ACCIDENT CLAIM FORM

-
Input sentence: Conﬁrmation of Co

In [42]:
#WER_spell_correction = calculate_WER(target_texts_, decoded_sentences)
#print('WER_spell_correction |TRAIN= ', WER_spell_correction)

In [43]:
# Sample output from test data
decoded_sentences = []
target_texts_ =  []
for seq_index in range(len(test_input_texts)):
    # Take one sequence (part of the training set)
    # for trying out decoding.

    input_seq = test_encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq, encoder_model, decoder_model, num_decoder_tokens, int_to_vocab)
    target_text = test_target_texts[seq_index][1:-1]
    print('-')
    print('Input sentence:', test_input_texts[seq_index])
    print('GT sentence:', target_text)
    print('Decoded sentence:', decoded_sentence)   
    decoded_sentences.append(decoded_sentence)
    target_texts_.append(target_text)

-
Input sentence: Reason Codel
GT sentence: Reason Code

Decoded sentence: Reason Code

-
Input sentence: [:34 hours/day .
GT sentence: 4 hours/day

Decoded sentence: CURRENT NO

-
Input sentence: Date éigneé
GT sentence: Date Signed

Decoded sentence: Date of Discharge

-
Input sentence: EXPLAIN CODE
GT sentence: EXPLAIN CODE

Decoded sentence: EXPLAIN CODE

-
Input sentence: ANESTHESIOLOGIST:
GT sentence: ANESTHESIOLOGIST:

Decoded sentence: ANESTHESIA: General

-
Input sentence: Patient Na me:
GT sentence: Patient Name:

Decoded sentence: Patient Name:

-
Input sentence: Create Site: Chattanooga
GT sentence: Create Site: Chattanooga

Decoded sentence: Created Name Suefer

-
Input sentence: M2557? 9
GT sentence: M25572 ?

Decoded sentence: M25572 ?

-
Input sentence: Diagnosis Description
GT sentence: Diagnosis Description 

Decoded sentence: Diagnosis Description

-
Input sentence: Visit Date:
GT sentence: Visit Date:

Decoded sentence: Visit Date:

-
Input sentence: Physician autho

-
Input sentence: Page 3 0f 5
GT sentence: Page 3 of 5

Decoded sentence: Page  usence

-
Input sentence: Page 1
GT sentence: Page 1

Decoded sentence: Page 1

-
Input sentence: Assessment:
GT sentence: Assessment:

Decoded sentence: Assessment

-
Input sentence: Address
GT sentence: Address

Decoded sentence: Address

-
Input sentence: Language Preference:
GT sentence: Language Preference:

Decoded sentence: Language Preference:

-
Input sentence: The Beneﬁts Center 
GT sentence: The Benefits Center

Decoded sentence: The Benefits Center

-
Input sentence: Second Employer contact phone #
GT sentence: Second Employer contact phone #

Decoded sentence: See Statement Details on Back

-
Input sentence: Current Problems:
GT sentence: Current Problems:

Decoded sentence: Current Problems:

-
Input sentence: Mini/192168.11
GT sentence: http://192.168.11

Decoded sentence: Middle: No

-
Input sentence: Date First Unable to Work (mmlddlyy) 
GT sentence: Date First Unable to Work (mm/dd/yy)

De

In [44]:
WER_spell_correction = calculate_WER(target_texts_, decoded_sentences)
print('WER_spell_correction |TEST= ', WER_spell_correction)

WER_spell_correction |TEST=  0.0808136338648


In [None]:
WER_OCR = calculate_WER(target_texts_, test_input_tex)
print('WER_OCR |TEST= ', WER_OCR)

# Test on separate tesseract corrected file

In [45]:
num_samples = 10000
tess_correction_data = os.path.join(data_path, 'new_trained_data.txt')
input_texts_OCR, target_texts_OCR, gt_OCR = load_data_with_gt(tess_correction_data, num_samples, max_sent_len, min_sent_len)

input_texts = input_texts_OCR
target_texts = target_texts_OCR

encoder_input_data, decoder_input_data, decoder_target_data = vectorize_data(input_texts=input_texts,
                                                                             target_texts=target_texts, 
                                                                             max_encoder_seq_length=max_encoder_seq_length, 
                                                                             num_encoder_tokens=num_encoder_tokens, 
                                                                             vocab_to_int=vocab_to_int)

# Sample output from train data
decoded_sentences = []
target_texts_ =  []
for seq_index in range(len(input_texts)):
    # Take one sequence (part of the training set)
    # for trying out decoding.

    input_seq = encoder_input_data[seq_index: seq_index + 1]
    
    decoded_sentence = decode_sequence(input_seq, encoder_model, decoder_model, num_decoder_tokens, int_to_vocab)
    target_text = target_texts[seq_index][1:-1]
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('GT sentence:', target_text)
    print('Decoded sentence:', decoded_sentence)   
    decoded_sentences.append(decoded_sentence)
    target_texts_.append(target_text)
    
WER_spell_correction = calculate_WER(target_texts_, decoded_sentences)
print('WER_spell_correction |TEST= ', WER_spell_correction)

-
Input sentence: Me dieal Provider Roles: Treating 
GT sentence: Medical Provider Roles: Treating

Decoded sentence: Medical Provider Roles: Primary Care

-
Input sentence: Provider First Name: Christine 
GT sentence: Provider First Name: Christine

Decoded sentence: Provider First Name: Jason

-
Input sentence: Provider Last Name: Nolen, MD 
GT sentence: Provider Last Name: Nolen, MD

Decoded sentence: Provider Last Name: Hakim

-
Input sentence: Address Line 1 : 7 25 American Avenue 
GT sentence: Address Line 1 : 725 American Avenue

Decoded sentence: Address Line 1 (CPT-7320 bymm Mariah Lar
-
Input sentence: City. W’aukesha 
GT sentence: City: Waukesha

Decoded sentence: City: Lewiston

-
Input sentence: StatefProvinee: ‘WI 
GT sentence: State/Province: WI

Decoded sentence: State/Province :

-
Input sentence: Postal Code: 5 31 88 
GT sentence: Postal Code: 53188

Decoded sentence: Postal Codes: 49202

-
Input sentence: Country". US 
GT sentence: Country:  US

Decoded sentence: Cou

-
Input sentence: . . O The Benefits Center 
GT sentence: The Benefits Center

Decoded sentence: The Benefits Center

-
Input sentence: (Not for FMLA Requests) 
GT sentence: (Not for FMLA Requests)

Decoded sentence: (Not for FMLA Requests)

-
Input sentence: 03/14/2018 Date Signed 
GT sentence: 03/14/2018 Date Signed

Decoded sentence: 05/09/1980 m El0.2 Rigitare

-
Input sentence: Printed Name 
GT sentence: Printed Name

Decoded sentence: Printed Name

-
Input sentence: Seeial Security Number 
GT sentence: Social Security Number

Decoded sentence: Social Security Number

-
Input sentence: CL-1116 ( 
GT sentence: CL-1116

Decoded sentence: CL-1116

-
Input sentence: Daytime Phone: 
GT sentence: Daytime Phone:

Decoded sentence: Daytime Phone:

-
Input sentence: Dependent Information 
GT sentence: Dependent Information

Decoded sentence: Dependent Information

-
Input sentence: First Name: 
GT sentence: First Name:

Decoded sentence: First Name:

-
Input sentence: Middle Nameﬂnitial: 


-
Input sentence: History of Present illness 
GT sentence: History of Present Illness

Decoded sentence: History of Present Illness

-
Input sentence: Review of Systems 
GT sentence: Review of Systems

Decoded sentence: Review of Systems

-
Input sentence: General: no constitutional symptoms. 
GT sentence: General: no constitutional symptoms.

Decoded sentence: General: no constitutional symptoms.

-
Input sentence: Skin no skin symptoms. 
GT sentence: Skin no skin symptoms.

Decoded sentence: Skin no skin symptoms.

-
Input sentence: Endocrine: no endocrine symptoms. 
GT sentence: Endocrine: no endocrine symptoms.

Decoded sentence: Endocrine: no endocrine symptoms.

-
Input sentence: Eyes: glasseslcontact. 
GT sentence: Eyes: glasses/contact.

Decoded sentence: Eyes: glasses/contact.

-
Input sentence: Active Problems 
GT sentence: Active Problems

Decoded sentence: Active Problems

-
Input sentence: 1. Knee injury (889.90XA) 
GT sentence: 1. Knee injury (S89.90XA)

Decoded sentence:

-
Input sentence: Physical Exam 
GT sentence: Physical Exam

Decoded sentence: Physical Exam

-
Input sentence: Musculoskeletal - 
GT sentence: Musculoskeletal -

Decoded sentence: Musculoskele arm.

-
Input sentence: Right Knee: 
GT sentence: Right Knee:

Decoded sentence: Right knee:

-
Input sentence: Resultleata 
GT sentence: Results/Data

Decoded sentence: Results/Data

-
Input sentence: Diagnosis 
GT sentence: Diagnosis

Decoded sentence: Diagnosis

-
Input sentence: Allergies 
GT sentence: Allergies

Decoded sentence: Allergies

-
Input sentence: 1._No Known Aliergies 
GT sentence: 1. No Known Allergies

Decoded sentence: 1. ACL teanos Information

-
Input sentence: Physical Exam 
GT sentence: Physical Exam

Decoded sentence: Physical Exam

-
Input sentence: Diagnosis 
GT sentence: Diagnosis

Decoded sentence: Diagnosis

-
Input sentence: Plan 
GT sentence: Plan

Decoded sentence: Plan:

-
Input sentence: DiscussionlSummary 
GT sentence: Discussion/Summary

Decoded sentence: Dis

-
Input sentence: Address line 2 — 
GT sentence: Address line 2 -

Decoded sentence: Address Line 2: 

-
Input sentence: City - 
GT sentence: City -

Decoded sentence: City -

-
Input sentence: State - NC 
GT sentence: State - NC

Decoded sentence: State Plan: No

-
Input sentence: Speciality — PCP 
GT sentence: Speciality - PCP

Decoded sentence: Speciality - PCP

-
Input sentence: Add another doctor — no 
GT sentence: Add another doctor - no

Decoded sentence: Add another doctor - no

-
Input sentence: Physician authorization - mail 
GT sentence: Physician authorization - mail

Decoded sentence: Physician Signature Date Signed

-
Input sentence: Home Email — 
GT sentence: Home Email -

Decoded sentence: Home Email -

-
Input sentence: Register for Claim Self Service — no 
GT sentence: Register for Claim Self Service - no

Decoded sentence: Register for Claim Self Service - no

-
Input sentence: Health insurance provider — bcbs 
GT sentence: Health insurance provider - bcbs

Decoded s

-
Input sentence: Job Title: General Production 
GT sentence: Job Title: General Production

Decoded sentence: Job Title: General Production

-
Input sentence: Work Phone: 
GT sentence: Work Phone:

Decoded sentence: Work waist to: 

-
Input sentence: Primary Phon 
GT sentence: Primary Phone:

Decoded sentence: Primary Phone:

-
Input sentence: Other Phone: 
GT sentence: Other Phone:

Decoded sentence: Other Phone:

-
Input sentence: Claimant Addresses 
GT sentence: Claimant Addresses

Decoded sentence: Claim Event Identifier:

-
Input sentence: Primary Address: 
GT sentence: Primary Address:

Decoded sentence: Primary Address:

-
Input sentence: Address Line 1: E 
GT sentence: Address Line 1:

Decoded sentence: Address Line 1 :

-
Input sentence: Address Line 2: 
GT sentence: Address Line 2:

Decoded sentence: Address Line 2: 

-
Input sentence: City: Lewiston 
GT sentence: City: Lewiston

Decoded sentence: City: Lewiston

-
Input sentence: State: 
GT sentence: State:

Decoded sentenc

-
Input sentence: ‘ Cardholder name: 
GT sentence: Cardholder name:

Decoded sentence: Cardholder name:

-
Input sentence: Transaction identiﬁer: 
GT sentence: Transaction identifier: 

Decoded sentence: Transaction identifier:

-
Input sentence: Patient identiﬁer: 
GT sentence: Patient identifier: 

Decoded sentence: Patient identifier:

-
Input sentence: Subtotal: 
GT sentence: Subtotal:

Decoded sentence: Subtotal:

-
Input sentence: Sales Tax: 
GT sentence: Sales Tax:

Decoded sentence: Sales Tax:

-
Input sentence: Total: 
GT sentence: Total:

Decoded sentence: Total:

-
Input sentence: [customer copy) 
GT sentence: (customer copy) 

Decoded sentence: (customer copy)

-
Input sentence: ORTHOA?LANTA, L.L.C. 
GT sentence: ORTHOATLANTA, L.L.C.

Decoded sentence: ORTHOATLANTA, L.L.C.

-
Input sentence: please send payments to: 
GT sentence: please send payments to: 

Decoded sentence: please send payments to: 

-
Input sentence: ORTHOATLANTA, LLC 
GT sentence: ORTHOATLANTA, L.L.C.

De

-
Input sentence: Superwslng :Eroyider 
GT sentence: Supervising Provider

Decoded sentence: Supervising Provider

-
Input sentence: Ease? For  Copay
GT sentence: Reason For Payment Copay

Decoded sentence: Reason For Payment Copay

-
Input sentence: Method o'f Payment
GT sentence: Method of Payment 

Decoded sentence: Method of Payment

-
Input sentence: Ar'nount
GT sentence: Amount

Decoded sentence: Amount

-
Input sentence: Total Payment Amount 
GT sentence: Total Payment Amount

Decoded sentence: Total Payment Amount

-
Input sentence: unum" 
GT sentence: unum

Decoded sentence: unum

-
Input sentence: . O I ACCIDENT CLAIM FORM 
GT sentence: ACCIDENT CLAIM FORM

Decoded sentence: ACCIDENT CLAIM FORM

-
Input sentence: The Beneﬁts Center 
GT sentence: The Benefits Center

Decoded sentence: The Benefits Center

-
Input sentence: My Spouse: f"— 
GT sentence: My Spouse: 

Decoded sentence: My Spouse: 

-
Input sentence: (Name) ._ (Telephone Number) 
GT sentence: (Name)  (Telephone Num

-
Input sentence: Business Telephone: (952) 512- 5625 
GT sentence: Business Telephone: (952) 512- 5625

Decoded sentence: Business Telephone Number 

-
Input sentence: Date ofl-‘irst Visit: 01/212018 
GT sentence: Date of First Visit: 01/21/2018

Decoded sentence: Date of First Visit: F/m

-
Input sentence: Date ofNeXt Visit: 03/132018 
GT sentence: Date of Next Visit: 03/13/2018

Decoded sentence: Date of Next Visit (mm/dd/yy)

-
Input sentence: Address Line 1: 1000 W 140th St #102 
GT sentence: Address Line 1: 1000 W 140th St #102

Decoded sentence: Address Line 1 (CPT-7320.1 cALLE STATTHO
-
Input sentence: City. Bm‘nsville 
GT sentence: City: Burnsville

Decoded sentence: City State Zip

-
Input sentence: Policg'h old 91': Owner Information 
GT sentence: Policyholder/Owner Information

Decoded sentence: Policyholder/Owner Information

-
Input sentence: First Name: 
GT sentence: First Name:

Decoded sentence: First Name:

-
Input sentence: Last Name: 
GT sentence: Last Name:

Decode

-
Input sentence: SII l‘g er)’ Information 
GT sentence: Surgery Information

Decoded sentence: Surgery Information

-
Input sentence: Is Surgery Required: Yes
GT sentence: Is Surgery Required: Yes

Decoded sentence: Is Surgery Required: No

-
Input sentence: Surgery Date: 0210252018
GT sentence: Surgery Date: 02/02/2018

Decoded sentence: Surgery Information

-
Input sentence: U n U m‘ 
GT sentence: unum

Decoded sentence: unum

-
Input sentence: O C . ACCIDENT CLAIM FORM 
GT sentence: ACCIDENT CLAIM FORM

Decoded sentence: ACCIDENT CLAIM FORM

-
Input sentence: The Benefits Center 
GT sentence: The Benefits Center

Decoded sentence: The Benefits Center

-
Input sentence: Facility Name 
GT sentence: Facility Name

Decoded sentence: Facility Name

-
Input sentence: Address 
GT sentence: Address

Decoded sentence: Address

-
Input sentence: City State Zip 
GT sentence: City State Zip

Decoded sentence: City State Zip

-
Input sentence: Date Surge Performed (mmlddlyy): 
GT sentence: Date

-
Input sentence: Office. Crane Campaszte: - Florence
GT sentence: Office: Crane Composites - Florence

Decoded sentence: Office: Crane Composites - Florence

-
Input sentence: Last Earn Change: Ill/2010
GT sentence: Last Earn Change: 1/1/2010

Decoded sentence: Last name - Stokley details

-
Input sentence: Euuoxd Lvddud, SIHIEOIB i2.00.05 ?M
GT sentence: Record Loaded, 3/8/2018 12.00.00 PM

Decoded sentence: Record Loaded, MDICASTINE,

-
Input sentence: Address
GT sentence: Address

Decoded sentence: Address

-
Input sentence: ?r4maly ﬁcﬁxdcacc:
GT sentence: Primary Residence:

Decoded sentence: Primary Residence:

-
Input sentence: Businsss Physica}:
GT sentence: Business Physical:

Decoded sentence: Business Physical:

-
Input sentence: Access
GT sentence: Access

Decoded sentence: Accession type

-
Input sentence: Hone Telephone:
GT sentence: Home Telephone:

Decoded sentence: Home Telephone:

-
Input sentence: SugarVJaor affmqa Email:
GT sentence: Supervisor Office Email:

Decode

-
Input sentence: DATE CIF OPERATION: ﬂ3f16f2018 
GT sentence: DATE OF OPERATION: 03/16/2018

Decoded sentence: DATE OF OPERATION: Moore, J Alan, MD

-
Input sentence: SURGEGN: John J. Larkin, M.D. 
GT sentence: SURGEON: John J. Larkin, M.D.

Decoded sentence: SURGEON: John J. Larkin, M.D.

-
Input sentence: RNESTHESIA: General. 
GT sentence: ANESTHESIA: General.

Decoded sentence: ANESTHESIA: General.

-
Input sentence: SEE". E LIZAE E'E'H
GT sentence: ST. ELIZABETH

Decoded sentence: ST. ELIZABETH

-
Input sentence: Edgewood
GT sentence: EDGEWOOD

Decoded sentence: EDGEWOOD

-
Input sentence: FACESHEE’T
GT sentence: FACESHEET

Decoded sentence: FACESHEET

-
Input sentence: MRN: Doe Sex:
GT sentence: MRN: DOB Sex:

Decoded sentence: MRN: DOB Sex:

-
Input sentence: Patient Demographics .
GT sentence: Patient Demographics

Decoded sentence: Patient Demographics

-
Input sentence: Patient ID
GT sentence: Patient ID

Decoded sentence: Patient ID

-
Input sentence: SSN xxxexxvmocx 
GT sen

-
Input sentence: Employee Coverage; Yes 
GT sentence: Employee Coverage: Yes

Decoded sentence: Employee Coverage: Yes

-
Input sentence: Employﬁr Coverage: YER 
GT sentence: Employer Coverage: Yes

Decoded sentence: Employer Coverage: Yes

-
Input sentence: Policy No.: 
GT sentence: Policy No.:

Decoded sentence: Policy Number:

-
Input sentence: DiVision: 
GT sentence: Division:

Decoded sentence: Division:

-
Input sentence: PEG: 
GT sentence: PEG:

Decoded sentence: Page 2

-
Input sentence: Cholce. 
GT sentence: Choice:

Decoded sentence: Choice:

-
Input sentence: Eff DatE' 
GT sentence: Eff Date:

Decoded sentence: Eff Date:

-
Input sentence: Tarn. Date: 
GT sentence: Term Date:

Decoded sentence: Term Date:

-
Input sentence: Plan Lnxuiugb 
GT sentence: Plan Earnings:

Decoded sentence: Plan Earnings:

-
Input sentence: 1151-13-13:]! Typu: 
GT sentence: Earnings Type:

Decoded sentence: Earnings Type:

-
Input sentence: Earnlngs Mode: 
GT sentence: Earnings Mode:

Decoded sen

-
Input sentence: 15 Surgery Required: 
GT sentence: Is Surgery Required: No

Decoded sentence: Is Surgery Required: No

-
Input sentence: unqu 
GT sentence: unum

Decoded sentence: unum

-
Input sentence: O . O The Benefits Center 
GT sentence: The Benefits Center

Decoded sentence: The Benefits Center

-
Input sentence: (Not for FMLA Requests) 
GT sentence: (Not for FMLA Requests)

Decoded sentence: (Not for FMLA Requests)

-
Input sentence: 03/15/2018 Date Signed 
GT sentence: 03/15/2018 Date Signed

Decoded sentence: 05/09/1980 m Emprorid

-
Input sentence: Printed Name 
GT sentence: Printed Name

Decoded sentence: Printed Name

-
Input sentence: Social Security Number 
GT sentence: Social Security Number

Decoded sentence: Social Security Number:

-
Input sentence: CL-1116 (11114) 
GT sentence: CL-1116 (11/14)

Decoded sentence: CL-1116 (11/14)

-
Input sentence: Encounter Date: 02! 1 2/20 1: 
GT sentence: Encounter Date: 02/12/2018

Decoded sentence: Encounter Details Panel

-
In

-
Input sentence: Patient Name: 
GT sentence: Patient Name:

Decoded sentence: Patient Name:

-
Input sentence: Patient DOB: 
GT sentence: Patient DOB:

Decoded sentence: Patient Demograng

-
Input sentence: Date ofVisit: February II 2018 
GT sentence: Date of Visit: February 11 2018

Decoded sentence: Date of surgery (lest (SC/dec

-
Input sentence: Seen By: Vijay Patel, MD 
GT sentence: Seen By: Vijay Patel, MD

Decoded sentence: Seen By: Vijay Patel, MD

-
Input sentence: 1325 North West Avenue 
GT sentence: 1325 North West Avenue

Decoded sentence: Location : WCINYP - Starr 

-
Input sentence: Jackson: MI 49202—2050 
GT sentence: Jackson, MI 49202-2050

Decoded sentence: Jacob S. Heydemann MD PA

-
Input sentence: Policy Holder: 
GT sentence: Policy Holder:

Decoded sentence: Policy Number:

-
Input sentence: DOB: 
GT sentence: DOB:

Decoded sentence: DOB:

-
Input sentence: Effective Date: 
GT sentence: Effective Date:

Decoded sentence: Effective Date:

-
Input sentence: Sex: 
GT

-
Input sentence: Member Responsibility: 
GT sentence: Member Responsibility: $ 35.00

Decoded sentence: Member Responsibility:

-
Input sentence: Service Details for This Claim 
GT sentence: Service Details for This Claim

Decoded sentence: Services Description MRI

-
Input sentence: SERVICE LINE # 
GT sentence: SERVICE LINE #

Decoded sentence: SERVICE LINE #

-
Input sentence: DATES] OF SERVICE 
GT sentence: DATE(S) OF SERVICE

Decoded sentence: DATE(S) OF SERVICE

-
Input sentence: AUTHORIZATION NO. 
GT sentence: AUTHORIZATION NO.

Decoded sentence: AUTHORIZATION NO.

-
Input sentence: PROCEDURE NOJREVENUE CODE 
GT sentence: PROCEDURE NO./REVENUE CODE

Decoded sentence: PROCEDURE NO./REVENUE CODE

-
Input sentence: PROCEDURE MODJFIER 
GT sentence: PROCEDURE MODIFIER

Decoded sentence: PROCEDURE NO./REVENUE CODE

-
Input sentence: DIAGNOSlS CODE 
GT sentence: DIAGNOSIS CODE

Decoded sentence: DIAGNOSIS CODE

-
Input sentence: EXPLAIN CODE 
GT sentence: EXPLAIN CODE

Decoded sentence

-
Input sentence: State-Proxince: 
GT sentence: State/Province:

Decoded sentence: State/Province:

-
Input sentence: Postal Code: 
GT sentence: Postal Code:

Decoded sentence: Postal Code: 

-
Input sentence: Country. 
GT sentence: Country:

Decoded sentence: Country:

-
Input sentence: Business Telephone: 
GT sentence: Business Telephone:

Decoded sentence: Business Telephone:

-
Input sentence: Date ofFirst Visit: 0211952018 
GT sentence: Date of First Visit: 02/19/2018

Decoded sentence: Date of First Visit: F/C

-
Input sentence: Date ofNeXt Visit; 03/120018 
GT sentence: Date of Next Visit: 03/12/2018

Decoded sentence: Date of Next Visit (mm/dd/yy)

-
Input sentence: Emplqvm am In formation 
GT sentence: Employment Information

Decoded sentence: Employment Information

-
Input sentence: Flrployer Name: 
GT sentence: Employer Name:

Decoded sentence: Employer Name:

-
Input sentence: Electronic Sn bmil. sion 
GT sentence: Electronic Submission

Decoded sentence: Electronic Submis

# Domain transfer from noisy spelling mistakes to OCR corrections

## Pre-train on noisy spelling mistakes

In [46]:

input_texts = input_texts_noisy_OCR
target_texts = target_texts_noisy_OCR

encoder_input_data, decoder_input_data, decoder_target_data = vectorize_data(input_texts=input_texts,
                                                                             target_texts=target_texts, 
                                                                             max_encoder_seq_length=max_encoder_seq_length, 
                                                                             num_encoder_tokens=num_encoder_tokens, 
                                                                             vocab_to_int=vocab_to_int)

batch_size = 64  # Batch size for training.
epochs = 2  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
lr = 0.01

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.compile(optimizer=optimizers.Adam(lr=lr), loss='categorical_crossentropy', metrics=['categorical_accuracy'])


model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          #validation_data = ([test_encoder_input_data, test_decoder_input_data], test_decoder_target_data),
          batch_size=batch_size,
          epochs=epochs,
          callbacks=callbacks_list,
          validation_split=0.2,
          shuffle=True)

ValueError: Graph disconnected: cannot obtain value for tensor Tensor("input_3:0", shape=(?, 256), dtype=float32) at layer "input_3". The following previous layers were accessed without issue: ['input_2']

## Fine tune on OCR correction data

In [None]:

input_texts = input_texts_OCR
target_texts = target_texts_OCR

# Keep test data from the corrected OCR, as this what we care about
input_texts, test_input_texts, target_texts, test_target_texts  = train_test_split(input_texts, target_texts, test_size = 0.15, random_state = 42)

# Vectorize train data
encoder_input_data, decoder_input_data, decoder_target_data = vectorize_data(input_texts=input_texts,
                                                                             target_texts=target_texts, 
                                                                             max_encoder_seq_length=max_encoder_seq_length, 
                                                                             num_encoder_tokens=num_encoder_tokens, 
                                                                             vocab_to_int=vocab_to_int)
# Vectorize test data
test_encoder_input_data, test_decoder_input_data, test_decoder_target_data = vectorize_data(input_texts=test_input_texts,
                                                                                            target_texts=test_target_texts, 
                                                                                            max_encoder_seq_length=max_encoder_seq_length, 
                                                                                            num_encoder_tokens=num_encoder_tokens, 
                                                                                            vocab_to_int=vocab_to_int)


batch_size = 64  # Batch size for training.
epochs = 2  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
lr = 0.001# Reduce the learning rate for fine tuning

model.compile(optimizer=optimizers.Adam(lr=lr), loss='categorical_crossentropy', metrics=['categorical_accuracy'])


model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          validation_data = ([test_encoder_input_data, test_decoder_input_data], test_decoder_target_data),
          batch_size=batch_size,
          epochs=epochs,
          callbacks=callbacks_list,
          #validation_split=0.2,
          shuffle=True)

In [None]:

# Sample output from test data
decoded_sentences = []
target_texts_ =  []

for seq_index in range(test_input_texts):
    # Take one sequence (part of the training set)
    # for trying out decoding.

    input_seq = test_encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq, encoder_model, decoder_model, num_decoder_tokens, int_to_vocab)
    target_text = test_target_texts[seq_index][1:-1]
    print('-')
    print('Input sentence:', test_input_texts[seq_index])
    print('GT sentence:', target_text)
    print('Decoded sentence:', decoded_sentence)   
    decoded_sentences.append(decoded_sentence)
    target_texts_.append(target_text)
    
print('WER_spell_correction |TEST= ', WER_spell_correction)    

# Next steps
- Add attention
- Full attention
- Condition the Encoder on word embeddings of the context (Bi-directional LSTM)
- Condition the Decoder on word embeddings of the context (Bi-directional LSTM) 

# References
- Sequence to Sequence Learning with Neural Networks
    https://arxiv.org/abs/1409.3215
- Learning Phrase Representations using
    RNN Encoder-Decoder for Statistical Machine Translation
    https://arxiv.org/abs/1406.107