# Introduction

We tackle the problem of OCR post processing. In OCR, we map the image form of the document into the text domain. This is done first using an CNN+LSTM+CTC model, in our case based on tesseract. Since this output maps only image to text, we need something on top to validate and correct language semantics.

The idea is to build a language model, that takes the OCRed text and corrects it based on language knowledge. The langauge model could be:
- Char level: the aim is to capture the word morphology. In which case it's like a spelling correction system.
- Word level: the aim is to capture the sentence semnatics. But such systems suffer from the OOV problem.
- Fusion: to capture semantics and morphology language rules. The output has to be at char level, to avoid the OOV. However, the input can be char, word or both.

The fusion model target is to learn:

    p(char | char_context, word_context)

In this workbook we use seq2seq vanilla Keras implementation, adapted from the lstm_seq2seq example on Eng-Fra translation task. The adaptation involves:

- Adapt to spelling correction, on char level
- Pre-train on a noisy, medical sentences
- Fine tune a residual, to correct the mistakes of tesseract 
- Limit the input and output sequence lengths
- Enusre teacher forcing auto regressive model in the decoder
- Limit the padding per batch (TODO)
- Learning rate schedule (TODO)


# Imports

In [1]:
from __future__ import print_function
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
from keras.models import Model
from keras.layers import Input, LSTM, Dense
from keras import optimizers
from keras.callbacks import ModelCheckpoint, TensorBoard, LearningRateScheduler
import numpy as np
import os
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


# Utility functions

In [2]:
# Limit gpu allocation. allow_growth, or gpu_fraction
def gpu_alloc():
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    set_session(tf.Session(config=config))

In [3]:
gpu_alloc()

In [4]:
def calculate_WER_sent(gt, pred):
    '''
    calculate_WER('calculating wer between two sentences', 'calculate wer between two sentences')
    '''
    gt_words = gt.lower().split(' ')
    pred_words = pred.lower().split(' ')
    d = np.zeros(((len(gt_words) + 1), (len(pred_words) + 1)), dtype=np.uint8)
    # d = d.reshape((len(gt_words)+1, len(pred_words)+1))

    # Initializing error matrix
    for i in range(len(gt_words) + 1):
        for j in range(len(pred_words) + 1):
            if i == 0:
                d[0][j] = j
            elif j == 0:
                d[i][0] = i

    # computation
    for i in range(1, len(gt_words) + 1):
        for j in range(1, len(pred_words) + 1):
            if gt_words[i - 1] == pred_words[j - 1]:
                d[i][j] = d[i - 1][j - 1]
            else:
                substitution = d[i - 1][j - 1] + 1
                insertion = d[i][j - 1] + 1
                deletion = d[i - 1][j] + 1
                d[i][j] = min(substitution, insertion, deletion)
    return d[len(gt_words)][len(pred_words)]

In [5]:
def calculate_WER(gt, pred):
    '''

    :param gt: list of sentences of the ground truth
    :param pred: list of sentences of the predictions
    both lists must have the same length
    :return: accumulated WER
    '''
#    assert len(gt) == len(pred)
    WER = 0
    nb_w = 0
    for i in range(len(gt)):
        #print(gt[i])
        #print(pred[i])
        WER += calculate_WER_sent(gt[i], pred[i])
        nb_w += len(gt[i])

    return WER / nb_w

In [6]:
# Artificial noisy spelling mistakes
def noise_maker(sentence, threshold):
    '''Relocate, remove, or add characters to create spelling mistakes'''
    letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m',
           'n','o','p','q','r','s','t','u','v','w','x','y','z',]
    noisy_sentence = []
    i = 0
    while i < len(sentence):
        random = np.random.uniform(0, 1, 1)
        # Most characters will be correct since the threshold value is high
        if random < threshold:
            noisy_sentence.append(sentence[i])
        else:
            new_random = np.random.uniform(0, 1, 1)
            # ~33% chance characters will swap locations
            if new_random > 0.67:
                if i == (len(sentence) - 1):
                    # If last character in sentence, it will not be typed
                    continue
                else:
                    # if any other character, swap order with following character
                    noisy_sentence.append(sentence[i + 1])
                    noisy_sentence.append(sentence[i])
                    i += 1
            # ~33% chance an extra lower case letter will be added to the sentence
            elif new_random < 0.33:
                random_letter = np.random.choice(letters, 1)[0]
                noisy_sentence.append(random_letter)
                noisy_sentence.append(sentence[i])
            # ~33% chance a character will not be typed
            else:
                pass
        i += 1

    return ''.join(noisy_sentence)

In [7]:
def load_data_with_gt(file_name, num_samples, max_sent_len, min_sent_len):
    '''Load data from txt file, with each line has: <TXT><TAB><GT>. The  target to the decoder muxt have \t as the start trigger and \n as the stop trigger.'''
    cnt = 0  
    input_texts = []
    gt_texts = []
    target_texts = []
    for row in open(file_name):
        if cnt < num_samples :
            #print(row)
            sents = row.split("\t")
            input_text = sents[0]
            
            target_text = '\t' + sents[1] + '\n'
            if len(input_text) > min_sent_len and len(input_text) < max_sent_len and len(target_text) > min_sent_len and len(target_text) < max_sent_len:
                cnt += 1
                
                input_texts.append(input_text)
                target_texts.append(target_text)
                gt_texts.append(sents[1])
    return input_texts, target_texts, gt_texts

In [8]:
def load_data_with_noise(file_name, num_samples, noise_threshold, max_sent_len, min_sent_len):
    '''Load data from txt file, with each line has: <TXT>. The GT is just a noisy version of TXT. The  target to the decoder muxt have \t as the start trigger and \n as the stop trigger.'''
    cnt = 0  
    input_texts = []
    gt_texts = []
    target_texts = []
    while cnt < num_samples :
        for row in open(file_name):
            if cnt < num_samples :
                sents = row.split("\t")
                input_text = noise_maker(sents[1], noise_threshold)
                input_text = input_text[:-1]

                target_text = '\t' + sents[1] + '\n'            
                if len(input_text) > min_sent_len and len(input_text) < max_sent_len and len(target_text) > min_sent_len and len(target_text) < max_sent_len:
                    cnt += 1
                    input_texts.append(input_text)
                    target_texts.append(target_text)
                    gt_texts.append(target_text[1:-1])
                    
    return input_texts, target_texts, gt_texts

In [9]:
def build_vocab(all_texts):
    '''Build vocab dictionary to victorize chars into ints'''
    vocab_to_int = {}
    count = 0
    
    for sentence in all_texts:
        for char in sentence:
            if char not in vocab_to_int:
                vocab_to_int[char] = count
                count += 1
    # Add special tokens to vocab_to_int
    codes = ['\t','\n']
    for code in codes:
        if code not in vocab_to_int:
            vocab_to_int[code] = count
            count += 1
    '''''Build inverse translation from int to char'''
    int_to_vocab = {}
    for character, value in vocab_to_int.items():
        int_to_vocab[value] = character
        
    return vocab_to_int, int_to_vocab

In [10]:
def vectorize_data(input_texts, target_texts, max_encoder_seq_length, num_encoder_tokens, vocab_to_int):
    '''Prepares the input text and targets into the proper seq2seq numpy arrays'''
    encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
    decoder_input_data = np.zeros(
        (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
        dtype='float32')
    decoder_target_data = np.zeros(
        (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
        dtype='float32')

    for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
        for t, char in enumerate(input_text):
            # c0..cn
            encoder_input_data[i, t, vocab_to_int[char]] = 1.
        for t, char in enumerate(target_text):
            # c0'..cm'
            # decoder_target_data is ahead of decoder_input_data by one timestep
            decoder_input_data[i, t, vocab_to_int[char]] = 1.
            if t > 0:
                # decoder_target_data will be ahead by one timestep
                # and will not include the start character.
                decoder_target_data[i, t - 1, vocab_to_int[char]] = 1.
                
    return encoder_input_data, decoder_input_data, decoder_target_data

In [11]:
def decode_sequence(input_seq, encoder_model, decoder_model, num_decoder_tokens, int_to_vocab):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, vocab_to_int['\t']] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = int_to_vocab[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    return decoded_sentence


# Load data

In [12]:
data_path = '../../dat/'

In [13]:
max_sent_len = 100
min_sent_len = 4

## Results on tesseract correction

In [14]:
num_samples = 10000
tess_correction_data = os.path.join(data_path, 'all_ocr_data.txt')
input_texts_OCR, target_texts_OCR, gt_OCR = load_data_with_gt(tess_correction_data, num_samples, max_sent_len, min_sent_len)

In [15]:
input_texts = input_texts_OCR
target_texts = target_texts_OCR

# Results of pre-training on generic data

In [16]:
'''
num_samples = 0
big_data = os.path.join(data_path, 'big.txt')
threshold = 0.9
input_texts_gen, target_texts_gen, gt_gen = load_data_with_noise(file_name=big_data, 
                                                                 num_samples=num_samples, 
                                                                 noise_threshold=threshold, 
                                                                 max_sent_len=max_sent_len, 
                                                                 min_sent_len=min_sent_len)
'''                                                                 

"\nnum_samples = 0\nbig_data = os.path.join(data_path, 'big.txt')\nthreshold = 0.9\ninput_texts_gen, target_texts_gen, gt_gen = load_data_with_noise(file_name=big_data, \n                                                                 num_samples=num_samples, \n                                                                 noise_threshold=threshold, \n                                                                 max_sent_len=max_sent_len, \n                                                                 min_sent_len=min_sent_len)\n"

In [17]:
#input_texts = input_texs_gen
#target_texts = target_texts_gen

# Results on noisy tesseract corrections

In [18]:
num_samples = 10000
tess_correction_data = os.path.join(data_path, 'new_trained_data.txt')
threshold = 0.9
input_texts_noisy_OCR, target_texts_noisy_OCR, gt_noisy_OCR = load_data_with_noise(file_name=tess_correction_data, 
                                                                 num_samples=num_samples, 
                                                                 noise_threshold=threshold, 
                                                                 max_sent_len=max_sent_len, 
                                                                 min_sent_len=min_sent_len)

In [19]:
'''
input_texts = input_texts_noisy_OCR
target_texts = target_texts_noisy_OCR
'''

'\ninput_texts = input_texts_noisy_OCR\ntarget_texts = target_texts_noisy_OCR\n'

# Results on merge of tesseract correction + generic data

In [20]:
'''
input_texts = input_texts_OCR + input_texts_gen
target_texts = input_texts_OCR + target_texts_gen
'''

'\ninput_texts = input_texts_OCR + input_texts_gen\ntarget_texts = input_texts_OCR + target_texts_gen\n'

# Results noisy tesseract correction + generic data

In [21]:
'''
input_texts = input_texts_noisy_OCR + input_texts_gen
target_texts = input_texts_noisy_OCR + target_texts_gen
'''

'\ninput_texts = input_texts_noisy_OCR + input_texts_gen\ntarget_texts = input_texts_noisy_OCR + target_texts_gen\n'

# Results noisy tesseract noisy + correction data

In [22]:
input_texts = input_texts_noisy_OCR + input_texts_OCR
target_texts = target_texts_noisy_OCR + target_texts_OCR

# Results of pre-training on generic and fine tuning on tesseract correction

In [23]:
# TODO

In [24]:
# Sample data
print(len(input_texts))
for i in range(10):
    print(input_texts[i], '\n', target_texts[i])

12616
Medical ProivderRloes: Treating 
 	Medical Provider Roles: Treating


Pvrovdier First NameC hristjine 
 	Provider First Name: Christine


ProviderL ast Naem: Nolenv, MD 
 	Provider Last Name: Nolen, MD


Address Line 1 : 725d Amercan Avensuer 
 	Address Line 1 : 725 American Avenue


City: Waueksha 
 	City: Waukesha


State/Provinnce:zWI 
 	State/Province: WI


Posta Code: 53188 
 	Postal Code: 53188


Country:  US 
 	Country:  US


Businessb Telezphne: (262) 298- 1000 
 	Business Telephone: (262) 928- 1000


Date o First Visit: 12/01h/0217 
 	Date of First Visit: 12/01/2017




## Build vocab

In [25]:
all_texts = target_texts + input_texts
vocab_to_int, int_to_vocab = build_vocab(all_texts)

In [26]:
input_characters = sorted(list(vocab_to_int))
target_characters = sorted(list(vocab_to_int))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

In [27]:
print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

Number of samples: 12616
Number of unique input tokens: 108
Number of unique output tokens: 108
Max sequence length for inputs: 99
Max sequence length for outputs: 99


# Prepare training data

## Train/test split

In [28]:
# Split the data into training and testing sentences
input_texts, test_input_texts, target_texts, test_target_texts  = train_test_split(input_texts, target_texts, test_size = 0.15, random_state = 42)

## Vectorize data

## Train data

In [29]:
encoder_input_data, decoder_input_data, decoder_target_data = vectorize_data(input_texts=input_texts,
                                                                             target_texts=target_texts, 
                                                                             max_encoder_seq_length=max_encoder_seq_length, 
                                                                             num_encoder_tokens=num_encoder_tokens, 
                                                                             vocab_to_int=vocab_to_int)

## Test data

In [30]:
test_encoder_input_data, test_decoder_input_data, test_decoder_target_data = vectorize_data(input_texts=test_input_texts,
                                                                                            target_texts=test_target_texts, 
                                                                                            max_encoder_seq_length=max_encoder_seq_length, 
                                                                                            num_encoder_tokens=num_encoder_tokens, 
                                                                                            vocab_to_int=vocab_to_int)

# Training model

In [31]:
batch_size = 64  # Batch size for training.
epochs = 200  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
lr = 0.01

In [32]:
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
# TODO: Add Embedding for chars
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
print(model.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None, 108)    0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None, 108)    0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   [(None, 256), (None, 373760      input_1[0][0]                    
__________________________________________________________________________________________________
lstm_2 (LSTM)                   [(None, None, 256),  373760      input_2[0][0]                    
                                                                 lstm_1[0][1]                     
          

# Learning rate decay

In [33]:
model.compile(optimizer=optimizers.Adam(lr=lr), loss='categorical_crossentropy', metrics=['categorical_accuracy'])

In [34]:
#filepath="weights-improvement-{epoch:02d}-{val_categorical_accuracy:.2f}.hdf5"
filepath="best_model.hdf5" # Save only the best model for inference step, as saving the epoch and metric might confuse the inference function which model to use
checkpoint = ModelCheckpoint(filepath, monitor='val_categorical_accuracy', verbose=1, save_best_only=True, mode='max')
tbCallBack = TensorBoard(log_dir='./Graph', histogram_freq=0, write_graph=True, write_images=True)
callbacks_list = [checkpoint, tbCallBack]
#callbacks_list = [checkpoint, tbCallBack, lrate]



In [35]:
def exp_decay(epoch):
    initial_lrate = 0.1
    k = 0.1
    lrate = initial_lrate * np.exp(-k*epoch)
    return lrate
lrate = LearningRateScheduler(exp_decay)
#lr = 0

In [36]:
def step_decay(epoch):
    initial_lrate = 0.1
    drop = 0.5
    epochs_drop = 10.0
    lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop))
    return lrate
lrate = LearningRateScheduler(step_decay)
#lr = 0

In [37]:
#callbacks_list.append(lrate)

In [38]:
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          validation_data = ([test_encoder_input_data, test_decoder_input_data], test_decoder_target_data),
          batch_size=batch_size,
          epochs=epochs,
          callbacks=callbacks_list,
          #validation_split=0.2,
          shuffle=True)

Train on 10723 samples, validate on 1893 samples
Epoch 1/200

Epoch 00001: val_categorical_accuracy improved from -inf to 0.15992, saving model to best_model.hdf5
Epoch 2/200


  '. They will not be included '



Epoch 00002: val_categorical_accuracy improved from 0.15992 to 0.22001, saving model to best_model.hdf5
Epoch 3/200

Epoch 00003: val_categorical_accuracy improved from 0.22001 to 0.24048, saving model to best_model.hdf5
Epoch 4/200

Epoch 00004: val_categorical_accuracy improved from 0.24048 to 0.25285, saving model to best_model.hdf5
Epoch 5/200

Epoch 00005: val_categorical_accuracy improved from 0.25285 to 0.25909, saving model to best_model.hdf5
Epoch 6/200

Epoch 00006: val_categorical_accuracy improved from 0.25909 to 0.26471, saving model to best_model.hdf5
Epoch 7/200

Epoch 00007: val_categorical_accuracy improved from 0.26471 to 0.26963, saving model to best_model.hdf5
Epoch 8/200

Epoch 00008: val_categorical_accuracy improved from 0.26963 to 0.27434, saving model to best_model.hdf5
Epoch 9/200

Epoch 00010: val_categorical_accuracy did not improve from 0.27660
Epoch 11/200

Epoch 00011: val_categorical_accuracy improved from 0.27660 to 0.27962, saving model to best_model.

KeyboardInterrupt: 

In [39]:
model.load_weights('best_model.hdf5')

# Inference model

In [40]:
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

In [41]:
# Sample output from train data
decoded_sentences = []
target_texts_ =  []
for seq_index in range(100):
    # Take one sequence (part of the training set)
    # for trying out decoding.

    input_seq = encoder_input_data[seq_index: seq_index + 1]
    
    decoded_sentence = decode_sequence(input_seq, encoder_model, decoder_model, num_decoder_tokens, int_to_vocab)
    target_text = target_texts[seq_index][1:-1]
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('GT sentence:', target_text)
    print('Decoded sentence:', decoded_sentence)   
    decoded_sentences.append(decoded_sentence)
    target_texts_.append(target_text)

-
Input sentence: Employer:
GT sentence: Employer:

Decoded sentence: Employee Off-Job Acc January 1, 2018

-
Input sentence: Call toll-free Mondya thoruh Friday, 8 a.m. to  8p.mr. Easten Tiemn.
GT sentence: Call toll-free Monday through Friday, 8 a.m. to 8 p.m. Eastern Time.

Decoded sentence: Call toll-free Monday through Friday, 8 a.m. to 8 p.m. Eastern Time.

-
Input sentence: Medical Proivdeyrl Information - Physician
GT sentence: Medical Provider Information - Physician

Decoded sentence: Medical Provider Roles: Treating

-
Input sentence: ANESTHESIA: Gneral

GT sentence: ANESTHESIA: General.

Decoded sentence: ANESTHESIA: General

-
Input sentence: Iundicatse additionaul infowrzmation i svaailable
GT sentence: Indicates additional information is available.

Decoded sentence: Insured Coverage Type Coverage Effective Date

-
Input sentence: Status: Svigne

GT sentence: Status: Signed

Decoded sentence: Status: Signed

-
Input sentence: unu “' 
GT sentence: unum

Decoded sentence: 

-
Input sentence: TIER 3 Faimly MOOP Max 
GT sentence: TIER 3 Family MOOP Max 

Decoded sentence: TIER 3 Individual Deductible 01/01/2018 - 12/31/2018

-
Input sentence: ESCRIPTION
GT sentence: DESCRIPTION

Decoded sentence: ESTIMATED BLOOD LOSS: 5ml

-
Input sentence: Clgaim Tpe: VB Accidet - Acicdnetal nIjury
GT sentence: Claim Type: VB Accident - Accidental Injury

Decoded sentence: Claim Type: VB Accident - Accidental Injury

-
Input sentence: 02/02/18 29888 ARvTHRSCOPaICA LLY AIDED ANTEROIR
GT sentence: 02/02/18 29888 ARTHROSCOPIC ALLY AIDED ANTERIOR

Decoded sentence: 02/02/18 29888 ARTHROSCOPY DEBRIDEMENT MEDIAL MEISCECTOMY CHOCKNT

-
Input sentence: iVtals
GT sentence: Vitals

Decoded sentence: Vitals

-
Input sentence: • Lives with fami

GT sentence: • Lives with family

Decoded sentence: • Consumes alcohol

-
Input sentence: Piedmont Healthcare
GT sentence: Piedmont Healthcare

Decoded sentence: Piedmont Healthcare

-
Input sentence: Medicl rPopvider Roles: Treating
GT senten

In [None]:
#WER_spell_correction = calculate_WER(target_texts_, decoded_sentences)
#print('WER_spell_correction |TRAIN= ', WER_spell_correction)

In [42]:
# Sample output from test data
# Sample output from train data
decoded_sentences = []
target_texts_ =  []
for seq_index in range(100):
    # Take one sequence (part of the training set)
    # for trying out decoding.

    input_seq = test_encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq, encoder_model, decoder_model, num_decoder_tokens, int_to_vocab)
    target_text = test_target_texts[seq_index][1:-1]
    print('-')
    print('Input sentence:', test_input_texts[seq_index])
    print('GT sentence:', target_text)
    print('Decoded sentence:', decoded_sentence)   
    decoded_sentences.append(decoded_sentence)
    target_texts_.append(target_text)

-
Input sentence: Stater‘Pronnce: 
GT sentence: State/Province:

Decoded sentence: State/Province: MN

-
Input sentence: Address Line 1: E 
GT sentence: Address Line 1:

Decoded sentence: Address Line 1:

-
Input sentence: Temp Agency and timeframe — n/a 
GT sentence: Temp Agency and timeframe - n/a

Decoded sentence: Temp Agency and utiess in the patient to stop working? Yes No If yes, what is the relationship to th
-
Input sentence: 03/15/2018 Date Signed 
GT sentence: 03/15/2018 Date Signed

Decoded sentence: 03/14/2018 Date Signed

-
Input sentence: 4. 3.1 cm x 0. 9cm Baker's cyst.
GT sentence: 4. 3.1 cm x 0.9 cm Baker's cyst.

Decoded sentence: 4. Probable duration of medical condition started:

-
Input sentence: Gender:
GT sentence: Gender:

Decoded sentence: Gender:

-
Input sentence: The eBenefistj Ceter
GT sentence: The Benefits Center

Decoded sentence: The Benefits Center

-
Input sentence: Pleaesverify reatment for the accnidetn listed above
GT sentence: Please verify treat

-
Input sentence: pPlzease veify tratmnt for the acciednt linsted above
GT sentence: Please verify treatment for the accident listed above.

Decoded sentence: Please verify treatment for the accident listed aboval and upieated frament cannot do)

-
Input sentence: Subtota:
GT sentence: Subtotal:

Decoded sentence: Subtotal:

-
Input sentence: Eent dtes: EE strated she imay have missed sotme ot o nldw. fUnsure of times.
GT sentence: Event dates: EE stated she may have missed some ot on ldw. Unsure of times.

Decoded sentence: Electronically Signed Indicator: Yes

-
Input sentence: The iBenefts Cbenter
GT sentence: The Benefits Center

Decoded sentence: The Benefits Center

-
Input sentence: ComponentR esults
GT sentence: Component Results

Decoded sentence: Complete Site: Chattanooga

-
Input sentence: u. If yes. ultimate the dates of tnebtllty. Including any ﬂme for treatment and recovery: 
GT sentence: c. If yes, estimate the dates of inability, including any time for treatment and re

In [44]:
WER_spell_correction = calculate_WER(target_texts_, decoded_sentences)
print('WER_spell_correction |TEST= ', WER_spell_correction)

WER_spell_correction |TEST=  0.14461846051316227


# Test on separate tesseract corrected file

In [45]:
num_samples = 10000
tess_correction_data = os.path.join(data_path, 'new_trained_data.txt')
input_texts_OCR, target_texts_OCR, gt_OCR = load_data_with_gt(tess_correction_data, num_samples, max_sent_len, min_sent_len)

input_texts = input_texts_OCR
target_texts = target_texts_OCR

encoder_input_data, decoder_input_data, decoder_target_data = vectorize_data(input_texts=input_texts,
                                                                             target_texts=target_texts, 
                                                                             max_encoder_seq_length=max_encoder_seq_length, 
                                                                             num_encoder_tokens=num_encoder_tokens, 
                                                                             vocab_to_int=vocab_to_int)

# Sample output from train data
decoded_sentences = []
target_texts_ =  []
for seq_index in range(len(input_texts)):
    # Take one sequence (part of the training set)
    # for trying out decoding.

    input_seq = encoder_input_data[seq_index: seq_index + 1]
    
    decoded_sentence = decode_sequence(input_seq, encoder_model, decoder_model, num_decoder_tokens, int_to_vocab)
    target_text = target_texts[seq_index][1:-1]
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('GT sentence:', target_text)
    print('Decoded sentence:', decoded_sentence)   
    decoded_sentences.append(decoded_sentence)
    target_texts_.append(target_text)
    
WER_spell_correction = calculate_WER(target_texts_, decoded_sentences)
print('WER_spell_correction |TEST= ', WER_spell_correction)

-
Input sentence: Me dieal Provider Roles: Treating 
GT sentence: Medical Provider Roles: Treating

Decoded sentence: Medical Provider Roles: Treating

-
Input sentence: Provider First Name: Christine 
GT sentence: Provider First Name: Christine

Decoded sentence: Provider Last Name: Hakim

-
Input sentence: Provider Last Name: Nolen, MD 
GT sentence: Provider Last Name: Nolen, MD

Decoded sentence: Provider Last Name: Hakim

-
Input sentence: Address Line 1 : 7 25 American Avenue 
GT sentence: Address Line 1 : 725 American Avenue

Decoded sentence: Address Line 1:

-
Input sentence: City. W’aukesha 
GT sentence: City: Waukesha

Decoded sentence: City: Waukesha

-
Input sentence: StatefProvinee: ‘WI 
GT sentence: State/Province: WI

Decoded sentence: State/Province: MN

-
Input sentence: Postal Code: 5 31 88 
GT sentence: Postal Code: 53188

Decoded sentence: Postal Code:

-
Input sentence: Country". US 
GT sentence: Country:  US

Decoded sentence: Country: US

-
Input sentence: Busine

-
Input sentence: STATEMENT DATE 01/03/18 
GT sentence: STATEMENT DATE  01/03/18

Decoded sentence: STATEMENT DATE 

-
Input sentence: —ue DATE _1/13/18 
GT sentence: DUE DATE 01/13/18

Decoded sentence: Date of Next Visit: 03/13/2018

-
Input sentence: snow AMOUNT$ PNDHEHE 
GT sentence: SHOW AMOUNT PAID HERE $

Decoded sentence: Employee Off-Job Acc January 1, 2018

-
Input sentence: PHONE: 414—423—4120 —ue 
GT sentence: PHONE: 414-423-4120

Decoded sentence: PHONE: 414-423-4120

-
Input sentence: PNDHEHE 
GT sentence: ADDRESSEE:

Decoded sentence: OVER 90 DAYS 

-
Input sentence: - MAKE CHECKS PHASE “3' " 
GT sentence: MAKE CHECKS PAYABLE TO:

Decoded sentence: Review of Systems

-
Input sentence: EMERGENCY MEDICAL ASSOCIATES 
GT sentence: EMERGENCY MEDICAL ASSOCIATES

Decoded sentence: EMERGENCY MEDICAL ASSOCIATES

-
Input sentence: 64-00 INDUSTRIAL LOOP 
GT sentence: 6400 INDUSTRIAL LOOP

Decoded sentence: 6400 INDUREDICTED BY INSURED/PATIENT STATEMENT (Continued)

-
Input sentence

-
Input sentence: Group Policy #: 
GT sentence: Group Policy #:

Decoded sentence: Group Policy #:

-
Input sentence: Customer Policy #: 
GT sentence: Customer Policy #:

Decoded sentence: Customer Policy #:

-
Input sentence: EE Name: 
GT sentence: EE Name:

Decoded sentence: EE Name:

-
Input sentence: MONTANO 
GT sentence: MONTANO

Decoded sentence: MONTANO

-
Input sentence: Insucesl Comes: ﬂee Amount Ettectivemte 
GT sentence: Insured Coverage Type Coverage Amount Coverage Effective Date

Decoded sentence: Insured Coverage Type Coverage Effective Date

-
Input sentence: Employee‘lc CI w/Cancer Conditions $20,000.00 January 1, 2015 
GT sentence: Employee*  CI w/Cancer Conditions $20,000.00 January 1, 2015

Decoded sentence: Employee Off-Job Acc January 1, 2018

-
Input sentence: Employee‘kt Wellness Benefit $50.00 January 1{ 2015 
GT sentence: Employee** Wellness Benefit $50.00 January 1, 2015

Decoded sentence: Employee Off-Job Acc January 1, 2018

-
Input sentence: *Children are 

-
Input sentence: ResultsIData 
GT sentence: Results/Data

Decoded sentence: Respiratory: Negative for cough, shortness of breath and wheezing.

-
Input sentence: Right knee MRI IMPRESSION: 
GT sentence: Right knee MRI IMPRESSION:

Decoded sentence: Right knee ACL repair, MCL repair utilizing internal bracing (DOS: 2/2/18)

-
Input sentence: 1. ACL tear. 
GT sentence: 1. ACL tear.

Decoded sentence: 1. Right knee arthroscopic anterior cruciate ligament tear.

-
Input sentence: 2. MCL high-grade tear with few if any Intact ﬁbers present at the femoral attachment. 
GT sentence: 2. MCL high-grade tear with few if any Intact fibers present at the femoral attachment.

Decoded sentence: 2. Medial collateral ligament tear.

-
Input sentence: 5. Patellar apical grade 1-2 chondromalacia. 
GT sentence: 5. Patellar apical grade 1-2 chondromalacia.

Decoded sentence: SUR Physician Honpital

-
Input sentence: Diagnosis 
GT sentence: Diagnosis

Decoded sentence: Diagnosis: ICD Code:

-
Input sentenc

-
Input sentence: History of Present Illness - 
GT sentence: History of Present Illness

Decoded sentence: History provided Lamblate: David

-
Input sentence: Review of Systems 
GT sentence: Review of Systems

Decoded sentence: Review of Systems

-
Input sentence: General: no constitutional symptoms. 
GT sentence: General: no constitutional symptoms.

Decoded sentence: Gender:

-
Input sentence: Cardlovascular: no cardiovascular symptoms. 
GT sentence: Cardiovascular: no cardiovascular symptoms.

Decoded sentence: Cardholder name:

-
Input sentence: Skin no skin symptoms. 
GT sentence: Skin no skin symptoms.

Decoded sentence: Skin: Negative for photoposic mesing xurelogical compartment osteoarthritis.

-
Input sentence: ENT: no ears. nose or throat symptoms. 
GT sentence: ENT: no ears, nose or throat symptoms.

Decoded sentence: ENT -. Hearing intact to the spoken word.

-
Input sentence: Endocrine: no endocrine symptoms. 
GT sentence: Endocrine: no endocrine symptoms.

Decoded senten

-
Input sentence: Current Mods 
GT sentence: Current Meds

Decoded sentence: Current Meds

-
Input sentence: 1. Probiotio CAPS; 
GT sentence: 1. Probiotic CAPS;

Decoded sentence: 1. Right knee arthroscopic anterior cruciate ligament tear.

-
Input sentence: Therapy: 21 Jan2018 to Recorded 
GT sentence: Therapy: 21Jan2018 to Recorded

Decoded sentence: Therapy: 21Jan2018 to Recorded

-
Input sentence: Allergies 
GT sentence: Allergies

Decoded sentence: Allergies

-
Input sentence: 1. No Known Allergies 
GT sentence: 1. No Known Allergies

Decoded sentence: 1. Right knee arthroscopic anterior cruciate ligament tear.

-
Input sentence: Vitals 
GT sentence: Vitals

Decoded sentence: Vitals

-
Input sentence: TCO Vitals Signs Panel 
GT sentence: TCO Vitals Signs Panel

Decoded sentence: TCO Vitals Signs Panel

-
Input sentence: Height: 5 ft 8 in 
GT sentence: Height: 5 ft 8 in

Decoded sentence: Height: 5 ft 8 in

-
Input sentence: Weight: 270 lb 
GT sentence: Weight: 270 lb

Decoded sent

-
Input sentence: 0 Tobacco quit date established (287.891) 
GT sentence: • Tobacco quit date established (287.891)

Decoded sentence: • Consumes alcohol

-
Input sentence: - : ‘10 years 
GT sentence: • : 10 years

Decoded sentence: Review of Systems

-
Input sentence: Current Made 
GT sentence: Current Meds

Decoded sentence: Current Meds

-
Input sentence: 1. Muitl-V'rtamin TABS: 
GT sentence: 1. Multi-Vitamin TABS;

Decoded sentence: 1. Right knee arthroscopic anterior cruciate ligament tear.

-
Input sentence: Therapy: (Recorded:24dan2018) to Recorded 
GT sentence: Therapy: (Recorded:24Jan2018) to Recorded

Decoded sentence: Therapy: 21Jan2018 to Recorded

-
Input sentence: 2. Probiotic CAPS; 
GT sentence: 2. Probiotic CAPS;

Decoded sentence: 2. Medial collar lypatient to stop working? Yes No If yes, what is the relationship to the child

-
Input sentence: Therapy: 21Jan2018 to Recorded 
GT sentence: Therapy: 21Jan2018 to Recorded

Decoded sentence: Therapy: 21Jan2018 to Recorded


-
Input sentence: Address line 1 - 
GT sentence: Address line 1 -

Decoded sentence: Address Line 1:

-
Input sentence: Address line 2 — 
GT sentence: Address line 2 -

Decoded sentence: Address Line 1:

-
Input sentence: City - 
GT sentence: City -

Decoded sentence: City: Waukesha

-
Input sentence: State - NC 
GT sentence: State - NC

Decoded sentence: State/Province: MN

-
Input sentence: Speciality — PCP 
GT sentence: Speciality - PCP

Decoded sentence: Specialty

-
Input sentence: Add another doctor — no 
GT sentence: Add another doctor - no

Decoded sentence: Add doctors details - yes

-
Input sentence: Physician authorization - mail 
GT sentence: Physician authorization - mail

Decoded sentence: Physician Name (Last Name, First Name, MI, Suffix) Please Print

-
Input sentence: Home Email — 
GT sentence: Home Email -

Decoded sentence: Home Telephone:

-
Input sentence: Register for Claim Self Service — no 
GT sentence: Register for Claim Self Service - no

Decoded sentence: Rep

-
Input sentence: Leave start date 
GT sentence: Leave start date

Decoded sentence: Leave type - full

-
Input sentence: Event Dates Comments — 
GT sentence: Event Dates Comments -

Decoded sentence: Event Dates Information

-
Input sentence: Returned to work — no 
GT sentence: Returned to work - no

Decoded sentence: Report Group: 26

-
Input sentence: Estimated return to work date — 
GT sentence: Estimated return to work date -

Decoded sentence: Estimated return to work date (mm/dd/yy):

-
Input sentence: Time missed 7 no 
GT sentence: Time missed - no

Decoded sentence: Time of Accident: 16: 0X60 A)

-
Input sentence: Break in employment — no 
GT sentence: Break in employment - no

Decoded sentence: Breat Status: Signed

-
Input sentence: Served military last 12 mths — no 
GT sentence: Served military last 12 mths - no

Decoded sentence: Service Date: 03/13/2018

-
Input sentence: Hired as temp - no 
GT sentence: Hired as temp - no

Decoded sentence: Hire Date:

-
Input sentence: 

-
Input sentence: This cialmlsfor: \El Self El Spouse El Domestic Partner El DopendentChiid
GT sentence: This claim is for: Self Spouse Domestic Partner Dependent Child

Decoded sentence: This claim is for a child, please state your relationship?

-
Input sentence: 8. information About the Inauroleollcyholdor
GT sentence: B. Information About the Insured/Policyholder

Decoded sentence: B. Information About the Patient Outhotic Procedure wel"

-
Input sentence: Last Name Suffix First Name MI
GT sentence: Last Name Suffix First Name  MI

Decoded sentence: Last Name:

-
Input sentence: Language l-Ireloronce Ll hngllsh El Spanish
GT sentence: Language Preference English Spanish

Decoded sentence: Language Preference:

-
Input sentence: lf itnown. please check all types of coverage you have with Unum.
GT sentence: If known, please check all types of coverage you have with Unum.

Decoded sentence: Is this condition the result of an accidental injury? Yes No

-
Input sentence: l:l Short Term 

-
Input sentence: Physician Name (Last Name. First Name, Ml, Suffix) Please Print . . ‘ ‘ J _
GT sentence: Physician Name (Last Name, First Name, MI, Suffix) Please Print 

Decoded sentence: Physician Name (Last Name, First Name, MI, Suffix) Please Print

-
Input sentence: '9 .1 , r“ i J i —Q‘ ._
GT sentence: Medical Specialty Degree

Decoded sentence: Medical Provider Roles: Treating

-
Input sentence: Address
GT sentence: Address

Decoded sentence: Address Line 1:

-
Input sentence: City ﬂyettavlli9_ GA 30214 State Zip '
GT sentence: City Sate Zip

Decoded sentence: City: Waukesha

-
Input sentence: Areyou retamd tuthis patient? El Yes W.Whatistherelationshlp?
GT sentence: Are you related to this patient? Yes No If yes, what is the relationship?

Decoded sentence: Are you related to this patient? Yes No If yes, what is the relationship to the child

-
Input sentence: Physician Signature I Date
GT sentence: Physician Signature Date

Decoded sentence: Physician Name (Last Name, First N

-
Input sentence: Account number: 
GT sentence: Account number: 

Decoded sentence: Acct #:

-
Input sentence: Transaction reference number: 
GT sentence: Transaction reference number: 

Decoded sentence: Transaction type:

-
Input sentence: Cardholder name: 
GT sentence: Cardholder name: 

Decoded sentence: Cardholder name:

-
Input sentence: Transaction identiﬁer: 
GT sentence: Transaction identifier: 

Decoded sentence: Transaction type:

-
Input sentence: Patient identiﬁer: 
GT sentence: Patient identifier:

Decoded sentence: Patient Name (Last Name, First Name, MI, Suffix) Please Print

-
Input sentence: Subtotal: 
GT sentence: Subtotal:

Decoded sentence: Subtotal:

-
Input sentence: Sales Tax: 
GT sentence: Sales Tax:

Decoded sentence: Saturd yourse servicis index is 31.12 kg/m2.

-
Input sentence: Toto I: 
GT sentence: Total:

Decoded sentence: Total Employee Bi-Weekly Payroll Deduction: $13.52

-
Input sentence: [customer copy) 
GT sentence: (customer copy) 

Decoded sentence

-
Input sentence: 2. 0.3 cm extrusion of the medial meniscal body. 
GT sentence: 2. 0.3 cm extrusion of the medial meniscal body.

Decoded sentence: 2. Medial collateral ligament tear.

-
Input sentence: 3. Mild medial and patellofemoral compartment osteoarthritis. 
GT sentence: 3. Mild medial and patellofemoral compartment osteoarthritis.

Decoded sentence: 3. Approximate date symptoms.

-
Input sentence: 4. 3.1 cm x 0.9 cm Baker's cyst. 
GT sentence: 4. 3.1 cm x 0.9 cm Baker's cyst.

Decoded sentence: 4. Probable duration of medical condition started:

-
Input sentence: Approved By: CROFT STONE M D 2/15/2018 9:41 AM 
GT sentence: Approved By: CROFT STONE MD 2/15/2018 9:41 AM

Decoded sentence: Approval code: 

-
Input sentence: Narrative 
GT sentence: Narrative

Decoded sentence: NarrOCID

-
Input sentence: M Fil left knee without contrast 
GT sentence: MRI left knee without contrast

Decoded sentence: MRN: DOB:

-
Input sentence: INDICATION: Posterior and lateral pain 
GT sentence: 

-
Input sentence: Pﬁnreo Name _ ocia ecuri y um er 
GT sentence: Printed Name Social Security Number

Decoded sentence: Printed by 1237

-
Input sentence: Unum is a registered trademark and marketing brand of Unum Group and its Ensuring subsidiaries. 
GT sentence: Unum is a registered trademark and marketing brand of Unum Group and its insuring subsidiaries.

Decoded sentence: Unum is a registered trademark and marketing brand of Unum Group and its insuring subsidiaries.

-
Input sentence: unum" 
GT sentence: unum

Decoded sentence: unum

-
Input sentence: O C . ACCIDENT CLAIM FORM 
GT sentence: ACCIDENT CLAIM FORM

Decoded sentence: Op Note by Larkin, John J, MD at 3/16/1018 at 3/1/1 )

-
Input sentence: The Benefits Center 
GT sentence: The Benefits Center

Decoded sentence: The Benefits Center

-
Input sentence: Call toll-free Monday through Friday. 8 am. to 8 pm. Eastern Time. 
GT sentence: Call toll-free Monday through Friday, 8 a.m. to 8 p.m. Eastern Time.

Decoded sentence: Call

-
Input sentence: UN UM LIFE INSURANCE COMPANY OF AMERICA 2211 CONGRESS ST PORTLAND ME 04122-0002 
GT sentence: UNUM LIFE INSURANCE COMPANY OF AMERICA 2211 CONGRESS ST PORTLAND ME 04122-0002

Decoded sentence: UNUM LIFE INSURANCE COMPANY OF AMERICA 2211 CONGRESS ST PORTLAND ME 04122-0002

-
Input sentence: unum‘t 
GT sentence: unum

Decoded sentence: unum

-
Input sentence: . O . ACCIDENT CLAIM FORM 
GT sentence: ACCIDENT CLAIM FORM

Decoded sentence: ACCIDENT CLAIM FORM

-
Input sentence: The Beneﬁts Center 
GT sentence: The Benefits Center

Decoded sentence: The Benefits Center

-
Input sentence: Call toll—free Monday through Friday, 8 am. to 8 pm. Eastern Time. 
GT sentence: Call toll-free Monday through Friday, 8 a.m. to 8 p.m. Eastern Time.

Decoded sentence: Call toll-free Monday through Friday, 8 a.m. to 8 p.m. Eastern Time.

-
Input sentence: ATTENDING PHYSICIAN STATEMENT (PLEASE PRINT) 
GT sentence: ATTENDING PHYSICIAN STATEMENT (PLEASE PRINT)

Decoded sentence: ATTENDING PHYS

-
Input sentence: SUMMARY FOR ! 
GT sentence: SUMMARY FOR:

Decoded sentence: SUMMARY FOR:

-
Input sentence: Send Inquiries To: 
GT sentence: Send Inquiries To:

Decoded sentence: Send Inquiries To:

-
Input sentence: TWIN CITIES ORTHOPEDICS PA  
GT sentence: TWIN CITIES ORTHOPEDICS PA

Decoded sentence: TWIN CITIES ORTHOPEDICS 3 of 3 3/22/18 3:10:10 PM

-
Input sentence: SEE REVERSE SIDE FOR IMPORTANT BILLING I FORMATION 
GT sentence: SEE REVERSE SIDE FOR IMPORTANT BILLING INFORMATION

Decoded sentence: SERVICE DESCRIPTION

-
Input sentence: unumL 
GT sentence: unum

Decoded sentence: unum

-
Input sentence: C . O The Benefits Center 
GT sentence: The Benefits Center

Decoded sentence: C. Signature of Attending Physician

-
Input sentence: Call toll-free Monday through Friday, 8 am. to 8 pm. Eastern Time. 
GT sentence: Call toll-free Monday through Friday, 8 a.m. to 8 p.m. Eastern Time.

Decoded sentence: Call toll-free Monday through Friday, 8 a.m. to 8 p.m. Eastern Time.

-
Input s

-
Input sentence: Are you related to this patierit’? II 'II_ Yes .11 No If yes. what Is the relationship? 
GT sentence: Are you related to this patient? Yes No If yes, what is the relationship?

Decoded sentence: Are you related to this patient? Yes No If yes, what is the relationship to the child

-
Input sentence: Physician Signature i Date 
GT sentence: Physician Signature Date

Decoded sentence: Physician Name (Last Name, First Name, MI, Suffix) Please Print

-
Input sentence: CL-1023 (0611 3) 2 
GT sentence: CL-1023 (06/13) 2

Decoded sentence: CL-1023 (06/13)

-
Input sentence: DATE: 03/07/2018 
GT sentence: DATE: 03/07/2018

Decoded sentence: DATE: 03/07/2018

-
Input sentence: ACCOUNT No . 
GT sentence: ACCOUNT NO.:

Decoded sentence: ACCOUNT# EMA297232

-
Input sentence: SEND PAYMENT TO: 
GT sentence: SEND PAYMENT TO:

Decoded sentence: SEND:

-
Input sentence: mo UN'I' PAID : 
GT sentence: AMOUNT PAID: $

Decoded sentence: ACCIDENT CLAIM FORM

-
Input sentence: TOTAL DUE jzgf

-
Input sentence: Note: Final cost may vary slightly due to rounding differences. 
GT sentence: Note: Final cost may vary slightly due to rounding differences.

Decoded sentence: Note: Final cost may vary slightly due to rounding differences.

-
Input sentence: UNUM LIFE INSURANCE COMPANY OF AMERICA 2211 CONGRESS 5T PORTLAND ME 04122-0002 
GT sentence: UNUM LIFE INSURANCE COMPANY OF AMERICA 2211 CONGRESS ST PORTLAND ME 04122-0002

Decoded sentence: UNUM LIFE INSURANCE COMPANY OF AMERICA 2211 CONGRESS ST PORTLAND ME 04122-0002

-
Input sentence: Employee:
GT sentence: Employee:

Decoded sentence: Employee Off-Job Acc January 1, 2018

-
Input sentence: BLILh Batu:
GT sentence: Birth Date:

Decoded sentence: Business Telephone: (952) 512- 5625

-
Input sentence: TA}: [1}
GT sentence: Tax ID:

Decoded sentence: Tax ID:

-
Input sentence: Employee ID Type; Employcc ID
GT sentence: Employee ID Type: Employee ID

Decoded sentence: Employee Off-Job Acc January 1, 2018

-
Input sentence: T‘mfﬂn

-
Input sentence: PROCEDURES}: Procedure[sl: 
GT sentence: PROCEDURE(S): Procedure(s):

Decoded sentence: PROCEDURE NO.:

-
Input sentence: LEFT KNEE AFITHHOSCCIPY DEBFI'IDEMENT MEDIAL MEISCECTUMY CHDNDRDPLASTY PFC 
GT sentence: LEFT KNEE ARTHROSCOPY DEBRIDEMENT MEDIAL MEISCECTOMY CHONDROPLASTY PFC

Decoded sentence: LEFT KNEE ARTHRISE

-
Input sentence: SURGEONS]: 
GT sentence: SURGEON(S):

Decoded sentence: SURGEON: John J. Larkin, M.D.

-
Input sentence: Surgeonts) and Role: 
GT sentence: Surgeon(s) and Role:

Decoded sentence: Surgery Information

-
Input sentence: " Larkin, John J, MD - F‘rimaryr 
GT sentence: * Larkin, John J, MD - Primary

Decoded sentence: *Children are coverage or date of your back the and treatment dates (mm/dd/yy): From Through

-
Input sentence: ANESTHESIA; General 
GT sentence: ANESTHESIA: General

Decoded sentence: ANESTHESIA: General

-
Input sentence: SPEClMENS: * Ne specimens In log * 
GT sentence: SPECIMENS: * No specimens in log *

Decoded sentence: 

-
Input sentence: Name 
GT sentence: Name

Decoded sentence: Name

-
Input sentence: Relation to 9t Sei‘f
GT sentence: Relation to Pt Self

Decoded sentence: Report Group: 26

-
Input sentence: Service Area SEH 
GT sentence: Service Area SEH

Decoded sentence: Service Date: 03/13/2018

-
Input sentence: Active? Yes 
GT sentence: Active? Yes

Decoded sentence: Active Problems

-
Input sentence: Sei‘f SEH Yes Personali'i-‘amrly 
GT sentence: Acct Type Personal/Family

Decoded sentence: Seevice: Orthopedic

-
Input sentence: Address 
GT sentence: Address

Decoded sentence: Address Line 1:

-
Input sentence: Phone 
GT sentence: Phone

Decoded sentence: Phone

-
Input sentence: Coverage Information (for Hospital Account: t 
GT sentence: Coverage Information (for Hospital Account )

Decoded sentence: Coverage Information (for Hospital Account )

-
Input sentence: PIC) Payori'ﬁlan 
GT sentence: F/O Payor/Plan

Decoded sentence: PRIMARY INSUR:

-
Input sentence: Precert # 
GT sentence: Precert

-
Input sentence: smasmf ion-3 5 /:L.Ira. riﬂes-r up Lager, Pam) maeir‘w‘i“ 0“, phgar‘eﬁi 
GT sentence: What is your treatment plan? Please include all medications.

Decoded sentence: Patient Name (Last Name, First Name, MI, Suffix) Please Print

-
Input sentence: Enznlngs Typu: Kuuxly
GT sentence: Earnings Type: Hourly

Decoded sentence: Encounter Date: 02/12/2018

-
Input sentence: Euxniugs deu: Weakly
GT sentence: Earnings Mode: Weekly

Decoded sentence: Earnings Mode: Weekly

-
Input sentence: Afiler' THI' 0 (N30
GT sentence: After Tax: 0 000

Decoded sentence: Allergies

-
Input sentence: Rtpozt Gxoup: 26
GT sentence: Report Group: 26

Decoded sentence: Report Group: 26

-
Input sentence: Product: _
GT sentence: Product:

Decoded sentence: Product: Long Term Disability

-
Input sentence: I’xod‘ac‘t Typo: Leave Hgmt Svnz
GT sentence: Product Type: Leave Mgmt Svc

Decoded sentence: Provider Last Name: Hakim

-
Input sentence: Funding: Not Applicable 
GT sentence: Funding: Not Applic

-
Input sentence: n. W hourte] per day: days. per weektram: 
GT sentence: c. Reduced schedule: hour(s) per day: days per week from:

Decoded sentence: s. Diding Chated Event Care Visit for Concussion

-
Input sentence: Printed Pmlnter'n Home; T910) 3— MM/(J .0' Type of Proﬁt”:
GT sentence: Printed Provider’s Name: Type of Practice

Decoded sentence: Printed Name 

-
Input sentence: Busineee address 
GT sentence: Business address

Decoded sentence: Business Telephone: (952) 512- 5625

-
Input sentence: Telephone: 
GT sentence: Telephone:

Decoded sentence: Telephone Number

-
Input sentence: Sluneturn of Health Care Provider
GT sentence: Signature of Health Care Provider

Decoded sentence: Sick Pay H-IG CODE

-
Input sentence: Un um" 
GT sentence: Unum

Decoded sentence: Unum is a registered trademark and marketing brand of Unum Group and its insuring subsidiaries.

-
Input sentence: . j . ACCIDENT CLAIM FORM 
GT sentence: ACCIDENT CLAIM FORM

Decoded sentence: Amount Encluare 13:24 We 

-
Input sentence: - at;
GT sentence: Date

Decoded sentence: LAST PART INSURANCE COMPANY OF AMERICA 2211 CONGRESS ST PORTLAND ME 04122-0002

-
Input sentence: unumu'
GT sentence: unum

Decoded sentence: unum

-
Input sentence: O O O The Benefits Center
GT sentence: The Benefits Center

Decoded sentence: OPERATIVE/PROCEDURE NOTE

-
Input sentence: Call toll-free Monday through Friday, 8 am. to 8 pm. Eastern Time.
GT sentence: Call toll-free Monday through Friday, 8 a.m. to 8 p.m. Eastern Time.

Decoded sentence: Call toll-free Monday through Friday, 8 a.m. to 8 p.m. Eastern Time.

-
Input sentence: Authorization to Collect and Disclose Information
GT sentence: Authorization to Collect and Disclose Information

Decoded sentence: Authorization to Collect and Disclose Information

-
Input sentence: (Not for FMLA Requests)
GT sentence: (Not for FMLA Requests)

Decoded sentence: (Name)  (Telephone Number)

-
Input sentence: Electronically Signed 03/14/2018
GT sentence: Electronically Signed 

-
Input sentence: Respiratory: Negative for ecugh, shortness of breath and wheezing. 
GT sentence: Respiratory: Negative for cough, shortness of breath and wheezing.

Decoded sentence: Respiratory: Negative for cough, shortness of breath and wheezing.

-
Input sentence: Cardiovascular: Negative for chest pain. 
GT sentence: Cardiovascular: Negative for chest pain.

Decoded sentence: Cardholder name:

-
Input sentence: Gastrointestinal: Negative for abdominal pain, constipation, diarrhea, nausea and vomiting. 
GT sentence: Gastrointestinal: Negative for abdominal pain, constipation, diarrhea, nausea and vomiting.

Decoded sentence: Gastrointestinal: no gastrointestinal symptoms.

-
Input sentence: Genitourinary: Negative for dysuria, flank pain, frequency and urgency. 
GT sentence: Genitourinary: Negative for dysuria, flank pain, frequency and urgency.

Decoded sentence: Gender:

-
Input sentence: Skin: Negative for rash. 
GT sentence: Skin: Negative for rash.

Decoded sentence: Skin: N

-
Input sentence: Best Phone Number to be Reached During the Day: 
GT sentence: Best Phone Number to be Reached During the Day:

Decoded sentence: Best Phone Number to be Reached During the Day

-
Input sentence: Email Address: 
GT sentence: Email Address:

Decoded sentence: Email Address:

-
Input sentence: unum 
GT sentence: unum

Decoded sentence: unum

-
Input sentence: December 1, 2017 
GT sentence: December 1, 2017

Decoded sentence: December 27, 2016

-
Input sentence: Confirmation of Coverage 
GT sentence: Confirmation of Coverage

Decoded sentence: Confirmation of Coverage

-
Input sentence: Employer: 
GT sentence: Employer:

Decoded sentence: Employee Off-Job Acc January 1, 2018

-
Input sentence: Group Policy #: 
GT sentence: Group Policy #:

Decoded sentence: Group Policy #:

-
Input sentence: Customer Policy #: 
GT sentence: Customer Policy #:

Decoded sentence: Customer Policy #:

-
Input sentence: EE Name: 
GT sentence: EE Name:

Decoded sentence: EE Name:

-
Input sente

-
Input sentence: Provider First Name: Todd 
GT sentence: Provider First Name: Todd

Decoded sentence: Provider Last Name: Hakim

-
Input sentence: Provider Last Name: Francis 
GT sentence: Provider Last Name: Francis

Decoded sentence: Provider Last Name: Hakim

-
Input sentence: Address Line 1: 5 700 Monroe ST Suite 203 
GT sentence: Address Line 1: 5 700 Monroe St Suite 203

Decoded sentence: Address Line 1:

-
Input sentence: C ity. Sylvania 
GT sentence: City: Sylvania

Decoded sentence: City: Waukesha

-
Input sentence: State/Produce: OH 
GT sentence: State/Produce: OH

Decoded sentence: State/Province: MN

-
Input sentence: Postal Code: 43560 
GT sentence: Postal Code: 43560

Decoded sentence: Postal Code:

-
Input sentence: Liotmtry". 
GT sentence: Country:

Decoded sentence: Limitations and Need for art treatment dates (mm/dd/yy): From Through

-
Input sentence: Business Telephone: (419) 843-8100 
GT sentence: Business Telephone: (419) 843-8100

Decoded sentence: Business Tele

-
Input sentence: 9 Indicates additional information i- available. 
GT sentence: Indicates additional information is available.

Decoded sentence: Indicates additional information prescribed (excluding over the counter my. legal aus, treatment(s) 
-
Input sentence: Employer Nana: 
GT sentence: Employer Name:

Decoded sentence: Employee Off-Job Acc January 1, 2018

-
Input sentence: Electron [1: S u Innis sion 
GT sentence: Electronic: Submission

Decoded sentence: Electronically Signed Indicator: Yes

-
Input sentence: C lairn Event Identiﬁer: 
GT sentence: Claim Event Identifier:

Decoded sentence: C. Signature of Attending Physician

-
Input sentence: Submission Date: 03/122018 
GT sentence: Submission Date: 03/12/2018

Decoded sentence: Submission Date: 03/14/2018

-
Input sentence: Electromeally Signed Indicator: Yes 
GT sentence: Electronically Signed Indicator: Yes

Decoded sentence: Electronically Signed Indicator: Yes

-
Input sentence: Electronically Signed Date: Monday, 03:"1

-
Input sentence: Date ofFirst Visit: 0211952018 
GT sentence: Date of First Visit: 02/19/2018

Decoded sentence: Date of Next Visit: 03/13/2018

-
Input sentence: Date ofNeXt Visit; 03/120018 
GT sentence: Date of Next Visit: 03/12/2018

Decoded sentence: Date of Next Visit: 03/13/2018

-
Input sentence: Emplqvm am In formation 
GT sentence: Employment Information

Decoded sentence: Employee Off-Job Acc January 1, 2018

-
Input sentence: Flrployer Name: 
GT sentence: Employer Name:

Decoded sentence: First Name:

-
Input sentence: Electronic Sn bmil. sion 
GT sentence: Electronic Submission

Decoded sentence: Electronically Signed Indicator: Yes

-
Input sentence: Claim Event Identiﬁer: 2667305 
GT sentence: Claim Event Identiﬁer: 2667305

Decoded sentence: Claim Type: VB Accident - Accidental Injury

-
Input sentence: Submission Date: 03/122018
GT sentence: Submission Date: 03/12/2018

Decoded sentence: Submission Date: 03/14/2018

-
Input sentence: Claim Tji'pe: V'B Accident - Accid

-
Input sentence: E1ec11‘0111cally Signed Date: 
GT sentence: Electronically Signed Date:
Decoded sentence: Employee Wellness Benefit July 1, 2017

WER_spell_correction |TEST=  0.12063933287004865


# Next steps
- Add attention
- Full attention
- Condition the Encoder on word embeddings of the context (Bi-directional LSTM)
- Condition the Decoder on word embeddings of the context (Bi-directional LSTM) 

# References
- Sequence to Sequence Learning with Neural Networks
    https://arxiv.org/abs/1409.3215
- Learning Phrase Representations using
    RNN Encoder-Decoder for Statistical Machine Translation
    https://arxiv.org/abs/1406.107