# Introduction

We tackle the problem of OCR post processing. In OCR, we map the image form of the document into the text domain. This is done first using an CNN+LSTM+CTC model, in our case based on tesseract. Since this output maps only image to text, we need something on top to validate and correct language semantics.

The idea is to build a language model, that takes the OCRed text and corrects it based on language knowledge. The langauge model could be:
- Char level: the aim is to capture the word morphology. In which case it's like a spelling correction system.
- Word level: the aim is to capture the sentence semnatics. But such systems suffer from the OOV problem.
- Fusion: to capture semantics and morphology language rules. The output has to be at char level, to avoid the OOV. However, the input can be char, word or both.

The fusion model target is to learn:

    p(char | char_context, word_context)

In this workbook we use seq2seq vanilla Keras implementation, adapted from the lstm_seq2seq example on Eng-Fra translation task. The adaptation involves:

- Adapt to spelling correction, on char level
- Pre-train on a noisy, medical sentences
- Fine tune a residual, to correct the mistakes of tesseract 
- Limit the input and output sequence lengths
- Enusre teacher forcing auto regressive model in the decoder
- Limit the padding per batch
- Learning rate schedule
- Bi-directional LSTM Encoder
- Bi-directional GRU Encoder


# Imports

In [1]:
from __future__ import print_function
import tensorflow as tf
import keras.backend as K
from keras.backend.tensorflow_backend import set_session
from keras.models import Model
from keras.layers import Input, LSTM, Dense, Bidirectional, Concatenate, GRU
from keras import optimizers
from keras.callbacks import ModelCheckpoint, TensorBoard, LearningRateScheduler
from keras.models import load_model
import numpy as np
import os
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from autocorrect import spell
import re
%matplotlib inline

Using TensorFlow backend.


# Utility functions

In [2]:
# Limit gpu allocation. allow_growth, or gpu_fraction
def gpu_alloc():
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    set_session(tf.Session(config=config))

In [3]:
gpu_alloc()

In [4]:
def calculate_WER_sent(gt, pred):
    '''
    calculate_WER('calculating wer between two sentences', 'calculate wer between two sentences')
    '''
    gt_words = gt.lower().split(' ')
    pred_words = pred.lower().split(' ')
    d = np.zeros(((len(gt_words) + 1), (len(pred_words) + 1)), dtype=np.uint8)
    # d = d.reshape((len(gt_words)+1, len(pred_words)+1))

    # Initializing error matrix
    for i in range(len(gt_words) + 1):
        for j in range(len(pred_words) + 1):
            if i == 0:
                d[0][j] = j
            elif j == 0:
                d[i][0] = i

    # computation
    for i in range(1, len(gt_words) + 1):
        for j in range(1, len(pred_words) + 1):
            if gt_words[i - 1] == pred_words[j - 1]:
                d[i][j] = d[i - 1][j - 1]
            else:
                substitution = d[i - 1][j - 1] + 1
                insertion = d[i][j - 1] + 1
                deletion = d[i - 1][j] + 1
                d[i][j] = min(substitution, insertion, deletion)
    return d[len(gt_words)][len(pred_words)]

In [5]:
def calculate_WER(gt, pred):
    '''

    :param gt: list of sentences of the ground truth
    :param pred: list of sentences of the predictions
    both lists must have the same length
    :return: accumulated WER
    '''
#    assert len(gt) == len(pred)
    WER = 0
    nb_w = 0
    for i in range(len(gt)):
        #print(gt[i])
        #print(pred[i])
        WER += calculate_WER_sent(gt[i], pred[i])
        nb_w += len(gt[i])

    return WER / nb_w

In [21]:
def load_data_with_gt(file_name, num_samples, max_sent_len, min_sent_len, delimiter='\t', gt_index=1, prediction_index=0):
    '''Load data from txt file, with each line has: <TXT><TAB><GT>. The  target to the decoder muxt have \t as the start trigger and \n as the stop trigger.'''
    cnt = 0  
    input_texts = []
    gt_texts = []
    target_texts = []
    for row in open(file_name, encoding='utf8'):
        if cnt < num_samples :
            #print(row)
            sents = row.split(delimiter)
            if (len(sents) < 2):
                continue
            input_text = sents[prediction_index]
            
            target_text = '\t' + sents[gt_index] + '\n'
            if len(input_text) > min_sent_len and len(input_text) < max_sent_len and len(target_text) > min_sent_len and len(target_text) < max_sent_len:
                cnt += 1
                
                input_texts.append(input_text)
                target_texts.append(target_text)
                gt_texts.append(sents[gt_index])
    return input_texts, target_texts, gt_texts

In [7]:
def load_data(file_name, num_samples, max_sent_len, min_sent_len):
    '''Load data from txt file, with each line has: <TXT><TAB><GT>. The  target to the decoder muxt have \t as the start trigger and \n as the stop trigger.'''
    cnt = 0  
    input_texts = []   
    
    #for row in open(file_name, encoding='utf8'):
    for row in open(file_name):
        if cnt < num_samples :            
            input_text = row           
            if len(input_text) > min_sent_len and len(input_text) < max_sent_len:
                cnt += 1                
                input_texts.append(input_text)
    return input_texts

In [8]:
def vectorize_data(input_texts, max_encoder_seq_length, num_encoder_tokens, vocab_to_int):
    
    if(len(input_texts) > max_encoder_seq_length):
        input_texts = input_texts[:max_encoder_seq_length]
    
    '''Prepares the input text and targets into the proper seq2seq numpy arrays'''
    encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length),
    dtype='float32')
    
    for i, input_text in enumerate(input_texts):
        for t, char in enumerate(input_text[:max_encoder_seq_length]):
            # c0..cn
            encoder_input_data[i, t] = vocab_to_int[char]
                
    return encoder_input_data

In [9]:
def decode_sequence(input_seq, encoder_model, decoder_model, num_decoder_tokens, max_decoder_seq_length, vocab_to_int, int_to_vocab):
    
    #print(max_decoder_seq_length)
    # Encode the input as state vectors.
    encoder_outputs, h, c  = encoder_model.predict(input_seq)
    states_value = [h,c]
    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0] = vocab_to_int['\t']

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    #print(input_seq)
    attention_density = []
    i = 0
    special_chars = ['\\', '/', '-', '—' , ':', '[', ']', ',', '.', '"', ';', '%', '~', '(', ')', '{', '}', '$', '#']
    #special_chars = []
    while not stop_condition:
        #print(target_seq)
        output_tokens, attention, h, c  = decoder_model.predict(
            [target_seq, encoder_outputs] + states_value)
        #print(attention.shape)
        attention_density.append(attention[0][0])# attention is max_sent_len x 1 since we have num_time_steps = 1 for the output
        # Sample a token
        #print(output_tokens.shape)
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        
        #print(sampled_token_index)
        sampled_char = int_to_vocab[sampled_token_index]
        
        orig_char = int_to_vocab[int(input_seq[:,i][0])]
        
        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True
            #print('End', sampled_char, 'Len ', len(decoded_sentence), 'Max len ', max_decoder_seq_length)
            sampled_char = ''
        
        # Copy digits as it, since the spelling corrector is not good at digit corrections
        
        if(orig_char.isdigit() or orig_char in special_chars):
            decoded_sentence += orig_char            
        else:
            if(sampled_char.isdigit() or sampled_char in special_chars):
                decoded_sentence += ''
            else:
                decoded_sentence += sampled_char
        
        #decoded_sentence += sampled_char


        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]
        
        i += 1
        if(i > 48):
            i = 0
    attention_density = np.array(attention_density)
    
    # Word level spell correct
    '''
    corrected_decoded_sentence = ''
    for w in decoded_sentence.split(' '):
        corrected_decoded_sentence += spell(w) + ' '
    decoded_sentence = corrected_decoded_sentence
    '''
    return decoded_sentence, attention_density


In [10]:
def word_spell_correct(decoded_sentence):
    corrected_decoded_sentence = ''
    special_chars = ['\\', '/', '-', '—' , ':', '[', ']', ',', '.', '"', ';', '%', '~', '(', ')', '{', '}', '$', '#']
    for w in decoded_sentence.split(' '):
        if((len(re.findall(r'\d+', w))==0) and not (w in special_chars)):
            corrected_decoded_sentence += spell(w) + ' '
        else:
            corrected_decoded_sentence += w + ' '
    return corrected_decoded_sentence

In [11]:
def clean_up_sentence(sentence, vocab):
    s = ''
    prev_char = ''
    for c in sentence.strip():
        if c not in vocab or (c == ' ' and prev_char == ' '):
            s += ''
        else:
            s += c
        prev_char = c
            
    return s

# Load data

# Load model params

In [12]:
data_path = '../../dat/'

In [13]:
max_sent_lengths = [50, 100]

In [14]:
vocab_file = {}
model_file = {}
encoder_model_file = {}
decoder_model_file = {}
model = {}
encoder_model = {}
decoder_model = {}
vocab = {}
vocab_to_int = {}
int_to_vocab = {}
max_sent_len = {}
min_sent_len = {}
num_decoder_tokens = {}
num_encoder_tokens = {}
max_encoder_seq_length = {}
max_decoder_seq_length = {}

In [15]:

for i in max_sent_lengths:
    vocab_file[i] = 'vocab-{}.npz'.format(i)
    model_file[i] = 'best_model-{}.hdf5'.format(i)
    encoder_model_file[i] = 'encoder_model-{}.hdf5'.format(i)
    decoder_model_file[i] = 'decoder_model-{}.hdf5'.format(i)
    
    vocab = np.load(file=vocab_file[i])
    vocab_to_int[i] = vocab['vocab_to_int'].item()
    int_to_vocab[i] = vocab['int_to_vocab'].item()
    max_sent_len[i] = vocab['max_sent_len']
    min_sent_len[i] = vocab['min_sent_len']
    input_characters = sorted(list(vocab_to_int))
    num_decoder_tokens[i] = num_encoder_tokens[i] = len(input_characters) #int(encoder_model.layers[0].input.shape[2])
    max_encoder_seq_length[i] = max_decoder_seq_length[i] = max_sent_len[i] - 1#max([len(txt) for txt in input_texts])
    
    model[i] = load_model(model_file[i])
    encoder_model[i] = load_model(encoder_model_file[i])
    decoder_model[i] = load_model(decoder_model_file[i])



In [22]:
num_samples = 1000000
#tess_correction_data = os.path.join(data_path, 'test_data.txt')
#input_texts = load_data(tess_correction_data, num_samples, max_sent_len, min_sent_len)

OCR_data = os.path.join(data_path, 'field_class_32.txt')#'new_trained_data.txt')
#input_texts, target_texts, gt_texts = load_data_with_gt(OCR_data, num_samples, max_sent_len, min_sent_len, delimiter='|',gt_index=0, prediction_index=1)
input_texts, target_texts, gt_texts = load_data_with_gt(OCR_data, num_samples, max_sent_len=10000, min_sent_len=0)

In [23]:
# Sample data
print(len(input_texts))
for i in range(10):
    print(input_texts[i], '\n', target_texts[i])

354
claim folder contents 
 	Claim Folder Contents


claimant name 
 	Claimant Name:


claim number 
 	Claim Number:


unauthorized access is strictly prohibited. 
 	Unauthorized access is strictly prohibited.


print date 3/13/2018 
 	Print Date: 3/13/2018


accountability act (hipaa) privacy rule. 
 	Accountability Act (HIPAA) Privacy Rule.


(not for fmla requests} 
 	(Not For FMLA Requests)


| authorize the following persons health care professionals, hospitals, clinics, laboratories, pharmacies and all other medical or medically related providers, facilities or services, rehabilitation professionals, vocational evaluators, health plans, insurance companies, third party administrators, insurance producers, insurance 
 	I authorize the following persons: health care professionals, hospitals, clinics, laboratories, pharmacies and all other medical or medically related providers, facilities or services, rehabilitation professionals, vocational evaluators, health plans, insurance comp

In [18]:
# Spell correct before inference
'''
input_texts_ = []
for sent in input_texts:
    sent_ = ''
    for word in sent.split(' '):
        sent_ += spell(word) + ' '
    input_texts_.append(sent_)
input_texts = input_texts_
input_texts_ = []
# Sample data
print(len(input_texts))
for i in range(10):
    print(input_texts[i], '\n', target_texts[i])
'''

"\ninput_texts_ = []\nfor sent in input_texts:\n    sent_ = ''\n    for word in sent.split(' '):\n        sent_ += spell(word) + ' '\n    input_texts_.append(sent_)\ninput_texts = input_texts_\ninput_texts_ = []\n# Sample data\nprint(len(input_texts))\nfor i in range(10):\n    print(input_texts[i], '\n', target_texts[i])\n"

In [24]:
decoded_sentences = []
corrected_sentences = []

#for seq_index in range(len(input_texts)):
results = open('RESULTS.md', 'w')
results.write('|OCR sentence|GT sentence|Char decoded sentence|Word decoded sentence|Sentence length (chars)|\n')
results.write('---------------|-----------|----------------|----------------|----------------|\n')
     

for i, input_text in enumerate(input_texts):
    #print(input_text)
    # Find the input length range to choose the proper model to use
    len_range = max_sent_lengths[-1] # Take the longest range
    for length in max_sent_lengths:
        if(len(input_text) < length):
            len_range = length
            break
    #print(len_range)
    
    input_text = clean_up_sentence(input_text, vocab_to_int[len_range])
    encoder_input_data = vectorize_data(input_texts=[input_text], max_encoder_seq_length=max_encoder_seq_length[len_range], num_encoder_tokens=num_encoder_tokens[len_range], vocab_to_int=vocab_to_int[len_range])
    
    

    target_text = gt_texts[i]
    
    input_seq = encoder_input_data
    #print(input_seq.shape)
    #print(max_decoder_seq_length[len_range])
    #print(max_decoder_seq_length)
    decoded_sentence,_  = decode_sequence(input_seq, encoder_model[len_range], decoder_model[len_range], num_decoder_tokens[len_range],  max_decoder_seq_length[len_range], vocab_to_int[len_range], int_to_vocab[len_range])
    corrected_sentence = word_spell_correct(input_text)
    print('-Lenght = ', len_range)
    print('Input sentence:', input_text)
    print('GT sentence:', target_text.strip())
    print('Char Decoded sentence:', decoded_sentence)   
    print('Word Decoded sentence:', corrected_sentence) 
    results.write(' | ' + input_text + ' | ' + target_text.strip() + ' | ' + decoded_sentence + ' | ' + corrected_sentence + ' | ' + str(len_range) + ' | \n')
    decoded_sentences.append(decoded_sentence)
    corrected_sentences.append(corrected_sentence)
results.close()    

    

-Lenght =  50
Input sentence: claim folder contents
GT sentence: Claim Folder Contents
Char Decoded sentence: Claim Folder Contents
Word Decoded sentence: claim folder contents 
-Lenght =  50
Input sentence: claimant name
GT sentence: Claimant Name:
Char Decoded sentence: Claimant Name
Word Decoded sentence: claimant name 
-Lenght =  50
Input sentence: claim number
GT sentence: Claim Number:
Char Decoded sentence: Claim Number
Word Decoded sentence: claim number 
-Lenght =  50
Input sentence: unauthorized access is strictly prohibited.
GT sentence: Unauthorized access is strictly prohibited.
Char Decoded sentence: Unauthorized access is strictly prohibited.
Word Decoded sentence: unauthorized access is strictly prohibited 
-Lenght =  50
Input sentence: print date 3/13/2018
GT sentence: Print Date: 3/13/2018
Char Decoded sentence: Print Date 3/13/2018
Word Decoded sentence: print date 3/13/2018 
-Lenght =  50
Input sentence: accountability act (hipaa) privacy rule.
GT sentence: Accounta

-Lenght =  100
Input sentence: accident description my son was hit by a vehicle while crossing the street. the driver was given a ticket
GT sentence: Accident Description: My son was hit by a vehicle while crossing the street. The driver was given a ticket
Char Decoded sentence: Accident description myy on was bit by a verive while crave while cracticie was give a sident a dive
Word Decoded sentence: accident description my son was hit by a vehicle while crossing the street the driver was given a ticket 
-Lenght =  50
Input sentence: accident work related
GT sentence: Accident Work Related:
Char Decoded sentence: Accident Work Related
Word Decoded sentence: accident work related 
-Lenght =  50
Input sentence: time of accident
GT sentence: Time of Accident:
Char Decoded sentence: Time of Accident
Word Decoded sentence: time of accident 
-Lenght =  50
Input sentence: accident date
GT sentence: Accident Date:
Char Decoded sentence: Accident Date
Word Decoded sentence: accident date 
-Leng

-Lenght =  50
Input sentence: employee on & off-job acc january 1, 2017
GT sentence: Employee On & Off-Job Acc January 1, 2017
Char Decoded sentence: Employee & On & O-fJob Acc Januar1,2017
Word Decoded sentence: employee on a off-job acc january 1, 2017 
-Lenght =  50
Input sentence: spouse on & off-job acc january 1, 2017
GT sentence: Spouse On & Off-Job Acc January 1, 2017
Char Decoded sentence: Spouse On & Off-Job Acc January 1, 2017
Word Decoded sentence: spouse on a off-job acc january 1, 2017 
-Lenght =  50
Input sentence: child on & off-job acc january 1, 2017
GT sentence: Child On & Off-Job Acc January 1, 2017
Char Decoded sentence: Child On & Off-Job Acc January 1, 2017
Word Decoded sentence: child on a off-job acc january 1, 2017 
-Lenght =  50
Input sentence: total monthly premium $48.15
GT sentence: Total Monthly Premium $48.15
Char Decoded sentence: Total Monthly Premium $48.15
Word Decoded sentence: total monthly premium $48.15 
-Lenght =  50
Input sentence: total employ

-Lenght =  100
Input sentence: (red nacional de prevencion del suicidio)- 1-888-628-9454 co
GT sentence: (Red Nacional de Prevencion del Suicidio)- 1-888-628-9454
Char Decoded sentence: (Red nacional de Prevencion del Suicidio)- 1-888-(
Word Decoded sentence: red national de prevention del suicidio)- 1-888-628-9454 co 
-Lenght =  100
Input sentence: collier county menta! health resources project help crisis & sexual assault hotline- (239)262-7227
GT sentence: Collier County Mental Health Resources: Project HELP Crisis & Sexual Assault Hotline- (239)262-7227
Char Decoded sentence: Coller county mental health resorues provide Procedure care provide s action  
Word Decoded sentence: collier county mental health resources project help crisis a sexual assault hotlines (239)262-7227 
-Lenght =  100
Input sentence: * smoking cessation if you smoke, it is imperative that you stop and one way to do this is to enroll in a smoking cessation program or contact your physician for alternative method

-Lenght =  100
Input sentence: **as always, you are the most important factor in your recovery. please follow the instructions below carefully. take your medicines as prescribed. most important, see & doctor again as discussed. if you have any changes or concerns that we have not discussed, call or visit your doctor right away. if you can't reach your doctor, return to the emergency department.
GT sentence: **AS ALWAYS, YOU ARE THE MOST IMPORTANT FACTOR IN YOUR RECOVERY. Please follow the instructions below carefully. Take your medicines as prescribed. Most important, see a doctor again as discussed. If you have any changes or concerns that we have not discussed, CALL OR VISIT YOUR DOCTOR RIGHT AWAY. If you can't reach your doctor, return to the Emergency Department.
Char Decoded sentence: * *ase your,care the most imporict factor instructions facto, instructions belasery Lele follow the 
Word Decoded sentence: was always you are the most important factor in your recovery please follow

-Lenght =  100
Input sentence: physicain does have the right to be paid for the care provided.
GT sentence: physicain does have the right to be paid for the care provided.
Char Decoded sentence: physicain does have the right to be paid for the care provided
Word Decoded sentence: physicain does have the right to be paid for the care provided 
-Lenght =  50
Input sentence: estimado paciente
GT sentence: Estimado Paciente:
Char Decoded sentence: Estimado Paciente
Word Decoded sentence: estimate patient 
-Lenght =  50
Input sentence: name 10f9
GT sentence: Name: 1 of 9
Char Decoded sentence: Name10 9f 
Word Decoded sentence: name 10f9 
-Lenght =  50
Input sentence: mrn
GT sentence: MRN:
Char Decoded sentence: nem
Word Decoded sentence: man 
-Lenght =  50
Input sentence: name
GT sentence: Name:
Char Decoded sentence: EName
Word Decoded sentence: name 
-Lenght =  50
Input sentence: patient education materials
GT sentence: Patient Education Materials
Char Decoded sentence: Patient education 

-Lenght =  100
Input sentence: * when the swelling has gone away, start using warm compresses. this is a clean cloth that's damp with warm water. apply this to the area for 10 minutes, several times a day.
GT sentence: * When the swelling has gone away, start using warm compresses. This is a clean cloth that's damp with warm water. Apply this to the area for 10 minutes, several times a day.
Char Decoded sentence: * Weng the selling as gone away ,art using warm compressestes whis wark clarm co,m warg a
Word Decoded sentence: a when the swelling has gone away start using warm compresses this is a clean cloth thats damp with warm water apply this to the area for 10 minutes several times a day 
-Lenght =  100
Input sentence: * if your child was given a wrap, follow instructions for how to use it and when to remove it.
GT sentence: * If your child was given a wrap, follow instructions for how to use it and when to remove it.
Char Decoded sentence: * If your child was given a wrap, follow in

-Lenght =  50
Input sentence: + nausea or vomiting
GT sentence: * Nausea or vomiting
Char Decoded sentence: * Nausea or vomiting
Word Decoded sentence: a nausea or vomiting 
-Lenght =  50
Input sentence: ¢ dizziness
GT sentence: * Dizziness
Char Decoded sentence: * Dizziness
Word Decoded sentence: a dizziness 
-Lenght =  50
Input sentence: » sensitivity to light or noise
GT sentence: * Sensitivity to light or noise
Char Decoded sentence: * sensitivity to light or noise
Word Decoded sentence: a sensitivity to light or noise 
-Lenght =  50
Input sentence: » unusual sleepiness or grogginess
GT sentence: * Unusual sleepiness or grogginess
Char Decoded sentence: * Unusual sleepiness or grogginess
Word Decoded sentence: a unusual sleepiness or grogginess 
-Lenght =  50
Input sentence: » trouble falling asleep
GT sentence: * Trouble falling asleep
Char Decoded sentence: * Trouble falling asleep
Word Decoded sentence: a trouble falling asleep 
-Lenght =  50
Input sentence: » personality change

-Lenght =  100
Input sentence: e your child is 3 months old or younger and has a fever of 100.4°f (38°c) or higher. (get medical care right away. fever in a young baby can be a sign of a dangerous infection.)
GT sentence: * Your child is 3 months old or younger and has a fever of 100.4°F (38°C) or higher. (Get medical care right away. Fever in a young baby can be a sign of a dangerous infection.)
Char Decoded sentence: If your child is3 monthh scripter and has and has and has and at3a feverner and has a 
Word Decoded sentence: e your child is 3 months old or younger and has a fever of 100.4°f (38°c) or higher get medical care right away fever in a young baby can be a sign of a dangerous infection 
-Lenght =  100
Input sentence: * your child is younger than 2 years of age and has a fever of 100.4°f (38°c) that lasts formore than 1 day.
GT sentence: * Your child is younger than 2 years of age and has a fever of 100.4°F (38°C) that lasts for more than 1 day.
Char Decoded sentence: * Your 

-Lenght =  50
Input sentence: 480 bedford rd, bldg 600, 2nd floor
GT sentence: 480 BEDFORD RD, BLDG 600, 2ND FLOOR
Char Decoded sentence: 480 BExtred F,breal 600,2nol or
Word Decoded sentence: 480 Bedford rd bldg 600, 2nd floor 
-Lenght =  50
Input sentence: chappaqua, ny 10514-1702
GT sentence: CHAPPAQUA, NY 10514-1702
Char Decoded sentence: ChappaNy ,ame10514-1702
Word Decoded sentence: Chappaqua ny 10514-1702 
-Lenght =  50
Input sentence: to provide insurance formation online
GT sentence: TO PROVIDE INSURANCE INSURANCE INFORMATION ONLINE PLEASE VISIT
Char Decoded sentence: * Provider insurance formation on noile
Word Decoded sentence: to provide insurance formation online 
-Lenght =  50
Input sentence: please visit
GT sentence: PLEASE VISIT
Char Decoded sentence: PLEASE VISIT
Word Decoded sentence: please visit 
-Lenght =  50
Input sentence: https//icollier. payambulance. com
GT sentence: HTTPS://COLLIER.PAYAMBULANCE.COM
Char Decoded sentence: HTTPS//COLLIER.PAYAMBULANCE.OM
Word De

-Lenght =  50
Input sentence: nch healthcare system
GT sentence: NCH Health System
Char Decoded sentence: NCH HEher ICLATE HESpe
Word Decoded sentence: inch healthcare system 
-Lenght =  50
Input sentence: patient accounting departmentq
GT sentence: Patient Accounting Department
Char Decoded sentence: Patient Accounting astematment
Word Decoded sentence: patient accounting departments 
-Lenght =  50
Input sentence: nch » 1311-8
GT sentence: NCH - 1311-8
Char Decoded sentence: NAch H1311-8
Word Decoded sentence: inch a 1311-8 
-Lenght =  50
Input sentence: southwest florida emergency
GT sentence: SOUTHWEST FLORIDA EMERGENCY
Char Decoded sentence: Sorth Hest Formida emergency
Word Decoded sentence: southwest florida emergency 
-Lenght =  50
Input sentence: management inc
GT sentence: MANAGEMENT, INC
Char Decoded sentence: Amagenent Inc
Word Decoded sentence: management inc 
-Lenght =  50
Input sentence: cincinnati, oh 45263-6553
GT sentence: CINCINNATI, OH 45263-6553
Char Decoded sentenc

-Lenght =  100
Input sentence: policy which states benefits are excluded "to any person" who is entitled to no-fault benefits from the owner or insurer of a motor vehicle which is not an insured motor vehicle under this insurance. you will need to report a personal injury claim to your resident relative’s primary auto insurance carrier.
GT sentence: policy which states benefits are excluded "to any person" who is entitled to no-fault benefits from the owner or insurer of a motor vehicle which is not an insured motor vehicle under this insurance. You will need to report a personal injury claim to your resident relative’s primary auto insurance carrier.
Char Decoded sentence: Policy which states benefits are exclude t" nailt in any personthin to no from
Word Decoded sentence: policy which states benefits are excluded to any person who is entitled to no-fault benefits from the owner or insurer of a motor vehicle which is not an insured motor vehicle under this insurance you will need to r

In [None]:
input_texts = ['Unum Life  Insurance Company of America 2211',               
               'Congress Street Portland, Maine 04122',
               'APPLICATION FOR GROUP CRITICAL LLNESS INSURANCE',
               'I Evidence of Insurability',
               '',
               'Application Type: @ New Enrollee Change to',
               'Existing Coverage  Reinstatement  Internal',
               'Replacement  Late Applicant  Rehire SECTION 1:',
               'Employee(Applicant) Information  Always',
               'Complete Employee Name(First, Middle, Last)',
               'Social Security Number Nikolas J Jones',
               '123 - 456 - 7890 Home Address(Street/ PO Box)',
               'Gender 1634 Stewert St  F  M City Date of Birth',
               '(mm / dd / yyyy) Seattle 06 / 15 / 1991 State Zip',
               'Code Home Phone # Washington 98101 854-555-1212',
               'Are you Actively at Work? Employee ID / Payroll #',
               ' Yes  No55624 a.Are you a U.S.Citizen or',
               'Canadian Citizen working in the U.S.? b.Are you',
               'legally authorized to work in  Yes  No(If No',
               'reply to part b) the U.S.?  Yes  No Employer',
               'Name Group Number Date of Hire(mm/ dd / yyyy)',
               'Facebook 11 - 555566 11 / 30 / 2016 Occupation',
               'Eligibility Class Software Engineer 7 Scheduled',
               'Number of Work Hours per Week Work Phone # 35',
               '854-555-6622 SECTION 2: Spouse Information ',
               'Complete Only if applying for Spouse coverage Name',
               '(First, Middle, Last) Social Security Number',
               'Gender Date of Birth(mm / dd / yyyy) Does the',
               '1019 - 07 - AZ 1',
              'if claint is for a child, please state your relationship 10 the child',
              'date of accident 3d _ time of accident ram. 0 p.m.',
              'have you slopped working? (of yes [1 no if yes, what was the last day that you worked? (mm/ddryy)_| —3 | —{% cnslamegs bil =']
               
for input_text in input_texts:
    len_range = max_sent_lengths[-1] # Take the longest range
    for length in max_sent_lengths:
        if(len(input_text) < length):
            len_range = length
            break
    #print(len_range)
    pre_corrected_sentence = word_spell_correct(input_text)
    input_text = clean_up_sentence(input_text, vocab_to_int[len_range])
    encoder_input_data = vectorize_data(input_texts=[input_text], max_encoder_seq_length=max_encoder_seq_length[len_range], num_encoder_tokens=num_encoder_tokens[len_range], vocab_to_int=vocab_to_int[len_range])



    target_text = gt_texts[i]

    input_seq = encoder_input_data
    #print(input_seq.shape)
    #print(max_decoder_seq_length[len_range])
    #print(max_decoder_seq_length)

    decoded_sentence,_  = decode_sequence(input_seq, encoder_model[len_range], decoder_model[len_range], num_decoder_tokens[len_range],  max_decoder_seq_length[len_range], vocab_to_int[len_range], int_to_vocab[len_range])
    corrected_sentence = word_spell_correct(input_text)
    #print('-Lenght = ', len_range)
    print('Input sentence:', input_text)
    #print('Spell Decoded sentence:', pre_corrected_sentence) 
    print('Char Decoded sentence:', decoded_sentence)   
    print('Word Decoded sentence:', corrected_sentence) 
    print('\n')



In [25]:
WER_spell_correction = calculate_WER(gt_texts, decoded_sentences)
print('WER_spell_correction |TEST= ', WER_spell_correction)

WER_spell_correction |TEST=  0.110294492931


In [26]:
WER_spell_word_correction = calculate_WER(gt_texts, corrected_sentences)
print('WER_spell_word_correction |TEST= ', WER_spell_word_correction)

WER_spell_word_correction |TEST=  0.0625733680396


In [27]:
WER_OCR = calculate_WER(gt_texts, input_texts)
print('WER_OCR |TEST= ', WER_OCR)

WER_OCR |TEST=  0.031950186291
