# Introduction

We tackle the problem of OCR post processing. In OCR, we map the image form of the document into the text domain. This is done first using an CNN+LSTM+CTC model, in our case based on tesseract. Since this output maps only image to text, we need something on top to validate and correct language semantics.

The idea is to build a language model, that takes the OCRed text and corrects it based on language knowledge. The langauge model could be:
- Char level: the aim is to capture the word morphology. In which case it's like a spelling correction system.
- Word level: the aim is to capture the sentence semnatics. But such systems suffer from the OOV problem.
- Fusion: to capture semantics and morphology language rules. The output has to be at char level, to avoid the OOV. However, the input can be char, word or both.

The fusion model target is to learn:

    p(char | char_context, word_context)

In this workbook we use seq2seq vanilla Keras implementation, adapted from the lstm_seq2seq example on Eng-Fra translation task. The adaptation involves:

- Adapt to spelling correction, on char level
- Pre-train on a noisy, medical sentences
- Fine tune a residual, to correct the mistakes of tesseract 
- Limit the input and output sequence lengths
- Enusre teacher forcing auto regressive model in the decoder
- Limit the padding per batch
- Learning rate schedule
- Bi-directional LSTM Encoder
- Bi-directional GRU Encoder


# Imports

In [1]:
from __future__ import print_function
import tensorflow as tf
import keras.backend as K
from keras.backend.tensorflow_backend import set_session
from keras.models import Model
from keras.layers import Input, LSTM, Dense, Bidirectional, Concatenate, GRU
from keras import optimizers
from keras.callbacks import ModelCheckpoint, TensorBoard, LearningRateScheduler
from keras.models import load_model
import numpy as np
import os
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

Using TensorFlow backend.


# Utility functions

In [2]:
# Limit gpu allocation. allow_growth, or gpu_fraction
def gpu_alloc():
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    set_session(tf.Session(config=config))

In [3]:
gpu_alloc()

In [4]:
def calculate_WER_sent(gt, pred):
    '''
    calculate_WER('calculating wer between two sentences', 'calculate wer between two sentences')
    '''
    gt_words = gt.lower().split(' ')
    pred_words = pred.lower().split(' ')
    d = np.zeros(((len(gt_words) + 1), (len(pred_words) + 1)), dtype=np.uint8)
    # d = d.reshape((len(gt_words)+1, len(pred_words)+1))

    # Initializing error matrix
    for i in range(len(gt_words) + 1):
        for j in range(len(pred_words) + 1):
            if i == 0:
                d[0][j] = j
            elif j == 0:
                d[i][0] = i

    # computation
    for i in range(1, len(gt_words) + 1):
        for j in range(1, len(pred_words) + 1):
            if gt_words[i - 1] == pred_words[j - 1]:
                d[i][j] = d[i - 1][j - 1]
            else:
                substitution = d[i - 1][j - 1] + 1
                insertion = d[i][j - 1] + 1
                deletion = d[i - 1][j] + 1
                d[i][j] = min(substitution, insertion, deletion)
    return d[len(gt_words)][len(pred_words)]

In [5]:
def calculate_WER(gt, pred):
    '''

    :param gt: list of sentences of the ground truth
    :param pred: list of sentences of the predictions
    both lists must have the same length
    :return: accumulated WER
    '''
#    assert len(gt) == len(pred)
    WER = 0
    nb_w = 0
    for i in range(len(gt)):
        #print(gt[i])
        #print(pred[i])
        WER += calculate_WER_sent(gt[i], pred[i])
        nb_w += len(gt[i])

    return WER / nb_w

In [6]:
def load_data_with_gt(file_name, num_samples, max_sent_len, min_sent_len, delimiter='\t', gt_index=1, prediction_index=0):
    '''Load data from txt file, with each line has: <TXT><TAB><GT>. The  target to the decoder muxt have \t as the start trigger and \n as the stop trigger.'''
    cnt = 0  
    input_texts = []
    gt_texts = []
    target_texts = []
    for row in open(file_name, encoding='utf8'):
        if cnt < num_samples :
            #print(row)
            sents = row.split(delimiter)
            input_text = sents[prediction_index]
            
            target_text = '\t' + sents[gt_index] + '\n'
            if len(input_text) > min_sent_len and len(input_text) < max_sent_len and len(target_text) > min_sent_len and len(target_text) < max_sent_len:
                cnt += 1
                
                input_texts.append(input_text)
                target_texts.append(target_text)
                gt_texts.append(sents[gt_index])
    return input_texts, target_texts, gt_texts

In [7]:
def load_data(file_name, num_samples, max_sent_len, min_sent_len):
    '''Load data from txt file, with each line has: <TXT><TAB><GT>. The  target to the decoder muxt have \t as the start trigger and \n as the stop trigger.'''
    cnt = 0  
    input_texts = []   
    
    #for row in open(file_name, encoding='utf8'):
    for row in open(file_name):
        if cnt < num_samples :            
            input_text = row           
            if len(input_text) > min_sent_len and len(input_text) < max_sent_len:
                cnt += 1                
                input_texts.append(input_text)
    return input_texts

In [8]:
def vectorize_data(input_texts, max_encoder_seq_length, num_encoder_tokens, vocab_to_int):
    '''Prepares the input text and targets into the proper seq2seq numpy arrays'''
    encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length),
    dtype='float32')

    for i, input_text in enumerate(input_texts):
        for t, char in enumerate(input_text):
            # c0..cn
            encoder_input_data[i, t] = vocab_to_int[char]
                
    return encoder_input_data

In [9]:
def decode_sequence(input_seq, encoder_model, decoder_model, num_decoder_tokens, int_to_vocab):
    # Encode the input as state vectors.
    encoder_outputs, h, c  = encoder_model.predict(input_seq)
    states_value = [h,c]
    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0] = vocab_to_int['\t']

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    #print(input_seq)
    attention_density = []
    i = 0
    special_chars = ['\\', '/', '-', '—' , ':', '[', ']', ',', '.', '"', ';', '%', '~', '(', ')', '{', '}', '$']
    while not stop_condition:
        #print(target_seq)
        output_tokens, attention, h, c  = decoder_model.predict(
            [target_seq, encoder_outputs] + states_value)
        #print(attention.shape)
        attention_density.append(attention[0][0])# attention is max_sent_len x 1 since we have num_time_steps = 1 for the output
        # Sample a token
        #print(output_tokens.shape)
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        
        #print(sampled_token_index)
        sampled_char = int_to_vocab[sampled_token_index]
        orig_char = int_to_vocab[int(input_seq[:,i][0])]
        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True
            sampled_char = ''

        # Copy digits as it, since the spelling corrector is not good at digit corrections
        if(orig_char.isdigit() or orig_char in special_chars):
            decoded_sentence += orig_char            
        else:
            if(sampled_char.isdigit() or sampled_char in special_chars):
                decoded_sentence += ''
            else:
                decoded_sentence += sampled_char
        


        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]
        
        i += 1
        if(i > 48):
            i = 0
    attention_density = np.array(attention_density)
    return decoded_sentence, attention_density


# Load data

# Load model params

In [10]:
data_path = '../../dat/'

In [11]:
vocab_file = 'vocab.npz'
model_file = 'best_model.hdf5'
encoder_model_file = 'encoder_model.hdf5'
decoder_model_file = 'decoder_model.hdf5'

In [12]:
vocab = np.load(file=vocab_file)
vocab_to_int = vocab['vocab_to_int'].item()
int_to_vocab = vocab['int_to_vocab'].item()
max_sent_len = vocab['max_sent_len']
min_sent_len = vocab['min_sent_len']



In [13]:
input_characters = sorted(list(vocab_to_int))
num_decoder_tokens = num_encoder_tokens = len(input_characters) #int(encoder_model.layers[0].input.shape[2])
max_encoder_seq_length = max_decoder_seq_length = max_sent_len - 1#max([len(txt) for txt in input_texts])


In [14]:
num_samples = 1000000
#tess_correction_data = os.path.join(data_path, 'test_data.txt')
#input_texts = load_data(tess_correction_data, num_samples, max_sent_len, min_sent_len)

OCR_data = os.path.join(data_path, 'new_trained_data.txt')
#input_texts, target_texts, gt_texts = load_data_with_gt(OCR_data, num_samples, max_sent_len, min_sent_len, delimiter='|',gt_index=0, prediction_index=1)
input_texts, target_texts, gt_texts = load_data_with_gt(OCR_data, num_samples, max_sent_len, min_sent_len)

In [15]:
# Sample data
print(len(input_texts))
for i in range(10):
    print(input_texts[i], '\n', target_texts[i])

1451
Me dieal Provider Roles: Treating  
 	Medical Provider Roles: Treating


Provider First Name: Christine  
 	Provider First Name: Christine


Provider Last Name: Nolen, MD  
 	Provider Last Name: Nolen, MD


Address Line 1 : 7 25 American Avenue  
 	Address Line 1 : 725 American Avenue


City. W’aukesha  
 	City: Waukesha


StatefProvinee: ‘WI  
 	State/Province: WI


Postal Code: 5 31 88  
 	Postal Code: 53188


Country". US  
 	Country:  US


Business Telephone: (2 62) 92 8- 1000  
 	Business Telephone: (262) 928- 1000


Date ot‘Pirst Visit: 1 2/01f20 17  
 	Date of First Visit: 12/01/2017




In [16]:
#model.load_weights(model_file)

model = load_model(model_file)

In [17]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 115)    13225       input_1[0][0]                    
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) [(None, None, 512),  761856      embedding_1[0][0]                
__________________________________________________________________________________________________
embedding_

In [18]:
encoder_model = load_model(encoder_model_file)
decoder_model = load_model(decoder_model_file)



In [19]:
encoder_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 115)    13225       input_1[0][0]                    
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) [(None, None, 512),  761856      embedding_1[0][0]                
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 512)          0           bidirectional_1[0][1]            
                                                                 bidirectional_1[0][3]            
__________

In [20]:
decoder_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, None, 115)    13225       input_2[0][0]                    
__________________________________________________________________________________________________
input_4 (InputLayer)            (None, 512)          0                                            
__________________________________________________________________________________________________
input_5 (InputLayer)            (None, 512)          0                                            
__________________________________________________________________________________________________
lstm_2 (LS

In [21]:


encoder_input_data = vectorize_data(input_texts=input_texts, max_encoder_seq_length=max_encoder_seq_length, num_encoder_tokens=num_encoder_tokens, vocab_to_int=vocab_to_int)

# Sample output from train data
decoded_sentences = []

for seq_index in range(len(input_texts)):
    # Take one sequence (part of the training set)
    # for trying out decoding.

    input_seq = encoder_input_data[seq_index: seq_index + 1]
    target_text = gt_texts[seq_index]
    decoded_sentence,_  = decode_sequence(input_seq, encoder_model, decoder_model, num_decoder_tokens, int_to_vocab)
    
    print('-')
    print('Input sentence:', input_texts[seq_index])
    print('GT sentence:', target_text)
    print('Decoded sentence:', decoded_sentence)   
    decoded_sentences.append(decoded_sentence)
    


-
Input sentence: Me dieal Provider Roles: Treating 
GT sentence: Medical Provider Roles: Treating

Decoded sentence: Medical Provider Roles:Treating
-
Input sentence: Provider First Name: Christine 
GT sentence: Provider First Name: Christine

Decoded sentence: Provider First Name: Christine 
-
Input sentence: Provider Last Name: Nolen, MD 
GT sentence: Provider Last Name: Nolen, MD

Decoded sentence: Provider Last Name: Nolen, MD 
-
Input sentence: Address Line 1 : 7 25 American Avenue 
GT sentence: Address Line 1 : 725 American Avenue

Decoded sentence: Address Line 1 : 7a25 nuerical Avenue
-
Input sentence: City. W’aukesha 
GT sentence: City: Waukesha

Decoded sentence: City. Wauksha e 
-
Input sentence: StatefProvinee: ‘WI 
GT sentence: State/Province: WI

Decoded sentence: StateProvince: WI L 
-
Input sentence: Postal Code: 5 31 88 
GT sentence: Postal Code: 53188

Decoded sentence: Postal Code: 5 31 88 
-
Input sentence: Country". US 
GT sentence: Country:  US

Decoded sentence:

-
Input sentence: 1 NOLEN, CHRISTINE, M. D. 
GT sentence: 1 NOLEN, CHRISTINE, M. D.

Decoded sentence: 1 NOLEN, CHRISTINE, M. D. D L 
-
Input sentence: COMMENTS 
GT sentence: COMMENTS

Decoded sentence: COMMENTS
-
Input sentence: PRIMARY INSUR: UMR FISERV WI 
GT sentence: PRIMARY INSUR: UMR FISERV WI

Decoded sentence: PRIMARY INSUR: UMR FISER WI N 
-
Input sentence: SECONDARY INSUR: 
GT sentence: SECONDARY INSUR:

Decoded sentence: SECONDARY INSUR:
-
Input sentence: EMERGENCY MEDICAL ASSOCIATES 
GT sentence: EMERGENCY MEDICAL ASSOCIATES

Decoded sentence: EMERGENCY MEDICAL ASSOCIATES
-
Input sentence: PHONE: 
GT sentence: PHONE:

Decoded sentence: PHONE:
-
Input sentence: Web user notes: 
GT sentence: Web user notes:

Decoded sentence: Web user notes:
-
Input sentence: medical statements 
GT sentence: medical statements

Decoded sentence: Emedical statements 
-
Input sentence: unum‘D 
GT sentence: unum

Decoded sentence: unum
-
Input sentence: . . O The Benefits Center 
GT sentence: T

-
Input sentence: 1. ACL tear. 
GT sentence: 1. ACL tear.

Decoded sentence: 1. ACL tear. 
-
Input sentence: 5. Patellar apical grade 1-2 chondromalacia. 
GT sentence: 5. Patellar apical grade 1-2 chondromalacia.

Decoded sentence: 5. Patellar apical grade 1-2 chondromalacia.
-
Input sentence: Diagnosis 
GT sentence: Diagnosis

Decoded sentence: Diagnosis 
-
Input sentence: Right knee ACL rupture and high grade MCL tear 
GT sentence: Right knee ACL rupture and high grade MCL tear

Decoded sentence: Right knee ACL rupturg Med tor Lear high grade
-
Input sentence: Plan 
GT sentence: Plan

Decoded sentence: Plan 
-
Input sentence: Health Maintenance 
GT sentence: Health Maintenance

Decoded sentence: Health Maintenance 
-
Input sentence: DiscussionlSumrnary 
GT sentence: Discussion/Sumrnary

Decoded sentence: DiscussionSumrnary 
-
Input sentence: Scribe - Statements 
GT sentence: Scribe - Statements

Decoded sentence: Scribe - Statements 
-
Input sentence: Signatures 
GT sentence: Signatu

-
Input sentence: MINNESOTA VALLEY SURGERY CENTER 
GT sentence: MINNESOTA VALLEY SURGERY CENTER

Decoded sentence: IMENT ITALE VALLEY SURGERY CENTER
-
Input sentence: OPERATIVE REPORT 
GT sentence: OPERATIVE REPORT

Decoded sentence: OPERATIVE REPORT
-
Input sentence: MR #: 
GT sentence: MR #:

Decoded sentence: MR #: 
-
Input sentence: SURGEON: JASON HOLM, M.D. 
GT sentence: SURGEON: JASON HOLM, M.D. 

Decoded sentence: SURGEON: JASON HOLM, M.D. 
-
Input sentence: DATE: 02/02/2018 
GT sentence: DATE: 02/02/2018

Decoded sentence: DATE: 02/02/2018
-
Input sentence: 05/09/1980 
GT sentence: 05/09/1980

Decoded sentence: 05/09/1980 ?
-
Input sentence: PREOPERATIVE DIAGNOSES: 
GT sentence: PREOPERATIVE DIAGNOSES:

Decoded sentence: PREOPERATIVE DIAGNOSES:S
-
Input sentence: 1. Right knee anterior cruciate ligament tear. 
GT sentence: 1. Right knee anterior cruciate ligament tear.

Decoded sentence: 1. Right knee anterior cruciate ligament tear.
-
Input sentence: 2. Me dial collateral liga

-
Input sentence: Electronically signed by : David Felvor, PA~C:
GT sentence: Electronically signed by : David Feivor, PA~C:

Decoded sentence: Electronically Signed by:David Fellor,PA~r:k
-
Input sentence: oTé'iLNog'aiiiés 
GT sentence: TWIN CITIES ORTHOPEDICS

Decoded sentence: Total Nong Disiine 
-
Input sentence: MINNESOTA VALLEY SURGERY CENTER 
GT sentence: MINNESOTA VALLEY SURGERY CENTER

Decoded sentence: IMENT ITALE VALLEY SURGERY CENTER
-
Input sentence: OPERATIVE REPORT ' 
GT sentence: OPERATIVE REPORT

Decoded sentence: OPERATIVE REPORT #
-
Input sentence: SURGEON: JASON HOLM, MD. 
GT sentence: SURGEON: JASON HOLM, M.D. 

Decoded sentence: SURGEON: JASON HOLM, M. 
-
Input sentence: Jason Holm, M.D. 
GT sentence: Jason Holm, M.D.

Decoded sentence: Jason Holm, M.D. 
-
Input sentence: OPERATIVE REPORT - PAGE 2. of 2 ‘ 
GT sentence: OPERATIVE REPORT - PAGE 2 of 2

Decoded sentence: OPERATIVE REPORT - PAGE 2.of 2 PAGE  
-
Input sentence: I Family history of Cancer (080.1) 
GT se

-
Input sentence: Address line 1 - 
GT sentence: Address line 1 -

Decoded sentence: Address line 1 -                                  
-
Input sentence: Address line 2 — 
GT sentence: Address line 2 -

Decoded sentence: Address line 2 —                                  
-
Input sentence: City - 
GT sentence: City -

Decoded sentence: City - #
-
Input sentence: State - NC 
GT sentence: State - NC

Decoded sentence: State - NC
-
Input sentence: Speciality — PCP 
GT sentence: Speciality - PCP

Decoded sentence: Speciality — P C P C  P  P C  P C  P C  P C  P  P 
-
Input sentence: Add another doctor — no 
GT sentence: Add another doctor - no

Decoded sentence: Add another doc  n—  no
-
Input sentence: Physician authorization - mail 
GT sentence: Physician authorization - mail

Decoded sentence: Physician authorization - mail 
-
Input sentence: Home Email — 
GT sentence: Home Email -

Decoded sentence: Home Email —  & 
-
Input sentence: Register for Claim Self Service — no 
GT sentence: Reg

-
Input sentence: Work and live same state — yes 
GT sentence: Work and live same state - yes

Decoded sentence: Work and live seme state — yes
-
Input sentence: Work from home — no 
GT sentence: Work from home - no

Decoded sentence: Work fro  ome —no f 
-
Input sentence: Does schedule vary? - yes 
GT sentence: Does schedule vary? - yes

Decoded sentence: Dowes cemedule vary?- yes
-
Input sentence: How does it vary — hours and days vary 
GT sentence: How does it vary - hours and days vary

Decoded sentence: How does it vary — hours and days vary
-
Input sentence: Verified hrs worked per week avg — 100.00 
GT sentence: Verified hrs worked per week avg - 100.00

Decoded sentence: Verified hrs worked per week ave — 100.00 
-
Input sentence: Employment Details Comments - 
GT sentence: Employment Details Comments -

Decoded sentence: Employment Details Comments - 
-
Input sentence: Employer aware of absence? — yes 
GT sentence: Employer aware of absence? - yes

Decoded sentence: Employer a

-
Input sentence: Procedure Description
GT sentence: Procedure Description

Decoded sentence: Procedure Description
-
Input sentence: It yes, please provide the following:
GT sentence: If yes, please provide the following:

Decoded sentence: Insure, please provide the followin:
-
Input sentence: DIEQHDSIEI “m—
GT sentence: Diagnosis:

Decoded sentence: DIE HESICIAL —I
-
Input sentence: Treatment Dates: . ‘ l I)
GT sentence: Treatment Dates:

Decoded sentence: Treatment Dates: .ates )
-
Input sentence: Did you advise the patient to stop working?
GT sentence: Did you advise the patient to stop working?

Decoded sentence: Did you advise the patient to stop working?
-
Input sentence: lfyee. as of what date”? (mrna'ddiyy)
GT sentence: If yes, as of what date”? (mm/dd/yy)

Decoded sentence: Clect.es as of what date?( mmddyy)
-
Input sentence: ExPeoled Dei‘ ery _ate (mmlddiyy)
GT sentence: Expected Delivery Date (mm/dd/yy) 

Decoded sentence: ExPectied Devi Dev mm(ddyy
-
Input sentence: Actua

-
Input sentence: Transaction identiﬁer: 
GT sentence: Transaction identifier: 

Decoded sentence: Transaction identifie: 
-
Input sentence: Patient identiﬁer: 
GT sentence: Patient identifier:

Decoded sentence: Patient identifie: 
-
Input sentence: Subtotal: 
GT sentence: Subtotal:

Decoded sentence: Subtotal:
-
Input sentence: Sales Tax: 
GT sentence: Sales Tax:

Decoded sentence: Sales Tax: 
-
Input sentence: Toto I: 
GT sentence: Total:

Decoded sentence: Toton :n
-
Input sentence: [customer copy) 
GT sentence: (customer copy) 

Decoded sentence: [customer copy) 
-
Input sentence: ORTHOATLANTA, L.L.C. 
GT sentence: ORTHOATLANTA, L.L.C.

Decoded sentence: ORTHOATLANTA, L.L.C.
-
Input sentence: piease send pa ments to: 
GT sentence: please send payments to: 

Decoded sentence: ppease send payments to: 
-
Input sentence: ORTHDATLA A. Lu: 
GT sentence: ORTHOATLANTA, L.L.C.

Decoded sentence: ORTHOATLATL. Lu: AL Lu 
-
Input sentence: billm -hone: 
GT sentence: billing phone:

Decoded s

-
Input sentence: Piedmont Healllleare 
GT sentence: Piedmont Healthcare

Decoded sentence: Predmont Health Name are 
-
Input sentence: PO Box I000 
GT sentence: PO Box 1000

Decoded sentence: PO Box I000
-
Input sentence: Piscataway, NJ 03855-1000 
GT sentence: Piscataway, NJ 08855-1000

Decoded sentence: Piscataway, NJ 03855-1000 
-
Input sentence: Electronic Service Requested 1—377-601-3835 
GT sentence: Electronic Service Requested

Decoded sentence: Electronic Service Requested 1—377-601-383
-
Input sentence: MyHeaIthBéO" Iggy"; 
GT sentence: MyHealth360° PHC % Me

Decoded sentence: MyHeasthPh"D R g"; 
-
Input sentence: p—IEDMONT HEALTHéARé“ 
GT sentence: PIEDMONT HEALTHCARE

Decoded sentence: A—EDIONT HEALTH RAD F
-
Input sentence: Group No: 
GT sentence: Group No:

Decoded sentence: Group No: 
-
Input sentence: Date: 
GT sentence: Date:

Decoded sentence: Date: 
-
Input sentence: Explanation of Benefits 
GT sentence: Explanation of Benefits 

Decoded sentence: Expalanation of Be

-
Input sentence: Insured Coverme TIE W Coverage 
GT sentence: Insured Coverage Type Coverage Effective Date

Decoded sentence: Insured Coverage Type Coverage Effective Date
-
Input sentence: Employee Off—Job Acc January 1, 2018 
GT sentence: Employee Off-Job Acc January 1, 2018

Decoded sentence: Employee Off—Job Acc January 1, 2018
-
Input sentence: Spouse Off-Job Acc January 1, 2018 
GT sentence: Spouse Off-Job Acc January 1, 2018

Decoded sentence: Spouse Off-Job Acc January 1, 2018
-
Input sentence: Total Monthly Premium: 
GT sentence: Total Monthly Premium:

Decoded sentence: Total Monthly Premium:
-
Input sentence: Total Employee Bi—Weekly Payroll Deduction: 
GT sentence: Total Employee Bi-Weekly Payroll Deduction:

Decoded sentence: Total Employee Bi—Weekly Payroll Deduction:
-
Input sentence: unum‘t 
GT sentence: unum

Decoded sentence: unum
-
Input sentence: . O . ACCIDENT CLAIM FORM 
GT sentence: ACCIDENT CLAIM FORM

Decoded sentence: .CCI.ENT CLAIM FORM  OR M FOR 
-
Input s

-
Input sentence: Surgical rocedure CPT Code: 
GT sentence: Surgical Procedure CPT Code:

Decoded sentence: Surgical Produre CPT Code:
-
Input sentence: C. Signature of Attending Physician 
GT sentence: C. Signature of Attending Physician

Decoded sentence: C. Signature of Attending Physician
-
Input sentence: Medical Specialty Degree 
GT sentence: Medical Specialty Degree

Decoded sentence: Medical Specialty Degree
-
Input sentence: Address 
GT sentence: Address

Decoded sentence: Address
-
Input sentence: City 2 I Slate Zip 3/] 
GT sentence: City State Zip

Decoded sentence: City 2tate Zip Zip 3/]d
-
Input sentence: Physician Signature i Date 
GT sentence: Physician Signature Date

Decoded sentence: Physician Signature Date mmddyy 
-
Input sentence: CL-1023 (0611 3) 2 
GT sentence: CL-1023 (06/13) 2

Decoded sentence: CL-1023(06113) 2 
-
Input sentence: DATE: 03/07/2018 
GT sentence: DATE: 03/07/2018

Decoded sentence: DATE: 03/07/2018
-
Input sentence: ACCOUNT No . 
GT sentence: ACC

Input sentence: PruducL: Long Team Disability
GT sentence: Product: Long Term Disability

Decoded sentence: Product: Long Telability
-
Input sentence: Pruduct Type: Flex
GT sentence: Product Type: Flex

Decoded sentence: Product Type: Flex
-
Input sentence: funding: Fully lnsurcd
GT sentence: Funding: Fully Insured

Decoded sentence: Uunding: Fully lansured
-
Input sentence: State Plan: No
GT sentence: State Plan: No

Decoded sentence: State Plan: No
-
Input sentence: Employee Coverage: Yes
GT sentence: Employee Coverage: Yes

Decoded sentence: Employee Coverage: Yes
-
Input sentence: Emplayc: Covcxagc: Yes
GT sentence: Employer Coverage: Yes

Decoded sentence: Employe: Covesage: Yes
-
Input sentence: Policy N0,
GT sentence: Policy No.

Decoded sentence: Policy N0,
-
Input sentence: DiviSLon:
GT sentence: Division:

Decoded sentence: Division:
-
Input sentence: Chainn.
GT sentence: Choice.

Decoded sentence: Chain .od
-
Input sentence: Eff Date:
GT sentence: Eff Date:

Decoded sentence

-
Input sentence: SSN xxxexxvmocx 
GT sentence: SSN xxx-xx-xxxx

Decoded sentence: SSN xxxxx Incox
-
Input sentence: EEFU'I Date
GT sentence: Birth Date

Decoded sentence: EEFURI Date
-
Input sentence: Address 
GT sentence: Address

Decoded sentence: Address
-
Input sentence: Fho ne 
GT sentence: Phone

Decoded sentence: Fho ne #
-
Input sentence: Email 
GT sentence: Email

Decoded sentence: Email 
-
Input sentence: Empioyer 
GT sentence: Employer

Decoded sentence: Employer No
-
Input sentence: Reg. Status Verified 
GT sentence: Reg Status Verified

Decoded sentence: Reg. Status Verified 
-
Input sentence: Date Last Verified 
GT sentence: Date Last Verified 03/14/18

Decoded sentence: Date Last Verified 
-
Input sentence: Next Renew Date 
GT sentence: Next Renew Date 05/13/18

Decoded sentence: Next Renew Date 
-
Input sentence: Admission Intormation 
GT sentence: Admission Information

Decoded sentence: Admission Intormation
-
Input sentence: Attending Provider 
GT sentence: Attendin

-
Input sentence: Afiler' THI' 0 (N30
GT sentence: After Tax: 0 000

Decoded sentence: After THI TON0
-
Input sentence: Rtpozt Gxoup: 26
GT sentence: Report Group: 26

Decoded sentence: Report Group: 26
-
Input sentence: Product: _
GT sentence: Product:

Decoded sentence: Product:
-
Input sentence: I’xod‘ac‘t Typo: Leave Hgmt Svnz
GT sentence: Product Type: Leave Mgmt Svc

Decoded sentence: Insured Type L:ave Homat Sing
-
Input sentence: Funding: Not Applicable 
GT sentence: Funding: Not Applicable

Decoded sentence: Funding: Not Applicable
-
Input sentence: State Plan; No 
GT sentence: State Plan: No

Decoded sentence: State Plan; No 
-
Input sentence: Employee Coverage; Yes 
GT sentence: Employee Coverage: Yes

Decoded sentence: Employee Coverage;Type Coverage Effecs on 
-
Input sentence: Employﬁr Coverage: YER 
GT sentence: Employer Coverage: Yes

Decoded sentence: Employer Coverage: Yes
-
Input sentence: Policy No.: 
GT sentence: Policy No.:

Decoded sentence: Policy No.: 
-
Input 

-
Input sentence: Dependent Inform mion 
GT sentence: Dependent Information

Decoded sentence: Dependent Information 
-
Input sentence: First Name: 
GT sentence: First Name:

Decoded sentence: First Name: 
-
Input sentence: Middle Name/initial: 
GT sentence: Middle Name/Initial:

Decoded sentence: Middle Name/initial:
-
Input sentence: Last Name: 
GT sentence: Last Name:

Decoded sentence: Last Name:
-
Input sentence: Social Security Number: 
GT sentence: Social Security Number:

Decoded sentence: Social Security Number: 
-
Input sentence: Bil‘lh Date: 
GT sentence: Birth Date:

Decoded sentence: Billh Date:
-
Input sentence: Gender: 
GT sentence: Gender:

Decoded sentence: Gender: 
-
Input sentence: Claim Event Information 
GT sentence: Claim Event Information

Decoded sentence: Claim Event Information
-
Input sentence: Accident Work Related: No 
GT sentence: Accident Work Related: No

Decoded sentence: Accident Work Related: No
-
Input sentence: Time ofAccident: 1:30 pm 
GT sentence:

-
Input sentence: Electronic Submis siorl 
GT sentence: Electronic Submission

Decoded sentence: Electronic Submission is
-
Input sentence: C lairn Event Identiﬁer: 26774 53 
GT sentence: Clairn Event Identifier: 2677453

Decoded sentence: Clairn Event Identifier: 26774 53
-
Input sentence: Electronically Signed Indicator: Yes 
GT sentence: Electronically Signed Indicator: Yes

Decoded sentence: Electronically Signed Indicator: Yes
-
Input sentence: Electronically Signed Date: Thursday 
GT sentence: Electronically Signed Date: Thursday

Decoded sentence: Electronically Signed Date: Thersday
-
Input sentence: Claim Tji'pe: V'B Accident - Accidental Injury 
GT sentence: Claim Type: VB Accident - Accidental Injury

Decoded sentence: Claim Type :B Accident  A-cidental Injury 
-
Input sentence: Policg'h old er: Owner Information 
GT sentence: Policyholder/Owner Information

Decoded sentence: Policyholder O:ner Information 
-
Input sentence: First Name: 
GT sentence: First Name:

Decoded sen

-
Input sentence: Hospital N anle: Medlixpress 
GT sentence: Hospital Name: MedExpress

Decoded sentence: Hospital Name :edlipress s 
-
Input sentence: Address Line 1: 1 325 North West Ave. 
GT sentence: Address Line 1: 1325 North West Ave.

Decoded sentence: Address Line 1: 1325AMe    Avert.Ave
-
Input sentence: City .Tackson 
GT sentence: City: Jackson

Decoded sentence: City .tack on 
-
Input sentence: State/Province: MI 
GT sentence: State/Province: MI

Decoded sentence: State/Province: MI    M    M 
-
Input sentence: Postal Codes 4 92 02 
GT sentence: Postal Codes: 49202

Decoded sentence: Postal Codes 4e9202   
-
Input sentence: Comm-1 US 
GT sentence: Country: US

Decoded sentence: Coum-1 Da s
-
Input sentence: Medical Provider Roles: Primary Care 
GT sentence: Medical Provider Roles: Primary Care

Decoded sentence: Medical Provider Roles: Primary Care 
-
Input sentence: Provider First Name: Todd 
GT sentence: Provider First Name: Todd

Decoded sentence: Provider First Name: Tod

-
Input sentence: DlAGNOSiS CODE 
GT sentence: DIAGNOSIS CODE

Decoded sentence: DiAGNOS Sign C D E 
-
Input sentence: E X PLAI N CODE 
GT sentence: EXPLAIN CODE

Decoded sentence: Ex PLAN COD E D 
-
Input sentence: AMOUNT BILL ED 
GT sentence: AMOUNT BILLED

Decoded sentence: AMOUNT BILL D
-
Input sentence: ALLOWED AMOUNT 
GT sentence: ALLOWED AMOUNT

Decoded sentence: ALLOWED AMOUNT 
-
Input sentence: PARAMOU NT PAID 
GT sentence: PARAMOUNT PAID

Decoded sentence: PARAMOUNT PAID NAMS
-
Input sentence: 9 Indicates additional information i- available. 
GT sentence: Indicates additional information is available.

Decoded sentence: 9ndicates additional information if -aciation i.f9
-
Input sentence: Employer Nana: 
GT sentence: Employer Name:

Decoded sentence: Employer Name:
-
Input sentence: Electron [1: S u Innis sion 
GT sentence: Electronic: Submission

Decoded sentence: Electroni[1: S Sus Inno non
-
Input sentence: C lairn Event Identiﬁer: 
GT sentence: Claim Event Identifier:

Dec

-
Input sentence: hﬁddle Name/initial: 
GT sentence: Middle Name/Initial:

Decoded sentence: Middle Name/inintia:
-
Input sentence: Last Name: 
GT sentence: Last Name:

Decoded sentence: Last Name:
-
Input sentence: Social Secun'ty Number: 
GT sentence: Social Security Number:

Decoded sentence: Social Security Number: 
-
Input sentence: Birth Date: 
GT sentence: Birth Date:

Decoded sentence: Birth Date: 
-
Input sentence: Gender: 
GT sentence: Gender:

Decoded sentence: Gender: 
-
Input sentence: Language Preference: 
GT sentence: Language Preference:

Decoded sentence: Language Preference:
-
Input sentence: Address Line 1: 
GT sentence: Address Line 1:

Decoded sentence: Address Line 1: 
-
Input sentence: City. 
GT sentence: City:

Decoded sentence: City. 
-
Input sentence: State/PrOVirice: 
GT sentence: State/Province:

Decoded sentence: Statu/PrOVice :o
-
Input sentence: Postal Code 
GT sentence: Postal Code:

Decoded sentence: Postal Code 
-
Input sentence: Country. 
GT sentence:

In [22]:
WER_spell_correction = calculate_WER(gt_texts, decoded_sentences)
print('WER_spell_correction |TEST= ', WER_spell_correction)

WER_spell_correction |TEST=  0.117486338798


In [23]:
WER_OCR = calculate_WER(gt_texts, input_texts)
print('WER_OCR |TEST= ', WER_OCR)

WER_OCR |TEST=  0.117237953304
