# Introduction

We tackle the problem of OCR post processing. In OCR, we map the image form of the document into the text domain. This is done first using an CNN+LSTM+CTC model, in our case based on tesseract. Since this output maps only image to text, we need something on top to validate and correct language semantics.

The idea is to build a language model, that takes the OCRed text and corrects it based on language knowledge. The langauge model could be:
- Char level: the aim is to capture the word morphology. In which case it's like a spelling correction system.
- Word level: the aim is to capture the sentence semnatics. But such systems suffer from the OOV problem.
- Fusion: to capture semantics and morphology language rules. The output has to be at char level, to avoid the OOV. However, the input can be char, word or both.

The fusion model target is to learn:

    p(char | char_context, word_context)

In this workbook we use seq2seq vanilla Keras implementation, adapted from the lstm_seq2seq example on Eng-Fra translation task. The adaptation involves:

- Adapt to spelling correction, on char level
- Pre-train on a noisy, medical sentences
- Fine tune a residual, to correct the mistakes of tesseract 
- Limit the input and output sequence lengths
- Enusre teacher forcing auto regressive model in the decoder
- Limit the padding per batch
- Learning rate schedule
- Bi-directional LSTM Encoder
- Bi-directional GRU Encoder


# Imports

In [1]:
from __future__ import print_function
import tensorflow as tf
import keras.backend as K
from keras.backend.tensorflow_backend import set_session
from keras.models import Model
from keras.layers import Input, LSTM, Dense, Bidirectional, Concatenate, GRU
from keras import optimizers
from keras.callbacks import ModelCheckpoint, TensorBoard, LearningRateScheduler
from keras.models import load_model
import numpy as np
import os
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


# Utility functions

In [2]:
# Limit gpu allocation. allow_growth, or gpu_fraction
def gpu_alloc():
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    set_session(tf.Session(config=config))

In [3]:
gpu_alloc()

In [4]:
def calculate_WER_sent(gt, pred):
    '''
    calculate_WER('calculating wer between two sentences', 'calculate wer between two sentences')
    '''
    gt_words = gt.lower().split(' ')
    pred_words = pred.lower().split(' ')
    d = np.zeros(((len(gt_words) + 1), (len(pred_words) + 1)), dtype=np.uint8)
    # d = d.reshape((len(gt_words)+1, len(pred_words)+1))

    # Initializing error matrix
    for i in range(len(gt_words) + 1):
        for j in range(len(pred_words) + 1):
            if i == 0:
                d[0][j] = j
            elif j == 0:
                d[i][0] = i

    # computation
    for i in range(1, len(gt_words) + 1):
        for j in range(1, len(pred_words) + 1):
            if gt_words[i - 1] == pred_words[j - 1]:
                d[i][j] = d[i - 1][j - 1]
            else:
                substitution = d[i - 1][j - 1] + 1
                insertion = d[i][j - 1] + 1
                deletion = d[i - 1][j] + 1
                d[i][j] = min(substitution, insertion, deletion)
    return d[len(gt_words)][len(pred_words)]

In [5]:
def calculate_WER(gt, pred):
    '''

    :param gt: list of sentences of the ground truth
    :param pred: list of sentences of the predictions
    both lists must have the same length
    :return: accumulated WER
    '''
#    assert len(gt) == len(pred)
    WER = 0
    nb_w = 0
    for i in range(len(gt)):
        #print(gt[i])
        #print(pred[i])
        WER += calculate_WER_sent(gt[i], pred[i])
        nb_w += len(gt[i])

    return WER / nb_w

In [6]:
def load_data_with_gt(file_name, num_samples, max_sent_len, min_sent_len):
    '''Load data from txt file, with each line has: <TXT><TAB><GT>. The  target to the decoder muxt have \t as the start trigger and \n as the stop trigger.'''
    cnt = 0  
    input_texts = []
    gt_texts = []
    target_texts = []
    for row in open(file_name, encoding='utf8'):
        if cnt < num_samples :
            #print(row)
            sents = row.split("\t")
            input_text = sents[0]
            
            target_text = '\t' + sents[1] + '\n'
            if len(input_text) > min_sent_len and len(input_text) < max_sent_len and len(target_text) > min_sent_len and len(target_text) < max_sent_len:
                cnt += 1
                
                input_texts.append(input_text)
                target_texts.append(target_text)
                gt_texts.append(sents[1])
    return input_texts, target_texts, gt_texts

In [7]:
def load_data(file_name, num_samples, max_sent_len, min_sent_len):
    '''Load data from txt file, with each line has: <TXT><TAB><GT>. The  target to the decoder muxt have \t as the start trigger and \n as the stop trigger.'''
    cnt = 0  
    input_texts = []   
    
    #for row in open(file_name, encoding='utf8'):
    for row in open(file_name):
        if cnt < num_samples :            
            input_text = row           
            if len(input_text) > min_sent_len and len(input_text) < max_sent_len:
                cnt += 1                
                input_texts.append(input_text)
    return input_texts

In [8]:
def vectorize_data(input_texts, max_encoder_seq_length, num_encoder_tokens, vocab_to_int):
    '''Prepares the input text and targets into the proper seq2seq numpy arrays'''
    encoder_input_data = np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')

    for i, input_text in enumerate(input_texts):
        for t, char in enumerate(input_text):
            # c0..cn
            encoder_input_data[i, t, vocab_to_int[char]] = 1.
                
    return encoder_input_data

In [9]:
def decode_sequence(input_seq, encoder_model, decoder_model, num_decoder_tokens, int_to_vocab):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, vocab_to_int['\t']] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = int_to_vocab[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    return decoded_sentence


# Load data

# Load model params

In [10]:
data_path = '../../dat/'

In [11]:
vocab_file = 'vocab.npz'
model_file = 'best_model.hdf5'
encoder_model_file = 'encoder_model.hdf5'
decoder_model_file = 'decoder_model.hdf5'

In [12]:
vocab = np.load(file=vocab_file)
vocab_to_int = vocab['vocab_to_int'].item()
int_to_vocab = vocab['int_to_vocab'].item()
max_sent_len = vocab['max_sent_len']
min_sent_len = vocab['min_sent_len']



In [13]:
input_characters = sorted(list(vocab_to_int))
num_decoder_tokens = num_encoder_tokens = len(input_characters) #int(encoder_model.layers[0].input.shape[2])
max_encoder_seq_length = max_decoder_seq_length = max_sent_len - 1#max([len(txt) for txt in input_texts])


In [14]:
num_samples = 1000000
tess_correction_data = os.path.join(data_path, 'test_data.txt')
input_texts = load_data(tess_correction_data, num_samples, max_sent_len, min_sent_len)

In [15]:
#model.load_weights(model_file)

model = load_model(model_file)

In [16]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None, 114)    0                                            
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) [(None, 512), (None, 759808      input_1[0][0]                    
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, None, 114)    0                                            
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 512)          0           bidirectional_1[0][1]            
                                                                 bidirectional_1[0][3]            
__________

In [17]:
encoder_model = load_model(encoder_model_file)
decoder_model = load_model(decoder_model_file)



In [18]:
encoder_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None, 114)    0                                            
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) [(None, 512), (None, 759808      input_1[0][0]                    
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 512)          0           bidirectional_1[0][1]            
                                                                 bidirectional_1[0][3]            
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (None, 512)          0           bidirectional_1[0][2]            
          

In [19]:
decoder_model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, None, 114)    0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            (None, 512)          0                                            
__________________________________________________________________________________________________
input_4 (InputLayer)            (None, 512)          0                                            
__________________________________________________________________________________________________
lstm_2 (LSTM)                   [(None, None, 512),  1284096     input_2[0][0]                    
                                                                 input_3[0][0]                    
          

In [None]:


encoder_input_data = vectorize_data(input_texts=input_texts, max_encoder_seq_length=max_encoder_seq_length, num_encoder_tokens=num_encoder_tokens, vocab_to_int=vocab_to_int)

# Sample output from train data
decoded_sentences = []

for seq_index in range(len(input_texts)):
    # Take one sequence (part of the training set)
    # for trying out decoding.

    input_seq = encoder_input_data[seq_index: seq_index + 1]
    
    decoded_sentence = decode_sequence(input_seq, encoder_model, decoder_model, num_decoder_tokens, int_to_vocab)
    
    print('-')
    print('Input sentence:', input_texts[seq_index])
    
    print('Decoded sentence:', decoded_sentence)   
    decoded_sentences.append(decoded_sentence)
    


-
Input sentence: Me dieal Provider Roles: Treating 

Decoded sentence: Medical Provider Roles: Treating

-
Input sentence: Provider First Name: Christine 

Decoded sentence: Provider First Name: Reast Notes

-
Input sentence: Provider Last Name: Nolen, MD 

Decoded sentence: Provider Last Name: Nolenas: 687.9218mk

-
Input sentence: Address Line 1 : 7 25 American Avenue 

Decoded sentence: Address Line 1 : 140 - 104 405-32/2m/d9:

-
Input sentence: City. W’aukesha 

Decoded sentence: City: Status - yes

-
Input sentence: StatefProvinee: ‘WI 

Decoded sentence: State/Province Plon:

-
Input sentence: Postal Code: 5 31 88 

Decoded sentence: Postal Code: 202/2r/2218

-
Input sentence: Country". US 

Decoded sentence: Country: Ye?

-
Input sentence: Business Telephone: (2 62) 92 8- 1000 

Decoded sentence: Business Telephone: 02/17/2018

-
Input sentence: Date ot‘Pirst Visit: 1 2/01f20 17 

Decoded sentence: Date of First Visit: ETI WIORASTENT

-
Input sentence: Medical Protitler Informa

-
Input sentence: Social Security Number: 

Decoded sentence: Social Security Number: 

-
Input sentence: Birth Date: 

Decoded sentence: Birth Date:

-
Input sentence: Gender: 

Decoded sentence: Gender: 

-
Input sentence: Claim Event Information 

Decoded sentence: Claim Event Information

-
Input sentence: Accident Work Related: No

Decoded sentence: Accident Work Related: No

-
Input sentence: Time ofAccident: 8:00 PM

Decoded sentence: Time Lock Farical Pain:

-
Input sentence: Accident Date: 

Decoded sentence: Accident Date: 6

-
Input sentence: 5 n rg erji' Inform ari on 

Decoded sentence: Surgery Information (2m/darn.

-
Input sentence: 15 Surgery Required: No 

Decoded sentence: Is Surgery Required: No

-
Input sentence: Medical Provider Information - Physician 

Decoded sentence: Medical Provider Information - Physician

-
Input sentence: Medical Provider Specialty: ElVIS 

Decoded sentence: Medical Provider Role Specialty: Yes

-
Input sentence: UﬂUﬁT 

Decoded sentence: 

-
Input sentence: 1. Knee injury (889.90XA) 

Decoded sentence: 1. Knee in you Anjusy

-
Input sentence: Past Medical History 

Decoded sentence: Pastice Merstralthrsted:

-
Input sentence: 0 No signiﬁcant past medical history 

Decoded sentence: • No significant past medical symptoms.

-
Input sentence: Surgical History 

Decoded sentence: Surgical History:

-
Input sentence: 0 History of Ankle Surgery 

Decoded sentence: • History of Antedure Selfient

-
Input sentence: Family History 

Decoded sentence: Family History:

-
Input sentence: Jillitﬁ'géiés 1 of 3 3/22/18 3:09:54 PM 

Decoded sentence: TWIN TITIES OUTH (I66 1 ries ar ormed.

-
Input sentence: TWIN CITIES 

Decoded sentence: TWIN CoTIENT

-
Input sentence: ORTHOPEDICS 

Decoded sentence: ORTHOPEDICS

-
Input sentence: Twin Cities Orthopedics-Burnsville 

Decoded sentence: Twin Cities Orthopedics-Burnsville

-
Input sentence: Date of Service: 01/21/2018 7:30PM 

Decoded sentence: Date of Service ER Incoust an: Senk

-
Input

-
Input sentence: Allergies 

Decoded sentence: Allergies:

-
Input sentence: 1._No Known Aliergies 

Decoded sentence: 1. No Known Allergies

-
Input sentence: Physical Exam 

Decoded sentence: Physicial Exam

-
Input sentence: Diagnosis 

Decoded sentence: Diagnosis: 

-
Input sentence: Plan 

Decoded sentence: Plan 

-
Input sentence: DiscussionlSummary 

Decoded sentence: Discussion/Summary

-
Input sentence: Signatures 

Decoded sentence: Signature Ded

-
Input sentence: Electronicaliy signed by : Jamie Birkelo, PA; 

Decoded sentence: Electronically signed by Larkin, J Plamination:

-
Input sentence: altiltliigélés 

Decoded sentence: TWITling to Count

-
Input sentence: Plan 

Decoded sentence: Plan 

-
Input sentence: Knee injury 

Decoded sentence: Knen in encript

-
Input sentence: Last Updated ByzKim, Daniel;Ordered; 

Decoded sentence: Lant Adjust Physicial By Please s/L

-
Input sentence: For:Knee injury; Ordered By:Felvor. David; 

Decoded sentence: For:Kned Ondiclen Brav

-
Input sentence: Register for Claim Self Service — no 

Decoded sentence: Retister for tare Reason Specified T

-
Input sentence: Health insurance through employer - yes 

Decoded sentence: Health Dether empleastherment - norrent

-
Input sentence: Health insurance provider — bcbs 

Decoded sentence: Healthaingurgerenched marlished (mm/ddry

-
Input sentence: Fax paperwork - yes 

Decoded sentence: Fax paperwory evgros.

-
Input sentence: Attention of - Tellie 

Decoded sentence: Attertion of - The (mm/dd/yy)

-
Input sentence: Fax number 

Decoded sentence: Fax number s

-
Input sentence: Refax paperwork — yes 

Decoded sentence: Refax paperworn rulating

-
Input sentence: Notes # 

Decoded sentence: Notes: #:

-
Input sentence: Event dates: unknown rtw 

Decoded sentence: Evented Work-on Larter Sing

-
Input sentence: Final Details: EE does not have email. 

Decoded sentence: Privical Pllness Health and Humphen.

-
Input sentence: Submission Method: — phone 

Decoded sentence: Submi

-
Input sentence: Temporary Address: 

Decoded sentence: Teppreded: FOLA, M.D.

-
Input sentence: Address Line 1: 

Decoded sentence: Address Line 1: 

-
Input sentence: Address Line 2: 

Decoded sentence: Address Lane 2: 2

-
Input sentence: City: 

Decoded sentence: City: No

-
Input sentence: State: 

Decoded sentence: State: :

-
Input sentence: Country: 

Decoded sentence: Country: :

-
Input sentence: ZIP: 

Decoded sentence: Pat:

-
Input sentence: Effective From Date: 

Decoded sentence: EAdEcttent/Ount Name:

-
Input sentence: Effective To Date:  

Decoded sentence: Effective Type: Devinal

-
Input sentence: Created By: Hughes, Brittany 

Decoded sentence: Created at: 4 on aut art firnt.

-
Input sentence: Created Date: 

Decoded sentence: Created Date:

-
Input sentence: Create Site: Chattanooga 

Decoded sentence: Create Submision - Soumaly

-
Input sentence: Completed By: Hughes, Brittany 

Decoded sentence: Comperted By: Hughasyn nommaly

-
Input sentence: Completed Date: 

-
Input sentence: TIER 2 Family MOO? Max 

Decoded sentence: TIER 2 Family MOOP MaIV 2712

-
Input sentence: TIER 2 100111.109I  

Decoded sentence: TIER 2 Individual Deduction

-
Input sentence: TIER 2 lndlvidual MOOP Max 

Decoded sentence: TIER 2 Individual By/ic -.0492

-
Input sentence: TIER 3 Fan1Ily 

Decoded sentence: TIER 3 F7m Tied

-
Input sentence: TIER 3 Family MOOP Max

Decoded sentence: TIER 2 Family MOOP Max

-
Input sentence: TIER 9 1110111101101- 

Decoded sentence: TIER 3 Individual Deduction

-
Input sentence: TIER 3 11101111111101 MOOP Max

Decoded sentence: TIEd y Enflive Date: 01/22/2018

-
Input sentence: URTHDATLANTA LLC FAYETI'EVI LLE 

Decoded sentence: ORTHOATLANTA LLC FAYET EY/L. TH

-
Input sentence: Merchant I0: 

Decoded sentence: Medical: Ex/2

-
Input sentence: Transaction typo: 

Decoded sentence: Transaction ty: :

-
Input sentence: Approval code: 

Decoded sentence: Approval code:

-
Input sentence: Dateltime: 

Decoded sentence: Date/In: ID:

-
Inp

-
Input sentence: Page I 012 (continued on back) 

Decoded sentence: Page 1 of urencession on lequired

-
Input sentence: Provld-or: GEORGE STONE M0 

Decoded sentence: Provider: David Off, C TH: FPET, PA-C

-
Input sentence: En'lployee: 

Decoded sentence: Employee: 49612

-
Input sentence: PaiTiEMWWTM“

Decoded sentence: Patient Iodist:

-
Input sentence: Claim No: ' _" 

Decoded sentence: Claim No:

-
Input sentence: Provider Me:  

Decoded sentence: Provider None:

-
Input sentence: Member Ne: 

Decoded sentence: MeIMON: No.e

-
Input sentence: Pat Acct No: 

Decoded sentence: Pal Acctune: No

-
Input sentence: Service Date of Charged I MRI 

Decoded sentence: Services Sprrient Resurad’s Signt

-
Input sentence: Description Service Ameung , Porgy} . 02:15:18 

Decoded sentence: Date of Service Dumin ERe Inlerure Senfirent

-
Input sentence: Claim Totals _ 

Decoded sentence: Claim Tothant

-
Input sentence: 1 Processed at the Tier 1 Contracted Hale 

Decoded sentence: 1/urgerss at 

-
Input sentence: CL-i 023 (06/13) 1 

Decoded sentence: CL-1023 (06/13) 2

-
Input sentence: ﬂit TWIN (:5 raga; OR'E'HOPEDICS 

Decoded sentence: TWIN CITIOSION STATTIONT

-
Input sentence: TWIN CITIES ORTHOPEDICS PA 

Decoded sentence: TWIN CoTESSENTEV ELOPHONE

-
Input sentence: Temp—Return Service Requested 

Decoded sentence: Temp-Return Service Aroes No

-
Input sentence: Guarantor Account Information 

Decoded sentence: Gurrint Toped Entortation of State

-
Input sentence: Summary of Charges 

Decoded sentence: Summary of Charged:

-
Input sentence: Amount Insurance Amount Patient 

Decoded sentence: Amount Information Date: (mm/dd/yy)

-
Input sentence: Date Description Charged Paid Adjusted Balanc- 

Decoded sentence: Date Disce satient Date: (mm/dd/yy)

-
Input sentence: 01 {221201 8 MRI, LWR EX? JOINT W/O 

Decoded sentence: SUR2 2 RB18 2V knee Ja12016 Range:

-
Input sentence: SUMMARY FOR ! 

Decoded sentence: SUM: NATY/AUTH

-
Input sentence: Send Inquiries To: 

Decoded s