## **Fixing Model**

*Task:*
* Transform Image to Text  
* Transform Text Data to Words  
* Encode Words  
* Train Model
* Evaluate
* Decode and Transform to Text Data

#### **I. Import & Load Data**

In [1]:
# Data Manipulation
import numpy as np
import pandas as pd

# Data Processing 
import string
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Model Building
import tensorflow as tf
from tensorflow.keras.layers import Dense, SimpleRNN, Bidirectional, Masking, Embedding
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import Sequential

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Getting Data
import os
from PIL import Image
import pathlib

# Save model
import pickle as pkl

In [2]:
# Get train data path
project_directory = pathlib.Path.cwd().parent
data_folder_directory = str(project_directory) + "/data/fixing_model/" 

In [3]:
# Load data
all_data = [pd.read_csv(data_folder_directory + f) for f in os.listdir(data_folder_directory)]
data = pd.concat(all_data)
data.head()

Unnamed: 0,Correct Word,Incorrect Word
0,Surrender,Surrende#
1,Surrender,Surre?der
2,Surrender,Surr?nder
3,Surrender,S&rrender
4,Surrender,Surren&er


In [4]:
data.shape

(2621, 2)

In [5]:
# Split input and out put
input = data["Incorrect Word"]
output = data["Correct Word"]

# Train, Validate, Test data
X_train, X_test, Y_train, Y_test = train_test_split(input, output, test_size=0.2, random_state=42)
X_tr, X_val, Y_tr, Y_val = train_test_split(X_train, Y_train, test_size=0.25, random_state=42)

X_tr.shape, X_test.shape, X_val.shape

((1572,), (525,), (524,))

#### **II. Encoder**

In [6]:
# Key list
key_list = list("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ-.")

In [32]:
# Define word max length
def get_word_max_length(data:pd.DataFrame)->int:
    return data.apply(len).max()

# Encoder
class CharactersEncoder():

    def __init__(self):
        self.__unknown_word = "<UNKNOWN>"
        self.__empty = " "
        self.empty_vector = None
        self.vocab = {self.__unknown_word: 0}
        self.__reverse_vocab = {0: self.__unknown_word,
                                1: self.__empty}
        self.vocab_size = 2
        self.word_max_length = None

    def __index_to_array__(self, indices:np.array)->np.array:
        vectorized_word = np.zeros((indices.shape[0], self.vocab_size))
        for i in range(indices.shape[0]):
            vectorized_word[i, indices[i]] = 1
        return vectorized_word
    
    def __array_to_index__(self, arr:np.array)->np.array:
        return np.argmax(arr, axis=1)

    def fit(self, key_list:list, word_max_length:int)->None:
        self.word_max_length = word_max_length
        self.vocab_size += len(key_list)
        index_list = range(2, self.vocab_size)
        self.vocab.update(dict(zip(key_list, index_list)))
        self.__reverse_vocab.update(dict(zip(index_list, key_list)))
        self.empty_vector = np.zeros((self.vocab_size))
        self.empty_vector[1] = 1

    def transform(self, words_data:pd.DataFrame)->list:
        words_data_encoded = []
        for w in range(words_data.shape[0]):
            word = list(words_data.iloc[w])[:self.word_max_length]
            word = [char if char in self.vocab else self.__unknown_word for char in word]
            words_data_as_index = np.array([self.vocab[char] for char in word])
            words_data_encoded.append(self.__index_to_array__(words_data_as_index))
        return words_data_encoded

    def reverse_transform(self, words_vector_data:list, make_unknown:bool=True)->list:
        word_data_decoded = []
        for vector in words_vector_data:
            words_data_as_index = self.__array_to_index__(vector)
            if make_unknown and np.any(words_data_as_index==0):
                word_data_decoded.append(self.__unknown_word)
                continue
            word_data_decoded.append(''.join([self.__reverse_vocab[i] for i in words_data_as_index]))
        return word_data_decoded

#### **III. Preprocess Data**

In [33]:
# Max length
word_max_length = get_word_max_length(Y_tr)
word_max_length

14

In [34]:
# Encoder
char_encoder = CharactersEncoder()
char_encoder.fit(key_list=key_list, word_max_length=word_max_length)

# Save encoder
encoder_save_path = str(project_directory) + "/models/char_encoder.pkl"
with open(encoder_save_path, 'wb') as file:
    pkl.dump(char_encoder, file)

In [35]:
# Preprocess function
def preprocess(data:pd.DataFrame, sequence_length:int=None)->np.array:
    encoded_data = char_encoder.transform(data)
    data_padded = pad_sequences(encoded_data, padding='post', dtype='int8', value=char_encoder.empty_vector)
    return data_padded

In [36]:
# Save function
preprocess_save_path = str(project_directory) + "/models/preprocess_fixing_model.pkl"
with open(preprocess_save_path, 'wb') as file:
    pkl.dump(preprocess, file)

In [37]:
# Preprocess input data
X_tr_processed = preprocess(X_tr, word_max_length)
X_tr_processed.shape

(1572, 14, 56)

In [38]:
# Preprocess output data
Y_tr_processed = preprocess(Y_tr, word_max_length)
Y_tr_processed.shape

(1572, 14, 56)

In [39]:
# Preprocess validate data
X_val_processed = preprocess(X_val, word_max_length)
Y_val_processed = preprocess(Y_val, word_max_length)

# Preprocess test data
X_test_processed = preprocess(X_test, word_max_length)
Y_test_processed = preprocess(Y_test, word_max_length)

#### **IV. Model Building**

a, Model 1 - Bidirectional Simple RNN

In [40]:
# Model
model1 = Sequential([
    Bidirectional(SimpleRNN(64, return_sequences=True),input_shape=(word_max_length, char_encoder.vocab_size)),
    Dense(char_encoder.vocab_size, activation='softmax')
])

# Compile model
model1.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model1.summary()

2025-04-05 10:35:35.935894: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-04-05 10:35:35.937851: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional (Bidirectiona  (None, 14, 128)          15488     
 l)                                                              
                                                                 
 dense (Dense)               (None, 14, 56)            7224      
                                                                 
Total params: 22,712
Trainable params: 22,712
Non-trainable params: 0
_________________________________________________________________


In [41]:
model1.fit(X_tr_processed, Y_tr_processed, epochs=100, batch_size=8)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x72536f028910>

In [42]:
# Evaluate
Y_val_pred = model1.predict(X_val_processed)



In [43]:
Y_val_onehot = np.eye(char_encoder.vocab_size)[np.argmax(Y_val_pred, axis=2)]

In [44]:
# Words comparison
comparison_tabel = pd.DataFrame({
    "Incorrect": X_val,
    "Predict": char_encoder.reverse_transform(Y_val_onehot),
    "True": Y_val
})

comparison_tabel.head(20)

Unnamed: 0,Incorrect,Predict,True
350,%ailway,Railway,Railway
631,Signatur&,Signature,Signature
228,Liab?lity,Liability,Liability
788,S?pplement,Supplement,Supplement
149,Dlstination,Destination,Destination
323,Packyges,Packages,Packages
369,Marbs,Marks,Marks
52,Shixper,Shipper,Shipper
84,S=ipper,Shipper,Shipper
610,wriginal,Origieal,Original


In [45]:
# Show incorrect predictions
def incorrect_predictions(data:pd.DataFrame):
    return data[data["Predict"].str.strip() != data["True"]]
incorrect_predictions(comparison_tabel)

Unnamed: 0,Incorrect,Predict,True
610,wriginal,Origieal,Original
371,nEdorsement,Eddorsement,Endorsement
583,Juris!iction,Jurisiiction,Jurisdiction
499,japy,Capy,Copy
483,?uarantor,muarantor,Guarantor
...,...,...,...
665,Porvision,Porvision,Provision
376,Regis?ry,Regisury,Registry
59,facgo,Hacgo,Cargo
479,Quarantnie,Quaranaiie,Quarantine


In [46]:
def model_report(y_true:np.array, y_pred:np.array):
    # Total number of wrong keys
    tt_wrong_key = np.abs(y_true-y_pred).sum()
    print("Total wrong keys: ", tt_wrong_key)

    # Character incorrect rate 
    Cincorrect_rate = round((tt_wrong_key/y_true.sum())*100, 2)
    print(f"Character incorrect rate: {Cincorrect_rate}%")

    # Word incorrect rate
    Wincorrect_rate = round(np.where(y_true == y_pred, 0, 1).any(axis=2).any(axis=1).sum()/y_true.shape[0]*100, 2)
    print(f"Word incorrect rate: {Wincorrect_rate}%")

In [47]:
model_report(Y_val_processed, Y_val_onehot)

Total wrong keys:  256.0
Character incorrect rate: 3.49%
Word incorrect rate: 18.7%


b, Model 2 - Model 1 + Embedding Layer

In [48]:
# Determine embedding dimension
embedding_dim = 8

# Model
model2 = Sequential([
    Embedding(input_dim=char_encoder.vocab_size, output_dim=embedding_dim, input_length=word_max_length),
    Bidirectional(SimpleRNN(64, return_sequences=True)),
    Dense(char_encoder.vocab_size, activation='softmax')
])

# Compile model
model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model2.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 14, 8)             448       
                                                                 
 bidirectional_1 (Bidirectio  (None, 14, 128)          9344      
 nal)                                                            
                                                                 
 dense_1 (Dense)             (None, 14, 56)            7224      
                                                                 
Total params: 17,016
Trainable params: 17,016
Non-trainable params: 0
_________________________________________________________________


In [49]:
# Tokenize input
X_tr_processed_tkn = np.argmax(X_tr_processed, axis=2)

# Fit model
model2.fit(X_tr_processed_tkn, Y_tr_processed, epochs=100, batch_size=8)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x72536575e200>

In [50]:
# Evaluate
Y_val_pred = model2.predict(np.argmax(X_val_processed, axis=2))
Y_val_onehot = np.eye(char_encoder.vocab_size)[np.argmax(Y_val_pred, axis=2)]

# Words comparison
comparison_tabel = pd.DataFrame({
    "Incorrect": X_val,
    "Predict": char_encoder.reverse_transform(Y_val_onehot),
    "True": Y_val
})

comparison_tabel.head(20)



Unnamed: 0,Incorrect,Predict,True
350,%ailway,Railway,Railway
631,Signatur&,Signature,Signature
228,Liab?lity,Liability,Liability
788,S?pplement,Supplement,Supplement
149,Dlstination,Ddstination,Destination
323,Packyges,Packages,Packages
369,Marbs,Marks,Marks
52,Shixper,Shipper,Shipper
84,S=ipper,Shipper,Shipper
610,wriginal,Original,Original


In [51]:
# Incorrect predictions
incorrect_predictions(comparison_tabel)

Unnamed: 0,Incorrect,Predict,True
149,Dlstination,Ddstination,Destination
371,nEdorsement,EnAorsement,Endorsement
499,japy,Lapk,Copy
483,?uarantor,Quarantor,Guarantor
745,Retetnion,Reteanion,Retention
...,...,...,...
566,Cintainer,Cintainer,Container
376,Regis?ry,Regisery,Registry
59,facgo,Fargo,Cargo
479,Quarantnie,Quarantice,Quarantine


In [52]:
# Model report
model_report(Y_val_processed, Y_val_onehot)

Total wrong keys:  414.0
Character incorrect rate: 5.64%
Word incorrect rate: 27.48%


Embeddings show worse results and since the vocabulary size is not large, embedding may not be essential.

c, Model 3 - Model 1 with more hidden layer neurons

In [85]:
# Model
model3 = Sequential([
    Bidirectional(SimpleRNN(256, return_sequences=True),input_shape=(word_max_length, char_encoder.vocab_size)),
    Bidirectional(SimpleRNN(128, return_sequences=True)),
    Bidirectional(SimpleRNN(64, return_sequences=True)),
    Bidirectional(SimpleRNN(128, return_sequences=True)),
    Bidirectional(SimpleRNN(256, return_sequences=True)),
    Bidirectional(SimpleRNN(256, return_sequences=True)),
    Dense(char_encoder.vocab_size, activation='softmax')
])

# Compile model
model3.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model3.summary()

Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional_42 (Bidirecti  (None, 14, 512)          160256    
 onal)                                                           
                                                                 
 bidirectional_43 (Bidirecti  (None, 14, 256)          164096    
 onal)                                                           
                                                                 
 bidirectional_44 (Bidirecti  (None, 14, 128)          41088     
 onal)                                                           
                                                                 
 bidirectional_45 (Bidirecti  (None, 14, 256)          65792     
 onal)                                                           
                                                                 
 bidirectional_46 (Bidirecti  (None, 14, 512)         

In [None]:
# Fit model
model3.fit(X_tr_processed, Y_tr_processed, epochs=50, batch_size=8)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7252edafd6f0>

In [87]:
# Evaluate
Y_val_pred = model3.predict(X_val_processed)
Y_val_onehot = np.eye(char_encoder.vocab_size)[np.argmax(Y_val_pred, axis=2)]

# Words comparison
comparison_tabel = pd.DataFrame({
    "Incorrect": X_val,
    "Predict": char_encoder.reverse_transform(Y_val_onehot),
    "True": Y_val
})

comparison_tabel.head(20)



Unnamed: 0,Incorrect,Predict,True
350,%ailway,Railway,Railway
631,Signatur&,Signature,Signature
228,Liab?lity,Liability,Liability
788,S?pplement,Supplement,Supplement
149,Dlstination,Destination,Destination
323,Packyges,Packages,Packages
369,Marbs,Marks,Marks
52,Shixper,Shipper,Shipper
84,S=ipper,Shipper,Shipper
610,wriginal,Original,Original


In [88]:
incorrect_predictions(comparison_tabel)

Unnamed: 0,Incorrect,Predict,True
499,japy,Capy,Copy
483,?uarantor,Quarantor,Guarantor
473,zopt,Copt,Copy
640,hocamgnt,Docament,Document
188,m#eihht,Sreight,Freight
851,Rescissoin,Rescissinn,Rescission
456,&anitation,Nanitation,Sanitation
823,Vaildity,Valldity,Validity
471,Insepction,Inseection,Inspection
616,ufots,Gross,Goods


In [89]:
model_report(y_pred=Y_val_onehot, y_true=Y_val_processed)

Total wrong keys:  68.0
Character incorrect rate: 0.93%
Word incorrect rate: 4.96%


In [91]:
# Repredict
# Evaluate
Y_val_pred2 = model3.predict(Y_val_onehot)
Y_val_onehot2 = np.eye(char_encoder.vocab_size)[np.argmax(Y_val_pred2, axis=2)]

# Words comparison
comparison_tabel2 = pd.DataFrame({
    "Incorrect": char_encoder.reverse_transform(Y_val_onehot),
    "Predict": char_encoder.reverse_transform(Y_val_onehot2),
    "True": Y_val
})

comparison_tabel2.head(20)



Unnamed: 0,Incorrect,Predict,True
350,Railway,Railway,Railway
631,Signature,Signature,Signature
228,Liability,Liability,Liability
788,Supplement,Supplement,Supplement
149,Destination,Destination,Destination
323,Packages,Packages,Packages
369,Marks,Marks,Marks
52,Shipper,Shipper,Shipper
84,Shipper,Shipper,Shipper
610,Original,Original,Original


In [92]:
incorrect_predictions(comparison_tabel2)

Unnamed: 0,Incorrect,Predict,True
483,Quarantor,Quarantor,Guarantor
188,Sreight,Sreight,Freight
456,Nanitation,Nanitation,Sanitation
823,Valldity,Valldity,Validity
471,Inseection,Inseection,Inspection
616,Gross,Gross,Goods
834,Suspention,Suspention,Suspension
801,Resssuance,Resssuance,Reissuance
685,Froce,Froce,Force
86,Lipt,Lipt,Port


In [93]:
model_report(y_true=Y_val_processed, y_pred=Y_val_onehot2)

Total wrong keys:  50.0
Character incorrect rate: 0.68%
Word incorrect rate: 3.05%


Model's problems:
* Words with wrong possitions especially those that swapped to letters next to each other.  
=> This maybe a Bidirectional issue as two letters next to each other contribute to each other score. So if the word is short and RNN do not generate directional output that has enough information it is likely that the letters will not be fixed in the correct way.  
=> Normally, the one come from the direction that has more letters is more likely to be fixed. And if it is fixed then another predict will likely to generate correct letters for both.  
* Long words like "Guarantor" may suffer from memory loss as the last part will have lower contribute than those close to the letter. So "arant" can be mistaken as part of "Quarantine".  

 Choose model with closest to 0 lost - overfit is good as the mispelled words has finite cases (except words that has few different characters that are mispelled like "&&porter" could be "exporter" or "importer") => need context to predict which is hard for variety of BL format => Consider improving OCR model

#### **V. Test model**

In [102]:
# Predict
Y_test_pred = np.eye(char_encoder.vocab_size)[
    np.argmax(
        model3.predict(
    np.eye(char_encoder.vocab_size)[
        np.argmax(model3.predict(X_test_processed), axis=2)]), axis=2)]



In [None]:
# Comparision table
comparison_tabel_test = pd.DataFrame({
    "Incorrect": char_encoder.reverse_transform(Y_val_onehot),
    "Predict": char_encoder.reverse_transform(Y_val_onehot2),
    "True": Y_val
})

comparison_tabel2.head(20)

In [None]:
# Final pipeline
class fixing_model():
    def __init__(self, data):
