# Correction des typos

Le but de ce projet est de corriger les fautes de frappes dans un texte sans recours à un dictionnaire, en utilisant un modèle de Markov caché (HMM).

# I - Chargement des données

Données issues du *Manifeste de l'Unabomber* et artificiellement bruitées en mofifiant aléatoirement 10% ou 20% des lettres du corpus.

In [1]:
import numpy as np

from HMM import *
from toolbox import *

In [2]:
# Import and separate datasets
ERROR_RATE = 10  # 10% or 20%
train_set, test_set = load_db(error_rate=ERROR_RATE)
X_train = [[token[0] for token in word] for word in train_set]
y_train = [[token[1] for token in word] for word in train_set]
X_test = [[token[0] for token in word] for word in test_set]
y_test = [[token[1] for token in word] for word in test_set]

# Get states and observations sets
states, observations = get_observations_states(X_train, y_train)
print("{} states :\n{}".format(len(states), states))
print("{} observations :\n{}".format(len(observations), observations))

# Example from dataset
print("\nSample example :\n{}".format(train_set[3]))

26 states :
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
26 observations :
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

Sample example :
[('a', 'a'), ('c', 'c'), ('v', 'c'), ('o', 'o'), ('u', 'u'), ('n', 'n'), ('t', 't')]


# II - HMM forward d'ordre 1

Essai de correction des typos par un HMM d'ordre 1 utilisant l'algorithme de Viterbi (forward).

In [3]:
# Initialize and train HMM
hmm1 = HMM(states, observations)
hmm1.fit(X_train, y_train)

1st order HMM created with: 
 * 26 states
 * 26 observations
Training initial states probabilities... Done.
Training transitions probabilities given states... Done.
Training observations probabilities given states... Done.


In [4]:
# Try HMM prediction on one sample
SAMPLE = 11
observation_sequence = X_test[SAMPLE]
states_sequence = y_test[SAMPLE]

predicted_states_sequence = hmm1.predict([observation_sequence])

print("Observation sequence      : {}".format("".join(observation_sequence)))
print("Real states sequence      : {}".format("".join(states_sequence)))
print("Predicted states sequence : {}".format("".join(predicted_states_sequence[0])))

Observation sequence      : inferikrigy
Real states sequence      : inferiority
Predicted states sequence : inderiorigy


In [5]:
y_test_pred = hmm1.predict(X_test)
display_correction_stats(X_test, y_test, y_test_pred, name="HMM1")

HMM1 score on test set
 * accuracy on full words : 75.15%
 * accuracy on letters    : 93.20%
   > typos corrected      : 310 (4.23%)
   > typos not corrected  : 435 (5.94%)
   > typos added          : 63 (0.86%)

Dummy score on test set
 * accuracy on full words : 62.89%
 * accuracy on letters    : 89.82%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 745 (10.18%)
   > typos added          : 0 (0.00%)



# III -HMM d'ordre 2

## HMM forward d'ordre 2

In [6]:
# Initialize and train HMM
hmm2 = HMM2(states, observations)
hmm2.fit(X_train, y_train, smoothing=False)

2nd order HMM created with: 
 * 26 states
 * 26 observations
Training initial states probabilities... Done.
Training transitions probabilities given states... Done.
Training observations probabilities given states... Done.


In [7]:
# Try HMM prediction on one sample
SAMPLE = 11
observation_sequence = X_test[SAMPLE]
states_sequence = y_test[SAMPLE]

predicted_states_sequence = hmm2.predict([observation_sequence])

print("Observation sequence      : {}".format("".join(observation_sequence)))
print("Real states sequence      : {}".format("".join(states_sequence)))
print("Predicted states sequence : {}".format("".join(predicted_states_sequence[0])))

Observation sequence      : inferikrigy
Real states sequence      : inferiority
Predicted states sequence : inferiority


In [8]:
y_test_pred = hmm2.predict(X_test)
display_correction_stats(X_test, y_test, y_test_pred, name="HMM2")

HMM2 score on test set
 * accuracy on full words : 83.41%
 * accuracy on letters    : 95.57%
   > typos corrected      : 513 (7.01%)
   > typos not corrected  : 232 (3.17%)
   > typos added          : 92 (1.26%)

Dummy score on test set
 * accuracy on full words : 62.89%
 * accuracy on letters    : 89.82%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 745 (10.18%)
   > typos added          : 0 (0.00%)


## HMM forward d'ordre 2 avec smoothing

In [9]:
# Initialize and train HMM
hmm2s = HMM2(states, observations)
hmm2s.fit(X_train, y_train, smoothing=True)

2nd order HMM created with: 
 * 26 states
 * 26 observations
Training initial states probabilities... Done.
Training transitions probabilities given states... Done.
Training observations probabilities given states... Done.


In [10]:
# Try HMM prediction on one sample
SAMPLE = 11
observation_sequence = X_test[SAMPLE]
states_sequence = y_test[SAMPLE]

predicted_states_sequence = hmm2s.predict([observation_sequence])

print("Observation sequence      : {}".format("".join(observation_sequence)))
print("Real states sequence      : {}".format("".join(states_sequence)))
print("Predicted states sequence : {}".format("".join(predicted_states_sequence[0])))

Observation sequence      : inferikrigy
Real states sequence      : inferiority
Predicted states sequence : inferiorigy


In [11]:
y_test_pred = hmm2s.predict(X_test)
display_correction_stats(X_test, y_test, y_test_pred, name="HMM2_smooth")

HMM2_smooth score on test set
 * accuracy on full words : 79.61%
 * accuracy on letters    : 94.66%
   > typos corrected      : 406 (5.55%)
   > typos not corrected  : 339 (4.63%)
   > typos added          : 52 (0.71%)

Dummy score on test set
 * accuracy on full words : 62.89%
 * accuracy on letters    : 89.82%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 745 (10.18%)
   > typos added          : 0 (0.00%)


## Test de HOHMM

In [None]:
from SimpleHOHMM import HiddenMarkovModelBuilder, HiddenMarkovModel

order = 2
hohmm_builder = HiddenMarkovModelBuilder()
hohmm_builder.set_single_states(states)
hohmm_builder.set_all_obs(observations)
hohmm_builder.add_batch_training_examples(X_train, y_train)
hohmm = hohmm_builder.build(highest_order=order)

y_test_pred_hohmm = []
for obs_seq in X_test:
    y_test_pred_hohmm.append(hohmm.decode(obs_seq))

display_correction_stats(X_test, y_test, y_test_pred_hohmm, name="HOHMM{}".format(order))

# IV - Insertion de caractères

In [12]:
# Insert noisy observations (linked to no state) in the database
# Empty state will be denoted '_'
INSERTION_PROB = 0.05
(X_ni_train, y_ni_train), (X_ni_test, y_ni_test) = noisy_insertion(X_train, y_train, X_test, y_test, thresh_proba=INSERTION_PROB)

SAMPLE = 3
print("States sequence sample       : {}".format(y_ni_train[SAMPLE]))
print("Observations sequence sample : {}\n".format(X_ni_train[SAMPLE]))

states_ni, observations_ni = get_observations_states(X_ni_train, y_ni_train)
print("{} states :\n{}".format(len(states_ni), states_ni))
print("{} observations :\n{}".format(len(observations_ni), observations_ni))

States sequence sample       : ['a', '_', 'c', 'c', 'o', 'u', 'n', 't']
Observations sequence sample : ['a', 'a', 'c', 'v', 'o', 'u', 'n', 't']

27 states :
['_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
26 observations :
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [13]:
# HMM first order
hmm = HMM(states_ni, observations_ni)
hmm.fit(X_ni_train, y_ni_train)
y_ni_test_pred = hmm.predict(X_ni_test)
display_correction_stats(X_ni_test, y_ni_test, y_ni_test_pred, name="\nHMM1")

1st order HMM created with: 
 * 27 states
 * 26 observations
Training initial states probabilities... Done.
Training transitions probabilities given states... Done.
Training observations probabilities given states... Done.

HMM1 score on test set
 * accuracy on full words : 61.89%
 * accuracy on letters    : 88.67%
   > typos corrected      : 327 (4.27%)
   > typos not corrected  : 748 (9.78%)
   > typos added          : 119 (1.56%)

Dummy score on test set
 * accuracy on full words : 52.10%
 * accuracy on letters    : 85.95%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 1075 (14.05%)
   > typos added          : 0 (0.00%)


In [14]:
# HMM second order
hmm2 = HMM2(states_ni, observations_ni)
hmm2.fit(X_ni_train, y_ni_train)
y_ni_test_pred = hmm2.predict(X_ni_test)
display_correction_stats(X_ni_test, y_ni_test, y_ni_test_pred, name="\nHMM2")

2nd order HMM created with: 
 * 27 states
 * 26 observations
Training initial states probabilities... Done.
Training transitions probabilities given states... Done.
Training observations probabilities given states... Done.

HMM2 score on test set
 * accuracy on full words : 71.55%
 * accuracy on letters    : 90.34%
   > typos corrected      : 544 (7.11%)
   > typos not corrected  : 531 (6.94%)
   > typos added          : 208 (2.72%)

Dummy score on test set
 * accuracy on full words : 52.10%
 * accuracy on letters    : 85.95%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 1075 (14.05%)
   > typos added          : 0 (0.00%)


In [15]:
# HMM second order with smoothing
hmm2 = HMM2(states_ni, observations_ni)
hmm2.fit(X_ni_train, y_ni_train, smoothing=True)
y_ni_test_pred = hmm2.predict(X_ni_test)
display_correction_stats(X_ni_test, y_ni_test, y_ni_test_pred, name="\nHMM2_smooth")

2nd order HMM created with: 
 * 27 states
 * 26 observations
Training initial states probabilities... Done.
Training transitions probabilities given states... Done.
Training observations probabilities given states... Done.

HMM2_smooth score on test set
 * accuracy on full words : 65.82%
 * accuracy on letters    : 90.17%
   > typos corrected      : 406 (5.31%)
   > typos not corrected  : 669 (8.75%)
   > typos added          : 83 (1.08%)

Dummy score on test set
 * accuracy on full words : 52.10%
 * accuracy on letters    : 85.95%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 1075 (14.05%)
   > typos added          : 0 (0.00%)


# V - Omission de caractères

In [16]:
def cross_states(states):
    """ Take a state list as input and return a cross state list
    (each new state being a concatenation of two states from states, or a unique state)
    Ex: If states=['a', 'b'], cross_states is ['a', 'b', 'aa', 'ab', 'ba', 'bb']

    :param states, list of string
    :return cross_states: states, list of string
    """
    
    cross_states = []
    
    for s1 in states:
        for s2 in states:
            cross_states.append(s1 + s2)
            
    return states + cross_states

In [17]:
# Remove some observations in the database
# To do that, we add the state without observation to the previous state
DELETION_PROB = 0.05
(X_no_train, y_no_train), (X_no_test, y_no_test) = noisy_omission(X_train, y_train, X_test, y_test, thresh_proba=DELETION_PROB)

states_no = cross_states(states)
observations_no = observations

print("{} states :\n{}".format(len(states_no), states_no))
print("{} observations :\n{}".format(len(observations_no), observations_no))

702 states :
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'aa', 'ab', 'ac', 'ad', 'ae', 'af', 'ag', 'ah', 'ai', 'aj', 'ak', 'al', 'am', 'an', 'ao', 'ap', 'aq', 'ar', 'as', 'at', 'au', 'av', 'aw', 'ax', 'ay', 'az', 'ba', 'bb', 'bc', 'bd', 'be', 'bf', 'bg', 'bh', 'bi', 'bj', 'bk', 'bl', 'bm', 'bn', 'bo', 'bp', 'bq', 'br', 'bs', 'bt', 'bu', 'bv', 'bw', 'bx', 'by', 'bz', 'ca', 'cb', 'cc', 'cd', 'ce', 'cf', 'cg', 'ch', 'ci', 'cj', 'ck', 'cl', 'cm', 'cn', 'co', 'cp', 'cq', 'cr', 'cs', 'ct', 'cu', 'cv', 'cw', 'cx', 'cy', 'cz', 'da', 'db', 'dc', 'dd', 'de', 'df', 'dg', 'dh', 'di', 'dj', 'dk', 'dl', 'dm', 'dn', 'do', 'dp', 'dq', 'dr', 'ds', 'dt', 'du', 'dv', 'dw', 'dx', 'dy', 'dz', 'ea', 'eb', 'ec', 'ed', 'ee', 'ef', 'eg', 'eh', 'ei', 'ej', 'ek', 'el', 'em', 'en', 'eo', 'ep', 'eq', 'er', 'es', 'et', 'eu', 'ev', 'ew', 'ex', 'ey', 'ez', 'fa', 'fb', 'fc', 'fd', 'fe', 'ff', 'fg', 'fh', 'fi', 'fj', 'fk', 'fl', 'fm'

In [18]:
# HMM first order
hmm = HMM(states_no, observations_no)
hmm.fit(X_no_train, y_no_train)
y_no_test_pred = hmm.predict(X_no_test)
display_correction_stats(X_no_test, y_no_test, y_no_test_pred)

1st order HMM created with: 
 * 702 states
 * 26 observations
Training initial states probabilities... Done.
Training transitions probabilities given states... Done.
Training observations probabilities given states... Done.
HMM score on test set
 * accuracy on full words : 62.03%
 * accuracy on letters    : 88.60%
   > typos corrected      : 286 (4.12%)
   > typos not corrected  : 717 (10.33%)
   > typos added          : 74 (1.07%)

Dummy score on test set
 * accuracy on full words : 51.90%
 * accuracy on letters    : 85.54%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 1003 (14.46%)
   > typos added          : 0 (0.00%)


In [19]:
# HMM second order
hmm2 = HMM2(states_no, observations_no)
hmm2.fit(X_no_train, y_no_train)
y_no_test_pred = hmm2.predict(X_no_test)
display_correction_stats(X_no_test, y_no_test, y_no_test_pred)

2nd order HMM created with: 
 * 702 states
 * 26 observations
Training initial states probabilities... Done.
Training transitions probabilities given states... Done.
Training observations probabilities given states... Done.


In [None]:
# HMM second order with smoothing
hmm2s = HMM2(states_no, observations_no)
hmm2s.fit(X_no_train, y_no_train, smoothing=True)
y_no_test_pred = hmm2s.predict(X_no_test)
display_correction_stats(X_no_test, y_no_test, y_no_test_pred)