# Correction des typos

Le but de ce projet est de corriger les fautes de frappes dans un texte sans recours à un dictionnaire, en utilisant un modèle de Markov caché (HMM).

## Chargement des données

Données issues du *Manifeste de l'Unabomber* et artificiellement bruitées en mofifiant aléatoirement 10% ou 20% des lettres du corpus.

In [1]:
import numpy as np

from HMM import *
from toolbox import *

In [2]:
# Import and separate datasets
ERROR_RATE = 10  # 10% or 20%
train_set, test_set = load_db(error_rate=ERROR_RATE)
X_train = [[token[0] for token in word] for word in train_set]
y_train = [[token[1] for token in word] for word in train_set]
X_test = [[token[0] for token in word] for word in test_set]
y_test = [[token[1] for token in word] for word in test_set]

# Get states and observations sets
states, observations = get_observations_states(X_train, y_train)
print("{} states :\n{}".format(len(states), states))
print("{} observations :\n{}".format(len(observations), observations))

# Example from dataset
print("\nSample example :\n{}".format(train_set[3]))

26 states :
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
26 observations :
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

Sample example :
[('a', 'a'), ('c', 'c'), ('v', 'c'), ('o', 'o'), ('u', 'u'), ('n', 'n'), ('t', 't')]


## HMM forward d'ordre 1

Essai de correction des typos par un HMM d'ordre 1 utilisant l'algorithme de Viterbi (forward).

In [3]:
# Initialize and train HMM
hmm = HMM(states, observations)
hmm.fit(X_train, y_train)

1st order HMM created with: 
 * 26 states
 * 26 observations
Training initial states probabilities... Done.
Training transitions probabilities given states... Done.
Training observations probabilities given states... Done.


In [4]:
# Try HMM prediction on one sample
SAMPLE = 11
observation_sequence = X_test[SAMPLE]
states_sequence = y_test[SAMPLE]

predicted_states_sequence = hmm.predict([observation_sequence])

print("Observation sequence      : {}".format("".join(observation_sequence)))
print("Real states sequence      : {}".format("".join(states_sequence)))
print("Predicted states sequence : {}".format("".join(predicted_states_sequence[0])))

Observation sequence      : inferikrigy
Real states sequence      : inferiority
Predicted states sequence : inderiorigy


In [5]:
# HMM predicted letters
y_test_pred = hmm.predict(X_test)

# compute errors stats
hmm_char_res, hmm_word_res = compute_corrections_stats(X_test, y_test, y_test_pred)
dummy_char_res, dummy_word_res = compute_corrections_stats(X_test, y_test, X_test)

# display main results
print("HMM score on test set")
print(" * accuracy on full words : {:.2f}%".format(hmm_word_res['accuracy'] * 100))
print(" * accuracy on letters    : {:.2f}%".format(hmm_char_res['accuracy'] * 100))
print("   > typos corrected      : {} ({:.2f}%)".format(hmm_char_res['typo_correction'], hmm_char_res['typo_correction'] / hmm_char_res['n_tokens'] * 100))
print("   > typos not corrected  : {} ({:.2f}%)".format(hmm_char_res['typo_nocorrection'], hmm_char_res['typo_nocorrection'] / hmm_char_res['n_tokens'] * 100))
print("   > typos added          : {} ({:.2f}%)".format(hmm_char_res['notypo_correction'], hmm_char_res['notypo_correction'] / hmm_char_res['n_tokens'] * 100))

print("\nDummy score on test set")
print(" * accuracy on full words : {:.2f}%".format(dummy_word_res['accuracy'] * 100))
print(" * accuracy on letters    : {:.2f}%".format(dummy_char_res['accuracy'] * 100))
print("   > typos corrected      : {} ({:.2f}%)".format(dummy_char_res['typo_correction'], dummy_char_res['typo_correction'] / dummy_char_res['n_tokens'] * 100))
print("   > typos not corrected  : {} ({:.2f}%)".format(dummy_char_res['typo_nocorrection'], dummy_char_res['typo_nocorrection'] / dummy_char_res['n_tokens'] * 100))
print("   > typos added          : {} ({:.2f}%)".format(dummy_char_res['notypo_correction'], dummy_char_res['notypo_correction'] / dummy_char_res['n_tokens'] * 100))

HMM score on test set
 * accuracy on full words : 75.15%
 * accuracy on letters    : 93.20%
   > typos corrected      : 310 (4.23%)
   > typos not corrected  : 435 (5.94%)
   > typos added          : 63 (0.86%)

Dummy score on test set
 * accuracy on full words : 62.89%
 * accuracy on letters    : 89.82%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 745 (10.18%)
   > typos added          : 0 (0.00%)



## HMM forward d'ordre 2

In [6]:
# Initialize and train HMM
hmm2 = HMM2(states, observations)
hmm2.fit(X_train, y_train)

2nd order HMM created with: 
 * 26 states
 * 26 observations
Training initial states probabilities... Done.
Training transitions probabilities given states... Done.
Training observations probabilities given states... Done.


In [7]:
# Try HMM prediction on one sample
SAMPLE = 11
observation_sequence = X_test[SAMPLE]
states_sequence = y_test[SAMPLE]

predicted_states_sequence = hmm2.predict([observation_sequence])

print("Observation sequence      : {}".format("".join(observation_sequence)))
print("Real states sequence      : {}".format("".join(states_sequence)))
print("Predicted states sequence : {}".format("".join(predicted_states_sequence[0])))

Observation sequence      : inferikrigy
Real states sequence      : inferiority
Predicted states sequence : inferiority


In [8]:
# HMM predicted letters
y_test_pred = hmm2.predict(X_test)

# compute errors stats
hmm_char_res, hmm_word_res = compute_corrections_stats(X_test, y_test, y_test_pred)
dummy_char_res, dummy_word_res = compute_corrections_stats(X_test, y_test, X_test)

# display main results
print("HMM score on test set")
print(" * accuracy on full words : {:.2f}%".format(hmm_word_res['accuracy'] * 100))
print(" * accuracy on letters    : {:.2f}%".format(hmm_char_res['accuracy'] * 100))
print("   > typos corrected      : {} ({:.2f}%)".format(hmm_char_res['typo_correction'], hmm_char_res['typo_correction'] / hmm_char_res['n_tokens'] * 100))
print("   > typos not corrected  : {} ({:.2f}%)".format(hmm_char_res['typo_nocorrection'], hmm_char_res['typo_nocorrection'] / hmm_char_res['n_tokens'] * 100))
print("   > typos added          : {} ({:.2f}%)".format(hmm_char_res['notypo_correction'], hmm_char_res['notypo_correction'] / hmm_char_res['n_tokens'] * 100))

print("\nDummy score on test set")
print(" * accuracy on full words : {:.2f}%".format(dummy_word_res['accuracy'] * 100))
print(" * accuracy on letters    : {:.2f}%".format(dummy_char_res['accuracy'] * 100))
print("   > typos corrected      : {} ({:.2f}%)".format(dummy_char_res['typo_correction'], dummy_char_res['typo_correction'] / dummy_char_res['n_tokens'] * 100))
print("   > typos not corrected  : {} ({:.2f}%)".format(dummy_char_res['typo_nocorrection'], dummy_char_res['typo_nocorrection'] / dummy_char_res['n_tokens'] * 100))
print("   > typos added          : {} ({:.2f}%)".format(dummy_char_res['notypo_correction'], dummy_char_res['notypo_correction'] / dummy_char_res['n_tokens'] * 100))

HMM score on test set
 * accuracy on full words : 83.41%
 * accuracy on letters    : 95.57%
   > typos corrected      : 513 (7.01%)
   > typos not corrected  : 232 (3.17%)
   > typos added          : 92 (1.26%)

Dummy score on test set
 * accuracy on full words : 62.89%
 * accuracy on letters    : 89.82%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 745 (10.18%)
   > typos added          : 0 (0.00%)
