# Correction des typos

Le but de ce projet est de corriger les fautes de frappes dans un texte sans recours à un dictionnaire, en utilisant un modèle de Markov caché (HMM).

## Chargement des données

In [8]:
import numpy as np

from HMM import *
from toolbox import *

In [9]:
# Import data and separate datasets
ERROR_RATE = 10  # 10% or 20%
train_set, test_set = load_db(error_rate=ERROR_RATE)

# Get states and observations sets
states, observations = get_states_observations(train_set)
print("{} states :\n{}".format(len(states), states))
print("{} observations :\n{}".format(len(observations), observations))

# Example from dataset
print("\nSample example :\n{}".format(train_set[1]))

26 states :
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
26 observations :
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

Sample example :
[('y', 't'), ('j', 'h'), ('w', 'e'), ('i', 'i'), ('r', 'r')]


## HMM forward d'ordre 1

Essai de correction des typos par un HMM d'ordre 1 utilisant l'algorithme de Viterbi (forward).

In [10]:
# Initialize and train HMM
hmm = HMM(states, observations)
hmm.fit(train_set)

HMM created with: 
 * 26 states
 * 26 observations
Training initial states probabilities... Done.
Training transitions probabilities given states... Done.
Training observations probabilities given states... Done.


In [11]:
# Try HMM prediction on one sample
SAMPLE = 11
observation_sequence = [token[0] for token in test_set[SAMPLE]]
states_sequence = [token[1] for token in test_set[SAMPLE]]

predicted_states_sequence = hmm.predict(observation_sequence)

print("Observation sequence      : {}".format(observation_sequence))
print("Real states sequence      : {}".format(states_sequence))
print("Predicted states sequence : {}".format(predicted_states_sequence))

Observation sequence      : ['r', 'k', 'r']
Real states sequence      : ['f', 'o', 'r']
Predicted states sequence : ['r', 'o', 'r']


In [12]:
score_letters, score_words = hmm.score(test_set)
print("HMM score on test set :")
print(" * {:.2f}% accuracy on letters".format(score_letters * 100))
print(" * {:.2f}% accuracy on full words".format(score_words * 100))

HMM score on test set :
 * 86.57% accuracy on letters
 * 57.32% accuracy on full words



## HMM forward d'ordre 2