# Correction des typos

Le but de ce projet est de corriger les fautes de frappes dans un texte sans recours à un dictionnaire, en utilisant un modèle de Markov caché (HMM).

# I - Chargement des données

Données issues du *Manifeste de l'Unabomber* et artificiellement bruitées en mofifiant aléatoirement 10% ou 20% des lettres du corpus (erreurs de substitutions seulement).

In [1]:
import numpy as np

from HMM import *
from toolbox import *

In [2]:
# Import and separate datasets
ERROR_RATE = 10  # 10% or 20%
train_set, test_set = load_db(error_rate=ERROR_RATE)
X_train = [[token[0] for token in word] for word in train_set]
y_train = [[token[1] for token in word] for word in train_set]
X_test = [[token[0] for token in word] for word in test_set]
y_test = [[token[1] for token in word] for word in test_set]

# Get states and observations sets
states, observations = get_observations_states(X_train, y_train)
print("{} states :\n{}".format(len(states), states))
print("{} observations :\n{}".format(len(observations), observations))

# Example from dataset
print("\nSample example (observation, état) :\n{}".format(train_set[3]))

26 states :
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
26 observations :
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

Sample example (observation, état) :
[('a', 'a'), ('c', 'c'), ('v', 'c'), ('o', 'o'), ('u', 'u'), ('n', 'n'), ('t', 't')]


# II - HMM d'ordre 1

Essai de correction des typos par un HMM d'ordre 1 utilisant l'algorithme de Viterbi.
On prend en compte un seul instant précédent pour le calcul de la transition. On considère donc les probabilités :
- $P(X_t| Y_t)$ : probabilité d'observer X à l'état Y à l'instant t
- $P(Y_t| Y_{t-1})$ : probabilité de l'état Y à l'instant t sachant  l'état précédent

Afin d'éviter les probabilités nulles, on initialise toutes les matrices (émission, transition et état initial) par une probabilité uniforme (respectivement $1/n_{observations}$, $1/n_{states}$, $1/n_{states}$), puis on affine ces probabilités grâce aux comptes des unigrammes et bigrammes selon le maximum de vraisemblance.

In [3]:
# Initialize and train HMM
hmm1 = HMM(states, observations)
hmm1.fit(X_train, y_train)

1st order HMM created with: 
 * 26 states
 * 26 observations
Training initial states probabilities... Done.
Training transitions probabilities given states... Done.
Training observations probabilities given states... Done.


In [4]:
# Try HMM prediction on one sample
SAMPLE = 11
observation_sequence = X_test[SAMPLE]
states_sequence = y_test[SAMPLE]

predicted_states_sequence = hmm1.predict([observation_sequence])

print("Observation sequence      : {}".format("".join(observation_sequence)))
print("Real states sequence      : {}".format("".join(states_sequence)))
print("Predicted states sequence : {}".format("".join(predicted_states_sequence[0])))

Observation sequence      : inferikrigy
Real states sequence      : inferiority
Predicted states sequence : inderiorigy


In [5]:
y_test_pred = hmm1.predict(X_test)
display_correction_stats(X_test, y_test, y_test_pred, name="HMM1")

HMM1 score on test set
 * accuracy on full words : 75.15%
 * accuracy on letters    : 93.20%
   > typos corrected      : 310 (4.23%)
   > typos not corrected  : 435 (5.94%)
   > typos added          : 63 (0.86%)

Dummy score on test set
 * accuracy on full words : 62.89%
 * accuracy on letters    : 89.82%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 745 (10.18%)
   > typos added          : 0 (0.00%)



# III - HMM d'ordre 2

## HMM d'ordre 2

On prend en compte les deux instants précédents pour le calcul de la transition. On considère donc maitenant les probabilités :
- $P(X_t| Y_t)$ : probabilité d'observer X à l'état Y à l'instant t (identique à l'ordre 1)
- $P(Y_t| Y_{t-1}, Y_{t-2})$ : probabilité de l'état Y à l'instant t sachant les deux états précédents.

La dimension de la matrice de transition étant plus importante, les risques que certains trigrammes n'aient pas été observés dans le corpus d'apprentissage sont encore plus grands. On peut donc tester plusieurs sortes de *smoothing* pour extrapoler ces données manquantes, et voir l'importance de ces traitements.

## Aucun smoothing

Les matrices sont toutes initialisées à 0. Si certains comptes sont manquants, beaucoup des éléments de la matrice de transition seront nuls (ce qui provoque de belles erreurs à cause des logarithmes).

In [6]:
# Initialize and train HMM
hmm2 = HMM2(states, observations)
hmm2.fit(X_train, y_train, smoothing=None)

# Try HMM prediction on one sample
predicted_states_sequence = hmm2.predict([observation_sequence])
print("\nObservation sequence      : {}".format("".join(observation_sequence)))
print("Real states sequence      : {}".format("".join(states_sequence)))
print("Predicted states sequence : {}\n".format("".join(predicted_states_sequence[0])))

# Run prediction on all test set
y_test_pred = hmm2.predict(X_test)
display_correction_stats(X_test, y_test, y_test_pred, name="HMM2_no_smoothing")

2nd order HMM created with: 
 * 26 states
 * 26 observations
Training initial states probabilities... Done.
Training transitions probabilities given states...

  self.initial_state_logproba = np.log(self.initial_state_logproba)
  self.transition2_logproba /= np.atleast_3d(np.sum(self.transition2_logproba, axis=2))
  self.transition1_logproba = np.log(self.transition1_logproba)
  self.transition2_logproba = np.log(self.transition2_logproba)
  self.observation_logproba = np.log(self.observation_logproba).astype(self.fp_precision)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


 Done.
Training observations probabilities given states... Done.

Observation sequence      : inferikrigy
Real states sequence      : inferiority
Predicted states sequence : aaaaaaaaaaa

HMM2_no_smoothing score on test set
 * accuracy on full words : 20.65%
 * accuracy on letters    : 15.16%
   > typos corrected      : 71 (0.97%)
   > typos not corrected  : 674 (9.21%)
   > typos added          : 5536 (75.63%)

Dummy score on test set
 * accuracy on full words : 62.89%
 * accuracy on letters    : 89.82%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 745 (10.18%)
   > typos added          : 0 (0.00%)


## Initialisation non nulle des matrices

Afin d'éviter les probabilités nulles, on initialise toutes les matrices (émission, transition et état initial) par une probabilité uniforme (respectivement $1/n_{observations}$, $1/n_{states}$, $1/n_{states}$), puis on affine ces probabilités grâce aux comptes d'unigrammes, bigrammes et trigrammes selon le maximum de vraisemblance. On normalise ensuite les distributions de probabilités obtenues.

Il s'agit du comportement de base de nos HMM.

In [7]:
# Initialize and train HMM
hmm2 = HMM2(states, observations)
hmm2.fit(X_train, y_train, smoothing='epsilon')

# Try HMM prediction on one sample
predicted_states_sequence = hmm2.predict([observation_sequence])
print("\nObservation sequence      : {}".format("".join(observation_sequence)))
print("Real states sequence      : {}".format("".join(states_sequence)))
print("Predicted states sequence : {}\n".format("".join(predicted_states_sequence[0])))

# Run prediction on all test set
y_test_pred = hmm2.predict(X_test)
display_correction_stats(X_test, y_test, y_test_pred, name="HMM2_epsilon_smoothing")

2nd order HMM created with: 
 * 26 states
 * 26 observations
Training initial states probabilities... Done.
Training transitions probabilities given states... Done.
Training observations probabilities given states... Done.

Observation sequence      : inferikrigy
Real states sequence      : inferiority
Predicted states sequence : inferiority

HMM2_epsilon_smoothing score on test set
 * accuracy on full words : 83.41%
 * accuracy on letters    : 95.57%
   > typos corrected      : 513 (7.01%)
   > typos not corrected  : 232 (3.17%)
   > typos added          : 92 (1.26%)

Dummy score on test set
 * accuracy on full words : 62.89%
 * accuracy on letters    : 89.82%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 745 (10.18%)
   > typos added          : 0 (0.00%)


## Modèle d'extrapolation plus complexe

Comme de nombreux trigrammes peuvent ne pas avoir été rencontrés dans le corpus d'apprentissage, on peut introduire différentes techniques de lissage ou d'extrapolation des données. On teste ici la technique introduite par [cet article](http://www.aclweb.org/anthology/P99-1023).

In [8]:
# Initialize and train HMM
hmm2 = HMM2(states, observations)
hmm2.fit(X_train, y_train, smoothing='weighted')

# Try HMM prediction on one sample
predicted_states_sequence = hmm2.predict([observation_sequence])
print("\nObservation sequence      : {}".format("".join(observation_sequence)))
print("Real states sequence      : {}".format("".join(states_sequence)))
print("Predicted states sequence : {}\n".format("".join(predicted_states_sequence[0])))

# Run prediction on all test set
y_test_pred = hmm2.predict(X_test)
display_correction_stats(X_test, y_test, y_test_pred, name="HMM2_weighted_smoothing")

2nd order HMM created with: 
 * 26 states
 * 26 observations
Training initial states probabilities... Done.
Training transitions probabilities given states... Done.
Training observations probabilities given states... Done.

Observation sequence      : inferikrigy
Real states sequence      : inferiority
Predicted states sequence : inferiorigy

HMM2_weighted_smoothing score on test set
 * accuracy on full words : 79.61%
 * accuracy on letters    : 94.66%
   > typos corrected      : 406 (5.55%)
   > typos not corrected  : 339 (4.63%)
   > typos added          : 52 (0.71%)

Dummy score on test set
 * accuracy on full words : 62.89%
 * accuracy on letters    : 89.82%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 745 (10.18%)
   > typos added          : 0 (0.00%)


**Conclusion:** Finalement moins efficace, mais évite de rajouter beaucoup d'erreurs (comportement plus agréable pour l'utilisateur)

# IV - Insertion de caractères

## Adaptation du modèle

Pour simuler l'insertion de caractères, on ajoute avec un certaine probabilité un caractère (dont la touche du clavier est proche de la précédente). Comme cette observation supplémentaire ne correspond pas un état (pas de véritable lettre associée à cette insertion), on modélise celui-ci par un nouvel état '\_'. On se retrouve face à un problème de 27 états (26 lettres + '\_') et 26 observations.

Exemple:
- **séquence d'états        :** ['a', 'c', 'c', 'o', '\_', 'u', 'n', 't']
- **séquence d'observations :** ['a', 'c', 'v', 'o', 'p', 'u', 'n', 't']

In [9]:
# Add noisy insertions (linked to no state) to the noisy substituted or clean database
# Empty state will be denoted '_'
np.random.seed(20)
INSERTION_PROB = 0.10
(X_si_train, y_si_train), (X_si_test, y_si_test) = noisy_insertion(X_train, y_train, X_test, y_test, thresh_proba=INSERTION_PROB)
(X_i_train, y_i_train), (X_i_test, y_i_test) = noisy_insertion(y_train, y_train, y_test, y_test, thresh_proba=INSERTION_PROB)

SAMPLE = 3
print("States sequence sample       : {}".format(y_si_train[SAMPLE]))
print("Observations sequence sample : {}\n".format(X_si_train[SAMPLE]))

states_si, observations_si = get_observations_states(X_si_train, y_si_train)
print("{} states :\n{}".format(len(states_si), states_si))
print("{} observations :\n{}".format(len(observations_si), observations_si))

States sequence sample       : ['a', 'c', 'c', 'o', 'u', '_', 'n', 't']
Observations sequence sample : ['a', 'c', 'v', 'o', 'u', 'u', 'n', 't']

27 states :
['_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
26 observations :
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


## Tests de correction des erreurs de substitution + insertion

10% d'erreurs de substitutions + 10% d'erreurs d'insertions

In [10]:
# HMM first order
hmm = HMM(states_si, observations_si, verbose=False)
hmm.fit(X_si_train, y_si_train)
y_si_test_pred = hmm.predict(X_si_test)
display_correction_stats(X_si_test, y_si_test, y_si_test_pred, name="HMM1", dummy=False)

# HMM second order
hmm2 = HMM2(states_si, observations_si, verbose=False)
hmm2.fit(X_si_train, y_si_train)
y_si_test_pred = hmm2.predict(X_si_test)
display_correction_stats(X_si_test, y_si_test, y_si_test_pred, name="\nHMM2", dummy=False)

# HMM second order with smoothing
hmm2 = HMM2(states_si, observations_si, verbose=False)
hmm2.fit(X_si_train, y_si_train, smoothing='weighted')
y_si_test_pred = hmm2.predict(X_si_test)
display_correction_stats(X_si_test, y_si_test, y_si_test_pred, name="\nHMM2_smooth")

HMM1 score on test set
 * accuracy on full words : 49.83%
 * accuracy on letters    : 83.64%
   > typos corrected      : 391 (4.85%)
   > typos not corrected  : 1091 (13.54%)
   > typos added          : 227 (2.82%)

HMM2 score on test set
 * accuracy on full words : 60.56%
 * accuracy on letters    : 85.90%
   > typos corrected      : 677 (8.40%)
   > typos not corrected  : 805 (9.99%)
   > typos added          : 331 (4.11%)

HMM2_smooth score on test set
 * accuracy on full words : 55.16%
 * accuracy on letters    : 85.61%
   > typos corrected      : 499 (6.19%)
   > typos not corrected  : 983 (12.20%)
   > typos added          : 176 (2.18%)

Dummy score on test set
 * accuracy on full words : 41.77%
 * accuracy on letters    : 81.61%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 1482 (18.39%)
   > typos added          : 0 (0.00%)


## Test de correction des erreurs d'insertions seulement

10% d'erreurs d'insertions

In [11]:
# HMM first order
hmm = HMM(states_si, observations_si, verbose=False)
hmm.fit(X_i_train, y_i_train)
y_i_test_pred = hmm.predict(X_i_test)
display_correction_stats(X_i_test, y_i_test, y_i_test_pred, name="HMM1", dummy=False)

# HMM second order
hmm2 = HMM2(states_si, observations_si, verbose=False)
hmm2.fit(X_i_train, y_i_train)
y_i_test_pred = hmm2.predict(X_i_test)
display_correction_stats(X_i_test, y_i_test, y_i_test_pred, name="\nHMM2", dummy=False)

# HMM second order with smoothing
hmm2 = HMM2(states_si, observations_si, verbose=False)
hmm2.fit(X_i_train, y_i_train, smoothing='weighted')
y_i_test_pred = hmm2.predict(X_i_test)
display_correction_stats(X_i_test, y_i_test, y_i_test_pred, name="\nHMM2_smooth")

HMM1 score on test set
 * accuracy on full words : 66.02%
 * accuracy on letters    : 90.77%
   > typos corrected      : 189 (2.34%)
   > typos not corrected  : 565 (7.00%)
   > typos added          : 180 (2.23%)

HMM2 score on test set
 * accuracy on full words : 73.48%
 * accuracy on letters    : 91.37%
   > typos corrected      : 342 (4.24%)
   > typos not corrected  : 412 (5.10%)
   > typos added          : 285 (3.53%)

HMM2_smooth score on test set
 * accuracy on full words : 66.82%
 * accuracy on letters    : 91.16%
   > typos corrected      : 158 (1.96%)
   > typos not corrected  : 596 (7.38%)
   > typos added          : 118 (1.46%)

Dummy score on test set
 * accuracy on full words : 60.89%
 * accuracy on letters    : 90.66%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 754 (9.34%)
   > typos added          : 0 (0.00%)


**Conclusion :**

Résultats assez décevants... En réalité, on remarque que les performance du modèle sont à peine meilleures que le *Dummy* pour la correction des insertions. Ceci est probablement dû aux peu de données d'apprentissage disponibles pour ce type d'erreurs.

# V - Omission de caractères

Pour l'omission (délétion) de caractères, on est dans le cas contraire, où une observation est manquante. On ne peut simplement rajouter une observation '\_', puisque cela reviendrait à indiquer directement au modèle qu'il y a eu une omission.  On serait alors dans un cas plus simple, mais où le modèle n'aurait pas à déterminer tout seul la résence d'une omission de caractère. Pour ceci, on modifie plutôt le corpus en supprimant certaines observations, et en autorisant le remplacement d'un observation par un état "double", composé de deux lettres. On se retrouve face à un problème beaucoup plus complexe, de 702 états (26 états simples + 676 états doubles) et 26 observations.

Exemple:
- **séquence d'états        :** ['a', 'c', 'c', 'o', 'un', 't']
- **séquence d'observations :** ['a', 'c', 'v', 'o', 'u', 't']

Remarque: de nombreuses lignes de la matrice de transition seront vides, puisque chacune de ces fautes est rarement observée.

In [12]:
def cross_states(states):
    """ Take a state list as input and return a cross state list
    (each new state being a concatenation of two states from states, or a unique state)
    Ex: If states=['a', 'b'], cross_states is ['a', 'b', 'aa', 'ab', 'ba', 'bb']

    :param states, list of string
    :return cross_states: states, list of string
    """
    
    cross_states = []
    
    for s1 in states:
        for s2 in states:
            cross_states.append(s1 + s2)
            
    return states + cross_states

In [13]:
# Remove some observations in the database
# To do that, we add the state without observation to the previous state
np.random.seed(5)
DELETION_PROB = 0.10
(X_so_train, y_so_train), (X_so_test, y_so_test) = noisy_omission(X_train, y_train, X_test, y_test, thresh_proba=DELETION_PROB)
(X_o_train, y_o_train), (X_o_test, y_o_test) = noisy_omission(y_train, y_train, y_test, y_test, thresh_proba=DELETION_PROB)

states_so = cross_states(states)
observations_so = observations

SAMPLE = 3
print("States sequence sample       : {}".format(y_so_train[SAMPLE]))
print("Observations sequence sample : {}\n".format(X_so_train[SAMPLE]))

print("{} states :\n{}".format(len(states_so), states_so))
print("{} observations :\n{}".format(len(observations_so), observations_so))

States sequence sample       : ['a', 'c', 'c', 'ou', 'n', 't']
Observations sequence sample : ['a', 'c', 'v', 'u', 'n', 't']

702 states :
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'aa', 'ab', 'ac', 'ad', 'ae', 'af', 'ag', 'ah', 'ai', 'aj', 'ak', 'al', 'am', 'an', 'ao', 'ap', 'aq', 'ar', 'as', 'at', 'au', 'av', 'aw', 'ax', 'ay', 'az', 'ba', 'bb', 'bc', 'bd', 'be', 'bf', 'bg', 'bh', 'bi', 'bj', 'bk', 'bl', 'bm', 'bn', 'bo', 'bp', 'bq', 'br', 'bs', 'bt', 'bu', 'bv', 'bw', 'bx', 'by', 'bz', 'ca', 'cb', 'cc', 'cd', 'ce', 'cf', 'cg', 'ch', 'ci', 'cj', 'ck', 'cl', 'cm', 'cn', 'co', 'cp', 'cq', 'cr', 'cs', 'ct', 'cu', 'cv', 'cw', 'cx', 'cy', 'cz', 'da', 'db', 'dc', 'dd', 'de', 'df', 'dg', 'dh', 'di', 'dj', 'dk', 'dl', 'dm', 'dn', 'do', 'dp', 'dq', 'dr', 'ds', 'dt', 'du', 'dv', 'dw', 'dx', 'dy', 'dz', 'ea', 'eb', 'ec', 'ed', 'ee', 'ef', 'eg', 'eh', 'ei', 'ej', 'ek', 'el', 'em', 'en', 'eo', 'ep', 'eq', 'er'

In [14]:
# HMM first order (omissions only)
hmm = HMM(states_so, observations_so, verbose=False)
hmm.fit(X_o_train, y_o_train)
y_o_test_pred = hmm.predict(X_o_test)
display_correction_stats(X_o_test, y_o_test, y_o_test_pred)

HMM score on test set
 * accuracy on full words : 71.95%
 * accuracy on letters    : 92.72%
   > typos corrected      : 54 (0.81%)
   > typos not corrected  : 459 (6.85%)
   > typos added          : 29 (0.43%)

Dummy score on test set
 * accuracy on full words : 70.55%
 * accuracy on letters    : 92.34%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 513 (7.66%)
   > typos added          : 0 (0.00%)


**/!\ Attention /!\**

Attention, dans le cas de l'ordre 2, les matrices considérées peuvent prendre des tailles très importantes. Ici, la matrice de transition est de taille 702 x 702 x 702, ce qui dans le cas de flottants sur 64 bits (type de base d'un array numpy), revient à une matrice de plus de 2.5 Go. Afin d'éviter des erreurs de mémoire et limiter les calculs, la précision des nombres flottants est automatiquement modifiée en fonction de la dimension du problème, aboutissant ici à des matrices de "seulement" 600 Mo. Pour compenser cette perte de pécision, l'usage des log-probabilités est fortement recommandé (et effectué ici), permettant de mieux exploiter la plage des valeurs possibles, et ainsi limiter les erreurs de représentations.

Malgré l'utilisation du *broadcasting* numpy, l'algorithme de Viterbi reste très lent, avec plus d'une dizaine de secondes par séquence.

In [16]:
# HMM second order (omissions only)
hmm2 = HMM2(states_so, observations_so)
hmm2.fit(X_o_train, y_o_train)

# Try HMM prediction on one sample
SAMPLE = 11
print("\nObservation sequence      : {}".format("".join(X_o_test[SAMPLE])))
print("Real states sequence      : {}".format("".join(y_o_test[SAMPLE])))
print("Predicted states sequence : {}\n".format("".join(hmm2.predict([X_o_test[SAMPLE]])[0])))

# # Run all predictions (BE CAREFUL, IT MAY BE VEEEERY LONG)
# y_so_test_pred = hmm2.predict(X_so_test)
# display_correction_stats(X_so_test, y_so_test, y_so_test_pred)

2nd order HMM created with: 
 * 702 states
 * 26 observations
Training initial states probabilities... Done.
Training transitions probabilities given states... Done.
Training observations probabilities given states... Done.

Observation sequence      : inerority
Real states sequence      : inferiority
Predicted states sequence : inerority



# VI - Apprentissage non-supervisé

Apprentissage non supervisé par algorithme de Baum-Welch (~ Expectation Maximization).
Réalisé pour le HMM d'ordre 1, mais pas pour l'ordre 2 (serait de toute façon beaucoup trop long à calculer dans notre cas...).

## Sans initialisation a priori

Les matrices sont initialisés aléatoirement autour des valeurs $1/n_{observations}$ et $1/n_{states}$.

In [17]:
# Initialize 1st order HMM
hmm1 = HMM(states, observations)

# Train HMM using EM algorithm
hmm1.fit(X_train, max_iter=10, tol=0.001)

# Run predictions
y_test_pred = hmm1.predict(X_test)
display_correction_stats(X_test, y_test, y_test_pred, name="\nHMM1")

1st order HMM created with: 
 * 26 states
 * 26 observations
Unsupervised training by EM algorithm...
EM LOOP: n_iter=1, delta=0.0432
EM LOOP: n_iter=2, delta=0.0681
EM LOOP: n_iter=3, delta=0.0698
EM LOOP: n_iter=4, delta=0.0803
EM LOOP: n_iter=5, delta=0.0631
EM LOOP: n_iter=6, delta=0.1077
EM LOOP: n_iter=7, delta=0.1807
EM LOOP: n_iter=8, delta=0.1064
EM LOOP: n_iter=9, delta=0.0866
EM LOOP: n_iter=10, delta=0.1752

HMM1 score on test set
 * accuracy on full words : 0.00%
 * accuracy on letters    : 2.62%
   > typos corrected      : 33 (0.45%)
   > typos not corrected  : 712 (9.73%)
   > typos added          : 6416 (87.65%)

Dummy score on test set
 * accuracy on full words : 62.89%
 * accuracy on letters    : 89.82%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 745 (10.18%)
   > typos added          : 0 (0.00%)


Pour vérifier nos résultats (plus que décevants), nous avons testé ceux obtenus avec la librairie *hmmlearn* (fork de *scikit-learn*):

In [18]:
# convert to int and reshape to 1D data
X_train_id = [hmm1._convert_observations_sequence_to_index(sample) for sample in X_train]
X_train_1D = np.atleast_2d(np.concatenate(X_train_id)).T
X_train_lengths = [len(sample) for sample in X_train_id]

X_test_id = [hmm1._convert_observations_sequence_to_index(sample) for sample in X_test]
y_test_id = [hmm1._convert_observations_sequence_to_index(sample) for sample in y_test]
X_test_1D = np.atleast_2d(np.concatenate(X_test_id)).T
X_test_lengths = [len(sample) for sample in X_test_id]

from hmmlearn.hmm import GaussianHMM, MultinomialHMM, GMMHMM

hmm_sk = MultinomialHMM(26, verbose=True, n_iter=5, init_params='ste')
hmm_sk.fit(X_train_1D, X_train_lengths)
y_test_pred_sk = [hmm_sk.predict(np.atleast_2d(sample).T) for sample in X_test_id]
y_test_pred = [hmm1._convert_states_sequence_to_string(sample) for sample in y_test_pred_sk]
display_correction_stats(X_test, y_test, y_test_pred, name="hmmlearn")

         1     -469784.9709             +nan
         2     -425576.9829      +44207.9880
         3     -425379.4479        +197.5350
         4     -425115.3658        +264.0821
         5     -424733.3449        +382.0209


hmmlearn score on test set
 * accuracy on full words : 0.00%
 * accuracy on letters    : 1.42%
   > typos corrected      : 31 (0.42%)
   > typos not corrected  : 714 (9.75%)
   > typos added          : 6502 (88.83%)

Dummy score on test set
 * accuracy on full words : 62.89%
 * accuracy on letters    : 89.82%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 745 (10.18%)
   > typos added          : 0 (0.00%)


**Conclusion**: ne converge pas vers les bonnes valeurs, quelle que soit la librairie utilisée... Probablement trop de couples d'états/observations, de probabilités très différentes (et donc loin de l'initialisation).

## Avec initialisation approximative de la matrice d'émission au préalable

Afin de "guider" un peu l'apprentissage, nous avons testé d'initialiser la matrice d'émission à une valeur proche de la réalité.

In [19]:
# emission prior
emission0 = np.eye(26) + 0.2/26
emission0 /= np.sum(emission0, axis=1)

# our code
hmm1 = HMM(states, observations, verbose=True)
hmm1.observation_logproba = np.log(emission0)
hmm1.fit(X_train, max_iter=10, tol=0.001)
y_test_pred = hmm1.predict(X_test)
display_correction_stats(X_test, y_test, y_test_pred, name="\nHMM1", dummy=False)

# hmmlearn
hmm_sk = MultinomialHMM(26, verbose=True, n_iter=5, init_params='st')
hmm_sk.emissionprob_ = emission0
hmm_sk.fit(X_train_1D, X_train_lengths)
y_test_pred_sk = [hmm_sk.predict(np.atleast_2d(sample).T) for sample in X_test_id]
y_test_pred = [hmm1._convert_states_sequence_to_string(sample) for sample in y_test_pred_sk]
display_correction_stats(X_test, y_test, y_test_pred, name="hmmlearn")

1st order HMM created with: 
 * 26 states
 * 26 observations
Unsupervised training by EM algorithm...
EM LOOP: n_iter=1, delta=0.0552
EM LOOP: n_iter=2, delta=0.0956
EM LOOP: n_iter=3, delta=0.1389
EM LOOP: n_iter=4, delta=0.1801
EM LOOP: n_iter=5, delta=0.2794
EM LOOP: n_iter=6, delta=0.1494
EM LOOP: n_iter=7, delta=0.1326
EM LOOP: n_iter=8, delta=0.2847
EM LOOP: n_iter=9, delta=0.4004
EM LOOP: n_iter=10, delta=0.7407

HMM1 score on test set
 * accuracy on full words : 62.56%
 * accuracy on letters    : 89.47%
   > typos corrected      : 11 (0.15%)
   > typos not corrected  : 734 (10.03%)
   > typos added          : 37 (0.51%)


         1     -466455.1652             +nan
         2     -388599.9028      +77855.2624
         3     -379193.9853       +9405.9175
         4     -377019.4268       +2174.5585
         5     -376223.0538        +796.3730


hmmlearn score on test set
 * accuracy on full words : 63.62%
 * accuracy on letters    : 89.89%
   > typos corrected      : 48 (0.66%)
   > typos not corrected  : 697 (9.52%)
   > typos added          : 43 (0.59%)

Dummy score on test set
 * accuracy on full words : 62.89%
 * accuracy on letters    : 89.82%
   > typos corrected      : 0 (0.00%)
   > typos not corrected  : 745 (10.18%)
   > typos added          : 0 (0.00%)


**Conclusion:** Mieux, mais similaire au *Dummy*, donc inutile. Si on regarde les matrices estimées, on peut voir que la matrice d'émission initiale a été raffinnée. La matrice de transition n'est cependant pas très fiable.