# TP3 - Automatic Segmentation of Mails
This Lab aims to build an email segmentation tool, dedicated to separate the email header from its
body. It is proposed to perform this task by learning a HMM (A, B, π) with two states, one (*state 1*) for
the header, the other (*state 2*) for the body. In this model, it is assumed that each mail actually contains
a header : the decoding necessarily begins in the state 1.

Knowing that each mail contains exactly one header and one body, each mail follows once the transition
from 1 to 2.

### Q1 : Give the value of the π vector of the initial probabilities

$$π^T = 
\begin{matrix} 
1 \\ 0 
\end{matrix}
$$

Because the initial probability of being in the state 0 (header) is always true.

### Q2 : What is the probability to move from state 1 to state 2 ? What is the probability to remain in state 2 ? What is the lower/higher probability ? Try to explain why

Given the matrix:

$$ A = \begin{matrix} 
0.999218078035812 & 0.000781921964187974 \\
0 & 1
\end{matrix}$$

The probability of moving from state 1 to state 2 is `P(2|1) = 0.000781921964187974` ; the probability of state 2 is `P(2|2) = 1` . 

It is normal for P(2|2) to be higher, since once a character belonging to the *body* of the mail has been found, all the following observations will belong to the same state : no other *header* is found after the *body* in a mail.

### Q3 : What is the size of the corresponding matrix ?

Because the ASCII characters are 256 (`N = 256`), the size of the corresponding matrix will be `256x2`, where each row represents the discrete probability distribution of the character *c* given the state *s*.

In [36]:
# Implementation of Viterbi algorithm
import numpy as np

"""
Implementation of the Viterbi algorithm.
Finds the best segmentation of the text provided, 
and returns the most probable states sequence
Parameters explaination:
:param O: = characters codification (ASCII, range(0,256))
:param S: = possible states (header = 0, body = 1)
:param A: = state transition probability matrix
:param B: = matrix probability of character c being in state s (256x2, contained in P.text)
:param pi: = initial probability vector
:param Y: = observation in the mail (mail.dat)
:return X: = most likely hidden state sequence
"""
def ViterbiAlgorithm(O, S, A, B, pi, Y):
    A = np.matrix(A)
    T = len(Y)
    T1 = np.zeros((len(S), T))
    T2 = np.zeros((len(S), T))
    for s in S:
        T1[s,0] = pi[s]*B[Y[0]][s] # vector of most likely path so far
        T2[s,0] = 0 # most likely path for previous observation
#     print(T1[:,0])
    for t in range(1,T):
        for s in S:
#             print("T1[:,"+str(t-1)+"] = "+str(T1[:,t-1]))
#             print("A[:][s] = "+str(A[:,s]))
#             print("B[Y["+str(t)+"],s] = "+str(B[Y[t],s]))
#             result = T1[:, t-1]*A[:,s]*B[Y[t],s]
            result = [a*b*B[Y[t],s] for a,b in zip(T1[:, t-1],A[:,s])]
            result += np.finfo(np.double).tiny
#             print("Result = "+str(result))
            T1[s, t] = max(result)
            T2[s, t] = np.argmax(result)
#         print(T1[:,t])
    Z = [0 for t in Y]
    X = [0 for t in Y]
    T = T - 1
    Z[T] = int(np.argmax(T1[:,T]))
    X[T] = S[Z[T]]
    for i in range(T, 0, -1):
        Z[i-1] = int(T2[Z[i], i])
        X[i-1] = S[Z[i-1]]
    return X

In [49]:
# Viterbi - test
O = range(0,256)
S = [0,1]
pi = [1,0]
A = [[0.999218078035812, 0.000781921964187974],[0, 1]]
B = np.loadtxt('P.text', dtype=float)
Y = np.loadtxt('dat/mail11.dat',dtype=int)

X = ViterbiAlgorithm(O, S, A, B, pi, Y)
np.bincount(X)

array([ 161, 3314])

In [50]:
Y = np.loadtxt('dat/mail30.dat',dtype=int)

X = ViterbiAlgorithm(O, S, A, B, pi, Y)
np.bincount(X)

array([ 180, 4980])