<h1>HMM and Text Segmentation</center>

In [2]:
import numpy as np

## Task: automatic segmentation of mails, problem statement

This lab aims to build an email segmentation tool, dedicated to separate the email header from its body. It is proposed to perform this task by learning a HMM $(A, B, \pi)$ with two states, one (state $1$) for the header, the other (state $2$) for the body. In this model, it is assumed that each mail actually contains a header: the decoding necessarily begins in the state $1$.

### Question 1 
Give the value of the $\pi$ vector of the initial probabilities.

As said in the subject, each mail contains a header. Thus, we begin in state $1$ and the initial vector of probabilities is:

$$\boxed{\pi_{0} = \left(\begin{array}{cc} 1 & 0 \end{array}\right)}$$

<hr />

Knowing that each mail contains exactly one header and one body, each mail follows once the transition from $1$ to $2$. The transition matrix $(A(i, j) = P(j | i))$ estimated on a labeled small corpus has thus the following form:

$$ A = \left(\begin{array}{cc} 
0.999218078035812 & 0.000781921964187974 \\
0 & 1 \\
\end{array}\right) $$

### Question 2
What is the probability to move from state 1 to state 2? What is the probability to remain in state 2? What is the lower/higher probability? Try to explain why.

The probability to move from state $1$ to state $2$ is $0.000781921964187974$ according to the matrix $A$ given by the subject. It seems low (it is the lowest value of the matrix) but it comes from the fact that a header has probably an average length of $1000-1500$ and that we follow the transition from state $1$ to state $2$ once. Thus, the probability of transition is $\frac{1}{1200} \approx A[1, 2]$.

The probability to remain in state 2 is equal to $1$ (highest probability in $A$), because an email only follows the transition once.

<hr />

A mail is represented by a sequence of characters. Let $N$ be the number of different characters. Each part of the mail is characterized by a discrete probability distribution on the characters $P(c | s)$, with $s = 1$ or $s = 2$.

### Question 3
What is the size of B?

The matrix $B$ contains the probability of observing each symbol in each state.

Thus, the size of $B$ is $(N, 2)$.

<hr />

## To implement

* Implement the Viterbi algorithm. Concretely, it comes to coding a function which takes as argument a vector of observations and the parameters of the model, and returns a vector of states representing the most probable sequence.
* Test it on some mails that are given in the <tt>dat</tt> directory (especially <tt>mail11.txt</tt> to <tt>mail30.txt</tt>).

In [136]:
# initialization

X = np.loadtxt('./dat/' + 'mail1.dat')
Pi0 = np.array([1, 0])
A = np.array([[0.999218078035812, 0.000781921964187974],
              [0                , 1]])
P = np.loadtxt('./PerlScriptAndModel/' + 'P.text')

In [137]:
def viterbi(X, Pi0 = Pi0, A = A, P = P):
    """
        Viterbi Algorithm Implementation

        Keyword arguments:
            - obs: sequence of observation
            - states:list of states
            - start_prob:vector of the initial probabilities
            - trans: transition matrix
            - emission_prob: emission probability matrix
        Returns:
            - seq: sequence of state
    """

    # to avoid null values in the log
    realmin = np.finfo(np.double).tiny
    A = np.log(A + realmin)
    Pi0 = np.log(Pi0 + realmin)
    P = np.log(P + realmin)
    taille = np.shape(X)
    T = taille[0] # number of observations
    N = Pi0.shape[0] # number of states of the model

    # initialization
    logl = np.zeros((T, N))
    bcktr = np.zeros((T-1, N))
        
    logl[0, :] = Pi0 + P[X[0], :]

    # recursion
    for t in range(T-1):
        logl[t+1, :] = P[X[t+1], :] + np.max(A + logl[t, :].reshape(-1, 1), axis = 0)
        bcktr[t, :] = np.argmax(A + logl[t, :].reshape(-1, 1), axis = 0)
    
    # termination
    path = np.zeros(T, dtype = int)
    path[-1] = np.argmax(logl[-1, :])
    
    # backtrack
    for t in range(T-1, 0, -1):
        path[t-1] = bcktr[t-1, path[t]]
    
    return logl, path

In [138]:
logl, path = viterbi(X.astype(int), Pi0, A, P)

In [144]:
def visualize_segmentation(mail_name, path, mail_dir = './dat/'):
    """
        Implementation of the visualization. Adds a '================ cut here' to separate the header from the body.
        
        Keyword arguments:
            - mail_name: path to the file that contains the mail, e.g. 'mail11.txt'
            - path: sequence obtained with the Viterbi algorithm
            - mail_dir: folder in which to find the mails, default is './dat/'
    """
    
    sep_idx = np.argmax(path) # use the assumption that the transition is only made once
    
    # open the files
    mail = open(mail_dir + mail_name, 'r').read()
    visu_mail = open(mail_dir + 'visu_' + mail_name, 'w')
    
    # add the line of separation
    visu_mail.write(mail[:sep_idx] +
                    '\n================ cut here\n' +
                    mail[sep_idx:])
    
    # close the files
    visu_mail.close()

## Visualizing segmentation

In [160]:
mails = open("dat/mail.lst", "r").readlines()

# make the separation for all the mails
for mail in mails:
    mail_dat = np.loadtxt('./dat/' + mail[:-1]).astype(int)
    visualize_segmentation(mail[:-5] + '.txt', viterbi(mail_dat)[1])

### Question 4
Print the track and present and discuss the results obtained on <tt>mail11.txt</tt> to <tt>mail30.txt</tt>

The algorithm seems to give good results. Indeed, when it does not separate perfectly, the predicted separation is often next to the truth separation (two lines above in <tt>mail20.txt</tt> as shown below), and sometimes inside a word. Thus, it is satisfying.

In [161]:
# separation for mail11
visu_mail = open('./dat/' + 'visu_mail11.txt', 'r').read()
print(visu_mail)

From spamassassin-devel-admin@lists.sourceforge.net  Thu Aug 22 15:25:29 2002
Return-Path: <spamassassin-devel-admin@example.sourceforge.net>
Delivered-To: zzzz@localhost.netnoteinc.com
Received: from localhost (localhost [127.0.0.1])
	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id AE2D043F9B
	for <zzzz@localhost>; Thu, 22 Aug 2002 10:25:29 -0400 (EDT)
Received: from phobos [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 15:25:29 +0100 (IST)
Received: from usw-sf-list2.sourceforge.net (usw-sf-fw2.sourceforge.net
    [216.136.171.252]) by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id
    g7MENlZ09984 for <zzzz@spamassassin.taint.org>; Thu, 22 Aug 2002 15:23:47 +0100
Received: from usw-sf-list1-b.sourceforge.net ([10.3.1.13]
    helo=usw-sf-list1.sourceforge.net) by usw-sf-list2.sourceforge.net with
    esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 17hsof-00042r-00; Thu,
    22 Aug 2002 07:20:05 -0700
Received: from vivi.upti

In [164]:
# separation for mail20
visu_mail = open('./dat/' + 'visu_mail20.txt', 'r').read()
print(visu_mail)

From ilug-admin@linux.ie  Thu Aug 22 17:19:25 2002
Return-Path: <ilug-admin@linux.ie>
Delivered-To: zzzz@localhost.netnoteinc.com
Received: from localhost (localhost [127.0.0.1])
	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id CD34B47C67
	for <zzzz@localhost>; Thu, 22 Aug 2002 12:19:21 -0400 (EDT)
Received: from phobos [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for zzzz@localhost (single-drop); Thu, 22 Aug 2002 17:19:21 +0100 (IST)
Received: from lugh.tuatha.org (root@lugh.tuatha.org [194.125.145.45]) by
    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g7MGHJZ14177 for
    <zzzz-ilug@spamassassin.taint.org>; Thu, 22 Aug 2002 17:17:19 +0100
Received: from lugh (root@localhost [127.0.0.1]) by lugh.tuatha.org
    (8.9.3/8.9.3) with ESMTP id RAA09581; Thu, 22 Aug 2002 17:16:28 +0100
    claimed to be lugh
Received: from redpie.com (redpie.com [216.122.135.208] (may be forged))
    by lugh.tuatha.org (8.9.3/8.9.3) with ESMTP id RAA09518 for
    <ilug@linux.ie>; Thu, 

## Further questions

### Question 5

How would you model the problem if you had to segment the mails in more than two parts (for example: header, body, signature)? Draw a diagram of the corresponding Hidden Markov model and give an example of A matrix that would be suitable in this case.

In order to segment the mails in more than two parts, the Hidden Markov model would need as many states as number of parts. In this case, we would have three states: $1$ for the header, $2$ for the body and $3$ for the signature.

Besides, the matrix $A$ would be a $3 \times 3$ upper triangular matrix because the transitions from $1$ to $2$, $1$ to $3$ or $2$ to $3$ can be only made once. For instance, we could have:

$$ A = \left( \begin{array}{ccc}
        0.92 & 0.079 & 0.001 \\
        0    & 0.95  & 0.05 \\
        0    & 0     & 1 \\ \end{array} \right) $$
        
Here is the diagram that represents the model:

![HMM with three states](HMM_three_states.PNG)

### Question 6

How would you model the problem of separating the portions of mail included, knowing that they always start with the character ">". Draw a diagram of the corresponding Hidden Markov model.

In the same way, we would have four states: $1$ for the header, $2$ for the body, $3$ for an included mail and $4$ for the signature. We can have several included mails so we have to allow a transition from body to included mail and from included mail to body. And the transitions header $\rightarrow$ {body, included mail} and {body, included mail} $\rightarrow$ signature can only be made once.

![HMM with four states](HMM_four_states.PNG)

## BONUS: Unsupervised learning

### Question 7

Present the pseudo code of the algorithm and discuss the results.

### Question 8

Present the pseudo code of the algorithm and discuss the results.