# Sequence Models
## Day 2 of the summerschool

### Summary 

- We will train the models in pairs $(x,y)$ where $x$ and $y$ will be sequences.
- Once the model is trained, given an input sequence $x$ the model will predict
  a target sequence $y$.
  
In order to do so we will implement...
  
-  one inference algorithm for Hidden Markov Models.
    - We will use it to find the most likely hidden state sequence given an observation sequence. 
    


### Notation

#### Set of Words $\Sigma$ and set of states $\Lambda$
This notebook will use the following notation.

- $\Sigma := \{w_1,\ldots,w_J\}$ is the set of words (or vocabulary).
- $\Lambda:= \{c_1,\ldots, c_K\}$ is the set of labels.

A sentence is an element of the Kleene clousure of $\Sigma$, denoted by $\Sigma^*$.
The Kleene clousure of $\Sigma$, is defined as the set containing all possible sentences of arbitrary lengt that can be created using the words in $\Sigma$. More formally,

$$
\Sigma^* := \{\varepsilon\} \cup \Sigma \cup \Sigma^2 \cup \ldots
$$
where  $\{\varepsilon\}$ is an "empty word". In other words, inputs are observation sequences, $x = x_1 x_2 \ldots x_N$,  where each $x_i \in \Sigma$. 

Given such an $x$, we seek the corresponding state sequence, $y = y_1 y_2 \ldots y_N$, 
where each $y_i \in \Lambda$. We also consider two special states: the ${\tt start}$ symbol,
which starts the sequence, and the ${\tt stop}$ symbol, which ends the sequence. 



### Example 2.1
Consider a person who is only interested in four activities.
- walking in the park $({\tt walk})$,
- shopping (${\tt shop}$),
- cleaning the apartment (${\tt clean}$)
- playing tennis (${\tt tennis}$).

Also, consider that the choice of what the person does on a given day is determined exclusively by the weather on that day, which can be either ${\tt rainy}$ or ${\tt sunny}$. 

Now, supposing that we observe what the person did on a sequence of days, the question is: 
can we use that information to predict the weather on each of those days? 

To tackle this problem, we assume  that the weather behaves as a discrete Markov chain (with markov property 1): the weather on a given day depends only on the weather on the previous day. The entire system can be described as an HMM.

In this example 

$$
\begin{array}
\hline
\Sigma := \{ {\tt walk},{\tt shop},{\tt clean},{\tt tennis}\}\\
\Lambda: = \{ {\tt rainy},{\tt sunny} \} \\
\end{array}
$$


Let us assume that we are given access to three different sequences of days, containing both the activities performed by the person and the weather on those days.

The information given has the form $(x,y) = (x_i / y_i)$ where $x_i$ is a word in our vocabulary ( ${\tt walk},{\tt shop},{\tt clean},{\tt tennis}$) and $y_i$ is a state (${\tt rainy},{\tt sunny}$). The whole train set is:

- (${\tt walk/rainy, walk/sunny, shop/sunny, clean/sunny}$)
- (${\tt walk/rainy, walk/rainy, shop/rainy, clean/sunny}$)
- (${\tt walk/sunny, shop/sunny, shop/sunny, clean/sunny}$)

We will use this information  to train our model.

Now assume we are asked to predict the weather conditions on two different
sequences of days. During these two sequences, we observed the person performing the following activities: 

- $({\tt walk, walk, shop, clean})$
- $({\tt clean, walk, tennis, walk})$


The following image represents the first training sequence which starts with  ${\tt start}$ symbol, and ends with ${\tt stop}$.

<img src="../images_for_notebooks/day_2/hmm_new.png">


In [2]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [4]:
import sys
# We will this append to ensure we can import lxmls toolking
sys.path.append('../../lxmls-toolkit')


In [5]:
import lxmls
import lxmls.readers.simple_sequence as ssr
import scipy
import numpy as np

## Exercise 2.1, Getting in touch with the provided classes

The objective of this exercises is to get in touch with the classes used to store the sequences, you will need this for the next exercise.

We will use

- class ``Sequence`` in ``lxmls/sequences/sequence.py`` file
- class ``LabelDictionary`` in ``lxmls/sequences/label_dictionary.py`` file
- class ``SequenceList`` in ``lxmls/sequences/sequence_list.py`` file

- class ``_SequenceIterator`` in ``lxmls/sequences/sequence_list.py`` file



In [6]:
# We could put the code of the classes here with no need to import anything from lxmls-toolkit
from lxmls.sequences.label_dictionary import LabelDictionary
from lxmls.sequences.sequence import Sequence
from lxmls.sequences.sequence_list import SequenceList

The following class will implement the train and test data from example 2.1


In [7]:
class SimpleSequence:

    def __init__(self):
        # Observation set.
        self.x_dict = LabelDictionary(['walk', 'shop', 'clean', 'tennis'])
        
        # State set.
        self.y_dict = LabelDictionary(['rainy', 'sunny'])
        
        # Generate training sequences.
        train_sequences = SequenceList(self.x_dict, self.y_dict)
        train_sequences.add_sequence(['walk', 'walk', 'shop', 'clean'], ['rainy', 'sunny', 'sunny', 'sunny'])
        train_sequences.add_sequence(['walk', 'walk', 'shop', 'clean'], ['rainy', 'rainy', 'rainy', 'sunny'])
        train_sequences.add_sequence(['walk', 'shop', 'shop', 'clean'], ['sunny', 'sunny', 'sunny', 'sunny'])

        # Generate test sequences.
        test_sequences = SequenceList(self.x_dict, self.y_dict)
        test_sequences.add_sequence(['walk', 'walk', 'shop', 'clean'], ['rainy', 'sunny', 'sunny', 'sunny'])
        test_sequences.add_sequence(['clean', 'walk', 'tennis', 'walk'], ['sunny', 'sunny', 'sunny', 'sunny'])

        self.train = train_sequences
        self.test = test_sequences

Notice that x_dict and y_dict are ``LabelDictionary``

**``LabelDictionary`` objects are instanciated with a list of strings **

Notice that the data in ``train_sequences`` and ``test_sequences`` are instanciated as ``SequenceList`` objects. 

**``SequenceList`` objects are instanciated with**

- ``x_dict``  containing all possible words $\Sigma$
- ``y_dict``  containing all possible states $\Lambda$
- ``seq_list`` list containing the data (if nothing is passed it starts with an empty list)


**``SequenceList`` objects have a method ``add_sequence`` which recieves as input two lists of strings**
- ``SequenceList.add_sequence`` appends the given sequence with labels $x,y$ as a ``Sequence`` object.

**``Sequence`` objects are instanciated with **:

- `` x`` list of observations
- `` y`` list of states
- `` nr`` length of x and y
- ``sequence_list`` 
      
 

 Now we will load the data from Example 2.1 and look at the training and test set.

In [8]:
simple = ssr.SimpleSequence()
for sequence in simple.train.seq_list: 
    print sequence

walk/rainy walk/sunny shop/sunny clean/sunny 
walk/rainy walk/rainy shop/rainy clean/sunny 
walk/sunny shop/sunny shop/sunny clean/sunny 


In [9]:
for sequence in simple.test.seq_list: 
    print sequence

walk/rainy walk/sunny shop/sunny clean/sunny 
clean/sunny walk/sunny tennis/sunny walk/sunny 


In [10]:
type(simple.train.seq_list[0])

lxmls.sequences.sequence.Sequence

In [11]:
simple.train.seq_list[0].__dict__

{'nr': 0,
 'sequence_list': [walk/rainy walk/sunny shop/sunny clean/sunny , walk/rainy walk/rainy shop/rainy clean/sunny , walk/sunny shop/sunny shop/sunny clean/sunny ],
 'x': [0, 0, 1, 2],
 'y': [0, 1, 1, 1]}

In [12]:
type(simple.train.seq_list[0].sequence_list)

lxmls.sequences.sequence_list.SequenceList

In [13]:
for sequence in simple.train.seq_list:
    print sequence.x

[0, 0, 1, 2]
[0, 0, 1, 2]
[0, 1, 1, 2]
