# Day 3: Learning Structured Predictors

In Day 2, we focused on generative sequence classifiers - HMMs. Today's focus is on discriminative classifiers. Recall that:

* **generative classifiers** try to model the probability distribution of the data, $P(X, Y)$;

* **discriminative classifiers** only model the conditional probability of each class given the observed data, $P(Y\,|\,X)$.

In Day 1, we implemented discriminative models for classification tasks. Today, we extend this concept to the classification of _sequential_ data.

## Summary

You will be using two discriminative classifiers to do part-of-speech tagging:
* Conditional Random Fields (CRF) and
* Structured Perceptron.

Your tasks for this lab session are:

* to train a CRF model using two different sets of features (exercises 3.1 and 3.2); 
* to implement the structured perceptron algorithm (exercise 3.3); 
* to compare the performance of the Structured Perceptron with that of CRFs (exercise 3.4).


**Therefore today's coding in centered in implementing the ```.perceptron_update``` method inside the  ```StructuredPerceptron``` class**. The **```class CRFBatch```** and the **```class CRFOnline```** are alredy implemented and will be used in Exercise 3.1 and 3.2.


In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
import sys
# We will this append to ensure we can import lxmls toolking
sys.path.append('../../lxmls-toolkit')

In [5]:
import lxmls
import lxmls.sequences.crf_online as crfo
import lxmls.readers.pos_corpus as pcc
import lxmls.sequences.id_feature as idfc
import lxmls.sequences.extended_feature as exfc
from lxmls.readers import pos_corpus

## Introduction

Discriminative sequence models aim to solve the following:

$$\underset{y\,\in\,\Lambda^N}{\textrm{arg max}}\ P(Y=y\,|\,X=x)=\underset{y\,\in\,\Lambda^N}{\textrm{arg max}}\ \boldsymbol{w}\cdot\boldsymbol{f}(x, y)$$

where $\boldsymbol{w}$ is the model's weight vector, and $\boldsymbol{f}(x, y)$ is a feature vector. Notice that now both $y$ and $x$ are $N$-dimensional vectors, whereas in Day 1, these variables were just scalar numbers.

In Day 2, sequences were scored using the log-probability. On today's models we are still scoring the sequences; the only difference is the scores are now computed as the product of the weights with the feature vector:


| score | Hidden Markov Models (Day 2) | Discriminative Models (Today) |
| ------------------------------- | ---------------- | ---------------- |
| $\textrm{score}_\textrm{emiss}$ | $\log P(x_i\,|\,y_i) $ | $\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{emiss}(i, x, y_i)$ |
| $\textrm{score}_\textrm{init}$ | $\log P(y_1\,|\,\mathrm{start}) $ | $\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{init}(x, y_1)$ |
| $\textrm{score}_\textrm{trans}$ | $\log P(y_{i+1}\,|\,y_i) $ | $\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{trans}(i, x, y_i, y_{i+1})$ |
| $\textrm{score}_\textrm{final}$ | $\log P(\mathrm{stop}\,|\,y_N) $ | $\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{final}(x, y_N)$ |

Notice that the scores computed using the feature vector depend on two sequential values of the output variable, $y$, but may depend on the whole observated input, $x$. We can now rewrite the above expression as

$$
\underset{y\,\in\,\Lambda^N}{\textrm{arg max}}\ 
\sum_{i=1}^N \boldsymbol{w}\cdot\boldsymbol{f}_\textrm{emiss}(i, x, y_i) + 
\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{init}(x, y_1) + 
\sum_{i=1}^{N-1}\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{trans}(i, x, y_i, y_{i+1}) + 
\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{final}(x, y_N) = 
\\
\underset{y\,\in\,\Lambda^N}{\textrm{arg max}}\ 
\sum_{i=1}^N \textrm{score}_\textrm{emiss}(i, x, y_i) + 
\textrm{score}_\textrm{init}(x, y_1) + 
\sum_{i=1}^{N-1}\textrm{score}_\textrm{trans}(i, x, y_i, y_{i+1}) +
\textrm{score}_\textrm{final}(x, y_N)
$$

The reader can notice that feature vectors depend locally on the output variable. The features depend on 

- a single $y_i$ in the case of emission scores, initial scores and final scores.
- or a pair  $y_i, y_{i+1}$ in the case of transition scores). 

### Features

Today we will use two types of simple features. 

- **Features that mimic the features used by the HMM**
    - This sill allow us to directly compare the performance of a generative vs a discriminative approach
    

- **Features that are implicit in the HMM** which are simple indicatiors of the initial, transition, final and emission events.
    - Given a certain position $i$ and state $c$ the set of features that mimic the HMM are:


| Conditions to be met       |    Name             |
| ----------------           | ----------------    |
| $y_i=c  \,\, \& \,\, i =0$        | Initial features    |
| $y_i=c   \,\, \& \,\, y_{i-1}=c$  | Transition features |
| $y_i=c_k \,\, \& \,\,  i=N$        | Final features      |
| $x_i=w_j \,\, \& \,\,  y_i=c_k$    | Emission features   |

When we used a generative model we were forced to make some independence assumptions. However, since we are now in a discriminative approach,where we model $P(Y | X)$ rather than $P(X,Y)$ we are not tied anymore to some of these assumptions. In particular:

- We may use “overlapping” features, e.g., features that fire simultaneously for many instances. For example, we can use a feature for a word, such as a feature which fires for the word ”brilliantly”, and another for prefixes and suffixes of that word, such as one which fires if the last two letters of the word are ”ly”. This would lead to an awkward model if we wanted to insist on a generative approach.


- We may use features that depend arbitrarily on the entire input sequence $x$. On the other hand, we still need to resort to “local” features with respect to the outputs (e.g. looking only at consecutive state pairs), otherwise decoding algorithms will become more expensive.


#### Typical features used for POS taggigng with discriminative models

The following table shows some typical POS tagging features. Let us consider $P_set$ and $S_set$ to be two sets of prefixes and sufixes respectively (set by the user).


| Conditions to be met for some of the most typical POS features     |    Name      |
| ----------------                                | ----------------    |
| $y_i=c , \,\,  i =0$                      | Initial features    |
| $y_i=c ,\,\,  y_{i-1}=c$                | Transition features |
| $y_i=c_k ,\,\, i=N$                     | Final features      |
| $x_i=w_j ,\,\,  y_i=c_k$                 | Basic Emission features|
| $x_i=w_j ,\,\,  w_j \text{ is uppercased } ,\,\,  y_i=c_k$                 | Upper case features|
| $x_i=w_j ,\,\,  w_j \text{ contains digit} ,\,\,  y_i=c_k$                 | Digit features|
| $x_i=w_j ,\,\,  w_j \text{ contains hyphen} ,\,\,  y_i=c_k$                 | Hypthen features|
| $x_i=w_j ,\,\,  w_j[0:i] \in P_{set}  \forall i \in \{1,2,3\}  ,\,\,  y_i=c_k$                 | Prefix features|
| $x_i=w_j ,\,\,  w_j[-i] \in S_{set}  \forall i \in \{1,2,3\}  ,\,\,  y_i=c_k$                 | Suffix features|








We can have more complex features which look arbitrarily to the input sequence. We are not going to have them in this exercise only for performance reasons (to have less features and smaller caches). State-of-the-art sequence classifiers can easily reach over one million features!

Our features subdivide in two groups

- **node features**: $f_{\text{emiss}}, f_{\text{init}}, f_{\text{final}}$. Node features depend only on a single position in the state sequence (or node in the trellis).
    
    
- **edge features**: $f_{\text{trans}}$. Edge features depend on two consecutive positions in the state sequence (an edge in the trellis)


    
| score  definitions: scalar product between features and weights |
| ------------------------------- | ---------------- |
| $\textrm{score}_\textrm{emiss}\,(i,x,y_1) =\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{emiss}\,(i, x, y_i)$ |
|$\textrm{score}_\textrm{init}(x,y_1)=\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{init} \,(x, y_1)$ |
| $\textrm{score}_\textrm{trans}\,(i,x,y_i,y_{i+1}) = \boldsymbol{w}\cdot\boldsymbol{f}_\textrm{trans}\,(i, x, y_i, y_{i+1})$ |
| $\textrm{score}_\textrm{final}\,(x,y_N) = \boldsymbol{w}\cdot\boldsymbol{f}_\textrm{final}\,(x, y_N)$ |



### Discriminative Sequential Classifiers

Given a weight vector $W$, the conditional probability $P_{W}(Y=y|X=x)$ is then defined as follows: 


$$
P_{w}(Y=y \vert X=x) = \frac{1}{Z({w},x)}\exp \big( w \cdot F_{\mathrm{init}}\,(x,y_1) +  \sum_{i=1}^{N-1} w \cdot F_{\mathrm{trans}}\,(i,x,y_i,y_{i+1}) + w \cdot F_{\mathrm{final}}\,(x,y_N) + \sum_{i=1}^{N} w \cdot F_{\mathrm{emiss}}\,(i,x,y_i) \big)
$$

where the normalizing factor $Z(w,x)$ is called the **partition function**:

$$
\sum_{y\in \Lambda^N} \exp \big( w \cdot F_{\mathrm{init}}\,(x,y_1) + 
\sum_{i=1}^{N-1} w \cdot F_{\mathrm{trans}}\,(i,x,y_i,y_{i+1}) + w \cdot F_{\mathrm{final}}\,(x,y_N) + \sum_{i=1}^{N} w \cdot F_{\mathrm{emiss}}\,(i,x,y_i) \big)
$$

#### training a discriminative sequential classifier



For training,  the important problem is that of obtaining the weight vector $w$ that lead to an accurate 
classifier.  We will discuss two possible strategies:

-  Maximizing conditional log-likelihood from a set of labeled data $\{(x^m,y^m)\}_{m=1}^M$, yielding **conditional random fields**. This corresponds to the following optimization problem:
$$
\hat{w} = \arg\max_{w} \sum_{m=1}^M \log P_{w}(Y=y^m \vert X=x^m).
$$
To avoid overfitting, it is common to regularize with the Euclidean norm function, 
which is equivalent to considering a zero-mean Gaussian prior on the weight vector.
The problem becomes:
$$
\hat{w} = \arg\max_{w} \sum_{m=1}^M \log P_{w}(Y=y^m | X=x^m) - \frac{\lambda}{2} \|w\|^2.
$$
This is precisely the structured variant of the maximum entropy 
method discussed in Chapter 1. Unlike HMMs, this problem does not have a closed form solution 
and has to be solved with numerical optimization. 


- Alternatively, running the **structured perceptron** algorithm 
to obtain a weight vector $w$ that accurately classifies the training data. 
We will see that this simple strategy achieves results which are competitive 
with conditional log-likelihood maximization.






### Decoding

One important thing to notice is that the decoding process - the process by which we pick the most likely label $y_i$ for the observation $x_i$ - stays the same. This means *we do not need to develop new decoders,* only new functions to compute the scores. Because of this, we will keep using the Viterbi and Forward-Backward algorithms developed on Day 2.


For decoding,   there are three important problems that need to be solved:

1. Given $X=x$, compute the most likely output sequence $\hat{y}$ (the one which maximizes $P_{w}(Y=y|X=x)$). 
2. Compute the posterior marginals $P_{w}(Y_i=y_i|X=x)$ at each position $i$.
3. Evaluate the partition function $Z(w,x)$. 

Interestingly, all these problems can be solved by using the very same
algorithms that were 
already implemented for HMMs: the Viterbi algorithm (for 1) and the forward-backward algorithm (for 2--3). All that changes is the way the scores are computed. 



### Training the classifier

Today we will cover two different approaches to training sequential discriminative models. Given a training set with $M$ observation-label pairs, $\{(x_m, y_m)\}_{m=1}^M$ (note that $x_m$, $y_m$ are $N$-dimensional vectors, as $m$ indexes the training sample):

* **Conditional Random Fields** maximize the log-likelihood over $w$ on the observed training set.
* **Structured Perceptron** iteratively updates $w$ in order to correctly classify the training set.

## Conditional Random Fields

CRFs are the generalization of the Maximum Entropy classifier for sequences. The general concept is the same, with a couple of diferences to be discussed below. They are trained by solving the following optimization problem:

$$
\hat{\boldsymbol{w}} = \underset{\boldsymbol{w}}{\textrm{argmax}}\ \sum_{m=1}^M\log P_\boldsymbol{w}(Y=y_m\,|\,X=x_m)
$$

where 

$$
P_w(Y=y_m\,|\,X=x_m) =
\frac{1}{Z(w, x)}\ \exp \big(\sum_{i=1}^N \boldsymbol{w}\cdot\boldsymbol{f}_\textrm{emiss}\,(i, x, y_i) + 
\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{init}\,(x, y_1) + 
\sum_{i=1}^{N-1}\boldsymbol{w} \cdot \boldsymbol{f}_\textrm{trans}\,(i, x, y_i, y_{i+1}) + 
\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{final}\,(x, y_N))\\
Z(w, x) = \sum_{y\,\in\,\Lambda^N} \exp \big(\sum_{i=1}^N \boldsymbol{w}\cdot\boldsymbol{f}_\textrm{emiss}\,(i, x, y_i) +  \boldsymbol{w}\cdot\boldsymbol{f}_\textrm{init}\,(x, y_1) + 
\sum_{i=1}^{N-1}\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{trans}\,(i, x, y_i, y_{i+1}) + 
\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{final}\,(x, y_N) \big)
$$

As before, the partition function $Z(w,x)$ ensures the sum of probabilities over all possible labels $y\in\Lambda^N$ is equal to 1.

To avoid overfitting, it is common to add the Euclidean norm function as a regularization term. This is equivalent
to considering a zero-mean Gaussian prior on the weight vector. The optimization problem becomes:

$$
\hat{\boldsymbol{w}} = \underset{\boldsymbol{w}}{\textrm{arg max}}\ \sum_{m=1}^M\log P_\boldsymbol{w}(Y=y_m\,|\,X=x_m) - \frac{\lambda}{2}||\boldsymbol{w}||^2
$$

which is precisely the structured variant of the maximum entropy method discussed on Day 1. Unlike with HMMs, the above problem has to be solved numerically.

### Differences with respect to ME algorithm

* CRe does not compute posterior marginals, $P(Y=y\,|\,X=x)$ for every possible $y\in\Lambda^N$, as there are exponentially many possible $y$'s. Instead, it decomposes the model into parts — nodes and edges — and computes the posteriors for those parts, that is, $P(Y_i=y_i\,|\,X=x)$ and $P(Y_i=y_i, Y_{i+1}=y_{i+1}\,|\,X=x)$. The crucial point is that these quantities can be computed using the forward-backward algorithm.

* Instead of updating the features for all possible outputs $y′ ∈ \Lambda^N$, we again exploit the decomposition into parts above and update only “local features” at the nodes and edges. [TODO clarify this]

### Pseudo Code

Below is pseudo code to optimize a CRF with the stochastic gradient descent (SGD) algorithm. Our toolkit also includes an implementation of a quasi-Newton method, L-BFGS, which converges faster. For the purpose of this exercise, however, we will stick with SGD.

<img src="../images_for_notebooks/day_3/CRF_pseudocode.png">



<font color='red', size=5>
Put information on why the expected value over parameters has the form it can be seen in the algorithm
</font>



### HMM vs CRF

HMMs are factored linear models. HMM can be written as CRFs.

If an HMM is a type of CRF...

- HMM features are tied to the generative process
- CRF features are very flexible. They can look at the whole input x paired with any labeled bigram.
- In practise, for prediction taks, good discriminative features can improve accuracy al lot.

##### Parameter estimation

- HMMs focus on explaining the data, both x and y.
- CRFs focus  on the mapping from x to y
- A priori it is hard to say which paradigm is better.
- Similar dilema can be found in the Naive Bayes vs. Maximum Entropy.

## Exercises


Objectives:


* train a CRF using different feature sets for part-of-speech tagging;

* evaluate the model on the training, development and test sets.


Files used:

* class CRFOnline in lxmls/sequences/crf_online.py file

* class PostagCorpus in lxmls/sequences/readers/pos_corpus.py file

* class IDFeatures in lxmls/sequences/id_feature.py file

* class ExtendedFeatures in lxmls/sequences/extended_feature.py file


# CRF and Structured Perceptron are discriminative classifiers


Classes that implement CRF and Structured Perceptron inherit from  ```lxmls.sequences.discriminative_sequence_classifier```

More concretly the classes are

- **```class StructuredPerceptron```** and can be found in ```lxmls.sequences.structured_perceptron.py ```      
- **```class CRFBatch```**  and can be found in ```lxmls.sequences.crf_batch.py ```
- **```class CRFOnline```**  and can be found in ```lxmls.sequences.crf_online.py ```
    



    
## Code for ```lxmls.sequences.sequence_classifier ```


#### What is important to notice in  SequenceClassifier class

The code provided below in **(*)** defines an abstract class for a sequence classifier.

It can be noticed that a ```SequenceClassifier``` has the methods

- ```.train_supervised```: trains the algorithm in a supervised way
- ```.compute_scores```: Computes the scores of a given sequence

** Both of this methods are not implemented**

```python        
    # Code from SequenceClassifier
    def train_supervised(self, sequence_list):
        """ Train a classifier in a supervised setting."""
        raise NotImplementedError

    def compute_scores(self, sequence):
        """ Compute emission and transition scores for the decoder."""
        raise NotImplementedError
```       


** The exercise of today will related to the train_supervised method in the structured perceptron**.

The code for the CRF will be all already implemented so there is no need to implement anything.
The code for the structured perceptron has already the ```.train_supervised``` method implemented. Nevertheless, the ```.train_supervised``` inside the ```StructuredPerceptron``` class calls another function that you will have to implement. The function is ```.perceptron_update```



#### (*) Code for the SequenceClassifier class

```python        
import sequence_classification_decoder as scd

class SequenceClassifier:
    """ Implements an abstract sequence classifier."""

    def __init__(self, observation_labels, state_labels):
        """Initialize a sequence classifier. observation_labels and
        state_labels are the sets of observations and states, respectively.
        They must be LabelDictionary objects."""

        self.decoder = scd.SequenceClassificationDecoder()
        self.observation_labels = observation_labels
        self.state_labels = state_labels
        self.trained = False

    def get_num_states(self):
        """ Return the number of states."""
        return len(self.state_labels)

    def get_num_observations(self):
        """ Return the number of observations (e.g. word types)."""
        return len(self.observation_labels)

    def train_supervised(self, sequence_list):
        """ Train a classifier in a supervised setting."""
        raise NotImplementedError

    def compute_scores(self, sequence):
        """ Compute emission and transition scores for the decoder."""
        raise NotImplementedError

     .
     .
     .
```


## Code for ```DiscriminativeSequenceClassifier```


The code for DiscriminativeSequenceClassifier class is as follows:




```python
import lxmls.sequences.sequence_classifier as sc

class DiscriminativeSequenceClassifier(sc.SequenceClassifier):

    def __init__(self, observation_labels, state_labels, feature_mapper):
        sc.SequenceClassifier.__init__(self, observation_labels, state_labels)

        # Set feature mapper and initialize parameters.
        self.feature_mapper = feature_mapper
        self.parameters = np.zeros(self.feature_mapper.get_num_features())

    def compute_scores(self, sequence):
        num_states = self.get_num_states()
        length = len(sequence.x)
        emission_scores = np.zeros([length, num_states])
        initial_scores = np.zeros(num_states)
        transition_scores = np.zeros([length-1, num_states, num_states])
        final_scores = np.zeros(num_states)

        # Initial position.
        for tag_id in xrange(num_states):
            initial_features = self.feature_mapper.get_initial_features(sequence, tag_id)
            score = 0.0
            for feat_id in initial_features:
                score += self.parameters[feat_id]
            initial_scores[tag_id] = score

        # Intermediate position.
        for pos in xrange(length):
            for tag_id in xrange(num_states):
                emission_features = self.feature_mapper.get_emission_features(sequence, pos, tag_id)
                score = 0.0
                for feat_id in emission_features:
                    score += self.parameters[feat_id]
                emission_scores[pos, tag_id] = score
            if pos > 0:
                for tag_id in xrange(num_states):
                    for prev_tag_id in xrange(num_states):
                        transition_features = self.feature_mapper.get_transition_features(
                            sequence, pos, tag_id, prev_tag_id)
                        score = 0.0
                        for feat_id in transition_features:
                            score += self.parameters[feat_id]
                        transition_scores[pos-1, tag_id, prev_tag_id] = score

        # Final position.
        for prev_tag_id in xrange(num_states):
            final_features = self.feature_mapper.get_final_features(sequence, prev_tag_id)
            score = 0.0
            for feat_id in final_features:
                score += self.parameters[feat_id]
            final_scores[prev_tag_id] = score

        return initial_scores, transition_scores, final_scores, emission_scores
```


# About Feature Generation

Given a dataset,

in order to build the features

- An instance from IDFeatures (we will call it feature_mapper) must be instanciated
- feature_mapper.build_features() must be executed



### Loading the data inside train-02-21.conll

In [6]:
corpus = lxmls.readers.pos_corpus.PostagCorpus()
data_path = "../../lxmls-toolkit/data/"

train_seq = corpus.read_sequence_list_conll(data_path + "/train-02-21.conll", 
                                            max_sent_len=10, max_nr_sent=1000)

In [7]:
print "There are", len(train_seq), "samples in train_seq"

There are 1000 samples in train_seq


In [8]:
train_seq[0]

Ms./noun Haag/noun plays/verb Elianti/noun ./. 

## Inspecting the IDFeatures class

** IDFeatures object will be referred to as a  ```feature_mapper```.**


We will assume feature_mapper has been instantiated with

    feature_mapper = lxmls.sequences.id_feature.IDFeatures(train_seq)



#### About feature_mappers
A ```feature_mapper``` will contain the following attributes:

- the dataset in ```.dataset```
    - if we instantiate the feature mapper with a dataset X then ```feature_mapper.dataset```will be a copy of X


- a dictionary of features in ```.feature_dict```
    - this dictionary will default to ```{}```. 
    - In order to build the features the feature mapper must call ```.build_features()``` function.
    
    
- a list of features in ```.feature_list```
    - this list will default to ```[]```. 
    - In order to build the list of features the feature mapper must call ```.build_features()``` function.

A ```feature_mapper``` will contain the method 

- A method to generate features, ```.build_features```
    - this method will create features using the ```.dataset``.
    - This method will also fill ```.feature_dict``` and ```.feature_list``



In [9]:
feature_mapper = lxmls.sequences.id_feature.IDFeatures(train_seq)

In [10]:
len(feature_mapper.feature_list)

0

In [11]:
feature_mapper.__dict__.keys()

['feature_list',
 'final_state_feature_cache',
 'node_feature_cache',
 'add_features',
 'dataset',
 'initial_state_feature_cache',
 'feature_dict',
 'edge_feature_cache']

In [12]:
feature_mapper.dataset[0:2]

[Ms./noun Haag/noun plays/verb Elianti/noun ./. ,
 The/det new/adj rate/noun will/verb be/verb payable/adj Feb./noun 15/num ./. ]

In [13]:
# the number of features used is 2683 
# This is the dimension d of  f(x,y)
len(feature_mapper.feature_dict)

0

In [14]:
feature_mapper.feature_dict

{}

##  Building features using ```.build_features()```

Now we will call ```feature_mapper.build_features()``` to get the features for each training sample

In [15]:
feature_mapper = lxmls.sequences.id_feature.IDFeatures(train_seq)
feature_mapper.build_features()

In [16]:
print "there are", len(feature_mapper.feature_list), "samples in the features build from train_seq"

there are 1000 samples in the features build from train_seq


In [17]:
len(feature_mapper.feature_list)

1000

The feature_mapper object 

In [18]:
len(feature_mapper.dataset)

1000

#### Examining initial, transition, final and emission features

In [19]:
feature_mapper.dataset[0]

Ms./noun Haag/noun plays/verb Elianti/noun ./. 

In [20]:
feature_mapper.feature_list[0]

[[[0]], [[3], [5], [7], [9]], [[10]], [[1], [2], [4], [6], [8]]]

In [21]:
print "\nInitial features:",     feature_mapper.feature_list[0][0]
print "\nTransition features:",  feature_mapper.feature_list[0][1]
print "\nFinal features:",       feature_mapper.feature_list[0][2]
print "\nEmission features:",    feature_mapper.feature_list[0][3]


Initial features: [[0]]

Transition features: [[3], [5], [7], [9]]

Final features: [[10]]

Emission features: [[1], [2], [4], [6], [8]]


### Codification of the features

All features are saved in ``feature_mapper.feature_dict`` this represents our feature vector. If it is our feature vector why it's not a vector? Good point! In order to make the algorithm fast, the code is written using dicts, so if we access only a few positions from the dict and compute substractions it will be much faster than computing the substraction of two huge weight vectors.

Features are identifyed by **init_tag:**, **prev_tag:**,  **final_prev_tag:**, **id:**

- **init_tag:** when they are Initial features
    - Example: **``init_tag:noun``** is an initial feature that describes that the first word is a noun
    
    
- **prev_tag:** when they are transition features
    - Example: **``prev_tag:noun::noun``** is an transition feature that describes that the previous word was
      a noun and the current word is a noun.
    - Example: **``prev_tag:noun:.``** is an transition feature that describes that the previous word was
      a noun and the current word is a `.` (this is usually foud as the last transition feature since most phrases will end up with a dot)
      


- **final_prev_tag:** when they are final features
    - Example: **``final_prev_tag:.``** is a final feature stating that the last "word" in the sentence was a dot.


- **id:** when they are emission features
    - Example: **``id:plays::verb``** is an emission feature, describing that the current word is plays and the current hidden state is a verb.
    - Example: **``id:Feb.::noun``** is an emission feature, describing that the current word is "Feb." and the current hidden state is a noun.




In [22]:
inv_feature_dict = {word: pos for pos, word in feature_mapper.feature_dict.iteritems()}

In [23]:
[inv_feature_dict[x[0]] for x in feature_mapper.feature_list[0][0]]

['init_tag:noun']

In [24]:
[inv_feature_dict[x[0]] for x in feature_mapper.feature_list[0][1]]

['prev_tag:noun::noun',
 'prev_tag:noun::verb',
 'prev_tag:verb::noun',
 'prev_tag:noun::.']

In [25]:
[inv_feature_dict[x[0]] for x in feature_mapper.feature_list[0][2]]

['final_prev_tag:.']

In [26]:
[inv_feature_dict[x[0]] for x in feature_mapper.feature_list[0][3]]

[u'id:Ms.::noun',
 u'id:Haag::noun',
 u'id:plays::verb',
 u'id:Elianti::noun',
 u'id:.::.']

In [27]:
print "\nInitial features:",     feature_mapper.feature_list[1][0]
print "\nTransition features:",  feature_mapper.feature_list[1][1]
print "\nFinal features:",       feature_mapper.feature_list[1][2]
print "\nEmission features:",    feature_mapper.feature_list[1][3]


Initial features: [[11]]

Transition features: [[14], [16], [5], [19], [21], [16], [24], [25]]

Final features: [[10]]

Emission features: [[12], [13], [15], [17], [18], [20], [22], [23], [8]]


In [28]:
feature_mapper.feature_list[1]

[[[11]],
 [[14], [16], [5], [19], [21], [16], [24], [25]],
 [[10]],
 [[12], [13], [15], [17], [18], [20], [22], [23], [8]]]

In [29]:
[inv_feature_dict[x[0]] for x in feature_mapper.feature_list[1][0]]

['init_tag:det']

In [30]:
[inv_feature_dict[x[0]] for x in feature_mapper.feature_list[1][1]]

['prev_tag:det::adj',
 'prev_tag:adj::noun',
 'prev_tag:noun::verb',
 'prev_tag:verb::verb',
 'prev_tag:verb::adj',
 'prev_tag:adj::noun',
 'prev_tag:noun::num',
 'prev_tag:num::.']

In [31]:
[inv_feature_dict[x[0]] for x in feature_mapper.feature_list[1][2]]

['final_prev_tag:.']

In [32]:
[inv_feature_dict[x[0]] for x in feature_mapper.feature_list[1][3]]

[u'id:The::det',
 u'id:new::adj',
 u'id:rate::noun',
 u'id:will::verb',
 u'id:be::verb',
 u'id:payable::adj',
 u'id:Feb.::noun',
 u'id:15::num',
 u'id:.::.']

In [33]:
len(feature_mapper.feature_dict)

2683

In [34]:
feature_mapper.feature_dict["id:Each::det"]

946

In [35]:
inv_feature_dict [11]

'init_tag:det'

In [36]:
pos = [inv_feature_dict[x]=="id:Each::det"  for x in range(len(inv_feature_dict))]

In [37]:
for x in range(len(inv_feature_dict)):
    if inv_feature_dict[x]=="id:Each::det":
        print x, inv_feature_dict[x]

946 id:Each::det


In [38]:
inv_feature_dict

{0: 'init_tag:noun',
 1: u'id:Ms.::noun',
 2: u'id:Haag::noun',
 3: 'prev_tag:noun::noun',
 4: u'id:plays::verb',
 5: 'prev_tag:noun::verb',
 6: u'id:Elianti::noun',
 7: 'prev_tag:verb::noun',
 8: u'id:.::.',
 9: 'prev_tag:noun::.',
 10: 'final_prev_tag:.',
 11: 'init_tag:det',
 12: u'id:The::det',
 13: u'id:new::adj',
 14: 'prev_tag:det::adj',
 15: u'id:rate::noun',
 16: 'prev_tag:adj::noun',
 17: u'id:will::verb',
 18: u'id:be::verb',
 19: 'prev_tag:verb::verb',
 20: u'id:payable::adj',
 21: 'prev_tag:verb::adj',
 22: u'id:Feb.::noun',
 23: u'id:15::num',
 24: 'prev_tag:noun::num',
 25: 'prev_tag:num::.',
 26: u'id:A::det',
 27: u'id:record::noun',
 28: 'prev_tag:det::noun',
 29: u'id:date::noun',
 30: u'id:has::verb',
 31: u"id:n't::adv",
 32: 'prev_tag:verb::adv',
 33: u'id:been::verb',
 34: 'prev_tag:adv::verb',
 35: u'id:set::verb',
 36: 'prev_tag:verb::.',
 37: 'init_tag:adv',
 38: u'id:Not::adv',
 39: u'id:all::det',
 40: 'prev_tag:adv::det',
 41: u'id:those::det',
 42: 'prev_t



## Exercise 3.1 - Basic feature set

_Start by training the model. You will receive feedback when each epoch is finished. Note that running the 20 epochs might take a while._

In [39]:
import lxmls.sequences
import lxmls.sequences.crf_online as crfo
import lxmls.readers.pos_corpus as pcc
import lxmls.sequences.id_feature as idfc
import lxmls.sequences.extended_feature as exfc

In [40]:
data_path = "../../lxmls-toolkit/data/"

In [41]:
# Load the corpus
corpus = pcc.PostagCorpus()

# Load the training, test and development sequences
train_seq = corpus.read_sequence_list_conll(data_path + "/train-02-21.conll", 
                                            max_sent_len=10, max_nr_sent=1000)
test_seq = corpus.read_sequence_list_conll(data_path + "/test-23.conll",
                                           max_sent_len=10, max_nr_sent=1000)
dev_seq = corpus.read_sequence_list_conll(data_path + "/dev-22.conll", 
                                          max_sent_len=10, max_nr_sent=1000)


In [42]:
train_seq[0].x

[42, 40, 43, 44, 41]

In [43]:
train_seq[0].y

[0, 0, 6, 0, 4]

In [44]:
# Build features
feature_mapper = idfc.IDFeatures(train_seq)
feature_mapper.build_features()

# Train the model
# You will receive feedback when each epoch is finished.
# Note that running the 20 epochs might take a while.
crf_online = crfo.CRFOnline(corpus.word_dict, corpus.tag_dict, feature_mapper)
crf_online.num_epochs = 20
crf_online.train_supervised(train_seq)


# You will receive feedback when each epoch is finished, note that running the 20 epochs might take a while. 

Epoch: 0 Objective value: -5.779018
Epoch: 1 Objective value: -3.192724
Epoch: 2 Objective value: -2.717537
Epoch: 3 Objective value: -2.436614
Epoch: 4 Objective value: -2.240491
Epoch: 5 Objective value: -2.091833
Epoch: 6 Objective value: -1.973353
Epoch: 7 Objective value: -1.875643
Epoch: 8 Objective value: -1.793034
Epoch: 9 Objective value: -1.721857
Epoch: 10 Objective value: -1.659605
Epoch: 11 Objective value: -1.604499
Epoch: 12 Objective value: -1.555229
Epoch: 13 Objective value: -1.510806
Epoch: 14 Objective value: -1.470468
Epoch: 15 Objective value: -1.433612
Epoch: 16 Objective value: -1.399759
Epoch: 17 Objective value: -1.368518
Epoch: 18 Objective value: -1.339566
Epoch: 19 Objective value: -1.312636


#### The previous cell execution should give the following results

    Epoch: 0 Objective value: -5.779018
    Epoch: 1 Objective value: -3.192724
    Epoch: 2 Objective value: -2.717537
    Epoch: 3 Objective value: -2.436614
    Epoch: 4 Objective value: -2.240491
    Epoch: 5 Objective value: -2.091833
    Epoch: 6 Objective value: -1.973353
    Epoch: 7 Objective value: -1.875643
    Epoch: 8 Objective value: -1.793034
    Epoch: 9 Objective value: -1.721857
    Epoch: 10 Objective value: -1.659605
    Epoch: 11 Objective value: -1.604499
    Epoch: 12 Objective value: -1.555229
    Epoch: 13 Objective value: -1.510806
    Epoch: 14 Objective value: -1.470468
    Epoch: 15 Objective value: -1.433612
    Epoch: 16 Objective value: -1.399759
    Epoch: 17 Objective value: -1.368518
    Epoch: 18 Objective value: -1.339566
    Epoch: 19 Objective value: -1.312636


After training is done, evaluate the learned model on the training, development and test sets.


In [45]:
# Make predictions for the various sequences using the trained model.
pred_train = crf_online.viterbi_decode_corpus(train_seq)
pred_dev = crf_online.viterbi_decode_corpus(dev_seq)
pred_test = crf_online.viterbi_decode_corpus(test_seq)

# Evaluate and print accuracies
eval_train = crf_online.evaluate_corpus(train_seq, pred_train)
eval_dev = crf_online.evaluate_corpus(dev_seq, pred_dev)
eval_test = crf_online.evaluate_corpus(test_seq, pred_test)
print "CRF -  Accuracy Train: %.3f Dev: %.3f Test: %.3f"%(eval_train,eval_dev, eval_test)

CRF -  Accuracy Train: 0.949 Dev: 0.846 Test: 0.858


**Your output should be similar to this:**

    CRF -  Accuracy Train: 0.949 Dev: 0.846 Test: 0.858

Compare with the results achieved with the HMM model (0.837 on the test set). Even when using a similar feature set, a CRF yields better results than the HMM from the previous lecture.

Perform some error analysis and figure out what are the main errors the tagger is making. Compare them with the errors made by the HMM model.

**Hint:** use the methods developed in the previous lecture to help you with the error analysis.

# Exercise 3.2 - Extended feature set

**Exercise 3.2 Repeat the previous exercise using the extended feature set. Compare the results**

_Train the model again, this time using the extended feature set._

In [46]:
import lxmls.sequences.extended_feature as exfc
import lxmls.sequences.crf_online as crfo

In [47]:
# Build features
feature_mapper_ext = exfc.ExtendedFeatures(train_seq)
feature_mapper_ext.build_features()

In [48]:
print "The standard feature_mapper has", len(feature_mapper.feature_dict), " features"
print "The extended feature_mapper has", len(feature_mapper_ext.feature_dict), " features"

The standard feature_mapper has 2683  features
The extended feature_mapper has 7261  features


In [49]:
len(feature_mapper_ext.feature_list)

1000

In [50]:
# Train the model
# You will receive feedback when each epoch is finished.
# Note that running the 20 epochs might take a while.
crf_online = crfo.CRFOnline(corpus.word_dict, corpus.tag_dict, feature_mapper_ext)

In [51]:
crf_online.num_epochs = 20
crf_online.train_supervised(train_seq)

Epoch: 0 Objective value: -7.141596
Epoch: 1 Objective value: -1.807511
Epoch: 2 Objective value: -1.218877
Epoch: 3 Objective value: -0.955739
Epoch: 4 Objective value: -0.807821
Epoch: 5 Objective value: -0.712858
Epoch: 6 Objective value: -0.647382
Epoch: 7 Objective value: -0.599442
Epoch: 8 Objective value: -0.562584
Epoch: 9 Objective value: -0.533411
Epoch: 10 Objective value: -0.509885
Epoch: 11 Objective value: -0.490548
Epoch: 12 Objective value: -0.474318
Epoch: 13 Objective value: -0.460438
Epoch: 14 Objective value: -0.448389
Epoch: 15 Objective value: -0.437800
Epoch: 16 Objective value: -0.428402
Epoch: 17 Objective value: -0.419990
Epoch: 18 Objective value: -0.412406
Epoch: 19 Objective value: -0.405524


#### The previous cell should give the following results

    Epoch: 0 Objective value: -7.141596
    Epoch: 1 Objective value: -1.807511
    Epoch: 2 Objective value: -1.218877
    Epoch: 3 Objective value: -0.955739
    Epoch: 4 Objective value: -0.807821
    Epoch: 5 Objective value: -0.712858
    Epoch: 6 Objective value: -0.647382
    Epoch: 7 Objective value: -0.599442
    Epoch: 8 Objective value: -0.562584
    Epoch: 9 Objective value: -0.533411
    Epoch: 10 Objective value: -0.509885
    Epoch: 11 Objective value: -0.490548
    Epoch: 12 Objective value: -0.474318
    Epoch: 13 Objective value: -0.460438
    Epoch: 14 Objective value: -0.448389
    Epoch: 15 Objective value: -0.437800
    Epoch: 16 Objective value: -0.428402
    Epoch: 17 Objective value: -0.419990
    Epoch: 18 Objective value: -0.412406
    Epoch: 19 Objective value: -0.405524
            


And compute its accuracy.

In [52]:
# Make predictions for the various sequences using the trained model.
pred_train = crf_online.viterbi_decode_corpus(train_seq)
pred_dev = crf_online.viterbi_decode_corpus(dev_seq)
pred_test = crf_online.viterbi_decode_corpus(test_seq)

# Evaluate and print accuracies
eval_train = crf_online.evaluate_corpus(train_seq, pred_train)
eval_dev = crf_online.evaluate_corpus(dev_seq, pred_dev)
eval_test = crf_online.evaluate_corpus(test_seq, pred_test)
print "CRF_ext -  Accuracy Train: %.3f Dev: %.3f Test: %.3f"%(eval_train,eval_dev, eval_test)

CRF_ext -  Accuracy Train: 0.984 Dev: 0.899 Test: 0.894


#### The output of the previous cell should be similar to this:

    CRF_ext -  Accuracy Train: 0.984 Dev: 0.899 Test: 0.894

Compare the errors obtained with the two different feature sets. 

- Do some error analysis: what errors were correct by using more features? 

- Can you think of other features to use to solve the errors you found?





**The main lesson from this exercise is that, if you are not satisfied by the accuracy of your algorithm, you can perform some error analysis and find out which errors your algorithm is making. You can then add more features which attempt to improve those specific errors — this is known as feature engineering.**



#### End Ex 3.2   ----------------------------------------------------------------------------





### About adding features 

Adding engineered features can lead to two problems:
* More features will make training and decoding more expensive. For example, if you add features that depend on the current word and the previous word, the number of new features is the square of the number of different words, which is quite large. For example, the Penn Treebank has around 40000 different words, so you are adding a lot of new features, even though not all pairs of words will ever occur. Features that depend on three words (previous, current, and next) are even more numerous.

* If features are very specific, such as the (previous word, current word, next word) one just mentioned, they might occur very rarely in the training set, which leads to overfit problems. Some of these problems (not all) can be mitigated with techniques such as smoothing, which you already learned about.