In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [2]:

import scipy
import numpy as np

import os,sys,inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0,parentdir) 
import skseq

In [3]:
from skseq.sequences import sequence
from skseq.sequences.sequence import Sequence

## Setting a Sequence Object

- Sequence objects are defined in ``skseq/sequences/sequence.py``. 
    - A sequence in a supervised learning problem consist on a set of words and tags associated to words.
    - For example ``w_1/t_1 w_2/t_2 w_3/t_3`` is a sequence of lenght 3 with words ``w_i`` and tags ``t_i``.


In order to instanciate a Sequence we essentially need a list of words and a list of tags of the same size. In order to do it efficiently we will not store strings for the words and tags. We will store an integer values that will represent words and tags.

- **.x** attribute: list of words (integer words)

- **.y** attribute: list of tags (integer tags)

Then we need to keep a mapping from integers to words and from integers to tags.



In [4]:
seq = Sequence(x=["my","sequence","is","cool"], y=[0,2,1,1])

In [5]:
seq

my/0 sequence/2 is/1 cool/1 

In [6]:
seq.x

['my', 'sequence', 'is', 'cool']

In [7]:
seq.y

[0, 2, 1, 1]

# Building a vocabulary and a SequenceList

Given a training set with words and tags we want to build a SequenceList object definded in  ``skseq/sequences/sequence_list.py``.

A  SequenceList is a class that is initialized using a
- dictionary for the words
- a dictionary for the tags
- an empty sequence list where the Sequences read from the data will be stored.


    class SequenceList(object):

        def __init__(self, x_dict, y_dict):
            self.x_dict = x_dict
            self.y_dict = y_dict
            self.seq_list = []


Let us create 3 sequence list for train, test and validation.  

We will use the conll dataset and the class  ``PostagCorpus``.
The class has a method ``.read_sequence_list_conll`` that will return the **SequenceList** object we want



    def read_sequence_list_conll(self, train_file,
                                 mapping_file=("%s/en-ptb.map"
                                               % dirname(__file__)),
                                 max_sent_len=100000,
                                 max_nr_sent=100000):

        # Build mapping of postags:
        mapping = {}
        if mapping_file is not None:
            for line in open(mapping_file):
                coarse, fine = line.strip().split("\t")
                mapping[coarse.lower()] = fine.lower()

        instance_list = self.read_conll_instances(train_file,
                                                  max_sent_len,
                                                  max_nr_sent,
                                                  mapping)

        seq_list = SequenceList(self.word_dict, self.tag_dict)

        for sent_x, sent_y in instance_list:
            seq_list.add_sequence(sent_x, sent_y,  self.word_dict, self.tag_dict)

        return seq_list

In [8]:
import skseq.readers.pos_corpus
corpus = skseq.readers.pos_corpus.PostagCorpus()

In [9]:
ls

01_string_basics.ipynb               04_Dense_word_representations.ipynb
02_text_features.ipynb               05_word2vec_in_documents.ipynb
03_hmm.ipynb                         06_structured_perceptron.ipynb


In [10]:
data_path = "/data/conll"

data_path = parentdir + data_path

train_seq = corpus.read_sequence_list_conll(data_path + "/train-02-21.conll", 
                                            max_sent_len=100, max_nr_sent=5000)

test_seq = corpus.read_sequence_list_conll(data_path + "/test-23.conll",
                                           max_sent_len=100, max_nr_sent=1000)

dev_seq = corpus.read_sequence_list_conll(data_path + "/dev-22.conll", 
                                          max_sent_len=100, max_nr_sent=1000)

In [11]:
 corpus.tag_dict

{'adp': 0,
 'det': 1,
 'noun': 2,
 'num': 3,
 '.': 4,
 'prt': 5,
 'verb': 6,
 'conj': 7,
 'adv': 8,
 'pron': 9,
 'adj': 10,
 'x': 11}

In [12]:
print(type(train_seq))
print(type(train_seq[1]))

<class 'skseq.sequences.sequence_list.SequenceList'>
<class 'skseq.sequences.sequence.Sequence'>


In [13]:
len(train_seq)

5000

In [14]:
# first sentence
train_seq[0].__dict__.keys()

dict_keys(['x', 'y'])

In [15]:
train_seq[1]

42/2 40/2 43/6 44/2 41/4 

In [16]:
train_seq.__dict__.keys()

dict_keys(['x_dict', 'y_dict', 'seq_list'])

In [17]:
# Set of possible tags Lambda
# train_seq.y_dict is a dictionary of mappings from tag to integer id 
train_seq.y_dict

{'adp': 0,
 'det': 1,
 'noun': 2,
 'num': 3,
 '.': 4,
 'prt': 5,
 'verb': 6,
 'conj': 7,
 'adv': 8,
 'pron': 9,
 'adj': 10,
 'x': 11}

In [18]:
# number of possible words
len(train_seq.x_dict)

16937

In [19]:
train_seq[1].x

[42, 40, 43, 44, 41]

In [20]:
train_seq[1].y

[2, 2, 6, 2, 4]

In [21]:
# Mapping from word to integer
c =0
print("First 5 word:id pairs in the dicitionary\n")
for i in train_seq.x_dict:
    print(i,":", train_seq.x_dict[i])
    c +=1
    if c>=5:
        break

First 5 word:id pairs in the dicitionary

In : 0
an : 1
Oct. : 2
19 : 3
review : 4


### Using our corpus ``sequencelist`` to map integers to words

Sequences can use ``SequenceList`` objects to map word_ids and tag_ids to words and tags.

All ``sequence`` objects have the **``.to_words``** method which allows us to print the words given a **``SequenceList``** object. 

In [22]:
train_seq.__dict__.keys()

dict_keys(['x_dict', 'y_dict', 'seq_list'])

In [23]:
sequence = train_seq[1]

In [24]:
sequence.to_words(sequence_list=train_seq)

'Ms./noun Haag/noun plays/verb Elianti/noun ./. '

## Introduction to (linear) discriminative sequence models

Discriminative sequence models aim to solve the following:

$$\underset{y\,\in\,\Lambda^N}{\textrm{arg max}}\ P(Y=y\,|\,X=x)=\underset{y\,\in\,\Lambda^N}{\textrm{arg max}}\ \boldsymbol{w}\cdot\boldsymbol{f}(x, y)$$

where $\boldsymbol{w}$ is the model's weight vector, and $\boldsymbol{f}(x, y)$ is a feature vector. Notice that now both $y$ and $x$ are $N$-dimensional vectors, whereas in Day 1, these variables were just scalar numbers.

In Day 2, sequences were scored using the log-probability. On today's models we are still scoring the sequences; the only difference is the scores are now computed as the product of the weights with the feature vector:


| score | Hidden Markov Models| Discriminative Models  |
| ------------------------------- | ---------------- | ---------------- |
| $\textrm{score}_\textrm{emiss}$ | $\log P(x_i\,|\,y_i) $ | $\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{emiss}(i, x, y_i)$ |
| $\textrm{score}_\textrm{init}$ | $\log P(y_1\,|\,\mathrm{start}) $ | $\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{init}(x, y_1)$ |
| $\textrm{score}_\textrm{trans}$ | $\log P(y_{i+1}\,|\,y_i) $ | $\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{trans}(i, x, y_i, y_{i+1})$ |
| $\textrm{score}_\textrm{final}$ | $\log P(\mathrm{stop}\,|\,y_N) $ | $\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{final}(x, y_N)$ |

Notice that the scores computed using the feature vector depend on two sequential values of the output variable, $y$, but may depend on the whole observated input, $x$. We can now rewrite the above expression as

$$
\underset{y\,\in\,\Lambda^N}{\textrm{arg max}}\ 
\sum_{i=1}^N \boldsymbol{w}\cdot\boldsymbol{f}_\textrm{emiss}(i, x, y_i) + 
\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{init}(x, y_1) + 
\sum_{i=1}^{N-1}\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{trans}(i, x, y_i, y_{i+1}) + 
\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{final}(x, y_N) = 
\\
\underset{y\,\in\,\Lambda^N}{\textrm{arg max}}\ 
\sum_{i=1}^N \textrm{score}_\textrm{emiss}(i, x, y_i) + 
\textrm{score}_\textrm{init}(x, y_1) + 
\sum_{i=1}^{N-1}\textrm{score}_\textrm{trans}(i, x, y_i, y_{i+1}) +
\textrm{score}_\textrm{final}(x, y_N)
$$

The reader can notice that feature vectors depend locally on the output variable. The features depend on 

- a single $y_i$ in the case of emission scores, initial scores and final scores.
- or a pair  $y_i, y_{i+1}$ in the case of transition scores). 

### Features

Today we will use two types of simple features. 

- **Features that mimic the features used by the HMM**
    - This sill allow us to directly compare the performance of a generative vs a discriminative approach
    

- **Features that are implicit in the HMM** which are simple indicatiors of the initial, transition, final and emission events.
    - Given a certain position $i$ and state $c$ the set of features that mimic the HMM are:


| Conditions to be met       |    Name             |
| ----------------           | ----------------    |
| $y_i=c  \,\, \& \,\, i =0$        | Initial features    |
| $y_i=c   \,\, \& \,\, y_{i-1}=c$  | Transition features |
| $y_i=c_k \,\, \& \,\,  i=N$        | Final features      |
| $x_i=w_j \,\, \& \,\,  y_i=c_k$    | Emission features   |

When we used a generative model we were forced to make some independence assumptions. However, since we are now in a discriminative approach,where we model $P(Y | X)$ rather than $P(X,Y)$ we are not tied anymore to some of these assumptions. In particular:

- We may use “overlapping” features, e.g., features that fire simultaneously for many instances. For example, we can use a feature for a word, such as a feature which fires for the word ”brilliantly”, and another for prefixes and suffixes of that word, such as one which fires if the last two letters of the word are ”ly”. This would lead to an awkward model if we wanted to insist on a generative approach.


- We may use features that depend arbitrarily on the entire input sequence $x$. On the other hand, we still need to resort to “local” features with respect to the outputs (e.g. looking only at consecutive state pairs), otherwise decoding algorithms will become more expensive.


#### Typical features used for POS taggigng with discriminative models

The following table shows some typical POS tagging features. Let us consider $P_{set}$ and $S_{set}$ to be two sets of prefixes and sufixes respectively (set by the user).


| Conditions to be met for some of the most typical POS features     |    Name      |
| ----------------                                | ----------------    |
| $y_i=c , \,\,  i =0$                      | Initial features    |
| $y_i=c ,\,\,  y_{i-1}=c$                | Transition features |
| $y_i=c_k ,\,\, i=N$                     | Final features      |
| $x_i=w_j ,\,\,  y_i=c_k$                 | Basic Emission features|
| $x_i=w_j ,\,\,  w_j \text{ is uppercased } ,\,\,  y_i=c_k$                 | Upper case features|
| $x_i=w_j ,\,\,  w_j \text{ contains digit} ,\,\,  y_i=c_k$                 | Digit features|
| $x_i=w_j ,\,\,  w_j \text{ contains hyphen} ,\,\,  y_i=c_k$                 | Hypthen features|
| $x_i=w_j ,\,\,  w_j[0:i] \in P_{set}  \forall i \in \{1,2,3\}  ,\,\,  y_i=c_k$                 | Prefix features|
| $x_i=w_j ,\,\,  w_j[-i] \in S_{set}  \forall i \in \{1,2,3\}  ,\,\,  y_i=c_k$                 | Suffix features|








We can have more complex features which look arbitrarily to the input sequence. We are not going to have them in this exercise only for performance reasons (to have less features and smaller caches). State-of-the-art sequence classifiers can easily reach over one million features!

Our features subdivide in two groups

- **node features**: $f_{\text{emiss}}, f_{\text{init}}, f_{\text{final}}$. Node features depend only on a single position in the state sequence (or node in the trellis).
    
    
- **edge features**: $f_{\text{trans}}$. Edge features depend on two consecutive positions in the state sequence (an edge in the trellis)


    
| score  definitions: scalar product between features and weights |
| ------------------------------- | ---------------- |
| $\textrm{score}_\textrm{emiss}\,(i,x,y_1) =\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{emiss}\,(i, x, y_i)$ |
|$\textrm{score}_\textrm{init}(x,y_1)=\boldsymbol{w}\cdot\boldsymbol{f}_\textrm{init} \,(x, y_1)$ |
| $\textrm{score}_\textrm{trans}\,(i,x,y_i,y_{i+1}) = \boldsymbol{w}\cdot\boldsymbol{f}_\textrm{trans}\,(i, x, y_i, y_{i+1})$ |
| $\textrm{score}_\textrm{final}\,(x,y_N) = \boldsymbol{w}\cdot\boldsymbol{f}_\textrm{final}\,(x, y_N)$ |





### Decoding

One important thing to notice is that the decoding process - the process by which we pick the most likely label $y_i$ for the observation $x_i$ - stays the same as in an standard HMM. This means **we do not need to develop new decoders, as long as we have existing decorders for HMMs** only new functions to compute the scores. Because of this, we will keep using the Viterbi and Forward-Backward algorithms.


For decoding,   there are three important problems that need to be solved:

1. Given $X=x$, compute the most likely output sequence $\hat{y}$ (the one which maximizes $P_{w}(Y=y|X=x)$). 
2. Compute the posterior marginals $P_{w}(Y_i=y_i|X=x)$ at each position $i$.
3. Evaluate the partition function $Z(w,x)$. 

Interestingly, all these problems can be solved by using the very same
algorithms that were 
already implemented for HMMs: the Viterbi algorithm (for 1) and the forward-backward algorithm (for 2--3). All that changes is the way the scores are computed. 



### Training the classifier

Today we will cover two different approaches to training sequential discriminative models. Given a training set with $M$ observation-label pairs, $\{(x_m, y_m)\}_{m=1}^M$ (note that $x_m$, $y_m$ are $N$-dimensional vectors, as $m$ indexes the training sample):

* **Structured Perceptron** iteratively updates $w$ in order to correctly classify the training set.

# Training  a POS tagger in english

## Loading the data: the conll dataset for part of speech tagging



#### Example of two sequences in the dataset

    1	Ms.	_	NN	NNP	_	2	NMOD	_	_
    2	Haag	_	NN	NNP	_	3	SUB	_	_
    3	plays	_	VB	VBZ	_	0	ROOT	_	_
    4	Elianti	_	NN	NNP	_	3	OBJ	_	_
    5	.	_	.	.	_	3	P	_	_

    1	Rolls-Royce	_	NN	NNP	_	4	NMOD	_	_
    2	Motor	_	NN	NNP	_	4	NMOD	_	_
    3	Cars	_	NN	NNPS	_	4	NMOD	_	_
    4	Inc.	_	NN	NNP	_	5	SUB	_	_
    5	said	_	VB	VBD	_	0	ROOT	_	_
    6	it	_	PR	PRP	_	7	SUB	_	_
    7	expects	_	VB	VBZ	_	5	VMOD	_	_
    8	its	_	.	PRP$	_	10	NMOD	_	_
    9	U.S.	_	NN	NNP	_	10	NMOD	_	_
    10	sales	_	NN	NNS	_	12	SUB	_	_
    11	to	_	TO	TO	_	12	VMOD	_	_
    12	remain	_	VB	VB	_	7	VMOD	_	_
    13	steady	_	JJ	JJ	_	12	PRD	_	_
    14	at	_	IN	IN	_	12	VMOD	_	_
    15	about	_	IN	IN	_	17	NMOD	_	_
    16	1,200	_	CD	CD	_	15	AMOD	_	_
    17	cars	_	NN	NNS	_	14	PMOD	_	_
    18	in	_	IN	IN	_	12	VMOD	_	_
    19	1990	_	CD	CD	_	18	PMOD	_	_
    20	.	_	.	.	_	5	P	_	_
    

#### corpus.read_sequence_list_conll

This method will read the data. Each phrase in the dataset will become a **Sequence** which will be appended to the SequenceList

skseq.sequences.sequence_list.SequenceList



In [25]:
# This is completly unrealistic, words in train and test will be different usually
len(train_seq.x_dict),len(test_seq.x_dict), len(dev_seq.x_dict)

(16937, 16937, 16937)

In [26]:
len(train_seq.y_dict),len(test_seq.y_dict), len(dev_seq.y_dict)

(12, 12, 12)

# Feature Mapper (or feature template)

Given a dataset, first we will build a 

-  **``SequenceList``** with the dataset.

in order to build the features from the instanciated  **``SequenceList``** we will use

- An instance from **``IDFeatures``** (we will call it feature_mapper) must be instanciated
- **``feature_mapper.build_features()``** must be executed



## Generating features from data



### Instantiating a feature_mapper
We will assume feature_mapper has been instantiated with

    feature_mapper = skseq.sequences.id_feature.IDFeatures(train_seq)


**The IDFeatures object will be referred to as a  ```feature_mapper```.**


#### About feature_mappers
A ```feature_mapper``` will contain the following attributes:

- the dataset in ```.dataset```
    - if we instantiate the feature mapper with a dataset X then ```feature_mapper.dataset```will be a copy of X


- a dictionary of features in ```.feature_dict```
    - this dictionary will default to ```{}```. 
    - In order to build the features the feature mapper must call ```.build_features()``` function.
    
    
- a list of features in ```.feature_list```
    - this list will default to ```[]```. 
    - In order to build the list of features the feature mapper must call ```.build_features()``` function.

A ```feature_mapper``` will contain the method 

- A method to generate features, ```.build_features```
    - this method will create features using the ```.dataset``.
    - This method will also fill ```.feature_dict``` and ```.feature_list``




In [27]:
feature_mapper = skseq.sequences.id_feature.IDFeatures(train_seq)
#from skseq.sequences import extended_feature
#feature_mapper = skseq.sequences.extended_feature.ExtendedFeatures(train_seq)

In [28]:
feature_mapper.feature_dict

{}

In [29]:
feature_mapper.feature_list

[]

### Calling ```feature_mapper.build_features()```



In [30]:
# get features
feature_mapper.build_features()

In [31]:
import pprint
pprint.pprint(list(feature_mapper.__dict__.keys()))

['feature_dict',
 'feature_list',
 'add_features',
 'dataset',
 'node_feature_cache',
 'initial_state_feature_cache',
 'final_state_feature_cache',
 'edge_feature_cache']


In [32]:
len(feature_mapper.feature_dict)

15377

In [33]:
len(feature_mapper.feature_list)

5000

In [34]:
list(feature_mapper.feature_dict)[0:5]

['init_tag:adp',
 'id:In::adp',
 'id:an::det',
 'prev_tag:adp::det',
 'id:Oct.::noun']

In [35]:
list(feature_mapper.feature_dict)[0:10]

['init_tag:adp',
 'id:In::adp',
 'id:an::det',
 'prev_tag:adp::det',
 'id:Oct.::noun',
 'prev_tag:det::noun',
 'id:19::num',
 'prev_tag:noun::num',
 'id:review::noun',
 'prev_tag:num::noun']

In [36]:
set([x.split(":")[0] for x in feature_mapper.feature_dict.keys()])

{'final_prev_tag', 'id', 'init_tag', 'prev_tag'}

### Retreaving features from a `feature_mapper`

we can get the features of a given sequence using **``feature_mapper.get_sequence_features``**

In [37]:
feature_mapper.feature_list[1]

[[[69]], [[28], [36], [34], [18]], [[68]], [[70], [66], [71], [72], [67]]]

## Inspecting a feature mapper

In [38]:
len(feature_mapper.feature_dict),len(feature_mapper.feature_list)

(15377, 5000)

In [39]:
# for any position i, this will be len 4, corresponding to 
# initial,emission, transition, and final features
m = 1
len(feature_mapper.feature_list[m])

4

In [40]:
id_seq = 1

In [41]:
feature_mapper.dataset[id_seq]

42/2 40/2 43/6 44/2 41/4 

In [42]:
# lenght of the sequence
len(feature_mapper.dataset[id_seq])

5

In [43]:
# lenght of the types of features
len(feature_mapper.feature_list[id_seq])

4


## Understanding features in a feature mapper


### What are all those numbers?

Notice that ```feature_mapper.feature_list[id_seq]``` is a list of lists of length 4.

This lenght is the same no matter ```id_seq```

### Codification of the features

All features are saved in **``feature_mapper.feature_dict``**.

- **If it is our feature vector why it's not a vector? Good point! ** 
    - In order to make the algorithm fast, the code is written using dicts, so if we access only a few positions from the dict and compute substractions it will be much faster than computing the substraction of two huge weight vectors.

Some features are identifyed by starting with **init_tag:**, **prev_tag:**,  **final_prev_tag:**, **id:**

- **init_tag:** when they are Initial features
    - Example: **``init_tag:noun``** is an initial feature that describes that the first word is a noun
    
    
- **prev_tag:** when they are transition features
    - Example: **``prev_tag:noun::noun``** is an transition feature that describes that the previous word was
      a noun and the current word is a noun.
    - Example: **``prev_tag:noun:.``** is an transition feature that describes that the previous word was
      a noun and the current word is a `.` (this is usually foud as the last transition feature since most phrases will end up with a dot)
      


- **final_prev_tag:** when they are final features
    - Example: **``final_prev_tag:.``** is a final feature stating that the last "word" in the sentence was a dot.


- **id:** when they are emission features
    - Example: **``id:plays::verb``** is an emission feature, describing that the current word is plays and the current hidden state is a verb.
    - Example: **``id:Feb.::noun``** is an emission feature, describing that the current word is "Feb." and the current hidden state is a noun.
    
Other features are identifyed by starting with **uppercased**, **suffix**, **preffix** etc...

- **uppercased:** when they contain the current word with an Uppercase>
    - Example: **``uppercased::noun``** is a feature stating that current word is upeprcased and the current tag is a noun.

- **prefix:** when the current word contains a certain prefix.
    - Example: prefix:Eli::noun

- **suffix:** when the current word contains a certain suffix.
    - Example: suffix:ing:verb




In [44]:
print ("Initial features:",     feature_mapper.feature_list[id_seq][0])
print ("Transition features:",  feature_mapper.feature_list[id_seq][1])
print ("Final features:",       feature_mapper.feature_list[id_seq][2])
print ("Emission features:",    feature_mapper.feature_list[id_seq][3])

Initial features: [[69]]
Transition features: [[28], [36], [34], [18]]
Final features: [[68]]
Emission features: [[70], [66], [71], [72], [67]]


In [45]:
inv_feature_dict = {word: pos for pos, word in feature_mapper.feature_dict.items()}

### Features

features are coded as integers, using the inv_feature_dict we can see its interpretation

Features have been assigned to a unique string that gives some hint on what they mean

In [46]:
id_seq = 6

In [47]:
train_seq[id_seq].to_words(train_seq)

'The/det new/adj rate/noun will/verb be/verb payable/adj Feb./noun 15/num ./. '

In [48]:
feature_mapper.feature_list[id_seq]

[[[98]],
 [[144], [105], [36], [148], [89], [105], [7], [97]],
 [[68]],
 [[14], [143], [145], [146], [147], [149], [150], [151], [67]]]

In [49]:
feature_type = ["Initial features", "Transition features", "Final features", "Emission features"]

for feat,feat_ids in enumerate(feature_mapper.get_sequence_features(train_seq[id_seq])):
    print(feature_type[feat])
    for id_list in feat_ids:
        print ("\t",id_list)
        for k,id_val in enumerate(id_list):
            print ("\t\t", inv_feature_dict[id_val] )
    print("\n")

Initial features
	 [98]
		 init_tag:det


Transition features
	 [144]
		 prev_tag:det::adj
	 [105]
		 prev_tag:adj::noun
	 [36]
		 prev_tag:noun::verb
	 [148]
		 prev_tag:verb::verb
	 [89]
		 prev_tag:verb::adj
	 [105]
		 prev_tag:adj::noun
	 [7]
		 prev_tag:noun::num
	 [97]
		 prev_tag:num::.


Final features
	 [68]
		 final_prev_tag:.


Emission features
	 [14]
		 id:The::det
	 [143]
		 id:new::adj
	 [145]
		 id:rate::noun
	 [146]
		 id:will::verb
	 [147]
		 id:be::verb
	 [149]
		 id:payable::adj
	 [150]
		 id:Feb.::noun
	 [151]
		 id:15::num
	 [67]
		 id:.::.




### Given an input sequence, how to compute features that are activated

We have stored all features in  **``feature_mapper.feature_dict``**.

Now how can we know when a particular word and tag at a particular position fire any of the created features?

In [50]:
# Looking at some features and the position they have assigned
c =0
print("First 5 features in the dicitionary\n")
for i in feature_mapper.feature_dict:
    print(i, ":", feature_mapper.feature_dict[i])
    c +=1
    if c>=5:
        break

First 5 features in the dicitionary

init_tag:adp : 0
id:In::adp : 1
id:an::det : 2
prev_tag:adp::det : 3
id:Oct.::noun : 4


In [51]:
sequence

42/2 40/2 43/6 44/2 41/4 

### Activated Features

inside the feature mapper there is the **``get_initial_features``** method

    def get_initial_features(self, sequence, y):
        if y not in self.initial_state_feature_cache:
            edge_idx = []
            edge_idx = self.add_initial_features(sequence, y, edge_idx)
            self.initial_state_feature_cache[y] = edge_idx
        return self.initial_state_feature_cache[y]

In [52]:
sequence

42/2 40/2 43/6 44/2 41/4 

In [53]:
feature_mapper.get_initial_features(sequence, 0)

[0]

In [54]:
feature_mapper.get_transition_features(sequence, pos=1, y=1, y_prev=1)

[373]

In [55]:
feature_mapper.get_transition_features(sequence, pos=1, y=1, y_prev=2)

[141]

In [56]:
feature_mapper.get_transition_features(sequence, pos=1, y=1, y_prev=3)

[1695]

In [57]:
inv_feature_dict[1695]

'prev_tag:num::det'

# Training a Structured perceptron

In order to train a structured perceptron we need to construct a feature mapper that will translate Sequence objects to numerical features. Then the structured perceptron can be instanciated using

- The corpus dictionary of words
- The corpus dictionary of tags
- The feature mapper

In [58]:
feature_mapper = skseq.sequences.id_feature.IDFeatures(train_seq)
feature_mapper.build_features()

#import skseq.sequences.extended_feature as exfc
#feature_mapper = exfc.ExtendedFeatures(train_seq)

In [59]:
corpus.sequence_list

[]

In [60]:
corpus.tag_dict

{'adp': 0,
 'det': 1,
 'noun': 2,
 'num': 3,
 '.': 4,
 'prt': 5,
 'verb': 6,
 'conj': 7,
 'adv': 8,
 'pron': 9,
 'adj': 10,
 'x': 11}

In [61]:
import skseq.sequences.structured_perceptron as spc

sp = spc.StructuredPerceptron(corpus.word_dict, corpus.tag_dict, feature_mapper)
sp.num_epochs = 5

In [62]:
sp.state_labels

{'adp': 0,
 'det': 1,
 'noun': 2,
 'num': 3,
 '.': 4,
 'prt': 5,
 'verb': 6,
 'conj': 7,
 'adv': 8,
 'pron': 9,
 'adj': 10,
 'x': 11}

In [63]:
sp.get_num_states(), sp.get_num_observations()

(12, 16937)

In [64]:
feature_mapper.get_num_features()

15377

#### About the weights of the perceptron

The perceptron starts with all weights set to 0

In [65]:
sp.parameters

array([0., 0., 0., ..., 0., 0., 0.])

In [66]:
len(sp.parameters)

15377

In [67]:
sp.parameters.sum()

0.0

### Predictions made by the structured Perceptron

We can use the method **``.viterbi_decode``** from the structured perceptron to generate a sequence of predictions for a given sequence.

In [68]:
seq = train_seq[3]

In [69]:
seq

7/1 61/2 62/2 63/2 64/10 65/2 66/6 67/3 59/2 21/0 19/1 53/2 

In [70]:
sp.get_num_states()

12

In [71]:
sp.viterbi_decode(seq)

(7/0 61/0 62/0 63/0 64/0 65/0 66/0 67/0 59/0 21/0 19/0 53/0 , 0.0)

# Training a structured perceptron


In order to train a structured perceptron we need to construct a feature mapper that will translate Sequence objects to numerical features. Then the structured perceptron can be instanciated using

- The corpus dictionary of words
- The corpus dictionary of tags
- The feature mapper

In [72]:
feature_mapper = skseq.sequences.id_feature.IDFeatures(train_seq)
feature_mapper.build_features()

In [73]:
sp = spc.StructuredPerceptron(corpus.word_dict, corpus.tag_dict, feature_mapper)
sp.num_epochs = 5

In [74]:
sp.get_num_states(), sp.get_num_observations()

(12, 16937)

### Accuracy before training

We can use the methods

- **``viterbi_decode_corpus``** to generate a list of sequences containing the predictions



In [75]:
def evaluate_corpus(sequences, sequences_predictions):
    """Evaluate classification accuracy at corpus level, comparing with
    gold standard."""
    total = 0.0
    correct = 0.0
    for i, sequence in enumerate(sequences):
        pred = sequences_predictions[i]
        for j, y_hat in enumerate(pred.y):
            if sequence.y[j] == y_hat:
                correct += 1
            total += 1
    return correct / total

In [76]:
# Make predictions for the various sequences using the trained model.
pred_train = sp.viterbi_decode_corpus(train_seq)
pred_dev = sp.viterbi_decode_corpus(dev_seq)
pred_test = sp.viterbi_decode_corpus(test_seq)

In [77]:
# Evaluate and print accuracies
eval_train = evaluate_corpus(train_seq.seq_list, pred_train)
eval_dev = evaluate_corpus(dev_seq.seq_list, pred_dev)
eval_test = evaluate_corpus(test_seq.seq_list, pred_test)
print("SP -  Accuracy Train: %.3f Dev: %.3f Test: %.3f"%(eval_train,eval_dev, eval_test))

SP -  Accuracy Train: 0.103 Dev: 0.100 Test: 0.105


## About the structured perceptron


The structured perceptron has the method **train_supervised** which allow us to train the weights of the model

- **train_supervised**: recieves a **SequenceList** object

Let us recall that a **SequenceList**  is a list of **Sequence** objects.

- Each **Sequence** has the **.x** and **.y** atribute which are the words and tags of the **Sequence** respectively.

    - For example, an example of **Sequence** in our train data, train_seq[1], is 
        - Ms./noun Haag/noun plays/verb Elianti/noun ./. 

In [78]:
sp

<skseq.sequences.structured_perceptron.StructuredPerceptron at 0x1169dd400>

In [79]:
%%time
num_epochs = 5
sp.fit(feature_mapper.dataset, num_epochs)

Epoch: 0 Accuracy: 0.822854
Epoch: 1 Accuracy: 0.904985
Epoch: 2 Accuracy: 0.925024
Epoch: 3 Accuracy: 0.937884
Epoch: 4 Accuracy: 0.943772
CPU times: user 1min 45s, sys: 366 ms, total: 1min 45s
Wall time: 1min 46s


## Saving model weight

In [80]:
len(sp.parameters)

15377

In [81]:
sp.parameters

array([ 0.8,  7.2, 11.4, ...,  0.6,  0. ,  0.2])

In [82]:
sp.save_model("perceptron_5_iter")

In [83]:
sp2 = spc.StructuredPerceptron(corpus.word_dict, corpus.tag_dict, feature_mapper)

In [84]:
sp2.parameters

array([0., 0., 0., ..., 0., 0., 0.])

In [85]:
sp2.load_model(dir="perceptron_5_iter")

In [86]:
sp2.parameters

array([ 0.8,  7.2, 11.4, ...,  0.6,  0. ,  0.2])

# Evaluating model quality

In [87]:
# Make predictions for the various sequences using the trained model.
pred_train = sp.viterbi_decode_corpus(train_seq)
pred_dev = sp.viterbi_decode_corpus(dev_seq)
pred_test = sp.viterbi_decode_corpus(test_seq)

In [88]:
# Evaluate and print accuracies
eval_train = evaluate_corpus(train_seq.seq_list, pred_train)
eval_dev = evaluate_corpus(dev_seq.seq_list, pred_dev)
eval_test = evaluate_corpus(test_seq.seq_list, pred_test)
print("SP -  Accuracy Train: %.3f Dev: %.3f Test: %.3f"%(eval_train,eval_dev, eval_test))

SP -  Accuracy Train: 0.927 Dev: 0.893 Test: 0.902


## Test the structured perceptron

In order to make a tag prediction for a given sequence we can use the **``.viterbi_decode``** method

### About  **``.viterbi_decode``**

- Compute scores given the observation sequence
- Run the forward algorithm and therefore gets
    - the predicted sequence of states **``best_states``**
    - the total score
- Creates a new sequence named **``predicted_sequence``**
    - copyes the sequence of words from the input
    - assigns the tags from  **``best_states``** to the created sequence
    
Returns **``predicted_sequence``** as well sas the **``total_score``**
    
      

In [89]:
sequence = train_seq[2]

In [90]:
sequence

45/2 46/2 47/2 48/2 49/6 50/9 51/6 52/9 53/2 54/2 38/5 55/6 56/10 10/0 57/0 58/3 59/2 21/0 60/3 41/4 

In [91]:
sequence.to_words(train_seq)

'Rolls-Royce/noun Motor/noun Cars/noun Inc./noun said/verb it/pron expects/verb its/pron U.S./noun sales/noun to/prt remain/verb steady/adj at/adp about/adp 1,200/num cars/noun in/adp 1990/num ./. '

In [92]:
aux = sp.viterbi_decode(sequence)
print(aux)

(45/2 46/2 47/2 48/2 49/6 50/9 51/6 52/9 53/2 54/2 38/5 55/6 56/10 10/0 57/8 58/3 59/2 21/0 60/3 41/4 , 199.20000000000002)


In [93]:
# sequence containing the original words and the tag prediction
aux[0].to_words(train_seq)

'Rolls-Royce/noun Motor/noun Cars/noun Inc./noun said/verb it/pron expects/verb its/pron U.S./noun sales/noun to/prt remain/verb steady/adj at/adp about/adv 1,200/num cars/noun in/adp 1990/num ./. '

## Predict a given unseen string sequence

Let us assume we have the phrase "David had been asked to write a challenging program for Angel ."

We have to:

- convert our string to a ``vlex_seq2.sequences.sequence.Sequence`` object.
- use the ``perceptron.viterbi_decode`` method with the previously build ``Sequence`` object



In [94]:
p = "David had been asked to write a challenging program for Maria ."

In [95]:
word_ids  = [train_seq.x_dict[w] for w in p.split()]
word_ids

[1613, 271, 106, 2552, 38, 394, 92, 9404, 4140, 78, 4036, 41]

In [96]:
seq = skseq.sequences.sequence.Sequence(x=word_ids, y=[0 for w in word_ids])
seq

1613/0 271/0 106/0 2552/0 38/0 394/0 92/0 9404/0 4140/0 78/0 4036/0 41/0 

In [97]:
sp.viterbi_decode(seq)[0].to_words(train_seq)

'David/noun had/verb been/verb asked/verb to/prt write/verb a/det challenging/noun program/noun for/adp Maria/noun ./. '

In [98]:
# The following features correspond to the activated features for the given 
# list of tags and list of words
feature_mapper.get_sequence_features(seq)

([[0]],
 [[92], [92], [92], [92], [92], [92], [92], [92], [92], [92], [92]],
 [[2995]],
 [[], [], [], [], [], [], [], [], [], [121], [], []])

### Unseen words in the corpus
An obvious question that might arise is **what happens when a word is not in x_dict** ?

In [99]:
print("there are", len(train_seq.x_dict), "words in the dictionary")

there are 16937 words in the dictionary


In [100]:
p = "David had been asked to write a challenging program for Angel ."

In [102]:
# this should fail because "Angel" was never seen
word_ids  = [train_seq.x_dict[w] for w in p.split()]

KeyError: 'Angel'

In [103]:
new_seq = skseq.sequences.sequence.Sequence(x=p.split(), y=[int(0) for w in p.split()])

In [104]:
new_seq

David/0 had/0 been/0 asked/0 to/0 write/0 a/0 challenging/0 program/0 for/0 Angel/0 ./0 

In [105]:
feature_mapper.get_sequence_features(new_seq)

([[0]],
 [[92], [92], [92], [92], [92], [92], [92], [92], [92], [92], [92]],
 [[2995]],
 [[], [], [], [], [], [], [], [], [], [121], [], []])

In [106]:
sp.viterbi_decode(new_seq)[0].to_words(train_seq,
                                       only_tag_translation=True)

'David/noun had/verb been/verb asked/verb to/prt write/verb a/det challenging/noun program/noun for/adp Angel/noun ./. '

In [107]:
p = "Sara had been asked to write a challenging program for Angel ."
new_seq = skseq.sequences.sequence.Sequence(x=p.split(), y=[int(0) for w in p.split()])

In [108]:
sp.viterbi_decode(new_seq)

(Sara/6 had/6 been/6 asked/6 to/5 write/6 a/1 challenging/2 program/2 for/0 Angel/2 ./4 ,
 131.4)

In [109]:
sp.viterbi_decode(new_seq)[0].to_words(train_seq,
                                       only_tag_translation=True)

'Sara/verb had/verb been/verb asked/verb to/prt write/verb a/det challenging/noun program/noun for/adp Angel/noun ./. '

# Exercices: Adding new features

Iniside skseq.sequences.extended_feature.py you will find

```
class ExtendedFeatures(IDFeatures):
```

Expand the function `add_emission_features(self, sequence, pos, y, features)` adding new features.

One possible feature is already added and is the `hyphen::tag` feature that fires when a word contains a "-" inside it.
