# Type-constraint POS tagging in Keras

In this exercise you will learn how to implement a sequence prediction model in Keras and extend the tagger to use type-constraint decoding.


## A simple POS tagger based on Keras

POS tagging is the task of assigning a part-of-speech (word class) to every input token. 
For example:

`POS/NOUN tagging/NOUN is/VERB cool/ADJ`

There exist different POS tagsets. They differ in type and granularity of the POS tag inventory. 

In this exercise we will use the Google universal tagset [(Petrov et al., 2012)](www.lrec-conf.org/proceedings/lrec2012/pdf/274_Paper.pdf), which consists of 12 POS classes: ``NOUN (nouns), VERB (verbs), ADJ (adjectives), ADV (adverbs), PRON (pronouns), DET (determiners and articles), ADP (prepositions and postpositions), NUM (numerals), CONJ (conjunctions), PRT (particles), ‘.’ (punctuation marks) and X (a catch-all for other categories such as abbreviations or foreign words).``


### A RNN-based POS tagger

We will build a simple POS tagger based on a Recurrent Neural Network, which you have learned about yesterday. In fact, RNNs can come in many different flavors (illustration by [Karpathy](http://karpathy.github.io/assets/rnn/diags.jpeg)):

<img src="http://karpathy.github.io/assets/rnn/diags.jpeg">

For our tagging scenario, we want to have a model that outputs a tag for every input token:


<img src="pics/karpathy-rnn.png" width=150>


### Keras

[Keras](https://keras.io/) is a neural network library in Python that supports both `theano` and `tensorflow`.

Keras implements two approaches:

* the Sequential model
* the functional API

The core data structure in Keras is the `Sequential` model, a linear stack of layers. If you want to implement more complex model, then go with the `functional API`. More details: [tiny intro Keras](https://github.com/bplank/ltp-notebooks/blob/master/mini-intro-Keras.ipynb), and [30 seconds into Keras](https://keras.io/).

Lets start of from the Sequential model. For the POS tagger, we have prepared a basic skeleton for you. Lets import this class.

In [96]:
from lib.tagger import SequenceTagger

This is a basic abstract class that implements already some functionality, like input/output handling (`read_data`), and the class provides basic `fit` and `predict` functions to train a model and test it on new data. However, the class itself is a bare skeleton, it does know yet know how the model itself looks like. 

So our first task is to specify the model structure. We create our own subclass `BasicSequenceTagger` from the basic `SequenceTagger` class, and specify the **model** by implementing your own `build_graph` function:

In [97]:
from keras.models import Model, Sequential
from keras.layers import Dense, Input, Activation
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import TimeDistributed
from keras.layers.embeddings import Embedding

class BasicSequenceTagger(SequenceTagger):
    
    def __init__(self):
        self.in_dim=64
        self.h_dim=100

    def build_graph(self):
        ### version1: traditional 'keras'-style code
        self.model = Sequential()
        self.model.add(Embedding(self.get_num_features(), self.in_dim,
                                 input_length=self.max_sentence_len, mask_zero=True))
        self.model.add(LSTM(self.h_dim, return_sequences=True))
        self.model.add(TimeDistributed(Dense(self.get_num_tags())))
        self.model.add(Activation('softmax'))

        self.model.compile(loss='categorical_crossentropy', optimizer='adam',
                      metrics=['accuracy'])

Take your time to go through the code. Discuss with your neighbor: 
* How do the parts connect to the illustration above?
* What does `return_sequences=True` mean? Why do we need it?

##### Training the tagger

Lets train the tagger. We provide you a sample dataset (from the Universal dependencies project, but converted to the 12 POS tags). 


In [100]:
# create tagger and read data
tagger = BasicSequenceTagger()
tagger.read_data("data/en-ud-train-5000-12.conll", "data/en-ud-test-200-12.conll")

data/en-ud-train-5000-12.conll


5000 sentences 82500 tokens
11998 features
max_sentence_len: 104


In [99]:
print("Build model")
tagger.build_graph()
print("train..")
batch_size=50
epochs=6
tagger.fit(batch_size, epochs)

Build model
train..
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


In [85]:
# get predictions
tagger.predict(tagger.test_X)

correct 5174.0 total: 5715.0 accuracy: 0.9053368328958881


Test the tagger with different number of epochs. What do you observe?

A nice way to visualize the model is the `model.summary()` function in Keras. This will give you more information on the model input/outputs and number of parameters. Try to add it to the tagger.

### Type constraints

The idea behind type-constraint is to exploit additional information to guide your model. 

Say, someone told you that every sentence starts with a noun. You could restrict your model to only output nouns at sentence-initial positions. This is obviously too restrictive and not true. But this is the basic idea behind type constraints. 

Note, this technique is called **type** constraints. This already alludes at the idea of using a dictionary as additional, fortuitous data. A dictionary lists the possible word classes for a given word type. More formally, a word type $w \in \mathcal{V}$ is mapped to a set of admissible tags $\mathcal{Y}(w) \subseteq \mathcal{Y}$. For word types that are not in the dictionary, we allow all possible tags, i.e., $\mathcal{Y}$.

Now, using these constraints for words in the dictionary we restrict the tagger to output only tags that are allowed.

Here is an example (from [Tackström et al., (2013)](http://soda.swedish-ict.se/5472/1/paper1.pdf)):

<img src="pics/typeconstr.png">

For instance, the first word in the example sentence is not in the dictionary, hence all tags are allowed. The remaining words are in the dictionary so only their admissible subset $\mathcal{Y}(w)$ is shown.


### Exercise: 

Add type-constraint inference to the tagger. 

In [109]:
from collections import defaultdict
from keras.models import Model, Sequential
from keras.layers import Dense, Input, Activation
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import TimeDistributed
from keras.layers.embeddings import Embedding

class BasicSequenceTaggerTypeConstraint(SequenceTagger):
    
    def __init__(self, wiktionaryfile):
        self.w2i = {} # mapping word to word indices
        self.t2i = {} # mapping tag to tag indices
        self.in_dim=64
        self.h_dim=100
        self.word2tags = self.read_wiktionary(wiktionaryfile)
        
    def read_wiktionary(self, wiktionaryfile):
        pass #TODO

    def build_graph(self):
        ### version1: traditional 'keras'-style code
        self.model = Sequential()
        self.model.add(Embedding(self.get_num_features(), self.in_dim,
                                 input_length=self.max_sentence_len, mask_zero=True))
        self.model.add(LSTM(self.h_dim, return_sequences=True))
        self.model.add(TimeDistributed(Dense(self.get_num_tags())))
        self.model.add(Activation('softmax'))

        self.model.compile(loss='categorical_crossentropy', optimizer='adam',
                      metrics=['accuracy'])
    
    def predict(self, input, output=None, batch_size=32):
        # This is the standard predict method (copy from SequenceTagger)
        # probs is a tensor (num_instance, max_sentence_len, number_classes)
        # note that it is padded (pre-padded with zeros)
        probs = self.model.predict(input, batch_size=batch_size)
        predictions = probs.argmax(axis=-1)
        return self.evaluate(predictions, output)

**Step 1**: Read in Wiktionary

Add the necessary code to the `read_wiktionary` function. The dictionary is a plain text file in the format `word<tab>tag`, with a single tag per line. 

Once you have your code in place you can examine `word2tags`:

In [None]:
tagger = BasicSequenceTaggerTypeConstraint("data/small.dic")
tagger.word2tags['enough']

**Step 2**: Add type-constraint inference
  
Now that you have the dictionary in place, how can you implement these restrictions? 
Which parts of the tagger do you need to modify?

We need to instruct the tagger to:

* allow only tags that are licensed by the dictionary,
* otherwise put no restrictions on the possible output.

Add the necessary code to the tagger above. Then train and test the model with type constraints.

In [None]:
print("read data")
tagger.read_data("data/en-ud-train-5000-12.conll", "data/en-ud-test-200-12.conll")
print("Build model")
tagger.build_graph()
print("train..")
batch_size=50
epochs=6
tagger.fit(batch_size, epochs)

In [None]:
# get type-constraint predictions
tagger.predict(tagger.test_X)

#### Wiktionary

A popular community-created dictionary is [Wiktionary](https://en.wiktionary.org/wiki/Wiktionary:Main_Page). 
It is frequently used in NLP, both for type-constraint decoding or evaluation, cf. Li et al., (2012), Tackström et al., (2013).

You can download the dictionaries derived from Wiktionary by Li et al. [here](https://code.google.com/archive/p/wikily-supervised-pos-tagger/).


##### Extra exercises:

* Another way to make the system more robust is to use *dropout* (Hinton et al., 2012). Check the Keras documentation. Add dropout to your tagger.

* For more advanced models the functional API is more appropriate. Translate the code above into functional API code (i.e., every layer is now a function, and you can give names to layers). Here is how the first two lines of code would look like (you now need to define an explicit `Input` layer):

`
input_words = Input(batch_shape=(args.batch_size, self.max_sentence_len,),
                            dtype='int32', name='word_input')
word_embeddings = Embedding(self.get_num_features(), args.in_dim,
                                    input_length=self.max_sentence_len,
                                    mask_zero=True)(input_words)
`

### References:

* [Li et al., (2012)](http://www.seas.upenn.edu/~taskar/pubs/wikipos_emnlp12.pdf)
* [Tackström et al., (2013)](http://soda.swedish-ict.se/5472/1/paper1.pdf)