After this notebook you should know:
    
* classification: `keras` vs `scikit-learn/sklearn`
* recap: representing text as features (traditional sparse vs deep learning dense)
* how to build a simple structured prediction model (a sequence prediction model where every input gets a label) in `keras`

# Classification and Structured Prediction with Keras

In NLP we typically deal with the following **prediction problems** - Given $x$, predict $y$:


| Given $x$ | predict $y$  | Type of prediction problem | 
|------|------|
|   a book review  | positive, negative | **classification** (binary) |
|   a tweet  | language | **multi-class classification** (several choices) |
|   a sentence  | its syntactic parse tree | **structured prediction** (millions of choices) |



Depending on **how the $y$ looks like** we distinguish the following types of prediction problems:

| Example task | Traditional classifier  | Type of prediction problem | 
|------|------|
|   sentiment | Logistic regression, SVM | **classification** (binary) |
|   language identification  | Logistic regression, SVM  | **multi-class classification** (several choices) |
| personality (big five) | linear regression | **regression** problem (continous/numeric output) | 
|   POS sequence  | HMM, structured perceptron, (window-based classifier) | **structured prediction** (millions of choices) |
|   NER  | [CRF (e.g. crfsuite)](http://www.chokkan.org/software/crfsuite/), [structured perceptron (my example impl.)](https://github.com/bplank/sp)  | **structured prediction** (millions of choices) |



So far **we mostly focused on non-structured classification/regression problems**, predicting a single output from some input, like:
* the animacy of a word, 
* the sentiment of a tweet, or
* the personality value of a person from its tweet, e.g. (Liu et al., 2016)

## ML 101: What we need

1. Data
  * what your data looks like, the input $X$ and output (labels) $Y$ 
2. Features
  * how to represent your data (the actual features): how $X$ is decomposed into its parts by the vectorizer/featurizer $\phi$ --- ** How do we do that for text data? **
3. Model/Algorithm
  * the machine learning algorithm used 
4. Evaluation
  * how to measure how good your model is 

## ML in a non-NLP world (no vectorizer/featurizer $\phi$)

In a world where we can easily observe all features (like in the case with the IRIS dataset) we do not need to do feature extraction/vectorization, as the features are already given directly as input:

In [16]:
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
iris_X = iris.data  # recall: how does a single instance look like?
iris_y = iris.target

In [17]:
print(iris_X[0])
print(iris['feature_names'])

[ 5.1  3.5  1.4  0.2]
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


### Train a scikit-learn classifier

Training a classifier in `scikit-learn` is a matter of a few lines, once the data is in the right format:

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

np.random.seed(113)
indices = np.random.permutation(len(iris['data']))
# split in 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(iris['data'], iris['target'], test_size=0.2)

# output statistics
print("#inst train: %s" % (len(X_train)))
print("#inst test: %s" % (len(X_test)))
# learn a logistic regression classifier
clf = LogisticRegression()
clf.fit(X_train, y_train)
y_pred= clf.predict(X_test)
print("Pred:", y_pred)
print("Gold:", y_test)
print(classification_report(y_test, y_pred, target_names=iris['target_names']))

#inst train: 120
#inst test: 30
Pred: [0 1 0 2 1 2 0 0 1 2 0 2 0 0 0 0 2 0 1 2 1 0 0 0 1 1 0 1 0 0]
Gold: [0 1 0 2 1 2 0 0 1 1 0 2 0 0 0 0 2 0 1 2 1 0 0 0 1 1 0 1 0 0]
             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        16
 versicolor       1.00      0.89      0.94         9
  virginica       0.83      1.00      0.91         5

avg / total       0.97      0.97      0.97        30



### Now lets do something very similar with Keras



The difference with `keras` is a bit of **data mungling**. 

`Scikit-learn` and `Keras` do require the data to be encoded in different ways. 


#### Step 1: Label encoding

`sklearn` accepts labels/classes as strings/numbers: 

In [19]:
print(y_train) # sklearn

[2 1 1 1 2 1 2 1 2 1 0 0 0 2 1 0 2 1 2 0 1 0 2 0 2 0 1 2 0 2 0 0 0 1 1 2 1
 1 2 0 1 0 2 2 0 0 2 2 0 2 1 2 2 0 0 2 0 0 1 1 2 2 0 0 1 1 1 2 2 0 2 1 2 2
 2 0 2 1 2 2 0 1 1 1 1 1 2 0 0 1 2 2 0 0 1 0 1 2 1 2 1 2 1 0 2 2 0 2 2 1 2
 2 2 0 1 1 1 1 1 1]


`Keras` wants classes/labels as **one-hot-encodings**! That is, each label is no longer just a number, but will be represented as a vector, where the index of the label that is 'on' gets a 1.

We can use the `keras` helper function `np_utils.to_categorical` to get one-hot-encodings for our labels:
    
    

In [20]:
from keras.utils import np_utils

In [21]:
num_classes = len(np.unique(y_train)) # how many labels we have
y_train_one_hot = np_utils.to_categorical(y_train, num_classes) #important to give the num_classes!

In [22]:
y_train_one_hot[:10] # print first 10

array([[ 0.,  0.,  1.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  1.,  0.]])

In [23]:
# do the same with the test data
y_test_one_hot = np_utils.to_categorical(y_test, num_classes)

Do you see how the labels were mapped? Great. Now we have the labels in the proper format for `Keras`.

#### Step 2: Build the model

While deciding which model to use is easy in `sklearn` (typically just loading the class you want to use, like `LogisticRegression`), in `Keras` you decide how to build your model. 

Lets build a simple feedforward neural network that takes the 4 features as input, has one hidden layer of, say, 16 neurons, and three output nodes, one for each class.

In [24]:
np.random.seed(113) #set seed before any keras import
from keras.models import Sequential
from keras.layers import Dense, Activation


model = Sequential()
model.add(Dense(32, input_shape=(4,)))
model.add(Activation('sigmoid'))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

#### Step 3: Train and evaluate the model

Once we have built the model we have an object that is similar to an `sklearn` classifier. It has `fit` and `predict` functions (or `evaluate`). However, before calling the training function `fit` we need to `compile` the model:



In [25]:
model.compile(loss='categorical_crossentropy', optimizer='adam')


In [26]:
model.fit(X_train, y_train_one_hot)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1133bd6a0>

Now we have trained the model. Another part where `keras` deviates from `sklearn` is in evaluation. The `evaluate` function returns, my default, the *loss* (and not the labels/accuracy):

In [27]:
model.evaluate(X_test, y_test_one_hot)



1.0258275270462036

In fact, pay attention! The evaluate function outputs the loss (if you do not specify a metric when you compile the model). Lets add accuracy (you can also see it back in the training log):

In [29]:
np.random.seed(113) #set seed before any keras import
from keras.models import Sequential
from keras.layers import Dense, Activation


model = Sequential()
model.add(Dense(32, input_shape=(4,)))
model.add(Activation('sigmoid'))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


In [30]:
model.fit(X_train, y_train_one_hot)
model.evaluate(X_test, y_test_one_hot)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


[1.0258275270462036, 0.60000002384185791]

So we got an accuracy of 60%.

To get the predictions back, we need to do a bit of coding, to transform the softmax probabilities back to labels:

In [15]:
probs = model.predict(X_test)
print(probs[:3])
y_predicted = [seq.argmax() for seq in probs]
print(y_predicted)
print(accuracy_score(y_test, y_predicted))
print(classification_report(y_test, y_predicted))
print(y_test)

[[ 0.7071358   0.21368665  0.07917757]
 [ 0.18763381  0.46548307  0.34688312]
 [ 0.68814999  0.2216766   0.09017339]]
[0, 1, 0, 2, 1, 2, 0, 0, 1, 2, 0, 2, 0, 0, 0, 0, 2, 0, 2, 2, 2, 0, 0, 0, 1, 2, 0, 2, 0, 0]
0.833333333333
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        16
          1       1.00      0.44      0.62         9
          2       0.50      1.00      0.67         5

avg / total       0.92      0.83      0.83        30

[0 1 0 2 1 2 0 0 1 1 0 2 0 0 0 0 2 0 1 2 1 0 0 0 1 1 0 1 0 0]


Points that need attention:

* how to get predictions
* how to train your model - in fact, what was the batch size used? what happens if you change it? (Recall: in theory a neural network is a universal function approximator (Schaeffer and Zimmermann, 2006) hence much more expressive than a simple linear model; however, one needs to get many details right..)

In [None]:
## your code here:
# Train the model with a smaller batch_size

For this tiny example a neural network is almost an overkill. Nevertheless it shows how to represent the input to `keras`, what are possible pitfalls etc.

## It's an NLP's world... Text everywhere!

The IRIS example is simplistic in the sense that the features are directly defined. When we work from text we need to think of how to decompose the text to represent it as features.

Remember this picture from Lecture 1 on [Text Analytics](02_Text_analytics_Evaluation.ipynb)?

<img src="pics/learning.png">

When we use a simple BOW (word unigram) representation, we get **n-hot** encodings of our features.

Recall:
<img src="pics/bow2.png">

So, the common pipeline of extracting features for an NLP model in a **traditional** model is:

* extract a set of core linguistic features $f_1,..f_n$ (e.g. just the words)
* define a vector whose length is the total number of features with a 1 at position k if the k-th feature is active; this feature vector represents the **instance** $\mathbf{x}$  (**sparse representation**, n-hot encoding)
* use $\mathbf{x}$ as representation for an instance, train the model

This is a sparse feature representation, as you already know by now.

#### Sentiment classifier

In [31]:
import numpy as np
import random
positive_sentences = [l.strip() for l in open("exercise/rt-polaritydata/rt-polarity.pos").readlines()]
negative_sentences = [l.strip() for l in open("exercise/rt-polaritydata/rt-polarity.neg").readlines()]

positive_labels = [1 for sentence in positive_sentences]
negative_labels = [0 for sentence in negative_sentences]

sentences = np.concatenate([positive_sentences,negative_sentences], axis=0)
labels = np.concatenate([positive_labels,negative_labels],axis=0)

## make sure we have a label for every data instance
assert(len(sentences)==len(labels))
data={}
np.random.seed(113) #seed
data['target']= np.random.permutation(labels)
np.random.seed(113) # use same seed!
data['data'] = np.random.permutation(sentences)


In [32]:
X_rest, X_test, y_rest, y_test = train_test_split(data['data'], data['target'], test_size=0.2)
X_train, X_dev, y_train, y_dev = train_test_split(X_rest, y_rest, test_size=0.2)
del X_rest, y_rest

In [33]:
print("#train instances: {} #dev: {} #test: {}".format(len(X_train),len(X_dev),len(X_test)))

#train instances: 6823 #dev: 1706 #test: 2133


In [34]:
## look at some instances
print(X_train[:5])
print(y_train[:5])

[ 'more likely to have you scratching your head than hiding under your seat .'
 'kaufman creates an eerie sense of not only being there at the time of these events but the very night matthew was killed .'
 'captures the raw comic energy of one of our most flamboyant female comics .'
 "it's a nicely detailed world of pawns , bishops and kings , of wagers in dingy backrooms or pristine forests ."
 'de oliveira creates an emotionally rich , poetically plump and visually fulsome , but never showy , film whose bittersweet themes are reinforced and brilliantly personified by michel piccoli .']
[0 1 1 1 1]


In [35]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

vectorizer = CountVectorizer(binary=True,analyzer='word')

## transform data to sklearn representation 
# (n-hot encoding but internally stored as sparse matrix)
X_train_vec = vectorizer.fit_transform(X_train)
X_dev_vec = vectorizer.transform(X_dev)
X_test_vec = vectorizer.transform(X_test)

In [36]:
classifier = LogisticRegression()
classifier.fit(X_train_vec, y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [37]:
y_pred = classifier.predict(X_dev_vec)

In [38]:
accuracy_score(y_dev, y_pred)

0.7790152403282532

We get an accuracy of 77.9%. How good is this? We need to compare it to a baseline. 

In [39]:
from collections import Counter
most_freq_class = Counter(y_dev).most_common()[0][0]
y_maj = [most_freq_class for _ in y_dev]
accuracy_score(y_dev, y_maj)

0.50996483001172332

### Now lets do something very similar with Keras, however, this time we have text as input!


1. Represent labels as 1-hot encoding.

Note, this is actually only necessary if you have more than 2 labels (multi-class classification). Why? Because a two-class problem can be represented by a single output node with a `sigmoid` activation. In the multi-class case you'll have a `softmax` output with k classes (nodes), one for each class.

In [40]:
num_classes = len(np.unique(y_train)) # how many labels we have
# do the 'to_categorical' only if you have > 2 num_classes 
#y_train_one_hot = np_utils.to_categorical(y_train, num_classes)
y_train_one_hot = y_train
y_test_one_hot = y_test
y_dev_one_hot = y_dev

In [26]:
y_train_one_hot

array([0, 1, 1, ..., 1, 1, 0])


| Example task |  Type of prediction problem |  Output and loss | 
|------|------|
|   sentiment | **classification** (binary) | 1 dense (numeric label encoding), sigmoid, binary crossentropy | 
|   language identification  | **multi-class classification** (several choices) | num_classes dense (1-hot encoded labels), softmax, categorical crossentropy | 


"2. Build model"

In [None]:
np.random.seed(113) #set seed before any keras import
from keras.models import Sequential
from keras.layers import Dense, Activation


model = Sequential()
model.add(Dense(32, input_shape=(???,)))
model.add(Activation('sigmoid'))
model.add(Dense(1))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


However, **how to represent the input**? (what's the input shape?)

Well, it depends on how we decide to represent the input (text) in a neural network.

We have several possibilities:

1. Use a traditional sparse n-hot encoding (like in the example above) - a traditional sparse representation
2. Use a deep learning representation, i.e., embed the features into an embedding space. If we do use embeddings, we still need to decide how to **combine** the embedding representations, e.g.:
    * using a simple average (what's called a CBOW model, we'll see why)
    * using a model that combines the output into a fixed-length vector (like a RNN)

### Option 1: Traditional sparse n-hot encoding

In [41]:
# first map all words to indices, then create n-hot vector
from collections import defaultdict

def convert_to_n_hot(X, vocab_size):
    out = []
    for instance in X:
        n_hot = np.zeros(vocab_size)
        for w_idx in instance:
            n_hot[w_idx] = 1
        out.append(n_hot)
    return np.array(out)

w2i = defaultdict(lambda: len(w2i))
UNK = w2i["<unk>"]
X_train_num = [[w2i[word] for word in sentence.split(" ")] for sentence in X_train]
w2i = defaultdict(lambda: UNK, w2i) # freeze
X_dev_num = [[w2i[word] for word in sentence.split(" ")] for sentence in X_dev]
X_test_num = [[w2i[word] for word in sentence.split(" ")] for sentence in X_test]


In [42]:
X_train[-1]

'an extremely unpleasant film .'

In [45]:
X_train_num[-1] # represented as word indices

[16, 1859, 1079, 72, 13]

In [46]:
y_train[-1]

0

In [47]:
y_train_one_hot[-1] #same here as just 2 classes

0

In [48]:
### Note: we use n-hot encodings *ONLY* when we do not embed the inputs (use no embedding layer!)
vocab_size = len(w2i)
X_train_nhot = convert_to_n_hot(X_train_num, vocab_size)
X_dev_nhot = convert_to_n_hot(X_dev_num, vocab_size)
X_test_nhot = convert_to_n_hot(X_test_num, vocab_size)

In [49]:
print(vocab_size)

21426


In [50]:
np.random.seed(113) #set seed before any keras import
from keras.models import Sequential
from keras.layers import Dense, Activation


model = Sequential()
model.add(Dense(32, input_shape=(vocab_size,)))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


In [51]:
model.fit(X_train_nhot, y_train_one_hot)
loss, accuracy = model.evaluate(X_dev_nhot, y_dev_one_hot)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

In [52]:
print("Accuracy: ", accuracy *100)

Accuracy:  77.1395076202


Good. We have a model now that is very similar to the logistic regression model before. Now how can we move this to a deep learning model? 

### Option 2: Use a dense deep learning representation

We change our model and use an embedding layer as input. This means that each word gets embedded into a high-dimensional space. A data instance thus gets represented in as a list of lists (vectors). The entire training dataset gets 3d!

Still there is one pecularity, even though our model might be able to use variable size input, `keras` compiles the graph upfront, so we need to give it inputs of the same length. How do we handle inputs of different **lengths**? We need to "pad" them (add dummy values), so that they are all the same length. 

One way to do so is to pad all sentences to the max sentence length.

We need to make sure we use a dedicated symbol for padding! Typically it is 0. (pay attention, we used 0 for OOVs before). From now on let us dedicate index 1 for OOVs, and index 0 for padding, as shown next.

In [68]:
w2i = defaultdict(lambda: len(w2i))
PAD = w2i["<pad>"] # index 0 is padding
UNK = w2i["<unk>"] # index 1 is for UNK

# convert words to indices, taking care of UNKs
X_train_num = [[w2i[word] for word in sentence.split(" ")] for sentence in X_train]
w2i = defaultdict(lambda: UNK, w2i) # freeze - cute trick!
X_dev_num = [[w2i[word] for word in sentence.split(" ")] for sentence in X_dev]
X_test_num = [[w2i[word] for word in sentence.split(" ")] for sentence in X_test]

In [69]:
X_train

array([ 'more likely to have you scratching your head than hiding under your seat .',
       'kaufman creates an eerie sense of not only being there at the time of these events but the very night matthew was killed .',
       'captures the raw comic energy of one of our most flamboyant female comics .',
       ...,
       "none of his actors stand out , but that's less of a problem here than it would be in another film : characterization matters less than atmosphere .",
       'the difference between cho and most comics is that her confidence in her material is merited .',
       'an extremely unpleasant film .'], 
      dtype='<U267')

In [70]:
max_sentence_length=max([len(s.split(" ")) for s in X_train] 
                        + [len(s.split(" ")) for s in X_dev] 
                        + [len(s.split(" ")) for s in X_test] )

In [71]:
print(max_sentence_length)

59


`Keras` provides a helper function to do the padding:

In [72]:
from keras.preprocessing import sequence
# pad X
X_train_pad = sequence.pad_sequences(X_train_num, maxlen=max_sentence_length, value=PAD)
X_dev_pad = sequence.pad_sequences(X_dev_num, maxlen=max_sentence_length, value=PAD)
X_test_pad = sequence.pad_sequences(X_test_num, maxlen=max_sentence_length,value=PAD)


In [73]:
print(X_train_pad.shape)

(6823, 59)


In [74]:
vocabulary_size = len(w2i)
embeds_size=64

Now we have our input data $X$ in the proper shape. Our dataset is now a tensor of: number of training instances (D), maximum sentence length (Max len) and dimensionality of the embedding space. Illustrated as (where gray blocks are the 0-paddings):

<img src="http://dirko.github.io/images/2016-04-02-Bidirectional-LSTMs-with-Keras/input_block.png" width=500>

##### Option 2 a: Lets use a simple average embedding (CBOW model)

In [75]:
np.random.seed(113) #set seed before any keras import
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding, GlobalAveragePooling1D


model = Sequential()
model.add(Embedding(vocabulary_size, embeds_size, input_length=max_sentence_length))
model.add(GlobalAveragePooling1D())
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


In [77]:
model.fit(X_train_pad, y_train_one_hot)
loss, accuracy = model.evaluate(X_dev_pad, y_dev_one_hot)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
  32/1706 [..............................] - ETA: 0s

In [78]:
print("Accuracy: ", accuracy *100)

Accuracy:  79.0152403283


Great! This very simple model is actually suprisingly effective. It is called a CBOW ("continous bag-of-words" model, and is simply takes the average embedding). Here is a great illustration of this model by G.Neubig:
<img src="pics/cbow-gn.png">

**Q:** Why is it called CBOW? How is this related to a BOW model?

#### Small detour: number of parameters of this model

In [80]:
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 59, 64)            1371328   
_________________________________________________________________
global_average_pooling1d_3 ( (None, 64)                0         
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 65        
_________________________________________________________________
activation_13 (Activation)   (None, 1)                 0         
Total params: 1,371,393
Trainable params: 1,371,393
Non-trainable params: 0
_________________________________________________________________
None


See that our model's input gets now 3 dimensional: (number of instance, max length, embedding  size). 

The number of parameters in the embeddings layer is: vocab size time embedding size. 

In [81]:
print(vocabulary_size)
print(vocabulary_size* 64)

21427
1371328


Make sure you get familiar with the remaining information in the `model.summary`.

This very simple CBOW model was able to outperform the traditional logistic regression classifier / n-hot classifier.

##### Option 2 a: Lets use a recurrent neural network (RNN)

Can we do better with a RNN?

<img src="pics/many2one.png">

We have all pieces in place. Instead of averaging the embedding vectors, we run a simple RNN on it and use the last state to predict the label. 

In [47]:
np.random.seed(113) #set seed before any keras import
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding, SimpleRNN

from keras import regularizers
model = Sequential()
model.add(Embedding(vocabulary_size, embeds_size, input_length=max_sentence_length))
model.add(SimpleRNN(32))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


In [48]:
model.fit(X_train_pad, y_train_one_hot)
loss, accuracy = model.evaluate(X_dev_pad, y_dev_one_hot)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

In [49]:
print("Accuracy: ", accuracy *100)

Accuracy:  73.5638921524


Not yet. But there are many options to explore here (deeper network, add regularization, use LSTM or GRU, bidirectional model...). 

## Structured Prediction

So far we concentrated on simple classification (multi-class or binary classification) problems (many inputs to one output). However, in many cases in NLP we are actually dealing with output spaces which are more complex (either entire parse trees or sequences of tags), where we have millions of choices. 

A Part-of-Speech tagger is an example of a structured prediction problem (a sequence prediction), that assigns a sequence of tags (POS tags) to a sequence of input tokens. Hence, instead of just predicting a label at the very end we want a model that predicts a tag **for every input token** (many to many).

<img src="pics/many2many-tagging.png">

An example is POS tagging: 
* every red box is a word
* every blue box is a POS tag

How can we implement such a model in `keras`?

#### From classification to sequence prediction

There are several things we need to take care of:
* our labels are now no longer just single labels for an instance, but 2d: lists of labels (a single instances is now a matrix! the entire training dataset a tensor)
* we need to get a prediction for every input (time step)
* we need to skip time steps with no input (just padding)
* we need to make sure that we also pad the $y$s (to have equal inputs, they will then be skipped)



#### Example: POS tagging

A tiny example of a POS tagger (too little data to make it really work).

In [82]:
w2i = defaultdict(lambda: len(w2i))
PAD = w2i["<pad>"] # index 0 is padding
UNK = w2i["<unk>"] # index 1 is for UNK
t2i = defaultdict(lambda: len(t2i))
TPAD = w2i["<pad>"]

X_train = [['From', 'the', 'AP', 'comes', 'this', 'story', ':'],['President', 'Bush', 'on', 'Tuesday', 'nominated', 'two', 'individuals', 'to', 'replace', 'retiring', 'jurists', 'on', 'federal', 'courts', 'in', 'the', 'Washington', 'area', '.']]
y_train = [['ADP', 'DET', 'PROPN', 'VERB', 'DET', 'NOUN', 'PUNCT'],['PROPN', 'PROPN', 'ADP', 'PROPN', 'VERB', 'NUM', 'NOUN', 'PART', 'VERB', 'VERB', 'NOUN', 'ADP', 'ADJ', 'NOUN', 'ADP', 'DET', 'PROPN', 'NOUN', 'PUNCT']]
X_dev = [['Bush', 'nominated', 'Jennifer', 'M.', 'Anderson', 'for', 'a', '15', '-', 'year', 'term', 'as', 'associate', 'judge', 'of', 'the', 'Superior', 'Court', 'of', 'the', 'District', 'of', 'Columbia', ',', 'replacing', 'Steffen', 'W.', 'Graae', '.']]
y_dev = [['PROPN', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'ADP', 'DET', 'NUM', 'PUNCT', 'NOUN', 'NOUN', 'ADP', 'ADJ', 'NOUN', 'ADP', 'DET', 'PROPN', 'PROPN', 'ADP', 'DET', 'PROPN', 'ADP', 'PROPN', 'PUNCT', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'PUNCT']]

# convert words to indices, taking care of UNKs
X_train_num = [[w2i[word] for word in sentence] for sentence in X_train]
w2i = defaultdict(lambda: UNK, w2i) # freeze
X_dev_num = [[w2i[word] for word in sentence] for sentence in X_dev]

# same for labels/tags
y_train_num = [[t2i[tag] for tag in sentence] for sentence in y_train]
t2i = defaultdict(lambda: UNK, t2i) # freeze
y_dev_num = [[t2i[tag] for tag in sentence] for sentence in y_dev]


In [83]:
np.unique([y for sent in y_train for y in sent ])

array(['ADJ', 'ADP', 'DET', 'NOUN', 'NUM', 'PART', 'PROPN', 'PUNCT', 'VERB'], 
      dtype='<U5')

In [84]:
num_classes = len(np.unique([y for sent in y_train for y in sent]))

Now we need to convert our labels into one-hot-encodings. Notice that we actually need a label **for every input** (a n-hot encoding for every input).

In [85]:
num_labels = len(np.unique([y for sent in y_train for y in sent ]))
y_train_1hot = [np_utils.to_categorical([t2i[tag] for tag in instance_labels], num_classes=num_labels) for instance_labels in y_train]
y_dev_1hot  = [np_utils.to_categorical([t2i[tag] for tag in instance_labels], num_classes=num_labels) for instance_labels in y_dev]

# now a single instance is a 2d object (matrix)
print(y_train_1hot[0])

[[ 1.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.  0.  0.]]


In [86]:
print(y_train[0])

['ADP', 'DET', 'PROPN', 'VERB', 'DET', 'NOUN', 'PUNCT']


Still, our instances (labels) are of different lengths, we also need to pad them.


In [87]:
X_train_pad = sequence.pad_sequences(X_train_num, maxlen=max_sentence_length, value=PAD)
X_dev_pad = sequence.pad_sequences(X_dev_num, maxlen=max_sentence_length, value=PAD)

In [88]:
y_train_1hot_pad = sequence.pad_sequences(y_train_1hot, maxlen=max_sentence_length, value=TPAD)
y_dev_1hot_pad = sequence.pad_sequences(y_dev_1hot, maxlen=max_sentence_length, value=TPAD)

In [89]:
print(y_train_1hot_pad.shape)

(2, 59, 9)


So we have now labels that are 3 dimensional! (num examples x max len x num labels)

In [90]:
np.random.seed(113) #set seed before any keras import
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding, LSTM
from keras.layers.wrappers import TimeDistributed


model = Sequential()
model.add(Embedding(vocabulary_size, embeds_size, input_length=max_sentence_length,mask_zero=True))
model.add(LSTM(32, return_sequences=True)) # return_sequences=TRUE! 
model.add(TimeDistributed(Dense(num_labels))) # TimeDistributed 
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


Note:
* `return_sequences=True` (to get a prediction for every time step)
* similarily, add a `TimeDistributed` wrapper around the dense output
* now use softmax with categorical_crossentropy (as we have more than 2 labels)

In [91]:
model.fit(X_train_pad, y_train_1hot_pad)
loss, accuracy = model.evaluate(X_dev_pad, y_dev_1hot_pad)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [92]:
print("Accuracy: ", accuracy *100)

Accuracy:  17.2413796186


This model is not very good, it's a tiny amount of data we trained it on! Nevertheless we have a first tagger!

However, please make sure you check the actual **OUTPUT** of your system, do not just look at the accuracy scores in Keras. In fact, the  **accuracy** of a **structured prediction** problem in Keras is actually pretty **misleading** as `Keras` does not disregard zero-padded input in the accuracy calculation.

It is important that you get the actual predictions back and evaluate accordingly. This will also allow you to **inspect** your model's predictions. 

In [61]:
X_dev_pad #need to ignore padding tokens!

array([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10, 13,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  3,  1,  1,  1,  3,  1,
         1,  1,  1,  1,  1,  1,  1, 25]], dtype=int32)