## Spoken Language Understanding / Argument Tagging

This homework is an adaptation of this Theano tutorial: http://deeplearning.net/tutorial/rnnslu.html

In this homework, you will train a Keras model for the Spoken Language Understanding task, which consists in assigning a label to each word given a sentence. It’s a sequence labelling task.

An old and small benchmark for this task is the ATIS (Airline Travel Information System) dataset collected by DARPA. Here is a sentence (or utterance) example using the Inside Outside Beginning (IOB) representation.


|Input (words)|show|flights|from|Boston|to|New|York|today|
|---|---|---|---|---|---|---|---|---|---|
|Output (labels)|O|O|O|B-dept|O|B-arr|I-arr|B-date|

The ATIS offical split contains 4,978/893 sentences for a total of 56,590/9,198 words (average sentence length is 15) in the train/test set. The number of classes (different slots) is 128 including the O label (NULL).

Unseen words in the test are dealt with set by marking any words with only one single occurrence in the training set as <UNK> and use this token to represent those unseen words in the test set. Sequences of numbers are converted to repetitions of the string DIGIT i.e. 1984 is converted to DIGITDIGITDIGITDIGIT.
    
** There are 10 points in total for this homework. Send the completed notebook to beroth@cis.uni-muenchen.de. The deadline is Tuesday, December 12, 23:59. You can work in teams of 2 or 3.**

First, you need to download the data from the course homepage (atis.json). Then you can load it into the notebook:

In [None]:
import numpy as np
import math
from keras.utils import to_categorical
import json
import keras
from keras import Sequential
from keras.layers import Embedding, LSTM, TimeDistributed, Dense, Bidirectional
np.random.seed(1)

with open("atis.json", "r") as f:
    data = json.load(f)

train_dev_sents = data["train_sents"] # list of lists
train_dev_labels = data["train_labels"] # list of lists
num_train = math.floor(0.8 * len(train_dev_sents))
train_sents = train_dev_sents[:num_train]
train_labels = train_dev_labels[:num_train]
dev_sents = train_dev_sents[num_train:]
dev_labels = train_dev_labels[num_train:]
test_sents = data["test_sents"]
test_labels = data["test_labels"]
word_to_id = data["vocab"]
label_to_id = data["label_dict"]

**TODO: How many sentences are in the train, dev and test set? (0.5 p.)**

In [None]:
# TODO: Your answer here.

Next, let's define some constants that we'll use later:

In [None]:
UNK_TOKEN = "<UNK>"
PAD_TOKEN = "<PAD>"
VOCAB_SIZE = len(word_to_id)
NUM_LABELS = len(label_to_id)
EMBEDDING_SIZE = 50
HIDDEN_SIZE=50
MAX_LENGTH=20

**TODO: What are the token ids of the "&lt;UNK>" and "&lt;PAD>" token, and what is the label id of "O"? (0.5 p.)**

In [None]:
# TODO: Your answer here.

**TODO: print the string representation of the words (not ids) in the first sentence. (1 p.)**

In [None]:
# TODO: Your code here.
# You may find it helpful to create dictionaries from ids to strings:
id_to_word = dict() # TODO
id_to_label = dict() # TODO

Now, let's bring the data into the format needed by Keras

**TODO: create numpy matrix of size: num_training_sentences x MAX_LENGTH. Do the same for dev and test set. (2 p.)**

** Hint: ** Use the Keras methods pad_sequences, to trim/expand exactly to the desired length: https://keras.io/preprocessing/sequence/

In [None]:
def do_padding(sequences, length, padding_value):
    pass # TODO
train_sents_padded = None # do_padding(...) # TODO
dev_sents_padded = None #do_padding(...) # TODO
test_sents_padded = None #do_padding(...) # TODO

Let's do the same for the labels.

** TODO: Create numpy matrices to encode the labels. In addition to padding, you need to transform the label ids into "one-hot encodings", vectors that are 1 for the specified label, and 0 otherwise. The resulting matrices should have shape num_sentences x MAX_LENGTH x NUM_LABELS. (1 p.) **

** Hint: **  Use the keras method to_categorical. https://keras.io/utils/#to_categorical

In [None]:
train_labels_padded = None # TODO
dev_labels_padded = None # TODO
test_labels_padded = None # TODO

We are ready to define the LSTM model. It consists of the following components:
* An embedding layer that learns word vectors that are passed on to the lstm as an input.
* The LSTM layer. We use a bidirectional LSTM to consider information from left and right.
* A final layer that predicts a label for each position in the sentence from the LSTM hidden states.

** TODO: Create the embedding layer for the vocabulary. It learns lookup vectors (size: EMBEDDING_SIZE) for all words in the vocabulary. We store the embedding layer in an extra variable so that we can inspect it later.  (1 p.)**


In [None]:
model = Sequential()
embedding_layer = None # TODO
model.add(embedding_layer)

** Add a bidirectional lstm layer with hidden units of size HIDDEN_SIZE. The lstm should return the sequence of hidden states. (1 p.)**

In [None]:
model.add(None) # TODO

** TODO: Output a prediction over the possible labels (i.e. use the softmax activation) at each time step, i.e. apply a Dense layer with softmax activation at each time step. (1 p.)**

** Hint: ** Use TimeDistributed: https://keras.io/layers/wrappers/

In [None]:
model.add(None) # TODO

We compile the model with the 'adam' optimizer and 'categorical_crossentropy' as the loss (this corresponds to negative log-likelihood). We also monitor the 'acurracy' as the metric of interest.


In [None]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(train_sents_padded, train_labels_padded, batch_size=8, epochs=10, \
          validation_data=(dev_sents_padded, dev_labels_padded))

** TODO: Now, predict the label sequence for the first sentence in the dev data. Print both the sentence and the predicted labes (words/labels, not ids.) (1 p.)**

** Hint:** You can use model.predict_classes(...) to obtain the predicted label ids. Which shape does predict_classes(...) return?

In [None]:
# TODO: Your code here.

The model is not only good for labelling sequences, it also learns word vectors. Let's inspect them.

** TODO: For each of the query_words, print the 10 words with the highest cosine similarity. (1 p.) **

**Hint:** You can obtain the learned embedding matrix using:
`learned_embeddings = embedding_layer.get_weights()[0]`

In [None]:
def cosine_similarities(m, v):
    """ Computes cosine for each row in a numpy matrix m with a vector v."""
    norm_m = np.sqrt((m**2).sum(1))
    norm_v = np.sqrt(v.dot(v))
    return m.dot(v) / (norm_m * norm_v)

query_words = ["boston", "saturday", "DIGIT", "today", "nonstop", "am"]

# TODO: Your code here.