Keras Tutorial - Natural Language Understanding

http://chsasank.github.io/spoken-language-understanding.html

Natural language understanding (NLU) is a subtopic of natural language processing in artificial intelligence that deals with machine reading comprehension. NLU is considered an AI-hard problem.
The process of disassembling and parsing input is more complex than the reverse process of assembling output in natural language generation because of the occurrence of unknown and unexpected features in the input and the need to determine the appropriate syntactic and semantic schemes to apply to it, factors which are pre-determined when outputting language.[dubious – discuss]
There is considerable commercial interest in the field because of its application to news-gathering, text categorization, voice-activation, archiving, and large-scale content-analysis.

Problem and Dataset

NLU: extract meaning of speech -- still an unsolved problem

We breat this problem into a solvable practical problem of understanding the speaker in a limited context.

In particular, we want to identify the intent of a speaker asking information about flights

Dataset: Airline Travel Information System (ATIS)
- Collected by DARPA in early 90s
- ATIS consists of spoken queries on flight related information
- eg, I want to go from Boston to Atlanta on Monday
- Understanding this is then reduced to identifying arguments like Destination and Departure Day
- This task is called `slot-filling`

Example sentence:

|Words|Show|flights|from|Boston|to|New|York|today|
|Labels|O|O|O|B-dept|O|B-arr|I-arr|B-date|

ATIS contains 5M sentences for total of 56,590 / 9,198 words (avg sentence length is 15) in the train / test set.

Number of classes (different slots) is 128 including the O label (NULL)

Unseen words in the test set are encoded `<UNK>` and each digit is replaced with string `DIGIT`, ie 20 is convert to `DIGITDIGIT`

Our approach is to use:
- Word embeddings
- Recurrent neural networks


Word Embeddings

Word embeddings map words to a vector in a high-dimensional space.  

If learned the right way, these word embeddings can learn semantic and syntactic information of the words, ie, similar words are close to each other in this space and dissimilar words are far apart.

These can be learned either using large amounts of text like Wikipedia or specifically for a given problem.  We will take the second approach.

Eg, nearest neighbords in the word embedding space for some of the words

sunday	delta	california	boston	august	time	car
wednesday	continental	colorado	nashville	september	schedule	rental
saturday	united	florida	toronto	july	times	limousine
friday	american	ohio	chicago	june	schedules	rentals
monday	eastern	georgia	phoenix	december	dinnertime	cars
tuesday	northwest	pennsylvania	cleveland	november	ord	taxi
thursday	us	north	atlanta	april	f28	train
wednesdays	nationair	tennessee	milwaukee	october	limo	limo
saturdays	lufthansa	minnesota	columbus	january	departure	ap
sundays	midwest	michigan	minneapolis	may	sfo	later

Recurrent Neural Networks (RNN)

Convolutional layers can be a great way to pool local information, but they don't really capture the sequentiality of the data.

RNNs help us tackle sequential information like natural language.

If we're going to predict properties of the current word, we better remember the words before it too.  

An RNN has such an internal state / memory which stores the summary of the sequence it has seen so far.

This allows the the RNN to solve complicated word tagging problems like part of speech (POS) tagging or slot filling as in our case.

Diagram:

![RNN][rnn-diagram]
[rnn-diagram]: http://chsasank.github.io/assets/images/slu/rnn.gif "RNN"

- x1, x2, ..., xt-1, xt, xt+1.. is the input to RNN
- s_t is the hidden state of RNN, calculated based on state at step t-1
- s_t = f(Ux_t + s_t-1) / f is nonlinearity like tanh or ReLU
- o_t is the output at step t.  Computed as o_t = F(Vs_t)
- U, V, W are the learnable parameters of RNN

For our problem, we will pass word embedding sequence as the input to the RNN


Putting it all together

Since we are using IOB representation for labels, it’s not trivial to calculate the scores of our model. We therefore use the conlleval perl script (http://www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt) to compute the F1 Scores (https://en.wikipedia.org/wiki/F1_score). I’ve adapted the code from here(https://github.com/mesnilgr/is13) for the data preprocessing and score calculation. Complete code is available at GitHub (https://github.com/chsasank/ATIS.keras)

$ git clone https://github.com/chsasank/ATIS.keras.git
$ cd ATIS.keras

I recommend using jupyter notebook to run and experiment with the snippets from the tutorial.


In [1]:
# loading data

# load data using data.load.atisfull() -- will download data first time it is run
# words and labels are encoded as indices to a vocabulary
# this vocabulary is stored in w2idx and labels2idx

import numpy as np
import data.load

train_set, valid_set, dicts = data.load.atisfull()
w2idx, labels2idx = dicts['words2idx'], dicts['labels2idx']

train_x, _, train_label = train_set
val_x, _, val_label = valid_set

# create index to word / label dicts
idx2w = {w2idx[k]:k for k in w2idx}
idx2la = {labels2idx[k]:k for k in labels2idx}

# for conlleval script
words_train = [ list(map(lambda x: idx2w[x], w)) for w in train_x]
labels_train = [ list(map(lambda x: idx2la[x], y)) for y in train_label]
words_val = [ list(map(lambda x: idx2w[x], w)) for w in val_x]
labels_val = [ list(map(lambda x: idx2la[x], y)) for y in val_label]

n_classes = len(idx2la)
n_vocab = len(idx2w)


In [2]:
print("Example sentence : {}".format(words_train[0]))
print("Encoded form: {}".format(train_x[0]))
print()
print("It's label : {}".format(labels_train[0]))
print("Encoded form: {}".format(train_label[0]))


Example sentence : ['i', 'want', 'to', 'fly', 'from', 'boston', 'at', 'DIGITDIGITDIGIT', 'am', 'and', 'arrive', 'in', 'denver', 'at', 'DIGITDIGITDIGITDIGIT', 'in', 'the', 'morning']
Encoded form: [232 542 502 196 208  77  62  10  35  40  58 234 137  62  11 234 481 321]

It's label : ['O', 'O', 'O', 'O', 'O', 'B-fromloc.city_name', 'O', 'B-depart_time.time', 'I-depart_time.time', 'O', 'O', 'O', 'B-toloc.city_name', 'O', 'B-arrive_time.time', 'O', 'O', 'B-arrive_time.period_of_day']
Encoded form: [126 126 126 126 126  48 126  35  99 126 126 126  78 126  14 126 126  12]


Keras Model

Keras has inbuilt `Embedding` layer for word embeddings. It expects integer indices.

`SimpleRNN` is the RNN layer described above.  We will have to use `TimeDistrubuted` to pass the output of the RNN $o_t$ at each time step $t$ to a fully connected layer.  Otherwise, output at the final step will be passed on to the next layer.

```
keras.layers.embeddings.Embedding(input_dim, output_dim, embeddings_initializer='uniform', embeddings_regularizer=None, activity_regularizer=None, embeddings_constraint=None, mask_zero=False, input_length=None)

Turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]

This layer can only be used as the first layer in a model.

model = Sequential()
  model.add(Embedding(1000, 64, input_length=10))
  # the model will take as input an integer matrix of size (batch, input_length).
  # the largest integer (i.e. word index) in the input should be no larger than 999 (vocabulary size).
  # now model.output_shape == (None, 10, 64), where None is the batch dimension.

  input_array = np.random.randint(1000, size=(32, 10))

  model.compile('rmsprop', 'mse')
  output_array = model.predict(input_array)
  assert output_array.shape == (32, 10, 64)
```

```
keras.layers.recurrent.SimpleRNN(units, activation='tanh', use_bias=True, kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal', bias_initializer='zeros', kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, bias_constraint=None, dropout=0.0, recurrent_dropout=0.0)
```

In [3]:
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import SimpleRNN
from keras.layers.core import Dense, Dropout
from keras.layers.wrappers import TimeDistributed
from keras.layers import Convolution1D

model = Sequential()
model.add(Embedding(n_vocab, 100))
model.add(Dropout(0.25))
model.add(SimpleRNN(100, return_sequences=True))
model.add(TimeDistributed(Dense(n_classes, activation='softmax')))
model.compile('rmsprop', 'categorical_crossentropy')

Using TensorFlow backend.


Training

We will pass each sentence as a batch to the model.

We cannot use `model.fit()` as it expects all the sentences to be of the same size.

We will therefore use `model.train_on_batch()`


In [None]:
import progressbar
n_epochs = 30

for i in range(n_epochs):
    bar = progressbar.ProgressBar(max_value=len(train_x))
    for n_batch, sent in bar(enumerate(train_x)):
        label = train_label[n_batch]
        # Make labels one hot
        label = np.eye(n_classes)[label][np.newaxis,:]
        # View each sentence as a batch
        sent = sent[np.newaxis,:]
        
        if sent.shape[1] > 1: # ignore 1 word sentences
            model.train_on_batch(sent, label)

100% (4978 of 4978) |#####################| Elapsed Time: 0:05:25 Time: 0:05:25
100% (4978 of 4978) |#####################| Elapsed Time: 0:05:11 Time: 0:05:11
100% (4978 of 4978) |#####################| Elapsed Time: 0:05:12 Time: 0:05:12
100% (4978 of 4978) |#####################| Elapsed Time: 0:05:10 Time: 0:05:10
100% (4978 of 4978) |#####################| Elapsed Time: 0:05:10 Time: 0:05:10
100% (4978 of 4978) |#####################| Elapsed Time: 0:05:11 Time: 0:05:11
100% (4978 of 4978) |#####################| Elapsed Time: 0:05:09 Time: 0:05:09
100% (4978 of 4978) |#####################| Elapsed Time: 0:05:10 Time: 0:05:10
100% (4978 of 4978) |#####################| Elapsed Time: 0:05:11 Time: 0:05:11
100% (4978 of 4978) |#####################| Elapsed Time: 0:05:10 Time: 0:05:10
100% (4978 of 4978) |#####################| Elapsed Time: 0:05:10 Time: 0:05:10
 45% (2281 of 4978) |##########            | Elapsed Time: 0:02:22 ETA: 0:02:49

Evaluation

To measure the accuracy of the model, we use `model.predict_on_batch()` and `metrics.accuracy.conlleval()`

In [None]:
from metrics.accuracy import conlleval

labels_pred_val = []

bar = progressbar.ProgressBar(max_value=len(val_x))
for n_batch, sent in bar(enumerate(val_x))
    label = val_label[n_batch]
    label = np.eye(n_classes)[label][np.newaxis,:]
    sent = sent[np.newaxis,:]
    
    pred = model.predict_on_batch(sent)
    pred = np.argmx(pred,-1)[0]
    labels_pred_val.append(pred)

labels_pred_val = [list(map(lambda x: idx2la[x], y)) \
                                   for y in labels_pred_val]
con_dict = conlleval(labels_pred_val, labels_val,
                    words_val, 'measure.txt')

print('Precision = {}, Recall = {}, F1 = {}'.format(
            con_dict['r'], con_dict['p'], con_dict['f1']))

With this model, I get 92.36 F1 Score.

`Precision = 92.07, Recall = 92.66, F1 = 92.36`

Note that for the sake of brevity, I’ve not showed logging part of the code. Loggging losses and accuracies is an important part of coding up an model. An improved model (described in the next section) with logging is at main.py. You can run it as :

`python main.py`

Improvements

One drawback with our current model is that there is no lookahead, ie, the output $o_t$ depends only on the current and previous words but not on the words next to it.  One can imagine that clues about the properties of the current word is also held about the next word.

Lookahead can easily be implemented by having a convolutional layer before RNN and after word embeddings:

```
model = Sequential()
model.add(Embedding(n_vocab,100))
model.add(Convolution1D(128, 5, border_mode='same', activation='relu'))
model.add(Dropout(0.25))
model.add(GRU(100,return_sequences=True))
model.add(TimeDistributed(Dense(n_classes, activation='softmax')))
model.compile('rmsprop', 'categorical_crossentropy')
```
With this improved model, I get 94.90 F1 Score.



Conclusion

In this tutorial, we have learnt about word embeddings and RNNs. We have applied these to a NLP problem: ATIS. We also have made an improvement to our model.

To improve the model further, we could try using word embeddings learnt on a large corpus like Wikipedia. Also, there are variants of RNNs like LSTM or GRU which can be experimented with.
