## LSTMs for classification

In this notebook, LSTMs are going to be used to predict the label (e.g. sentiment) of a sequence.

We are going to use `keras` to build LSTM network, using function `keras.layers.LSTM`. First, let's install the library `tensorflow` and `keras`. This may take a few seconds.

In [None]:
!pip install tensorflow

In [None]:
!pip install keras

In [None]:
# Install in Anaconda command: conda install -c conda-forge keras
from __future__ import print_function
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb

The imdb dataset: https://keras.io/api/datasets/imdb/#getwordindex-function

In [None]:
max_features = 2000 # use top max_features most common words to build a vocabulary

Loading data (and reducing its size):

In [None]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = x_train[:1000]
x_test = x_test[:1000]
y_train = y_train[:1000]
y_test = y_test[:1000]
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

Just to give you an idea of what the sequences look like (each number represents a different word):

In [None]:
print("X-vector: "+str(x_train[0]))
print("Label: "+str(y_train[0]))

For your curiosity, here we just show how to retrieve the dictionary mapping word indices back to words.
For more details, see https://stackoverflow.com/questions/42821330/restore-original-text-from-keras-s-imdb-dataset

In [None]:
INDEX_FROM=3   # word index offset, by default

word_to_id = imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2 #unknown words according to the vovabulary
word_to_id["<UNUSED>"] = 3

id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in x_train[0] ))

Since sequences (in this case sentences) can have different lengths, we need to make sure that they are padded: we add zeros to the beginning of the sequences that are shorter than the longest sequence so we can still train them step-by-step:

In [None]:
# make sure sequences have same length
maxlen = 80  # in each sentence, cut texts  before this number of words

print('Transform sequences')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

In [None]:
print("X-vector: "+str(x_train[0]))
print("Label: "+str(y_train[0]))

Note:

When directly working with text, we need an embedding layer, where words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space.
Look at https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/ for more details

In [None]:
print('Build model...')
model = Sequential()
no_dim = 128

# First we create an embedding for each word of dimensionality 128
# no_dim - should match LSTM
model.add(Embedding(max_features, no_dim))

# dropout = percentage of units dropped by the input linear transformation
# rec_drop = percentage of units dropped by linear transformation of recurrent state
model.add(LSTM(no_dim, dropout=0.2, recurrent_dropout=0.2))

# dimensionality of the output space = 1: since we use classification of a label, e.g., [0,1,2,3]
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy','mae'])

model.fit(x_train, y_train,
          batch_size=32,
          epochs=15,
          validation_data=(x_test, y_test))

Evaluation happens as follows:

In [None]:
evaluation = model.evaluate(x_test, y_test,return_dict = True)
print(evaluation)

You may check `keras.layers.LSTM`'s documentation for more details: 
https://keras.io/api/layers/recurrent_layers/lstm/

Or this Youtube tutorial video imdb classification using  `LSTM`
https://www.youtube.com/watch?v=95F26zyK-c4