# Build LSTM with Keras

In this notebook, I practised how to implement LSTM with Keras. I basically followed [this tutorial](https://adventuresinmachinelearning.com/keras-lstm-tutorial/), and its [corresponding code](https://github.com/adventuresinML/adventures-in-ml-code/blob/master/lstm_tutorial.py). It's easier for me to start with a high level API like Keras, and get farmiliar with the model, and then use more basic tensorflow blocks to build more customized models.

In [1]:
import numpy as np
import tensorflow as tf

## Fetch and read data

Same data used for embedding

In [2]:
import sys
import os
import urllib

In [3]:
import zipfile

In [4]:
url = 'http://mattmahoney.net/dc/'

In [5]:
def fetch_data(filename, expected_bytes = None):
    '''Download a file if not found'''
    data_dir = os.path.join(os.getcwd(), 'data')
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    local_filename = os.path.join(data_dir, filename)
    if not os.path.exists(local_filename):
        local_filename, _ = urllib.request.urlretrieve(url+filename, local_filename)
    
    filesize = os.stat(local_filename).st_size
    if expected_bytes and filesize == expected_bytes:
        print('Found and verified', filename)
    else:
        print('Downloaded file', filename, 'with size of', filesize)
        if expected_bytes:
            raise Exception('Fail to verify'+local_filename)
    return data_dir, local_filename

In [6]:
datapath, filename = fetch_data('text8.zip', 31344016)

Found and verified text8.zip


In [7]:
def read_data(filename):
    with zipfile.ZipFile(filename) as f:
        data = tf.compat.as_str(f.read(f.namelist()[0])).split()
    return data

In [8]:
words = read_data(filename)

In [9]:
len(words)

17005207

## Split them into training, validation and test sets

In [10]:
training_data = words[:(8*len(words))//10]
valid_data = words[(8*len(words))//10:(9*len(words))//10]
test_data = words[(9*len(words))//10:]

### Build dictionary from training dataset

Limit the size to top 10000 words only

In [11]:
from collections import Counter

In [12]:
vocabulary_size = 10000

In [13]:
def build_dicts(words, num_words):
    counts = [('UNK', -1)]
    counts.extend(Counter(words).most_common(num_words-1))
    word2ind = {}
    for i, item in enumerate(counts):
        word2ind[item[0]] = i
    ind2word = dict(zip(word2ind.values(), word2ind.keys()))
    return word2ind, ind2word

In [14]:
word2ind, ind2word = build_dicts(training_data, vocabulary_size)

Convert words to indexes

In [15]:
def words_to_indexes(words, word2ind):
    data = []
    for word in words:
        data.append(word2ind.get(word, word2ind['UNK']))
    return data

In [16]:
training_data = words_to_indexes(training_data, word2ind)
valid_data = words_to_indexes(valid_data, word2ind)
test_data = words_to_indexes(test_data, word2ind)

In [17]:
print(words[:5])

['anarchism', 'originated', 'as', 'a', 'term']


In [18]:
print([ind2word[item] for item in training_data[:5]])

['anarchism', 'originated', 'as', 'a', 'term']


## Mini-batch generator

In [19]:
from keras.utils import to_categorical

Using TensorFlow backend.


In [47]:
class BatchGenerator(object):
    def __init__(self, data, batch_size, num_lstm, vocabulary_size, skip_step=1):
        self.data = data
        self.batch_size = batch_size
        self.num_lstm = num_lstm
        self.vocabulary_size = vocabulary_size
        self.skip_step = skip_step
        self.current_index = 0
        
    def generate(self):
        x = np.ndarray(shape=(self.batch_size, self.num_lstm), dtype=np.int32)
        y = np.ndarray(shape=(self.batch_size, self.num_lstm, vocabulary_size),
                      dtype=np.int32)
        while True:
            if self.current_index+self.num_lstm >= len(self.data):
                self.current_index = 0    
            for i in range(self.batch_size):
                x[i, :] = self.data[self.current_index: self.current_index+self.num_lstm]
                _y = self.data[self.current_index+1: self.current_index+self.num_lstm+1]
                y[i, :, :] = to_categorical(_y, num_classes=self.vocabulary_size)
                self.current_index += self.skip_step
            yield x, y

In [48]:
num_lstm = 20
batch_size = 32
train_data_generator = BatchGenerator(training_data, batch_size, num_lstm, vocabulary_size,
                                     skip_step=num_lstm)
valid_data_generator = BatchGenerator(valid_data, batch_size, num_lstm, vocabulary_size,
                                     skip_step=num_lstm)

## Build LSTM cells with Keras

In [22]:
from keras import Sequential
from keras.layers import LSTM

In [23]:
from keras.layers import Embedding

In [24]:
from keras.layers import Dropout

In [25]:
from keras.layers import TimeDistributed
from keras.layers import Dense

In [26]:
from keras.layers import Activation

In [27]:
from keras.optimizers import Adam

Some hyperparameters

In [28]:
hidden_size = 500
use_dropout = True

In [29]:
model = Sequential()
model.add(Embedding(vocabulary_size, hidden_size, input_length=num_lstm))
model.add(LSTM(hidden_size, return_sequences=True))
model.add(LSTM(hidden_size, return_sequences=True))
if use_dropout:
    model.add(Dropout(0.5))
model.add(TimeDistributed(Dense(vocabulary_size)))
model.add(Activation('softmax'))

In [30]:
model.compile(loss='categorical_crossentropy', optimizer=Adam(),
              metrics=['categorical_accuracy'])

In [31]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 500)           5000000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 20, 500)           2002000   
_________________________________________________________________
lstm_2 (LSTM)                (None, 20, 500)           2002000   
_________________________________________________________________
dropout_1 (Dropout)          (None, 20, 500)           0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 20, 10000)         5010000   
_________________________________________________________________
activation_1 (Activation)    (None, 20, 10000)         0         
Total params: 14,014,000
Trainable params: 14,014,000
Non-trainable params: 0
________________________________________________________________

## Train model

In [32]:
from keras.callbacks import ModelCheckpoint

In [33]:
checkpointer = ModelCheckpoint(filepath=datapath+'/model-{epoch:02d}.hdf5', verbose=1)

In [34]:
num_epochs = 1

In [49]:
# training
model.fit_generator(train_data_generator.generate(), 
                   len(training_data)//(batch_size*num_lstm),
                   num_epochs,
                   validation_data=valid_data_generator.generate(),
                   validation_steps=len(valid_data)//(batch_size*num_lstm),
                   callbacks=[checkpointer])
model.save(datapath+'/final_model.hdf5')

Epoch 1/1

Epoch 00001: saving model to /Users/yuzhang/ML/RNN/LSTM_Keras/data/model-01.hdf5


## Prediction

In [72]:
temp = next(valid_data_generator.generate())

In [74]:
prediction = model.predict(temp[0])

pred_indexes = np.argmax(prediction, axis=2)

Prediction

In [75]:
for item in pred_indexes[:5]:
    print(' '.join(ind2word[_] for  _ in item))

UNK UNK UNK UNK UNK UNK the to as UNK was UNK in the UNK gulf of after the UNK
for UNK UNK was UNK in in in the UNK of the UNK UNK UNK UNK UNK UNK the one
nine zero zero zero zero est UNK of been the UNK UNK minister of UNK UNK UNK UNK UNK UNK
UNK the UNK to in UNK of the to be for the UNK to one one one UNK been UNK
UNK UNK UNK of the UNK zero years were from in UNK council of UNK UNK council was the of


True output

In [76]:
true_indexes = np.argmax(temp[1], axis=2)
for item in true_indexes[:5]:
    print(' '.join(ind2word[_] for _ in item))

UNK al UNK al UNK from power such a vote is unusual in the arab countries shortly after the vote
UNK UNK he was UNK only briefly after the death of UNK UNK al ahmed al UNK on january one
five two zero zero six the cabinet has recommended the current prime minister UNK al ahmad al UNK al UNK
to be elected UNK the parliament is expected to vote on his appointment in late january UNK has been the
de facto ruler since the two previous UNK fell ill the national assembly the UNK national assembly or UNK al


Given only 1 epoch of training, the model has limited ability to predict the correct next word! More training might give better results. Also attention could also be introduced into the model to yield better predictions.