## Dinosaur names with LSTM

In this notebook we will a character level LSTM that can be used to generate dinosaur names. This exercise was inspired by an excercise from Andrew Ng's course on Sequence models. I also used the same dataset for training the model.

After we train the model using 1536 dinosaur names, we will sample new dinosaur names one letter at a time using our LSTM model. This notebook consists of the following steps:

<ul>
<li>Implement load_dinosaur_data() function to load dinosaur names from the file</li>
<li>Implement preprocess_data() function to encode dinosaur name characters into tensors for training</li>
<li>Implement build_model() function to define LSTM model architecture</li>
<li>Fit the model using Adam optimizer and cross_entropy loss</li>
<li>Implement sample_name_from_model() function to sample new names one letter at a time</li>
</ul>

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import keras
import tensorflow as tf
from keras.optimizers import Adam
from keras.layers import LSTM, Dense, Input, Lambda, Dropout
from keras.utils import to_categorical

%matplotlib inline
np.random.seed(1)

Using TensorFlow backend.


Let's first define some key parameters. The longest dinosaur name is 27 characters, so let's set the model to 50 characters in case the generate sequence is a bit longer.

In [2]:
length = 50 # define length of the string to be fed to LSTM

Load the data with dinosaur names.

In [3]:
def load_dinosaur_data(path):
    """
    Loads and returns dinosaur names data
    Arguments:
    path -- a string with a path to the file with dinosaur names
    
    Returns:
    names -- a list of dinosaur names
    chars -- a list with vocabulary of letters
    char_to_idx -- a dictionary mapping characters to their integer encodings
    idx to char -- a dictionary mapping character encodings to characters
    """
    
    file = open(path, 'r')
    names = file.readlines()
    names = [name.lower() for name in names]
    file.close()
    
    file = open(path, 'r')
    res = file.read().lower()
    chars = sorted(list(set(res))) # + "* ")))
    file.close()
    
    char_to_idx = dict((c, i) for i, c in enumerate(chars))
    idx_to_char = dict((i, c) for i, c in enumerate(chars))

    return names, chars, char_to_idx, idx_to_char

file_path = "./data/dinos.txt"
dinosaur_names, chars, char_to_idx, idx_to_char = load_dinosaur_data(file_path)
vocab_size = len(char_to_idx)

Use dinosaur names to fill out the data tensor X and the label tensor y. X is initialized with indexes of letters in dinosaur names, while the rest of the values of X are 0. y is initialized with one hot encodings of t+1 (or next) characters and the rest, while the rest values are 0. Examples are shuffled and data is split into train and test (90 / 10).

In [4]:
def preprocess_data(names, length, vocab_size):
    """
    Preprocesses names for model training
    Arguments:
    names -- a list of string representing distinct names
    length -- an integer representing a length of the string to be fed to the model
    vocab_size -- an integer representing the number of unique letters in the list of names
    
    Returns:
    X -- one hot encodings of characters in names, of shape (1536, 50, 29)
    Y -- one hot encoding of "next" characters for the sequence model, of shape (1536, 50, 29)
    """
    
    X = np.zeros((len(dinosaur_names), length, vocab_size))
    y = np.zeros((len(dinosaur_names), length, vocab_size))

    for i, name in enumerate(dinosaur_names):
        cur_seq = []
        cur_labels = []
        for j in range(min(len(name)-1, length)): 
            c_prev = name[j]
            c = name[j+1]
            cur_seq.append(char_to_idx[c_prev])
            cur_labels.append(char_to_idx[c])
        cur_seq = np.array(cur_seq)
        cur_seq = to_categorical(cur_seq, num_classes=vocab_size)
        cur_labels = np.array(cur_labels)
        cur_labels = to_categorical(cur_labels, num_classes=vocab_size)

        X[i, 0:min(len(name)-1, length), :] = cur_seq
        y[i, 0:min(len(name)-1, length), :] = cur_labels
        
    return X, y

# Preprocess data
X, y = preprocess_data(dinosaur_names, length, vocab_size)
m = X.shape[0]
train_m = int(0.9*m)

# Shuffle examples
shuffle_inds = np.arange(X.shape[0])
np.random.shuffle(shuffle_inds)
X = X[shuffle_inds, :, :]
y = y[shuffle_inds, :, :]

# Split in train and test
X_train = X[0:train_m, :, :]
X_test = X[train_m:m, :, :]

y_train = y[0:train_m, :, :]
y_test = y[train_m:m, :, :]

print("X_train.shape: " + str(X_train.shape))
print("X_test.shape: " + str(X_test.shape))

X_train.shape: (1382, 50, 27)
X_test.shape: (154, 50, 27)


We are now ready to specify our LSTM model. It is a simple model with LSTM layer followed by a Dense layer with softmax output of the vocab_size. I also experimented with other model architectures, including more LSTM layers, wider LSTM layers, more Dense layers, Dropout layers and BatchNormalization layers. However, these models have about the same performance, but take longer to train.

In [5]:
def build_model(n1, vocab_size):
    """
    Builds character-level LSTM model using Keras

    Arguments:
    n1 -- number of units in LSTM layer
    vocab_size -- size of the vocabulary
    
    Returns:
    model - LSTM model to be trained
    """
    
    model = keras.Sequential()    
    model.add(LSTM(n1, activation='relu', # kernel_initializer='he_normal',
                   input_shape=(length, vocab_size), return_sequences=True))
    model.add(Dense(vocab_size, activation = 'softmax'))
    
    return model

# Initialize a model, compile and print summary
model = build_model(200, vocab_size)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 50, 200)           182400    
_________________________________________________________________
dense_1 (Dense)              (None, 50, 27)            5427      
Total params: 187,827
Trainable params: 187,827
Non-trainable params: 0
_________________________________________________________________


Let's train our character level LSTM model using 20 epochs. We'll use X_test, y_test to validate out of sample accuracy and loss.

In [6]:
model.fit(X_train, y_train, epochs=20, 
          validation_data=(X_test, y_test))

Train on 1382 samples, validate on 154 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.callbacks.History at 0x149ba2d90>

Finally, let's sample from our model one letter at a time. We will start with sampling a random letter from the alphabet and update the input vector x with each new letter we will generate, continuing to feed x to the model to sample the next letter. Once the new line character is encountered, the sampling procedure is over. At the end we report 10 dinosaur names sampled from the model.

In [7]:
def sample_name_from_model(model, idx_to_char, char_to_idx, vocab_size, length):
    """
    Samples one letter at a time from the model to produce a new name

    Arguments:
    model -- trained model that is used for sampling
    idx_to_char -- a dictionary encoding index to character encoding
    char_to_idx -- a dictionary for character to index encoding
    vocab_size -- size of the vocabulary of letters
    length -- size of the string to be fed into the model
    
    Returns:
    res -- a string representing a sampled new name
    """
    
    first_char_ind = np.random.randint(1, vocab_size)
    first_char = idx_to_char[first_char_ind]
    res = "" + first_char

    x = np.zeros((1, 50, vocab_size))
    x[0, 0, char_to_idx[first_char]] = 1

    for t in range(length):
        pred = model.predict(x)[0, t, :]
        char_ind = np.random.choice(vocab_size, p=pred)
        char = idx_to_char[char_ind]
        x[0, t+1, char_ind] = 1
    
        if(char == '\n'):
            break
        else:
            res = res + char

    return res

# Now we can generate 10 new dinosaur names
for i in range(10):
    new_name = sample_name_from_model(model, idx_to_char, char_to_idx, vocab_size, length)
    print(new_name)

ijinymangosaurus
dabanrtiops
ifonentosaurus
kacbelomceus
fginnsaurus
ontrocoraurus
ldaxosaurus
uavita
pabrorosaurus
dpaphadon
