# Sequence Generation

In this exercise, you will design an RNN to generate baby names! You will design an RNN to learn to predict the next letter of a name given the preceding letters. This is a character-level RNN rather than a word-level RNN.

This idea comes from this excellent blog post: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

In [1]:
%matplotlib inline

import numpy as np
from keras.preprocessing import sequence
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Embedding
from keras.layers import LSTM, SimpleRNN, GRU

Using TensorFlow backend.


## Training Data

The training data we will use comes from this corpus:
http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/

Take a look at the training data in `data/names.txt`, which includes both boy and girl names. Below we load the file and convert it to all lower-case for simplicity.

Note that we also add a special "end" character (in this case a period) to allow the model to learn to predict the end of a name.

In [2]:
with open('../data/names.txt') as f:
    names = f.readlines()
    names = [name.lower().strip() + '.' for name in names]

print('Loaded %d names' % len(names))

Loaded 7939 names


In [3]:
names[:10]

['aamir.',
 'aaron.',
 'abbey.',
 'abbie.',
 'abbot.',
 'abbott.',
 'abby.',
 'abdel.',
 'abdul.',
 'abdulkarim.']

We need to count all of the characters in our "vocabulary" and build a dictionary that translates between the character and its assigned index (and vice versa).

In [4]:
chars = set()
for name in names:
    chars.update(name)
vocab_size = len(chars)
print('Vocabulary size:', vocab_size)

char_inds = dict((c, i) for i, c in enumerate(chars))
inds_char = dict((i, c) for i, c in enumerate(chars))

Vocabulary size: 28


In [5]:
char_inds

{'k': 0,
 'h': 1,
 'z': 2,
 'n': 3,
 'i': 4,
 'm': 5,
 'a': 6,
 'q': 7,
 '.': 8,
 '-': 9,
 'v': 10,
 'e': 11,
 'f': 12,
 'x': 13,
 'g': 14,
 's': 15,
 'j': 16,
 'd': 17,
 'b': 18,
 'r': 19,
 'c': 20,
 'p': 21,
 'w': 22,
 'o': 23,
 't': 24,
 'y': 25,
 'u': 26,
 'l': 27}

In [8]:
inds_char

{0: 'k',
 1: 'h',
 2: 'z',
 3: 'n',
 4: 'i',
 5: 'm',
 6: 'a',
 7: 'q',
 8: '.',
 9: '-',
 10: 'v',
 11: 'e',
 12: 'f',
 13: 'x',
 14: 'g',
 15: 's',
 16: 'j',
 17: 'd',
 18: 'b',
 19: 'r',
 20: 'c',
 21: 'p',
 22: 'w',
 23: 'o',
 24: 't',
 25: 'y',
 26: 'u',
 27: 'l'}

#### Exercise 1 - translate chars to indexes

Most of the work of preparing the data is taken care of, but it is important to know the steps because they will be needed anytime you want to train an RNN. Use the dictionary created above to translate each example in `names` to its number format in `int_names`.

In [11]:
# Translate names to their number format in int_names
int_names = [[char_inds[x] for x in name] for name in names]
# TODO: review compound list comprehension like this

The `create_matrix_from_sequences` will take the examples and create training data by cutting up names into input sequence of length `maxlen` and training labels, which are the following character. Make sure you understand this procedure because it is what will actually go into the network!

In [13]:
def create_matrix_from_sequences(int_names, maxlen, step=1):
    name_parts = []
    next_chars = []
    for name in int_names:
        for i in range(0, len(name) - maxlen, step):
            name_parts.append(name[i: i + maxlen])
            next_chars.append(name[i + maxlen])

    return name_parts, next_chars

maxlen = 3
name_parts, next_chars = create_matrix_from_sequences(int_names, maxlen)
print('Created %d name segments' % len(name_parts))
# Created fixed-size inputs

Created 32016 name segments


In [14]:
X_train = sequence.pad_sequences(name_parts, maxlen=maxlen)
y_train = np_utils.to_categorical(next_chars, vocab_size)

In [15]:
X_train.shape

(32016, 3)

In [16]:
X_train[:5]

array([[ 6,  6,  5],
       [ 6,  5,  4],
       [ 5,  4, 19],
       [ 6,  6, 19],
       [ 6, 19, 23]], dtype=int32)

In [17]:
y_train[:5]

array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

#### Exercise 2 - design a model

Design your model below. Like before, you will need to set up the embedding layer, the recurrent layer, a dense connection and a softmax to predict the next character.

Fit the model by running at least 10 epochs. Later you will generate names with the model. Getting around 30% accuracy will usually result in decent generations. What is the accuracy you would expect for random guessing?

In [18]:
# What layers will this model need?
# Embedding, because need to go from letters/dictionary size to whatever size we want
# Or could go directly via tensor(?)
# Then, LSTM
# Then, output layer Dense with vocab size output
# Then, Activation('softmax'), why softmax?
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

In [20]:
# TODO: See solution for rest
# More dropout details in day 5
# Will it ever get to 99% accuracy?
    # No, there are names with the first 3 letters.
    # So, after 3 letters there are multiple possibilities.  We can only figure out most likely.

In [21]:
model.fit(X_train, y_train, batch_size=32, epochs=10, verbose=1)

NameError: name 'model' is not defined

## Sampling from the model

We can sample the model by feeding in a few letters and using the model's prediction for the next letter. Then we feed the model's prediction back in to get the next letter, etc.

The `sample` function is a helper to allow you to adjust the diversity of the samples. You can read more [here](https://en.wikipedia.org/wiki/Softmax_function#Reinforcement_learning).

Read the `gen_name` function to understand how the model is sampled.

In [None]:
def sample(p, diversity=1.0):
    p1 = np.asarray(p).astype('float64')
    p1 = np.log(p1) / diversity
    e_p1 = np.exp(p1)
    s = np.sum(e_p1)
    p1 = e_p1 / s
    return np.argmax(np.random.multinomial(1, p1, 1))


def gen_name(seed, length=1, diversity=1.0, maxlen=3):
    """
    seed - the start of the name to sample
    length - the number of letters to sample; if None then samples
        are generated until the model generates a '.' character
    diversity - a knob to increase or decrease the randomness of the
        samples; higher = more random, lower = closer to the model's
        prediction
    maxlen - the size of the model's input
    """
    
    # Prepare input array
    x = np.zeros((1, maxlen), dtype=int)

    # Generate samples
    out = seed
    while length is None or len(out) < len(seed) + length:

        # Add the last chars so far for the next input
        for i, c in enumerate(out[-maxlen:]):
            x[0, i] = char_inds[c]
        
        # Get softmax for next character
        preds = model.predict(x, verbose=0)[0]
        
        # Sample the network output with diversity
        c = sample(preds, diversity)
        
        # Choose to end if the model generated an end token
        if c == char_inds['.']:
            if length is None:
                return out
            else:
                continue

        # Build up output
        out += inds_char[c]
        
    return out

#### Exercise 3 - sample the model

Use the `gen_name` function above to sample some names from your model.

1. Try generating a few characters by setting the `length` argument.
2. Try different diversities. Start with 1.0 and vary it up and down.
3. Try using `length=None`, allowing the model to choose when to end a name.
4. What happens when `length=None` and the diversity is high? How do samples change in this case staring from beginning to end? Why do you think this is?
5. With `length=None` and a "good" diversity, can you tell if the model has learned a repertoire of "endings"? What are some of them? 
6. Find some good names. What are you favorites? :D

In [23]:
# Not great names, but pronoucible, not necessarily English names.
# Untrained model producses unpronouncible names
# Why is this useful? Given enough data, we might be able to produce actually useful things.

#### Exercise 4 - retrain

Now that you have seen some samples, go back up and redefine your model to "erase" it. Don't train it again yet. You can sample again to compare the quality of the samples before the model is trained.

Experiment with the hidden layer size, the maxlen, the number of epochs, etc. Do you observe any differences in the sample behavior?

Not all changes will make an observable impact, but do experiments to see what you can discover.