# RNN Children's Book Author

Train an RNN to make sentences for a children's book like the example in [this YouTube video](https://www.youtube.com/watch?v=WCUNPb-5EYI). Useful as a high-level "make an RNN work" project, but haven't gone into the details of how it works yet.

Expanded on the example to show it more examples of sentences in the format "<person 1> saw <person 2> ." and see if it could 1) Keep the appropriate grammar (e.g. avoid "Jane." or "Jane saw Spot saw Doug saw Jane." and 2) make new (correct) sentences that it hasn't seen before (e.g. "Jane saw Luke.", which isn't in the training set).

Most of the code was adapted from this [Tensorflow tutorial](https://www.tensorflow.org/tutorials/text/text_generation), which originally genereated Shakespeare-like text on a character-by-character basis. Adapted it to use words as tokens instead of characters.

In [1]:
from itertools import permutations

import numpy as np
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf

## Dataset

Create a dataset for training, starting with a string of sentences of the form "Jane saw Spot ." The period is separated by a space so that it is treated as a separate word/token the RNN can choose to add.


### Initial text string

Start with a string of many example sentences. Names are split into groups, and sentences only use names within a single group. This way we can see if the RNN creates new sentences by combining names from the different groups (e.g. the training data will not contain "Spot saw Leia ." but the output might generate that sentence.

In [2]:
names1 = ['Doug', 'Jane', 'Spot', 'Kaylee', 'Mal', 'Link', 'Zelda', 'Mario', 'Luigi']
names2 = ['Leia', 'Luke', 'Han', 'Harry', 'Hermione', 'Ron']
names3 = ['Frodo', 'Sam', 'Merry', 'Pippin']

text_array = []
for name_pair in permutations(names1, 2):
    text_array.append(' saw '.join(name_pair))
for name_pair in permutations(names2, 2):
    text_array.append(' saw '.join(name_pair))
for name_pair in permutations(names3, 2):
    text_array.append(' saw '.join(name_pair))
data_text = ' . '.join(text_array) + ' .' # Need that last period

print(data_text)

Doug saw Jane . Doug saw Spot . Doug saw Kaylee . Doug saw Mal . Doug saw Link . Doug saw Zelda . Doug saw Mario . Doug saw Luigi . Jane saw Doug . Jane saw Spot . Jane saw Kaylee . Jane saw Mal . Jane saw Link . Jane saw Zelda . Jane saw Mario . Jane saw Luigi . Spot saw Doug . Spot saw Jane . Spot saw Kaylee . Spot saw Mal . Spot saw Link . Spot saw Zelda . Spot saw Mario . Spot saw Luigi . Kaylee saw Doug . Kaylee saw Jane . Kaylee saw Spot . Kaylee saw Mal . Kaylee saw Link . Kaylee saw Zelda . Kaylee saw Mario . Kaylee saw Luigi . Mal saw Doug . Mal saw Jane . Mal saw Spot . Mal saw Kaylee . Mal saw Link . Mal saw Zelda . Mal saw Mario . Mal saw Luigi . Link saw Doug . Link saw Jane . Link saw Spot . Link saw Kaylee . Link saw Mal . Link saw Zelda . Link saw Mario . Link saw Luigi . Zelda saw Doug . Zelda saw Jane . Zelda saw Spot . Zelda saw Kaylee . Zelda saw Mal . Zelda saw Link . Zelda saw Mario . Zelda saw Luigi . Mario saw Doug . Mario saw Jane . Mario saw Spot . Mario saw K

### Encode the data

The dataset is made by splitting the string into a list of individual words. Scikit-Learn's `LabelEncoder` class is used to convert each word into a number, so it can be used as input into the RNN. The encoded words are stored in `dataset_enc`.

In [3]:
dataset = np.array(data_text.split())
encoder = LabelEncoder()
dataset_enc = encoder.fit_transform(dataset) # Reshape dataset to be a single column vector


print(f'Vocabulary: {encoder.classes_}')
print(f'Orignal data: {dataset[:8]}\nEncoded data: {dataset_enc[:8]}')

Vocabulary: ['.' 'Doug' 'Frodo' 'Han' 'Harry' 'Hermione' 'Jane' 'Kaylee' 'Leia' 'Link'
 'Luigi' 'Luke' 'Mal' 'Mario' 'Merry' 'Pippin' 'Ron' 'Sam' 'Spot' 'Zelda'
 'saw']
Orignal data: ['Doug' 'saw' 'Jane' '.' 'Doug' 'saw' 'Spot' '.']
Encoded data: [ 1 20  6  0  1 20 18  0]


### Tensorflow Dataset Format

Convert our encoded data (a list of numbers representing word tokens) into a format usable by the Tensorflow RNN. We want our input to be a series of number representing words (e.g. [1, 16, 14] -> "Doug saw Spot"), and the output should be the series shifted one word into the future (e.g. [16, 14, 0] -> "saw Spot ."). 

This converts our list of encoded numbers into a tensorflow dataset, then formats it by:

- Grabbing batches 1 longer than the input length (e.g. [1, 16, 14, 0, 2] for an input length of 4)
- Mapping the batches into input and target (e.g. input: [1, 16, 14, 0] target: [16, 14, 0, 2])
- Shuffling the input/output pairs so it doesn't always see input in the same order during training
    + This doesn't shuffle the words in the input/target, only the order it sees the input/target pairs
- Set the batch size used for training (the number of input/target pairs to give to the model at each training step)
    + Prediction needs a batch size of 1, and apparently the batch size can't be changed after creating a model. So if you use a larger batch size for training, for prediction you'd have to make a new model with a batch size of 1 and load the weights of the trained model into it.

In [4]:
# Input and target are sets of 4 words, with target shifted one word into the future
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

seq_length = 4 # Length of input and target strings
batch_size = 1 # Use 1 so we don't have to rebuild model for generating data after training
buffer_size = 4
dataset_tf = (tf.data.Dataset.from_tensor_slices(dataset_enc) # Make tf dataset from encoded dataset
              .batch(seq_length+1, drop_remainder=True) # Take a batch of 5 words at a time, dropping any remainder
              .map(split_input_target) # From each 5-word batch, return input (words 1-4) and target (words 2-5)
              .shuffle(buffer_size) # tf reads in buffer_size elements into memory and shuffles those elements
              .batch(batch_size, drop_remainder=True)) # This is the batch size used for training

for batch_num, (input_text, target_text) in enumerate(dataset_tf.take(3)):
    print(f'Batch {batch_num}')
    for batch_input, batch_target in zip(input_text, target_text):
        print(f'Input: {batch_input}')
        print(f'Target: {batch_target}')
        print('Input Translated: ' + ' '.join(encoder.inverse_transform(batch_input)))
        print('Target Translated: ' + ' '.join(encoder.inverse_transform(batch_target)))
        print(' ')

Batch 0
Input: [ 0  1 20  9]
Target: [ 1 20  9  0]
Input Translated: . Doug saw Link
Target Translated: Doug saw Link .
 
Batch 1
Input: [20 18  0  1]
Target: [18  0  1 20]
Input Translated: saw Spot . Doug
Target Translated: Spot . Doug saw
 
Batch 2
Input: [ 1 20 19  0]
Target: [20 19  0  1]
Input Translated: Doug saw Zelda .
Target Translated: saw Zelda . Doug
 


## Create the RNN Model

**Create a model with 3 layers:**

* Embedding: used to convert number representations of words into one-hot vectors usable by the RNN
   + `embedding_dim` is (part of the) size of the layer output, not sure what effect it has or if it can just be the size of the data
* LSTM: the recurrent layer
   + `rnn_units` is passed to `units` parameter, determines output dimensionality. Not sure of details of exactly what it does
   + `return_sequences` parameter determines "whether to return the the last output in the output sequence, or the full sequence."
       - I believe setting it to true has it return the entire sequence (input + prediction? or current and past predictions?). Not sure what exactly this does, but it seems to need to be True to work right
   + `stateful` parameter: "f True, the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch." 
       + Something about passing information into the next step, but not sure exactly what the state is
   
**Create a loss function and compile**

- Use the categorical crossentropy loss, but have to create our own function so we can set `from_logits` parameter to be true
    + Logits are an inverse of the sigmoid function, limiting the x-axis to the [0,1] range (or probably [-1,1])
    + Haven't figured out why we're using them yet, or for what
- Use Adam optimizer

In [5]:
# Create RNN

vocab_size = len(encoder.classes_)

# Embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

model = tf.keras.Sequential([
    # Embedding layer maps words to vectors
    tf.keras.layers.Embedding(vocab_size, 
                              embedding_dim, 
                              batch_input_shape=[batch_size, None]),
    
    # Recurrent layer
    tf.keras.layers.LSTM(units=rnn_units, 
                         return_sequences=True, 
                         stateful=True),
    
    # Output layer
    tf.keras.layers.Dense(vocab_size)
])

# Use categorical crossentropy as loss function, use custom function so from_logits is true
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

# Compile with loss function and adam optimizer
model.compile(optimizer='adam', loss=loss)

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (1, None, 256)            5376      
_________________________________________________________________
lstm (LSTM)                  (1, None, 1024)           5246976   
_________________________________________________________________
dense (Dense)                (1, None, 21)             21525     
Total params: 5,273,877
Trainable params: 5,273,877
Non-trainable params: 0
_________________________________________________________________


## Train the Model

- Set the number of epochs and train the model on the dataset

In [6]:
epochs = 10
history = model.fit(dataset_tf, epochs=epochs)

Train for 91 steps
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Generate Text

Start by seeding it with the sentence "Jane saw Doug ." and having it generate new text word by word. 

We choose the new word by sampling from the output predictions, rather than simply taking the highest probability word. Apparently always taking the most likely word can get it stuck in a loop?

The input for the next step is the last 3 words from the previous input, followed by the newest prediction.

In [7]:
input_text = encoder.transform('Jane saw Doug .'.split())
input_text = tf.expand_dims(input_text, 0) # Some kind of formatting for tensorflow

generated_text = []

# No idea where the term comes from, but low values give more predictable results, high values more surprising
temperature = 1.0

model.reset_states() # ? Does this drop memory of the recently seen text?

for i in range(100):
    predictions = model(input_text)
    
    # Remove batch dimension
    predictions = tf.squeeze(predictions, 0)
    
    # Sample from the output predictions instead of taking the argmax
    # Apparently argmax can get it stuck in a loop
    predictions = predictions / temperature
    predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()
    
    # Add predicted value to the input and drop the oldest value
    input_text = np.append(np.array(input_text)[0, 1:], [predicted_id])
    input_text = tf.expand_dims(input_text, 0)
    
    generated_text.extend(encoder.inverse_transform([predicted_id]))
    
print(" ".join(generated_text))
    
    

Link saw Mario saw Spot . Spot saw Spot . Luigi saw Ron . Link saw Kaylee . Luigi saw Luigi . Link saw Luigi . Doug saw Jane . Jane saw Luigi . Luigi saw Link . Mario saw Mal . . Luigi saw Mal . Luigi saw Han . Mal saw Han . . Hermione saw Merry . Merry saw Harry . Hermione saw Luke . Frodo . Pippin saw Harry . Ron saw Han . Hermione saw Pippin . Hermione saw Luke . Kaylee saw Han . Hermione saw Hermione . Ron saw Harry saw Ron .


## Results

It sort of works! It generates words, usually into sensible sentences. It will occasionally cross words between groups (e.g. "Hermione saw Merry ." or "Kaylee saw Han .". It will also sometimes do weird things, like use two periods in a row. Or "Luigi saw Luigi ." or "Ron saw Harry saw Ron ."