# Generating Star Trek Titles with an RNN

- Trains on Star Trek episode titles
- Outputs "fake" titles.
- Uses the "charRNN" idea.

Much of this example is borrowed from François Chollet's [Keras examples](https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py).

This notebook was developed by Charles Martin and Kai Olav Ellefsen at the University of Oslo, Department of Informatics.

## Setup Environment

- Import Keras
- Open up the Star Trek corpus
- We need to translate the textual data into a format that the RNN can accept as input.
- Give each letter an index and create dictionaries to translate from index to character.

In [None]:
## Much borrowed from https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py

import keras
from keras import layers
import numpy as np
import random
import sys

from urllib.request import urlopen

# Jupyter Notebook Only
# text = open("../datasets/startrekepisodes.txt").read().lower()
# print('corpus length:', len(text))

# Colab
ST_EPISODES_URL = "https://raw.githubusercontent.com/cpmpercussion/creative-prediction/master/datasets/startrekepisodes.txt"
text = urlopen(ST_EPISODES_URL).read().decode("utf-8").lower()

chars = sorted(list(set(text)))
vocabulary_size = len(chars)
print('total chars:', vocabulary_size)
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))


# How long is a title?
titles = text.split('\n')
lengths = np.array([len(n) for n in titles])
print("Max:", np.max(lengths))
print("Mean:", np.mean(lengths))
print("Median:", np.median(lengths))
print("Min:", np.min(lengths))
print()

# hence choose 30 as sequence length to train on.
print("Character Dictionary: ", char_indices, "\n")
print("Inverse Character Dictionary: ", indices_char)

## Setup Training Data

- Cut up the corpus into semi-redundant sequences of 30 characters.
- Change indices into "one-hot" vector encodings.

<img src="figures/slicing_text.png" style="width: 300px;"/>

In [None]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 30
step = 3

sentences = [] #The training data
next_chars = [] #The training labels

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
    
print('Number of sequences (Xs):', len(sentences))
print('Number of next_chars (ys):', len(next_chars))

print("\nHere's the first example:")
print("X:",sentences[0])
print("y:",next_chars[0])

### Onehot encoding:

* `a -> [True, False, False, ..., False]`
* `b -> [False, True, False, ..., False]`
* ...

Each training sample becomes 2D tensor:

* `"This is the text" -> X = [[0, 0, ..., 1, 0, ..., 0], ..., [0, 0, ..., 1, 0, ... 0]]`

Each target (next letter) becomes 1D onehot tensor:

* `a -> y = [1, 0, 0, ..., 0]`

In [None]:
#X shape: 3D tensor. First dimension is the sentences, second is each letter in each sentence, third is the onehot
#vector representing that letter.
X = np.zeros((len(sentences), maxlen, vocabulary_size), dtype=bool)
y = np.zeros((len(sentences), vocabulary_size), dtype=bool)
    
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
    
print("Done preparing training corpus, shapes of sets are:")
print("X shape: " + str(X.shape))
print("y shape: " + str(y.shape))
print("Vocabulary of characters:", vocabulary_size)

In [None]:
# Look at some data:
print(X[0][0])

## Model

- Model has one hidden layer of 128 LSTM cells.
- Output layer uses the "softmax" activation function to output a probability distribution over next letters.

This is model is designed for "one-by-one" prediction, i.e., it predicts the very next letter in a sequence of text. 

- For the sentence "My cat is named Simon"
   - x: "My cat is named Simo"
   - y: "n"
   
The RNN is structured as follows:

<img src="figures/n-in-1-out.png" style="width: 400px;"/>

In [7]:
layer_size = 128
# build the model: a single LSTM layer.
model_train = keras.Sequential()
model_train.add(keras.Input(shape=(maxlen, len(chars))))
model_train.add(layers.LSTM(layer_size))
# Project back to vocabulary. One output node for each letter.
# Dense indicates a fully connected layer.
# Softmax activation ensures the combined values of all outputs form a probability distribution:
# They sum to 1, with each individual value between 0 and 1.
model_train.add(layers.Dense(len(chars), activation='softmax'))

In [None]:
# Categorical crossentropy  minimizes the distance between the probability distributions 
# output by the network and the true distribution of the targets.
# The optimizer specifies HOW the gradient of the loss will be used to update parameters.
# Different optimizers have different tricks to avoid local optima, etc.
# RMSProp is adaptive, adjusting the rate of learning to how fast we're currently learning.
# Choose one by experimenting, or selecting one documented to work well for this problem by other researchers.
model_train.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.RMSprop(learning_rate=0.01))
model_train.summary()

# LSTM is more complicated than the basic RNN we introduced. It has more free parameters, therefore more parameters 
# than one might expect below. We use them since they are better at learning long-term structure.

## Sampling from the Model

- The model doesn't output _letters_, but a distribution for the probability for each letter. - Could just take letter with max probability
- Better to do a random sampling from the distribution.
- Also have opportunity to "reweight" the distribution, to make more "creative" choices.

<img src="figures/reweighting.png" style="width: 600px;"/>

- Here's the code for the sampling function:

In [9]:
#Higher diversity -> more randomness in the generation.
def sample(probability_distribution, diversity=1.0):
    # helper function to sample an index from a probability distribution
    probability_distribution = np.asarray(probability_distribution).astype('float64')
    # Reweight the distribution
    probability_distribution = np.log(probability_distribution) / diversity
    # Here's the Softmax operation
    exp_preds = np.exp(probability_distribution)
    probability_distribution = exp_preds / np.sum(exp_preds)
    #Draws 1 element at random according to the new scaled probability-distribution.
    probabilities = np.random.multinomial(n=1, pvals = probability_distribution) 
    return np.argmax(probabilities)

## Method for printing some example text after every epoch

In [10]:
def generate_text_segment(length, diversity, generating_model = model_train, input_sequence_length = maxlen):
    start_index = random.randint(0, len(text) - input_sequence_length - 1)

    # We need a seed to start the text generation. Since during training the ANN always experiences
    # sentences of size 30, we seed it with a sentence of length 30 to get it into a sensible state.
    generated = ''
    sentence = text[start_index: start_index + input_sequence_length]
    generated += sentence
    
    for i in range(length):
        x_pred = np.zeros((1, input_sequence_length, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_indices[char]] = 1.
        

        predictions_distribution = generating_model(x_pred)[0]
        next_index = sample(predictions_distribution, diversity)
        next_char = indices_char[next_index]

        generated += next_char
        #Stepping one symbol forward in the sentence
        sentence = sentence[1:] + next_char

    return generated

def generate_sample_text(epoch, logs):
    # Function invoked at end of each fifth epoch. Prints generated text.
    if epoch % 5 == 0:
        generated = generate_text_segment(200, 1.0, model_train, input_sequence_length = maxlen)
        print("\nSeed:\n", generated[:30], "\n")
        print("\nGenerated text:\n", generated[30:], "\n\n")
    else:
        return

print_callback = keras.callbacks.LambdaCallback(on_epoch_end=generate_sample_text)

## Training

- Training in Keras is done by calling `model_train.fit(X,y)`, where `X`, `y` are the data corpus we prepared earlier.
- There's two important paramters for training:
    - Batch size: How many examples are used to make one weight update in the model.
    - Number of epochs: How many times to iterate through the whole dataset (randomised batches each time).


In [None]:
history = model_train.fit(X, y, batch_size=128, epochs=51, callbacks=[print_callback])

In [None]:
# Save model if necessary
model_train.save("keras-startrek-LSTM-model.keras")

## Plotting training and validation error

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

history_dict = history.history
loss_values = history_dict['loss']
epochs = range(1, len(loss_values) + 1)
plt.plot(epochs, loss_values, 'b-', label='Training loss')
plt.title('Training loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

## Make a Decoder model

During training, we presented sequences of 30 characters, along with the correct next character.
When_using the trained model, it may be more useful to feed in 1 character at a time, and seeing the next
predicted one. That will also convince us that the network is actually _using_ its internal state.

- Needs input length of 1.
- Needs batch size of 1
- Needs LSTM to be stateful
- check that params is the same as model_train

<img src="figures/1-in-1-out.png" style="width: 600px;"/>

In [None]:
# Load model if necessary.
# model_train = load_model("keras-startrek-LSTM-model.h5")

In [None]:
# Build a decoding model (input length 1, batch size 1, stateful)
layer_size = 128

model_dec = keras.Sequential()
# 1 letter in, 1 letter out.
# Stateful=True keeps the state from the end of one batch to the start of the next
# In other words, the network "remembers" its state from one input to the next. This is essential when
# the network looks at 1 input at a time.
model_dec.add(keras.Input(shape=(1, len(chars)), batch_size=1))
model_dec.add(layers.LSTM(layer_size, stateful=True))

# project back to vocabulary
model_dec.add(layers.Dense(vocabulary_size, activation='softmax'))
model_dec.summary()

# set weights from training model
# Note that we can reuse these weights, since the sizes of the trained and decoder network are the same.
# The trained network took in 30 characters, but remember that all these 30 used the same input weights.
# That is one of the advantages of RNNs: They are independent of sequence lengths.
model_dec.set_weights(model_train.get_weights())

## Test the Model

- Take a quote then add 400 characters.

In [None]:
# Sample 1000 characters from the decoding model using a random seed from the vocabulary.
generated = generate_text_segment(1000, diversity=1.0, generating_model = model_dec, input_sequence_length = 1)
sys.stdout.write(generated)
print()