# Speech Generator

What better use of Machine Learning than to generate synthetic Trump speeches...

In [1]:
import logging
import os
import re
from datetime import datetime
from typing import Tuple

import tensorflow as tf

from utils import (build_model, generate_text, get_checkpoint_callback,
                   get_tensorboard_callback, load_text, loss_func,
                   split_input_target)

%load_ext tensorboard

In [2]:
logger = logging.getLogger(__name__)

## Constants
A little explanation for some of the constants:
- ```CKPT_DIR```: Checkpoints directory;
- ```SEQ_LENGTH```: The maximum length sentence we want for a single input in characters;
- ```BUFFER_SIZE```: Buffer size to shuffle the dataset (TF data is designed to work with possibly infinite sequences, so it doesn't attempt to shuffle the entire sequence in memory. Instead, it maintains a buffer in which it shuffles elements);
- ```VOCAB_SIZE```: Length of the vocabulary in chars.

In [3]:
# CONSTANTS/config
CKPT_DIR = 'training_checkpoints'
DATA_FILE = 'data/2016_campaign_speeches.txt'
EPOCHS = 30
SEQ_LENGTH = 100
BATCH_SIZE = 64
BUFFER_SIZE = 10000
EMBEDDING_DIM = 256
RNN_UNITS = 1024

## Import & clean data

First let's import the training data (which can be found [here](https://github.com/ryanmcdermott/trump-speeches/blob/master/speeches.txt)). This data comprises the transcripts of Trump's 2016 campaign speeches. 

In [4]:
text = load_text(DATA_FILE)

In [5]:
vocab = sorted(set(text))
VOCAB_SIZE = len(vocab)
logger.info(f'Length of text: {len(text)} characters')
logger.info(f'{len(vocab)} unique characters')

2019-12-31 11:40:20,960 - __main__ - INFO - Length of text: 892783 characters
2019-12-31 11:40:20,962 - __main__ - INFO - 92 unique characters


Now we can encode the text under numerical representations...

In [6]:
char2idx = {char: idx for idx, char in enumerate(vocab)}
idx2char = np.array(vocab)
text_as_int = np.array([char2idx[char] for char in text])

...which looks something like the following.

In [7]:
num_chars = 17
logger.info(f'{repr(text[:num_chars])} ---- maps to ints ---- > {text_as_int[:num_chars]}')

2019-12-31 11:40:24,176 - __main__ - INFO - 'Thank you so much' ---- maps to ints ---- > [48 65 58 71 68  1 82 72 78  1 76 72  1 70 78 60 65]


## Prediction

Given a character, or a sequence of characters, what is the most probable next character? This is the task we're training the model to perform. The input to the model will be a sequence of characters, and we train the model to predict the output—the following character at each time step.

Since RNNs maintain an internal state that depends on the previously seen elements, given all the characters computed until this moment, what is the next character?

## Create training examples and targets

Next divide the text into example sequences. Each input sequence will contain ```SEQ_LENGTH``` characters from the text.

For each input sequence, the corresponding targets contain the same length of text, except shifted one character to the right.

So break the text into chunks of ```SEQ_LENGTH + 1```. For example, say ```SEQ_LENGTH``` is 4 and our text is ```Hello```. The input sequence would be ```Hell```, and the target sequence ```ello```.

To do this first use the ```tf.data.Dataset.from_tensor_slices``` function to convert the text vector into a stream of character indices.

In [8]:
examples_per_epoch = len(text)//(SEQ_LENGTH+1)

# Create training examples/targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

The ```batch``` method lets us easily convert these individual characters to sequences of the desired size.

In [9]:
sequences = char_dataset.batch(SEQ_LENGTH+1, drop_remainder=True)
dataset = sequences.map(split_input_target)

We used ```tf.data``` to split the text into manageable sequences. But before feeding this data into the model, we need to shuffle the data and pack it into batches.

In [10]:
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

## Build the model
Use ```tf.keras.Sequential``` to define the model. For this simple example three layers are used to define our model:
- ```tf.keras.layers.Embedding```: The input layer. A trainable lookup table that will map the numbers of each character to a vector with ```EMBEDDING_DIM``` dimensions;
- ```tf.keras.layers.GRU```: A type of RNN with size units=```RNN_UNITS``` (You can also use a LSTM layer here.)
- ```tf.keras.layers.Dense```: The output layer, with ```VOCAB_SIZE``` outputs.

For each character the model looks up the embedding, runs the GRU one timestep with the embedding as input, and applies the dense layer to generate logits predicting the log-likelihood of the next character.

In [11]:
model = build_model(
    VOCAB_SIZE,
    EMBEDDING_DIM,
    RNN_UNITS,
    BATCH_SIZE
)

In [12]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           23552     
_________________________________________________________________
gru (GRU)                    (64, None, 1024)          3938304   
_________________________________________________________________
dense (Dense)                (64, None, 92)            94300     
Total params: 4,056,156
Trainable params: 4,056,156
Non-trainable params: 0
_________________________________________________________________


## Train the model
At this point the problem can be treated as a standard classification problem. Given the previous RNN state, and the input this time step, predict the class of the next character.

## Attach an optimizer, and a loss function
The standard ```tf.keras.losses.sparse_categorical_crossentropy``` loss function works in this case because it is applied across the last dimension of the predictions.

Because our model returns logits, we need to set the ```from_logits``` flag.

In [14]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=loss_func
)

In [15]:
# callbacks
checkpoint_callback = get_checkpoint_callback(CKPT_DIR)
tensorboard_callback = get_tensorboard_callback()

Now let's fit the model!

In [16]:
history = model.fit(
    dataset,
    epochs=EPOCHS,
    callbacks=[checkpoint_callback, tensorboard_callback]
)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


## Generate text
### Restore the latest checkpoint
To keep this prediction step simple, use a batch size of 1.

Because of the way the RNN state is passed from timestep to timestep, the model only accepts a fixed batch size once built.

To run the model with a different batch_size, we need to rebuild the model and restore the weights from the checkpoint.

In [26]:
tf.train.latest_checkpoint(CKPT_DIR)

'training_checkpoints/ckpt_30'

In [27]:
model = build_model(
    VOCAB_SIZE, 
    EMBEDDING_DIM, 
    RNN_UNITS, 
    batch_size=1
)

model.load_weights(tf.train.latest_checkpoint(CKPT_DIR))

model.build(tf.TensorShape([1, None]))

In [28]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (1, None, 256)            23552     
_________________________________________________________________
gru_4 (GRU)                  (1, None, 1024)           3938304   
_________________________________________________________________
dense_4 (Dense)              (1, None, 92)             94300     
Total params: 4,056,156
Trainable params: 4,056,156
Non-trainable params: 0
_________________________________________________________________


## The prediction loop
The following code block generates the text:
- It starts by choosing a start string, initializing the RNN state and setting the number of characters to generate.
- Get the prediction distribution of the next character using the start string and the RNN state.
- Then, use a categorical distribution to calculate the index of the predicted character. Use this predicted character as our next input to the model.
- The RNN state returned by the model is fed back into the model so that it now has more context, instead than only one word. After predicting the next word, the modified RNN states are again fed back into the model, which is how it learns as it gets more context from the previously predicted words.

Looking at the generated text, you'll see the model knows when to capitalize, make paragraphs and imitates a Trump-like writing vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences.

In [33]:
sample_text = generate_text(model, "I have the best mind ", char2idx, idx2char)
print(sample_text)

I have the best mind and some people are going to be a much bigger party. Our party a fortune. I think we’re going to have some more coming in. And we’re going to start making Apple computers in this country. What the hell did we get? We sell them beef.
And then he took one who was a proven history of terrorism against the United States and make it great again. It’s not going to happen. Not going to happen.
I said what the hell does that help us? When does it help us? Where they have a baby, walks across the border. They’re destroying our country with a great financial statement. But we’re not going to let Carrier make air conditioners, fired basic party and somebody said, "Well, there’s a lot of stuff from John Deere because I really don’t know what they’re doing. He used the wrong kind of thing. That’s what we we have. They won’t understand it. They don’t want to be the next day the heads of this country. He said, "No, no, it’s true. It’s true.
Right?
Everybody said he’s a fantastic 

The easiest thing you can do to improve the results it to train it for longer.

You can also experiment with a different start string, or try adding another RNN layer to improve the model's accuracy, or adjusting the temperature parameter to generate more or less random predictions.

# Embeddings in TensorBoard

In [36]:
# model.load_weigths(tf.train.latest_checkpoint(CKPT_DIR))

# References

- [Text generation with an RNN; TensorFlow](https://www.tensorflow.org/tutorials/text/text_generation)