# Text Generation pt 2

In our second part of Text Generation we'll move from statistical methods to neural networks. However, for the willing one it could be fun to take the last session and apply different improvements (such as back-off).  

Remember how it worked? We grabbed ngrams of either characters or words, saved statistics on how often word(s) followed another word. Basically:  
"`Hello you poor old bastard. Hello you damn soul. Hello there.`"  
Yields the most probable follow-up to "`Hello`" to be `you`, "`there`" is a bit more random.


## Neural Text Generation
For this session we're moving on to Neural Networks, more exactly a Sequence-to-Sequence structure. I know some of you have already actually generated text through this technique, remember? Earlier we translated English to Swedish.

We'll use Tensorflow 2.0 for this workshop, mainly because I've a tiny bit more experience in TF rather than PyTorch.

### Imports

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
tf.enable_eager_execution()
import numpy as np
import os
import time

### Download the dataset

In [0]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

### And a little data understanding
As often said, it's important to understand & know your data.

In [0]:
# Read the whole blob to include newlines.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

# As we're gonna create a char2char network the length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

Length of text: 1115394 characters


In [0]:
# Take a look at the first 250 characters in text
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



In [0]:
# The unique characters (classes) in the file
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

65 unique characters


### Process text
#### Vectorizing
We can't process strings but need to convert them into an numerical representation. Create two lookup tables, one for each mapping (char2idx, idx2char).

In [0]:
# Creating a mapping from unique characters to indices
char2idx = # TODO
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

Now we have mapped each unique word into an integer (0..len(vocab))

In [0]:
print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

{
  '\n':   0,
  ' ' :   1,
  '!' :   2,
  '$' :   3,
  '&' :   4,
  "'" :   5,
  ',' :   6,
  '-' :   7,
  '.' :   8,
  '3' :   9,
  ':' :  10,
  ';' :  11,
  '?' :  12,
  'A' :  13,
  'B' :  14,
  'C' :  15,
  'D' :  16,
  'E' :  17,
  'F' :  18,
  'G' :  19,
  ...
}


And vectorizing a sentence:

In [0]:
# Show how the first 13 characters from the text are mapped to integers
print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))

'First Citizen' ---- characters mapped to int ---- > [18 47 56 57 58  1 15 47 58 47 64 43 52]


### The prediction task
As in the last session we want to predict the next character given the previous character(s).  
To visualize this, we have a sequence of characters, and we will predict the next one one time-step at a time, that is one character at a time.

Recurrent Neural Networks (RNN) are very well suited for this task. RNNs maintain an interanl staet that depends on the previously seen elements, so we'll predict the next character basically given by all characters computed this far.  

Improvements for future: 

1.   Bidirectional RNNs (why?)
2.   Attention (why?)
3.   Transformers

#### Create training & target examples
Even though we have "infinite" sequences they're not infinite, we need to chop them up. So next up is to divide the sequences into `seq_length`.  

For each input sequence the target will contain `seq_length` characters as well but shifted one character to the right.

Example:  
`seq_length=3` and text = "Hurray", `input="Hur"` and `target=urr`.

To do this we can use `tf.data.Dataset.from_tensor_slices` function to convert the text vector into a stream of character indices.

So break the text into chunks of seq_length+1. For example, say seq_length is 4 and our text is "Hello". The input sequence would be "Hell", and the target sequence "ello".

In [0]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)
 
# use tf.data.Dataset.from_tensor_slices
# https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices
char_dataset = # TODO

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])

F
i
r
s
t


Tensorflow provides another useful function `batch`. `batch` lets us convert these individual characters to sequences of the desired size.

In [0]:
# Create sequences through batch. Datasets contains this method, dont forget length. And apply drop_remainder.
# https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch
sequences = # TODO

# Visualization
for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))


'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'


Let's create the targets.  
For every sequence, shift and duplicate.

In [0]:
def split_input_target(chunk):
    input_text = # TODO
    target_text = # TODO
    return input_text, target_text

dataset = sequences.map(split_input_target)

for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

As mentioned each character in the sequence is a time-step. 

In [0]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

#### Create training batches
First, what is a batch again? Anyone recall?

After each batch has been predicted we update weights through backpropagation. That means, we basically learn nothing in-between. If we'd stop in the middle of a batch the earlier samples wouldn't help us out for future predictions. 

Why don't we just train on all data at the same time?  
That's because of memory, and speed.

Why don't we just set batch-size to 1?  
The smaller the batch, the less accurate the estimate of the gradient is (we need statistics, right?). See this image where green is mini-batch:
![alt text](https://i.stack.imgur.com/lU3sx.png) (StackOverflow)

In [0]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

#### Build the model
![alt text](https://media2.giphy.com/media/S5fMUcgKUs04E/giphy.webp?cid=790b7611db0c8b5845b9b3ce4659fcf0a15f0ae4753e2a81&rid=giphy.webp)

To simplify our life we'll use Keras in Tensorflow 2.0. Keras, as mentioned, is an abstraction which helps us to focus on the big part. Pure tensorflow-code let's us dive into the small parts and optimize the final parts. As we're focusing on getting things done we'll not do that, but for companies that could be the deciding factor. Improving scores by using a custom loss-function can help as an example. Or transforming data manually between layers. All this is harder in Keras.



##### Keras
We'll use `tf.keras.Sequential` to define our model.  
As earlier we'll start of by embedding our text using an embedding-layer (`tf.keras.layers.Embedding`) with the input being vocab-size and embedding dim as the vector space.  
Next up we'll put a GRU-layer (LSTM works too), `tf.keras.layers.GRU`.  
Finally we'll have an output layer (i.e.the classifier) in the form of a Dense layer (I've never seen anything else used tbh) - `tf.keras.layers.Dense`. 

In [0]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [0]:
# TODO fill out the layers.

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.GRU(
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense()
  ])
  return model

In [0]:
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)


Run-through for each character (timestep):


1.   Model looks up embedding (which is trained during training)
2.   Model runs the GRU one timestep with embedding as input
3.   Apply dense layer to classify next character

![alt text](https://www.tensorflow.org/tutorials/text/images/text_generation_training.png)


#### Trial
Let's test this!

In [0]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

In [0]:
# Notice how above has length 100, but model really has None.

model.summary()

To get actual predictions from the model we need to sample from the output distribution, to get actual character indices. This distribution is defined by the logits over the character vocabulary.

**OBS** it is important to _not_ use the argmax but rather sample over the distritubtion as the model otherwise easily gets stuck in a loop (I believe we noticed this once during the statistical approach too).

Let's go through this step by step now.

In [0]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

`sampled_indices` now contains the predicted now contains a prediction for each timestep in the sequence.

In [0]:
sampled_indices

Decoding.

In [0]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))

#### Training

At this point the problem can be treated as a standard classification problem. Given the previous RNN state, and the input this time step, predict the class of the next character.

##### Attach an optimizer, and a loss function
The standard `tf.keras.losses.sparse_categorical_crossentropy` loss function works in this case because it is applied across the last dimension of the predictions.

Because our model returns logits, we need to set the from_logits flag.

P.S.  
If you feel confused by the `sparsed_categorical_crossentropy`, why have `sparse`? It is because we have integers.  
`categorical_crossentropy` is applied when we one-hot-encode our data.

In [0]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Personally the go-to optimizer is `Adam` but one can feel free to try out `RMSProp` or any other. Let's compile the model.

![alt text](https://miro.medium.com/max/1240/1*SjtKOauOXFVjWRR7iCtHiA.gif)

In [0]:
model.compile # TODO add compile options

##### Checkpoints
Checkpoints, checkpoints and even more checkpoints. I love checkpoints, and you should too.  
We'll use a `tf.keras.callbacks.ModelCheckpoint` to make sure that checkpoints are saved during training.

In [0]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

##### Execution
To keep training time reasonable, use 10 epochs to train the model. **DON'T** forget to turn on GPU, we're talking RNNs here.

In [0]:
EPOCHS = 10

In [0]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

#### Text Generation
##### Restore the last check point
To keep this prediction step simple, use a batch size of 1.

Because of the way the RNN state is passed from timestep to timestep, the model only accepts a fixed batch size once built.

To run the model with a different batch_size, we need to rebuild the model and restore the weights from the checkpoint.

In [0]:
tf.train.latest_checkpoint(checkpoint_dir)

In [0]:
model = # TODO build model

model.load_weights # TODO load weights

model.build(tf.TensorShape([1, None]))

In [0]:
model.summary()

##### Prediction loop
The following code block generates the text:

*   It Starts by choosing a start string, initializing the RNN state and setting the number of characters to generate.
*   Get the prediction distribution of the next character using the start string and the RNN state.
*   Then, use a categorical distribution to calculate the index of the predicted character. Use this predicted character as our next input to the model.
*   The RNN state returned by the model is fed back into the model so that it now has more context, instead than only one word. After predicting the next word, the modified RNN states are again fed back into the model, which is how it learns as it gets more context from the previously predicted words.
![alt text](https://www.tensorflow.org/tutorials/text/images/text_generation_sampling.png)
Looking at the generated text, you'll see the model knows when to capitalize, make paragraphs and imitates a Shakespeare-like writing vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences.

In [0]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate =

  # Converting our start string to numbers (vectorizing)
  input_eval = # TODO
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the word returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append # TODO revert idx to char

  return (start_string + ''.join(text_generated))


In [0]:
print(generate_text(model, start_string=u"ROMEO: "))


The easiest thing you can do to improve the results it to train it for longer (try EPOCHS=30).

You can also experiment with a different start string, or try adding another RNN layer to improve the model's accuracy, or adjusting the temperature parameter to generate more or less random predictions.


## If time, let's do advanced
https://www.tensorflow.org/tutorials/text/text_generation#advanced_customized_training