# Text Generation

This Google Colaboratory notebook is paired with the tutorial on text generation. 

In this lab, you will


*    create training data from a raw text corpus
*    build and train an LSTM network using Keras
*    implement a greedy sampler
*    experiment with other model architectures



## Data Loading

In [0]:
# Run setup code for this notebook to download the corpus

import urllib.request
import numpy as np
from keras.models import Sequential
from keras.layers import LSTM, Dense, Activation
from keras.callbacks import Callback
from random import randint

"""Other training sets
Shakespeare's sonnets:
https://gist.githubusercontent.com/mohitd/387c2b98e15a5d292c9385da56c607f6/raw/2042cb4e6489671a62cc805b34c5095369733017/sonnets.txt

Linux kernel code in C:
https://gist.githubusercontent.com/mohitd/08cf6d5832a5502a68b5be5ead14d4aa/raw/a483495db3d746c3480a9fa8fb760f7dd9bd1b41/kernel.c
"""

# download corpus
url = 'https://gist.githubusercontent.com/mohitd/387c2b98e15a5d292c9385da56c607f6/raw/2042cb4e6489671a62cc805b34c5095369733017/sonnets.txt'
response = urllib.request.urlopen(url)
corpus = response.read().decode('utf-8')
print('Downloaded corpus!')

# get some statistics on the corpus
chars = list(set(corpus))
data_size, vocab_size = len(corpus), len(chars)
print('{} chars in corpus'.format(data_size))
print('{} unique chars'.format(vocab_size))

# construct dictionaries to creating "embeddings" for the characters
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for i, c in enumerate(chars)}
print('Created embedding dictionaries!')

Downloaded corpus!
Created embedding dictionaries!


## Create Training Data

Since we are using a character-based model, the goal for our network is to classify the next character given the previous 50 characters. So we need to create training data where the input data is 50 characters and the output is the next character.

Split the corpus into overlapping blocks (simply shift over by 1 character each time) of 50 characters into `sentences`. Then extract the next character after those 50 into `next_chars`. The `sentences` variable should be a list of strings, and `next_chars` should be a list of single characters.

(Hint: use string slicing! e.g., `string[x:x+10]` extracts 10 characters starting at position `x`)

In [0]:
sentence_length = 50
sentences = []
next_chars = []

"""
YOUR CODE HERE
"""
assert(all(len(sentence) == sentence_length for sentence in sentences))
print('{} sentences, each {} characters long'.format(len(sentences), sentence_length))

550171 sentences, each 50 characters long


Now that we have string representations of the characters, we need to convert these string/character representations into numerical representations using our character "embeddings". These "embeddings" are really just one-hot vectors for each character.

Iterate through each of the sentences to populate the `X` and `y` matrices. Remember that `X[i]` refers to a particular sentence, i.e., a sequence of character vectors. And `X[i,j]` refers to a particular character vector in a particular sentence. Use the `char_to_idx` dictionary to convert from characters to their index in the character vector.

In [0]:
num_sentences = len(sentences)
X = np.zeros((num_sentences, sentence_length, vocab_size), dtype=np.bool)
y = np.zeros((num_sentences, vocab_size), dtype=np.bool)

"""
YOUR CODE HERE
"""
print(X.shape)
print(y.shape)
print('Created vectorized input and output')

(550171, 50, 99)
(550171, 99)
Created vectorized input and output


## Building the Model

Now that we have our training data, we need to create our RNN model. For this tutorial, we will be using [Keras](https://keras.io/), an API for working with neural networks in Python.

Our base model will be a sequential model, which is a linear sequence of layers. Our model will have two layers, the first of which will be our RNN. For this model, rather than the simple RNN, we will use a Long Short-Term Memory Unit or LSTM. Read the tutorial for more information on LSTMs. In Keras, the output of this layer will simply be the output at the final time step, which is customary for sequence classification. The next layer will be a fully-connected layer that performs the classification over the characters: this is the classification layer that will produce a probability distribution over all of the unique characters. 

Steps:
- Add a 256-unit LSTM layer (make sure to use the correct input shape!)
- Add a dense/fully connected layer to perform classification (make sure to use the correct number of neurons and the correct activation function!)
- Compile the model using categorical cross-entropy as the loss function and the adam optimizer with the default parameters

For reference:

https://keras.io/getting-started/sequential-model-guide/

https://keras.io/layers/recurrent/

https://keras.io/layers/core/

In [0]:
model = Sequential()
"""
YOUR CODE HERE
"""

## Sampling from an RNN

Once our model is built and trained – we'll get to training in a second – we need to be able to use it to generate text. That is, after all, why we came here in the first place. We will do this by sampling from the model with an initial seed.

Remember that our model works by taking a "sentence" of the previous 50 characters and, based on that input, predicts the next character. When we start predicting, however, we don't have any characters yet. So, to get the prediction started, we take a "seed sentence" from our corpus and use that for our first prediction. Each character we predict will be concatenated onto the seed, and then, once we've predicted at least 50 characters, our future predictions will be based on our actual generated text rather than text from the original corpus.

Steps:
1. Randomly select a "seed sentence" (sequence of 50 characters) from the corpus (or you may supply your own!)
2. Generate a one-hot character embedding from the seed sentence
3. Have the model predict the next character based on the preceding 50 characters. Note that the prediction will not be a single character – it will be an array of probabilities for each possible next character. Choose the one with the highest probability, i.e., the argmax!
4. Add the highest probability character to the end of our generated text, then add it's one-hot character embedding to the end of the running generated text
5. Repeat steps 3 & 4 until you've built up a string of predicted characters equal to the sample length 

Hints
* The seed should be of size `(1, 50, vocab_size)` where 1 represents the batch size.
* Use `model.predict(...)` with the runnning seed as input and use `idx_to_char` to convert it back to a character
* You'll need to encode that character as a one-hot character (the size should be `(1, 1, vocab_size)`)
* Use `np.concatenate` in combination with numpy array slicing to attach the model's prediction to the seed while retaining the `(1, 50, vocab_size)` size.




In [0]:
def sample_from_model(model, sample_length=100):
  generated_text = ''
  """
  YOUR CODE HERE
  """
  return generated_text

Next, create a subclass of Keras's `Callback` class called `SamplerCallback`. A `Callback` class is used to provide some sort of feedback during model training. These callbacks have hooks that we can use to run code during training, e.g., `on_epoch_{begin/end}`, `on_batch_{begin/end}`, and `on_training_{begin/end}`. In this case, we can create one that will give us sample text after each training epoch. If our model is being trained properly, at the end of each training epoch, the sampled text from the model should look more and more like real text – up until a certain point, of course! We can't train a perfect model, after all.

To do this, simply define a function in your Callback class named `on_epoch_end` that takes three parameters: `self`, `epoch`, and `logs` which will be called automatically at the end of each epoch. This function should sample text from the model using our previously-defined `sample_from_model` function and then print it. To access the current model being trained, use `self.model` rather than `model`, i.e., our global model. (This aids in re-usability if our model was called something different than `model`.) Then we just need to create an instance of this class and pass it as a `Callback` argument when we train our model.

In [0]:
class SamplerCallback(Callback):
    def on_epoch_end(self, epoch, logs):
      pass
      """
      YOUR CODE HERE
      """

sampler_callback = SamplerCallback()

## Training the RNN

Finally, we need to train our RNN model by fitting it to our training data, which, in case you've forgotten, consists of a sequence of "sentences" (`X`) and a sequence of next characters (`y`). We will train our model to learn weights such that, given the "sentence" input, it produces the correct next character as output. Our model will train for a certain number of epochs – each epoch is one full pass over all the training data – using the specified optimizer to adjust the weights in the model to optimize the specified loss function when applied to the training data.

To fit our model to the training data, we will need to specify the following parameters:
- Training input (a numpy array)
- Training output (a numpy array)
- Number of training epochs (try 30)
- A batch_size indicating the number of training samples taken in for each update of the model weights (try 256)
- A list of callbacks (see previous section)

For reference, see: https://keras.io/models/model/

In [0]:
"""
YOUR CODE HERE
"""

## Sample from the Trained Model

Now that our model is trained, we can see how well it did. Sample 1000 characters from the model using the `sample_from_model` function and print it. How does it look? Does it look like Shakespeare? Does it look like English? Is it pronouncable?

Hopefully the answer to all these questions is yes. Even this simple character-based model should be able to produce mostly actual English words, and Shakespearean sounding ones at that. The grammar might be a little questionable, but it shouldn't be terrible either. Given that this is a character-based model with no inherent knowledge of the concept of "words," that's pretty amazing!

In [0]:
"""
YOUR CODE HERE
"""

If you've gotten this far, you can:

1. Try training your model with a different input dataset. We know that LSTMs can write Shakespeare, but what about C code? Use the other dataset and re-run all of the sections to see how well character-level models fair against code. **What if, instead of a character-level model, we tried to use a word-level model for C code? What do you expect the quality of this to be compared to a character-level model on C code?**

2. Play around with the model parameters to see how they affect your output. What happens if you increase/decrease the number of hidden units in your LSTM? Or use a different optimizer? In the beginning, we organized our training data so that our model could predict the next character based on the previous 50 characters. What if it was 100 characters? What if it was 10? Do you think the sampled output would get better or worse? What are the tradeoffs to using a larger sequence?

3. Recent literature seems to favor GRUs and 1D convolutional networks over LSTM because they produce similar quality results but are more efficient to train. Try these models. Do they perform better than the LSTM?


In [0]:
"""
SANDBOX
"""