In [1]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

## Generating Shakesperian Text using a Character RNN

We'll start exploring NLP by creating an RNN that takes in a sequence of characters and tries to predict the next character in the sequence. Check this [cool article](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) that contains examples (including Algebraic Geometry)!

We'll train our model on Shakespeare's text

#### Creating the training set

In [2]:
shakespeare_url = 'https://homl.info/shakespeare'
filepath = keras.utils.get_file('shakespeare.txt', shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

In [3]:
shakespeare_text[:15]

'First Citizen:\n'

To process the text, we'll encode each character as an integer by using Keras' ```Tokenizer``` class

In [4]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)

Now we can give the tokenizer a string and get back its encoding

In [5]:
tokenizer.texts_to_sequences([shakespeare_text[:15]])

[[20, 6, 9, 8, 3, 1, 19, 6, 3, 6, 36, 2, 10, 24, 11]]

In [6]:
max_id = len(tokenizer.word_index)  # number of distinct characters
max_id

39

In [7]:
# total number of characters
dataset_size = tokenizer.document_count
dataset_size

1115394

Encode the text (subtracting 1 to get ranges from 0 to 38)

In [8]:
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

In [9]:
encoded

array([19,  5,  8, ..., 20, 26, 10])

#### Splitting a sequential Dataset

Now we need to split into training, validation and test set. But how we do it with characters? We can't just shuffle all the characters in the text.

With time series, we gnerally split on time: for example if we might take years 2000 to 2012 for training, 2013 to 2015 for validation and 2016 to 2018 for test.

In some other case we can split among other dimensions such as an industry type or country.

Splitting accross time is safe, however it implicitly assumes that patterns learned in the past will still exist in the future. I.e. we assume the time series is *stationary*. To make sure the time series is sufficiently stationary, we can plot the model's error on the validation set across time: if the model performes better on the first part of the validation set than on the last part, then the time series might not be stationary enough. Training the model on a shorter time span might be better.

For the Shakespeare example  we'll take the first 90% of the text for the training set keeping the rest for validation and test

In [10]:
train_size = dataset_size *90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

#### Chopping data into windows

We'll use ```dataset.window``` to convert the one long instance we have into many smaller windows of text. Every instance in training will be a fairly short substring of the full text and the RNN will be unrolled over the length of the substrings. This is called **truncated backpropagation through time**

In [11]:
n_steps = 100 # tune-able parameter
window_length = n_steps + 1 # target will be the input shifted 1 character ahead
dataset = dataset.window(window_length, shift=1, drop_remainder=True)

Using ```shift=1``` we create overlapping windows: the first window will contain characters 0 to 100, the second 1 to 101 and so on...

```drop_remainder``` ensures all windows are 101 characters long and we don't have to do any padding

```window()``` creates a dataset containing windows, each of which is also a dataset. It is a *nested dataset*. This is useful when we want to transform each window by calling dataset methods but it cannot be used directly for training. The model expects tensors, not datasets. 

The ```flat_map``` dataset converts a nested dataset into a *flat dataset* (i.e. without nesting). It also takes a function as an argument which allows you to transform each dataset _before_ flattening.

For example passing ```lambda ds: ds.batch(2)``` to ```flat_map``` the dataset $\{\{1, 2\}, \{3,4, 5,6\}\}$ becomes $\{[1,2], [3,4], [5,6]\}$.

In [12]:
dataset = dataset.flat_map(lambda window: window.batch(window_length))

By calling passing ```window_length``` to batch, we get a single tensor for each batch (all windows have that length). That is, the dataset contains windows of 101 characters each. Now we can shuffle the tensors and separate inputs from targets. See Figure 16-1 for a pictorial representation of the process

In [13]:
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

In [14]:
for X, y in dataset.take(1):
    print(X)
    print(y)

tf.Tensor(
[[11 22 26 ... 11 14  7]
 [ 1 27  7 ... 29  0 16]
 [10  8  3 ... 10 14  4]
 ...
 [11 12  0 ...  9 29 10]
 [ 1  0 16 ... 13  8  2]
 [18  3 13 ... 16  3 13]], shape=(32, 100), dtype=int64)
tf.Tensor(
[[22 26  0 ... 14  7  0]
 [27  7  0 ...  0 16  6]
 [ 8  3 14 ... 14  4  8]
 ...
 [12  0 21 ... 29 10 10]
 [ 0 16  6 ...  8  2 17]
 [ 3 13  9 ...  3 13 11]], shape=(32, 100), dtype=int64)


We'll do one-hot-encoding since there are fairly few features (only 39 characters)

In [15]:
dataset = dataset.map(lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

#### Bulding and Training the Char-RNN model

We'll create a model with GRU units that are fed into a Time-distributed Dense layer of 39 unites (max_id), since we want to output character probabilities for each possible character

**Note:** Running the code below takes a good while...

In [16]:
# model = keras.models.Sequential([
#     keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id],
#                      dropout=0.2, recurrent_dropout=0.2),
#     keras.layers.GRU(128, return_sequences=True,
#                      dropout=0.2, recurrent_dropout=0.2),
#     keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))
# ])
# model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
# history = model.fit(dataset, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [19]:
model.save('saved_models/16_nlp_with_rnns/shakespeare.h5')

In [97]:
model = keras.models.load_model('saved_models/16_nlp_with_rnns/shakespeare.h5')

#### Using the Char-RNN model

To pass some inputs to our model, we need to apply the same preprocessing we did for training it

In [20]:
def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    return tf.one_hot(X, max_id)

And a test:

In [98]:
X_new = preprocess(['How are yo'])
y_pred = np.argmax(model.predict(X_new), axis=-1) + 1 # Account for the 1 we removed during preprocessing
y_pred

array([[ 2, 12,  1,  6, 10,  2,  1, 16,  4, 14]])

The model generate one predicion for each character in the input. Since we're only interested in the prediction for the last character we simply pick up the last prediction

In [48]:
tokenizer.sequences_to_texts(y_pred)[0][-1] # 1st sentence, last char

'u'

#### Generating new text

Now that we have a model, we could start using it by feeding in some text, generate a prediction, amend the text, feed it in, get a new prediction, etc...

While this approach does work, it often leads to the same words being predicted over and over again. Instead, we can sample from the distribution of possible characters using ```tf.random.categorical()```, this will generate more interesting text.

```categorical``` samples random class indices, given class log probabilities. We can also introduce a notion of *temperature*, which will control how wild our model will deviate from the logits. We divide the logits by the temperature, when temperature is close to 0 it will favour high probability characters, when it is close to 1, it gives characters an equal probability.

We'll define the ```next_char``` function to do this for us

In [94]:
def next_char(text, model, temperature=1):
    X_new = preprocess([text])
    y_proba = model.predict(X_new)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

And a helper function that will generate the text and fill it in

In [95]:
def complete_text(text, model, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text+= next_char(text, model, temperature)
    return text

In [101]:
print(complete_text('t', model, temperature=0.5))

the raiment servant,
that she is a rest daughter wh


In [103]:
print(complete_text('test', model, temperature=0.5))

test this gate,
and what you be gone, 'tis not whom th


In [104]:
print(complete_text('adam', model, temperature=0.8))

adam, if you, understand me! i am all begot.

bianca:



In [105]:
print(complete_text('u', model, temperature=1))

urst:
and she's better with mind.

petruchio:
sir, 


In [106]:
print(complete_text('k', model, temperature=2))

k exock, read up grew i
schreadt, upon, knavoliut u


To get better text we can add more layers, train for longer, add regularization. The limitation of this model is that it is incapable of learning patterns longer than n_steps=100 characters. Stateful RNNs can help with that.

#### Stateful RNN

The RNNs discussed so far only used hidden states at each epoch. If we preserve the state after the end of each step and use that as the starting point for the next step, the RNN willl learn long-term patterns despite only backpropagating through short sequences. This is called a *stateful RNN*.

It has a catch though: training a stateful RNN only makes sense if each input sequence in a batch starts exactly where the corresponding sequence in the previous batch left off.

Instead of using shuffled data, we will use sequential, non-overlapping sequences. To do this we'll create the dataset using ```shift=n_steps``` and naturally we won't ```shuffle``` the data. 

Batching is where things get a bit harder. If we were to use ```batch(32)``` then the first batch would contain windows 1 through 32 and the next batch would be 33 to 64, so the first window of each batch are not consecutive. One approach to this is to use 'batches' containing a single window

In [71]:
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])
dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_length))
dataset = dataset.batch(1)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

Another approach would be to divide the corpus into 32 texts of equal length, create one dataset of consecutive input sequences for them and then use ```tf.data.Dataset.zip(dataset).map(lambda *windows: tf.stack(windows))``` to create proper batches, where the nth input sequence in a batch start off exactly where the nth input sequence ended in the previous batch

In [78]:
batch_size = 32
encoded_parts = np.array_split(encoded[:train_size], batch_size)
datasets = []
for encoded_part in encoded_parts:
    dataset = tf.data.Dataset.from_tensor_slices(encoded_part)
    dataset = dataset.window(window_length, shift=n_steps, drop_remainder=True)
    dataset = dataset.flat_map(lambda window: window.batch(window_length))
    datasets.append(dataset)
dataset = tf.data.Dataset.zip(tuple(datasets)).map(lambda *windows: tf.stack(windows))
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

Now to create the RNN we must set its ```stateful``` parameter as wel as specify the batch_input_shape for the first layer. We leave the second dimension unspecified since the inputs can have any length

In [75]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, stateful=True,
                     dropout=0.2, recurrent_dropout=0.2,
                     batch_input_shape=[batch_size, None, max_id]),
    keras.layers.GRU(128, return_sequences=True, stateful=True,
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))
])

We also need to add a little callback to ensure we reset the states at the end of each epoch.

In [76]:
class ResetStatesCallback(keras.callbacks.Callback):
    def on_epoch_begin(self, epoch, logs):
        self.model.reset_states()

In [81]:
# model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
# history = model.fit(dataset, epochs=50, callbacks=[ResetStatesCallback()])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [82]:
model.save('saved_models/16_nlp_with_rnns/stateful_shake.h5')

In [107]:
model = keras.models.load_model('saved_models/16_nlp_with_rnns/stateful_shake.h5')

Now that this model is trained, it can only make predictions of the same size as the batch size used during training. To avoid this restriction, we can create a statelss model and copy the stateful model weights' to this model

In [108]:
stateless_model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))
])

In [109]:
stateless_model.set_weights(model.get_weights())

And then use it for some predictions!

In [113]:
print(complete_text('u', stateless_model))

umber:
too this some his brother clarence.

duke vi


## Sentiment Analysis