Alan Turing's famous Turning test helps evaulate whether a machine's intelligence matches a human's intelligence. This test was called the imitation game. Where a machine has to try fool the human into thinking it is a human. 

A common approach to language tasks are Recurrent Neural Networks (RNNs), but there are many other types that have other use cases:

- Character RNN used to predict the next character in an sentence, using a Stateless RNN and then a Stateful RNN.
- Sentiment Analysis by extracting a feeling within a sentence
- Neural Machine Translation (NMT) capable of tranlating languages. 

We will also look at how we can boost the RNN performance by using Attention Mechanisms and Encoder-Decoder architecture, which allows the network to focus on a select part of the inputs at each time step. 

Finally, we will then look at a Transformer, a very succesful NLP architecture, before discussing GPT-2 and BERT. 

In [1]:
import sys 
sys.version_info > (3, 5)

import numpy as np
import tensorflow as tf
assert tf.__version__ > "2.0"
from tensorflow import keras
import matplotlib.pyplot as plt

# Shakespeare Dataset

Below is an example of how we would work with text data by converting it using a tokenizer, how to split text data because we cannot shuffle the data as we do with tabular data, 

In [2]:
shakespeare_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
filepath = keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

In [3]:
print(shakespeare_text[60:250])



All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



In [4]:
"".join(sorted(set(shakespeare_text.lower()))) # list of characters within dataset

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

## Tokenize Text data

In [5]:
# convert all characters into a unique character ID
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)

In [6]:
tokenizer.texts_to_sequences('Romeo')

[[9], [4], [15], [2], [4]]

In [7]:
"".join(tokenizer.sequences_to_texts([[9], [4], [15], [2], [4]]))

'romeo'

In [8]:
max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = tokenizer.document_count

Note, the word encoder sets the IDs from 1 to 39 so when we convert the entire text to ID we need to subtract 1 so we can get IDs from 0 to 38.


In [9]:
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

Before we talk about how we can split the text data into training, validation and test set, lets first talk about how we can split the time series data.

##### Splitting Time series data

The safest way is to split the data up across time. For example, take the years 2000 to 2016 as the training, 2017 to 2019 as the validation and leaving 2020 to 2021 as the test set. Ensure there is no overlap in the sets. 

There are two problems: correlation between time series data and assuming your data is a stationary. 
- **Correlation** between variables can lead to an optimisitically biased generalization error, because the training and test set, both contain time series data which are correlated. In these scenarios we should avoid having correlated time series across the training and test set.s

- Assuming that your data is a **Stationary** time series (i.e. the mean, variance and autocorrelation does not change). This assumption works well for most time series data but some time series data has disappearing patterns over time. In these scenarios we would benefit by training the data on short time spans. You can plot the model's error on the validation set, and if you observe increasing errors towards the end of the data then you know the data is not stationary enough.

For example, if you have financial data for many companies, some companies are well correlated because of the sectors that they are in. Traders would exploit these correlations once they realise it, however patterns may soon disappear because of it. The correlation, alongside the unstationary nature, of the data prevents us from obtaining a generalizable model.

Ultimately, how you split time series data depends on the task at hand. 




## Splitting Sequential Text data

Splitting text data is pretty simple, in that we must have no overlap between the sets and introduce a gap to avoid paragraph overlapping. 

In [10]:
train_size = dataset_size * 90 // 100 # take 90% of the data and // 100 to get steps of 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

The `dataset` now is a single sequence of over one million characters. Recall how RNNs work on the previous notebook. If we were to train the neural network it would be equivalent to training a deep neural network with over a million layers - with only one (very long) instance!

Instead, we need to convert this dataset into smaller windows of text. The length of the window size is the maximum pattern length the RNN will learn. The RNN will unrolled over the length of the substrings, this is called **Truncated Backpropagation Through Time (TBPTT)**. Read [this](https://www.quora.com/Whats-the-key-difference-between-backprop-and-truncated-backprop-through-time) Quora answer to understand the difference between backpropagation through time and truncated.


In [11]:
n_steps = 100
window_length = n_steps + 1
dataset = dataset.window(size=window_length, shift=1, drop_remainder=True)

The `shift` argument causes the difference between the next window to be 1 character. For example, the first window will be 0 to 100 the next will be 1 to 101 etc.. Setting the `drop_remainder=True` argument makes every window size equal to `size` argument. Otherwise, the last windows will go from 100 to 1 characters in length. 

In [12]:
dataset # datasets within a dataset

<WindowDataset shapes: DatasetSpec(TensorSpec(shape=(), dtype=tf.int64, name=None), TensorShape([])), types: DatasetSpec(TensorSpec(shape=(), dtype=tf.int64, name=None), TensorShape([]))>

In [13]:
# we now need to flatten it, as the model only accepts tensors
dataset = dataset.flat_map(lambda window: window.batch(window_length))

# the flat_map function flattens the dataset
# the lambda function forces it to create tensors of window_size length

# for example, if 
# example = {{1, 2}, {3, 4, 6, 7}, {8, 9, 10}}
# then example.flat_map(lambda eg: eg.batch(2)), would become
# {{1, 2}, {3, 4}, {5, 6}, {7, 8}, {9, 10}}

Now that the dataset is in the right shape we can shuffle these windows so that gradient descent can have instances that are indepenedent and identically distributed across the training set. 

In [14]:
batch_size = 32
dataset = dataset.shuffle(10000, seed=42).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:])) # X, y

In [15]:
z = [1, 2, 3, 4, 5]
(z[: -1], z[1: ]) # we are trying to predict the next window size

([1, 2, 3, 4], [2, 3, 4, 5])

In [16]:
# one hot encode the dataset as there are not many unique characters ~ 39
dataset = dataset.map(
    lambda X_batch, y_batch: (tf.one_hot(X_batch, depth=max_id), y_batch))

# calling prefetch allows later elements to be prepared while the current element is being processed
dataset = dataset.prefetch(1)

In [17]:
for X_batch, y_batch in dataset.take(1):
  print(X_batch.shape, y_batch.shape)

(32, 100, 39) (32, 100)


## Build Model - Char RNN

We can train a model on all of Shakespeare's work and then use it to predict a character in a sentence. This can be used to produce novel text and is pretty fun to read about. 

Read this blog by Andrej Karapthy: https://karpathy.github.io/2015/05/21/rnn-effectiveness/

In [None]:
model = keras.models.Sequential([
  keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id], 
                   dropout=0.2,), # recurrent_dropout=0.2), #  prevents GPU support
  keras.layers.GRU(128, return_sequences=True,
                   dropout=0.2,), # recurrent_dropout=0.2), #  prevents GPU support
  keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))
])

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
model.fit(dataset, epochs=10)

Epoch 1/10
   1122/Unknown - 221s 193ms/step - loss: 2.1430

## Make Predictions

In [None]:
def preprocess(texts):
  """
  Function that preprocesses text data and returns one hot encoded data. 
  """
  X = np.array(tokenizer.texts_to_sequence(texts) - 1)
  return tf.one_hot(X, max_id)

X_new = preprocess(['This is an exampl'])
Y_pred = model.predict_classes(X_new)
print(tokenizer.sequences_to_texts(Y_pred + 1)[0][-1]) # print first sentence last character

Although, this is amusing and satisifying to have predicted the next character this does not work well in practice because the model would repeat the same works over and over again.

Instead, we can pick the next letter randomly which will generate diverse and interesting text. We can use the `tf.random.categorical` function, which takes in logits divided by a hyperparameter, temperature. Lower values favour high probability characters while high values will give characters an equal probability. 

The model is ok for small data but if we wanted to realise patterns over a large time step, you can use Stateful RNNs.

# Stateful RNN

So far we have trained Stateless RNNs, this is where at each iteration the model starts with hidden state full of zeros and updates them at the end of each time step. It then removes them at the last time step.

![Stateless vs Stateful](https://media.springernature.com/original/springer-static/image/chp%3A10.1007%2F978-3-030-27947-9_24/MediaObjects/480892_1_En_24_Fig3_HTML.png)

Stateful RNNs reuse the state between batches instead of reinitializing them. This can allow the model to learn long term patterns.

One thing we need to change about the input dataset when using Stateful RNNs is not to split the batches up so that one batch starts where the previous batch left off. There should be no overlap like we saw with the `windows()` function earlier. Stateful RNNs require sequential and non-overlapping input sequences.

Unfortnately, this is not easy to do and requires a lot of code. 

In [None]:
batch_size = 32
encoded_parts = np.array_split(encoded[:train_size], batch_size) # split into 32 parts
# len(encoded_parts) = 32 

datasets = []
n_steps = 100
window_length = n_steps + 1

for encoded_part in encoded_parts:
  dataset = tf.data.Dataset.from_tensor_slices(encoded_part) # dataset object
  dataset = dataset.window(window_length, shift=n_steps,
                           drop_remainder=True) # flatten windows
  dataset = dataset.flat_map(lambda window: (window.batch(window_length)))
  datasets.append(dataset)


dataset = tf.data.Dataset.zip(tuple(datasets)).map(lambda *windows: tf.stack(windows)) # create one massive dataset
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))
dataset = dataset.prefetch(1)

Make sure you specify `stateful=True` and the `batch_input_shape`, this is so tensorflow can preserve a state for each input sequenence in the batch. 

In [None]:
model = keras.models.Sequential([
  keras.layers.GRU(128, return_sequences = True, stateful = True,
                   dropout = 0.2, # recurrent_dropout=0.2,
                   batch_input_shape = [batch_size, None, max_id]),
  keras.layers.GRU(128, return_sequences = True, stateful = True,
                   dropout = 0.2),
  keras.layers.TimeDistributed(keras.layers.Dense(max_id, activation='softmax'))
])

In [None]:
class ResetStatesCallback(keras.callbacks.Callback):
  """
  Callback used in Stateful RNN model, to reset states at the end of each
  epoch.
  """
  def on_epoch_begin(self, epoch, logs):
    self.model.reset_states()

In [None]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
model.fit(dataset, epochs=10, callbacks=[ResetStatesCallback()])

# Sentiment Analysis


# Bidirectional RNNs

# Attention Mechanisms