<a href="https://colab.research.google.com/github/nitinpunjabi/nlp-demystified/blob/main/notebooks/nlpdemyst_recurrent_neural_networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing Demystified | Recurrent Neural Networks
https://nlpdemystified.org<br>
https://github.com/nitinpunjabi/nlp-demystified

**IMPORTANT**<br>
Enable **GPU acceleration** by going to *Runtime > Change Runtime Type*. Keep in mind that, on the free tier, you're not guaranteed GPU access depending on usage history and current load.
<br><br>
Also, if you're running this for free in the cloud rather than using a paid tier or using a local Jupyter server on your machine, then the notebook will *timeout* after a period of inactivity. If that happens and you don't reconnect in time, you will need to upgrade spaCy again and reinstall the requisite statistical package(s).
<br><br>
Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:<br>
https://research.google.com/colaboratory/local-runtimes.html

# Part-of-Speech Tagging with a Bidirectional LSTM

It's difficult to find free sequence labelling datasets because they're so labour-intensive to create.
<br><br>
Fortunately, **Natural Language Toolkit (NLTK)** includes enough free sets of corpora for our purposes. NLTK also provides them in a convenient uniform format.<br>
https://www.nltk.org/index.html<br>
https://www.nltk.org/nltk_data/<br>
<br>
We'll use the Treebank, Brown, and CONLL-2000 datasets. 

In [1]:
import nltk

nltk.download('treebank')
nltk.download('brown')
nltk.download('conll2000')

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.


True

In their original form, not all corpora use the same tagsets, so we'll also download the *universal_tagset* from NLTK so that all datasets share the same set.<br>
https://universaldependencies.org/u/pos/

In [2]:
nltk.download('universal_tagset')

[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


True

In [3]:
from nltk.corpus import treebank, brown, conll2000

In [4]:
# Download all PoS-tagged sentences and place them in one list.
tagged_sentences = treebank.tagged_sents(tagset='universal') +\
                   brown.tagged_sents(tagset='universal') +\
                   conll2000.tagged_sents(tagset='universal')

Each tagged sentence is a list of (token, tag) tuples.

In [5]:
print(tagged_sentences[0])
print(len(tagged_sentences))

[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ('61', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ'), (',', '.'), ('will', 'VERB'), ('join', 'VERB'), ('the', 'DET'), ('board', 'NOUN'), ('as', 'ADP'), ('a', 'DET'), ('nonexecutive', 'ADJ'), ('director', 'NOUN'), ('Nov.', 'NOUN'), ('29', 'NUM'), ('.', '.')]
72202


We need to separate the tokens from the tags. Each sequence of tokens (sentence) will be an input to our model, and each corresponding sequence of tags will be a label for that sentence.

In [6]:
sentences, sentence_tags = [], []

for s in tagged_sentences:
  sentence, tags = zip(*s)
  sentences.append(list(sentence))
  sentence_tags.append(list(tags))

In [7]:
print(sentences[0])
print(sentence_tags[0])

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
['NOUN', 'NOUN', '.', 'NUM', 'NOUN', 'ADJ', '.', 'VERB', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'ADJ', 'NOUN', 'NOUN', 'NUM', '.']


In [8]:
print(len(sentences), len(sentence_tags))

72202 72202


Create train/validation/test splits. This time, we don't have a separate test set so we'll call *train_test_split* twice.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [9]:
from sklearn.model_selection import train_test_split

train_ratio = 0.75
validation_ratio = 0.15
test_ratio = 0.10

x_train, x_test, y_train, y_test = train_test_split(sentences, sentence_tags, test_size=1 - train_ratio)

x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio)) 

In [10]:
print(len(x_train), len(y_train))
print(len(x_val), len(y_val))
print(len(x_test), len(y_test))

54151 54151
10830 10830
7221 7221


If you watched the demo on **Static Word Embeddings**, the next few steps should look familiar.<br>
https://github.com/nitinpunjabi/nlp-demystified/blob/main/notebooks/nlpdemyst_static_word_embeddings.ipynb

First, we need to create a tokenizer for the sentences and fit it to create a vocabulary.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

In [11]:
from tensorflow import keras

In [12]:
sentence_tokenizer = keras.preprocessing.text.Tokenizer(oov_token='<OOV>')

In [13]:
sentence_tokenizer.fit_on_texts(x_train)

In [14]:
print(len(sentence_tokenizer.word_index))
sentence_tokenizer.get_config()

52234


{'char_level': False,
 'document_count': 54151,
 'filters': '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
 'index_docs': '{"6946": 12, "60": 1977, "698": 148, "14": 7559, "1145": 97, "2": 33350, "1918": 58, "27": 4288, "16590": 3, "489": 198, "329": 297, "1970": 55, "317": 281, "594": 173, "6232": 14, "354": 282, "86": 1288, "102": 1003, "8": 17357, "7850": 10, "19": 5985, "978": 114, "1336": 83, "9": 16328, "602": 176, "4": 47858, "443": 229, "1074": 100, "12277": 5, "430": 226, "5": 22323, "987": 113, "20646": 2, "1054": 101, "36": 1855, "620": 172, "651": 159, "375": 260, "6": 19726, "2640": 41, "33": 3571, "249": 381, "7380": 9, "16591": 3, "2513": 44, "9939": 6, "409": 242, "833": 134, "276": 338, "1752": 64, "166": 574, "1238": 90, "46": 2466, "8414": 9, "355": 274, "1576": 71, "24": 4686, "160": 580, "3": 29339, "1055": 101, "5201": 18, "120": 778, "364": 275, "20647": 2, "721": 143, "11": 8445, "654": 161, "99": 1073, "12": 8106, "272": 339, "7": 19215, "2308": 48, "2809": 39, "457":

We also need to create another tokenizer for the tags since our labels are also sequences.

In [15]:
tag_tokenizer = keras.preprocessing.text.Tokenizer()
tag_tokenizer.fit_on_texts(y_train)

In [16]:
print(len(tag_tokenizer.word_index))
tag_tokenizer.get_config()

12


{'char_level': False,
 'document_count': 54151,
 'filters': '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
 'index_docs': '{"1": 51123, "6": 36419, "5": 44701, "4": 43853, "7": 29514, "3": 53366, "2": 50856, "9": 24461, "12": 2678, "10": 21810, "8": 26930, "11": 12015}',
 'index_word': '{"1": "noun", "2": "verb", "3": ".", "4": "adp", "5": "det", "6": "adj", "7": "adv", "8": "pron", "9": "conj", "10": "prt", "11": "num", "12": "x"}',
 'lower': True,
 'num_words': None,
 'oov_token': None,
 'split': ' ',
 'word_counts': '{"det": 127003, "adj": 80871, "noun": 287403, "verb": 174813, "adp": 136453, "adv": 51025, ".": 143408, "conj": 35198, "x": 6093, "prt": 31354, "pron": 44502, "num": 21422}',
 'word_docs': '{"noun": 51123, "adj": 36419, "det": 44701, "adp": 43853, "adv": 29514, ".": 53366, "verb": 50856, "conj": 24461, "x": 2678, "prt": 21810, "pron": 26930, "num": 12015}',
 'word_index': '{"noun": 1, "verb": 2, ".": 3, "adp": 4, "det": 5, "adj": 6, "adv": 7, "pron": 8, "conj": 9, "prt": 10, "

The **universal tagset** is a reduced tag list so items such as *proper nouns* are missing.

In [17]:
tag_tokenizer.word_index

{'.': 3,
 'adj': 6,
 'adp': 4,
 'adv': 7,
 'conj': 9,
 'det': 5,
 'noun': 1,
 'num': 11,
 'pron': 8,
 'prt': 10,
 'verb': 2,
 'x': 12}

Next, we need to vectorize our sentences and corresponding tags.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#texts_to_sequences

In [18]:
x_train_seqs = sentence_tokenizer.texts_to_sequences(x_train)

In [19]:
print(x_train_seqs[0])

[2, 16590, 698, 1336, 489, 443, 6232, 9, 1970, 354, 9, 8, 602, 7850, 1918, 19, 2, 6946, 1145, 102, 317, 329, 60, 27, 594, 14, 86, 978, 4]


In [20]:
sentence_tokenizer.sequences_to_texts([x_train_seqs[0]])

['the twenty-second soviet communist party congress opens in moscow today in a situation contrasting sharply with the script prepared many months ago when this meeting was first announced .']

In [21]:
y_train_seqs = tag_tokenizer.texts_to_sequences(y_train)

In [22]:
tag_tokenizer.sequences_to_texts([y_train_seqs[0]])

['det adj noun noun noun noun verb adp noun noun adp det noun verb adv adp det noun verb adj noun adv adv det noun verb adv verb .']

In [23]:
# Vectorize the validation sentences and tags.
x_val_seqs = sentence_tokenizer.texts_to_sequences(x_val)
y_val_seqs = tag_tokenizer.texts_to_sequences(y_val)

As we covered in the slides, **Recurrent Neural Networks** are capable of handling variable length sequences.<br><br>
Despite that, it's still best to pad sequences to a uniform length for one or both of these reasons:<br>
1. Performance. The longer a sequence, the higher the computation cost. One may want to truncate all sequences to the same length if that's feasible.
2. When processing datasets in batches, each sequence in a batch usually has to be of uniform length.<br>

For simplicity, in this demo, we'll make *every* sequence be as long as the longest sequence. A more optimized solution would be to make each sequence as long as the longest sequence in each *batch* to avoid unnecessary processing.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences

In [24]:
MAX_LENGTH = len(max(x_train_seqs, key=len))
print(MAX_LENGTH)

271


In [25]:
x_train_padded = keras.preprocessing.sequence.pad_sequences(x_train_seqs, padding='post', maxlen=MAX_LENGTH)

In [26]:
print(x_train_padded[0])

[    2 16590   698  1336   489   443  6232     9  1970   354     9     8
   602  7850  1918    19     2  6946  1145   102   317   329    60    27
   594    14    86   978     4     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0   

In [27]:
y_train_padded = keras.preprocessing.sequence.pad_sequences(y_train_seqs, padding='post', maxlen=MAX_LENGTH)

In [28]:
# Pad the validation sentences and tags.
x_val_padded = keras.preprocessing.sequence.pad_sequences(x_val_seqs, padding='post', maxlen=MAX_LENGTH)
y_val_padded = keras.preprocessing.sequence.pad_sequences(y_val_seqs, padding='post', maxlen=MAX_LENGTH)

PoS tagging is a multiclass classification task done at each timestep, so we need to convert every tag for every sentence into a one-hot encoding.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical<br>

In [29]:
y_train_categoricals = keras.utils.to_categorical(y_train_padded)

A sequence of tags for a single sentence is now a sequence of one-hot encodings.

In [30]:
print(y_train_categoricals[0])

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]]


In [31]:
# One-hot encoding for a single tag.
print(y_train_categoricals[0][0])

[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]


In [32]:
# Retrieving the corresponding tag.
import numpy as np
print(tag_tokenizer.index_word[np.argmax(y_train_categoricals[0][0])])

det


In [33]:
# Turn the validation set tags into one-hot encodings as well.
y_val_categoricals = keras.utils.to_categorical(y_val_padded)

In [34]:
# For the embedding layer.
num_tokens = len(sentence_tokenizer.word_index) + 1
embedding_dim = 128

# For the output layer.
num_classes = len(tag_tokenizer.word_index) + 1

At this point, we're ready to build our model. We'll train word embeddings concurrently with our model (though you can use pretrained word vectors as well).<br><br>
There are several new things here:<br>
1. The embedding layer has a *mask_zero* parameter. We added padding in order to make our batches the same size, but we don't want the model to make PoS predictions on padding. Setting *mask_zero* to True makes the layers following the embedding layer ignore padding values.<br>
https://www.tensorflow.org/guide/keras/masking_and_padding<br>
https://stackoverflow.com/questions/47485216/how-does-mask-zero-in-keras-embedding-layer-work<br><br>
2. We're using a **bidirectional LSTM**. The *Bidirectional* layer is a wrapper to which we pass an *LSTM* layer. The first parameter to the *LSTM* layer is the number of units in the cell. The second parameter, *return_sequences*, controls whether the RNN returns an output for each timestep or only the last output. Since we're doing PoS-tagging, we want an output for each timestep.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional<br>
https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM<br><br>
3. A *TimeDistributed* layer wraps the *Dense* output layer. This way, the *Dense* layer with its *softmax* activation function gets applied to **every** sequential output to produce a PoS tag prediction.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/layers/TimeDistributed

In [35]:
from keras import layers

In [36]:
model = keras.Sequential()

model.add(layers.Embedding(input_dim=num_tokens, 
                           output_dim=embedding_dim, 
                           input_length=MAX_LENGTH,
                           mask_zero=True))
model.add(layers.Bidirectional(layers.LSTM(128, return_sequences=True)))
model.add(layers.TimeDistributed(layers.Dense(num_classes, activation='softmax')))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])


In [37]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 271, 128)          6686080   
                                                                 
 bidirectional (Bidirectiona  (None, 271, 256)         263168    
 l)                                                              
                                                                 
 time_distributed (TimeDistr  (None, 271, 13)          3341      
 ibuted)                                                         
                                                                 
Total params: 6,952,589
Trainable params: 6,952,589
Non-trainable params: 0
_________________________________________________________________


In [38]:
es_callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)
history = model.fit(x_train_padded, y_train_categoricals, epochs=20, batch_size=256, validation_data=(x_val_padded, y_val_categoricals), callbacks=[es_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20


In [39]:
# Preprocess the test data and test the model.
x_test_seqs = sentence_tokenizer.texts_to_sequences(x_test)
x_test_padded = keras.preprocessing.sequence.pad_sequences(x_test_seqs, padding='post', maxlen=MAX_LENGTH)

y_test_seqs = tag_tokenizer.texts_to_sequences(y_test)
y_test_padded = keras.preprocessing.sequence.pad_sequences(y_test_seqs, padding='post', maxlen=MAX_LENGTH)
y_test_categoricals = keras.utils.to_categorical(y_test_padded)

In [40]:
model.evaluate(x_test_padded, y_test_categoricals)



[0.00796465016901493, 0.9696719646453857]

We can now use our model to tag sentences.

In [41]:
samples = [
    "Brown refused to testify.",
    "Brown sofas are on sale.",
]

The function below takes a list of strings, tokenizes and pads them, then has the model tag them.

In [42]:
def tag_sentences(sentences):
  sentences_seqs = sentence_tokenizer.texts_to_sequences(sentences)
  sentences_padded = keras.preprocessing.sequence.pad_sequences(sentences_seqs, maxlen=MAX_LENGTH, padding='post')

  # The model returns a list of PROBABILITY DISTRIBUTIONS (due to the softmax)
  # for EACH sentence. There is one probability distribution for each PoS tag.
  tag_preds = model.predict(sentences_padded)

  sentence_tags = []

  # For EACH LIST of probability distributions...
  for i, preds in enumerate(tag_preds):

    # Extract the most probable tag from EACH probability distribution.
    tags_seq = [np.argmax(p) for p in preds[:len(sentences_seqs[i])]]

    # Convert the sentence and tag sequences back to their token counterparts.
    words = [sentence_tokenizer.index_word[w] for w in sentences_seqs[i]]
    tags = [tag_tokenizer.index_word[t] for t in tags_seq]
    sentence_tags.append(list(zip(words, tags)))

  return sentence_tags


In [43]:
tagged_sample_sentences = tag_sentences(samples)

In [44]:
print(tagged_sample_sentences[0])

[('brown', 'noun'), ('refused', 'verb'), ('to', 'prt'), ('testify', 'verb')]


In [45]:
print(tagged_sample_sentences[1])

[('brown', 'adj'), ('sofas', 'noun'), ('are', 'verb'), ('on', 'adp'), ('sale', 'noun')]


# Language Modelling With Stacked LSTMs

We'll build a language model trained on the *Art of War* by Sun Tzu.

In [48]:
import requests
art_of_war = requests.get('https://raw.githubusercontent.com/nitinpunjabi/nlp-demystified/main/datasets/art_of_war.txt')\
                     .text

The language model we'll build will be **character**-based (as opposed to word-based). That is, given a sequence of one or more characters, the model will be asked to predict the next character.<br><br>
Character-level models have the advantage of:
- Smaller prediction space. There are only a handful of characters in the English language compared to the tens of thousands of words in a typical corpus.
- Character-level models are more resilient to out-of-vocabulary (OOV) conditions and are better able to learn the lower mechanics of language (including punctuation).<br><br>

On the other hand, character-level models need to learn a sequence of characters to "make sense" of a word (e.g. the sequence of "c", "a", "t" to identify "cat" as a pattern) which can be inefficient.<br><br>
RNNs can process any kind of sequence so what's shown here can easily be applied at the word level.

We'll initialize a Keras **Tokenizer** and set the *char_level* parameter to True so that our corpus gets tokenized into characters rather than words.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer

In [49]:
from tensorflow import keras
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)

In [50]:
tokenizer.fit_on_texts([art_of_war])

The tokenizer's internal dictionary now maps characters rather than words...

In [51]:
tokenizer.get_config()

{'num_words': None,
 'filters': '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
 'lower': True,
 'split': ' ',
 'char_level': True,
 'oov_token': None,
 'document_count': 1,
 'word_counts': '{"1": 179, ".": 896, " ": 9794, "s": 3081, "u": 1467, "n": 3565, "t": 4398, "z": 20, "\\u016d": 13, "a": 3475, "i": 3573, "d": 1681, ":": 48, "h": 2558, "e": 5837, "r": 2776, "o": 3548, "f": 1238, "w": 981, "v": 478, "l": 1722, "m": 1201, "p": 769, "c": 1390, "\\n": 1443, "2": 127, ",": 634, "y": 1055, "b": 708, "j": 23, "q": 55, "g": 1007, "3": 87, "k": 345, "\\u2019": 57, "4": 66, "(": 59, ")": 59, ";": 168, "5": 58, "6": 51, "_": 62, "7": 39, "8": 36, "9": 34, "0": 38, "x": 49, "\\u2014": 16, "?": 8, "!": 8, "-": 57, "\\u201c": 3, "\\u201d": 3, "\\u0153": 7, "\\u00fc": 3, "\\u2018": 1}',
 'word_docs': '{"7": 1, "f": 1, "u": 1, "\\n": 1, "1": 1, "a": 1, "i": 1, "p": 1, "v": 1, "?": 1, "t": 1, "j": 1, "z": 1, "y": 1, "2": 1, "3": 1, "s": 1, "e": 1, "w": 1, "g": 1, ")": 1, "9": 1, "r": 1, "5": 1, "l": 1, "

...and the resulting possibility space is much smaller.

In [52]:
len(tokenizer.word_index)

56

As before, we'll turn the book's characters into a sequence of integers.

In [53]:
seq = tokenizer.texts_to_sequences([art_of_war])[0]

In [54]:
len(seq)

61054

In [55]:
tokenizer.sequences_to_texts([seq[:10]])

['1 .   s u n   t z ŭ']

To segment the vectorized corpus into training examples, we'll use the **Tensorflow Data** API which makes it easy to build preprocessing pipelines by chaining operations together.<br>
https://www.tensorflow.org/guide/data<br>
https://www.tensorflow.org/api_docs/python/tf/data<br>

The first step is to convert the vectorized corpus into a stream of character indices using *from_tensor_slices*.<br>
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices

In [56]:
import tensorflow as tf
slices = tf.data.Dataset.from_tensor_slices(seq)

This returns a Tensorflow **Dataset** object, an abstraction that represents a sequence of elements.<br>
https://www.tensorflow.org/api_docs/python/tf/data/Dataset

In [57]:
type(slices)

tensorflow.python.data.ops.dataset_ops.TensorSliceDataset

Like Python **generators**, TF **Datasets** function like iterators, so we need to do things like convert them into a list or iterate over them in order to print them.<br>
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#take

In [58]:
# Create a Dataset from the first ten slices and convert to a list to output to console.
list(slices.take(10))

[<tf.Tensor: shape=(), dtype=int32, numpy=27>,
 <tf.Tensor: shape=(), dtype=int32, numpy=21>,
 <tf.Tensor: shape=(), dtype=int32, numpy=1>,
 <tf.Tensor: shape=(), dtype=int32, numpy=8>,
 <tf.Tensor: shape=(), dtype=int32, numpy=13>,
 <tf.Tensor: shape=(), dtype=int32, numpy=5>,
 <tf.Tensor: shape=(), dtype=int32, numpy=1>,
 <tf.Tensor: shape=(), dtype=int32, numpy=3>,
 <tf.Tensor: shape=(), dtype=int32, numpy=47>,
 <tf.Tensor: shape=(), dtype=int32, numpy=49>]

In [97]:
# The first ten elements from the original vectorized corpus.
seq[:10]

[27, 21, 1, 8, 13, 5, 1, 3, 47, 49]

The next step is to create the training examples. To do that, we'll use the *window* method. Calling *window* on a dataset results in a sequence of datasets, each containing N elements. So calling window(100) on a dataset of 1000 will result in a sequence of 10 datasets, each containing 100 elements.<br>
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#window

Here, we're creating windows of *input_timesteps* + 1. The *input_timesteps* represents our training example length. The *+1* is there to help us create the target/label for each training example. This will be clarified further below.<br><br>
In addition, we're setting *shift* to 1. This means the result will contain overlapping windows shifted by 1. e.g. if the input is [1, 2, 3, 4, ...]. The first window will contain [1, 2, 3, ...], the second window will contain [2, 3, 4, ...] and so on. This is to create more training examples.<br><br>
Finally, we're setting *drop_remainder* to True which ensures ALL windows contain exactly N elements. i.e. once the input contains fewer than N elements, they are ignored.

In [60]:
input_timesteps = 100
window_size = input_timesteps + 1
windows = slices.window(window_size, shift=1, drop_remainder=True)

Iterating through a subset of windows, we can see they're all the same length and that each subsequent window is shifted over by 1.

In [61]:
for w in windows.take(3):
  arr = list(w.as_numpy_iterator())
  print(len(arr), arr)

101 [27, 21, 1, 8, 13, 5, 1, 3, 47, 49, 1, 8, 7, 4, 12, 41, 1, 3, 10, 2, 1, 7, 9, 3, 1, 6, 16, 1, 20, 7, 9, 1, 4, 8, 1, 6, 16, 1, 25, 4, 3, 7, 11, 1, 4, 17, 22, 6, 9, 3, 7, 5, 15, 2, 1, 3, 6, 1, 3, 10, 2, 1, 8, 3, 7, 3, 2, 21, 14, 14, 29, 21, 1, 4, 3, 1, 4, 8, 1, 7, 1, 17, 7, 3, 3, 2, 9, 1, 6, 16, 1, 11, 4, 16, 2, 1, 7, 5, 12, 1, 12]
101 [21, 1, 8, 13, 5, 1, 3, 47, 49, 1, 8, 7, 4, 12, 41, 1, 3, 10, 2, 1, 7, 9, 3, 1, 6, 16, 1, 20, 7, 9, 1, 4, 8, 1, 6, 16, 1, 25, 4, 3, 7, 11, 1, 4, 17, 22, 6, 9, 3, 7, 5, 15, 2, 1, 3, 6, 1, 3, 10, 2, 1, 8, 3, 7, 3, 2, 21, 14, 14, 29, 21, 1, 4, 3, 1, 4, 8, 1, 7, 1, 17, 7, 3, 3, 2, 9, 1, 6, 16, 1, 11, 4, 16, 2, 1, 7, 5, 12, 1, 12, 2]
101 [1, 8, 13, 5, 1, 3, 47, 49, 1, 8, 7, 4, 12, 41, 1, 3, 10, 2, 1, 7, 9, 3, 1, 6, 16, 1, 20, 7, 9, 1, 4, 8, 1, 6, 16, 1, 25, 4, 3, 7, 11, 1, 4, 17, 22, 6, 9, 3, 7, 5, 15, 2, 1, 3, 6, 1, 3, 10, 2, 1, 8, 3, 7, 3, 2, 21, 14, 14, 29, 21, 1, 4, 3, 1, 4, 8, 1, 7, 1, 17, 7, 3, 3, 2, 9, 1, 6, 16, 1, 11, 4, 16, 2, 1, 7, 5, 12, 1, 12, 2

2022-02-11 12:26:02.415134: W tensorflow/core/framework/dataset.cc:768] Input of Window will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.


The *window* method returns a nested dataset of datasets (i.e. each window is a dataset)...

In [62]:
for w in windows.take(2):
  print(w)

<_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>
<_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>


...but we need tensors for our model. So we can use the window's *batch* method to convert each window object back to a tensor, then *flat_map* to flatten the results to a single(i.e. non-nested) dataset of tensors.<br>
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch<br>
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#flat_map

In [63]:
dataset = windows.flat_map(lambda window: window.batch(window_size))

We now have a single dataset of tensors, where each tensor is *window_size* long and shifted by 1.

In [64]:
for d in dataset.take(2):
  print(d)

tf.Tensor(
[27 21  1  8 13  5  1  3 47 49  1  8  7  4 12 41  1  3 10  2  1  7  9  3
  1  6 16  1 20  7  9  1  4  8  1  6 16  1 25  4  3  7 11  1  4 17 22  6
  9  3  7  5 15  2  1  3  6  1  3 10  2  1  8  3  7  3  2 21 14 14 29 21
  1  4  3  1  4  8  1  7  1 17  7  3  3  2  9  1  6 16  1 11  4 16  2  1
  7  5 12  1 12], shape=(101,), dtype=int32)
tf.Tensor(
[21  1  8 13  5  1  3 47 49  1  8  7  4 12 41  1  3 10  2  1  7  9  3  1
  6 16  1 20  7  9  1  4  8  1  6 16  1 25  4  3  7 11  1  4 17 22  6  9
  3  7  5 15  2  1  3  6  1  3 10  2  1  8  3  7  3  2 21 14 14 29 21  1
  4  3  1  4  8  1  7  1 17  7  3  3  2  9  1  6 16  1 11  4 16  2  1  7
  5 12  1 12  2], shape=(101,), dtype=int32)


The next step is to create batches from our dataset. To do this, we'll shuffle the dataset to give the model optimizer a better chance of "bouncing out of a local minimum", then create batches.<br>
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle<br>


In [65]:
batch_size = 32

In [66]:
# The expected number of batches should be (len(seq) - input_timesteps) / batch_size
batches = dataset.shuffle(10000).batch(batch_size)

In [67]:
for b in batches.take(2):
  print(b)

tf.Tensor(
[[ 9  4  7 ...  8 21 14]
 [14 14 30 ...  7  4  5]
 [ 2 40 10 ...  5 15  6]
 ...
 [ 7  9 17 ...  5  1  7]
 [ 9  6 16 ...  1 10  4]
 [ 7  9 17 ...  4 17 22]], shape=(32, 101), dtype=int32)
tf.Tensor(
[[ 2 12  1 ... 17  2  5]
 [ 1  9  7 ...  4  5 19]
 [ 1  2  5 ...  8  2  8]
 ...
 [ 9 18  1 ... 20  7  9]
 [18 24  1 ...  3  1 10]
 [ 4  8  1 ... 23 11  2]], shape=(32, 101), dtype=int32)


We can now separate each example into an input(x) and a corresponding label(y).<br><br>
In the slides, we talked about **Teacher Forcing** where:<br>
1. At each timestep during training, the output is compared to a label.
2. At the next timestep, rather than feeding the model the previous output, we feed it the next character of the input sequence (i.e. what the model should've outputted).
<br><br>

This is why each window is of size *input_timesteps + 1*. Each window is now going to be separated into TWO sequences. The first sequence will be the training input and will be of length *input_timesteps* (i.e. everything but the LAST character). The second sequence will be the label and will consist of all the window elements shifted by 1 (i.e. everything but the FIRST character).

In [68]:
xy_batches = batches.map(lambda batch: (batch[:, :-1], batch[:, 1:]))

Each batch now consists of a set of inputs and a set of labels, with the labels shifted over by 1.

In [69]:
for b in xy_batches.take(1):
  print(b)

(<tf.Tensor: shape=(32, 100), dtype=int32, numpy=
array([[ 9, 26,  1, ...,  9, 26,  1],
       [ 2,  1, 19, ...,  4,  5,  1],
       [ 1, 19,  2, ...,  5,  1, 20],
       ...,
       [13, 22, 14, ...,  6, 11,  1],
       [22,  3,  4, ...,  4,  6,  5],
       [ 3,  4,  6, ...,  1,  6, 16]], dtype=int32)>, <tf.Tensor: shape=(32, 100), dtype=int32, numpy=
array([[26,  1,  6, ..., 26,  1,  4],
       [ 1, 19,  2, ...,  5,  1, 15],
       [19,  2,  5, ...,  1, 20, 10],
       ...,
       [22, 14,  6, ..., 11,  1, 10],
       [ 3,  4,  5, ...,  6,  5,  8],
       [ 4,  6,  5, ...,  6, 16,  1]], dtype=int32)>)


In [70]:
# For greater clarity.
for b in xy_batches.take(1):
  print("x1: ", b[0][0].numpy())
  print("\n")
  print("y1: ", b[1][0].numpy())

x1:  [ 2  1  3  6  1 23  2  1  4 17 22  6 25  2  9  4  8 10  2 12 21 14 14 27
 27 21  1  6  5  1  3 10  2  1  6  3 10  2  9  1 10  7  5 12 24  1  3 10
  2  1 22  9  6 40  4 17  4  3 18  1  6 16  1  7  5  1  7  9 17 18  1 15
  7 13  8  2  8  1 22  9  4 15  2  8  1  3  6  1 19  6  1 13 22 28 14  7
  5 12  1 10]


y1:  [ 1  3  6  1 23  2  1  4 17 22  6 25  2  9  4  8 10  2 12 21 14 14 27 27
 21  1  6  5  1  3 10  2  1  6  3 10  2  9  1 10  7  5 12 24  1  3 10  2
  1 22  9  6 40  4 17  4  3 18  1  6 16  1  7  5  1  7  9 17 18  1 15  7
 13  8  2  8  1 22  9  4 15  2  8  1  3  6  1 19  6  1 13 22 28 14  7  5
 12  1 10  4]


In [71]:
num_tokens = len(tokenizer.word_index) + 1

The last step before we can build our model is to one-hot encode the inputs. We're doing this because:
1. We're not using embeddings for the input. We can, but since this is a character model with just a few dozen possible choices, we can get away with one-hot encoding.
2. Since we're not using embeddings and our input is categorical, we need to one-hot encode.

Note that despite our labels ALSO being categorical, we are NOT one-hot encoding them. This is because we'll be using a loss function that can help us skip that step (more below).

In [72]:
xy_batches = xy_batches.map(lambda inputs, labels: (tf.one_hot(inputs, depth=num_tokens), labels))

In [73]:
for b in xy_batches.take(1):
  print("x1: ", b[0][0].numpy())
  print("\n")
  print("y1: ", b[1][0].numpy())

x1:  [[0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]]


y1:  [ 8 22  9  4  5 19  1 13 22  1  3  6 14  3  7 26  2  1  7 12 25  7  5  3
  7 19  2  1  6 16  1 18  6 13  9  1  2 40  3  9  2 17  4  3 18 21  1  3
 10  2  5  1  5  6  1 17  7  5 24  1 10  6 20  2 25  2  9  1 20  4  8  2
 24  1 20  4 11 11  1 23  2 14  7 23 11  2  1  3  6  1  7 25  2  9  3  1
  3 10  2  1]


The last step is to add some **prefetching**. This is an optimization step. This way, while the model trains on the current batch of data, the pipeline reads and prepares the next batch.<br>
https://www.tensorflow.org/guide/data_performance#prefetching

In [74]:
dataset = dataset.prefetch(tf.data.AUTOTUNE)

We can now build our model. There are three new things here:
1. We're stacking two LSTMs. The sequential output of the first LSTM will become the sequential input of the second LSTM.<br><br>
2. We're adding some *recurrent_dropout*. This drops connections between the recurrent units (i.e. the dropout is applied horizontally across time). You can still use regular *dropout* as well which will be applied to the inputs/outputs. Refer to this paper for more information:<br>
https://arxiv.org/abs/1512.05287<br><br>
3. We're using **sparse_categorical_crossentropy**. This allows us to provide labels as integers rather than one-hot encodings.<br>
https://keras.io/api/losses/probabilistic_losses/#sparsecategoricalcrossentropy-class

In [75]:
from keras import layers

In [76]:
model = keras.models.Sequential()

model.add(layers.LSTM(128, return_sequences=True, input_shape=[None, num_tokens], recurrent_dropout=0.2))
model.add(layers.LSTM(128, return_sequences=True, input_shape=[None, num_tokens], recurrent_dropout=0.2))
model.add(layers.TimeDistributed(layers.Dense(num_tokens, activation='softmax')))

model.compile(loss="sparse_categorical_crossentropy", optimizer='adam')


In [77]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_1 (LSTM)               (None, None, 128)         95232     
                                                                 
 lstm_2 (LSTM)               (None, None, 128)         131584    
                                                                 
 time_distributed_1 (TimeDis  (None, None, 57)         7353      
 tributed)                                                       
                                                                 
Total params: 234,169
Trainable params: 234,169
Non-trainable params: 0
_________________________________________________________________


Because this model takes a few hours to train, we're using **model checkpoints** to save the weights after every epoch. This way, if something goes wrong with our system during training, we can reload the last set of weights from the checkpoint, and resume training from there.<br>
https://keras.io/api/callbacks/model_checkpoint/

In [78]:
from keras.callbacks import ModelCheckpoint

In [79]:
# Saving this to a folder on my local machine (possible because I'm running this notebook on a local Jupyter backend).
filepath="./ArtofWarLM/training1/cp.ckpt"

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=filepath,
                                                 save_weights_only=True,
                                                 verbose=1)

When calling the model's *fit* method, we:<br>
1. simply pass in the batches as is (no need to separate into explicit x and y arguments).
2. pass the model checkpoint callback to save the weights after every epoch

Note the call to *fit* below is commented out. Because this model takes a few hours to train, I trained it ahead of time and saved it. If you want to train it yourself, feel free to uncomment and execute it.

**Note**: Because of the random weight initialization, your trained model's output will likely differ from mine.

In [80]:
# history = model.fit(xy_batches, epochs=50, callbacks=[cp_callback])

Once training is complete, we can call the model's *save* method to save its weights and metadata. Again, this is commented out because I already trained the model previously.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/Model#save<br>
https://www.tensorflow.org/guide/keras/save_and_serialize

In [81]:
# model.save('art_of_war_char_level_lm')

Download and unzip the previously trained model...

In [107]:
!wget https://github.com/nitinpunjabi/nlp-demystified/raw/main/art_of_war_char_level_lm.zip
!unzip -o art_of_war_char_level_lm.zip

--2022-02-11 15:52:27--  https://github.com/nitinpunjabi/nlp-demystified/raw/main/art_of_war_char_level_lm.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/nitinpunjabi/nlp-demystified/main/art_of_war_char_level_lm.zip [following]
--2022-02-11 15:52:27--  https://raw.githubusercontent.com/nitinpunjabi/nlp-demystified/main/art_of_war_char_level_lm.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2691531 (2.6M) [application/zip]
Saving to: ‘art_of_war_char_level_lm.zip.7’


2022-02-11 15:52:28 (2.54 MB/s) - ‘art_of_war_char_level_lm.zip.7’ saved [2691531/2691531]

Archive:  art_of_war_char_

...and load it.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/models/load_model

In [108]:
model = keras.models.load_model('art_of_war_char_level_lm')

In [109]:
import numpy as np

Now that we have a trained model, let's generate some text.<br><br>
The method below takes some seed text and uses that to generate a certain number of characters. For each character, it uses the currently generated text so far as the input.<br><br>
There's also a *temperature* parameter. The next character is picked from a probability distribution. By dividing the log of this distribution by *temperature*, we can influence the randomness of the output.<br><br>
When the temperature is low (< 1), the probability distribution sharpens and the model will be more strict in recreating the original text. As we raise the temperature, the distribution flattens and there's a higher chance the model picks something unexpected, resulting in greater surprise in the output. In practice, a high enough temperature will result in nonsense.

In [110]:
def generate_text(model, tokenizer, seed_text, num_chars=200, temperature=1):

  text = seed_text

  for _ in range(num_chars):
    
    # Create input sequence of length time_steps.
    input = np.array(tokenizer.texts_to_sequences([text[-input_timesteps:]]))
    input = tf.one_hot(input, num_tokens)

    # Create probability distribution for next character adjusted by temperature.
    preds = model.predict(input)[0, -1:, :] # <-- We want only the last character predicted.
    preds = tf.math.log(preds) / temperature

    # Choose next character and add to running text.
    next_char = tf.random.categorical(preds, num_samples=1)
    next_char = tokenizer.sequences_to_texts(next_char.numpy())[0]

    text += next_char
  
  return text


In [113]:
print(generate_text(model, tokenizer, "Banana peels on the battlefield can", num_chars=300, temperature=0.2))

Banana peels on the battlefield can ever be brought back to life.

22. hence the enlightened ruler lays his plans well ahead;
the good general cultivates his resources.

17. move not unless you see an advantage; use not your troops unless
there is something to be gained; fight not unless the position is
critical.

18. no ruler should


In [114]:
print(generate_text(model, tokenizer, "It's time to release the Kraken when", num_chars=300, temperature=0.5))

It's time to release the Kraken when to his orders are
not clear and distinct; when there are no fixed duties assigned to
officers and men alike will put forth
their uttermost strength.

24. soldiers when in desperate straits, and it will come off in safety.

59. for it is precisely when a force has fallen into harm’s way that is
capa


In [115]:
print(generate_text(model, tokenizer, "Crush your enemies, see them driven before you, and", num_chars=300, temperature=1))

Crush your enemies, see them driven before you, and we must make a forward move; if not, stay where
you are.

20. anger may in time change to gladness; vexation must will be saved.

6. when the enemy’s men were his men are in a condition to attack, but are
unaware that the nature of
the ground makes fighting impracticable, we have
strength of their 


In [116]:
print(generate_text(model, tokenizer, "What is best in life?", num_chars=300, temperature=2))

What is best in life?

7. in converted spy that is called tem be obtained inductively from exroper, yvoors a
advance. violent.

20. the enlightened ruler and the time of
the enemy; you marshes any awe lays emostrivigngs; if their fort, but the
officers are angry and hoard you maknour will be distred aboits.

16. hen the


A few observations of the preceding outputs:
1. Despite being a character-level model, the model managed to "learn" spelling, cadence, punctuation, spacing, grammar, and even numbered bullet points just from trying to predict the next character.
2. It's pretty cool how the model manages to take our initial seed text and complete a sentence with it before moving on.
3. We can see the output getting increasingly nonsensical as the temperature rises. What temperature to use ultimately depends on the nature of your corpus and your goals with the language model.

# Further Exploration
1. Everything we learned here can be applied at the word-level. Try creating a word-level language model with a different corpus (maybe download something from https://www.gutenberg.org/) and try using word embeddings.<br><br>
2. We didn't evaluate our language model using perplexity. Find out online how to do it.