<h1 align='center'>Applications of LSTM – Generating Text</h1>

In [1]:
import os
from six.moves.urllib.request import urlretrieve
import tensorflow as tf

tf.config.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

## Get Data

[Dowloading Stories](https://www.cs.cmu.edu/~spok/grimmtmp/). The dataset consists of 209 stories. These are the translations of some folk stories by the Grimm brothers.

In [2]:
url = 'https://www.cs.cmu.edu/~spok/grimmtmp/'
dir_name = 'data'

def download_data(url, filename, download_dir):
    """Download a file if not present, and make sure it's the right size."""
      
    # Create directories if doesn't exist
    os.makedirs(download_dir, exist_ok=True)
    
    # If file doesn't exist download
    if not os.path.exists(os.path.join(download_dir,filename)):
        filepath, _ = urlretrieve(url + filename, os.path.join(download_dir,filename))
    else:
        filepath = os.path.join(download_dir, filename)
        
    return filepath

# Number of files and their names to download
num_files = 209
filenames = [format(i, '03d')+'.txt' for i in range(1, num_files+1)]

# Download each file
for fn in filenames:
    download_data(url, fn, dir_name)
    
# Check if all files are downloaded
for i in range(len(filenames)):
    file_exists = os.path.isfile(os.path.join(dir_name,filenames[i]))
    assert file_exists

print(f"{len(filenames)} files found.") 

209 files found.


## Train-Validation-Test Split

In [3]:
from sklearn.model_selection import train_test_split

random_state = 54321

filenames = [os.path.join(dir_name, file) for file in os.listdir(dir_name)]

# First separate train and valid+test data
train_filenames, test_and_valid_filenames = train_test_split(filenames, test_size=0.2, 
                                                             random_state=random_state)

# Separate valid+test data to validation and test data
valid_filenames, test_filenames = train_test_split(test_and_valid_filenames, test_size=0.5,
                                                   random_state=random_state)

# Print out the sizes and some sample filenames
for subset_id, subset in zip(('train', 'valid', 'test'), (train_filenames, valid_filenames, test_filenames)):
    print(f"Got {len(subset)} files in the {subset_id} dataset (e.g. {subset[:3]})")

Got 167 files in the train dataset (e.g. ['data\\117.txt', 'data\\133.txt', 'data\\069.txt'])
Got 21 files in the valid dataset (e.g. ['data\\023.txt', 'data\\078.txt', 'data\\176.txt'])
Got 21 files in the test dataset (e.g. ['data\\129.txt', 'data\\207.txt', 'data\\170.txt'])


## Analyzing the vocabulary size

- We will process the text by breaking it into character-level bigrams (n-grams where n=2) and make a vocabulary out of the unique bigrams. 

- Using character-level bigrams helps us to language model with a reduced vocabulary, leading to faster model training.

For e.g.: "`The king was hunting in the forest`", would break down to a sequence of bigrams as follows: `[‘th’, ‘e ‘, ‘ki’, ‘ng’, ‘ w’, ‘as’, …]`

In [6]:
bigram_set = set()

# Go through each file in the training set
for file_name in train_filenames:
    document = [] # This will hold all the text
    with open(file_name, 'r') as f:
        for row in f:
            # Convert text to lower case to reduce input dimensionality
            document.append(row.lower())
        
        # Stiching all the text
        document = " ".join(document)
        
        # Updating the set with bigram
        bigram_set.update([document[i:i+2] for i in range(0, len(document), 2)])
        
        
n_vocab = len(bigram_set)
print(f"Found {n_vocab} unique bigrams")

Found 705 unique bigrams


## Defining the `tf.data` pipeline

We will now define a fully fledged data pipeline that is capable of reading the files from the disk and transforming the content into a format or structure that can be used to train the model.

<div align='center'>
    <b>Process for generating data to train the language model</b>
</div>

<div align='center'>
    <img src='images/data_pipe.png'/>
</div>

For example assume an ngram_width of 2, batch size of 1, and window_size of 5. This function would take the string `the king was hunting in the forest` and output:

```
Batch 1: ["th", "e ", "ki", " ng", " w"] -> ["e ", "ki", "ng", " w", "as"]
Batch 2: ["as", " h", "un", "ti", "ng"] -> [" h", "un", "ti", "ng", " i"]
...
```
The left list in each batch represents the input sequence, and the right list represents the target sequence. 

Note how the right list is simply the left one shifted one to the right. Also note how
there’s no overlap between the inputs in the two records. But in the actual function, we a small overlap between records is maintained.

In [33]:
def generate_tf_dataset(filenames, ngram_width, window_size, batch_size, shuffle=False):
    """ 
    Generate batched data from a list of files speficied
    
    Args:
        • filenames – A list of filenames containing the text to be used for the model
        • ngram_width – Width of the n-grams to be extracted
        • window_size – Length of the sequence of n-grams to be used to generate a single data
                        point for the model
        • batch_size – Size of the batch
        • shuffle – (defaults to False) Whether to shuffle the data or not
    """
    
    # Read the data 
    documents = []
    for file in filenames:
        doc = tf.io.read_file(file)
        doc = tf.strings.ngrams( # Create ngram from string
                    tf.strings.bytes_split( # Split text into characters
                         tf.strings.regex_replace( # Replace new lines with space
                              tf.strings.lower(
                                  doc
                              ), "\n", " "
                         )
                    ),
                    ngram_width, separator=''
                )
        documents.append(doc.numpy().tolist())
    
    # documents is a list of list of strings, where each string is a story
    # From that we generate a ragged tensor
    documents = tf.ragged.constant(documents)
    
    # Create a dataset where each row in the ragged tensor would be a sample
    doc_dataset = tf.data.Dataset.from_tensor_slices(documents)
    
    # We need to perform a quick transformation - tf.strings.ngrams would generate
    # all the ngrams (e.g. abcd -> ab, bc, cd) with overlap, however for our data
    # we do not need the overlap, so we need to skip the overlapping ngrams
    # the following line does that
    
    # Here, we simply get rid of the overlapping n-grams by taking only every 
    # nth n-gram in the sequence:
    doc_dataset = doc_dataset.map(lambda x: x[::ngram_width])
    
    # Here we are using a window function to generate windows from text
    # For a text sequence with window_size 3 and shift 1 you get
    # e.g. ab, cd, ef, gh, ij, ... -> [ab, cd, ef], [cd, ef, gh], [ef, gh, ij], ...
    # each of these windows is a single training sequence for our model
    doc_dataset = doc_dataset.flat_map(
                        lambda x: tf.data.Dataset.from_tensor_slices(
                                    x
                        ).window(
                            size=window_size+1, shift=int(window_size*0.75)
                        ).flat_map(
                            lambda window: window.batch(window_size+1, drop_remainder=True)
                        )
                  )
    
    # From each windowed sequence we generate input and target tuple
    # e.g. [ab, cd, ef] -> ([ab, cd], [cd, ef])
    doc_dataset = doc_dataset.map(lambda x: (x[:-1], x[1:]))
    
    # Shuffle the data if required
    doc_dataset = doc_dataset.shuffle(buffer_size=batch_size*10) if shuffle else doc_dataset
    
    # Batch the Data
    doc_dataset = doc_dataset.batch(batch_size=batch_size)
    
    return doc_dataset

- A RaggedTensor is a special type of tensor that can have dimensions that accept arbitrarily sized inputs. `tf.ragged.constant()`

- For example, it is almost impossible that all the stories would have the same number of n-grams in each as they vary from each other a lot. In this case, we will have arbitrarily long sequences of n-grams representing our stories. 

- Therefore, we can use a RaggedTensor to store these arbitrarily sized sequences.

* **

Explanation of above code
```
doc_dataset = doc_dataset.flat_map(
                        lambda x: tf.data.Dataset.from_tensor_slices(
                                    x
                        ).window(
                            size=window_size+1, shift=int(window_size*0.75)
                        ).flat_map(
                            lambda window: window.batch(window_size+1, drop_remainder=True)
                        )
                  )
```
After removing the overlapping ngrams. The dataset is transformed using `.flat_map` and `.window` to generate sliding windows of size `window_size+1` from the n-gram sequences. The windows are created with a shift that's 75% of `window_size`, leading to 25% overlapping between two consecutive sequences. Each window is then batched using `.batch` with `drop_remainder=True` to ensure consistent batch sizes.

* **

In [34]:
ngram_length = 2
batch_size = 128
window_size = 128

train_ds = generate_tf_dataset(train_filenames, ngram_length, 
                               batch_size, window_size, shuffle=True)

valid_ds = generate_tf_dataset(valid_filenames, ngram_length, window_size, batch_size)
test_ds = generate_tf_dataset(test_filenames, ngram_length, window_size, batch_size)

### Generate few samples from the dataset function

In [35]:
ds = generate_tf_dataset(train_filenames, ngram_width=2, window_size=10, batch_size=1)

for record in ds.take(5):
    print(record[0].numpy(), '->', record[1].numpy())

[[b'th' b'er' b'e ' b'wa' b's ' b'on' b'ce' b' u' b'po' b'n ']] -> [[b'er' b'e ' b'wa' b's ' b'on' b'ce' b' u' b'po' b'n ' b'a ']]
[[b' u' b'po' b'n ' b'a ' b'ti' b'me' b' a' b' s' b'he' b'ph']] -> [[b'po' b'n ' b'a ' b'ti' b'me' b' a' b' s' b'he' b'ph' b'er']]
[[b' s' b'he' b'ph' b'er' b'd ' b'bo' b'y ' b'wh' b'os' b'e ']] -> [[b'he' b'ph' b'er' b'd ' b'bo' b'y ' b'wh' b'os' b'e ' b'fa']]
[[b'wh' b'os' b'e ' b'fa' b'me' b' s' b'pr' b'ea' b'd ' b'fa']] -> [[b'os' b'e ' b'fa' b'me' b' s' b'pr' b'ea' b'd ' b'fa' b'r ']]
[[b'ea' b'd ' b'fa' b'r ' b'an' b'd ' b'wi' b'de' b' b' b'ec']] -> [[b'd ' b'fa' b'r ' b'an' b'd ' b'wi' b'de' b' b' b'ec' b'au']]


## Implementing the Language Model

In [38]:
import tensorflow.keras.backend as K

K.clear_session()

### Defining the `TextVectorization` layer

define a TextVectorization layer to convert the sequences of n-grams to sequences of integer IDs:

In [39]:
from tensorflow.keras import layers
from tensorflow.keras import models


# The vectorization layer that will convert string bigrams to IDs
text_vectorizer = layers.TextVectorization(max_tokens=n_vocab, standardize=None,
                                           split=None, input_shape=(window_size,))

# The the layer on data
text_vectorizer.adapt(train_ds)

In [40]:
text_vectorizer.get_vocabulary()[:10]

['', '[UNK]', 'e ', 'he', ' t', 'th', 'd ', ' a', ', ', ' h']

**Convert the targets from string ngrams to ngram IDs:**

Remember that our data pipelines output sequences of n-gram strings as inputs and targets. We need to convert the target sequences to sequences of n-gram IDs so that a loss can be computed

In [42]:
train_ds = train_ds.map(lambda x, y: (x, text_vectorizer(y)))
valid_ds = valid_ds.map(lambda x, y: (x, text_vectorizer(y)))
test_ds = test_ds.map(lambda x, y: (x, text_vectorizer(y)))

### Defining the LSTM model

Our model will have:
- The previously trained TextVectorization layer
- An embedding layer randomly initialized and jointly trained with the model
- Two LSTM layers each with 512 and 256 nodes respectively
- A fully-connected hidden layer with 1024 nodes and `ReLU` activation
- The final prediction layer with `n_vocab` nodes and `softmax` activation

In [43]:
K.clear_session()

lm_model = models.Sequential([text_vectorizer,
                              layers.Embedding(input_dim=n_vocab+2, output_dim=96),
                              layers.LSTM(512, return_state=False, return_sequences=True),
                              layers.LSTM(256, return_state=False, return_sequences=True),
                              layers.Dense(1024, activation='relu'),
                              layers.Dropout(0.5),
                              layers.Dense(n_vocab, activation='softmax')
                             ])

`K.clear_session()`, which is a function that clears the current TensorFlow session (e.g. layers and variables defined and their states). Otherwise, if you run multiple times in a notebook, it will create an unnecessary number of layers and variables.

Parameters of the LSTM layer in more detail:
- `return_state` – Setting this to `False` means that the layer outputs only the final output, whereas if set to `True`, it will return state vectors along with the final output of the layer. For example, for an LSTM layer, setting `return_state=True` means you’ll get three outputs: the final output, cell state, and hidden state. Note that the final output and the hidden state will be identical in this case.

- `return_sequences` – Setting this to true will cause the layer to output the full output sequences, as opposed to just the last output. For example, setting this to `False` will give you a `[b, n]`-sized output where `b` is the batch size and` n` is the number of nodes in the layer. If `True`, it will output a `[b, t, n]`-sized output, where `t` is the number of time steps.


In [44]:
lm_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, 128)              0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 128, 96)           67872     
                                                                 
 lstm (LSTM)                 (None, 128, 512)          1247232   
                                                                 
 lstm_1 (LSTM)               (None, 128, 256)          787456    
                                                                 
 dense (Dense)               (None, 128, 1024)         263168    
                                                                 
 dropout (Dropout)           (None, 128, 1024)         0         
                                                        

### Defining metrics and compiling the model

- [Evaluation Metrics for Language Modeling - Chip Hyuen](https://thegradient.pub/understanding-evaluation-metrics-for-language-models/)

- [A Gentle Introduction to Information Entropy - mlmastery](https://machinelearningmastery.com/what-is-information-entropy/)

Accuracy is used as a general-purpose evaluation metric across different ML tasks. However, accuracy might not be cut out for this task, mainly because it relies on the model choosing the exact word/bigram for a given time step as in the dataset. However, languages are complex and there can be many different choices to generate the next word/bigram given a text. Therefore, NLP practitioners rely on a metric known as **perplexity**, 

- **Perplexity** *measures how "perplexed" or “surprised” the model was to see a `t+1` bigram given `1:t` bigrams.*

- **Perplexity is simply the entropy to the power of two.** 

- Entropy is a measure of the uncertainty or randomness of an event. The more uncertain the outcome of the event, the higher the entropy. Entropy Formula is:

$$H(X) = - \sum_{x \forall X}p(x) \log(p(x))$$

- In machine learning, to optimize ML models, *we measure the difference between the predicted probability distribution versus the target probability distribution for a given sample*. For that, we use **cross-entropy**, an extension of entropy for two distributions: 

$$\text{Categorical Crossentropy}(\hat{y}_i, y_i) = - \sum_{c=1}^{c}y_{i,c}\log(\hat{y}_{i,c})$$

Finally, we define perplexity as:

$$Perplexity = 2^{H(X)}$$

In [45]:
# Inspired by https://gist.github.com/Gregorgeous/dbad1ec22efc250c76354d949a13cec3
class PerplexityMetric(tf.keras.metrics.Mean):
    
    def __init__(self, name='perplexity', **kwargs):
        super().__init__(name=name, **kwargs)
        self.cross_entropy = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False, 
                                                                           reduction='none')

    def _calculate_perplexity(self, real, pred):
        # The next 4 lines zero-out the padding from loss calculations, 
        # this follows the logic from: https://www.tensorflow.org/beta/tutorials/text/transformer#loss_and_metrics 			      
        loss_ = self.cross_entropy(real, pred)
      
        # Calculating the perplexity steps: 
        step1 = K.mean(loss_, axis=-1)
        perplexity = K.exp(step1)

        return perplexity 

    def update_state(self, y_true, y_pred, sample_weight=None):            
        perplexity = self._calculate_perplexity(y_true, y_pred)
        # Remember self.perplexity is a tensor (tf.Variable), 
        # so using simply "self.perplexity = perplexity" will result in error 
        # because of mixing EagerTensor and Graph operations 
        super().update_state(perplexity)

In [46]:
lm_model.compile(loss='sparse_categorical_crossentropy', 
                 optimizer='adam', 
                 metrics=['accuracy', PerplexityMetric()])

lm_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVec  (None, 128)              0         
 torization)                                                     
                                                                 
 embedding (Embedding)       (None, 128, 96)           67872     
                                                                 
 lstm (LSTM)                 (None, 128, 512)          1247232   
                                                                 
 lstm_1 (LSTM)               (None, 128, 256)          787456    
                                                                 
 dense (Dense)               (None, 128, 1024)         263168    
                                                                 
 dropout (Dropout)           (None, 128, 1024)         0         
                                                        

### Training the model

In [47]:
lstm_history = lm_model.fit(train_ds, 
                            validation_data=valid_ds, 
                            epochs=50, workers=10)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [48]:
lm_model.evaluate(test_ds)



[2.497385025024414, 0.3919745683670044, 12.561511039733887]

## Defining the inference model

Inferring from the trained model

- During training, we trained our model and evaluated it on sequences of bigrams. This works for us because during training and evaluation, we have the full text available to us. However, when we need to generate new text, we do not have anything available to us. Therefore, we have to make adjustments to our trained model so that it can generate text from scratch.

- The way we do this is by defining a recursive model that takes the current time step’s output of the model as the input to the next time step. This way we can keep predicting words/bigrams for an infinite number of steps. We provide the initial seed as a random word/bigram picked from the corpus (or even a sequence of bigrams). 

<div align='center'>
    <img src="images/infer_model_arch.png"/>
</div>

Our inference model is going to be comparatively more sophisticated, as we need to design an iterative process to generate text using previous predictions as inputs.