In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
import numpy as np
import os
import re
import string
import random


# WhackGPT

We can make a transformer based model to generate chatGPT-ish text responses. Ours will be far more stupid, but hey, it's a taking computer. Transformer based models are the current state of the art for natural language processing, and most of the models that you've heard of, like ChatGPT, are transformer based models.

## Transformer Architecture

What the heck is a transformer, what does it do, and why is it so cool? A transformer model is a type of neural network that was creating in 2017 at Google. The core idea behind transformers is the idea of attention, which is deailed a little bit below. The diagramed strucutre of a transformer model can be a little intimidating, but we can make sense of the critical parts without too much issue. We will mostly sidestep this large diagram, and focus on a basic transformer model that is a little easier to understand.

![Transformer](images/transformer.png "Transformer")

A transformer model contains a few key parts, each of with is dealt with in more detail below.
<ul>
<li> Embedding - the embedding layer generates embeddings (vector representations) for each token. The embeddings are created for both the token itself and its position in the sequence. </li>
<li> Attention layers - the attention layers are the core of the transformer model. They are responsible for creating a representation of the input sequence that is used to generate the output sequence. </li>
</ul>

The attention part is the star of the show, it is a method to be able to focus the attention of the model on the critical portions of the input sequence and generate contextually informed predictions for the output. As well, transformers do all of this in a way that is more parallelizable than LSTM based models that were the state of the art before transformers, only a few years ago. 

This image comes from Google, like transformers, and is a great visual representation of how attention works. This example is from a translation model, but a more simple version applies to what we are about to do. The key thing to note is that the image shows:
<ul>
<li> Several encoding layers, each of which generates a representation of each word in the input sequence by looking at all of the others. </li>
<li> Several decoding layers, which generate a representation of the output sequence by looking at all of the words in the input sequence, as well as the current sequence being generated. </li>
</ul>

This highlights the key idea of attention, which is that the model can look at any part of the input sequence to understand the current word in the input, as well as generate the current word of the output. This differs from the LSTM models that are tied to understanding the data only as a sequence, and not as a whole. The transformer model learns to look at, or pay attention to, the important parts of the input, irrespective of their position in the sequence. This is important in language, as we can have words that impact the meaning of other words at any place in the sequence. Think of an example sentence, "he isn't the largest, fastest, strongest, or tallest, but the walk-on scrapper Rudy is the heart of the team". In this sentence "he" applies to Rudy, as does "scrappy" and "heart of the team", which is clear in English, but maybe not so easy for a machine. The transformer model is able to look at the sentence, and work to learn the relationships between the words, and the context in which they are used - so if we were generating a similar sentence, the model knows that after "scrapper" we need a person, a "he", more specifically.  

<b>Important Note:</b> these generative text models seem smart, and in some senses they are, but in the most critical sense, they really aren't, and can be confidently and totally wrong. The model understands what is written and how to construct blocks of text; the model does not understand the underlying meaning. If you ask ChatGPT what it's like to bite an apple, you'll probably get descriptors like "sweet", "tart", and "juicy", which are all accurate. The model doesn't know what it is like to bite an apple, it just knows that if it needs to supply descriptors of an object "apple", sweet, juicy, and tart are ones that commonly come up. If it was trained on text written by some apple-hating lunatic, who wrote about how mushy, bitter, and gross apples are, that's what the model will confidently state an apple is like. Even when models learn to do chemistry or math, they are only learning to replicate what they have seen - they don't understand the underlying concepts. With subject like those, there tends to be more strictly defined rules than with the English language, which is pretty loose, so it isn't surprising that these models are quick to learn those subject areas, even if it seems difficult and almost impossibly fast for us as humans - a model can learn a linear regression, from examples only, very quickly. 

### Embedding

The embedding here has two parts:
<ul>
<li> Token embedding: This maps each token to a vector representation in N-dimensional space. This is what we are used to for embedding. The original transformer paper used a 512-dimensional embedding, so each token was represented by a vector of 512 values that position it on a 512D grid. 
<li> Positional embedding: This maps each token's position in the <i>sequence</i>. The position embedding can be thought of as an extension of the concept of just tracking which word of a sentence each token is, 1,2,3...
</ul>

#### Token Embedding

Token embedding is something that we are used to from when we used word2vec to generate embeddings for classification models. We are tranlating each token into an N-dimensional representation in space. The big difference here is that our embedding space is being learned by the model during training, so we should expect that the model will be shifting each token around in space as it learns more about what that word means, or more accurately, how it is used in our training data. 

![Embedding](images/embedding.png "Embedding")

#### Positional Embedding

The positional embedding is needed and most clearly seen if we compare this to an LSTM. In an LSTM, the position of a token is always known as we process the data sequentially. In the transformer model, the data is taken in parallel, so we don't have the sequence data built in. This has the benefit of allowing the model to process more of its work in parallel than an LSTM, but it also means that the model needs to be told where each token is in the sequence. What is the positional embedding? It follows the same concept as the token embedding, we are representing something with a vector of values. In the positional embedding, the math is a little involved, but it uses sine and cosine functions to represent the position of a token. 

![Positional Embedding](images/positional_emb.png "Positional Embedding")

Where:
<ul>
<li> <b>k:</b> position of the token. 
<li> <b>d:</b> dimension of the embedding.
<li> <b>i:</b> used for mapping to both sine and cosine functions.
</ul>

This positional embedding uses the trig functions to introduce some additional capability to our embedding values. First, this helps if we encounter longer sentences later on - if we embedded the position with a simple word count number, that would be an issue for us. Second, the trig functions allow us to embed the position in a way that is not deterministic. This means that the model can learn where tokens occur in relation to each other without being told explicitly. This is useful if you think of sentences such as:
<ul>
<li> I do not like the story of the movie, but I do like the cast.
<li> I do like the story of the movie, but I do not like the cast.
</ul>

These two sentences use the same words, but the meaning is opposite. The positional embedding helps capture the relationship between the words based on where the occur, and connect words that occur in certain "areas" to those in other "areas" of a sentence. This is really useful if you think of something like an adjective, that adjective modifies some noun, and understanding English requires that we are able to identify which noun it belongs to. Positional embedding with sine/cosine help with that, the position is recorded not only in a way that tells us where a word sits in an absolute sense, but it tells us where that word sits relative to the other words it is with. This is one reason transformers are so useful for tasks like language, their ability to contextualize the relationships in parts of text surpasses that of other models that we have today; when generating text, this gives us the most natural sounding text, as the "next word" prediction is based on a more comprehensive understanding of the sentence. 

Notably, the positional embedding uses the word embedding dimension, d, as the dimension of the positional embedding. This is because the positional embedding is added to the token embedding, so the two need to be the same dimension. This means that the embedding matrix generated can be quite large for each token. This also means that the input to any future modelling is going to contain those two vectors, likely represented in a high dimension - what is the token, and where is it in the sequence.

In [2]:
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

## Transformer Construction

We can now create a function to construct the core piece of our model, the transformer. The transformer layer has a few parts, the critical one being the attention layer. 

<b>Note:</b> the declarations of the layers are slightly different in a functional model. Each layer is a function that takes an input tensor and returns an output tensor. The layers are then called in the call method of the model.

### Pay Attention

The core piece of the transformer architecture is the attention mechanism. Attention serves as a way to focus, or pay attention, to certain parts of the input. The original paper that outlined the transformer architecture, "Attention is All You Need" from 2017 outlined the concept of attention at a high level - the idea is that we can use attention to focus on the most important parts of the input sequence, and use that to generate the output sequence. This is what humans do when reading something, we understand that certain parts of text add context allowing us to understand other parts. Attention is a way to teach a model to do the same thing. In RNN models, the data is always processed as a sequence - this has the advantage of allowing us to always know the order, but it slows computation and prevents the model from excelling at building relationships between parts of the data that are far from each other in the sequence. The transformer architecture replaces sequential processing of a sequence of values with parallel processing of the entire sequence at once, and build connections between the current part we are working on (e.g. the next word to generate) and all other parts of the sequence, focusing our attention on the most important parts.

![Attention](images/attention.png "Attention")

The result of that can be illustrated in the diagram above, with some additional labels. Here we have an illustration of a sentence, focusing on the last word, with colored and shaded lines indicating how much "attention" that word (token) needs to pay to the other tokens. Phrased differently, the stronger the line, the more that token on the left influences what the last token on the right should be. With training and lots of data, the model can eventually learn which other parts of a sequence the current token should "pay attention to", and use that to generate the next token in the sequence. Given enough training, this goes beyond words, and the model will begin to understand the relationships between types of words and parts of sentences. As a noticable example, words such "it", "in", or "is" can be used in many different contexts, and often refer to different nouns or verbs in a setence. The models can begin to decipher that certain types of words come after "is", and others come after "in". In this example, the model is paying strong attention to "in" and "European", since the model has learned enough that "in" indicates that the next word is some variety of place, and "European" indicates that said place is likely somewhere in Europe (the self-attention can be ignored). As models learn, see more language, and see sequences that are structured differently, the model will begin to understand more about the relationships between words and parts of sentences, and use that to generate sentences that not only use the correct words, but are structured in a way that mirrors the training data that it has seen. This sometimes gives the appearance that the model is thinking up and generating these answers on the fly, based on what it knows, but the language models don't really "know" anything, in an epistemological sense. Generative language models are just excellent at subtle pattern recognition in language, and reproducing an output that follows what it expects. The massive capacity and volume of training data is what gives it the ability to draw on patterns seen is "the internet" worth of text, thus allowing it to generate text that can avoid the robotic feel that small models have. 

#### Attention Implementation

The attention mechanism contains three key matrices that we'll ultimately use to calculate things:
<ul>
<li> Query
<li> Key
<li> Value
</ul>

The query, key, values are commonly described as analagous to doing a Google search. For example, when you search for videos on Youtube, the search engine will map your <b>query</b> (text in the search bar) against a set of <b>keys</b> (video title, description, etc.) associated with candidate videos in their database, then present you the best matched <b>values</b> (videos).  

Using the query, key, and value objects involves a multistep process. 
<ul>
<li> First, the query, key, and value all get a copy of the embedding (position and token) matrix fed in, which is then multipled by a set of weights that belong to a linear layer (no activation) for that Q/K/V input. 
<li> The value matrix is set aside for the moment. 
<li> The results of the query and key matricies are then mutipled by each other, which generates attention scores. 
<li> The result is then passed through a softmax function to normalize the weights and generates the actual attention mask. 
<li> The normalized weights are then multiplied by the value matrix, which gives us the final output.
</ul>

To ultimately create the layer, we have several of these heads, similar to filters in a CNN. 

![Multi-Head Attention](images/multi_head_att.png "Multi-Head Attention")

Take and example of a sentence being, "“Anthony Hopkins admired Michael Bay as a great director", the product of the query and key matricies would look something like this:

![Attention Mechanism](images/key_value.png "Attention Mechanism")

These attention scores are measures of how important each word in the input sequence is to each other word. We normally see each word being really important to itself, then as the similarity decreases, the importance decreases. In this example, "Hopkins" and "Anthony" have a high score of attention with respect to each other, which makes sense! We would likely want to produce those two words in sequence. Given large amounts of data, the model can become very good at identifying what is important and what is not, and in particular, understanding context. Because the attention is based on the positon and token embeddings, and we have multiple heads (see below) each honing in on some other aspect of the text, the model can learn relationships between parts of speech that are challenging for other types of models, such as a sentence that has a lage independent clause in the middle of it or figures of speach that have little impact on the meaning of a sentence. Importantly, each token in a sentence is taken as the input, so we generate such a matrix for each "query" token.

![Attention Sequences](images/attention_seq.png "Attention Sequences")

#### Attention Masking

Once we get the attention mask, we combine it with the value matrix to get the final output from our attention layer. The easiest way to think of applying an attention mask is with an example from computer vision. The "thing" that we are trying to do with computer vision, say image recognition, is to capture information from the "important part" of the image. We don't want to focus, normally, on background stuff. The attention mask serves to act basically as a filter, that blocks out the less important and lets through the more important. So we can think of the end result as the input + mask = useful output. This image is a little blurry, but it shows the idea. If we have a model being trained to identify objects, we might end up with a mask that looks like this. Note the final result and the original (which has had the color space changed). The desired result is the bottom left, where the objects we want to identify are the focus. Applying the mask to the original serves to do that - remove the less important stuff, emphasize the more important stuff. With language, we get the same thing. We want to focus on the important parts of a sentence and ignore the less important parts - that measure of importance is what we are learning during training. 

![Attention Mask](images/attention_mask.png "Attention Mask")

<b>Note:</b> we also have a causal mask, which is used to prevent the model from "cheating" by looking ahead in the input sequence. This effectively stops the model from just looking up the answer, which would let it sidestep learning. 

#### Multi-Head Attention

The layer that we are adding is called a multi-head attention layer, implying that we have multiple attention filters at once. This part works similarly to how the convolutional filters work in a CNN. Each filter in a CNN learns to identify some useful feature in that context - edges, colors, etc... Here, each attention head learns to focus on a different aspect of the input, language in our case. As our model is trained, each attention head will learn to focus on different aspects of the input. Recall that the weights for the filter are normally random initially, so the training process will cause each one to find its own thing to focus on as we shrink the loss. 

### Attention Magic

This is a very brief and high level overview of attention and its application to our neural networks. There is a lot more to it, it is a very interesting topic, and based on what we know now (2023), transformer based models will likely be exceedingly common over the near future. If you want to learn more, I recommend the following resources:
<ul>
<li> https://data-science-blog.com/blog/2021/04/07/multi-head-attention-mechanism/
<li> https://www.youtube.com/watch?v=6D4EWKJgNn0
<li> https://data-science-blog.com/blog/2021/04/22/positional-encoding-residual-connections-padding-masks-all-the-details-of-transformer-model/
</ul>

The ability of the transformer models to, without external direction, learn what is important and what is not is what makes them both so powerful and so flexible. The examples of the GPT models accurately performing tasks that it wasn't trained on are good examples of this flexibility. If we have training data to supply the transformer model, it can very accurately learn to extract what matters from what doesn't, irrespective of the specific task that it is working on, which makes learning that task much easier.

In [3]:
def causal_attention_mask(batch_size, n_dest, n_src, dtype):
    """
    Mask the upper half of the dot product matrix in self attention.
    This prevents flow of information from future tokens to current token.
    1's in the lower triangle, counting from the lower right corner.
    """
    i = tf.range(n_dest)[:, None]
    j = tf.range(n_src)
    m = i >= j - n_src + n_dest
    mask = tf.cast(m, dtype)
    mask = tf.reshape(mask, [1, n_dest, n_src])
    mult = tf.concat(
        [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], 0
    )
    return tf.tile(mask, mult)


class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = layers.MultiHeadAttention(num_heads, embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size = input_shape[0]
        seq_len = input_shape[1]
        causal_mask = causal_attention_mask(batch_size, seq_len, seq_len, tf.bool)
        attention_output = self.att(inputs, inputs, attention_mask=causal_mask)
        attention_output = self.dropout1(attention_output)
        out1 = self.layernorm1(inputs + attention_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output)
        return self.layernorm2(out1 + ffn_output)

### Create Model

We can now create the model, and we will use the portions that we constructed above. The basic parts are:
<ul>
<li> Token and positional embedding - create representations of each sequence. 
<li> Transformer layers - the core of the model.
<li> Output layer - dense layers to convert the output of the transformer layers to the output of the model.
</ul>

The basic structure of different varieties of neural networks is also seen here, we again have a dense neural network to generate predictions from inputs, and that network can be fed by either:
<ul>
<li> Our actual data, for normal regression or classification.
<li> The output of convolutional layers, for image processing. 
<li> The output of recurrent layers, for sequential data.
<li> The output of transformer layers, for quickly expanding types of tasks. 
</ul>

No matter the specific implementation, the basic structure, and ability to learn, is the same in all neural networks. The ability to learn relationships that are complex, obscure, and impossible for a human to describe makes neural networks extremely powerful. If we can generate some architecture that is good at extracting features from some specific type of data, we can combine that with a regular neural network to make all kinds of predictions or generate new data. Our "predictor" dense model, and the "extractor" early layers can then both learn epoch by epoch, together, to be as accurate as possible. As the capacity of processors increases and the experience of researchers grows, we can expect to see more and more expansion in what neural networks can do. In particular, the increased ability to parallelize the processing of sequential data with the transformer architecture is massively helpful - we saw in the LSTM models the depth of the sequences of calculations meant that growing models to be very powerful requires lots of processing, in a way that is extremely hard to parallelize, limiting the growth. Transformers can do more in parallel, and it is much easier to add another processor than it is to develop a processor that is twice as fast; these models will likely grow to more efficiently process data accross large networks of worker machines, generating larger and more powerful models.

In [4]:
HIGH_RESOURCE = False

vocab_size = 10000  # Only consider the top X words
maxlen = 40  # Max sequence size
embed_dim = 196  # Embedding size for each token
num_heads = 2  # Number of attention heads
feed_forward_dim = 196  # Hidden layer size in feed forward network inside transformer

batch_size = 512
EPOCHS = 10

LOAD_WEIGHTS = True

if HIGH_RESOURCE:
    vocab_size = 20000  # Only consider the top 20k words
    embed_dim = 512  # Embedding size for each token
    num_heads = 8  # Number of attention heads
    feed_forward_dim = 512  # Hidden layer size in feed forward network inside transformer
    batch_size = 1024
    EPOCHS = 100

def create_model():
    inputs = layers.Input(shape=(maxlen,), dtype=tf.int32)
    embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
    x = embedding_layer(inputs)
    transformer_block = TransformerBlock(embed_dim, num_heads, feed_forward_dim)
    x = transformer_block(x)
    outputs = layers.Dense(vocab_size)(x)
    model = keras.Model(inputs=inputs, outputs=[outputs, x])
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    model.compile(
        "adam", loss=[loss_fn, None], metrics=["accuracy", None]
    )  # No loss and optimization based on word embeddings from transformer block
    return model


#### Download and Load Weights

### Get Data and Prepare for Training

This example uses some movie reviews for source data. The dataset comes already split into positive and negative labels, for classification, and into training and testing sets. We don't need any of these divisions, we just need all the text for training, so the data preparation steps here are:
<ul>
<li> Download the data.
<li> Loop through all the files and generate a list of all the file names. 
<li> Crate a dataset from all the files. 
<li> Clean the data by removing the html tags and punctuation.
<li> Tokenize the data by splitting the text into words and creating a vocabulary.
<li> Create training ready data by creating sequences of X = "up to the current word" and Y = "the next word".
<li> Set the dataset to be shuffled, batched, and prefetched.
</ul>

<b>Note:</b> there are a few odd [UNK] tokens, this is a placeholder for words that are not in the vocabulary. Were this a production model, we'd want to come up with some more sophisticated way of handling this, but for this example, we'll just leave it as is. When dealing with natural text, it is common to have things like this for unknown data, or other special tokens for the beginning or end of a sentence (e.g. [BOS] or [EOS]).  

In [5]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  5168k      0  0:00:15  0:00:15 --:--:-- 7275k


In [6]:
# The dataset contains each review in a separate text file
# The text files are present in four different folders
# Create a list all files
filenames = []
directories = [
    "aclImdb/train/pos",
    "aclImdb/train/neg",
    "aclImdb/test/pos",
    "aclImdb/test/neg",
]
for dir in directories:
    for f in os.listdir(dir):
        filenames.append(os.path.join(dir, f))

print(f"{len(filenames)} files")

# Create a dataset from text files
random.shuffle(filenames)
text_ds = tf.data.TextLineDataset(filenames)
text_ds = text_ds.shuffle(buffer_size=256)
text_ds = text_ds.batch(batch_size)


def custom_standardization(input_string):
    """ Remove html line-break tags and handle punctuation """
    lowercased = tf.strings.lower(input_string)
    stripped_html = tf.strings.regex_replace(lowercased, "<br />", " ")
    return tf.strings.regex_replace(stripped_html, f"([{string.punctuation}])", r" \1")


# Create a vectorization layer and adapt it to the text
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size - 1,
    output_mode="int",
    output_sequence_length=maxlen + 1,
)
vectorize_layer.adapt(text_ds)
vocab = vectorize_layer.get_vocabulary()  # To get words back from token indices


def prepare_lm_inputs_labels(text):
    """
    Shift word sequences by 1 position so that the target for position (i) is
    word at position (i+1). The model will use all words up till position (i)
    to predict the next word.
    """
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y

text_ds = text_ds.map(prepare_lm_inputs_labels)
text_ds = text_ds.prefetch(tf.data.AUTOTUNE)

50000 files
Metal device set to: Apple M2

systemMemory: 24.00 GB
maxCacheSize: 8.00 GB



2023-05-31 19:10:53.552262: W tensorflow/tsl/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


We can look at one example of the data below. 

In [7]:
tmp = text_ds.as_numpy_iterator()
x_tmp, y_tmp = next(tmp)
print(x_tmp.shape, y_tmp.shape)
samp_x = x_tmp[0]
samp_y = y_tmp[0]
print("Tokens:", samp_x, "\n\n", samp_y)
word = ""
for x_ in samp_x:
    word += vocab[x_] + " "
print("Sentence:", word, "\n\nNext word:", vocab[samp_y[-1]])

(512, 40) (512, 40)
Tokens: [ 670 3908    9   66  526  329  172    3   12  213  987   14   12   32
  516    2  115   14   29  650    1   83    1   27    4   53   69   20
  748   76    2 1713    7  324    5 2165  707    6  323    1] 

 [3908    9   66  526  329  172    3   12  213  987   14   12   32  516
    2  115   14   29  650    1   83    1   27    4   53   69   20  748
   76    2 1713    7  324    5 2165  707    6  323    1    1]
Sentence: robert altman is my favorite american director . i must admit that i have enjoyed the films that are usually [UNK] : [UNK] " , if only for giving me the pleasure of seeing a grown -up and beautiful [UNK]  

Next word: [UNK]


### Callback

To get our text out, we can use a callback that will be called at the end of each epoch. We can still get things from "predict" after the fact, but this will give us some step by step evidence of our program's smarts. We will make two instances of this callback, each with different seeds. 

In [8]:
class TextGenerator(keras.callbacks.Callback):
    """A callback to generate text from a trained model.
    1. Feed some starting prompt to the model
    2. Predict probabilities for the next token
    3. Sample the next token and add it to the next input

    Arguments:
        max_tokens: Integer, the number of tokens to be generated after prompt.
        start_tokens: List of integers, the token indices for the starting prompt.
        index_to_word: List of strings, obtained from the TextVectorization layer.
        top_k: Integer, sample from the `top_k` token predictions.
        print_every: Integer, print after this many epochs.
    """

    def __init__(self, max_tokens, start_tokens, index_to_word, top_k=10, print_every=1, log_dir="logs"):
        self.max_tokens = max_tokens
        self.start_tokens = start_tokens
        self.index_to_word = index_to_word
        self.print_every = print_every
        self.k = top_k
        self.log_dir = log_dir

    def sample_from(self, logits):
        logits, indices = tf.math.top_k(logits, k=self.k, sorted=True)
        indices = np.asarray(indices).astype("int32")
        preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
        preds = np.asarray(preds).astype("float32")
        return np.random.choice(indices, p=preds)

    def detokenize(self, number):
        return self.index_to_word[number]

    def on_epoch_end(self, epoch, logs):
        start_tokens = [_ for _ in self.start_tokens]
        if (epoch + 1) % self.print_every != 0:
            return
        num_tokens_generated = 0
        tokens_generated = []
        while num_tokens_generated <= self.max_tokens:
            pad_len = maxlen - len(start_tokens)
            sample_index = len(start_tokens) - 1
            if pad_len < 0:
                x = start_tokens[:maxlen]
                sample_index = maxlen - 1
            elif pad_len > 0:
                x = start_tokens + [0] * pad_len
            else:
                x = start_tokens
            x = np.array([x])
            y, _ = self.model.predict(x)
            sample_token = self.sample_from(y[0][sample_index])
            tokens_generated.append(sample_token)
            start_tokens.append(sample_token)
            num_tokens_generated = len(tokens_generated)
        txt = " ".join(
            [self.detokenize(_) for _ in self.start_tokens + tokens_generated]
        )
        print(f"generated text:\n{txt}\n")

### Train, Run, Predict

Now that the model is created, we can fit it to the training data then test out the abilities. Our prediction is an incremental process, we start with a seed, then we predict the next word, then we add that word to the seed, and predict the next word, and so on. At each step, the model looks at the input to this point, calculates the attention, finds the most suitable (highest score) word from the vocabulary, generates it, calculates the loss (N-dimensional "closeness" from embedding), and moves one more step forward. This training process can take a very long time, loss kept slowly improving for me for 100+ epochs without fully flattening. With larger models and datasets, this pattern will likely still occur, only at a far larger number of epochs. This is a very tangible example of the need for fast machines to turn around quick training cycles, look at the quality of the sample lines from epoch 1 to epoch 20+, the difference is often striking. Here is one example that I took from in the middle of training, at epoch 38:
<ul>
<li> <b>Epoch 1:</b> this movie is [UNK] to of . to of . , to the . a the a the , , . . the . the [UNK] the , the , [UNK] the of a the the . . and . is [UNK] to [UNK]
<li> <b>Epoch 38:</b> this movie is not just a bad movie . the plot is outrageous and unbelievable . sure the characters are unbelievable and the ending will be a better surprise , but the ending was just sad .       
</ul>
Not a bad improvement. Like most models, we normally see a really quick improvement in the first few epochs, to get to "tolerable", we then usually get a really slow improvement as the model gets slightly better each epoch, for a long time. Here I am only looking at the loss, which isn't really a number that has contextual meaning on its own, but it gives us a metric of how well the model is doing in comparison to itself. The accuracy metric isn't really useful here as we want to generate "a" correct word, not specifically "the" correct word, so generating "vehicle" instead of "automobile" is still a good answer, as we'd expect them to be quite close in the N-dimensional embedding. There are other metrics that are commonly used to evaluate text generation, such as BLEU and Rouge, that aim to generate some form of an "accuracy" score for the text that is created for a model. Any accuracy measure when we are creating something for human consumption is only a guideline - our model can optimize for loss, but loss is only a proxy for "the model thinks of a good next word" - there isn't any way to directly calculate how close to "real speach" our generated data is. We won't get into those other metrics here, the loss and a subjective evaluation of the text is good enough for us. As large language models and generative models become more and more prevelant, there are constantly new metrics being developed to provide a benchmark for the quality of those models. All of these metrics are limited in their usefulness, as there really isn't a single RMSE-like value that makes sense for the quality of a generated block of text - the scores will broadly correlate with "model quality", but those scores won't be as meaningful as a "real" accuracy measure like RMSE, accuracy, or F1. These measures are more of a rough indication than a strict measurement. 

<b>Note:</b> Trying to train this on my laptop on CPU took forever, I didn't get to the point where the first epoch gave me a time estimate. Using the accelerated M2 Mac is faster, and using Google Colab or a GPU it is much, much faster. If you are running this it is a very good idea to use some variety of GPU for practicality. 

In [9]:
# Tokenize starting prompt
word_to_index = {}
for index, word in enumerate(vocab):
    word_to_index[word] = index

start_prompt1 = "this movie is"
start_tokens1 = [word_to_index.get(_, 1) for _ in start_prompt1.split()]
start_prompt2 = "Skiing fast makes me"
start_tokens2 = [word_to_index.get(_, 1) for _ in start_prompt2.split()]
num_tokens_generated = 40

log_dir = "logs"
log_1 = str(log_dir + "/1")
log_2 = str(log_dir + "/2")
weight_path = "weights"
text_gen_callback1 = TextGenerator(num_tokens_generated, start_tokens1, vocab, log_dir=log_1)
text_gen_callback2 = TextGenerator(num_tokens_generated, start_tokens2, vocab, log_dir=log_2)
checkpoint_callback = keras.callbacks.ModelCheckpoint(weight_path, save_weights_only=True, monitor="loss", save_best_only=True)

In [None]:
model = create_model()

if LOAD_WEIGHTS:
    weights_url = "https://jrssbcrsefilesnait.blob.core.windows.net/3950data1/lstm_gen_weights(1).keras"
    old_weights = keras.utils.get_file('cust_transform_weights.keras', weights_url)
    model.load_weights(old_weights)
    
model.summary()

In [10]:

model.fit(text_ds, verbose=0, epochs=EPOCHS, callbacks=[text_gen_callback1, text_gen_callback2, checkpoint_callback])

Epoch 1/10
generated text:
this movie is a good time to make [UNK] of all , [UNK] [UNK] [UNK] [UNK] [UNK] . you have a [UNK] . a great . it was an [UNK] to me . it was a few to me to the make make see

generated text:
[UNK] fast makes me [UNK] [UNK] , and the film of this . [UNK] , a good [UNK] of the [UNK] , and the [UNK] . i have been a good . but i can see the acting is the acting most film movie movie



FailedPreconditionError: checkpoint.tmp3627edaa72324296af83ee31915aa24c; Is a directory

### Predictions

We can now create some predicted text with our trained model. Below are just some helper functions to make predictions of a certain length. The quality of the text is highly, highly variable - the dataset used was only one source, and wasn't massive. The model is also pretty small. I ran this once for 100 epochs of training, and the results I got for generated text were:
<ul>
<li> this movie is not really a bad movie . there are only two stars (the heroine ) almost all vampire tales being [UNK] i psycho if your life . it wasn 't that part of this movie . i wanted to see more like that , but i just have to say i really liked it that movie and even the music was . in my opinion the movie itself and it really gets a look like something that you might can just really can just a bad movie . there are only two stars (the heroine ) almost all vampire tales being [UNK] i psycho if your life . it wasn 't that part of this movie . i wanted to see more like that , but i just have to say i really liked it that movie and even the music was . in my opinion the movie itself and it really gets a look like something that you might can just really can just 
<li> fast makes me forget . check . when you think carl brashear [UNK] gooding jr . and his navy master chief diver bill finds himself inside his hands , and even by his face running his navy , deep and surface story of carl brashear 's brother , played by cuba gooding jr . he died in his obsession with his cat competition . . the physical comedy gives him great example of his first powerful karate and kung fu and and master . forget . check . when you think carl brashear [UNK] gooding jr . and his navy master chief diver bill finds himself inside his hands , and even by his face running his navy , deep and surface story of carl brashear 's brother , played by cuba gooding jr . he died in his obsession with his cat competition . . the physical comedy gives him great example of his first powerful karate and kung fu and and master . 
<li> are going to make this country great movie on which you think the best film ever ? i am from a big city photographer . but then i saw "one dark night [UNK] summer " , and i left the store , the week later when i hit rock channels . i see 15 [UNK] only two films that have noted a very strange mixture of suspense and interaction between nature and noble and i found the kind of movies dimension . . . . . . . movie on which you think the best film ever ? i am from a big city photographer . but then i saw "one dark night [UNK] summer " , and i left the store , the week later when i hit rock channels . i see 15 [UNK] only two films that have noted a very strange mixture of suspense and interaction between nature and noble and i found the kind of movies dimension . . . . . . . 
<li> my dogs at that point i can 't say i am impressed with this movie for the first 10 years but i don 't remember any things i liked were in this animated movie , but there were something good in it . the animation is narration from the music , the songs were played : "you sons ' " and family how a kid died until after 9 years ! " . it was not that bad it was done bad done used that point i can 't say i am impressed with this movie for the first 10 years but i don 't remember any things i liked were in this animated movie , but there were something good in it . the animation is narration from the music , the songs were played : "you sons ' " and family how a kid died until after 9 years ! " . it was not that bad it was done bad done used
</ul>

Not amazing, but not terrible either. We have results that are more or less sentences, with some parts where we really go off the rails. For a small model, short training, and tiny dataset, I'd say we are doing reasonably well. Below, the sample_k value is somewhat similar to our Temperature parameter. This controls the amount of randomness in the prediction by controling the number of values that we are choosing from when generating a word. This k represents the number of most-likely words we are choosing from, so if it is 5, our generated word will come from the top 5 most likely words. If it is 1, we will only choose from the most likely word. This is called top_k sampling, it is a pretty simple method to add randomness to a generative model. As with any model, as we move farther out, we get more and more weird results - that really requires a larger model to capture well. There are several different ways that this selection process can be done, with each method applying a different algorithm to add randomness to the final generated token, this link is a good overview of some of the different choices at a high level: https://towardsdatascience.com/decoding-strategies-that-you-need-to-know-for-response-generation-ba95ee0faadc 

In [None]:
# Save Weights
weight_save_path = "custom_transformer_weights.keras"
model.save_weights(weight_save_path)

In [None]:
# Add some simple logic to, at least partially, de-UNK it
unknowns = [word_to_index.get(_, 1) for _ in ["<unk>", "<UNK>", "<Unk>", "<uNk>", "<unK>", "<UnK>", "<uNK>", "<UNk>"]]

def indToSentence(ind, dict):
    word = ""
    for n_ in ind:
        word += dict[n_] + " "
    return word

def sentenceToInd(sentence, dict):
    indicies = []
    words = sentence.split()
    for word in words:
        if isinstance(word, str):
            index = dict.get(word)
            if index is not None:
                indicies.append(index)
    return indicies

def sample_from(logits, sample_k = 2):
    logits, indices = tf.math.top_k(logits, k=sample_k, sorted=True)
    indices = np.asarray(indices).astype("int32")
    preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
    preds = np.asarray(preds).astype("float32")
    choice = np.random.choice(sample_k, p=preds)
    #if choice in unknowns:
    #    choice = np.random.choice(sample_k, p=preds)
    return choice

def generateText(model, index_to_word, word_to_index, startPrompt, length=80, sample_k = 12):
    start_tokens = sentenceToInd(startPrompt, word_to_index)
    num_tokens_generated = 0
    tokens_generated = []
    while num_tokens_generated <= length:
        pad_len = maxlen - len(start_tokens)
        sample_index = len(start_tokens) - 1
        if pad_len < 0:
            x = start_tokens[:maxlen]
            sample_index = maxlen - 1
        elif pad_len > 0:
            x = start_tokens + [0] * pad_len
        else:
            x = start_tokens
        x = np.array([x])
        y, _ = model.predict(x)
        #sample_token = np.argmax(y[0][sample_index])
        logits = y[0][sample_index]
        sample_token = sample_from(logits, sample_k)
        tokens_generated.append(sample_token)
        start_tokens.append(sample_token)
        num_tokens_generated = len(tokens_generated)
    txt = indToSentence(start_tokens + tokens_generated, index_to_word)
    return txt

In [None]:
k_size = 15

t1 = generateText(model, vocab, word_to_index, "this movie is not really", sample_k = k_size)
t2 = generateText(model, vocab, word_to_index, "Skiing fast makes me", sample_k = k_size)
t3 = generateText(model, vocab, word_to_index, "We are going to make this country great", sample_k = k_size)
t4 = generateText(model, vocab, word_to_index, "Where my dogs at", sample_k = k_size)
t5 = generateText(model, vocab, word_to_index, "How many licks would it take to", sample_k = k_size)

In [None]:
print(t1)
print(t2)
print(t3)
print(t4)

#### Alternate Generation Results

<b>These came from a different execution, where I set the parameters to be resource intensive (layers, heads, etc...)</b>

One note from this set of generated data is at the end of the first listing, Christopher Walken. Our model is only based on one-grams, so we don't have two word ngrams in the data. Despite this, the generated results are still able to pair the first and last name, and place it in a spot that is contextually supposed to be an actor. Great work, model! As well, we have something like "the acting and story is good , although it is quite campy", a phrase that definitely could be separated into two simple sentences, but is combined to make a more complex and more natural sounding phrase. We also get, "so you get naked" and "the actors are kinda stupid" as generated phrases, so we are confident that the text we are generating is inline with our source data - internet movie reviews.

We can also highlight the repeated spaces and periods. These are tokens that we need, we can't strip them out like we previously would for classification based on meaning. Our results would be inarguably better if we ran the results through some type of grammar/spell checker here. The model will eventually get much better at this if we supplied more data and epochs - punctuation and spaces occur all over the place in actual language, in a nearly unlimited number of variations, so picking up on patterns of what-goes-where requires more data for our naive model to learn. Think of someone adding spaces in an online text box, it might be due to lining up some words on the screen, or cutting/pasting, general laziness, or bad grammar. In any case, we may have any number of spaces, almost anywhere, with no real relation to the underlying meaning of the text. If this was a specific problem we are trying to solve in our model, we could preprocess the data. If we are making a large model to be used in many different applications, we may not want to do this, as it could impact other applications - programming in python is whitespace sensitive, a.k.a. the meaning of the code depends on the number of cases. In general model applications, our model will outlearn things like this, given enough training.

<ul>
<li> this movie is not really good . it is well done in the acting and story is good , although it is quite campy . and the story is good and the special effects are also excellent . i think the cast is really good , that of course christopher walken . the plot is simple but aside from laughing out loud moments .                       good . it is well done in the acting and story is good , although it is quite campy . and the story is good and the special effects are also excellent . i think the cast is really good , that of course christopher walken . the plot is simple but aside from laughing out loud moments .                       
<li> fast makes me think mtv movies and nowadays they usually are funny trash movies like the mindless shows that are somewhat good movies like scream ,but in turn off the whole thing into a mindless sort of thing , is the idea why someone gets upset by some people ) . . . this was a bad movie . . the actors are kinda stupid .                   think mtv movies and nowadays they usually are funny trash movies like the mindless shows that are somewhat good movies like scream ,but in turn off the whole thing into a mindless sort of thing , is the idea why someone gets upset by some people ) . . . this was a bad movie . . the actors are kinda stupid .                   
<li> are going to make this country great movie . . .no scenery (some japanese are not known , but it just tells us that he gets to know what happened to them , so you get naked , but then this time with it and a more serious movie . . . but that 's not worth hearing about this telling of a friend was mine that they decided to buy it anyway . i 'm not going to say how , that the what why , the movie . . .no scenery (some japanese are not known , but it just tells us that he gets to know what happened to them , so you get naked , but then this time with it and a more serious movie . . . but that 's not worth hearing about this telling of a friend was mine that they decided to buy it anyway . i 'm not going to say how , that the what why , the 
<li> my dogs at this [UNK] nasty (very short ) and mario [UNK] crop of other films . (and this is like many of us , talk about dinosaurs heaven ) starts with dinosaurs . mountains of hot [UNK] on a tour of performance which sets the island in dinosaurs and some are back to their craft . they both give us footage of animals to hire some real storyteller . often incoherent films such as don 't get me wrong , , , , this [UNK] nasty (very short ) and mario [UNK] crop of other films . (and this is like many of us , talk about dinosaurs heaven ) starts with dinosaurs . mountains of hot [UNK] on a tour of performance which sets the island in dinosaurs and some are back to their craft . they both give us footage of animals to hire some real storyteller . often incoherent films such as don 't get me wrong , , , , 
</ul>


### Challenge Exercise

Try to adapt this, or any transformer based text-generation model, to a new set of training data. For the most part, this example could be adapted to new data pretty easily, or work from scratch if you feel comfortable. There are lots of examples online if you Google something like, "keras transformer generate text". A recommendation is to try to use source training data that is simple, or written for children. The reason for this isn't really based on the model, but you're likely limited by Colab's free resource limits, and simple text requires a smaller model and less training - the experience is likely just a bit easier. <b>Above all else, generative text models are pretty popular currently, so being able to truthfully say that you made a SpongebobGPT or something from scratch is likely a good resume point and portfolio piece.</b> If nothing else, it'll impress any love interests you find over the summer. 

### Transformer Wrap-Up and Use

Transformers are new, so we don't have all that much experience using them and applying them to a variety of problems. To this point, it looks very likely that transformer based models will be the leaders in sequential data, and potentially much more. The key ability of the transformer to generate "attention" between any two values in a sequence, in parallel, is a massive advantage over LSTM models. This advantage lies both in the logic of the model and the ability to more efficiently parallelize the processing, which is hard with sequential data. The ability of transformer models to adapt to many types of problems, with little to no explicit direction, opens the door to many innovations in the near future. Some of the most exciting work is being done in the area of "few shot learning", where we can take a model that has been trained on a large amount of data, and then use that model to learn a new task with very little data. As transformer models progress, we'll likely see them edging out other architectures such as RNNs in many applications. 

The current massively impressive application of transformers is in larger versions of what we are doing, large language models. These models are, at their core, very similar to what we have created, the major difference being:
<ul>
<li> Far more training text allows the model to use the ability to learn from context to generate a far more accurate understanding of the structure of language. 
<li> A far larger model allows the model to "hold" a large vocabulary and a large number of relationships between words.
<li> Many large models utilize some form of human feedback to improve the model, generally by having a human vote for the best sentence generated by the model, or give a thumbs up/down to what a model produces. This adds in some human supervised learning, which can be important in helping the model become more natural sounding. 
</ul>

At the time of writing, the current state-of-the-art in language models is GPT4, which is a much larger evolution from the globally impactful ChatGPT. As computers get better, we should see models that are larger and smarter, especially as the transformer architecture allows us to scale up models by parallelizing the processing. Costs just to train some of these models are counted in the millions of dollars, either from renting cloud resources or purchasing GPUs and paying for electricity, so this is one of relatively few computing problems where computing speed and efficiency are critical limitations. Researchers are working on an assortment of ways to make these training times more efficient, from reducing the precision of calculations inside the neural networks (saving time for each +-*/ operation), to creating more efficient optimization algorithms for gradient descent, to building hardware that is inherently more efficient at training these models. One of the unique things about predictive modelling is that, generally, faster hardware not only means models are trained faster, it means that models can be trained <i>better</i>. Even without any improvements to the code, a model that can try more hyperparameter combinations in a grid search, process more training data, or train for more epochs, will be able to find a better solution. In this sense, faster GPUs and smarter algorithms are two sides of the same performance/accuracy coin; more speed = more attempts = more training = better model.

Due to these factors, as well as the increasing number of people with knowledge of neural networks and machine learning, we can expect to see the rate of progress in this area to continue to accelerate, potentially to a shocking extent. Simply adding more data, using newer hardware, and refining existing code will naturally lead to all of these large models getting better and better as time progresses. With many very smart people working on the implementation of the transformers in code, we can also expect to see both general refinements and the potential for some big jumps in ability as we've hinted at. Large language models have really only been a thing that the general public has known about since ChatGPT, so the increased exposure should lead to more brains thinking about it, and more innovation. Since this specific type of model has only existed for ~5 years, there's probably a lot of runway left for incremental improvements - I recently saw a paper on reducing or compressing the number of associations that are saved in the model, as many words in a sentence may not have "attention" needed between them; reducing that means fewer weights, less storage, less RAM, and faster processing - assuming you can keep the data you need to make the model! There are also improvements in a model's ability to train itself or generate training data. We looked at augmenting image data, a relatively simple way to add training data to an image model. The ability to create adversarial networks, which generate data then evaluate it as fake or real, also has the potential to speed the development of large neural networks. If we can create feedback that is "good enough" for training, we can let models just train indefinately and keep improving; we commonly see this type of approach on models that play games like chess, they play against themselves until they are amazing. A relatively small improvement in any of the underlying constraints, such as the ability to perform the weight update calculations slightly faster, will probably lead to outsized improvements in model performance down the line. 

This quick and accellerating development of the abilities of predictive models is extremely exciting, but also somewhat worrying. There is generally a low understanding of AI tools in the population, and the felt impact of these tools will be large. I'd predict that developments in neural networks feeds several unpredicatable changes in society:
<ul>
<li> Discharging of responsibility - we'll likely see more use of models making critical decisions with little to no human interaction. Take a self-driving car as an example, if the model that controls the car decides to drive it into a bunch of 5 year olds, who is responsible? The driver is currently, but that doesn't seem correct. I doubt your car insurance is excited to pay for Tesla's mistakes.
<li> Automation of bias - as an extension to the point above, models can make decisions that are biased, and implicitly excuse that bias in the process. We'll probably see this in things like hiring decisions and loan approvals, where the people making or using the model may not explicitly intend to be biased, but the old training data is. We can see this in the film Coded Bias, where facial recognition models work far better on white men, as that is the demographic that was used to train the model.
<li> Automation of jobs - as models become more and more capable, we'll see more and more jobs that are automated. This is already happening, but it will accelerate as models become more and more capable. This will lead to a lot of people being displaced from their jobs, often unpredicably and en masse. Once a self-driving semi-truck is "good enough" to replace a human driver, anyone using a human driver is at a massive disadvantage in the market. Potentially more concerning is knowledge and creative work - once AI is a little bit better, it'll be able to be tasked to do things like generate articles on today's news and format them into a newspaper, or generate illustrations to pair with the text of a children's book. This ability to be able to create content for essentially $0 may generate massive shifts in the very concept of creation. 
<li> Ownership of creation - as an extension of above, artists create art and they hold the license to that art. If a model can look at all art that has already been created and generate new art from that, who owns what is generated? Large players like Google, Amazon, and Apple have the ability to process much more data than any individual; can these companies simply scan every museum, art gallery, song, book, cartoon, anime, and movie in existence, generate new content, copyright that content, and basically choke out the ability of any individual to create anything that succeeds in the market? Disney has scores of original characters (including Marvel and Star Wars) along with a massive library of content and a ruthlessly agressive legal division. There is a realistic scenario in the near future where Disney can simply automate the production of new content, requring humans only to do a little polishing on the final product - no actors, voice actors, writers, animators, etc. needed. With enough processing time and data, a model could even generate new content dynamically, based on what people are looking for, watching, or seeing in the real world, then copyright this content for Disney - content could even be created that is tailored to the interests of one specific viewer. New content coming from humans would be a novelty, and it would be very difficult to compete with movies that can be generated by a single prompt and a few hours of processing time.
<li> Fake news - finally, as an extension of the last point, generative AI has the ability to seriously change the idea of truth. Right now, a video of someone doing something is generally assumed to be true. We can edit or create fake videos, but it takes time and effort, so it is unlikely that someone will create a high quality but fake video outside of something like a movie. As tools get better and processors get cheaper, this balance changes. We will probably see deepfake videos offered as evidence in court, used to sway elections, or to craft fully fake naratives. Based on recent events, conservative parties accross the globe will likely spawn a cottage industry of realistic looking, but totally fake, videos of their opponents doing all kinds of embarrassing things.
</ul>