# GPT from scratch

We have a secret for you. It's something you probably already know but everyone is too afraid to say:


Nobody likes writing ticket descriptions 🙄


In this challenge, you will be building your own GPT-style model that generates realistic-sounding tickets so you won't have to spend any more time writing all those long descriptions out for your hard-working TAs (you can just generate one here and paste it across!) 😮‍💨

<img src="https://wagon-public-datasets.s3.amazonaws.com/data-science-images/06-DL/smarter_not_harder.jpg" width = "300px">

This notebook will cover both theory and practice. Really take the time to understand __why__ we're doing what we're doing rather than just implementing the challenges. 

👉👉👉 This is the biggest challenge of the day and for good reason - we'll be solidifying all of the core concepts from the lecture, so don't stress if it takes you most of the day. 👈 👈 👈 

We're going to go through each step of this slowly and methodically, but the broad strokes look like this:

### 1️⃣ Data Preprocessing 📊

Read in our tickets data, clean it and split it into training, testing, and validation tensorflow datasets. 🧹🔀

### 2️⃣ Vocabulary Creation 📚

To start, we'll be using a simplistic TextVectorization layer from TensorFlow with a custom text standardization function to turn our models into tokens. We're also going to make sure we can translate between our tokens and our original text nice and smoothly. 🗂️

### 3️⃣ Model Creation and Training 🔨

Define the architecture of the text generation model using a Transformer-based approach. This is where a lot of the heavy lifting will get done 🏗️🧠

### 4️⃣ Text Generation 📝🔮

Finally, we'll define a callback function to generate sample text at the end of each epoch. Let your model's creativity shine! ✨🌟


### 🎉5️⃣ 🎉  Freestyle 
By now you'll be a pro at NLP, so in this section, you'' have the opportunity to take the code you've written, tidy it up and and then you can start playing around with your own datasets! Get ready to explore and have some fun!

## Enough preamble! Let's get cracking!

You'll need this library a little later so for now, run the cell below to install the `keras_nlp` library:

In [None]:
!pip install keras-nlp

In [None]:
import tensorflow as tf

## 1️⃣ Data Preprocessing 📊¶

Run the cell below to download this ```tickets.txt``` file from this [link](https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/tickets.txt) and put it into a ```data/``` folder. Then load the txt file into a variable - the variable will just be one long string:

In [None]:
!mkdir -p data
!curl https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/tickets.txt > data/tickets.txt
with open("data/tickets.txt", "r") as f:
    text = f.read()

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  123k  100  123k    0     0   640k      0 --:--:-- --:--:-- --:--:--  656k


Print the first 1000 characters of the string.

In [None]:
# $CHALLENGIFY_BEGIN
text[:1000]
# $CHALLENGIFY_END

"I'm developing a sentiment analysis model for customer reviews, but I'm struggling with handling domain-specific language or sarcasm. What are some techniques like domain adaptation, transfer learning, or using pre-trained language models such as BERT or GPT-3 that can help me improve sentiment analysis performance on such challenging data? --- I'm facing challenges in detecting and handling outliers in my numerical data. How can I use techniques like the interquartile range (IQR), Z-score, or robust statistical methods to identify outliers and decide whether to remove them or treat them differently in my analysis or modeling pipeline? --- I'm working on a collaborative filtering-based recommendation system, and I need guidance on handling cold start problems when dealing with new users or items with limited interaction data. How can I leverage techniques like content-based filtering, popularity-based recommendations, or hybrid approaches to address the cold start issue? --- I'm encou

Each ticket is broken up by some dashes. Split the text on the dashes. 

In [None]:
# $CHALLENGIFY_BEGIN
tickets = text.split(" --- ")
# $CHALLENGIFY_END

Our first task is to add " EOS " to the end of each of the strings we've created. This will let our model know that it's hit the end of the sentence.

In [None]:
# $CHALLENGIFY_BEGIN
tickets = [sentence + " EOS " for sentence in tickets]

tickets[0]
# $CHALLENGIFY_END

"I'm developing a sentiment analysis model for customer reviews, but I'm struggling with handling domain-specific language or sarcasm. What are some techniques like domain adaptation, transfer learning, or using pre-trained language models such as BERT or GPT-3 that can help me improve sentiment analysis performance on such challenging data? EOS "

Check how many tickets you have and how long the longest ticket description (__in words - not characters__)  is. Save that maximum length as a variable `max_len`.

In [None]:
# $CHALLENGIFY_BEGIN
max_len = max([len(ticket.split()) for ticket in tickets])
# $CHALLENGIFY_END

In [None]:
print(f"There are {len(tickets)} tickets in the dataset")
print(f"The longest ticket description is {max_len} words (including the 'EOS' word)")

There are 372 tickets in the dataset
The longest ticket description is 56 words (including the 'EOS' word)


Next up we need to convert our nicely prepared collections of sentences into tokens. To do that we are going to use a simple [TextVectorization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) layer. 

We'll talk more about advanced Tokenizing techniques later but - for now - let's instantiate a tokenizer layer called `tokenize_layer` that "standardizes" all of our sentences to lower case (we won't worry about punctuation for the time being), outputs integers and has a maximum sentence length equal to our ```max_len```. Check the docs or docstrings for guidance on how to achieve these steps

In [None]:
from tensorflow.keras.layers import TextVectorization

# $CHALLENGIFY_BEGIN

vectorize_layer = TextVectorization(
    standardize="lower",
    output_mode="int",
    output_sequence_length=max_len,

)
# $CHALLENGIFY_END

Once we've instantiated, we need to call the ```adapt()``` method of the layer to our sentences (i.e. pass in our our ```tickets``` variable into the `vectorize_layer.adapt()` function). When this layer is adapted, it will analyze the dataset, determine the frequency of individual string values, and create a vocabulary from them. 

Then we can investigate our vocabulary using the [```get_vocabulary()```](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization#get_vocabulary) method.

Assign the list produced to a variable called `vocab` and take a look at the first 10 in the list.

In [None]:
# $CHALLENGIFY_BEGIN
vectorize_layer.adapt(tickets)
vocab = vectorize_layer.get_vocabulary()

vocab[:10]
# $CHALLENGIFY_END

['', '[UNK]', 'or', 'like', 'and', "i'm", 'eos', 'can', 'techniques', 'to']

In [None]:
vocab_size = "### YOUR CODE HERE"
# $DELETE_BEGIN
vocab_size = len(vocab)
# $DELETE_END

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('vocab',
    vocab_size = vocab_size
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/markbotterill/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/markbotterill/code/lewagon_dev/data-solutions/06-Deep-Learning/05-Transformers/03-GPT-from-scratch/tests
plugins: dash-2.11.1, asyncio-0.19.0, typeguard-2.13.3, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_vocab.py::TestVocab::test_vocab [32mPASSED[0m[32m                              [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/vocab.pickle

[32mgit[39m commit -m [33m'Completed vocab step'[39m

[32mgit[39m push origin master



Execute the following cell to create dictionary that will allow you to translate your tokens back to words:
    


In [None]:
index_lookup = dict(zip(range(len(vocab)), vocab))

Try calling vectorizer layer on an example sentence of your choosing. (i.e. running `vectorize_layer("Try me")`). Ensure you get a tensor out on the other side, filled with integers!

In [None]:
# $CHALLENGIFY_BEGIN
sentence_as_tokens = vectorize_layer("Working with deep learning models is the best")
# $CHALLENGIFY_END

Then ensure you can translate the tokens back into words with your dictionary
<details >
<summary>Click for hint 👇</summary>
<br>
The vectorizer outputs a tensor which you'll need to convert back to a list of numbers if you want to loop through them. Try using <code>.numpy().tolist()</code> on your tensor.
</details>

In [None]:
# $CHALLENGIFY_BEGIN
translated_back = [index_lookup[token] for token in sentence_as_tokens.numpy().tolist()]
translated_back
print(translated_back)
# $CHALLENGIFY_END

['working', 'with', 'deep', 'learning', 'models', 'is', 'the', 'best', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']


You may get some `[UNK]` tokens if you try putting in a word that isn't included in our rather small vocabularly but don't worry!

Now that you feel comfortable with the vectorizer, let's loop through all of the sentences and tokenize each one of them! Save this list as a variable ```all_tokenized``` 

In [None]:
# $CHALLENGIFY_BEGIN
all_tokenized = [vectorize_layer(sentence) for sentence in tickets]

print(len(all_tokenized))

all_tokenized[0]
# $CHALLENGIFY_END

372


<tf.Tensor: shape=(56,), dtype=int64, numpy=
array([  5,  44,  12,  77,  62,  20,  11, 105, 477, 288,   5, 356,  29,
        32, 751,  72,   2, 613,  41, 141, 196,   8,   3, 528, 804,  88,
        93,   2,  18,  90,  72,  24, 131, 327, 440,   2, 714, 249,   7,
        47,  46,  31,  77,  62,  92,  16, 131, 784,  63,   6,   0,   0,
         0,   0,   0,   0])>

You should now have a ```list``` of 372 tensors with each tensor being 1-dimensional tensor that is 56 long.

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('tokenization',
    list_length = len(all_tokenized),
    contains_tensor = tf.is_tensor(all_tokenized[0]),
    tensor_shape = all_tokenized[0].shape[0]
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/markbotterill/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/markbotterill/code/lewagon_dev/data-solutions/06-Deep-Learning/05-Transformers/03-GPT-from-scratch/tests
plugins: dash-2.11.1, asyncio-0.19.0, typeguard-2.13.3, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 3 items

test_tokenization.py::TestTokenization::test_len [32mPASSED[0m[32m                  [ 33%][0m
test_tokenization.py::TestTokenization::test_seq_length [32mPASSED[0m[32m           [ 66%][0m
test_tokenization.py::TestTokenization::test_type [32mPASSED[0m[32m                 [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/tokenization.pickle

[32mgit[39m commit -m [33m'Completed tokenization step'[39m

[32mgit[39m push origin master



Now that we have all of our sentences as sequences of tokens, let's think hard about exactly what our X and y are going to be. 372 sentences does not seem like a lot of information to train our model on - it isn't and this is partly so that we can have faster training times for demonstration purposes. But we can also split our sentences down into smaller X, y pairs. 

Our model is supposed to predict the next word in a sentence, only knowing the words that it has up until that point. Below you can see the most obvious next-word training example.

<img src = "https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/gpt_scratch_1.png">

But there are many other sets of Xs and ys that we can get out of our sentence:

<img src = "https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/quick_brown_examples.png">

From a sentence of 8 words, we have created 7 (ie. n - 1) X-y pairs! Is there a way we can think about implementing this more quickly? Let's consider replicating our sentence a total of 7 times.

<img src = https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/quick_brown_examples_X.png>

We're interested in just the parts underlined in red - what would be great would be if we could take a mask and ignore anything that wasn't desired. 

<img src = https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/quick_brown_examples_masking.png>

This is exactly what `tf.linalg.band()` (you saw this in the warm-up exercise!) is going to do for us!

Finally let's think about our ys - we could take one from each row of our tensors, but there's a more efficient way:

<img src = https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/quick_brown_examples_y.png>

So our ys will just be our sentence `sequence[1:]`

Take this dummy tensor and perform the follow operations on it:
1) Create a tensor of shape 55x56 that is just repeats of the original sequence with ```tf.tile()```. First, you'll need to use `tf.expand_dims` here so that you can `tile` in two dimensions.

In [None]:
dummy_tensor = tf.range(0,56)

In [None]:
# Your code here

2) Mask out the upper triangle of the tensor with ```tf.linalg.band_part()```.


In [None]:
# Your code here

3) Create our corresponding ys from it.

In [None]:
# Your code here

When those steps work, fill out the function that will `return X, y` (where X is a tensor of shape (55,56) and y is of shape (55,)

In [None]:
def X_y_creator(sequence_tensor):
    # $CHALLENGIFY_BEGIN
    tiled_sequence = tf.tile(tf.expand_dims(sequence_tensor, 0), [max_len - 1, 1])
    X_s = tf.linalg.band_part(tiled_sequence, -1, 0)
    y_s = sequence_tensor[1:]
    # $CHALLENGIFY_END
    return X_s, y_s

You should be able to run the cell below without error: 

In [None]:
X, y = X_y_creator(all_tokenized[0])

In [None]:
from nbresult import ChallengeResult


result = ChallengeResult('xy_creater',
    list_length = len(X_y_creator(all_tokenized[0])),
    X_shape = X_y_creator(all_tokenized[0])[0].shape,
    y_shape = X_y_creator(all_tokenized[0])[1].shape
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/markbotterill/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/markbotterill/code/lewagon_dev/data-solutions/06-Deep-Learning/05-Transformers/03-GPT-from-scratch/tests
plugins: dash-2.11.1, asyncio-0.19.0, typeguard-2.13.3, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 3 items

test_xy_creater.py::TestXyCreater::test_X_shape [32mPASSED[0m[32m                   [ 33%][0m
test_xy_creater.py::TestXyCreater::test_list_length [32mPASSED[0m[32m               [ 66%][0m
test_xy_creater.py::TestXyCreater::test_type [32mPASSED[0m[32m                      [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/xy_creater.pickle

[32mgit[39m commit -m [33m'Completed xy_creater step'[39m

[32mgit[39m push origin master



Now apply this function to each element in our ```all_tokenized``` list.

In [None]:
Xs = []
ys = []
for sequence in all_tokenized:
    # $CHALLENGIFY_BEGIN
    X, y = X_y_creator(sequence)
    Xs.append(X)
    ys.append(y)
    # $CHALLENGIFY_END


Now we just need to tidy up a little - we've been doing things quite inefficiently with all of our ```for``` loops but that's so we can understand every step of the process and apply things methodically. We have a list of tensors for our `X`s and a list of `y`s for our `y`s so let's use `tf.concat` to create our 20460 training examples (that got big quickly!). Call this new variable `X` with shape `(20460, 56)` shape and do the same concat process for your ys and save it in a variable `y` `(shape (20460,)`.

In [None]:
# $CHALLENGIFY_BEGIN
X = tf.concat(Xs, axis = 0)
y = tf.concat(ys, axis=0)
# $CHALLENGIFY_END

There's one thing we haven't considered yet! A lot of our y values are just 0s because there is so much padding. Let's quickly create a boolean mask to figure out where our `ys` are __not__ zero and then only keep those examples from both our X and y.

In [None]:
# $CHALLENGIFY_BEGIN
y != 0

X = X[y != 0]

y = y[y != 0]
# $CHALLENGIFY_END

Phew! We've just dropped almost 5000 useless training exampels. Check your X and y against the tests below to make sure you've ended up with the right shapes and value for your X and y.

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('final_shapes',
    X_shape = X.shape,
    X_value = X[500][7],
    y_shape = y.shape,
    y_value = y[356],
    zeroes = tf.math.reduce_sum(tf.cast(y==0, "int32"))
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/markbotterill/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/markbotterill/code/lewagon_dev/data-solutions/06-Deep-Learning/05-Transformers/03-GPT-from-scratch/tests
plugins: dash-2.11.1, asyncio-0.19.0, typeguard-2.13.3, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 5 items

test_final_shapes.py::TestFinalShapes::test_X_shape [32mPASSED[0m[32m               [ 20%][0m
test_final_shapes.py::TestFinalShapes::test_sample_X_values [32mPASSED[0m[32m       [ 40%][0m
test_final_shapes.py::TestFinalShapes::test_y_shape [32mPASSED[0m[32m               [ 60%][0m
test_final_shapes.py::TestFinalShapes::test_y_value [32mPASSED[0m[32m               [ 80%][0m
test_final_shapes.py::TestFinalShapes::test_zeroes [32mPASSED[0m[32m                [100%][0m


systemMemory: 16.00 GB
maxCacheSize: 5.33 GB



💯 You can commit your code:

[1;32mgit[39m add tests/final_s

All sorted! That was a lot of work - and you'll only have to do this once - but it's crucial you understand what is going into our model and what is being predicted from the input.

## Modelling

Now we get to the tricky part - building our model! As you will recall from the lecture - GPT style models are what we call "decoder-only". GPT decoder-only models work by leveraging the Transformer architecture, specifically the decoder component, to generate output sequences based on input sequences. Let's take a look at this diagram that shows the architecture of GPT-2:

<img src = "https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/GPT2.png" width="400px">


Focus your attention on the left side of the diagram and you can easily visualize the journey that our words (tokens) take along their path.

### Step 1. Positional Encoding and Word Embeddings: 

Our input to the model is the lovely tokens we've just prepared. We now need to do two things to our input tokens:  

1) Give them a regular token embedding (as we did yesterday in our NLP tasks) and also give them a positional embedding. As a reminder, a token embedding means taking a token and embedding its meaning across however many ```embedding_dimensions``` we choose. 


2) Use positional encoding to clues about where each word is in the sentence to the model, and this positional encoding is simply added to the input embeddings. 

This means the model understands __both__ the relative positions of the tokens in the sequence __as well as__ what each word "means". Fortunately, we can use a `TokenAndPositionEmbedding()` layer from the `keras_nlp` library to do both of these at once! See the diagram below for a reminder on this step from the lecture: 

<img src =https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/positional_encoding_sketch.png width=300px>

### Step 2. The Transformer Block: 

Our embedded vectors now enter the Transformer block which is the __heart of the GPT architecture__. The attention mechanism here allows the model to attend to different positions in the input sequence when making predictions. The model learns the importance of each input token (now represented with its embeddings) by calculating attention weights, which reflect the token's relevance to other tokens in the sequence. 

This happens in a few steps - first we project our embedded vectors into Query, Key, and Value vectors. This can get a little more complex for multi-headed attention as you can see [here](https://towardsdatascience.com/transformers-explained-visually-part-3-multi-head-attention-deep-dive-1c1ff1024853) but once we have these matrices, we perform scaled dot-product attention by simply following this formula!

<img src = "https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/key-query-value.png">

Once we've done that, and we have updated our embeddings, we pass it through the final layers of the Transformer Block which just add some Dropout and LayerNormalization. 

The upshot of all of this is that we end up with updated vectors on the other side of our Transformer Block - each vector will still be 512 long, but they'll contain more information about their importance with respect to the task at hand (in our case predicting the next word).

### Step 3. Making a prediction: 

We then need to think very carefully about what our output is going to be. 

Remember - our `X` is all of the sentence up to a point and our `y` is the next word. What does this mean for prediction? Well essentially we have a massive classification problem in front of us. We need to pick the next word correctly, so how many choices do we have? 

Answer: as many words as we have in our vocabulary! Our output will be a Dense layer with as many neurons as we have words in our vocabulary. It will have a "softmax" activation function - which just means that it will essentially be predicting probabilities across all of our neurons that add to one. We want the next word to have a value as close to 1 as possible.

View the process below:

<img src = "https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/prediction_png_flow.png">

What does this like for us in terms of code? Well here is our define model function with a few holes in it:
    

In [None]:
from keras_nlp.layers import TokenAndPositionEmbedding
from tensorflow.keras import layers

def create_model(max_sequence_length, vocab_size, embedding_dimension):
    
    # 1. First up we define a layer that just grabs the inputs to our model
    # We use the standard Input() layer and we know that each X going into
    # our model is going to be 56 long!
    inputs = layers.Input(shape=(max_len,), dtype=tf.int32)

    # 2. Next we give our tokens Positional and Regular Embeddings which is done 
    # by a nicely built layer that takes these arguments! 
    x = TokenAndPositionEmbedding(vocab_size, 
                                  max_sequence_length, 
                                  embedding_dimension, 
                                  mask_zero = True)(inputs)
    
    # 3. This part we are going to define in a moment - don't worry 
    # about it for now - we'll come back to it!
    x = TransformerBlock(num_heads=4, 
                         embed_dim=embedding_dimension, 
                         ff_dim=embedding_dimension * 4)(x)

    
    # 4. This is just a regular Dropout layer that you've 
    # seen earlier in the week that helps our model avoid overfitting
    x = layers.Dropout(0.4)(x)
    
    
    # 5. At this point in the model we'll have tensors with 
    # shape (batch_size, sequence_length, embedding_dimension) but 
    # we want to squish it down so we use GlobalAveragePooling1d. All 
    # this does is average elements across our sequence and 
    # squish them into a (batch_size, embedding_dimension) tensor. 
    x = layers.GlobalAveragePooling1D()(x)
    
    
    # 6. Finally we just need to have our "classification" layer 
    # which needs to be as large as our vocabulary size
    outputs = layers.Dense(vocab_size, activation='softmax')(x)
    
    
    # Now we just just stick our model together with the Functional API
    # and compile with "adam" for our optimizer and 
    # "sparse_categorical_crossentropy" to compute the loss 
    # between our predicted labels and our actual labels. 
    # We'll talk about perplexity a little later.
    
    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    model.compile(
        optimizer="adam", 
        loss='sparse_categorical_crossentropy',
        metrics=[keras_nlp.metrics.Perplexity(), 'accuracy']
    )
    return model

Using TensorFlow backend


In [None]:
create_model(50, 50, 50)

NameError: name 'TransformerBlock' is not defined

If you run the cell above you will get an error! Why? Well, because we haven't defined many of our layers yet and we still need to implement out TransformerBlock- this is the part where all the magic happens!

To code our Transformer Block - we'll need to code our attention mechanism first. Here's a reminder on what that looks like.

<img src = "https://wagon-public-datasets.s3.amazonaws.com/data-science-images/lectures/Transformers/key-query-value.png">

All this layer is doing is projecting our embedded inputs into three matrices which are  - queries, keys, and values - and using the interaction between all three to give us better embeddings for our words. Let's break it down step by step:

1. We multiply the Q matrix with a transposed version of the K matrix.
2. We divide this by the square root of our model dimension
3. We take the Softmax of final dimension of the matrix the to get the scaled scores
4. We multiply these softmax scores by the V matrix

In [None]:
def coded_attention(query, key, value):
        # Step 1: Matrix multiply the query with the transpose of the key 
        # $CHALLENGIFY_BEGIN

        score = tf.matmul(query, key, transpose_b=True)
        
        # $CHALLENGIFY_END
        
        # Step 2: Divide this matrix by the square root of the hidden dimension
        # In our case this dimension will be 512 (with the square root being 22.6). 
        # You will have to use tf.cast(22.6, tf.float32) so that the two matrices can interact 
        
        # $CHALLENGIFY_BEGIN

        divider = tf.cast(22.6, tf.float32)
        scaled_score = score / divider
        # $CHALLENGIFY_END
        
        # Step 3:
        # Compute the softmax_scores - use tf.nn.softmax(scaled_score, axis = ?) 
        # Think about what dimension we should be using our softmax along -
        # it'll need to be our last dimension!
        # $CHALLENGIFY_BEGIN

        softmax_scores = tf.nn.softmax(scaled_score, axis=-1)
        # $CHALLENGIFY_END
        
        # Step 4: 
        # Matrix multiply the weights matrix with the value matrix
        # and set this to be your "output"
        # $CHALLENGIFY_BEGIN

        output = tf.matmul(softmax_scores, value)
        # $CHALLENGIFY_END

        
        # Return *both* the output and the softmax_scores as a tuple
        return output, softmax_scores

Run the cell below to test your function with some dummy tensors:
    

In [None]:
# Dummy tensor for query
example_query = tf.constant([[0.1, 0.2],
                     [0.3, 0.4]])

# Dummy tensor for key
example_key = tf.constant([[0.5, 0.6],
                   [0.7, 0.8]])

# Dummy tensor for value
example_value = tf.constant([[0.9, 1.0],
                     [1.1, 1.2]])
# Test your function
coded_attention(example_query, example_key, example_value)

In [None]:
from nbresult import ChallengeResult


example_query = tf.constant([[0.1, 0.2],
                     [0.3, 0.4]])


example_key = tf.constant([[0.5, 0.6],
                   [0.7, 0.8]])


example_value = tf.constant([[0.9, 1.0],
                     [1.1, 1.2]])
output = coded_attention(example_query, example_key, example_value)

result = ChallengeResult('attention',
    len_output = len(output),
    output_shape = output[0].shape,
    output_value = output[0][-1]
)

result.write()
print(result.check())

Now that's run, we can fold our hand-coded attention into our larger MultiAttentionHead and the - even larger - TransformerBlock. 

__If you want to to go through and understand each step of the block below, do so later__, but for now you can run the cell below to define the architecture then move on down - we're almost there!

In [None]:
from tensorflow.keras import layers, Model, Sequential
import keras_nlp

class MultiHeadAttention(layers.Layer):
    def __init__(self, embed_dim, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.embed_dim = embed_dim
        self.projection_dim = embed_dim // num_heads
        self.query_dense = layers.Dense(embed_dim)
        self.key_dense = layers.Dense(embed_dim)
        self.value_dense = layers.Dense(embed_dim)
        self.combine_heads = layers.Dense(embed_dim)

    def attention(self, query, key, value):
        # Your coded attention function goes here
        return coded_attention(query, key, value)

    def separate_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, 56, self.num_heads, self.projection_dim))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, inputs):
        # Here, we "project" into long query, key, value vectors
        query = self.query_dense(inputs)
        key = self.key_dense(inputs)
        value = self.value_dense(inputs)
        batch_size = tf.shape(query)[0]

        # We rearrange the projections for each head
        query = self.separate_heads(query, batch_size)
        key = self.separate_heads(key, batch_size)
        value = self.separate_heads(value, batch_size)

        # We perform attention on our QKV for our heads
        attention, weights = self.attention(query, key, value)
        attention = tf.transpose(attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(attention, (batch_size, -1, self.embed_dim))
        output = self.combine_heads(concat_attention)
        return output

class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.attention = MultiHeadAttention(embed_dim, num_heads)
        self.ffn = Sequential(
            [layers.Dense(ff_dim, activation="relu"), 
             layers.Dense(embed_dim)]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs):
        attention_output = self.attention(inputs)
        attention_output = self.dropout1(attention_output)
        out1 = self.layernorm1(inputs + attention_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output)
        return self.layernorm2(out1 + ffn_output)


All we need to do now is put it all together. The function below will stitch everything together for us! It's really only 5 layers (although - as we've just seen - one of them is quite complicated!)

In [None]:
def create_model(max_sequence_length, vocab_size, embedding_dimension):
       
    inputs = layers.Input(shape=(max_len,), dtype=tf.int32)
    x = TokenAndPositionEmbedding(vocab_size, 
                                  max_sequence_length, 
                                  embedding_dimension, 
                                  mask_zero = True)(inputs)
    x = TransformerBlock(num_heads=4, 
                         embed_dim=embedding_dimension, 
                         ff_dim=embedding_dimension)(x)
    x = layers.Dropout(0.4)(x)
    x = layers.GlobalAveragePooling1D()(x)
    outputs = layers.Dense(vocab_size, activation='softmax')(x)
    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    model.compile(
        optimizer="adam", 
        loss='sparse_categorical_crossentropy',
        metrics=[keras_nlp.metrics.Perplexity(), 'accuracy']
    )
    return model

We just need to instantiate our model with:
- `max_sequence_length` = Our longest sequence length
- `vocab_size` = The number of unique words we had in our vocabulary
- `embedding_dimension` = 512 (512 should work well for a dataset of this size)

In [None]:
# $CHALLENGIFY_BEGIN
model = create_model(56, 1150, 512)
# $CHALLENGIFY_END

Take a look at your model summary. 

In [None]:
# $CHALLENGIFY_BEGIN
model.summary()
# $CHALLENGIFY_END

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 56)]              0         
                                                                 
 token_and_position_embeddi  (None, 56, 512)           617472    
 ng_1 (TokenAndPositionEmbe                                      
 dding)                                                          
                                                                 
 transformer_block (Transfo  (None, 56, 512)           1577984   
 rmerBlock)                                                      
                                                                 
 dropout_2 (Dropout)         (None, 56, 512)           0         
                                                                 
 tf.math.reduce_mean (TFOpL  (None, 512)               0         
 ambda)                                                      

In [None]:
from nbresult import ChallengeResult

model = create_model(56, 1150, 512)

result = ChallengeResult('model',
    params = model.count_params()
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/markbotterill/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/markbotterill/code/lewagon_dev/data-solutions/06-Deep-Learning/05-Transformers/03-GPT-from-scratch/tests
plugins: dash-2.11.1, asyncio-0.19.0, typeguard-2.13.3, anyio-3.6.2
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_model.py::TestModel::test_params [32mPASSED[0m[32m                             [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/model.pickle

[32mgit[39m commit -m [33m'Completed model step'[39m

[32mgit[39m push origin master



All that is left to do now is fit your model with our `X` and `y`. A batch size of 32 should be good and 20 epochs. Now you can just sip on some tea, read an explanation on perplexity below, and wait while our model works its magic! 

In [None]:
# Before fitting, we need to expand the dims of y to make it work
# with the perplexity measure when using the latest version of tf.
# This changes the shape from (n_samples, ) to (n_samples, 1).
y = tf.expand_dims(y, axis=1)

In [None]:
# $CHALLENGIFY_BEGIN
model.fit(X, y, batch_size=32, epochs = 25)
# $CHALLENGIFY_END

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.src.callbacks.History at 0x2d5c23bb0>

<details> 
    <summary> Click HERE for a Perplexity Explanation</summary>
<br>
Perplexity is a metric commonly used in the context of Language Models to evaluate the quality and performance of the model in predicting the next word or token in a sequence of words. It measures how well the language model assigns probabilities to a given sequence of words.

Mathematically, perplexity is calculated using the concept of cross-entropy. The perplexity score is the exponential of the average cross-entropy per word in a given dataset. The formula for perplexity is as follows:

$$
\text{Perplexity} = \exp\left(-\frac{{\sum_{{i=1}}^{{N}} \log(p(w_i))}}{{N}}\right)
$$

where $N$ represents the total number of words in the dataset, and $(p(w_i))$ is the probability assigned by the language model to the $(i)$-th word in the sequence.

The formula involves taking the logarithm of the model's predicted probabilities for each word in the sequence and summing them. Dividing this sum by the total number of words $(N)$ and then taking the exponential gives the perplexity score.

A lower perplexity value indicates that the language model is more confident and accurate in its predictions, as it assigns higher probabilities to the true words in the dataset. Conversely, a higher perplexity score suggests that the model is more uncertain and less accurate in its predictions.

Perplexity is often used to compare different language models or to track the progress of a model during training. Lower perplexity values are generally desired, indicating better language modeling performance.</details>


### Using our model

You'll remember that GPT-style models work simply by taking an input sequence, predicting the next word, adding it to our existing input sequence and then predicting again and again!

We now want to write a generate function that takes as its input the starter_string. 

Then - in a for loop of ``range(max_len - len(starter_string.split()))``:

1. We use our ```vectorize_layer``` to convert it to a ```token_tensor```
2. Use ```model.predict(token_tensor)``` to get out a tensor of size vocab_size out of our model (you'll need to use ```tf.expand_dims(token_tensor, 0)``` so that it looks like the right input size for our model!
3. Use ```tf.argmax()``` to find the index of the most probable (largest number) word out (this is effectively a "greedy" algorithm as we just take the most likely word at any given step.
4. Use our ```index_lookup``` dictionary from earlier to convert that index from a number back into a word
5. Add that word to our input string and repeat UNLESS

Two catches: 

1) If we predict the word "eos" that means our model has predicted the end of the sentence so we ```return``` our ```starter_string``` in its current form. 

2) If the length of our string (when split on whitespace) is 55 then we also ```return``` our ```starter_string``` in its current form. 

N.B. Make sure you add whitespace when you add the word to your sequence!

In [None]:
def generate(starter_string):
    # $CHALLENGIFY_BEGIN

    for x in range(max_len - len(starter_string.split())):
        tokens = vectorize_layer(starter_string)
        token_expanded = tf.expand_dims(tokens, 0)
        pred = model.predict(token_expanded)
        index_pred = pred.argmax()
        word = index_lookup[index_pred]
        print(word)
        if word == "eos":
            return starter_string
        if len(starter_string.split())==55:
            return starter_string
        else:
            starter_string += f" {word} "
    # $CHALLENGIFY_END


In [None]:
generate("I need")

to
detect
anomalies
or
outliers
in
my
irregular
time
series
data
with
prediction
or
credit
data.
what
are
some
techniques
like
domain
adaptation,
transfer
learning,
or
using
pre-trained
language
models
such
as
bert
or
gpt-3
that
can
help
me
improve
sentiment
analysis
performance
on
such
challenging
data?
eos


'I need to  detect  anomalies  or  outliers  in  my  irregular  time  series  data  with  prediction  or  credit  data.  what  are  some  techniques  like  domain  adaptation,  transfer  learning,  or  using  pre-trained  language  models  such  as  bert  or  gpt-3  that  can  help  me  improve  sentiment  analysis  performance  on  such  challenging  data? '

Congratulations!!! You've just built your own GPT model from first principles 💪

Naturally there are __much__ more efficient ways to do what we've just done - you will almost never end up writing your own Transformer block, attention mechanism or even full GPT model from scratch. HuggingFace abstracts so much of the difficult coding away from us, which is why fine-tuning existing models is much more effective. 

The dataset we've been working with has been very, very small and we have done very simplistic tokenization too (each word is currently assigned its own unique token and punctuation is kept in as well, but if you'd like to try working with some more messy data, you can tidy up your code into Python files and then repeat your steps on the real data found in these 10000 StackOverflow answers [here](https://wagon-public-datasets.s3.amazonaws.com/answers.csv) and build a more complex generator.