<a href="https://colab.research.google.com/github/adammoss/MLiS2/blob/master/transformer_llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

As an example application of a transformer, let's build a GPT like language model that predicts the probability of a sentence of $\tau$ tokens,

$
P \left( \boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \ldots, \boldsymbol{x}^{(\tau)}    \right) = \prod_{t  = 1}^{\tau} P \left( \boldsymbol{x}^{(i)} |  \boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(i-1)}    \right)
$

where $\boldsymbol{x}^{(t)}$ is a vector representing a token.

It starts with simpler models like Bigram Models and progresses to more complex architectures such as Self Attention and Multi-head Attention mechanisms, culminating in the construction of a GPT model. It utilizes TensorFlow datasets, such as 'tiny_shakespeare', to train the models and showcases custom training loops, attention mechanisms, and the application of multi-head attention in neural network models.

Key highlights include:
- Implementation of Bigram Models for understanding basic NLP concepts.
- Exploration of Constant Attention Scores to illustrate the concept of attention.
- Detailed implementation of Self Attention and Multi-head Attention mechanisms, foundational to Transformer models.
- Building a GPT-like model from scratch, demonstrating the power of Transformers in generating human-like text.
- Custom training loops and model evaluation using TensorFlow's efficient data handling.




In [1]:
!pip install tiktoken
!pip install keras_nlp

Collecting tiktoken
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.8 MB[0m [31m6.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m28.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.6.0
Collecting keras_nlp
  Downloading keras_nlp-0.8.1-py3-none-any.whl (465 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m465.2/465.2 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting keras-core (from keras_nlp)
  Downloading keras_core-0.1.7-py3-none-any.whl (950 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m950.8/950.8 kB[0m [31m56.4 MB/s[0m eta [36m0:00:00[0m
Collecting te

In [2]:
import itertools
import operator
import numpy as np
import sys
from datetime import datetime
import os
import requests
from tqdm.notebook import trange, tqdm
import matplotlib.pyplot as plt
import time

In [3]:
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

from tensorflow.keras.layers import Dense, Flatten, Conv2D
from tensorflow.keras import Model
from tensorflow_probability import distributions as tfd
import tensorflow_datasets as tfds

import tiktoken

import keras_nlp

TensorFlow version: 2.15.0
Using TensorFlow backend


In [12]:
batch_size = 64
context_size = 256
tokenization = 'gpt2'
docs = ['_chat.txt']
docs = ['tiny_shakespeare']
#docs = ['scientific_papers/arxiv']

In [13]:
train_text = ''
test_text = ''
for doc in docs:
  if doc == 'tiny_shakespeare':
    d = tfds.load(name=doc)['train']
    train_text += next(iter(d))['text'].numpy().decode("utf-8")
    d = tfds.load(name=doc)['test']
    test_text += next(iter(d))['text'].numpy().decode("utf-8")
  elif doc == 'scientific_papers/arxiv':
    d = tfds.load(name=doc)
  else:
    if not os.path.isfile(doc):
      from google.colab import files
      uploaded = files.upload()
    sentences = []
    with open(doc, 'r') as f:
      for x in f.readlines():
        if 'omitted' not in x:
          if len(x.split(']')) > 1:
            sentences.append(x.split(']')[1])
          else:
            sentences.append(x)
    text = ''.join(sentences)
    train_text += text[:int(0.8*len(text))]
    test_text += text[int(0.8*len(text)):]

In [14]:
print(train_text[:200])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you


In [15]:
if tokenization == 'char':

  print(f'Length of train text: {len(train_text)} tokens')
  print(f'Length of test text: {len(test_text)} tokens')

  # The unique characters in the file
  vocab = sorted(set(train_text + ' ' + test_text))

  ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary=list(vocab), mask_token=None)

  chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)

  def text_from_ids(ids):
    return tf.strings.reduce_join(chars_from_ids(ids), axis=-1).numpy().decode('utf-8')

  ids_train_ds = tf.data.Dataset.from_tensor_slices(ids_from_chars(tf.strings.unicode_split(train_text, 'UTF-8')))
  ids_test_ds = tf.data.Dataset.from_tensor_slices(ids_from_chars(tf.strings.unicode_split(test_text, 'UTF-8')))

  # Length of the vocabulary in StringLookup Layer
  vocab_size = len(ids_from_chars.get_vocabulary())

elif 'gpt' in tokenization:

  enc = tiktoken.encoding_for_model(tokenization)
  vocab_size = enc.n_vocab

  def ids_from_chars(chars):
    return enc.encode_ordinary(chars)

  def text_from_ids(ids):
    return enc.decode(ids)

  train_tokens = enc.encode_ordinary(train_text)
  test_tokens = enc.encode_ordinary(test_text)

  print(f'Length of train text: {len(train_tokens)} tokens')
  print(f'Length of test text: {len(test_tokens)} tokens')

  ids_train_ds = tf.data.Dataset.from_tensor_slices(train_tokens)
  ids_test_ds = tf.data.Dataset.from_tensor_slices(test_tokens)

train_sequences = ids_train_ds.batch(context_size + 1, drop_remainder=True)
test_sequences = ids_test_ds.batch(context_size + 1, drop_remainder=True)

def split_input_target(sequence):
  input_text = sequence[:-1]
  target_text = sequence[1:]
  return input_text, target_text

train_ds = train_sequences.map(split_input_target)
test_ds = test_sequences.map(split_input_target)

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

train_ds = (
    train_ds
    .shuffle(BUFFER_SIZE)
    .batch(batch_size, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

test_ds = (
    test_ds
    .batch(batch_size, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

print(f'Vocab size: {vocab_size}')
# This is random loss value
print(f'Random loss: {np.log(vocab_size)}')


Length of train text: 301966 tokens
Length of test text: 17995 tokens
Vocab size: 50257
Random loss: 10.82490511970208


In [16]:
x, labels = next(iter(train_ds))

In [17]:
print(x.numpy())
print(labels.numpy())

[[ 9706    11  8492 ...   198  1797  6242]
 [   39 11262 20754 ...   198 13482  1394]
 [  198    56 18213 ...    42 12599  1503]
 ...
 [13020   286   340 ... 16111   338  5474]
 [   64 25638    11 ...    11  3595  2933]
 [  314   351   345 ...  7604    13   198]]
[[   11  8492 22027 ...  1797  6242 23304]
 [11262 20754    25 ... 13482  1394   262]
 [   56 18213    11 ... 12599  1503 28893]
 ...
 [  286   340   262 ...   338  5474    11]
 [25638    11  1497 ...  3595  2933    26]
 [  351   345   612 ...    13   198   198]]


In [46]:
def train(model, train_ds, test_ds, epochs=50, patience=3):

  loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
  optimizer = tf.keras.optimizers.Adam()

  train_loss = tf.keras.metrics.Mean(name='train_loss')
  train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')

  test_loss = tf.keras.metrics.Mean(name='test_loss')
  test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='test_accuracy')

  @tf.function
  def train_step(x, targets):
    with tf.GradientTape() as tape:
      # training=True is only needed if there are layers with different
      # behavior during training versus inference (e.g. Dropout).
      predictions = model(x, training=True)
      B, T, C = predictions.shape
      predictions = tf.reshape(predictions, (B*T, C))
      targets = tf.reshape(targets, (B*T, 1))
      loss = loss_object(targets, predictions)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    train_loss(loss)
    train_accuracy(targets, predictions)

  @tf.function
  def test_step(x, targets):
    # training=False is only needed if there are layers with different
    # behavior during training versus inference (e.g. Dropout).
    predictions = model(x, training=False)
    B, T, C = predictions.shape
    predictions = tf.reshape(predictions, (B*T, C))
    targets = tf.reshape(targets, (B*T, 1))
    t_loss = loss_object(targets, predictions)

    test_loss(t_loss)
    test_accuracy(targets, predictions)

  best_loss = np.inf
  best_epoch = 0

  for epoch in range(epochs):
    # Reset the metrics at the start of the next epoch
    train_loss.reset_states()
    train_accuracy.reset_states()
    test_loss.reset_states()
    test_accuracy.reset_states()

    for images, labels in train_ds:
      train_step(images, labels)

    for test_images, test_labels in test_ds:
      test_step(test_images, test_labels)

    if test_loss.result().numpy() < best_loss:
      best_loss = test_loss.result().numpy()
      best_epoch = epoch

    print(
      f'Epoch {epoch + 1}, '
      f'Loss: {train_loss.result()}, '
      f'Accuracy: {train_accuracy.result() * 100}, '
      f'Test Loss: {test_loss.result()}, '
      f'Test Accuracy: {test_accuracy.result() * 100}'
    )

    if epoch > best_epoch + patience:
      break

In [47]:
def generate(model, num_tokens, start_token="A"):
  sentence = ids_from_chars(start_token)
  if not isinstance(sentence, list):
    sentence = [sentence]
  for i in trange(num_tokens):
    x = np.array([sentence[-context_size:]]).astype(np.int32)
    predictions = model(x)
    predictions = tf.dtypes.cast(predictions, tf.float64)
    probs = tf.nn.softmax(predictions[:, -1, :])
    samp = np.random.multinomial(1, probs[0].numpy())
    sentence.append(samp.argmax(0))
  return text_from_ids(sentence)

# Bigram Model


In [48]:
class Bigram(Model):
  def __init__(self):
    super().__init__()
    self.embedding = tf.keras.layers.Embedding(vocab_size, vocab_size)

  def call(self, x):
    x = self.embedding(x)
    return x

# Create an instance of the model
bigram_model = Bigram()

In [49]:
if vocab_size < 10000:
  text = generate(bigram_model, 100)
  print(text)
  train(bigram_model, train_ds, test_ds)
  text = generate(bigram_model, 100)
  print(text)

# Constant Attention Scores

In [50]:
tril = tf.linalg.band_part(tf.ones((4, 4)), -1, 0)
weights = tf.nn.softmax(tf.where(tril == 0, -1e9, 0), axis=-1) # (T, T)
print(weights)

tf.Tensor(
[[1.         0.         0.         0.        ]
 [0.5        0.5        0.         0.        ]
 [0.33333334 0.33333334 0.33333334 0.        ]
 [0.25       0.25       0.25       0.25      ]], shape=(4, 4), dtype=float32)


In [51]:
class ConstantAttention(Model):
  def __init__(self):
    super().__init__()
    self.embedding = tf.keras.layers.Embedding(vocab_size, vocab_size)

  def call(self, x):
    x = self.embedding(x) # (B, T, vocab_size)
    B, T, C = x.shape
    tril = tf.linalg.band_part(tf.ones((T, T)), -1, 0)
    weights = tf.nn.softmax(tf.where(tril == 0, -1e9, 0), axis=-1) # (T, T)
    x = tf.linalg.matmul(weights, x)
    return x

# Create an instance of the model
constant_model = ConstantAttention()

In [52]:
if vocab_size < 10000:
  train(constant_model, train_ds, test_ds)
  text = generate(constant_model, 100)
  print(text)

# Self Attention


In [53]:
class DotProductAttention(tf.keras.layers.Layer):
    def __init__(self, head_size, dropout=0, causal=True, **kwargs):
      super().__init__(**kwargs)
      self.head_size = head_size
      self.dropout = dropout
      self.causal = causal
      self.key = tf.keras.layers.Dense(head_size, activation=None, use_bias=False, kernel_initializer=tf.random_normal_initializer(stddev=0.02))
      self.query = tf.keras.layers.Dense(head_size, activation=None, use_bias=False, kernel_initializer=tf.random_normal_initializer(stddev=0.02))
      self.value = tf.keras.layers.Dense(head_size, activation=None, use_bias=False, kernel_initializer=tf.random_normal_initializer(stddev=0.02))

    def call(self, x):
      k = self.key(x)   # (B,T,head_size)
      q = self.query(x) # (B,T,head_size)
      v = self.value(x)
      scores = tf.linalg.matmul(q, k, transpose_b=True) / tf.math.sqrt(tf.cast(self.head_size, tf.float32)) # (B, T, head_size) @ (B, head_size, T) -> (B, T, T)
      if self.causal:
        B, T, C = x.shape
        mask = tf.linalg.band_part(tf.ones((T, T)), 0, -1) - tf.linalg.band_part(tf.ones((T, T)), 0, 0)
        scores += -1e9 * mask
      weights = tf.nn.softmax(scores)
      if self.dropout > 0:
        weights = tf.keras.layers.Dropout(self.dropout)(weights)
      return tf.linalg.matmul(weights, v)

In [54]:
class AttentionV1(Model):
  def __init__(self, embed_size, dropout=0):
    super().__init__()
    self.position_embedding = tf.keras.layers.Embedding(context_size, embed_size,
                                                        embeddings_initializer=tf.random_normal_initializer(stddev=0.02))
    self.embedding = tf.keras.layers.Embedding(vocab_size, embed_size,
                                               embeddings_initializer=tf.random_normal_initializer(stddev=0.02))
    self.attention = DotProductAttention(embed_size, dropout=dropout)
    self.top = tf.keras.layers.Dense(vocab_size, kernel_initializer=tf.random_normal_initializer(stddev=0.02))

  def call(self, x):
    B, T = x.shape
    x = self.position_embedding(tf.range(T)) + self.embedding(x) # (B, T, embed_size)
    x = self.attention(x)
    x = self.top(x)
    return x

# Create an instance of the model
attention_v1 = AttentionV1(384, dropout=0.1)

In [55]:
train(attention_v1, train_ds, test_ds)

Epoch 1, Loss: 10.614594459533691, Accuracy: 11.096529960632324, Test Loss: 9.75546646118164, Test Accuracy: 12.8662109375
Epoch 2, Loss: 7.568630695343018, Accuracy: 11.781820297241211, Test Loss: 6.9852070808410645, Test Accuracy: 12.8662109375
Epoch 3, Loss: 6.492565631866455, Accuracy: 11.771646499633789, Test Loss: 6.583081245422363, Test Accuracy: 12.8662109375
Epoch 4, Loss: 6.3737473487854, Accuracy: 11.75672721862793, Test Loss: 6.561119079589844, Test Accuracy: 12.8662109375
Epoch 5, Loss: 6.353347301483154, Accuracy: 11.76283073425293, Test Loss: 6.554330348968506, Test Accuracy: 12.8662109375
Epoch 6, Loss: 6.348663806915283, Accuracy: 11.77537727355957, Test Loss: 6.551943778991699, Test Accuracy: 12.8662109375
Epoch 7, Loss: 6.348412990570068, Accuracy: 11.761136054992676, Test Loss: 6.552857398986816, Test Accuracy: 12.8662109375
Epoch 8, Loss: 6.345945835113525, Accuracy: 11.766561508178711, Test Loss: 6.547239303588867, Test Accuracy: 12.8662109375
Epoch 9, Loss: 6.340

In [56]:
text = generate(attention_v1, 100)

  0%|          | 0/100 [00:00<?, ?it/s]

In [57]:
print(text)

Aabases dry me runENS Lord ' marriage Tower think's

 that
.
 fromL!., theerFor receiving
 Hastings with they hath mount;TERTo current mine immediately
 blessThereb thee
 neighbour hast sight
 thisLookAnd thisThe prove'dES
 them goWhy heavy honour heUC stay
OWhat thee
.,
 wither pleasure;'
 Rivers? giving willingly your rivalThat areed
; in:
.
 which
;That



# Multi-head attention

In [58]:
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, num_heads, head_size, embed_size, dropout=0, **kwargs):
      super().__init__(**kwargs)
      self.heads = [DotProductAttention(head_size, dropout=dropout) for _ in range(num_heads)]
      self.proj = tf.keras.layers.Dense(embed_size, kernel_initializer=tf.random_normal_initializer(stddev=0.02))

    def call(self, x, mask=None):
      x = tf.concat([h(x) for h in self.heads], axis=-1)
      x = self.proj(x)
      return x

In [59]:
class AttentionV2(Model):
  def __init__(self, num_heads, embed_size, dropout=0):
    super().__init__()
    self.position_embedding = tf.keras.layers.Embedding(context_size, embed_size,
                                                        embeddings_initializer=tf.random_normal_initializer(stddev=0.02))
    self.embedding = tf.keras.layers.Embedding(vocab_size, embed_size,
                                               embeddings_initializer=tf.random_normal_initializer(stddev=0.02))
    head_size = embed_size // num_heads
    self.attention = MultiHeadAttention(num_heads, head_size, embed_size, dropout=dropout)
    self.top = tf.keras.layers.Dense(vocab_size, kernel_initializer=tf.random_normal_initializer(stddev=0.02))

  def call(self, x):
    B, T = x.shape
    x = self.position_embedding(tf.range(T)) + self.embedding(x) # (B, T, embed_size)
    B, T, C = x.shape
    mask = tf.linalg.band_part(tf.ones((T, T)), 0, -1) - tf.linalg.band_part(tf.ones((T, T)), 0, 0)
    x = self.attention(x, mask=mask)
    x = self.top(x)
    return x

# Create an instance of the model
attention_v2 = AttentionV2(6, 384, dropout=0.1)

In [60]:
train(attention_v2, train_ds, test_ds)

Epoch 1, Loss: 10.091297149658203, Accuracy: 11.096869468688965, Test Loss: 6.823359966278076, Test Accuracy: 12.8662109375
Epoch 2, Loss: 6.730208873748779, Accuracy: 7.262166500091553, Test Loss: 6.743961334228516, Test Accuracy: 5.419921875
Epoch 3, Loss: 6.424514293670654, Accuracy: 9.41908073425293, Test Loss: 6.578952789306641, Test Accuracy: 12.8662109375
Epoch 4, Loss: 6.359495639801025, Accuracy: 11.773343086242676, Test Loss: 6.554649353027344, Test Accuracy: 12.8662109375
Epoch 5, Loss: 6.3473639488220215, Accuracy: 11.77842903137207, Test Loss: 6.5458221435546875, Test Accuracy: 12.8662109375
Epoch 6, Loss: 6.347042083740234, Accuracy: 11.767239570617676, Test Loss: 6.5592451095581055, Test Accuracy: 12.8662109375
Epoch 7, Loss: 6.342369079589844, Accuracy: 11.7584228515625, Test Loss: 6.5506696701049805, Test Accuracy: 12.8662109375
Epoch 8, Loss: 6.329826354980469, Accuracy: 11.767916679382324, Test Loss: 6.509790897369385, Test Accuracy: 12.8662109375
Epoch 9, Loss: 6.23

In [61]:
text = generate(attention_v2, 100)

  0%|          | 0/100 [00:00<?, ?it/s]

In [62]:
print(text)

AING sparks
!, set.
COMAD weATEndowing friends. in newsUCK lov the for haste' ear you
TES market youThe live a hold one,AN quite unt feel IThings together
 sham bean go IHe blood and::
 death
 all I to pr near
 marks would in
 I a unsett smooth, such.I, like?
 law him shall's begging the lib would timeThThird, art makeam, alas encounterNo from


# GPT Model

In [63]:
class FeedForward(tf.keras.layers.Layer):
    def __init__(self, embed_size, dropout=0, **kwargs):
      super().__init__(**kwargs)
      self.dropout = dropout
      self.ff1 = tf.keras.layers.Dense(4 * embed_size, activation='relu', kernel_initializer=tf.random_normal_initializer(stddev=0.02))
      self.ff2 = tf.keras.layers.Dense(embed_size, kernel_initializer=tf.random_normal_initializer(stddev=0.02))

    def call(self, x):
      x = self.ff1(x)
      x = self.ff2(x)
      if self.dropout > 0:
        x = tf.keras.layers.Dropout(self.dropout)(x)
      return x

In [64]:
class Block(tf.keras.layers.Layer):
    def __init__(self, num_heads, embed_size, dropout=0, **kwargs):
      super().__init__(**kwargs)
      head_size = embed_size // num_heads
      self.attention = MultiHeadAttention(num_heads, head_size, embed_size, dropout=dropout)
      self.ff = FeedForward(embed_size, dropout=dropout)
      self.ln1 = tf.keras.layers.LayerNormalization()
      self.ln2 = tf.keras.layers.LayerNormalization()

    def call(self, x):
      x = x + self.attention(self.ln1(x))
      x = x + self.ff(self.ln2(x))
      return x

In [65]:
class GPT(Model):
  def __init__(self, num_heads, embed_size, num_layers, dropout=0):
    super().__init__()
    self.position_embedding = tf.keras.layers.Embedding(context_size, embed_size,
                                                        embeddings_initializer=tf.random_normal_initializer(stddev=0.02))
    self.embedding = tf.keras.layers.Embedding(vocab_size, embed_size,
                                               embeddings_initializer=tf.random_normal_initializer(stddev=0.02))
    self.blocks = tf.keras.Sequential([Block(num_heads, embed_size, dropout=dropout) for _ in range(num_layers)])
    self.top = tf.keras.layers.Dense(vocab_size, kernel_initializer=tf.random_normal_initializer(stddev=0.02))
    self.ln = tf.keras.layers.LayerNormalization()

  def call(self, x):
    B, T = x.shape
    x = self.position_embedding(tf.range(T)) + self.embedding(x) # (B, T, embed_size)
    x = self.blocks(x)
    x = self.ln(x) #(B, T, embed_size)
    x = self.top(x) #(B, T, vocab_size)
    return x

# Create an instance of the model
gpt_nano = GPT(6, 384, 6, dropout=0.2)

In [None]:
train(gpt_nano, train_ds, test_ds)

Epoch 1, Loss: 7.700555801391602, Accuracy: 11.250814437866211, Test Loss: 6.519326686859131, Test Accuracy: 12.8662109375
Epoch 2, Loss: 6.244474411010742, Accuracy: 14.95530891418457, Test Loss: 6.281233787536621, Test Accuracy: 16.131591796875
Epoch 3, Loss: 6.021170616149902, Accuracy: 15.551079750061035, Test Loss: 6.168852806091309, Test Accuracy: 16.0888671875
Epoch 4, Loss: 5.91624116897583, Accuracy: 15.58363151550293, Test Loss: 6.134458541870117, Test Accuracy: 16.094970703125
Epoch 5, Loss: 5.786737442016602, Accuracy: 15.686713218688965, Test Loss: 6.025511741638184, Test Accuracy: 16.650390625
Epoch 6, Loss: 5.596808910369873, Accuracy: 17.22276496887207, Test Loss: 5.884128570556641, Test Accuracy: 18.145751953125
Epoch 7, Loss: 5.39631462097168, Accuracy: 18.26137924194336, Test Loss: 5.802062511444092, Test Accuracy: 18.255615234375
Epoch 8, Loss: 5.24469518661499, Accuracy: 18.950397491455078, Test Loss: 5.662423610687256, Test Accuracy: 19.158935546875
Epoch 9, Loss:

In [71]:
text = generate(gpt_nano, 100)

  0%|          | 0/100 [00:00<?, ?it/s]

In [72]:
print(text)

Aaid seem like acknowledge, on as a breast,
To venom of this live,Iron with glorious Eden,
Being good of triumph supp debregnine like foremostadoes in the crown
settled than injustice seated rivers: you did bow no made his majesty,
With audience of
Stillascels of a base times
Repodes, the spring of grave and a largear
HERMakes himself carriageed wake in heart
Or then, their pleasures of light
Thy, like


In [69]:
gpt_nano.summary()

Model: "gpt_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_14 (Embedding)    multiple                  98304     
                                                                 
 embedding_15 (Embedding)    multiple                  19298688  
                                                                 
 sequential_1 (Sequential)   (64, 256, 384)            10639872  
                                                                 
 dense_301 (Dense)           multiple                  19348945  
                                                                 
 layer_normalization_25 (La  multiple                  768       
 yerNormalization)                                               
                                                                 
Total params: 49386577 (188.39 MB)
Trainable params: 49386577 (188.39 MB)
Non-trainable params: 0 (0.00 Byte)
_________________

In [70]:
# GPT-small
#gpt3_small = GPT(12, 768, 12, dropout=0.2)