# 🚀 Transformer

## Table of contents
0. [Parameters](#parameters)
1. [Load the Data](#load)
2. [Tokenize the Data](#tokenize)
3. [Create the Training Set](#create)
3. [Create the Token and Position Embedder](#embedder)
3. [Create the Causal Attention Mask function](#causal)
3. [Create Transformer Block Layer](#transformer)
4. [Build the Transformer Decoder](#build)
5. [Train the Transformer](#train)

In [81]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [82]:
import numpy as np
import matplotlib.pyplot as plt
import os
import json
from pprint import pprint
import random
import re
import string
from IPython.display import display, HTML

import tensorflow as tf
import tensorflow.keras as keras

## 0. Parameters <a name="parameters"></a>

In [83]:
VOCAB_SIZE = 10000
MAX_LEN = 80
EMBEDDING_DIM = 256
KEY_DIM = 256
N_HEADS = 2
FEED_FORWARD_DIM = 256
VALIDATION_SPLIT = 0.2
SEED = 42
LOAD_MODEL = True
BATCH_SIZE = 32
EPOCHS = 5

## 1. Load the data <a name="load"></a>

In [84]:
# Load the full dataset
with open('/app/data/wine-reviews/winemag-data-130k-v2.json') as json_data:
    wine_data = json.load(json_data)
    

In [85]:
wine_data[10]

{'points': '87',
 'title': 'Kirkland Signature 2011 Mountain Cuvée Cabernet Sauvignon (Napa Valley)',
 'description': 'Soft, supple plum envelopes an oaky structure in this Cabernet, supported by 15% Merlot. Coffee and chocolate complete the picture, finishing strong at the end, resulting in a value-priced wine of attractive flavor and immediate accessibility.',
 'taster_name': 'Virginie Boone',
 'taster_twitter_handle': '@vboone',
 'price': 19,
 'designation': 'Mountain Cuvée',
 'variety': 'Cabernet Sauvignon',
 'region_1': 'Napa Valley',
 'region_2': 'Napa',
 'province': 'California',
 'country': 'US',
 'winery': 'Kirkland Signature'}

In [86]:
# Filter the dataset
filtered_data = ['wine review : ' + x['country'] + ' : ' + \
                 x['province']  + ' : ' + x['variety'] + 
                 ' : ' + x['description'] \
            for x in wine_data
              if x['country'] is not None
                 and x['province'] is not None
                 and x['variety'] is not None
                 and x['description'] is not None
             ]

In [87]:
# Count the recipes
n_wines = len(filtered_data)
print(f'{n_wines} recipes loaded')

129907 recipes loaded


In [88]:
example = filtered_data[25]
print(example)

wine review : US : California : Pinot Noir : Oak and earth intermingle around robust aromas of wet forest floor in this vineyard-designated Pinot that hails from a high-elevation site. Small in production, it offers intense, full-bodied raspberry and blackberry steeped in smoky spice and smooth texture.


## 2. Tokenize the data <a name="tokenize"></a>

In [89]:
# Pad the punctuation, to treat them as separate 'words'  
def pad_punctuation(s):
    s = re.sub(f"([{string.punctuation}, '\n'])", r' \1 ', s)
    s = re.sub(' +', ' ', s)
    return s

text_data = [pad_punctuation(x) for x in filtered_data]

In [90]:
# Display an example of a recipe
example_data = text_data[25]
example_data

'wine review : US : California : Pinot Noir : Oak and earth intermingle around robust aromas of wet forest floor in this vineyard - designated Pinot that hails from a high - elevation site . Small in production , it offers intense , full - bodied raspberry and blackberry steeped in smoky spice and smooth texture . '

In [91]:
# Convert to a Tensorflow Dataset
text_ds = tf.data.Dataset.from_tensor_slices(text_data).batch(BATCH_SIZE).shuffle(1000)

In [92]:
# Create a vectorisation layer
vectorize_layer = keras.layers.TextVectorization(
    standardize = 'lower',
    max_tokens=VOCAB_SIZE,
    output_mode="int",
    output_sequence_length=MAX_LEN + 1,
)

In [93]:
# Adapt the layer to the training set
vectorize_layer.adapt(text_ds)
vocab = vectorize_layer.get_vocabulary()

In [94]:
# Display some token:word mappings
for i, word in enumerate(vocab[:10]):
    print(f'{i}: {word}')

0: 
1: [UNK]
2: :
3: ,
4: .
5: and
6: the
7: wine
8: a
9: of


In [95]:
# Display the same example converted to ints
example_tokenised = vectorize_layer(example_data)
print(example_tokenised.numpy())

[   7   10    2   20    2   29    2   43   62    2   55    5  243 4145
  453  634   26    9  497  499  667   17   12  142   14 2214   43   25
 2484   32    8  223   14 2213  948    4  594   17  987    3   15   75
  237    3   64   14   82   97    5   74 2633   17  198   49    5  125
   77    4    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0]


## 3. Create the Training Set <a name="create"></a>

In [96]:
# Create the training set of recipes and the same text shifted by one word
def prepare_inputs(text):
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y

train_ds = text_ds.map(prepare_inputs)

In [97]:
example_input_output = train_ds.take(1).get_single_element()

In [98]:
#Example Input
example_input_output[0][0]

<tf.Tensor: shape=(80,), dtype=int64, numpy=
array([   7,   10,    2,   20,    2,   29,    2,  231,    2,    8,   89,
        323, 2483,    9,   58, 1377,  785,    3,   12,   13,   64,   82,
          3,   57,   17,  261,    5, 1283,    4,   15,   83,   38,   16,
          9,  383,    3,  414,    3,  333,    5,  295,    3,  728,  143,
        102,   34,    3,    5,  512,   11,  182,    9,  357,    4,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0])>

In [99]:
#Example Output (shifted by one token)
example_input_output[1][0]

<tf.Tensor: shape=(80,), dtype=int64, numpy=
array([  10,    2,   20,    2,   29,    2,  231,    2,    8,   89,  323,
       2483,    9,   58, 1377,  785,    3,   12,   13,   64,   82,    3,
         57,   17,  261,    5, 1283,    4,   15,   83,   38,   16,    9,
        383,    3,  414,    3,  333,    5,  295,    3,  728,  143,  102,
         34,    3,    5,  512,   11,  182,    9,  357,    4,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0])>

## 5. Create the causal attention mask function <a name="causal"></a>

In [100]:
def causal_attention_mask(batch_size, n_dest, n_src, dtype):
    i = tf.range(n_dest)[:, None]
    j = tf.range(n_src)
    m = i >= j - n_src + n_dest
    mask = tf.cast(m, dtype)
    mask = tf.reshape(mask, [1, n_dest, n_src])
    mult = tf.concat(
        [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], 0
    )
    return tf.tile(mask, mult)

np.transpose(causal_attention_mask(1, 10,10,dtype = tf.int32)[0])

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [0, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
       [0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
       [0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]], dtype=int32)

## 6. Create a Transformer Block layer <a name="transformer"></a>

In [101]:
class TransformerBlock(keras.layers.Layer):
    def __init__(self, num_heads, key_dim, embed_dim, ff_dim, dropout_rate=0.1):
        super(TransformerBlock, self).__init__()
        self.num_heads = num_heads
        self.key_dim = key_dim
        self.embed_dim = embed_dim
        self.ff_dim = ff_dim
        self.dropout_rate = dropout_rate
        self.attn = keras.layers.MultiHeadAttention(num_heads, key_dim, output_shape = embed_dim)
        self.dropout_1 = keras.layers.Dropout(self.dropout_rate)
        self.ln_1 = keras.layers.LayerNormalization(epsilon=1e-6)
        self.ffn_1 = keras.layers.Dense(self.ff_dim, activation="relu")
        self.ffn_2 = keras.layers.Dense(self.embed_dim)
        self.dropout_2 = keras.layers.Dropout(self.dropout_rate)
        self.ln_2 = keras.layers.LayerNormalization(epsilon=1e-6)
        
    def call(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size = input_shape[0]
        seq_len = input_shape[1]
        causal_mask = causal_attention_mask(batch_size, seq_len, seq_len, tf.bool)
        attention_output, attention_scores = self.attn(inputs, inputs, attention_mask=causal_mask, return_attention_scores=True)
        attention_output = self.dropout_1(attention_output)
        out1 = self.ln_1(inputs + attention_output)
        ffn_1 = self.ffn_1(out1)
        ffn_2 = self.ffn_2(ffn_1)
        ffn_output = self.dropout_2(ffn_2)
        return (self.ln_2(out1 + ffn_output), attention_scores)
    
    def get_config(self):
        config = super().get_config()
        config.update({
            "key_dim": self.key_dim,
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "ff_dim": self.ff_dim,
            "dropout_rate": self.dropout_rate
        })
        return config

## 7. Create the Token and Position Embedding <a name="embedder"></a>

In [102]:
class TokenAndPositionEmbedding(keras.layers.Layer):
    def __init__(self, max_len, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.max_len =max_len
        self.vocab_size =vocab_size
        self.embed_dim = embed_dim
        self.token_emb = keras.layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = keras.layers.Embedding(input_dim=max_len, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions
    
    def get_config(self):
        config = super().get_config()
        config.update({
            "max_len": self.max_len,
            "vocab_size": self.vocab_size,
            "embed_dim": self.embed_dim,
        })
        return config

## 8. Build the Transformer model <a name="transformer_decoder"></a>

In [103]:
inputs = keras.layers.Input(shape=(None,), dtype=tf.int32)
x = TokenAndPositionEmbedding(MAX_LEN, VOCAB_SIZE, EMBEDDING_DIM)(inputs)
x, attention_scores = TransformerBlock(N_HEADS, KEY_DIM, EMBEDDING_DIM, FEED_FORWARD_DIM)(x)
outputs = keras.layers.Dense(VOCAB_SIZE, activation = 'softmax')(x)
model = keras.Model(inputs=inputs, outputs=[outputs, attention_scores])
model.compile("adam", loss=[keras.losses.SparseCategoricalCrossentropy(), None])

In [104]:
model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, None)]            0         
                                                                 
 token_and_position_embeddin  (None, None, 256)        2580480   
 g_1 (TokenAndPositionEmbedd                                     
 ing)                                                            
                                                                 
 transformer_block_1 (Transf  ((None, None, 256),      658688    
 ormerBlock)                  (None, 2, None, None))             
                                                                 
 dense_5 (Dense)             (None, None, 10000)       2570000   
                                                                 
Total params: 5,809,168
Trainable params: 5,809,168
Non-trainable params: 0
_________________________________________________

In [105]:
if LOAD_MODEL:
    # model.load_weights('./models/model')
    model = keras.models.load_model('./models/model', compile=True)

## 9. Train the Transformer <a name="train"></a>

In [106]:
# Create a TextGenerator checkpoint
class TextGenerator(keras.callbacks.Callback):
    def __init__(self, index_to_word, top_k=10):
        self.index_to_word = index_to_word
        self.word_to_index = {word: index for index, word in enumerate(index_to_word)}

    def sample_from(self, probs, temperature):
        probs = probs ** (1 / temperature)
        probs = probs / np.sum(probs)
        return np.random.choice(len(probs), p=probs), probs 
    
    def generate(self, start_prompt, max_tokens, temperature):
        start_tokens = [self.word_to_index.get(x, 1) for x in start_prompt.split()]
        sample_token = None
        info = []
        while len(start_tokens) < max_tokens and sample_token != 0:
            x = np.array([start_tokens])
            y, att = self.model.predict(x, verbose = 0)
            sample_token, probs = self.sample_from(y[0][-1], temperature)
            info.append({'prompt': start_prompt , 'word_probs': probs, 'atts': att[0,:,-1,:]})
            start_tokens.append(sample_token)
            start_prompt = start_prompt + ' ' + self.index_to_word[sample_token]
        print(f"\ngenerated text:\n{start_prompt}\n")
        return info
        
    def on_epoch_end(self, epoch, logs=None):
        self.generate("wine review", max_tokens = 100, temperature = 1.0)
        

In [107]:
# Create a model save checkpoint
model_checkpoint_callback = keras.callbacks.ModelCheckpoint(
    filepath="./checkpoint/checkpoint.ckpt",
    save_weights_only=True,
    save_freq="epoch",
    verbose=0,
)

tensorboard_callback = keras.callbacks.TensorBoard(log_dir="./logs")

# Tokenize starting prompt
text_generator = TextGenerator(vocab)

In [108]:
model.fit(
    train_ds, 
    epochs=EPOCHS, 
    # steps_per_epoch = 3,
    callbacks = [model_checkpoint_callback, tensorboard_callback, text_generator]
)

Epoch 1/5

KeyboardInterrupt: 

In [118]:
# Save the final model
model.save("./models/model")

2022-07-21 12:08:08.335399: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


INFO:tensorflow:Assets written to: ./models/model/assets


INFO:tensorflow:Assets written to: ./models/model/assets


# 3. Generate text using the Transformer

In [109]:
def print_probs(info, vocab, top_k = 5):
    for i in info:
        highlighted_text = []
        for word, att_score in zip(i['prompt'].split(), np.mean(i['atts'], axis = 0)):
            highlighted_text.append('<span style="background-color:rgba(135,206,250,' + str(att_score / max(np.mean(i['atts'], axis = 0)) ) + ');">' + word + '</span>')
        highlighted_text = ' '.join(highlighted_text)
        display(HTML(highlighted_text))
        
        word_probs = i['word_probs']
        p_sorted = np.sort(word_probs)[::-1][:top_k]
        i_sorted = np.argsort(word_probs)[::-1][:top_k]
        for p, i in zip(p_sorted, i_sorted):
            print(f'{vocab[i]}:   \t{np.round(100*p,2)}%') 
        print('--------\n')

In [110]:
info = text_generator.generate("wine review : us", max_tokens = 80, temperature = 1.0)


generated text:
wine review : us : california : chardonnay : buttery and creamy on the finish , this is a study whose crisp , green apple and both speak of [UNK] . pretty pineapple provides a kick to preserved its powerful creaminess , it offers delicious apple and pear aromas , finishing in tangy acids and just - dry seasoning the fruit flavors . 



In [111]:
info = text_generator.generate("wine review : italy", max_tokens = 80, temperature = 0.5)


generated text:
wine review : italy : tuscany : sangiovese : aromas of red berry , leather and a whiff of leather lead the nose . the palate offers ripe strawberry , red cherry , clove , cinnamon , clove and a hint of baking spice alongside solid tannins . it lacks fruit richness and complexity . 



In [112]:
info = text_generator.generate("wine review : germany", max_tokens = 80, temperature = 0.5)
print_probs(info, vocab)


generated text:
wine review : germany : rheingau : riesling : this is a ripe , full - bodied riesling with a touch of residual sugar . it ' s a slightly sweet , fresh , fruity style , with a zesty , minerally finish that lingers long on the finish . 



::   	100.0%
grosso:   	0.0%
zealand:   	0.0%
-:   	0.0%
africa:   	0.0%
--------



pfalz:   	51.53%
mosel:   	41.21%
rheingau:   	4.27%
rheinhessen:   	2.16%
franken:   	0.44%
--------



::   	100.0%
valley:   	0.0%
grosso:   	0.0%
-:   	0.0%
river:   	0.0%
--------



riesling:   	99.74%
pinot:   	0.14%
gewürztraminer:   	0.06%
cabernet:   	0.01%
grüner:   	0.01%
--------



::   	100.0%
grosso:   	0.0%
-:   	0.0%
blanc:   	0.0%
ottonel:   	0.0%
--------



a:   	26.57%
this:   	16.63%
whiffs:   	9.85%
fresh:   	9.14%
while:   	5.04%
--------



is:   	50.62%
dry:   	12.26%
wine:   	11.15%
intensely:   	3.13%
off:   	3.0%
--------



a:   	95.47%
an:   	3.62%
the:   	0.2%
one:   	0.16%
surprisingly:   	0.07%
--------



bit:   	8.36%
fresh:   	7.22%
stunning:   	7.15%
big:   	6.85%
gorgeously:   	6.59%
--------



,:   	98.73%
and:   	0.7%
wine:   	0.29%
yet:   	0.2%
but:   	0.05%
--------



sunny:   	36.96%
fruity:   	19.77%
full:   	12.19%
juicy:   	6.0%
round:   	3.8%
--------



-:   	99.87%
bodied:   	0.08%
of:   	0.04%
and:   	0.0%
,:   	0.0%
--------



bodied:   	99.74%
flavored:   	0.13%
fruited:   	0.08%
throttle:   	0.04%
force:   	0.0%
--------



riesling:   	46.56%
,:   	27.78%
wine:   	16.88%
and:   	4.58%
yet:   	1.33%
--------



with:   	62.96%
that:   	19.15%
,:   	12.7%
.:   	3.6%
from:   	1.33%
--------



a:   	66.7%
notes:   	7.77%
hints:   	7.62%
aromas:   	6.85%
swathes:   	1.78%
--------



touch:   	58.21%
hint:   	16.75%
nose:   	11.43%
whiff:   	3.34%
crush:   	0.64%
--------



of:   	100.0%
that:   	0.0%
to:   	0.0%
on:   	0.0%
reminiscent:   	0.0%
--------



petrol:   	42.37%
honey:   	33.73%
sweet:   	3.49%
minerality:   	3.46%
residual:   	3.46%
--------



sugar:   	99.99%
sweetness:   	0.01%
carbon:   	0.0%
sweet:   	0.0%
[UNK]:   	0.0%
--------



.:   	69.63%
,:   	27.85%
in:   	1.07%
and:   	0.26%
to:   	0.26%
--------



it:   	94.24%
the:   	2.21%
off:   	1.06%
there:   	0.42%
while:   	0.24%
--------



':   	99.98%
has:   	0.01%
is:   	0.01%
finishes:   	0.01%
balances:   	0.0%
--------



s:   	100.0%
ll:   	0.0%
11:   	0.0%
t:   	0.0%
[UNK]:   	0.0%
--------



a:   	53.5%
full:   	3.84%
rich:   	3.29%
intensely:   	3.15%
lusciously:   	3.04%
--------



bit:   	58.58%
refreshingly:   	8.02%
rich:   	4.55%
touch:   	3.14%
tad:   	2.9%
--------



sweet:   	94.23%
oily:   	1.25%
viscous:   	1.09%
bitter:   	0.88%
honeyed:   	0.66%
--------



wine:   	51.09%
,:   	40.57%
riesling:   	6.22%
style:   	0.67%
-:   	0.58%
--------



honeyed:   	28.64%
sunny:   	19.73%
unctuous:   	7.47%
fruity:   	6.49%
sweet:   	4.59%
--------



,:   	32.57%
wine:   	31.07%
and:   	19.55%
-:   	8.81%
riesling:   	2.42%
--------



fruity:   	56.42%
sunny:   	14.27%
juicy:   	5.63%
crisp:   	4.16%
floral:   	2.94%
--------



wine:   	76.16%
palate:   	8.65%
and:   	7.04%
style:   	5.37%
finish:   	0.77%
--------



with:   	45.8%
of:   	29.15%
,:   	18.45%
that:   	4.58%
and:   	1.54%
--------



with:   	87.78%
it:   	9.13%
but:   	2.38%
and:   	0.22%
the:   	0.16%
--------



a:   	83.73%
hints:   	5.05%
lime:   	1.27%
flavors:   	1.02%
lemon:   	0.87%
--------



touch:   	61.72%
hint:   	15.19%
honeyed:   	3.68%
lingering:   	2.79%
delicate:   	2.43%
--------



,:   	72.54%
finish:   	9.37%
lime:   	9.02%
streak:   	1.68%
mouthfeel:   	1.52%
--------



sunny:   	14.49%
zesty:   	11.89%
steely:   	11.77%
minerally:   	8.03%
mineral:   	7.28%
--------



finish:   	81.6%
tone:   	8.24%
,:   	3.68%
mouthfeel:   	1.02%
edge:   	0.89%
--------



.:   	96.96%
that:   	2.61%
and:   	0.18%
,:   	0.12%
marked:   	0.05%
--------



':   	64.65%
lingers:   	26.49%
is:   	4.46%
should:   	1.07%
adds:   	0.37%
--------



long:   	43.77%
on:   	40.79%
.:   	3.67%
through:   	3.4%
nicely:   	3.03%
--------



.:   	58.71%
on:   	33.17%
,:   	2.79%
and:   	2.38%
into:   	1.71%
--------



the:   	99.71%
a:   	0.27%
.:   	0.01%
an:   	0.0%
its:   	0.0%
--------



finish:   	93.32%
palate:   	6.24%
long:   	0.26%
nose:   	0.1%
sip:   	0.02%
--------



.:   	99.96%
,:   	0.03%
of:   	0.01%
that:   	0.0%
with:   	0.0%
--------



:   	94.62%
drink:   	5.2%
it:   	0.08%
the:   	0.03%
this:   	0.02%
--------

