In [1]:
import tensorflow as tf
import warnings
warnings.filterwarnings('ignore')

# Table of Contents

## 1. Multi-Head Connection

The following is the architecutre MultiHeadAttention layer, we can use it as reference when implementing the multi-head layer:

![multihead_attention](../images/multi-head_attention.png)

In simple english this is what is invovled in the multi-head layer:

1. Configure the number of heads you need (hyper-parameter) 
2. Inputs to MH layer are 3 word vectors(Query, Key, Value) and it outputs a context aware vector 
3. The inputs are passed to each attention head which have 3 Dense Layers (learnable)
4. Finally the outputs of each head is concatenated and output is presented

### Keras has an implementation of multi-head layer:


In [2]:
num_heads = 2
embedding_vector_dim = 256
inputs = tf.keras.Input(shape=[8, 256])
mha_layer = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embedding_vector_dim)
outputs = mha_layer(inputs, inputs, inputs)


Metal device set to: Apple M1

systemMemory: 8.00 GB
maxCacheSize: 2.67 GB



2022-01-07 19:35:00.508719: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-01-07 19:35:00.508823: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


In [3]:
print(outputs.shape)

(None, 8, 256)


## 2. Transformer Encoder

The following is the architecture of Encoder from the original "Attention is all you need"paper

![transformer_encoder](../images/transformer_encoder.png)

In breif this is what is invovled in the Transformer Encoder layer:

1. It begins with a multihead attention (as described above)
2. The original word vectors have a residual connection with the output from multihead attention
3. Then the output goes through a Normalization layer, NL1
4. Now we have a dense projection block (2 Dense layers maybe configurable)
   output of this layer is equal to the output/input vector dimension
5. Then we have a residual connection of the NL1 with the output of Dense projection
6. Finally we have one more Normalization layer, NL2

### A quick note on why we use residual connection and Normalization:

1.Why residual Connection?
 - It is a fix against vanishing gradient problem
 - It acts as a information shortcut around destructive or nosiy blocks such as blocks that contain relu activations or dropout layers)
 - It enables the gradient info to flow noiselessy propogate in a Deep Network
 
 2.Why use normalization layer?
 - It helps in graidents flow better during backprop
 - The Normalization we use here is LayerNormalization layer, which normalizes each sequence independently from other sequences in the batch.
 Note: BatchNorm doesn't work that great with sequence data


## 3. The Code for Tranformer Encoder layer

In [4]:
import tensorflow as tf

In [13]:
class TransformerEncoder(tf.keras.layers.Layer):
    
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        
        self.embed_dim = embed_dim #Vector embedding Dimension
        self.dense_dim = dense_dim #Dense layers number of neurons
        self.num_heads = num_heads #Number of heads in your MLH Layer
        self.attention = tf.keras.layers.MultiHeadAttention( #implementing the mutlihead attention block
                         num_heads = num_heads,
                         key_dim   = embed_dim )
        self.dnse_proj = tf.keras.Sequential(
                         [tf.keras.layers.Dense(dense_dim, activation='relu'),
                          tf.keras.layers.Dense(embed_dim)
                         ]
                         )
        self.layrnorm1 = tf.keras.layers.LayerNormalization()
        self.layrnorm2 = tf.keras.layers.LayerNormalization()
    
    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        
        attention_output = self.attention(inputs, inputs,
                                          attention_mask=mask)
        #Input to projection layer
        proj_input = self.layrnorm1(inputs + attention_output)
        
        #Dense block computation
        proj_output = self.dnse_proj(proj_input)
        
        #Finally add the Dense projection output with along with its original input passed to it
        return self.layrnorm2(proj_input + proj_output)
    
    def get_config(self):
        
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "dense_dim": self.dense_dim,
            "num_heads": self.num_heads,
        })
        return config

## Now we are going to build a text classifier using the Transformer Encoder Block

In [6]:
import pandas as pd
import numpy as  np

In [7]:
df = pd.read_csv('../data/uhack_review_train.csv')

In [8]:
text = df['Review']
lables = df['Polarity']

In [9]:
## splitting data
train_size = int(0.9 * len(df))
train_data = df[:train_size]
test_data = df[train_size:]


train_sentences = train_data['Review'].values
test_sentences = test_data['Review'].values

train_labels = np.array(train_data['Polarity'].values)
test_labels = np.array(test_data['Polarity'].values)


## HYPER-PARAM:-

NUM_WORDS = 1000
TRUNCATE = 'post'  # 'pre'
PADDING = 'post'   # 'pre
MAX_LEN = 100
EVD = 16

## 1. Fit Tokenizer

bbc_tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=NUM_WORDS,
                                                      oov_token='<OOV>')
bbc_tokenizer.fit_on_texts(train_sentences)

## 2. Convert text to sequence

train_seq = bbc_tokenizer.texts_to_sequences(train_sentences)
test_seq = bbc_tokenizer.texts_to_sequences(test_sentences)

## 3. Convert the sequence to padded sequences

train_padded = tf.keras.preprocessing.sequence.pad_sequences(train_seq,
                                                             truncating=TRUNCATE,
                                                             padding=PADDING,
                                                             maxlen=MAX_LEN)
test_padded = tf.keras.preprocessing.sequence.pad_sequences(test_seq,
                                                             truncating=TRUNCATE,
                                                             padding=PADDING,
                                                             maxlen=MAX_LEN)

In [14]:
## Classification

## Hyper-params for transformer

num_heads = 2
dense_dim = 32


inputs = tf.keras.Input(shape=[None], dtype='int64')
embedd = tf.keras.layers.Embedding(NUM_WORDS,EVD)(inputs)
transf = TransformerEncoder(EVD, dense_dim, num_heads)(embedd)
glmaxp = tf.keras.layers.GlobalMaxPool1D()(transf)
droput = tf.keras.layers.Dropout(0.5)(glmaxp)
output = tf.keras.layers.Dense(1, activation='sigmoid')(droput)

tmodel = tf.keras.models.Model(inputs, output)

tmodel.compile(optimizer='adam',
               loss='binary_crossentropy',
               metrics=['accuracy'])
tmodel.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding_2 (Embedding)      (None, None, 16)          16000     
_________________________________________________________________
transformer_encoder_2 (Trans (None, None, 16)          3296      
_________________________________________________________________
global_max_pooling1d (Global (None, 16)                0         
_________________________________________________________________
dropout (Dropout)            (None, 16)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 17        
Total params: 19,313
Trainable params: 19,313
Non-trainable params: 0
_________________________________________________________

In [16]:
MC = tf.keras.callbacks.ModelCheckpoint(
    '../model/first_transformer.h5',
    monitor='val_loss',
    save_best_only='True',
    verbose=1
)

ES = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=5,
    verbose=1,
    restore_best_weights='True'
)

TB = tf.keras.callbacks.TensorBoard('../tboard/')

tmodel.fit(train_padded,
              train_labels,
               epochs=10,
               validation_data=(test_padded, test_labels),
               callbacks=[ES, MC, TB]
              )

2022-01-07 19:44:11.345843: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2022-01-07 19:44:11.345877: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2022-01-07 19:44:11.347027: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2022-01-07 19:44:11.466005: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2022-01-07 19:44:11.468526: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


Epoch 1/10


2022-01-07 19:44:11.750198: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.


  5/173 [..............................] - ETA: 7s - loss: 1.0123 - accuracy: 0.5437

2022-01-07 19:44:14.939784: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2022-01-07 19:44:14.939798: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2022-01-07 19:44:14.992051: I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
2022-01-07 19:44:14.995497: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2022-01-07 19:44:14.999798: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: ../tboard/train/plugins/profile/2022_01_07_19_44_14

2022-01-07 19:44:15.001006: I tensorflow/core/profiler/rpc/client/save_profile.cc:142] Dumped gzipped tool data for trace.json.gz to ../tboard/train/plugins/profile/2022_01_07_19_44_14/Virajdatts-MacBook-Air.local.trace.json.gz
2022-01-07 19:44:15.004255: I tensorflow/core/profiler/rpc/client/save_profile.cc:136] Creating directory: ../tboard/train/plugins/profile/2022_01_07_19_44_14



2022-01-07 19:44:20.748037: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.



Epoch 00001: val_loss improved from inf to 0.35500, saving model to ../model/first_transformer.h5
Epoch 2/10

Epoch 00002: val_loss improved from 0.35500 to 0.30816, saving model to ../model/first_transformer.h5
Epoch 3/10

Epoch 00003: val_loss improved from 0.30816 to 0.28899, saving model to ../model/first_transformer.h5
Epoch 4/10

Epoch 00004: val_loss improved from 0.28899 to 0.28093, saving model to ../model/first_transformer.h5
Epoch 5/10

Epoch 00005: val_loss improved from 0.28093 to 0.27290, saving model to ../model/first_transformer.h5
Epoch 6/10

Epoch 00006: val_loss improved from 0.27290 to 0.26919, saving model to ../model/first_transformer.h5
Epoch 7/10

Epoch 00007: val_loss improved from 0.26919 to 0.26844, saving model to ../model/first_transformer.h5
Epoch 8/10

Epoch 00008: val_loss improved from 0.26844 to 0.26077, saving model to ../model/first_transformer.h5
Epoch 9/10

Epoch 00009: val_loss improved from 0.26077 to 0.25654, saving model to ../model/first_tran

<keras.callbacks.History at 0x168fd7760>