# Classification using Transformers
In this Jupyter Lab file, we will learn how to use the Transformer for classification task. <br>
Source: please checkout this [REPO](https://github.com/brightmart/text_classification) to learn more about different models (including this Transformers) for classification task.

- TODO#1: add tensorboard to visualize training loss, computational graph.
- TODO#2: (at home) change from the toy dataset to the real IMDB dataset.

In [8]:
%load_ext autoreload
%autoreload 2
import tensorflow as tf
import os, sys

  from ._conv import register_converters as _register_converters


In [9]:
# Include *.py files from other folders
module_path = os.path.abspath(os.path.join('../../'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [10]:
import numpy as np
import random
import copy
from pythonlibs.embeddings.transformers.a2_base_model import BaseClass
from pythonlibs.embeddings.transformers.a2_encoder import Encoder

input_x: Tensor("input_x:0", shape=(128, 50), dtype=int32)
get_mask==>result: Tensor("mul:0", shape=(50, 50), dtype=float32)


In [11]:
# Showing graph on Jupyter Notebook
# source: https://stackoverflow.com/questions/38189119/simple-way-to-visualize-a-tensorflow-graph-in-jupyter
from IPython.display import clear_output, Image, display, HTML

def strip_consts(graph_def, max_const_size=32):
    """Strip large constant values from graph_def."""
    strip_def = tf.GraphDef()
    for n0 in graph_def.node:
        n = strip_def.node.add() 
        n.MergeFrom(n0)
        if n.op == 'Const':
            tensor = n.attr['value'].tensor
            size = len(tensor.tensor_content)
            if size > max_const_size:
                tensor.tensor_content = "<stripped %d bytes>"%size
    return strip_def

def show_graph(graph_def, max_const_size=32):
    """Visualize TensorFlow graph."""
    if hasattr(graph_def, 'as_graph_def'):
        graph_def = graph_def.as_graph_def()
    strip_def = strip_consts(graph_def, max_const_size=max_const_size)
    code = """
        <script>
          function load() {{
            document.getElementById("{id}").pbtxt = {data};
          }}
        </script>
        <link rel="import" href="https://tensorboard.appspot.com/tf-graph-basic.build.html" onload=load()>
        <div style="height:600px">
          <tf-graph-basic id="{id}"></tf-graph-basic>
        </div>
    """.format(data=repr(str(strip_def)), id='graph'+str(np.random.rand()))

    iframe = """
        <iframe seamless style="width:1200px;height:620px;border:0" srcdoc="{}"></iframe>
    """.format(code.replace('"', '&quot;'))
    display(HTML(iframe))

In [12]:
logs_path = '../../../my_data/tf_transformers_logs'

In [13]:
# Helper: Creating random dataset
def get_unique_labels(length=5):
    #if length is  None:
    #    x=[2,3,4,5,6]
    #else:
    x=[i for i in range(2,2+length)]
    random.shuffle(x)
    return x

def get_unique_labels_batch(batch_size,length=None):
    x=[]
    for i in range(batch_size):
        labels=get_unique_labels(length=length)
        x.append(labels)
    return x

In [14]:
# Use this when you need to reset the TF graph.
# tf.reset_default_graph()
"""
Transformer_classification: originally it perform sequence to sequence solely on attention mechanism. do it fast and better. now we use it to do text classification.
for more detail, check paper: "Attention Is All You Need"
1. position embedding for encoder input and decoder input
2. encoder with multi-head attention, position-wise feed forward
3. decoder with multi-head attention for decoder input,position-wise feed forward, mulit-head attention between encoder and decoder.
encoder:
6 layers.each layers has two sub-layers.
the first is multi-head self-attention mechanism;
the second is position-wise fully connected feed-forward network.
for each sublayer. use LayerNorm(x+Sublayer(x)). all dimension=512.
Decoder:
1. The decoder is composed of a stack of N= 6 identical layers.
2. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head
attention over the output of the encoder stack.
3. Similar to the encoder, we employ residual connections
around each of the sub-layers, followed by layer normalization. We also modify the self-attention
sub-layer in the decoder stack to prevent positions from attending to subsequent positions.  This
masking, combined with fact that the output embeddings are offset by one position, ensures that the
predictions for position i can depend only on the known outputs at positions less than i.
"""
epochs = 100 # was 15000
checkepoch = 10# was 1500

class Transformer(BaseClass):
    def __init__(self, num_classes, learning_rate, batch_size, decay_steps, decay_rate, sequence_length,
                 vocab_size, embed_size,d_model,d_k,d_v,h,num_layer,is_training,
                 initializer=tf.random_normal_initializer(stddev=0.1),clip_gradients=5.0,l2_lambda=0.0001,use_residual_conn=False):
        """init all hyperparameter here"""
        super(Transformer, self).__init__(d_model, d_k, d_v, sequence_length, h, batch_size, num_layer=num_layer) #init some fields by using parent class.

        self.num_classes = num_classes
        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.embed_size = d_model
        self.learning_rate = tf.Variable(learning_rate, trainable=False, name="learning_rate")
        self.learning_rate_decay_half_op = tf.assign(self.learning_rate, self.learning_rate * 0.5)
        self.initializer = initializer
        self.clip_gradients=clip_gradients
        self.l2_lambda=l2_lambda

        self.is_training=is_training #self.is_training=tf.placeholder(tf.bool,name="is_training") #tf.bool #is_training
        self.input_x = tf.placeholder(tf.int32, [self.batch_size, self.sequence_length], name="input_x")                 #x  batch_size
        self.input_y_label = tf.placeholder(tf.int32, [self.batch_size], name="input_y_label")
        self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")

        self.global_step = tf.Variable(0, trainable=False, name="Global_Step")
        self.epoch_step = tf.Variable(0, trainable=False, name="Epoch_Step")
        self.epoch_increment = tf.assign(self.epoch_step, tf.add(self.epoch_step, tf.constant(1)))
        self.decay_steps, self.decay_rate = decay_steps, decay_rate
        self.use_residual_conn=use_residual_conn

        self.instantiate_weights()
        self.logits = self.inference() #logits shape:[batch_size,self.num_classes]

        self.predictions = tf.argmax(self.logits, axis=1, name="predictions")
        correct_prediction = tf.equal(tf.cast(self.predictions, tf.int32),self.input_y_label)
        self.accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name="Accuracy")  # shape=()
        if self.is_training is False:# if it is not training, then no need to calculate loss and back-propagation.
            return
        self.loss_val = self.loss()
        self.train_op = self.train()

    def inference(self):
        """ building blocks:
        encoder:6 layers.each layers has two   sub-layers. the first is multi-head self-attention mechanism; the second is position-wise fully connected feed-forward network.
               for each sublayer. use LayerNorm(x+Sublayer(x)). all dimension=512.
        decoder:6 layers.each layers has three sub-layers. the second layer is performs multi-head attention over the ouput of the encoder stack.
               for each sublayer. use LayerNorm(x+Sublayer(x)).
        """
        # 1.embedding for encoder input & decoder input
        # 1.1 position embedding for encoder input
        input_x_embeded = tf.nn.embedding_lookup(self.Embedding,self.input_x)  #[None,sequence_length, embed_size]
        input_x_embeded=tf.multiply(input_x_embeded,tf.sqrt(tf.cast(self.d_model,dtype=tf.float32)))
        input_mask=tf.get_variable("input_mask",[self.sequence_length,1],initializer=self.initializer)
        input_x_embeded=tf.add(input_x_embeded,input_mask) #[None,sequence_length,embed_size].position embedding.

        # 2. encoder
        encoder_class=Encoder(self.d_model,self.d_k,self.d_v,self.sequence_length,self.h,self.batch_size,self.num_layer,input_x_embeded,input_x_embeded,dropout_keep_prob=self.dropout_keep_prob,use_residual_conn=self.use_residual_conn)
        Q_encoded,K_encoded = encoder_class.encoder_fn() #K_v_encoder

        Q_encoded=tf.reshape(Q_encoded,shape=(self.batch_size,-1)) #[batch_size,sequence_length*d_model]
        with tf.variable_scope("output"):
            logits = tf.matmul(Q_encoded, self.W_projection) + self.b_projection #logits shape:[batch_size*decoder_sent_length,self.num_classes]
        print("logits:",logits)
        return logits

    def loss(self, l2_lambda=0.0001):  # 0.001
        with tf.name_scope("loss"):
            # input: `logits`:[batch_size, num_classes], and `labels`:[batch_size]
            # output: A 1-D `Tensor` of length `batch_size` of the same type as `logits` with the softmax cross entropy loss.
            losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=self.input_y_label,logits=self.logits);  # sigmoid_cross_entropy_with_logits.#losses=tf.nn.softmax_cross_entropy_with_logits(labels=self.input_y,logits=self.logits)
            # print("1.sparse_softmax_cross_entropy_with_logits.losses:",losses) # shape=(?,)
            loss = tf.reduce_mean(losses)  # print("2.loss.loss:", loss) #shape=()
            l2_losses = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if ('bias' not in v.name ) and ('alpha' not in v.name)]) * l2_lambda
            loss = loss + l2_losses
        return loss

    #def loss_seq2seq(self):
    #    with tf.variable_scope("loss"):
    #        losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=self.input_y_label, logits=self.logits);#losses:[batch_size,self.decoder_sent_length]
    #        loss_batch=tf.reduce_sum(losses,axis=1)/self.decoder_sent_length #loss_batch:[batch_size]
    #        loss=tf.reduce_mean(loss_batch)
    #        l2_losses = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if 'bias' not in v.name]) * self.l2_lambda
    #        loss = loss + l2_losses
    #        return loss

    def train(self):
        """based on the loss, use SGD to update parameter"""
        learning_rate = tf.train.exponential_decay(self.learning_rate, self.global_step, self.decay_steps,self.decay_rate, staircase=True)
        self.learning_rate_=learning_rate
        #noise_std_dev = tf.constant(0.3) / (tf.sqrt(tf.cast(tf.constant(1) + self.global_step, tf.float32))) #gradient_noise_scale=noise_std_dev
        train_op = tf.contrib.layers.optimize_loss(self.loss_val, global_step=self.global_step,
                                                   learning_rate=learning_rate, optimizer="Adam",clip_gradients=self.clip_gradients)
        return train_op

    def instantiate_weights(self):
        """define all weights here"""
        with tf.variable_scope("embedding_projection"):  # embedding matrix
            self.Embedding = tf.get_variable("Embedding", shape=[self.vocab_size, self.embed_size],initializer=self.initializer)  # [vocab_size,embed_size] tf.random_uniform([self.vocab_size, self.embed_size],-1.0,1.0)
            self.Embedding_label = tf.get_variable("Embedding_label", shape=[self.num_classes, self.embed_size],dtype=tf.float32) #,initializer=self.initializer
            self.W_projection = tf.get_variable("W_projection", shape=[self.sequence_length*self.d_model, self.num_classes],initializer=self.initializer)  # [embed_size,label_size]
            self.b_projection = tf.get_variable("b_projection", shape=[self.num_classes])

    def get_mask(self,sequence_length):
        lower_triangle = tf.matrix_band_part(tf.ones([sequence_length, sequence_length]), -1, 0)
        result = -1e9 * (1.0 - lower_triangle)
        print("get_mask==>result:", result)
        return result
    
    
# test started: learn to predict the bigger number in two numbers from specific location of array.
def test_training():

    # below is a function test; if you use this for text classifiction, you need to tranform sentence to indices of vocabulary first. then feed data to the graph.
    num_classes = 9+2 #additional two classes:one is for _GO, another is for _END
    learning_rate = 0.0001 #/10.0
    batch_size = 1 # was 1
    decay_steps = 1000
    decay_rate = 0.9
    sequence_length = 6#5 TODO
    vocab_size = 300
    is_training = True #True
    dropout_keep_prob = 0.9  # 0.5 #num_sentences
    #decoder_sent_length=6
    l2_lambda=0.0001#0.0001
    d_model=512 #512
    d_k=64
    d_v=64
    h=8
    num_layer=1
    embed_size = d_model

    model = Transformer(num_classes, learning_rate, batch_size, decay_steps, decay_rate, sequence_length,
                        vocab_size, embed_size,d_model,d_k,d_v,h,num_layer,is_training,l2_lambda=l2_lambda)
    saver = tf.train.Saver()
    
    acc_arr = []
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        
        # writer.add_graph(sess.graph)
        writer = tf.summary.FileWriter(logs_path, sess.graph)
        
        ckpt_dir = 'checkpoint_transformer/sequence_reverse/'
        
        if not os.path.exists(ckpt_dir):
            os.makedirs(ckpt_dir)
        
        if os.path.exists(ckpt_dir+"checkpoint"):
            saver.restore(sess, tf.train.latest_checkpoint(ckpt_dir))
            
        for i in range(epochs):
            label_list= get_unique_labels()
            input_x = np.array([label_list+[9]],dtype=np.int32)
            label_list_original=copy.deepcopy(label_list)
            label_list.reverse()
            input_y_label=np.array([np.max([label_list[0],label_list[1]])],dtype=np.int32)

            loss, acc, predict, W_projection_value, _ = sess.run([model.loss_val, model.accuracy, model.predictions, model.W_projection, model.train_op],
                                                     feed_dict={model.input_x:input_x, model.input_y_label: input_y_label,
                                                                model.dropout_keep_prob: dropout_keep_prob}) #model.dropout_keep_prob: dropout_keep_prob
            print(i,"loss:", loss, "acc:", acc, "label_list_original as input x:",label_list_original,";input_y_label:", input_y_label, "prediction:", predict)
            acc_arr.append(acc)
            if i%checkepoch==0:
                save_path = ckpt_dir + "model.ckpt"
                saver.save(sess, save_path, global_step=i)

#test_training()
#test_predict()
#test_training_batch()
#test_training()

In [15]:
def test_predict():
    # below is a function test; if you use this for text classifiction, you need to tranform sentence to indices of vocabulary first. then feed data to the graph.
    num_classes = 9+2 #additional two classes:one is for _GO, another is for _END
    learning_rate = 0.001
    batch_size = 1
    decay_steps = 1000
    decay_rate = 0.9
    sequence_length = 6 #5
    vocab_size = 300
    is_training = False #True
    dropout_keep_prob = 1  # 0.5 #num_sentences
    l2_lambda=0.0001
    d_model=512 #512
    d_k=64
    d_v=64
    h=8
    num_layer=1#6
    embed_size = d_model
    model = Transformer(num_classes, learning_rate, batch_size, decay_steps, decay_rate, sequence_length,
                                    vocab_size, embed_size,d_model,d_k,d_v,h,num_layer,is_training,l2_lambda=l2_lambda)
    saver = tf.train.Saver()
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        ckpt_dir = 'checkpoint_transformer/sequence_reverse/'
        saver.restore(sess, tf.train.latest_checkpoint(ckpt_dir))
        print("=================restored.")
        for i in range(checkepoch):
            label_list=get_unique_labels()
            input_x = np.array([label_list+[9]],dtype=np.int32)
            label_list_original=copy.deepcopy(label_list)
            label_list.reverse()
            input_y_label=np.array([np.max([label_list[0],label_list[1]])],dtype=np.int32)

            predict, W_projection_value = sess.run([ model.predictions, model.W_projection], #model.loss_val,--->loss, model.train_op
                                feed_dict={model.input_x:input_x,
                                           model.dropout_keep_prob: dropout_keep_prob})
            print(i, "label_list_original as input x:",label_list_original, "prediction:", predict,";label:",input_y_label) #"acc:", acc, "loss:", loss ";input_y_label:", input_y_label

In [16]:
# tf.reset_default_graph()
test_training()

encoder_fn.started.
MultiHeadAttention.self.dropout_rate: Tensor("base_mode_sub_layer_multi_head_attention_encoder0/sub:0", dtype=float32)
self.sequence_length: 6
LayerNormResidualConnection.use_residual_conn: False
Instructions for updating:
keep_dims is deprecated, use keepdims instead
output_conv1: Tensor("sub_layer_postion_wise_feed_forwardencoder0/transpose:0", shape=(1, 6, 2048, 1), dtype=float32)
LayerNormResidualConnection.use_residual_conn: True
encoder_fn. 0 .Q: Tensor("layer_normalization0encoder_postion_wise_ff/add_1:0", shape=(1, 6, 512), dtype=float32) ;K_s: Tensor("layer_normalization0encoder_postion_wise_ff/add_1:0", shape=(1, 6, 512), dtype=float32)
encoder_fn.ended.Q: Tensor("layer_normalization0encoder_postion_wise_ff/add_1:0", shape=(1, 6, 512), dtype=float32) ;K_s: Tensor("layer_normalization0encoder_postion_wise_ff/add_1:0", shape=(1, 6, 512), dtype=float32) ;time spent: 0.21437788009643555
logits: Tensor("output/add:0", shape=(1, 11), dtype=float32)
0 loss: 9.355

In [17]:
show_graph(tf.get_default_graph().as_graph_def())

In [147]:
tf.reset_default_graph()
test_predict()

encoder_fn.started.
MultiHeadAttention.self.dropout_rate: Tensor("base_mode_sub_layer_multi_head_attention_encoder0/sub:0", dtype=float32)
self.sequence_length: 6
LayerNormResidualConnection.use_residual_conn: False
output_conv1: Tensor("sub_layer_postion_wise_feed_forwardencoder0/transpose:0", shape=(1, 6, 2048, 1), dtype=float32)
LayerNormResidualConnection.use_residual_conn: True
encoder_fn. 0 .Q: Tensor("layer_normalization0encoder_postion_wise_ff/add_1:0", shape=(1, 6, 512), dtype=float32) ;K_s: Tensor("layer_normalization0encoder_postion_wise_ff/add_1:0", shape=(1, 6, 512), dtype=float32)
encoder_fn.ended.Q: Tensor("layer_normalization0encoder_postion_wise_ff/add_1:0", shape=(1, 6, 512), dtype=float32) ;K_s: Tensor("layer_normalization0encoder_postion_wise_ff/add_1:0", shape=(1, 6, 512), dtype=float32) ;time spent: 0.3507051467895508
logits: Tensor("output/add:0", shape=(1, 11), dtype=float32)
INFO:tensorflow:Restoring parameters from checkpoint_transformer/sequence_reverse/model

### TODO#1: add tensorboard to visualize training loss, computational graph.
(Suggestion):
- 1. Define a log_path, create a `writer` to save TF's events.
- 2. Visualize to see what is the current computational graph.
- 3. Create `summary` objects to store weights.
- 4. Add summary into the `writer`.

### TODO#2: (at home) change from the toy dataset to the real IMDB dataset.
- This is how you get the IMDB dataset:
```
import keras
from keras.datasets import imdb
(X_train, y_train), (X_test, y_test) = imdb.load_data('./imdb.npz', 
                                                     num_words=5000, # get 5000 words
                                                     skip_top = 0,   # No skipping
                                                     maxlen=0,       # No maximum length
                                                     start_char=1,   # starting char
                                                     oov_char=2,     # out-of-vocabulary
                                                     index_from=3)   # real index
```
- Check the dataset:
```
print("X.train: %s, y.train: %s, X.test: %s, y.test: %s" %(X_train.shape, y_train.shape, X_test.shape, y_test.shape))
```

- Exploring the dataset:

```
# Get index of all vocabulary
vocab_idx = imdb.get_word_index()
# Let see how one represented document looks like
print (X_train[1])
```

# Conclusions: 
What you should know after this:
- 1. Know what is the architecture of the Transformer model.
- 2. Know how to do the training and testing for classifications on a toy dataset.
- 3. Plus, you will have a chance to work more on this model inside BERT.