# Model Training and Evaluation
This is the final notebook for the training and evaluation of the BERT-like architecture, built on top of TensorFlow examples and tutorials. It will use the training data from the vectorized_samples directory containing 3.1 GB of 5000 .npy fizes of vectorized MalDroid analysis. The samples are broken up into categories as follows:
* Adware: 812 (~15.8%)
* Banking: 1438 (~28%)
* SMS: 1442 (~28.06%)
* Riskware: 1447 (~28.16%)
## Objectives
1. Set up input pipeline to read directly from notebook filesystem
2. Implement pipeline optimizations outlined in the TensorFlow docs
3. Define the BERT from TensorFlow docs, adding head classifier layer(s)
4. Write the training loop 
5. Implement logging of associated metrics and save checkpoints
6. Train and evaluate (be sure to properly initalize weights)
## Improvements (if not included)
* Better weight initalization
* Support for TPUs
* Support for mixed precision
* Attention output and plotting
* Hyperparameter tuning
* Larger-sized model
* Trimmed vocab size
* Larger sample size
* Profile the input pipeline to improve performance

In [30]:
# !pip install tensorflow
import tensorflow as tf
from tensorflow.keras import mixed_precision
# confirm tensorflow is using GPU:
print(tf.__version__)
print("Eager execution: {}".format(tf.executing_eagerly()))
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

import numpy as np
import time
import statistics
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from glob import glob
import random
random.seed(42)
import os

2.4.1
Eager execution: True
Num GPUs Available:  1


In [2]:
# setup mixed precision for GPU
# mixed_precision.set_global_policy('mixed_float16')

# setup mixed precision for TPU
# mixed_precision.set_global_policy('mixed_bfloat16')
# see https://www.tensorflow.org/guide/mixed_precision#summary for mixed precision guidelines

## Input pipeline
The pipeline needs to meet the following criteria:
* Avoid loading the whole dataset into memory
* Apply padding to the samples (max_len: 2783755, trimmed to 5039: 1283945), pads to max len in batch as configured
* Implement a fix for unbalanced data
* Batch the samples

In [3]:
# get max length of samples

# max_sample_len = 0
# for mal_class in os.listdir('vectorized_samples'):
#     parent_path = 'vectorized_samples/' + mal_class + '/'
#     for sample_path in os.listdir(parent_path):
#         len_list.append(np.load(parent_path + sample_path).size)
#         if sample_len > max_sample_len:
#             max_sample_len = sample_len

In [4]:
# get list of sample lengths for analysis

# len_list = []
# for mal_class in os.listdir('vectorized_samples'):
#     parent_path = 'vectorized_samples/' + mal_class + '/'
#     for sample_path in os.listdir(parent_path):
#         len_list.append(np.load(parent_path + sample_path).size)

In [5]:
# plot = sns.boxplot(x=len_list)

Boxplot of sample lengths reveals a large amount of outliers, it would be benificial to model performance to limit the max length of the dataset. We have 5,139 samples therefore we could cut the 100 or so greatest lengths. The results of this process shown below yield a max sample length of 1283945, a significant decrease from 2.78 million. This is a viable solution if training cost proves unmanageable.

In [6]:
# len_list.sort(reverse=True)
# len_list = len_list[100:]
# max_sample_len_trimmed = len_list[0]
# print(max_sample_len_trimmed)

In [7]:
sample_path_list = glob('vectorized_samples/*/*.npy')

# shuffling the samples now so the runtime does not have to deal with maintaining a large buffer
random.shuffle(sample_path_list)

In [8]:
# converts labels to categorical ints as follows:
# adware: 0
# banking: 1
# sms: 2
# riskware: 3

def process_path(file_paths):
    for file_path in file_paths:
        label = tf.strings.split(file_path, os.path.sep)[-2]
        if label == 'adware':
            label = 0
        elif label == 'banking':
            label = 1
        elif label == 'sms':
            label = 2
        elif label == 'riskware':
            label = 3
        sample = np.load(file_path)
        yield sample, label

In [9]:
# This generates a dataset that does not account for class imbalance

# data = tf.data.Dataset.from_generator(process_path, args=[sample_path_list], output_types=(tf.int32, tf.int32), output_shapes=((None,), ()))

### Class Imbalance
Analysis of vectorized samples shows that Adware is signifigantly underrepresented, making up around 18% compared to around 28% from each other class. To account for this, we will oversample Adware. Some more advanced techniques like class weighting could be more effective but cannot be implemented due to time constraints and potential training overhead.

In [10]:
# not the most efficient but it is useable and is only run once

adware_paths = []
banking_paths = []
sms_paths = []
riskware_paths = []

for path in sample_path_list:
    label = tf.strings.split(path, os.path.sep)[-2]
    if label == 'adware':
        adware_paths.append(path)
    elif label == 'banking':
        banking_paths.append(path)
    elif label == 'sms':
        sms_paths.append(path)
    elif label == 'riskware':
        riskware_paths.append(path)

# cleaning up sample_path_list
sample_path_list = []

In [11]:
# if memory becomes an issue consider dropping the precision of ints to 8, alteratively if more values are needed increase to 32. As it stands, these precisions are appropriate for their value ranges
adware_data = tf.data.Dataset.from_generator(process_path, args=[adware_paths], output_types=(tf.int16, tf.int8), output_shapes=((None,), ()))

banking_data = tf.data.Dataset.from_generator(process_path, args=[banking_paths], output_types=(tf.int16, tf.int8), output_shapes=((None,), ()))

sms_data = tf.data.Dataset.from_generator(process_path, args=[sms_paths], output_types=(tf.int16, tf.int8), output_shapes=((None,), ()))

riskware_data = tf.data.Dataset.from_generator(process_path, args=[riskware_paths], output_types=(tf.int16, tf.int8), output_shapes=((None,), ()))

In [12]:
# don't panic, seed is for reproducible results
oversamp_data = tf.data.experimental.sample_from_datasets([adware_data, banking_data, sms_data, riskware_data], weights=[0.25,0.25,0.25,0.25], seed=42)

In [13]:
BUFFER_SIZE = 250
BATCH_SIZE = 32 # Reccomended by BERT paper, alt is 16
DATASET_SIZE = 5139
train_size = int(0.7 * DATASET_SIZE)
test_size = int(0.15 * DATASET_SIZE)
val_size = int(0.15 * DATASET_SIZE)

def make_batches(dataset):
    return (
        dataset
        .cache() # comment out if too memory intensive
        .shuffle(BUFFER_SIZE)
        .repeat()
        .padded_batch(BATCH_SIZE, padding_values=-1, drop_remainder=True) # constant padding size if necessary is 'padded_shapes=((2783760,), ())'; shape of padded tensor is selected as it is the max len of sample rounded to a multiple of 8 for max GPU performance
        .prefetch(tf.data.AUTOTUNE))

oversamp_data.shuffle(BUFFER_SIZE, seed=42, reshuffle_each_iteration=False)
train_dataset = oversamp_data.take(train_size)
test_dataset = oversamp_data.skip(train_size)
val_dataset = test_dataset.skip(val_size)
test_dataset = test_dataset.take(test_size)

train_dataset = make_batches(train_dataset)
test_dataset = make_batches(test_dataset)
val_dataset = make_batches(val_dataset)

## BERT Model
Adapted from tutorials provided by TensorFlow (https://www.tensorflow.org/tutorials/text/transformer) on the transformer model.

### Positional Encoding

In [15]:
def get_angles(pos, i, d_model):
  angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
  return pos * angle_rates

In [16]:
def positional_encoding(position, d_model):
  angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)

  # apply sin to even indices in the array; 2i
  angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

  # apply cos to odd indices in the array; 2i+1
  angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

  pos_encoding = angle_rads[np.newaxis, ...]

  return tf.cast(pos_encoding, dtype=tf.float32)

### Masking

In [17]:
def create_padding_mask(seq):
  seq = tf.cast(tf.math.equal(seq, 0), tf.float32)

  # add extra dimensions to add the padding
  # to the attention logits.
  return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)

### Scaled Dot-Product Attention

In [18]:
def scaled_dot_product_attention(q, k, v, mask):
  """Calculate the attention weights.
  q, k, v must have matching leading dimensions.
  k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
  The mask has different shapes depending on its type(padding or look ahead)
  but it must be broadcastable for addition.

  Args:
    q: query shape == (..., seq_len_q, depth)
    k: key shape == (..., seq_len_k, depth)
    v: value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable
          to (..., seq_len_q, seq_len_k). Defaults to None.

  Returns:
    output, attention_weights
  """

  matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

  # scale matmul_qk
  dk = tf.cast(tf.shape(k)[-1], tf.float32)
  scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

  # add the mask to the scaled tensor.
  if mask is not None:
    scaled_attention_logits += (mask * -1e9)

  # softmax is normalized on the last axis (seq_len_k) so that the scores
  # add up to 1.
  attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

  output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

  return output, attention_weights

### Multi-Head Attention

In [19]:
class MultiHeadAttention(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads):
    super(MultiHeadAttention, self).__init__()
    self.num_heads = num_heads
    self.d_model = d_model

    assert d_model % self.num_heads == 0

    self.depth = d_model // self.num_heads

    self.wq = tf.keras.layers.Dense(d_model)
    self.wk = tf.keras.layers.Dense(d_model)
    self.wv = tf.keras.layers.Dense(d_model)

    self.dense = tf.keras.layers.Dense(d_model)

  def split_heads(self, x, batch_size):
    """Split the last dimension into (num_heads, depth).
    Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
    """
    x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
    return tf.transpose(x, perm=[0, 2, 1, 3])

  def call(self, v, k, q, mask):
    batch_size = tf.shape(q)[0]

    q = self.wq(q)  # (batch_size, seq_len, d_model)
    k = self.wk(k)  # (batch_size, seq_len, d_model)
    v = self.wv(v)  # (batch_size, seq_len, d_model)

    q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
    k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
    v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

    # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
    # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
    scaled_attention, attention_weights = scaled_dot_product_attention(
        q, k, v, mask)

    scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)

    concat_attention = tf.reshape(scaled_attention,
                                  (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

    output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

    return output, attention_weights

### Point-Wise Feed-Forward Network

In [20]:
def point_wise_feed_forward_network(d_model, dff):
  return tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
      tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
  ])

### Encoder

In [21]:
class EncoderLayer(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super(EncoderLayer, self).__init__()

    self.mha = MultiHeadAttention(d_model, num_heads)
    self.ffn = point_wise_feed_forward_network(d_model, dff)

    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)

  def call(self, x, training, mask):

    attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model), this is where attn weights would be output
    attn_output = self.dropout1(attn_output, training=training)
    out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)

    ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
    ffn_output = self.dropout2(ffn_output, training=training)
    out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

    return out2

In [22]:
class Encoder(tf.keras.layers.Layer):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
               maximum_position_encoding, rate=0.1):
    super(Encoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
    self.pos_encoding = positional_encoding(maximum_position_encoding,
                                            self.d_model)

    self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate)
                       for _ in range(num_layers)]

    self.dropout = tf.keras.layers.Dropout(rate)

  def call(self, x, training, mask):

    seq_len = tf.shape(x)[1]

    # adding embedding and position encoding.
    x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]

    x = self.dropout(x, training=training)

    for i in range(self.num_layers):
      x = self.enc_layers[i](x, training, mask)

    return x  # (batch_size, input_seq_len, d_model)

### Model Declaration

In [23]:
class Transformer(tf.keras.Model):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, pe_input, num_classes, rate=0.1):
    # pe_input is the max_positional_encoding
    super(Transformer, self).__init__()

    self.encoder = Encoder(num_layers, d_model, num_heads, dff,
                             input_vocab_size, pe_input, rate)

    self.flatten = tf.keras.layers.Flatten()

    self.final_layer = tf.keras.layers.Dense(num_classes, activation='softmax') # 4 is the number of classes

  def call(self, inp, tar, training, enc_padding_mask,
           look_ahead_mask, dec_padding_mask):

    enc_output = self.encoder(inp, training, enc_padding_mask)  # (batch_size, inp_seq_len, d_model)

    flatten_output = self.flatten(enc_output)  # (batch_size, inp_seq_len*d_model)

    final_output = self.final_layer(flatten_output)  # (batch_size, num_classes)

    return final_output

### Hyperparameters
Due to time and resource constraints, we will be using the values specified by the BERT paper (BERT_base). Better performance could be achieved by conducting through tuning on these values in this use case. Additionally, if training time becomes an issue this could be decreased to the criteria of BERT_small or BERT_medium. Details about these dimensions can be found in the BERT github readme (https://github.com/google-research/bert)

In [24]:
num_layers = 12
d_model = 768
dff = 3072
num_heads = 12
dropout_rate = 0.1
num_classes = 4

### Optimizer
Uses Adam with custom learning rate scheduler defined in original transformer paper (https://arxiv.org/abs/1706.03762)

In [25]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
  def __init__(self, d_model, warmup_steps=4000):
    super(CustomSchedule, self).__init__()

    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)

    self.warmup_steps = warmup_steps

  def __call__(self, step):
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps ** -1.5)

    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

### Loss
Some notes:
* This assumes the output is not a single integer indicating the predicted class and is instead an array of len>2. If this is not the case, change to non-sparse
* This also assumes that the output is a probability distribution (ie [0.1, 0.3, 0.6]) not a logit array (ie [0, 0, 1]). If this is not the case, pass from_logits=True
* The reduction indicated by TF transformer tutorial is 'none', the default is AUTO (docs indicates this typically results in SUM). This calculates gradient updates independently for the loss with respect to each input in the batch and then apply (the composition of) them (https://datascience.stackexchange.com/questions/55151/differences-between-gradient-calculated-by-different-reduction-methods-in-pytorc). However, the TF transformer loss function calls reduce_sum on the loss to average it by the sum of the mask so it is likely this achieves the same effect as 'SUM' therefore auto is used
* The TF transformer tutorial indicates the inclusion of a padding mask when calculating loss and metrics. I do not belive this is necessary in this use case as the tutorial assumes the desired output is a sequence of tokens
#### If there are any issues with loss, these may be of some assistance

In [29]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy()

In [None]:
def loss_function(real, pred):
    return loss_object(real, pred)

## Training, Checkpointing, and Evaluation Loop

In [None]:
EPOCHS = 157
# This will train on approx 5,024 samples (with a batch size of 32)

In [None]:
# The @tf.function trace-compiles train_step into a TF graph for faster
# execution. The function specializes to the precise shape of the argument
# tensors. To avoid re-tracing due to the variable sequence lengths or variable
# batch sizes (the last batch is smaller), use input_signature to specify
# more generic shapes.

train_step_signature = [
    tf.TensorSpec(shape=(None, None), dtype=tf.int16), # tokens
    tf.TensorSpec(shape=(None, None), dtype=tf.int8), # labels
]


@tf.function(input_signature=train_step_signature)
def train_step(input, label):
  with tf.GradientTape() as tape:
    predictions = transformer(inp, tar_inp,
                                 True,
                                 enc_padding_mask,
                                 combined_mask,
                                 dec_padding_mask)
    loss = loss_function(tar_real, predictions)

  gradients = tape.gradient(loss, transformer.trainable_variables)
  optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

  train_loss(loss)
  train_accuracy(accuracy_function(tar_real, predictions))

Portions of this page are reproduced from and/or modifications based on work created and shared by Google (https://developers.google.com/readme/policies) and used according to terms described in the Creative Commons 4.0 Attribution License (https://creativecommons.org/licenses/by/4.0/).