# Model Training and Evaluation
This is the final notebook for the training and evaluation of the BERT-like architecture, built on top of TensorFlow examples and tutorials. It will use the training data from the vectorized_samples directory containing 3.1 GB of 5000 .npy fizes of vectorized MalDroid analysis. The samples are broken up into categories as follows:
* Adware: 812 (~15.8%)
* Banking: 1438 (~28%)
* SMS: 1442 (~28.06%)
* Riskware: 1447 (~28.16%)
## Objectives
1. Set up input pipeline to read directly from notebook filesystem
2. Implement pipeline optimizations outlined in the TensorFlow docs
3. Define the BERT from TensorFlow docs, adding head classifier layer(s)
4. Write the training loop 
5. Implement logging of associated metrics and save checkpoints
6. Train and evaluate (be sure to properly initalize weights)

In [1]:
# !pip install tensorflow
import tensorflow as tf
from tensorflow.keras import mixed_precision
# confirm tensorflow is using GPU:
print(tf.__version__)
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

import numpy as np
import time
import statistics
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from glob import glob
import random
random.seed(42)

2.4.1
Num GPUs Available:  1


In [2]:
# setup mixed precision for GPU
mixed_precision.set_global_policy('mixed_float16')

# setup mixed precision for TPU
# mixed_precision.set_global_policy('mixed_bfloat16')
# see https://www.tensorflow.org/guide/mixed_precision#summary for mixed precision guidelines

INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPU will likely run quickly with dtype policy mixed_float16 as it has compute capability of at least 7.0. Your GPU: NVIDIA GeForce GTX 1660 Ti, compute capability 7.5


## Input pipeline
The pipeline needs to meet the following criteria:
* Avoid loading the whole dataset into memory
* Apply padding to the samples (max_len: 2783755, trimmed to 5039: 1283945)
* Implement a fix for unbalanced data
* Batch the samples

In [3]:
# get max length of samples

# max_sample_len = 0
# for mal_class in os.listdir('vectorized_samples'):
#     parent_path = 'vectorized_samples/' + mal_class + '/'
#     for sample_path in os.listdir(parent_path):
#         len_list.append(np.load(parent_path + sample_path).size)
#         if sample_len > max_sample_len:
#             max_sample_len = sample_len

In [4]:
# get list of sample lengths for analysis

# len_list = []
# for mal_class in os.listdir('vectorized_samples'):
#     parent_path = 'vectorized_samples/' + mal_class + '/'
#     for sample_path in os.listdir(parent_path):
#         len_list.append(np.load(parent_path + sample_path).size)

In [5]:
# plot = sns.boxplot(x=len_list)

Boxplot of sample lengths reveals a large amount of outliers, it would be benificial to model performance to limit the max length of the dataset. We have 5,139 samples therefore we could cut the 100 or so greatest lengths. The results of this process shown below yield a max sample length of 1283945, a significant decrease from 2.78 million. This is a viable solution if training cost proves unmanageable.

In [6]:
# len_list.sort(reverse=True)
# len_list = len_list[100:]
# max_sample_len_trimmed = len_list[0]
# print(max_sample_len_trimmed)

In [7]:
sample_path_list = glob('vectorized_samples/*/*.npy')

# shuffling the samples now so the runtime does not have to deal with maintaining a large buffer
random.shuffle(sample_path_list)

In [17]:
# converts labels to categorical ints as follows:
# adware: 0
# banking: 1
# sms: 2
# riskware: 3

def process_path(file_paths):
    for file_path in file_paths:
        label = tf.strings.split(path, os.path.sep)[-2]
        if label == 'adware':
            label = 0
        elif label == 'banking':
            label = 1
        elif label == 'sms':
            label = 2
        elif label == 'riskware':
            label = 3
        sample = np.load(file_path)
        yield sample, label

In [9]:
# This generates a dataset that does not account for class imbalance

# data = tf.data.Dataset.from_generator(process_path, args=[sample_path_list], output_types=(tf.int32, tf.int32), output_shapes=((None,), ()))

### Class Imbalance
Analysis of vectorized samples shows that Adware is signifigantly underrepresented, making up around 18% compared to around 28% from each other class. To account for this, we will oversample Adware. Some more advanced techniques like class weighting could be more effective but cannot be implemented due to time constraints and potential training overhead.

In [10]:
# not the most efficient but it is useable and is only run once

adware_paths = []
banking_paths = []
sms_paths = []
riskware_paths = []

for path in sample_path_list:
    label = tf.strings.split(path, os.path.sep)[-2]
    if label == 'adware':
        adware_paths.append(path)
    elif label == 'banking':
        banking_paths.append(path)
    elif label == 'sms':
        sms_paths.append(path)
    elif label == 'riskware':
        riskware_paths.append(path)

# cleaning up sample_path_list

sample_path_list = []

In [18]:
adware_data = tf.data.Dataset.from_generator(process_path, args=[adware_paths], output_types=(tf.int32, tf.int32), output_shapes=((None,), ()))

banking_data = tf.data.Dataset.from_generator(process_path, args=[banking_paths], output_types=(tf.int32, tf.int32), output_shapes=((None,), ()))

sms_data = tf.data.Dataset.from_generator(process_path, args=[sms_paths], output_types=(tf.int32, tf.int32), output_shapes=((None,), ()))

riskware_data = tf.data.Dataset.from_generator(process_path, args=[riskware_paths], output_types=(tf.int32, tf.int32), output_shapes=((None,), ()))

In [19]:
# don't panic, seed is for reproducible results
oversamp_data = tf.data.experimental.sample_from_datasets([adware_data, banking_data, sms_data, riskware_data], weights=[0.25,0.25,0.25,0.25], seed=42)

In [23]:
BUFFER_SIZE = 250
BATCH_SIZE = 32 # Reccomended by BERT paper, alt is 16
DATASET_SIZE = 5139
PAD_SIZE = 
train_size = int(0.7 * DATASET_SIZE)
test_size = int(0.15 * DATASET_SIZE)
val_size = int(0.15 * DATASET_SIZE)

def make_batches(dataset):
    return (
        dataset
        .cache() # comment out if too memory intensive
        .shuffle(BUFFER_SIZE)
        .repeat()
        .padded_batch(BATCH_SIZE, padding_values=-1, drop_remainder=True)
        .prefetch(tf.data.AUTOTUNE))

oversamp_data.shuffle(BUFFER_SIZE, seed=42, reshuffle_each_iteration=False)
train_dataset = oversamp_data.take(train_size)
test_dataset = oversamp_data.skip(train_size)
val_dataset = test_dataset.skip(val_size)
test_dataset = test_dataset.take(test_size)

train_dataset = make_batches(train_dataset)
test_dataset = make_batches(test_dataset)
val_dataset = make_batches(val_dataset)

In [24]:
for sample, label in train_dataset.take(20):
    print(sample)
    print(label)

tf.Tensor(
[[ 1  3 10 ... -1 -1 -1]
 [ 1  3 10 ... -1 -1 -1]
 [ 1  3 10 ... -1 -1 -1]
 ...
 [ 1  3 10 ... -1 -1 -1]
 [ 1  3 10 ... -1 -1 -1]
 [ 1  3 10 ... -1 -1 -1]], shape=(32, 1096410), dtype=int32)
tf.Tensor([1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1], shape=(32,), dtype=int32)
tf.Tensor(
[[ 1  3 10 ... -1 -1 -1]
 [ 1  3 10 ... -1 -1 -1]
 [ 1  3 10 ... -1 -1 -1]
 ...
 [ 1  3 10 ... -1 -1 -1]
 [ 1  3 10 ... -1 -1 -1]
 [ 1  3 10 ... -1 -1 -1]], shape=(32, 1285807), dtype=int32)
tf.Tensor([1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1], shape=(32,), dtype=int32)
tf.Tensor(
[[ 1  3 10 ... -1 -1 -1]
 [ 1  3 10 ... -1 -1 -1]
 [ 1  3 10 ... -1 -1 -1]
 ...
 [ 1  3 10 ... -1 -1 -1]
 [ 1  3 10 ... -1 -1 -1]
 [ 1  3 10 ... -1 -1 -1]], shape=(32, 1238131), dtype=int32)
tf.Tensor([1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1], shape=(32,), dtype=int32)
tf.Tensor(
[[ 1  3 10 ... -1 -1 -1]
 [ 1  3 10 ... -1 -1 -1]
 [ 1  3 10 ... -1 -1 -1]

Portions of this page are reproduced from and/or modifications based on work created and shared by Google (https://developers.google.com/readme/policies) and used according to terms described in the Creative Commons 4.0 Attribution License (https://creativecommons.org/licenses/by/4.0/).