# Two-Tier Malware Detection for Raw Executables
[ Block-Based Implementation ]

## Pre-requisites:

* Please refer to the Jupyter notebook - [Data_Preprocessing_and_Partitioning.ipynb](Data_Preprocessing_and_Partitioning.ipynb) for the complete set of one-time pre-requisite steps to be carried for data preprocessing and partitioning. 

## Basic implementation : Overview:

In this implementation, the second tier of our framework leverages the below two information:
1. Qualified sections identified by Activation Trend Identification (ATI) mechanism.

Unlike, top activation block and section id implementations, this basic implementation uses the entire byte data of
a qualified section as training input for Tier-2. During Tier-2 training, the byte data corresponding to 
unqualified sections are turned-off, i.e., their byte values are set to the padding character (0), and then 
fed into the Tier-2 model for training. Due to this, the tier-2's input data will be noisy as well as requires a training time same as tier-1 for each qualification criteria cut-off.

## High-level Sequence of Two-Tier Training-Validation-Testing Process:

```bash
TRAINING & VALIDATION:
----------------------
Load Training & Validation Partitions  -> Train & Evaluate Tier-1 model
Load Validation Partitions             -> Find Tier-1 model threshold (THD1) + Find & Store B1_Val set into partitions
Load Training Partitions again         -> Find & store B1_Train set using THD1
Load B1_Train partitions               -> Perform ATI over B1_Train + Find Qualified sections to Train Tier-2
Load B1_Train & B1_Val Partitions      -> Find & store top activation blocks into partitions using Qualified sections
Load B1 Train & Val Block partitions   -> Train & Evaluate Tier-2 model
Load B1_Validation partitions          -> Find Tier-2 model threshold (THD2)

TESTING:
--------
Load Tier-1 Test Partitions            -> Predict using Tier-1 model & store B1_Test set into partitions
Load B1_Test partitions                -> Find & store Top activation blocks into B1_Test_Block partitions
Load B1_Test_Block partitions          -> Predict using Tier-2 model & reconcile Tier-1 and Tier-2 results
```

## Sample Run:
The below sample run uses 20% of DS1 dataset (approx. 40k samples) and the output is provided for a single fold of cross-validation, where the training is allowed to run for 50 epochs with an early stopping criteria=5. Please note that these sample results are not indicative of actual results.

While running new experiments, start with a small early stopping value (param: EARLY_STOPPING_PATIENCE), such as 0 or 1,
to check the base time consumption, as each unit increment to this parameter may result in increased training time.

In [None]:
!python main.py 2  # Flag ONLY_TIER1_TRAINING is set to True 

Detected Platform: linux
Using TensorFlow backend.
2020-07-05 06:14:18.550090: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-07-05 06:14:18.560155: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-07-05 06:14:30.598968: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-05 06:14:30.645863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:04:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2020-07-05 06:14:30.646016: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-07-05 06:14:30.646125: I tensorflow/stream_executor/platform/default/dso_l

## Tier-1 Results:


In [None]:
from IPython.display import Image 
import os
img_path = '../out/imgs/'
for _, _, imgs in os.walk(img_path):
    imgs.sort()
    for img in imgs:
        if 'Tier1' in img:
            print(img[:-4])
            pil_img = Image(filename=os.path.join(img_path, img), width = 600, height = 400)
            display(pil_img)

In [None]:
from IPython.display import Image 
import os
img_path = '../out/imgs/'
for _, _, imgs in os.walk(img_path):
    imgs.sort()
    for img in imgs:
        if 'Tier2' in img:
            print(img[:-4])
            pil_img = Image(filename=os.path.join(img_path, img), width = 600, height = 400)
            display(pil_img)

In [None]:
from IPython.display import Image 
import os
img_path = '../out/imgs/'
for _, _, imgs in os.walk(img_path):
    imgs.sort()
    for img in imgs:
        if 'auc' in img:
            print(img[:-4])
            pil_img = Image(filename=os.path.join(img_path, img), width = 600, height = 400)
            display(pil_img)