# Lab Objectives

This lab aims to show how to detect malwares with decision trees, as well as several techniques that can be used to tackle class imbalance issue. You will learn how to

- Train a decision tree classifier for malware detection.
- upsampling and downsampling to tackle the challenge of class imbalance.

Please make sure this notebook and the following dataset files are located in the same folder:

- dataset "MalwareArtifacts.csv" for part 1
- datasets "training_data.npz", "training_labels.npy", "test_data.npz", "test_labels.npy" for part 2

Please try out the following cells and run the python code in your notebook. 

***
***This is not an assignment and you do not need to submit it***

# 1. Detecting malwares with decision trees

In this part, we will use the "AddressOfEntryPoint" and "DllCharacteristics" fields that are stored in "MalwareArtifacts.csv" as potentially distinctive features for detecting malware. 

In [None]:
import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score

# load the dataset
malware_dataset = pd.read_csv('MalwareArtifacts.csv', delimiter=',')
# Extact artifacts samples fields "AddressOfEntryPoint" and "DllCharacteristics"
samples = malware_dataset.iloc[:, [0, 4]].values
targets = malware_dataset.iloc[:, 8].values

# split the dataset into training and testing set
from sklearn.model_selection import train_test_split
training_samples, testing_samples, training_targets, testing_targets = train_test_split(
         samples, targets, test_size=0.2, random_state=0)

# train a decision tree classifier then make predictions.
from sklearn import tree
tree_classifier = tree.DecisionTreeClassifier()
tree_classifier.fit(training_samples, training_targets)
predictions = tree_classifier.predict(testing_samples)

accuracy = 100.0 * accuracy_score(testing_targets, predictions)
print ("Decision Tree accuracy: " + str(accuracy))

As we can see from the results, the accuracy of the forecasts made by selecting the AddressOfEntryPoint and DllCharacteristics fields proves particularly effective, being higher than 96%.

# 2. Tackling class imbalance

In part 2, we will look at some commonly used techniques to tackle the challenge of class imbalance. Often in applying machine learning to cybersecurity, we are faced with highly imbalanced datasets. For instance, it may be much easier to access a large collection of benign samples than it is to collect malicious samples. Conversely, you may be working at an enterprise that, for legal reasons, is prohibited from saving benign samples. In either case, your dataset will be highly skewed toward one class. As a consequence, naive machine learning aimed at maximizing accuracy will result in a classifier that predicts almost all samples as coming from the overrepresented class. In the following steps, we will demonstrate several methods for dealing with imbalanced data.

1. Begin by loading the training and testing data, and importing some libraries we will be using to score performance:

In [None]:
from sklearn.metrics import balanced_accuracy_score
import scipy.sparse
import collections

X_train = scipy.sparse.load_npz("training_data.npz")
y_train = np.load("training_labels.npy")
X_test = scipy.sparse.load_npz("test_data.npz")
y_test = np.load("test_labels.npy")

2. Train and test a simple Decision Tree classifier:

(balanced_accuracy_score: The balanced accuracy in classification problems to deal with imbalanced datasets. It is defined as the average of recall obtained on each class.)

In [None]:
dt = tree.DecisionTreeClassifier(random_state=1)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
# Count the number of instances of predicted label in dt_pred 
print(collections.Counter(dt_pred)) 
# print the balanced accuracy score
print(balanced_accuracy_score(y_test, dt_pred))

Next, we test several techniques to improve performance.

3. **Upsampling the minor class**: We extract all test samples from class 0 and class 1:

In [None]:
from sklearn.utils import resample

X_train_np = X_train.toarray()
class_0_indices = [i for i, x in enumerate(y_train == 0) if x]
class_1_indices = [i for i, x in enumerate(y_train == 1) if x]
size_class_0 = sum(y_train == 0)
X_train_class_0 = X_train_np[class_0_indices, :]
y_train_class_0 = [0] * size_class_0
X_train_class_1 = X_train_np[class_1_indices, :]

4. We upsample the elements of class 1 with replacements until the number of samples of class 1 and class 0 are equal:

In [None]:
X_train_class_1_resampled = resample(
    X_train_class_1, replace=True, n_samples=size_class_0
)
y_train_class_1_resampled = [1] * size_class_0

5. We combine the newly upsampled samples into a single training set:

In [None]:
X_train_resampled = np.concatenate([X_train_class_0, X_train_class_1_resampled])
y_train_resampled = y_train_class_0 + y_train_class_1_resampled

6. We train and test a classifier on our upsampled training set:

In [None]:
from scipy import sparse
X_train_resampled = sparse.csr_matrix(X_train_resampled)
dt_resampled = tree.DecisionTreeClassifier(random_state=1)
dt_resampled.fit(X_train_resampled, y_train_resampled)
dt_resampled_pred = dt_resampled.predict(X_test)
print(collections.Counter(dt_resampled_pred))
print(balanced_accuracy_score(y_test, dt_resampled_pred))

7. **Downsampling the major class**: We perform similar steps to the previous upsampling, except this time we down-sample the major class until it is of the same size as the minor class:

In [None]:
X_train_np = X_train.toarray()
class_0_indices = [i for i, x in enumerate(y_train == 0) if x]
class_1_indices = [i for i, x in enumerate(y_train == 1) if x]
size_class_1 = sum(y_train == 1)
X_train_class_1 = X_train_np[class_1_indices, :]
y_train_class_1 = [1] * size_class_1
X_train_class_0 = X_train_np[class_0_indices, :]
X_train_class_0_downsampled = resample(
    X_train_class_0, replace=False, n_samples=size_class_1
)
y_train_class_0_downsampled = [0] * size_class_1

8. We create a new training set from the downsampled data:

In [None]:
X_train_downsampled = np.concatenate([X_train_class_1, X_train_class_0_downsampled])
y_train_downsampled = y_train_class_1 + y_train_class_0_downsampled

9. We train a classifier on this dataset:

In [None]:
X_train_downsampled = sparse.csr_matrix(X_train_downsampled)
dt_downsampled = tree.DecisionTreeClassifier(random_state=1)
dt_downsampled.fit(X_train_downsampled, y_train_downsampled)
dt_downsampled_pred = dt_downsampled.predict(X_test)
print(collections.Counter(dt_downsampled_pred))
print(balanced_accuracy_score(y_test, dt_downsampled_pred))

To summirize part 2, we start by loading in a predefined dataset (step 1). Our next step is to train a basic Decision Tree model on our data (step 2). To measure performance, we utilize the balanced accuracy score, a measure that is often used in classification problems with imbalanced datasets. By definition, balanced accuracy is the average of recall obtained on each class. The best value is 1, whereas the worst value is 0.

In the following steps, we employ different techniques to tackle the class imbalance. In steps 3 to 6, we utilize upsampling to tackle class imbalance. This is the process of randomly duplicating observations from the minority class in order to reinforce
the minority class's signal. There are several methods for doing so, but the most common way is to simply resample with replacements as we have done. In steps 7 to 9, we down-sample our major class. This simply means that we don't use all of the samples we have, but just enough so that we balance our classes.

References:

- Hands On AI for Cybersecurity - Detecting malwares with decision trees
- Machine Learning for Cybersecurity Cookbook -  Machine Learning-Based Malware Detection

***
***end***