# Section 5: Models

In this section we present the most used machine learning models in problems involving security, including their theory, algorithms and codes of how to use them. Among the models, there are classifiers, detectors and clustering techniques, all of them explained here.

## Classifiers

Classifiers aim to classify a given input sample into a previously known class by them during the training. The training is the step that the classifier learns the patterns of each class with the data presented to it (together with their labels), adapting its parameters to the problem. This type of problem is known as supervised learning [Bishop 2006]. After the training, the model can be used to classify any unknown data, allowing it to be effectively used. Here we present the following classifiers: K-Nearest Neighbors, based in neighborhood, Random Forest, an ensemble based in decision trees, Support Vector Machines, based in the construction of a optimal hyperplane, and Multi-Layer Perceptron, a neural network type used a lot in deep learning.

### K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) is a machine learning model based in distance, which classification of a new instance is based on the distance of the $k$ training samples most closer to a given testing sample. Thus, a new unknown sample will be classified as being from the most prevalent classes between these $k$ samples, as shown in figure below [Michie et al. 1994], where a new instance will be classified as red when $k=3$, green, when $k=5$ and unknown when $k=6$ (the result is a draw, that is why even numbers are not recommended in binary classification - when there are just two classes).

<img src="imgs/knn1.png" align="center">

Generally, the distance used by KNN is the Euclidean distance. Given an instance $x$, described by $(a_{1}(x),a_{2}(x),...,a_{n}(x))$, where $a_{i}(x)$ is the $i$-th feature, the distance between two instances $x_{i}$ and $x_{j}$ is defined by the equation $d(x_{i},x_{j}) = \sqrt{\sum\limits_{r=1}^{n}(a_{r}(x_{i})-a_{r}(x_{j}))^{2})}$.

### Random Forest

TODO

### Support Vector Machine (SVM)

TODO

### Multi-Layer Perceptron (MLP)

TODO

## Detectors

TODO

### Drift Detection Method (DDM)

TODO

### Early Drift Detection Method (EDDM)

TODO

### ADaptive WINdowing (ADWIN)

TODO

## Clustering

TODO

### K-Means

TODO

### Density-Based Spatial Clustering of Applications with Noise (DBScan)

TODO

## Libraries

TODO

In [1]:
import pandas as pd
# dataset location
data_path = "./datasets/brazilian-malware.csv"
# read CSV dataset
data = pd.read_csv(data_path, keep_default_na=False)

TODO

In [2]:
# numerical attributes
NUMERICAL_ATTRIBUTES = ['BaseOfCode', 'BaseOfData', 'Characteristics', 'DllCharacteristics', 
                      'Entropy', 'FileAlignment', 'ImageBase', 'Machine', 'Magic',
                      'NumberOfRvaAndSizes', 'NumberOfSections', 'NumberOfSymbols', 'PE_TYPE',
                      'PointerToSymbolTable', 'Size', 'SizeOfCode', 'SizeOfHeaders',
                      'SizeOfImage', 'SizeOfInitializedData', 'SizeOfOptionalHeader',
                      'SizeOfUninitializedData']

# textual attributes
TEXTUAL_ATTRIBUTES = ['Identify', 'ImportedDlls', 'ImportedSymbols']

# label used to classify
LABEL = 'Label'

# attributes that are not used
UNUSED_ATTRIBUTES = ['FirstSeenDate', 'SHA1', 'TimeDateStamp']

TODO

In [3]:
label = data[LABEL].values
# remove unused attributes and label
for a in UNUSED_ATTRIBUTES:
    del data[a]
del data[LABEL]

TODO

In [4]:
# split data in half
def split_data(data):
    # get mid of data
    mid = int((len(data) + 1)/2)
    # split data into train and test
    train_data = data[:mid]
    test_data = data[mid:]
    # return train and test data
    return(train_data, test_data)

In [5]:
# data, _ = split_data(data)
# label, _ = split_data(label)
train_data, test_data = split_data(data)
train_label, test_label = split_data(label)

TODO

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
# extract features from textual attributes
def textual_feature_extraction(train_data, test_data, extractor=TfidfVectorizer(max_features=100)):
    vectorizer = extractor
    # train vectorizer
    vectorizer.fit(train_data)
    # transform train and test data to features
    train_features = vectorizer.transform(train_data)
    test_features = vectorizer.transform(test_data)
    # return train and test features
    return(train_features, test_features)

Obtain numerical attributes:

In [7]:
train_features = train_data[NUMERICAL_ATTRIBUTES].values
test_features = test_data[NUMERICAL_ATTRIBUTES].values

In [8]:
train_features.shape, test_features.shape

((25091, 21), (25090, 21))

Obtain textual attributes and append to features array:

In [9]:
import numpy as np
# extract features from each textual attribute
for a in TEXTUAL_ATTRIBUTES:
    # extract features from current attribute
    train_texts, test_texts = textual_feature_extraction(train_data[a], test_data[a])
    train_features = np.concatenate((train_features, train_texts.toarray()), axis=1)
    test_features = np.concatenate((test_features, test_texts.toarray()), axis=1)

In [10]:
train_features.shape, test_features.shape

((25091, 321), (25090, 321))

Normalization:

In [11]:
from sklearn.preprocessing import MinMaxScaler
def normalization(train_features, test_features, scaler=MinMaxScaler()):
    # train minmax
    scaler.fit(train_features)
    # transform features
    train_features_norm = scaler.transform(train_features)
    test_features_norm = scaler.transform(test_features)
    # return normalized train and test features
    return(train_features_norm, test_features_norm)

In [12]:
train_features_norm, test_features_norm = normalization(train_features, test_features)

### Scikit-Learn

TODO

#### KNN

TODO

In [13]:
from sklearn.neighbors import KNeighborsClassifier
# initialize classifier
clf = KNeighborsClassifier(n_neighbors=3)
# train classifier
clf.fit(train_features_norm, train_label)
# predict test classes
test_pred = clf.predict(test_features_norm)
# print test pred and real labels shape
print(test_pred.shape, test_label.shape)

(12545,) (12545,)


#### Random Forest

TODO

In [14]:
from sklearn.ensemble import RandomForestClassifier
# initialize classifier
clf = RandomForestClassifier(n_estimators=10)
# train classifier
clf.fit(train_features_norm, train_label)
# predict test classes
test_pred = clf.predict(test_features_norm)
# print test pred and real labels shape
print(test_pred.shape, test_label.shape)

(12545,) (12545,)


#### SVM

TODO

In [15]:
from sklearn.svm import SVC
# initialize classifier
clf = SVC(kernel="linear")
# train classifier
clf.fit(train_features_norm, train_label)
# predict test classes
test_pred = clf.predict(test_features_norm)
# print test pred and real labels shape
print(test_pred.shape, test_label.shape)

(12545,) (12545,)


#### K-Means

TODO

In [16]:
from sklearn.cluster import KMeans
# initialize kmeans
clustering = KMeans(n_clusters=2)
# fit kmeans (note that we do not need to use labels here)
# and predict train classes using the clusters created
train_pred = clustering.fit_predict(train_features_norm)
# print train pred and real labels shape
print(train_pred.shape, train_label.shape)

(12546,) (12546,)


#### DBScan

TODO

### Scikit-Multiflow

TODO

#### KNN

TODO

In [43]:
from skmultiflow.lazy.knn_adwin import KNNAdwin
from skmultiflow.data import DataStream
# initialize classifier
clf = KNNAdwin(n_neighbors=3) # disable_weighted_vote=False produces a bug
# fit classifier
clf.partial_fit(train_features_norm, train_label)

KNNAdwin(leaf_size=30, max_window_size=9223372036854775807, n_neighbors=3,
         nominal_attributes=None)

In [None]:
# create a stream with test features
stream = DataStream(test_features_norm, test_label)
# prepare stream for use
stream.prepare_for_use()
# create prediction array
test_pred = []
# iterate over stream
while stream.has_more_samples():
    # get next sample features and label from stream
    sample_features, sample_label = stream.next_sample(1)
    # predict sample
    sample_pred = clf.predict(sample_features)
    # add predicted labels to test_pred
    for l in sample_pred:
        test_pred.append(l)
    # update model with new sample
    clf.partial_fit(sample_features, sample_label)
# turn test_pred into numpy array
test_pred = np.array(test_pred)
print(test_pred.shape, test_label.shape)

#### Random Forest

TODO

In [31]:
from skmultiflow.meta import AdaptiveRandomForest
from skmultiflow.data import DataStream
# initialize classifier
clf = AdaptiveRandomForest(n_estimators=10, disable_weighted_vote=True) # disable_weighted_vote=False produces a bug
# fit classifier
clf.partial_fit(train_features_norm, train_label)

AdaptiveRandomForest(binary_split=False, disable_weighted_vote=True,
                     drift_detection_method=ADWIN(delta=0.001), grace_period=50,
                     lambda_value=6, leaf_prediction='nba',
                     max_byte_size=33554432, max_features=18,
                     memory_estimate_period=2000000, n_estimators=10,
                     nb_threshold=0, no_preprune=False, nominal_attributes=None,
                     performance_metric='acc', random_state=None,
                     remove_poor_atts=False, split_confidence=0.01,
                     split_criterion='info_gain', stop_mem_management=False,
                     tie_threshold=0.05,

In [32]:
# create a stream with test features
stream = DataStream(test_features_norm, test_label)
# prepare stream for use
stream.prepare_for_use()
# create prediction array
test_pred = []
# iterate over stream
while stream.has_more_samples():
    # get next sample features and label from stream
    sample_features, sample_label = stream.next_sample(1)
    # predict sample
    sample_pred = clf.predict(sample_features)
    # add predicted labels to test_pred
    for l in sample_pred:
        test_pred.append(l)
    # update model with new sample
    clf.partial_fit(sample_features, sample_label)
# turn test_pred into numpy array
test_pred = np.array(test_pred)
print(test_pred.shape, test_label.shape)

(12545,) (12545,)


#### Drift Detectors

TODO

In [31]:
from skmultiflow.trees import HoeffdingTree
# initialize classifier
clf = HoeffdingTree()
# fit classifier
clf.partial_fit(train_features_norm, train_label)
# initialize classifier 2
clf2 = HoeffdingTree()

In [32]:
from skmultiflow.drift_detection import EDDM, DDM
# initialize drift detector
drift = DDM()

In [33]:
from skmultiflow.data import DataStream
from skmultiflow.core import clone
# create a stream with test features
stream = DataStream(test_features_norm, test_label)
# prepare stream for use
stream.prepare_for_use()
# create prediction array
test_pred = []
# counter
count = 0
# drift points
drifts = []
# iterate over stream
while stream.has_more_samples():
    # get next sample features and label from stream
    sample_features, sample_label = stream.next_sample(1)
    # increase counter
    count += 1
    # predict sample
    sample_pred = clf.predict(sample_features)
    # add predicted labels to test_pred
    for l in sample_pred:
        test_pred.append(l)
    # add element to drift detector
    for e in sample_label == sample_pred:
        drift.add_element(e)
    # detect if warning or drift
    if drift.detected_warning_zone():
        # update classifier 2
        clf2.partial_fit(sample_features, sample_label)
    if drift.detected_change():
        # save drift point to array
        drifts.append(count)
        # change classifiers
        clf = clone(clf2)
        # initialize classifier 2 again
        clf2 = HoeffdingTree()
        # reset drift detector
        drift.reset()
    # update model with new sample
    clf.partial_fit(sample_features, sample_label)
# turn test_pred into numpy array
test_pred = np.array(test_pred)

In [35]:
# print drift points
print("Detected drifts in points {}.".format((drifts)))
# print shape
print(test_pred.shape, test_label.shape)

Detected drifts in points [19936, 20938, 21126, 21183, 24425, 24493, 24536, 24729].
(25090,) (25090,)


### Keras

TODO

#### Multi-Layer Perceptron

TODO

In [42]:
import keras
from keras.models import Sequential
from keras.layers import Dense
# converts labels to a categorical one-hot-vector
train_label_onehot = keras.utils.to_categorical(train_label, num_classes=2)
test_label_onehot = keras.utils.to_categorical(test_label, num_classes=2)
# initialize sequential network
model = Sequential()
# add fully-connected hidden layer with 200 units
model.add(Dense(200, activation='relu', input_dim=train_features_norm.shape[1]))
# add fully-connected hidden layer with 100 units
model.add(Dense(100, activation='relu'))
# output layer
model.add(Dense(2, activation='softmax'))
# compile model
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
# fit model with data
model.fit(train_features_norm, train_label_onehot, validation_split=0.33, epochs=10, batch_size=128)
# predict classes
test_pred = model.predict_classes(test_features_norm)
print(test_pred.shape, test_label.shape)

Train on 8405 samples, validate on 4141 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
(12545,) (12545,)


## References

[Bishop 2006] Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, Berlin, Heidelberg.

[Michie et al. 1994] Michie, D., Spiegelhalter, D. J., Taylor, C. C., and Campbell, J., editors (1994). Machine Learning, Neural and Statistical Classification. Ellis Horwood, Upper Saddle River, NJ, USA.

---

[**<< Previous Section**](04_features.ipynb) | [**Next Section**](06_evaluation.ipynb)