# Section 6: Evaluation

As important as creating a good representation to the problem and creating a good model, knowing how to evaluate if it solves the problem is a fundamental, especially in the security area. It is normal to see in the literature works that try to reach an accuracy near 100% in malware classificatino problems, i.e., they evaluate only if all the samples, regardless their classes and data distribution, were correctly classified, which may be not be the best way to evaluate the problem [Ceschin et al. 2018]. For example, consider that a model is evaluated using ten samples: eight of them are benign and two, malign. This model has an accuracy of 80%. Is 80% a good accuracy? Assuming that the model classifies correctly only the eight benign samples, it is not capable of identifying any malign sample and yet it has an accuracy of 80%, giving the false impression that the model works significantly well. We need to take into account that, in binary problems in general, a machine learning model must be robust enough to identify patterns in a generic way, regardless its class, given that we do not want a model that classifies everything as being from an unique class. In this section we are going to present the most common metrics used in security problems, as well as method to validate the solutions.

Before starting, we are going to extract features from Brazilian Malware dataset again, as in previous section. First, we read the CSV using pandas:

In [1]:
import pandas as pd
# dataset location
data_path = "./datasets/brazilian-malware.csv"
# read CSV dataset
data = pd.read_csv(data_path, keep_default_na=False)

Then, we list all the features of the dataset, categorizing them into numerical (real and integer numbers) and textual (libraries, system calls, compilers, etc). We also get the label and select the columns to be ignored. 

In [2]:
# numerical attributes
NUMERICAL_ATTRIBUTES = ['BaseOfCode', 'BaseOfData', 'Characteristics', 'DllCharacteristics', 
                      'Entropy', 'FileAlignment', 'ImageBase', 'Machine', 'Magic',
                      'NumberOfRvaAndSizes', 'NumberOfSections', 'NumberOfSymbols', 'PE_TYPE',
                      'PointerToSymbolTable', 'Size', 'SizeOfCode', 'SizeOfHeaders',
                      'SizeOfImage', 'SizeOfInitializedData', 'SizeOfOptionalHeader',
                      'SizeOfUninitializedData']

# textual attributes
TEXTUAL_ATTRIBUTES = ['Identify', 'ImportedDlls', 'ImportedSymbols']

# label used to classify
LABEL = 'Label'

# attributes that are not used
UNUSED_ATTRIBUTES = ['FirstSeenDate', 'SHA1', 'TimeDateStamp']

Then, we get the labels and remove unused attributes:

In [3]:
label = data[LABEL].values
# remove unused attributes and label
for a in UNUSED_ATTRIBUTES:
    del data[a]
del data[LABEL]

After that, we split the dataset in half (the first one is for the training set and the last, for the testing set):TODO

In [4]:
# split data in half
def split_data(data):
    # get mid of data
    mid = int((len(data) + 1)/2)
    # split data into train and test
    train_data = data[:mid]
    test_data = data[mid:]
    # return train and test data
    return(train_data, test_data)

In [5]:
# data, _ = split_data(data)
# label, _ = split_data(label)
train_data, test_data = split_data(data)
train_label, test_label = split_data(label)

Here we created a method to extract features, given a train and test set and a feature extractor:

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
# extract features from textual attributes
def textual_feature_extraction(train_data, test_data, extractor=TfidfVectorizer(max_features=100)):
    vectorizer = extractor
    # train vectorizer
    vectorizer.fit(train_data)
    # transform train and test data to features
    train_features = vectorizer.transform(train_data)
    test_features = vectorizer.transform(test_data)
    # return train and test features
    return(train_features, test_features)

Obtain numerical attributes:

In [7]:
train_features = train_data[NUMERICAL_ATTRIBUTES].values
test_features = test_data[NUMERICAL_ATTRIBUTES].values

In [8]:
train_features.shape, test_features.shape

((25091, 21), (25090, 21))

Obtain textual attributes and append to features array already initialized:

In [9]:
import numpy as np
# extract features from each textual attribute
for a in TEXTUAL_ATTRIBUTES:
    # extract features from current attribute
    train_texts, test_texts = textual_feature_extraction(train_data[a], test_data[a])
    train_features = np.concatenate((train_features, train_texts.toarray()), axis=1)
    test_features = np.concatenate((test_features, test_texts.toarray()), axis=1)

In [10]:
train_features.shape, test_features.shape

((25091, 321), (25090, 321))

Then, we created a normalization method, which normalizes both training and testing set, given a scaler:

In [11]:
from sklearn.preprocessing import MinMaxScaler
def normalization(train_features, test_features, scaler=MinMaxScaler()):
    # train minmax
    scaler.fit(train_features)
    # transform features
    train_features_norm = scaler.transform(train_features)
    test_features_norm = scaler.transform(test_features)
    # return normalized train and test features
    return(train_features_norm, test_features_norm)

In [12]:
train_features_norm, test_features_norm = normalization(train_features, test_features)

Finally, we train a Random Forest classifier and predict the test labels:

In [13]:
from sklearn.ensemble import RandomForestClassifier
# initialize classifier
clf = RandomForestClassifier(n_estimators=10, random_state=0)
# train classifier
clf.fit(train_features_norm, train_label)
# predict test classes
test_pred = clf.predict(test_features_norm)
# print test pred and real labels shape
print(test_pred.shape, test_label.shape)

(25090,) (25090,)


For now on we have the real labels (*test_label*) and the predicted ones (test_pred) of the testing set.

## Metrics

To correctly evaluate a model, we need to choose the right metrics, given that each one can present you a different perspective of the problem (and yet give you false impressions about it, as in previous example). In this course we are going to present accuracy, confusion matrix, recall, precision and f1score.

### Accuracy

Accuracy measures the percentage of samples that the model correctly classified from a certain set (usually, the testing set). Basically, it corresponds to the number of samples correctly classified divided by the total number of samples presented. Scikit-Learn implements this metric through the method [*accuracy_score*](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html), passing as parameters the predictions made by the classifier and the real labels of the set used, as shown in the code below. This metric is not recommended for problems that uses imbalanced datasets, given that it can have a high value, even when the classifier favors the majority class and misses all the samples from the minority classes [Gron 2017].

In [14]:
from sklearn.metrics import accuracy_score
print(accuracy_score(test_label, test_pred))

0.8836986847349542


### Confusion Matrix

The confusion matrix is a way to better visualize the results generated by a classifier. Given $C$, a confusion matrix, $C_{i,j}$ is equal to the number of observations that the classifier considered a sample from class $i$ as being from class $j$. In binary classification we can extract the following informations from the confusion matrix [Pedregosa et al. 2011]:

* **True Negatives (TN):** $C_{0,0}$.
* **False Negatives (FN):** $C_{1,0}$.
* **True Positives (TP):** $C_{1,1}$.
* **False Positives (FP):** $C_{0,1}$.

The code below presents an example of confusion matrix, which is computed using [the method *confusion_matrix* from Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html), passing as parameters the predictions made by the classifier and the real labels of the set used. It is possible to observe that the number of True Negatives (TN) is $6987$ ($C_{0,0}$), False Negatives (FN) is $2672$ ($C_{1,0}$), True Positives (TP) is $15185$ ($C_{1,1}$) and the False Positives (FP) is only $46$ ($C_{0,1}$). Through this information, two new measures can be extracted (recall and precision), generating a new metric that uses both (f1score).

In [15]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(test_label, test_pred))

[[ 6987    46]
 [ 2872 15185]]


### Recall

Recall, also called sensitivity or True Positive Rate (TPR), is the proportion of positive instances that are correctly classified by the model, i.e., the ability of the classifier to find all the positive samples [Gron 2017, Pedregosa et al. 2011]. It uses the number of true positives (TP) and false negatives (FN) and is given by the equation $Recall = \frac{TP}{TP+FN}$. Scikit-Learn implements this metric through the method [*recall_score*](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html), passing as parameters the predictions made by the classifier and the real labels of the set used, as shown in code below. In our sample, we obtained around 84% of recall, indicating that the model considers that some malware are goodware (around 16%).

In [16]:
from sklearn.metrics import recall_score
print(recall_score(test_label, test_pred))

0.8409481087666832


### Precision

Precision is a metric capable of measuring the accuracy of the positive predictions of the classifier, i.e., it measure the ability of the classifier not labeling as positive a sample that is negative [Gron 2017, Pedregosa et al. 2011]. It uses the number of true positives (TP) and the number of false positives (FP), as show by the equation $Precision = \frac{TP}{TP+FP}$. Scikit-Learn also implements this metric through the method [*precision_score*](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html), passing as parameters the predictions made by the classifier and the real labels of the set used, as shown in code below. In our sample, we obtained around 99% of precision, indicating that few goodware are being considered as being malware (less than 1%).

In [17]:
from sklearn.metrics import precision_score
print(precision_score(test_label, test_pred))

0.9969798437397414


### F1-Score

F1-Score uses both recall and precision to create an unique metric, which is the harmonic average between them, given by the equation $\textit{F1Score} = \frac{2*(Precision*Recall)}{Precision + Recall}$. Scikit-Learn implements this metric through the method [*f1_score*](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html), passing as parameters the predictions made by the classifier and the real labels of the set used, as shown in code below. In our problem, we got around 91% of f1score.

In [18]:
from sklearn.metrics import f1_score
print(f1_score(test_label, test_pred))

0.9123407834655132


Although interesting, it is always recommended to report recall and precision individually, once that in given problems, one can be more important than other and f1score may not show this necessity [Gron 2017]. For a malware detector, for example, it may be better that a malware is not detected than blocking a benign software (high precision).

### Using Thresholds

TODO

In [19]:
threshold = 0.7
test_pred_proba = clf.predict_proba(test_features_norm)
test_pred = (test_pred_proba[:,1] > threshold).astype('int')
print("Accuracy:",accuracy_score(test_label, test_pred))
print("Recall:", recall_score(test_label, test_pred))
print("Precision:", precision_score(test_label, test_pred))
print("F1Score:", f1_score(test_label, test_pred))

Accuracy: 0.8076524511757672
Recall: 0.7345073932546935
Precision: 0.9975930801053028
F1Score: 0.8460704261291146


In [20]:
recall = []
precision = []
thresholds = []
for threshold in np.arange(0,1.01,0.025):
    test_pred_proba = clf.predict_proba(test_features_norm)
    test_pred = (test_pred_proba[:,1] >= threshold).astype('int')
    thresholds.append(threshold)
    recall.append(recall_score(test_label, test_pred))
    precision.append(precision_score(test_label, test_pred))

In [21]:
import matplotlib.pyplot as plt
plt.plot(thresholds, recall, label="Recall", color="goldenrod")
plt.plot(thresholds, precision, label="Precision", color="green")
plt.xticks(np.arange(0,1.01,0.1))
plt.yticks(np.arange(0.5,1.01,0.05))
plt.xlabel("Threshold")
plt.ylabel("Recall e Precision")
plt.title("Recall e Precision X Threshold")
plt.savefig("rp_graph.png", dpi=300)
plt.savefig("rp_graph.pdf")
plt.legend()
plt.show()

<Figure size 640x480 with 1 Axes>

## Validation

TODO

### K-Fold Cross Validation

TODO

In [22]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
# initialize classifier
clf = RandomForestClassifier(n_estimators=10)
# get results
results = cross_val_score(clf, train_features_norm, train_label, cv=10, scoring="accuracy")
# print accuracy per fold
print(results)
# print mean of accuracy
print(np.mean(results))

[0.57609562 0.92430279 0.99561753 0.99402152 0.99481865 0.98326026
 0.9920287  0.9039458  0.98963317 0.98405104]
0.9337775077032392


## References

[Ceschin et al. 2018] Ceschin, F., Pinage, F., Castilho, M., Menotti, D., Oliveira, L. S., and Gregio, A. (2018). The need for speed: An analysis of brazilian malware classifers. IEEE Security Privacy, 16(6):31–41.

[Pedregosa et al. 2011] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

---

[**<< Previous Section**](05_models.ipynb) | **Next Section >>**