In [1]:
! wget "https://github.com/PacktPublishing/Machine-Learning-for-Cybersecurity-Cookbook/raw/master/Chapter02/Resources/test_data.npz"
! wget "https://github.com/PacktPublishing/Machine-Learning-for-Cybersecurity-Cookbook/raw/master/Chapter02/Resources/training_data.npz"
! wget "https://github.com/PacktPublishing/Machine-Learning-for-Cybersecurity-Cookbook/raw/master/Chapter02/Resources/test_labels.npy"
! wget "https://github.com/PacktPublishing/Machine-Learning-for-Cybersecurity-Cookbook/raw/master/Chapter02/Resources/training_labels.npy"

--2021-05-03 09:50:11--  https://github.com/PacktPublishing/Machine-Learning-for-Cybersecurity-Cookbook/raw/master/Chapter02/Resources/test_data.npz
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/PacktPublishing/Machine-Learning-for-Cybersecurity-Cookbook/master/Chapter02/Resources/test_data.npz [following]
--2021-05-03 09:50:11--  https://raw.githubusercontent.com/PacktPublishing/Machine-Learning-for-Cybersecurity-Cookbook/master/Chapter02/Resources/test_data.npz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38887 (38K) [application/octet-stream]
Saving to: ‘test_data.npz’


2021-05-03 09:50

In [3]:
import numpy as np
from scipy import sparse
import scipy

X_train = scipy.sparse.load_npz("training_data.npz")
y_train = np.load("training_labels.npy")
X_test = scipy.sparse.load_npz("test_data.npz")
y_test = np.load("test_labels.npy")
desired_FPR = 0.01

In [4]:
from sklearn.metrics import confusion_matrix

def FPR(y_true, y_pred):
    CM = confusion_matrix(y_true, y_pred)
    TN = CM[0][0]
    FP = CM[0][1]
    return FP/ (FP + TN)

def TPR(y_true, y_pred):
    CM = confusion_matrix(y_true, y_pred)
    TP = CM[1][1]
    FN = CM[1][0]
    return TP / (TP + FN)

def perform_thresholding(vector, threshold):
    return [0 if x>=threshold else 1 for x in vector]

In [7]:
from xgboost import XGBClassifier

clf = XGBClassifier()
clf.fit(X_train, y_train)
clf_pred_prob = clf.predict_proba(X_train)



In [8]:
print("Probabilities look like so: ")
print(clf_pred_prob[0:5])
print()

Probabilities look like so: 
[[9.9696845e-01 3.0315337e-03]
 [9.9934214e-01 6.5786147e-04]
 [9.9936205e-01 6.3797331e-04]
 [9.9046874e-01 9.5312512e-03]
 [9.1151476e-01 8.8485263e-02]]



In [10]:
M = 1000
print("fitting threshold")
for t in reversed(range(M)):
    scaled_threshold = float(t) /M
    thresholded_prediction = perform_thresholding(clf_pred_prob[:,0], scaled_threshold)
    print(t, FPR(y_train, thresholded_prediction), TPR(y_train, thresholded_prediction))
    if FPR(y_train, thresholded_prediction) <= desired_FPR:
        print()
        print("Selected threshold: ")
        print(scaled_threshold)
        break

fitting threshold
999 0.4636363636363636 1.0
998 0.32727272727272727 1.0
997 0.2636363636363636 1.0
996 0.19545454545454546 1.0
995 0.16818181818181818 1.0
994 0.16363636363636364 1.0
993 0.15 1.0
992 0.1409090909090909 1.0
991 0.12727272727272726 1.0
990 0.12272727272727273 1.0
989 0.11363636363636363 1.0
988 0.10909090909090909 1.0
987 0.10909090909090909 1.0
986 0.10454545454545454 1.0
985 0.10454545454545454 1.0
984 0.1 1.0
983 0.1 1.0
982 0.1 1.0
981 0.1 1.0
980 0.1 1.0
979 0.09545454545454546 1.0
978 0.09545454545454546 1.0
977 0.09545454545454546 1.0
976 0.07727272727272727 1.0
975 0.07727272727272727 1.0
974 0.07727272727272727 1.0
973 0.07727272727272727 1.0
972 0.07727272727272727 1.0
971 0.07727272727272727 1.0
970 0.07727272727272727 1.0
969 0.07727272727272727 1.0
968 0.07727272727272727 1.0
967 0.07727272727272727 1.0
966 0.07727272727272727 1.0
965 0.07727272727272727 1.0
964 0.07727272727272727 1.0
963 0.07727272727272727 1.0
962 0.07727272727272727 1.0
961 0.0772727272

We begin this recipe by loading in a previously featurized dataset and specifying a desired
FPR constraint of 1% (step 1). The value to be used in practice depends highly on the
situation and type of file being considered. There are a few considerations to follow: if the
file is extremely common, but rarely malicious, such as a PDF, the desired FPR will have to
be set very low, for example, 0.01%.
If the system is supported by additional systems that will double-check its verdict without
human effort, then a high FPR might not be detrimental. Finally, a customer may have a
preference, which will suggest a recommended value. We define a pair of convenience
functions for FPR and TPR in step 2—these functions are very handy and reusable. Another
convenience function we define is a function that will take our threshold value and use it to
threshold a numerical vector (step 3).
In step 4, we train a model on the training data, and determine prediction probabilities on
the training set as well. You can see what these look like in step 5. When a large dataset is
available, using a validation set for determining the proper threshold will reduce the
likelihood of overfitting. Finally, we compute the threshold to be used in future
classification in order to ensure that the FPR constraint will be satisfied (step 6).
