Clustering is a collection of unsupervised machine learning algorithms in which parts of
the data are grouped based on similarity. For example, clusters might consist of data that is
close together in n-dimensional Euclidean space. Clustering is useful in cybersecurity for
distinguishing between normal and anomalous network activity, and for helping to classify
malware into families

In [1]:
import pandas as pd
import plotly.express as px

In [2]:
df = pd.read_csv("../input/file-pe-headers/file_pe.csv", sep=",")
df.head()

Unnamed: 0,Name,e_magic,e_cblp,e_cp,e_crlc,e_cparhdr,e_minalloc,e_maxalloc,e_ss,e_sp,...,SectionMaxChar,SectionMainChar,DirectoryEntryImport,DirectoryEntryImportSize,DirectoryEntryExport,ImageDirectoryEntryExport,ImageDirectoryEntryImport,ImageDirectoryEntryResource,ImageDirectoryEntryException,ImageDirectoryEntrySecurity
0,VirusShare_a878ba26000edaac5c98eff4432723b3,23117,144,3,0,4,0,65535,0,184,...,3758096608,0,7,152,0,0,54440,77824,73728,0
1,VirusShare_ef9130570fddc174b312b2047f5f4cf0,23117,144,3,0,4,0,65535,0,184,...,3791650880,0,16,311,0,0,262276,294912,0,346112
2,VirusShare_ef84cdeba22be72a69b198213dada81a,23117,144,3,0,4,0,65535,0,184,...,3221225536,0,6,176,0,0,36864,40960,0,0
3,VirusShare_6bf3608e60ebc16cbcff6ed5467d469e,23117,144,3,0,4,0,65535,0,184,...,3224371328,0,8,155,0,0,356352,1003520,0,14109472
4,VirusShare_2cc94d952b2efb13c7d6bbe0dd59d3fb,23117,144,3,0,4,0,65535,0,184,...,3227516992,0,2,43,0,0,61440,73728,0,90624


In [3]:
df.columns

Index(['Name', 'e_magic', 'e_cblp', 'e_cp', 'e_crlc', 'e_cparhdr',
       'e_minalloc', 'e_maxalloc', 'e_ss', 'e_sp', 'e_csum', 'e_ip', 'e_cs',
       'e_lfarlc', 'e_ovno', 'e_oemid', 'e_oeminfo', 'e_lfanew', 'Machine',
       'NumberOfSections', 'TimeDateStamp', 'PointerToSymbolTable',
       'NumberOfSymbols', 'SizeOfOptionalHeader', 'Characteristics', 'Magic',
       'MajorLinkerVersion', 'MinorLinkerVersion', 'SizeOfCode',
       'SizeOfInitializedData', 'SizeOfUninitializedData',
       'AddressOfEntryPoint', 'BaseOfCode', 'ImageBase', 'SectionAlignment',
       'FileAlignment', 'MajorOperatingSystemVersion',
       'MinorOperatingSystemVersion', 'MajorImageVersion', 'MinorImageVersion',
       'MajorSubsystemVersion', 'MinorSubsystemVersion', 'SizeOfHeaders',
       'CheckSum', 'SizeOfImage', 'Subsystem', 'DllCharacteristics',
       'SizeOfStackReserve', 'SizeOfStackCommit', 'SizeOfHeapReserve',
       'SizeOfHeapCommit', 'LoaderFlags', 'NumberOfRvaAndSizes', 'Malware',
       '

Lets plot the dataset

In [4]:
fig = px.scatter_3d(
    df,
    x = "SuspiciousImportFunctions",
    y = "SectionsLength",
    z = "SuspiciousNameSection",
    color = "Malware"
)

fig.show()

extract the features and targe labels

In [6]:
y = df["Malware"]
X = df.drop(["Name", "Malware"], axis=1).to_numpy()

importing scikit-learns clustering module and fit a K means model with two clusters to the data

In [7]:
from sklearn.cluster import KMeans

estimator = KMeans(n_clusters=len(set(y)))
estimator.fit(X)

KMeans(n_clusters=2)

predict the cluster using our trained algorithm

In [8]:
y_pred = estimator.predict(X)
df["pred"] = y_pred
df["pred"] = df["pred"].astype("category")

lets plot the clusters

In [10]:
fig = px.scatter_3d(
    df,
    x = "SuspiciousImportFunctions",
    y = "SectionsLength",
    z = "SuspiciousNameSection",
    color = "pred"
)
fig.show()

We start by importing our dataset of PE header information from a collection of samples
(step 1). This dataset consists of two classes of PE files: malware and benign. We then use
plotly to create a nice-looking interactive 3D graph (step 1). We proceed to prepare our
dataset for machine learning. Specifically, in step 2, we set X as the features and y as the
classes of the dataset. Based on the fact that there are two classes, we aim to cluster the data
into two groups that will match the sample classification. We utilize the K-means algorithm
(step 3), about which you can find more information at: https://en.wikipedia.org/wiki/
K-means_clustering. With a thoroughly trained clustering algorithm, we are ready to
predict on the testing set. We apply our clustering algorithm to predict to which cluster
each of the samples should belong (step 4). Observing our results in step 5, we see that
clustering has captured a lot of the underlying information, as it was able to fit the data
well

## Lets train an XGBoost Classifier

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
X_train , X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Creating one instance of an XGBooset model and training it on training set

In [16]:
from xgboost import XGBClassifier

XGB_model_instance = XGBClassifier()

XGB_model_instance.fit(X_train, y_train)







XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [17]:
from sklearn.metrics import accuracy_score

y_test_pred = XGB_model_instance.predict(X_test)
accuracy = accuracy_score(y_test, y_test_pred)
print("Accuracy: %.2f%%" % (accuracy * 100))

Accuracy: 99.41%


We begin by reading in our data (step 1). We then create a train-test split (step 2). We
proceed to instantiate an XGBoost classifier with default parameters and fit it to our
training set (step 3). Finally, in step 4, we use our XGBoost classifier to predict on the
testing set. We then produce the measured accuracy of our XGBoost model's predictions