This notebook has been modified to remove sensitive data. It excludes the original dataset, the output of each cell, and some feature engineering based off of domain knowledge. The inputs are still included for the purpose of understanding our machine learning process.

In [None]:
import pandas as pd
from scipy import stats
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

Read in dataset and explore columns/datatypes

In [None]:
df = pd.read_json(r"C:\Users\hanbrolo\Documents\2.05-9-to-10-DCE_RPC.json")

In [None]:
df.dtypes

Create a new dataset that aggregates by the origin host. This gives statistics about how many unique response hosts it tried to connect to, the number of total connections, and what the most common host it tried to connect to was. All of this may be interesting for clustering.

In [None]:
df_unique_connections = df.groupby(['id.orig_h'])['id.resp_h'].describe().sort_values(['unique'])

In [None]:
df_unique_operations = df.groupby(['id.orig_h'])['operation'].describe().sort_values(['unique'], ascending=True)

In [None]:
df_unique_connections.head(10)

Choose the features we want to consider in our clustering. Convert to numeric for distance calculations.

In [None]:
cols = ['unique','count']

In [None]:
df_unique_connections['unique'] = pd.to_numeric(df_unique_connections['unique'])
df_unique_connections['count'] = pd.to_numeric(df_unique_connections['count'])

Normalize

In [None]:
df_tr_std = stats.zscore(df_unique_connections[cols])

Train the kmeans model with 5 clusters (we tried various numbers and settled on this). Assign each row to a cluster. Describe cluster statistics to learn what kind of traffic they represent.

In [None]:
kmeans = KMeans(n_clusters=5, random_state=0).fit(df_tr_std)
labels = kmeans.labels_
df_unique_connections['clusters'] = labels
cols.extend(['clusters'])
df_unique_connections[cols].groupby(['clusters']).agg(['mean','count'])

Visualize the clusters in a scatter plot to highlight which ones are anomalous.

In [None]:
label_color_map = {
    0:'b',
    1:'g',
    2:'r',
    3:'c',
    4:'m',
}
label_colors = [label_color_map[i] for i in df_unique_connections['clusters']]
plt.scatter(df_unique_connections['unique'], df_unique_connections['count'], c=label_colors)

Select an anomalous cluster (this one shows many total connections across just a few response hosts) and view the records in it for further analysis.

In [None]:
df_unique_connections[df_unique_connections.clusters == 1]

In [None]:
df.head()