This notebook has been modified to remove sensitive data. It excludes the original dataset, the output of each cell, and some feature engineering based off of domain knowledge. The inputs are still included for the purpose of understanding our machine learning process.

In [None]:
import pandas as pd
from scipy import stats
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

Read in data set and convert fields to their appropriate data types

In [None]:
df = pd.read_json(r"C:\Users\hanbrolo\Documents\2.05-Kerberos.json")

In [None]:
df.renewable = df.renewable.astype(bool)

In [None]:
df['id.resp_p'] = df['id.resp_p'].astype(str)

In [None]:
df['@timestamp'] = pd.to_datetime(df['@timestamp'])

In [None]:
df.dtypes

Create a new dataset that aggregates by the origin host. This gives statistics about how many unique response hosts it tried to connect to, the number of total connections, and what the most common host it tried to connect to was. All of this may be interesting for clustering.

In [None]:
df_unique_connections = df.groupby(['id.orig_h'])['id.resp_h'].describe().sort_values(['unique'])

select the features we want to cluster by.

In [None]:
cols = ['unique','count']

In [None]:
df_unique_connections['unique'] = pd.to_numeric(df_unique_connections['unique'])
df_unique_connections['count'] = pd.to_numeric(df_unique_connections['count'])

Normalize

In [None]:
df_tr_std = stats.zscore(df_unique_connections[cols])

Train the kmeans model with 6 clusters (we tried various numbers and settled on this). Assign each row to a cluster. Describe cluster statistics to learn what kind of traffic they represent.

In [None]:
kmeans = KMeans(n_clusters=6, random_state=0).fit(df_tr_std)
labels = kmeans.labels_
df_unique_connections['clusters'] = labels
cols.extend(['clusters'])
df_unique_connections[cols].groupby(['clusters']).agg(['mean','count'])

Visualize the clusters in a scatter plot to highlight which ones are anomalous.

In [None]:
label_color_map = {
    0:'b',
    1:'g',
    2:'r',
    3:'c',
    4:'m',
    5:'y'
}
label_colors = [label_color_map[i] for i in df_unique_connections['clusters']]
plt.scatter(df_unique_connections['unique'], df_unique_connections['count'], c=label_colors)


Select an anomalous cluster (this one shows records with a very high number of auth attempts against a single host) and view the records in it for further analysis.

In [None]:
df_unique_connections[df_unique_connections.clusters == 1]

In [None]:
df[df['id.orig_h'] == '10.25.25.2']

<b>Failures only</b>

This is another attempt to generate meaningful clusters, but this time we filtered to look at only auth attempts that failed. This type of analysis would be helpful for finding brute force attacks.

In [None]:
df_failure_by_host = df[df.success == 'false'].groupby(['id.orig_h'])['id.resp_h'].describe().sort_values(['unique'])

In [None]:
df_failure_by_host

Feature Selection

In [None]:
cols = ['unique','count', 'freq']

Train model and assign clusters

In [None]:
df_failure_by_host['unique'] = pd.to_numeric(df_failure_by_host['unique'])
df_failure_by_host['count'] = pd.to_numeric(df_failure_by_host['count'])
df_failure_by_host['freq'] = pd.to_numeric(df_failure_by_host['freq'])
df_tr_std = stats.zscore(df_failure_by_host[cols])
kmeans = KMeans(n_clusters=6, random_state=0).fit(df_tr_std)
labels = kmeans.labels_
df_failure_by_host['clusters'] = labels
cols.extend(['clusters'])
df_failure_by_host[cols].groupby(['clusters']).agg(['median','count'])

Visualize clusters

In [None]:
label_color_map = {
    0:'b',
    1:'g',
    2:'r',
    3:'c',
    4:'m',
    5:'y'
}
label_colors = [label_color_map[i] for i in df_failure_by_host['clusters']]
plt.scatter(df_failure_by_host['unique'], df_failure_by_host['count'], c=label_colors)

Explore some anomalous clusters

In [None]:
df_failure_by_host[df_failure_by_host['clusters']==1]

In [None]:
df_failure_by_host[df_failure_by_host['clusters']==5]

<b>Another Attempt - Failures only with additional engineered feature</b>

Here is one more iteration of the same process. This time we engineered one additional feature: the total number of other hosts that were trying to connect to the top connection for each host.

In [None]:
#df_conn_count[df_conn_count['id.resp_h'] == 132.177.152.168]
df_failure_by_host['top_host_connections'] = [df[df['id.resp_h'] == i].shape[0] for i in df_failure_by_host['top']]

In [None]:
df_failure_by_host

Feature selection, model training, assigning clusters, visualizing clusters.

In [None]:
cols = ['unique','count']

In [None]:
df_tr_std = stats.zscore(df_unique_connections[cols])
kmeans = KMeans(n_clusters=6, random_state=0).fit(df_tr_std)
labels = kmeans.labels_
df_unique_connections['clusters'] = labels
cols.extend(['clusters'])
df_unique_connections[cols].groupby(['clusters']).agg(['mean','count'])

In [None]:
label_color_map = {
    0:'b',
    1:'g',
    2:'r',
    3:'c',
    4:'m',
    5:'y'
}
label_colors = [label_color_map[i] for i in df_failure_by_host['clusters']]

In [None]:
fig = plt.figure()

Attempted one more type of aggregation (by session) but found that each connection is a unique session for this protocol.

In [None]:
df.groupby(['@id'])['ts'].count()