This notebook has been modified to remove sensitive data. It excludes the original dataset, the output of each cell, and some feature engineering based off of domain knowledge. The inputs are still included for the purpose of understanding our machine learning process.

In [None]:
import pandas as pd
import numpy as np
from kmodes.kmodes import KModes

In [None]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', -1)

Read in dataset - limit it to 1 in 20 rows so it's a manageable size for training the model.

In [None]:
skip = []
for i in range(100000000):
    if i % 20 != 0:
        skip.append(i)

In [None]:
df = pd.read_csv(r"C:\Users\hanbrolo\Documents\dce_rpc_3.26.csv",skiprows=skip)

In [None]:
df.shape

In [None]:
df.dtypes

Based on domain knowledge, we've grouped IPs into different departments or services they represent. We use this engineered feature in our model training.

In [None]:
#The logic for this feature has been removed from the notebook due to its sensitive nature.

Another engineered feature - cagetorical buckets for day of week and time of day.

In [None]:
df.ts = pd.to_datetime(df.ts)
df['day_of_week'] = df.ts.dt.weekday_name
hours = {
    0: "late_night",
    1: "late_night",
    2: "early_morning",
    3: "early_morning",
    4: "early_morning",
    5: "early_morning",
    6: "morning",
    7: "morning",
    8: "morning",
    9: "morning",
    10: "afternoon",
    11: "afternoon",
    12: "afternoon",
    13: "afternoon",
    14: "evening",
    15: "evening",
    16: "evening",
    17: "evening",
    18: "night",
    19: "night",
    20: "night",
    21: "night",
    22: "late_night",
    23: "late_night"
}
df['time_of_day_bin'] =  [hours[i] for i in df.ts.dt.hour]

In [None]:
df.dtypes

In [None]:
df.groupby('operation')['ts'].count()

Feature Selection

In [None]:
cols = ['endpoint','named_pipe','operation','orig_ip_group','resp_ip_group','time_of_day_bin']

Handle missing values

In [None]:
df[cols] = df[cols].fillna("missing")

Train the model with 15 clusters. Thus number provided enough distinct examples of "normal" behavior could be like that our anomalies were meaningful. 

In [None]:
km = KModes(n_clusters=15, init='Huang', n_init=4, verbose=2)

In [None]:
clusters = km.fit_predict(df[cols])
df['cluster'] = clusters

These centroids represent the most common values for each of the chosen features. We'll identify the records that are the most dissimilar to these centroids in order to find anomalies.

In [None]:
km.cluster_centroids_

Define a distance measure and calculate the distance between each record and its nearest cluster centroid

In [None]:
def dissim_distance(a,b):
    distance = 0
    for ai, bi in zip(a,b):
        if ai != bi:
            distance += 1
    return distance

In [None]:
def get_min_dist(row):
    cluster_index = row['cluster']
    return dissim_distance(row[cols],km.cluster_centroids_[cluster_index])

In [None]:
df['min_dist'] = df.apply(lambda row: get_min_dist(row),axis=1)

In [None]:
df.to_pickle("dce_rpc_clusters_assigned")

Show the records with the highest distance from their cluster centroids. These are anomalies and should be further investigated.

In [None]:
df[df['min_dist'] > 5]

In [None]:
df_anomalies = df[df['min_dist'] > 5]
df_anomalies.to_csv(path_or_buf="dce_rpc_anomalies_15_clusters_ip_group.csv")