This notebook has been modified to remove sensitive data. It excludes the original dataset, the output of each cell, and some feature engineering based off of domain knowledge. The inputs are still included for the purpose of understanding our machine learning process.

In [None]:
import numpy as np
from kmodes.kmodes import KModes
import pandas as pd

In [None]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', -1)

limit the dataframe to the size we can handle by pulling in 1 in 20 rows. Dr. Keith says this should still be representative of the population.

In [None]:
skip = []
for i in range(100000000):
    if i % 20 != 0:
        skip.append(i)

In [None]:
df = None
df = pd.read_csv(r"C:\Users\hanbrolo\Documents\kerberos_2.25_to_3.4.csv", skiprows=skip)

Based on domain knowledge, we've grouped IPs into different departments or services they represent. We use this engineered feature in our model training.

In [None]:
#The logic for this feature has been removed from the notebook due to its sensitive nature.

Another engineered feature - cagetorical buckets for day of week and time of day.

In [None]:
df.ts = pd.to_datetime(df.ts)
df['day_of_week'] = df.ts.dt.weekday_name
hours = {
    0: "late_night",
    1: "late_night",
    2: "early_morning",
    3: "early_morning",
    4: "early_morning",
    5: "early_morning",
    6: "morning",
    7: "morning",
    8: "morning",
    9: "morning",
    10: "afternoon",
    11: "afternoon",
    12: "afternoon",
    13: "afternoon",
    14: "evening",
    15: "evening",
    16: "evening",
    17: "evening",
    18: "night",
    19: "night",
    20: "night",
    21: "night",
    22: "late_night",
    23: "late_night"
}
df['time_of_day_bin'] =  [hours[i] for i in df.ts.dt.hour]

Feature selection

In [None]:
cols = ['error_msg','request_type', 'service','day_of_week','time_of_day_bin','orig_ip_group','resp_ip_group']

Handling missing values as their own category

In [None]:
df['success'] = ["yes" if i else "no" for i in df["success"]]

In [None]:
df['error_msg'] = df['error_msg'].fillna("SUCCESS")

In [None]:
df[cols] = df[cols].fillna("missing")

Train the model with 10 clusters. Thus number provided enough distinct examples of "normal" behavior could be like that our anomalies were meaningful. 

In [None]:
km = KModes(n_clusters=15, init='Huang', n_init=4, verbose=2)

In [None]:
clusters = km.fit_predict(df[cols])
clusters

In [None]:
df['cluster'] = clusters

These centroids represent the most common values for each of the chosen features. We'll identify the records that are the most dissimilar to these centroids in order to find anomalies.

In [None]:
km.cluster_centroids_

In [None]:
df_centroids = pd.DataFrame(km.cluster_centroids_)

In [None]:
df_centroids.to_pickle("kerb_kmodes_centroids_15_clusters_ip_group")

This function calculates the distance between a row and the closest centroid. For records of categorical data, distance is simply increased by one for each column in the row that doesn't match the value of the corresponding column in the assigned centroid.

In [None]:
def dissim_distance(a,b):
    distance = 0
    for ai, bi in zip(a,b):
        if ai != bi:
            distance += 1
    return distance

In [None]:
def get_min_dist(row):
    cluster_index = row['cluster']
    return dissim_distance(row[cols],km.cluster_centroids_[cluster_index])

In [None]:
df['min_dist'] = df.apply(lambda row: get_min_dist(row),axis=1)

Find anomalies. These are defined as the rows that have the largest distance from their assigned closest cluster. We found with this dataset a distance of 5 or greater was pretty anomalous:

In [None]:
df_anomalies = df[df.min_dist >= 5]#[cols + ['id.orig_h','id.resp_h','']]

In [None]:
df_anomalies.to_pickle("kmodes_kerb_9_cols_15_clusters_anomalies_max_distances")

In [None]:
df_anomalies.to_csv(path_or_buf="kerb_anomalies_15_clusters_ip_group.csv")

In [None]:
df_anomalies

Here are the anomalies! We looked through these by hand to determine if any of them were of concern.

In [None]:
df_anomalies

------------The Remainder of this notebook represent other work that I tried, but didn't end up using. It may be useful for showing the process, but this is the end of the code that helped determine the anomalies--------------------

In [None]:
#if you need to calculate distance between ALL items and ALL clusters, use this. But it's very slow.
K_NUM_CLUSTERS = 15

for i in range(K_NUM_CLUSTERS):
    df['dist_from_' + str(i)] = df[cols].apply(lambda row: dissim_distance(row,km.cluster_centroids_[i]),axis=1)

In [None]:
##Don't use this, the way above is faster and does the same thing
dist_cols = []

K_NUM_CLUSTERS = 15

for i in range(K_NUM_CLUSTERS):
    dist_cols.append([])

for row in df[cols].iterrows():
    col_count = 0
    for c in km.cluster_centroids_:
        dist_cols[col_count].append(dissim_distance(row[1], c))
        col_count += 1
        
col_count = 0    
for col in dist_cols:
    df['dist_from_' + str(col_count)] = col
    col_count += 1

In [None]:
df.to_pickle("kmodes_kerb_9_features_15_clusters_distances")

In [None]:
df.head(20)

In [None]:
dist_cols =  ['dist_from_0','dist_from_1','dist_from_2','dist_from_3','dist_from_4','dist_from_5','dist_from_6','dist_from_7','dist_from_8','dist_from_9','dist_from_10','dist_from_11','dist_from_12','dist_from_13','dist_from_14']

In [None]:
df[cols + dist_cols]

In [None]:
df[dist_cols].describe()

In [None]:
df['min_dist'] = df[dist_cols].agg('min', axis="columns")

In [None]:
df[cols+dist_cols+['min_dist']]