This notebook has been modified to remove sensitive data. It excludes the original dataset, the output of each cell, and some feature engineering based off of domain knowledge. The inputs are still included for the purpose of understanding our machine learning process.

In [None]:
import numpy as np
from kmodes.kmodes import KModes
import pandas as pd

Read in dataset - limit it to 1 in 20 rows so it's a manageable size for training the model.

In [None]:
skip = []
for i in range(100000000):
    if i % 20 != 0:
        skip.append(i)

In [None]:
df = pd.read_csv(r"C:\Users\hanbrolo\Documents\ntlm_2.25_to_3.4.csv", skiprows=skip)

Also read in the full dataset to assign clusters to all rows based on the model that we train.

In [None]:
full_df = pd.read_csv(r"C:\Users\hanbrolo\Documents\ntlm_2.25_to_3.4.csv")

In [None]:
df.dtypes

Based on domain knowledge, we've grouped IPs into different departments or services they represent. We use this engineered feature in our model training.

In [None]:
#The logic for this feature has been removed from the notebook due to its sensitive nature.

Another engineered feature - cagetorical buckets for day of week and time of day.

In [None]:
df.ts = pd.to_datetime(df.ts)
df['day_of_week'] = df.ts.dt.weekday_name
hours = {
    0: "late_night",
    1: "late_night",
    2: "early_morning",
    3: "early_morning",
    4: "early_morning",
    5: "early_morning",
    6: "morning",
    7: "morning",
    8: "morning",
    9: "morning",
    10: "afternoon",
    11: "afternoon",
    12: "afternoon",
    13: "afternoon",
    14: "evening",
    15: "evening",
    16: "evening",
    17: "evening",
    18: "night",
    19: "night",
    20: "night",
    21: "night",
    22: "late_night",
    23: "late_night"
}
df['time_of_day_bin'] =  [hours[i] for i in df.ts.dt.hour]

Feature selection

In [None]:
cols = ['domainname','status','id.resp_p','orig_ip_first','orig_ip_middle','resp_ip_first','resp_ip_middle','day_of_week','time_of_day_bin']

In [None]:
df['id.resp_p'] = df['id.resp_p'].astype(str)

Handle missing fields as their own category

In [None]:
df[cols] = df[cols].fillna("missing")

Train the model with 10 clusters. Thus number provided enough distinct examples of "normal" behavior could be like that our anomalies were meaningful. 

In [None]:
km = KModes(n_clusters=10, init='Huang', n_init=4, verbose=2)
clusters = km.fit_predict(df[cols])

Assign each record to its nearest cluster

In [None]:
df['clusters'] = clusters

View the cluster centroids to learn more about them and what our most common traffic looks like. 
These centroids represent the most common values for each of the chosen features. We'll identify the records that are the most dissimilar to these centroids in order to find anomalies.

In [None]:
km.cluster_centroids_

Define a distance measure then calculate the distance between each record and its cluster centroid.

In [None]:
def dissim_distance(a,b):
    distance = 0
    for ai, bi in zip(a,b):
        if ai != bi:
            distance += 1
    return distance

In [None]:
def get_min_dist(row):
    cluster_index = row['clusters']
    return dissim_distance(row[cols],km.cluster_centroids_[cluster_index])

In [None]:
df['min_dist'] = df.apply(lambda row: get_min_dist(row),axis=1)

Explore the records with the greatest distance from their cluster centroid (the anomalies)

In [None]:
#df_anomalies = 
df[df['min_dist'] > 6]
#df_anomalies.to_pickle("kmodes_ntlm_anomalies_9_cols_10_clusters")
#df_anomalies.to_csv("kmodes_ntlm_anomalies_9_cols_10_clusters.csv")