This notebook has been modified to remove sensitive data. It excludes the original dataset, the output of each cell, and some feature engineering based off of domain knowledge. The inputs are still included for the purpose of understanding our machine learning process.

In [None]:
import numpy as np
from kmodes.kmodes import KModes
import pandas as pd
from scipy import stats
from kmodes.kprototypes import KPrototypes

In [None]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', -1)

limit the dataframe to the size we can handle by pulling in 1 in 20 rows. Dr. Keith says this should still be representative of the population.

In [None]:
skip = []
for i in range(800000000):
    if i % 100 != 0:
        skip.append(i)

In [None]:
df = pd.read_csv(r"C:\Users\hanbrolo\Documents\smb_files_3.18_to_3.25.csv",skiprows=skip)

Convert fields to appropriate data type

In [None]:
df.ts = pd.to_datetime(df.ts)

Create a new data frame that aggregates by session. We want to discover users who at a certain time accessed/downloaded files in an anomalous fashion (not just individual records, but their whole session).

In [None]:
df_agg = df.groupby(['uid','id.orig_h','id.resp_h'])[['ts']].min()

In [None]:
df_agg.reset_index(level=['id.orig_h', 'id.resp_h'],inplace=True)

Engineer features including total session file size, total count

In [None]:
df_agg['size'] = df.groupby(['uid','id.orig_h','id.resp_h'])['size'].sum()

In [None]:
df_agg['count'] = df.groupby(['uid','id.orig_h','id.resp_h'])['size'].count()

Normalize numeric columns

In [None]:
df_agg['size_normalized'] = stats.zscore(df_agg['size'])

In [None]:
df_agg['count_normalized'] = stats.zscore(df_agg['count'])

Based on domain knowledge, we've grouped IPs into different departments or services they represent. We use this engineered feature in our model training.

In [None]:
# The logic for this feature has been removed from the notebook due to its sensitive nature.

Another engineered feature - cagetorical buckets for day of week and time of day.

In [None]:
df_agg.ts = pd.to_datetime(df_agg.ts)
df_agg['day_of_week'] = df_agg.ts.dt.weekday_name
hours = {
    0: "late_night",
    1: "late_night",
    2: "early_morning",
    3: "early_morning",
    4: "early_morning",
    5: "early_morning",
    6: "morning",
    7: "morning",
    8: "morning",
    9: "morning",
    10: "afternoon",
    11: "afternoon",
    12: "afternoon",
    13: "afternoon",
    14: "evening",
    15: "evening",
    16: "evening",
    17: "evening",
    18: "night",
    19: "night",
    20: "night",
    21: "night",
    22: "late_night",
    23: "late_night"
}
df_agg['time_of_day_bin'] =  [hours[i] for i in df_agg.ts.dt.hour]

In [None]:
df_agg['size'].describe()

Handle missing data

In [None]:
df_agg['size'] = df_agg['size'].fillna(0)
df_agg = df_agg.fillna("missing")

In [None]:
df_agg['count'].describe()

Feature Selection

In [None]:
cols = ['size_normalized','count_normalized','orig_ip_group','resp_ip_group','day_of_week','time_of_day_bin']

In [None]:
df_agg[cols]

Train k-prototypes (k-prototypes considers both categorical and numeric features for its distance measure). 

In [None]:
kproto = KPrototypes(n_clusters=5, init='Cao', verbose=2)

In [None]:
clusters = kproto.fit_predict(df_agg[cols].values, categorical=[2,3,4,5])

In [None]:
clusters

In [None]:
kproto.cluster_centroids_

Assign each record to a cluster

In [None]:
df_agg['cluster'] = clusters

In [None]:
df_agg.groupby('cluster')['ts'].count()

Investigate a cluster that looks anomalous (this one shows a very large total download size by session)

In [None]:
df_agg[df_agg['cluster']==2]

Calculate the distance from each categorical cluster centroid

In [None]:
categ_cols = ['orig_ip_group','resp_ip_group','day_of_week','time_of_day_bin']

In [None]:
def dissim_distance(a,b):
    distance = 0
    for ai, bi in zip(a,b):
        if ai != bi:
            distance += 1
    return distance

In [None]:
def get_min_dist(row):
    cluster_index = row['cluster']
    return dissim_distance(row[categ_cols],kproto.cluster_centroids_[1][cluster_index])

In [None]:
df_agg['min_categ_dist'] = df_agg.apply(lambda row: get_min_dist(row),axis=1)

In [None]:
df_agg.to_pickle("smb_files_aggregate_k_prototypes")

Show records that are anomalous across both numeric and categorical features. Investigate these.

In [None]:
df_agg[(df_agg['min_categ_dist'] > 2) & (df_agg['cluster']==2)]