This notebook has been modified to remove sensitive data. It excludes the original dataset, the output of each cell, and some feature engineering based off of domain knowledge. The inputs are still included for the purpose of understanding our machine learning process.

In [None]:
import pandas as pd
import math
from collections import Counter
from matplotlib import pyplot as plt
import math

Functions to assign the entropy of each row, the length of the hostname, and a "length bin" for each hostname. These will be used to determine a statistical threshold for hostname entropy that takes into account the added entropy inherent in a longer name.

In [None]:
def hostname_entropy(row):
    s = row['hostname']
    try:
        p, lns = Counter(s), float(len(s))
    except TypeError as e:
        s = str(s)
        p, lns = Counter(s), float(len(s))

    return -sum( count/lns * math.log(count/lns, 2) for count in p.values())

In [None]:
def hostname_len(row):
    try:
        return len(row['hostname'])
    except TypeError as e:
        return len(str(row['hostname']))
    

In [None]:
def hostname_len_bin(row):
    l = row.hostname_len 
#     if l/5 >= 3:
#         return "15 +"
#     else:
    beg = math.floor(l/5) * 5
    return str(beg) + " - " + str(beg + 4)


In [None]:
# df = pd.read_json(r"C:\Users\hanbrolo\Documents\2.05-NTLM.json")
df = pd.read_csv(r"C:\Users\hanbrolo\Documents\ntlm_2.25_to_3.4.csv")

In [None]:
df.dtypes
#do appropriate conversions here

Consider each hostname only once

In [None]:
unique_array = df.hostname.unique()
df_hostnames = pd.DataFrame(data=unique_array)

In [None]:
df_hostnames.columns = ['hostname']

In [None]:
df_hostnames

Apply functions to assign hostname entropy and hostname length bin columns

In [None]:
df_hostnames['hostname_entropy'] = df_hostnames.apply(lambda row: hostname_entropy(row),axis=1)

In [None]:
df_hostnames['hostname_len'] = df_hostnames.apply(lambda row: hostname_len(row),axis=1)

In [None]:
df_hostnames['hostname_len_bin'] = df_hostnames.apply(lambda row: hostname_len_bin(row),axis=1)

In [None]:
df_hostnames

Describe and view the distribution of entropy for each bin

In [None]:
entropy_by_hostname = df.groupby('hostname')['hostname_entropy'].mean()
entropy_by_hostname.describe()

In [None]:
hostname_entropy_threshold = entropy_by_hostname.mean() + entropy_by_hostname.std() * 3
hostname_entropy_threshold

In [None]:
plt.hist(entropy_by_hostname, 20)
plt.axvline(x=hostname_entropy_threshold, color='r')
plt.show()

In [None]:
plt.boxplot(entropy_by_hostname)
plt.plot()

In [None]:
df_entropy_by_len = df_hostnames.groupby(['hostname_len_bin'])['hostname_entropy'].describe()

Assign a threshold to each length bin based on the 3 standard deviation rule.

In [None]:
df_entropy_by_len['threshold'] = [ i if math.isnan(j) else i + 3*j for i,j in zip(df_entropy_by_len['mean'], df_entropy_by_len['std'])]

View the thresholds. These were used as an initial baseline for our hostname entropy Bro script

In [None]:
df_entropy_by_len

In [None]:
plt.hist(df.hostname_len)
plt.show()

In [None]:
df.groupby(['hostname_len_bin', 'hostname'])['hostname_entropy'].mean().mean()