# KMeans on Cybersecurity Data

Using the SSH log from http://www.secrepo.com/, we attempt to determine how the KMeans algorithm performs on a naive network security dataset. After converting the ssh.log file into a csv, we begin by reading in the data set and using the get_dummies function from pandas to convert the code the categorical dataset into something that can be processed by sklearn. 

In [None]:
import pandas as pd
df_ssh = pd.read_csv("ssh.csv", header = None)
df_transformed = pd.get_dummies(df_ssh)

# Training the Model

Next, the simple training step. Since our data set is quite large (over 7000 rows and columns when transformed into dummy values), it takes a while to train 10 models clustering with varying sizes of K. We will use the silhouette score to evaluate the performance of each cluster and further examine which cluster size fits the data best.

In [None]:
from sklearn.manifold import TSNE
TSNE_output = TSNE().fit_transform(df_transformed.values)
TSNE_output

In [None]:
from bokeh.io import gridplot, output_notebook, push_notebook
from bokeh.plotting import figure, show, ColumnDataSource
from bokeh.models import HoverTool
output_notebook()

df_display = df_transformed.copy()
df_display["x"] = TSNE_output[:,0]
df_display["y"] = TSNE_output[:,1]

p = figure(plot_height = 800, plot_width = 800, tools = ["hover"], title = "TSNE of Dataset 2 with Hover")                                               
p.left[0].formatter.use_scientific = False
p.circle("x", "y", color = "red", source = df_display)

show(p)

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np

def calcClusters():
    nrange = np.arange(2, 11, 1)
    output = []
    best = 0
    index = 0
    for n in nrange:
        print("Training on " + str(n) + " Clusters...")
        model = KMeans(n_clusters = n).fit(df_transformed.values)
        score = silhouette_score(df_transformed, model.labels_, metric = 'euclidean')
        output.append((n, model.labels_, score))
        if score > best:
            best = score
            index = len(output) - 1
    return output, best, index

output, best, index = calcClusters()
print("Training Complete!")