### Introduction

In this notebook (which is subsequent to the notebook in this repository titled "Penetration Testing as a Service Automation Ideas"), I explore possibilities for using the 'unsupervised' K nearest neighbors clustering algorithm in scikit learn. It is considered unsupervised because I will use data that "isn't labeled" - I put quotations marks because I will be using the same dataset that was used in the logistic regression notebook (https://github.com/beechamb/exploring_threat_data/blob/main/Penetration%20Testing%20as%20a%20Service%20Automation%20Ideas.ipynb), but without the dependent column. I imagine that this type of model would be used when a company has not experienced any threats/attacks/breaches of any kind in the past and do not know where there weak points exist.

First, I preprocess the data (this includes: one hot encoding of categorical variabes) using the python modules here: https://github.com/beechamb/exploring_threat_data/blob/main/data_utils.py. 

*NOTE: because the data is theoretically unlabeled we would not need to split into training and testing data as the testing data would come through continual model monitoring

Then, I train a KNN model with 2 groups.

In [100]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn as sk
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans

#Import defined modules for data processing
from data_utils import data_processing as dp

In [101]:
#set project path
project_folder = 'C:/Users/beech/Documents/Jobs/Cobalt/'

In [102]:
#read in data
df = pd.read_csv(project_folder +'data/smart_system_anomaly_dataset.csv').fillna(0)
df.head()

Unnamed: 0,timestamp,device_id,device_type,cpu_usage,memory_usage,network_in_kb,network_out_kb,packet_rate,avg_response_time_ms,service_access_count,failed_auth_attempts,is_encrypted,geo_location_variation,label
0,2025-06-20 12:51:55.452400,thermostat_38,thermostat,66.74,75.54,77,849,372,419.92,2,7,0,18.19,Normal
1,2025-06-20 12:51:56.452400,smart_light_37,smart,19.92,16.7,526,1492,635,32.69,3,3,1,2.98,Normal
2,2025-06-20 12:51:57.452400,sensor_1,sensor,10.08,48.62,577,923,220,418.91,4,10,0,9.66,Normal
3,2025-06-20 12:51:58.452400,sensor_9,sensor,68.6,22.46,400,240,769,66.81,5,6,1,15.0,Anomaly_DoS
4,2025-06-20 12:51:59.452400,smart_light_31,smart,50.62,48.15,104,1176,605,340.41,7,1,1,11.21,Normal


In [103]:
df['threat_cat'] = df['label'].apply(lambda x: 0 if x == 'Normal' else 1)

In [104]:
df3 = dp.process_data(df, ['label','timestamp'],['device_id','device_type'])

In [105]:
cols = df3.columns
df3[cols] = df3[cols].apply(pd.to_numeric, errors='coerce', axis=1)

### Train

In [106]:
kmeans = KMeans(n_clusters=2, random_state=0, n_init=10).fit(df3.drop(columns=['threat_cat']))
kmeans.labels_

array([1, 1, 1, ..., 0, 0, 0])

### Investigate

For now, I am just going to investigate how the KNN model performed as far as labeling in the training set.

In [107]:
labels = pd.Series(kmeans.labels_)
labels

0       1
1       1
2       1
3       0
4       1
       ..
9995    1
9996    0
9997    0
9998    0
9999    0
Length: 10000, dtype: int32

In [108]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [109]:
accuracy_score(df3['threat_cat'], labels)

0.4953

In [111]:
precision_score(df3['threat_cat'], labels)

0.19680097185665113

In [112]:
recall_score(df3['threat_cat'], labels)

0.47368421052631576

### Results

So it seems that unsupervised learning would probably not be the way to go. I think this test indicates that logistic regression or another form of supervised learning would be the best applied practice for threat detection. I think this is feasible as it is my understanding that Cobalt's core has vast knowledge pertaining to aspects of a company that would indicate a threat or not, which puts us in a good position to label data observations and put our best foot forward when it comes to training predictive models.