This tech firm has 3 potential hackers that perpetrated an attack. The firm is certain of the first two hackers but they are not very sure if the third hacker was involved. The goal is to figure out if the third suspect was involved with the hack.

Each hacker should have the same amount of hacks. If there were 100 total attacks, in a 2 hacker situation each would have 50 hacks, if three, then 33 hacks for each. This is a key element to solving this.

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName("HackClustering").getOrCreate()

In [4]:
data = spark.read.csv('../../SparkData/hack_data.csv', header=True,
                     inferSchema=True)

In [5]:
data.head()

Row(Session_Connection_Time=8.0, Bytes Transferred=391.09, Kali_Trace_Used=1, Servers_Corrupted=2.96, Pages_Corrupted=7.0, Location='Slovenia', WPM_Typing_Speed=72.37)

In [12]:
data.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)



In [6]:
from pyspark.ml.clustering import KMeans

In [7]:
from pyspark.ml.feature import VectorAssembler

In [8]:
data.columns

['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'Location',
 'WPM_Typing_Speed']

In [19]:
feat_cols = ['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'WPM_Typing_Speed']

In [20]:
assembler = VectorAssembler(inputCols=feat_cols,
                           outputCol='features')

In [22]:
final_data = assembler.transform(data)

In [23]:
final_data.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)
 |-- features: vector (nullable = true)



In [24]:
from pyspark.ml.feature import StandardScaler

In [25]:
scaler = StandardScaler(inputCol='features', outputCol='scaledFeatures')

In [26]:
scaler_model = scaler.fit(final_data)

In [27]:
cluster_final_data = scaler_model.transform(final_data)

In [28]:
cluster_final_data.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- scaledFeatures: vector (nullable = true)



In [29]:
km2 = KMeans(featuresCol='scaledFeatures', k=2)
km3 = KMeans(featuresCol='scaledFeatures', k=3)

In [31]:
model_km2 = km2.fit(cluster_final_data)
model_km3 = km3.fit(cluster_final_data)

In [35]:
model_km2.transform(cluster_final_data).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         0|  167|
+----------+-----+



In [36]:
model_km3.transform(cluster_final_data).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   83|
|         2|   84|
|         0|  167|
+----------+-----+



Indication that there are only 2 hackers with the assumption that each hacker must take on an equal workload.