# K-Means Cluster

Scenario:

A large technology firm needs your help, they've been hacked! 
Luckily their forensic engineers have grabbed valuable data about the hacks, 
including information like session time,locations, wpm typing speed, etc. 
The forensic engineer relates to you what she has been able to figure out so far, she has been able to 
grab meta data of each session that the hackers used to connect to their servers.
The technology firm has 3 potential hackers that perpetrated the attack. 
Their certain of the first two hackers but they aren't very sure if the third hacker was involved or not. 
They have requested your help! Can you help figure out whether or not the third suspect had anything to do 
with the attacks, or was it just two hackers? It's probably not possible to know for sure, 
but maybe what you've just learned about Clustering can help!
One last key fact, the forensic engineer knows that the hackers trade off attacks. 
Meaning they should each have roughly the same amount of attacks. For example if there were 100 total attacks, 
then in a 2 hacker situation each should have about 50 hacks, in a three hacker situation each would have 
about 33 hacks. The engineer believes this is the key element to solving this, 
but doesn't know how to distinguish this unlabeled data into groups of hackers.

Data descriotin:

In [1]:
# Basic imports
from pyspark.sql import SparkSession

In [2]:
# Creation of the spark session
spark = SparkSession.builder.appName('hackers').getOrCreate()

In [3]:
# Reading the data
dataset = spark.read.csv('hack_data.csv', header=True ,inferSchema= True)

In [4]:
# Look to the data
dataset.show(3)

+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
|Session_Connection_Time|Bytes Transferred|Kali_Trace_Used|Servers_Corrupted|Pages_Corrupted|            Location|WPM_Typing_Speed|
+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
|                    8.0|           391.09|              1|             2.96|            7.0|            Slovenia|           72.37|
|                   20.0|           720.99|              0|             3.04|            9.0|British Virgin Is...|           69.08|
|                   31.0|           356.32|              1|             3.71|            8.0|             Tokelau|           70.58|
+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
only showing top 3 rows



# Data preparation

In [5]:
# Imports 
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [6]:
# Looking 
dataset.columns

['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'Location',
 'WPM_Typing_Speed']

In [7]:
# Instancing of the VectorAssembler
# Here I choose which variable to use as features for the model
assembler = VectorAssembler(inputCols= ['Session_Connection_Time','Bytes Transferred','Kali_Trace_Used',
                                        'Servers_Corrupted','Pages_Corrupted','WPM_Typing_Speed'],
                             outputCol = 'features')

In [8]:
# Transforming the dataset through the assembler object
final_data = assembler.transform(dataset)

We are going to use the K-means clustering model, in order to allow a good performance of the model itself we need to normalize the variables data. Link: https://en.wikipedia.org/wiki/Curse_of_dimensionality .In this case we use the modul StandardScaler.

In [9]:
# Import 
from pyspark.ml.feature import StandardScaler

In [10]:
# Instancing of the StandardScaler. 
# It takes in imput all the "features" and gives in output all the values normalized 
scaler = StandardScaler(inputCol='features', outputCol='scalaredFeatures', withStd=True, withMean=False)

In [11]:
# Fitting of the data
scaler_model = scaler.fit(final_data)

In [12]:
# Transforming of the data
finalData = scaler_model.transform(final_data)

At this step we our dataset normalized and ready to use for k-means clustering model 

# Model creation

In [13]:
# Import
from pyspark.ml.clustering import KMeans

In [14]:
# Instancing of the KMeans method. 
# Here I pay attention to specify on wich column values the clusters must be calculated.
# The aim of the job is to predict if there are 2 o 3 groups of hackers, so I calulate 3 and 3 clusters.

k_means_2 = KMeans(featuresCol='scalaredFeatures', k=2) 
k_means_3 = KMeans(featuresCol='scalaredFeatures', k=3) 

In [15]:
# Fitting of the normalized data
kmeansModel_2 = k_means_2.fit(finalData)
kmeansModel_3 = k_means_3.fit(finalData)

Now it's better to remember the following advise: one last key fact, the forensic engineer knows that the hackers trade off attacks. Meaning they should each have roughly the same amount of attacks. For example if there were 100 total attacks, then in a 2 hacker situation each should have about 50 hacks, in a three hacker situation each would have about 33 hacks. The engineer believes this is the key element to solving this, but doesn't know how to distinguish this unlabeled data into groups of hackers.

In [16]:
# So, first i check how are the counts on three clusters
kmeansModel_3.transform(finalData).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         2|   84|
|         0|   83|
+----------+-----+



On three clusters we don't get what we should expect (3 * 33%) of the attacks. Let's try on two clusters.

In [17]:
kmeansModel_2.transform(finalData).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         0|  167|
+----------+-----+



Great! Nice result! On two clusters we have the 50% and 50% of sharing attacks.

In [18]:
#kmeansModel_2.transform(finalData).show()

At this point we can evaluate the performance of our model by the Within Set Sum of Squared Errors.

In [19]:
# Reminding: at the growing of K the Within Set Sum of Squared Errors goes down
wssse_2 = kmeansModel_2.computeCost(finalData)
wssse_3 = kmeansModel_3.computeCost(finalData)
print("Within Set Sum of Squared Errors K=2 = " + str(wssse_2))
print("Within Set Sum of Squared Errors K=3 = " + str(wssse_3))

Within Set Sum of Squared Errors K=2 = 601.7707512676716
Within Set Sum of Squared Errors K=3 = 434.1492898715845


So we have accomplished our task, our outputs say that the hacker attacks are divided in two groups and not in three. In this case we can confirm to the forensing engeneers what they have supposed. With our model they can distinct every attack!

As a last thing we can print out the centroid of the clusters.

In [25]:
# Here I print the coordinates of the centroid
centroid_2 = kmeansModel_2.clusterCenters()
centroid_3 = kmeansModel_3.clusterCenters()
print ('CENTROID CON K=2')
for cent in centroidi_2:
  print (cent)
print ('\n')
print ('CENTROID CON K=3')
for cent in centroidi_3:
  print (cent)

CENTROID CON K=2
[ 2.99991988  2.92319035  1.05261534  3.20390443  4.51321315  3.28474   ]
[ 1.26023837  1.31829808  0.99280765  1.36491885  2.5625043   5.26676612]


CENTROID CON K=3
[ 3.05623261  2.95754486  1.99757683  3.2079628   4.49941976  3.26738378]
[ 1.26023837  1.31829808  0.99280765  1.36491885  2.5625043   5.26676612]
[ 2.93719177  2.88492202  0.          3.19938371  4.52857793  3.30407351]


# Thanks for your attention!