# K-means notebook

A system was hacked, but meta data of each session that the hackers used to connect to their servers were found. These are the features of the data:

    * Session_Connection_Time: How long the session lasted in minutes
    * Bytes Transferred: Number of MB transferred during session
    * Kali_Trace_Used: Indicates if the hacker was using Kali Linux
    * Servers_Corrupted: Number of server corrupted during the attack
    * Pages_Corrupted: Number of pages illegally accessed
    * Location: Location attack came from (Probably useless because the hackers used VPNs)
    * WPM_Typing_Speed: Their estimated typing speed based on session logs.

There are 3 potential hackers, 2 confirmed hackers, 1 not yet confirmed.

The forensic engineer knows that the hackers trade off attacks. Meaning they should each have roughly the same amount of attacks. For example if there were 100 total attacks, then in a 2 hacker situation each should have about 50 hacks, in a three hacker situation each would have about 33 hacks.


### GOAL: how many hackers carried the attack?

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java
!java -version
!pip install pyspark

In [10]:
from google.colab import files
files.upload()

Saving hacking.csv to hacking (2).csv


{'hacking.csv': b"Session_Connection_Time,Bytes Transferred,Kali_Trace_Used,Servers_Corrupted,Pages_Corrupted,Location,WPM_Typing_Speed\n8.0,391.09,1,2.96,7.0,Slovenia,72.37\n20.0,720.99,0,3.04,9.0,British Virgin Islands,69.08\n31.0,356.32,1,3.71,8.0,Tokelau,70.58\n2.0,228.08,1,2.48,8.0,Bolivia,70.8\n20.0,408.5,0,3.57,8.0,Iraq,71.28\n1.0,390.69,1,2.79,9.0,Marshall Islands,71.57\n18.0,342.97,1,5.1,7.0,Georgia,72.32\n22.0,101.61,1,3.03,7.0,Timor-Leste,72.03\n15.0,275.53,1,3.53,8.0,Palestinian Territory,70.17\n12.0,424.83,1,2.53,8.0,Bangladesh,69.99\n15.0,249.09,1,3.39,9.0,Northern Mariana Islands,70.77\n32.0,242.48,0,4.24,8.0,Zimbabwe,67.93\n23.0,514.54,0,3.18,8.0,Isle of Man,68.56\n9.0,284.77,0,3.12,9.0,Sao Tome and Principe,70.82\n27.0,779.25,1,2.37,8.0,Greece,72.73\n12.0,307.31,1,3.22,7.0,Solomon Islands,67.95\n21.0,355.94,1,2.0,7.0,Guinea-Bissau,72.0\n10.0,372.65,0,3.33,7.0,Burkina Faso,69.19\n20.0,347.23,1,2.33,7.0,Mongolia,70.41\n22.0,456.57,0,1.52,8.0,Nigeria,69.35\n25.0,582.03,0,

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('find_hacker').getOrCreate()

In [0]:
from pyspark.ml.clustering import KMeans


dataset = spark.read.csv("hacking.csv",header=True,inferSchema=True)


In [18]:
dataset.describe().show()

+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|summary|Session_Connection_Time| Bytes Transferred|   Kali_Trace_Used|Servers_Corrupted|   Pages_Corrupted|   Location|  WPM_Typing_Speed|
+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|  count|                    334|               334|               334|              334|               334|        334|               334|
|   mean|     30.008982035928145| 607.2452694610777|0.5119760479041916|5.258502994011977|10.838323353293413|       null|57.342395209580864|
| stddev|     14.088200614636158|286.33593163576757|0.5006065264451406| 2.30190693339697|  3.06352633036022|       null| 13.41106336843464|
|    min|                    1.0|              10.0|                 0|              1.0|               6.0|Afghanistan|              40.0|
|    max|           

In [19]:
dataset.head()

Row(Session_Connection_Time=8.0, Bytes Transferred=391.09, Kali_Trace_Used=1, Servers_Corrupted=2.96, Pages_Corrupted=7.0, Location='Slovenia', WPM_Typing_Speed=72.37)

In [0]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [0]:
feat_cols = ['Session_Connection_Time', 'Bytes Transferred', 'Kali_Trace_Used', 'Servers_Corrupted', 'Pages_Corrupted','WPM_Typing_Speed']



In [0]:
vec_assembler = VectorAssembler(inputCols = feat_cols, outputCol='features')

In [0]:
final_data = vec_assembler.transform(dataset)

In [0]:
from pyspark.ml.feature import StandardScaler

In [0]:
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)

In [0]:
# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(final_data)

In [0]:
# Normalize each feature to have unit standard deviation.
cluster_final_data = scalerModel.transform(final_data)

In [0]:
kmeans3 = KMeans(featuresCol='scaledFeatures',k=3)
kmeans2 = KMeans(featuresCol='scaledFeatures',k=2)

In [0]:
model_k3 = kmeans3.fit(cluster_final_data)
model_k2 = kmeans2.fit(cluster_final_data)

In [0]:
wssse_k3 = model_k3.computeCost(cluster_final_data)
wssse_k2 = model_k2.computeCost(cluster_final_data)

In [32]:
print("With K=3")
print("Within Set Sum of Squared Errors = " + str(wssse_k3))
print('--'*30)
print("With K=2")
print("Within Set Sum of Squared Errors = " + str(wssse_k2))

With K=3
Within Set Sum of Squared Errors = 434.1492898715845
------------------------------------------------------------
With K=2
Within Set Sum of Squared Errors = 601.7707512676716


In [33]:
for k in range(2,9):
    kmeans = KMeans(featuresCol='scaledFeatures',k=k)
    model = kmeans.fit(cluster_final_data)
    wssse = model.computeCost(cluster_final_data)
    print("With K={}".format(k))
    print("Within Set Sum of Squared Errors = " + str(wssse))
    print('--'*30)

With K=2
Within Set Sum of Squared Errors = 601.7707512676716
------------------------------------------------------------
With K=3
Within Set Sum of Squared Errors = 434.1492898715845
------------------------------------------------------------
With K=4
Within Set Sum of Squared Errors = 414.215590355726
------------------------------------------------------------
With K=5
Within Set Sum of Squared Errors = 250.8996394957951
------------------------------------------------------------
With K=6
Within Set Sum of Squared Errors = 230.80943597630556
------------------------------------------------------------
With K=7
Within Set Sum of Squared Errors = 205.8368396522182
------------------------------------------------------------
With K=8
Within Set Sum of Squared Errors = 210.80890917413782
------------------------------------------------------------


In [34]:
model_k3.transform(cluster_final_data).groupBy('prediction').count().show()


+----------+-----+
|prediction|count|
+----------+-----+
|         1|   84|
|         2|   83|
|         0|  167|
+----------+-----+



In [35]:
model_k2.transform(cluster_final_data).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         0|  167|
+----------+-----+

