# Analysing Hacking Attacks using K-Means

### Aim

To find out how many hackers were involved in a hacking attack case.

### Dataset

[Hacking Dataset](https://www.dropbox.com/s/g5r2dh46abx1vdr/hack_data.csv?dl=0)

### Information about the dataset

1. **'Session_Connection_Time':** How long the session lasted in minutes
2. **'Bytes Transferred':** Number of MB transferred during session
3. **'Kali_Trace_Used':** Indicates if the hacker was using Kali Linux
4. **'Servers_Corrupted':** Number of server corrupted during the attack
5. **'Pages_Corrupted':** Number of pages illegally accessed
6. **'Location':** Location attack came from
7. **'WPM_Typing_Speed':** Their estimated typing speed based on session logs.

**Note:** The hackers in this case trade off attacks. Meaning they should each have roughly the same amount of attacks. For example if there were 100 total attacks, then in a 2 hacker situation each should have about 50 hacks, in a three hacker situation each would have about 33 hacks.

## Imports

In [1]:
from pyspark.sql import SparkSession

from pyspark.ml.feature import StandardScaler
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

from pyspark.ml.clustering import KMeans

In [2]:
spark = SparkSession.builder.appName('hack-clustering').getOrCreate()

In [3]:
data = spark.read.csv('input/hack_data.csv', header=True, inferSchema=True)

In [4]:
data.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)



In [5]:
data.show(5)

+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
|Session_Connection_Time|Bytes Transferred|Kali_Trace_Used|Servers_Corrupted|Pages_Corrupted|            Location|WPM_Typing_Speed|
+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
|                    8.0|           391.09|              1|             2.96|            7.0|            Slovenia|           72.37|
|                   20.0|           720.99|              0|             3.04|            9.0|British Virgin Is...|           69.08|
|                   31.0|           356.32|              1|             3.71|            8.0|             Tokelau|           70.58|
|                    2.0|           228.08|              1|             2.48|            8.0|             Bolivia|            70.8|
|                   20.0|            408.5|              0|             3.57

## Data preparation

In [6]:
data.columns

['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'Location',
 'WPM_Typing_Speed']

We will not be using the `Location` attribute as we assume that the hackers used VPNs during the attacks.

In [7]:
cols = ['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'WPM_Typing_Speed']

In [8]:
# Vector Assembling
assembler = VectorAssembler(inputCols=cols, outputCol='features')
assembled_data = assembler.transform(data)

In [9]:
# Feature scaling
scaler = StandardScaler(inputCol='features', outputCol='scaledFeatures')
scaler_model = scaler.fit(assembled_data)
scaled_data = scaler_model.transform(assembled_data)

scaled_data.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- scaledFeatures: vector (nullable = true)



In [10]:
scaled_data.select('scaledFeatures').show()

+--------------------+
|      scaledFeatures|
+--------------------+
|[0.56785108466505...|
|[1.41962771166263...|
|[2.20042295307707...|
|[0.14196277116626...|
|[1.41962771166263...|
|[0.07098138558313...|
|[1.27766494049636...|
|[1.56159048282889...|
|[1.06472078374697...|
|[0.85177662699757...|
|[1.06472078374697...|
|[2.27140433866020...|
|[1.63257186841202...|
|[0.63883247024818...|
|[1.91649741074455...|
|[0.85177662699757...|
|[1.49060909724576...|
|[0.70981385583131...|
|[1.41962771166263...|
|[1.56159048282889...|
+--------------------+
only showing top 20 rows



## Building our models

In [11]:
# Creating models
k_means_2 = KMeans(featuresCol='scaledFeatures', k=2)
k_means_3 = KMeans(featuresCol='scaledFeatures', k=3)

In [12]:
# Fitting models to dataset
model_k2 = k_means_2.fit(scaled_data)
model_k3 = k_means_3.fit(scaled_data)

In [16]:
# Evaluate both clustering models by computing Within Set Sum of Squared Errors.
wssse_2 = model_k2.computeCost(scaled_data)
print("K = 2: Within Set Sum of Squared Errors = {0:.4f}".format(wssse_2))

wssse_3 = model_k3.computeCost(scaled_data)
print("K = 3: Within Set Sum of Squared Errors = {0:.4f}".format(wssse_3))

K = 2: Within Set Sum of Squared Errors = 601.7708
K = 3: Within Set Sum of Squared Errors = 434.1493


## Was the third hacker involved?

In [17]:
# K-means with 3 clusters
model_k3_data = model_k3.transform(scaled_data)
model_k3_data.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   83|
|         2|   84|
|         0|  167|
+----------+-----+



In [19]:
# K-means with 2 clusters
model_k2_data = model_k2.transform(scaled_data)
model_k2_data.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         0|  167|
+----------+-----+



## Conclusion

From our analysis, we can see that the attacks are not uniformily distributed in the case of three clusters, but are in the case of two clusters. 

Considering the information that the hackers in this case trade off attacks, **it is a lot more likely that only two hackers were involved.**