# Clustering Consulting Project
<hr>

A company got hacked and they have gathered valuable data to figure out how many hacker actually got into their system. The technology firm has 3 potential hackers that perpetrated the attach. Their certain of the first two hackers but they are not very sure if the third hacker was involved or not. One last key fact is that the forensic engineer knows that the hackers trade off attacks, meaning they should each have roughly the same amount of attacks. For example, if there were 100 total attacks, then in a 2 hacker situation each should have about 50 hacks.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('k-means').getOrCreate()

In [2]:
from pyspark.ml.clustering import KMeans

In [3]:
data = spark.read.csv('./hack_data.csv', header=True, inferSchema=True)

### Data Description
<hr>

- **Session_Connection_Time:** How long the session lasted in minutes
- **Bytes_Transferred:** Number of MB transferred during session
- **Kali_Trace_Used:** Indicates if the hacker was using Kali Linux
- **Server_Corrupted:** Number of servers corrupted during the attack
- **Pages_Corrupted:** Number of pages illegally accessed
- **Location:** Location attack came from (Probably useless because the hackers used VPNs)
- **WPM_Typing_Speed:** Their estimated typing speed based on sessions logs

In [4]:
data.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)



I think `Location` is useless since hacker use VPN to perform attacks. I'm going to exclude it.

In [7]:
features_columns = data.select('Session_Connection_Time',
                               'Bytes Transferred',
                               'Kali_Trace_Used',
                               'Servers_Corrupted',
                               'Pages_Corrupted',
                               'WPM_Typing_Speed')

In [50]:
features_columns.columns

['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'WPM_Typing_Speed']

In [23]:
from pyspark.ml.feature import VectorAssembler

In [24]:
assembler = VectorAssembler(inputCols=features_columns.columns, outputCol='features')

In [27]:
final_data = assembler.transform(data)

In [28]:
final_data.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)
 |-- features: vector (nullable = true)



Scaling the data is case we need to do so. I'm performing the model without scaling the data.

In [51]:
# from pyspark.ml.feature import StandardScaler

In [52]:
# scaler = StandardScaler(inputCol='features', outputCol='scaled_features')

In [53]:
# final_pred = scaler.fit(fina_data).transform(fina_data)

In [54]:
# final_pred.head()

In [55]:
kmeans2 = KMeans(featuresCol='features', k=2)
kmeans3 = KMeans(featuresCol='features', k=3)

In [56]:
model2 = kmeans2.fit(final_data)
model3 = kmeans3.fit(final_data)

In [57]:
model2.transform(final_data).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  171|
|         0|  163|
+----------+-----+



In [58]:
model3.transform(final_data).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  144|
|         2|   99|
|         0|   91|
+----------+-----+



Based on the forensic hint and the clustering K-Mean algorithm, only two hackers were involved in the attack.