# Hackers Group Prediction

A large technology firm have been hacked! Luckily their forensic engineers have grabbed valuable data about the hacks, including information like session time,locations, wpm typing speed, etc. The forensic engineer relates what she has been able to figure out so far, she has been able to grab meta data of each session that the hackers used to connect to their servers. These are the features of the data:

* 'Session_Connection_Time': How long the session lasted in minutes
* 'Bytes Transferred': Number of MB transferred during session
* 'Kali_Trace_Used': Indicates if the hacker was using Kali Linux
* 'Servers_Corrupted': Number of server corrupted during the attack
* 'Pages_Corrupted': Number of pages illegally accessed
* 'Location': Location attack came from (Probably useless because the hackers used VPNs)
* 'WPM_Typing_Speed': Their estimated typing speed based on session logs.


The technology firm has 3 potential hackers that perpetrated the attack. Their certain of the first two hackers but they aren't very sure if the third hacker was involved or not. We need to figure out whether or not the third suspect had anything to do with the attacks, or was it just two hackers?

**One last key fact, the forensic engineer knows that the hackers trade off attacks. Meaning they should each have roughly the same amount of attacks. For example if there were 100 total attacks, then in a 2 hacker situation each should have about 50 hacks, in a three hacker situation each would have about 33 hacks. The engineer believes this is the key element to solving this, but doesn't know how to distinguish this unlabeled data into groups of hackers.**

In [32]:
# Initialize pyspark
import findspark
findspark.init()
import pyspark

In [33]:
# Initialize and create ba spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('hackers').getOrCreate()

In [34]:
# Using Spark to read in the wheat kernels data
data = spark.read.csv('hack_data.csv', header=True, inferSchema=True)

In [35]:
# Printing the first row of the dataframe
data.head()

Row(Session_Connection_Time=8.0, Bytes Transferred=391.09, Kali_Trace_Used=1, Servers_Corrupted=2.96, Pages_Corrupted=7.0, Location='Slovenia', WPM_Typing_Speed=72.37)

In [36]:
# Printing the schema of the dataframe
data.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)



In [37]:
data.describe().show()

+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|summary|Session_Connection_Time| Bytes Transferred|   Kali_Trace_Used|Servers_Corrupted|   Pages_Corrupted|   Location|  WPM_Typing_Speed|
+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|  count|                    334|               334|               334|              334|               334|        334|               334|
|   mean|     30.008982035928145| 607.2452694610777|0.5119760479041916|5.258502994011977|10.838323353293413|       null|57.342395209580864|
| stddev|     14.088200614636158|286.33593163576757|0.5006065264451406| 2.30190693339697|  3.06352633036022|       null| 13.41106336843464|
|    min|                    1.0|              10.0|                 0|              1.0|               6.0|Afghanistan|              40.0|
|    max|           

### Formatting the data

In [38]:
# Import statements to setup ML
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vectors

In [39]:
data.columns

['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'Location',
 'WPM_Typing_Speed']

In [40]:
#Assembling all the dependant features to a single vector column "features"

assembler = VectorAssembler(inputCols=['Session_Connection_Time','Bytes Transferred','Kali_Trace_Used',
                        'Servers_Corrupted','Pages_Corrupted','WPM_Typing_Speed'], outputCol='features')

In [41]:
output = assembler.transform(data)

In [42]:
output.select('features').show(3, truncate=False)

+--------------------------------+
|features                        |
+--------------------------------+
|[8.0,391.09,1.0,2.96,7.0,72.37] |
|[20.0,720.99,0.0,3.04,9.0,69.08]|
|[31.0,356.32,1.0,3.71,8.0,70.58]|
+--------------------------------+
only showing top 3 rows



### Scaling the Data

It is a good idea to scale the data to deal with the curse of dimensionality

In [43]:
from pyspark.ml.feature import StandardScaler

In [44]:
scaler = StandardScaler(inputCol='features', outputCol='scaled_features')

In [45]:
# Compute summary statistics by fitting the StandardScaler
scaled_model = scaler.fit(output)

In [46]:
# Normalize each feature to have unit standard deviation.
final_data = scaled_model.transform(output)

In [47]:
final_data.select('features','scaled_features').show(3)

+--------------------+--------------------+
|            features|     scaled_features|
+--------------------+--------------------+
|[8.0,391.09,1.0,2...|[0.56785108466505...|
|[20.0,720.99,0.0,...|[1.41962771166263...|
|[31.0,356.32,1.0,...|[2.20042295307707...|
+--------------------+--------------------+
only showing top 3 rows



In [48]:
final_data.select('scaled_features').show(3, truncate=False)

+------------------------------------------------------------------------------------------------------------------+
|scaled_features                                                                                                   |
+------------------------------------------------------------------------------------------------------------------+
|[0.5678510846650524,1.3658432518957642,1.9975768336483841,1.2858903881191532,2.2849485348398866,5.396290958577967]|
|[1.419627711662631,2.517986463945197,0.0,1.320644182392644,2.9377909733655687,5.150971112595909]                  |
|[2.2004229530770782,1.2444124562517545,1.9975768336483841,1.611707209433128,2.6113697541027276,5.262819066691072] |
+------------------------------------------------------------------------------------------------------------------+
only showing top 3 rows



**Now we need to identify whether then involved hackers are of 2 groups or 3 groups. So let's build k-means with k=2 and k=3 and thereby identifying the total number of hackers in each group**

In [49]:
#Creating a K-means model for k=2 and k=3
from pyspark.ml.clustering import KMeans

In [50]:
kmeans2 = KMeans(featuresCol='scaled_features',k=2)
kmeans3 = KMeans(featuresCol='scaled_features',k=3)

In [51]:
k2_model = kmeans2.fit(final_data)
k3_model = kmeans3.fit(final_data)

Evaluating clustering by computing Within Set Sum of Squared Errors.

In [52]:
wssse_k2 = k2_model.computeCost(final_data)
wssse_k3 = k3_model.computeCost(final_data)

In [53]:
print("With K=2, WSSSE = ",wssse_k2)
print("With K=3, WSSSE = ",wssse_k3)

With K=2, WSSSE =  601.7707512676716
With K=3, WSSSE =  434.75507308487647


Not much to be gained from the WSSSE, after all, we would expect that as K increases, the WSSSE decreases. We could however continue the analysis by seeing the drop from K=3 to K=4 to check if the clustering favors even or odd numbers. This won't be substantial, but its worth a look:

In [54]:
for k in range(2,9):
    kmeans = KMeans(featuresCol='scaled_features', k=k)
    k_model = kmeans.fit(final_data)
    wssse = k_model.computeCost(final_data)
    print("With K={}, WSSE={}".format(k,wssse))
    print('-'*60)

With K=2, WSSE=601.7707512676716
------------------------------------------------------------
With K=3, WSSE=434.75507308487647
------------------------------------------------------------
With K=4, WSSE=420.67698738775454
------------------------------------------------------------
With K=5, WSSE=252.57883236760853
------------------------------------------------------------
With K=6, WSSE=241.5430583569194
------------------------------------------------------------
With K=7, WSSE=236.05973891712603
------------------------------------------------------------
With K=8, WSSE=211.49663674713975
------------------------------------------------------------


Nothing definitive can be said with the above! The last key fact that the engineer mentioned was that the attacks should be evenly numbered between the hackers! Let's check with the transform and prediction columns that result form this! 

In [55]:
k2_model.transform(final_data).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         0|  167|
+----------+-----+



In [56]:
k3_model.transform(final_data).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   79|
|         2|   88|
|         0|  167|
+----------+-----+



From above results it can be seen that there are equal number of hackers if the group size is 2 whereas for 3 group size, hackers are unevenly distributed!

### So from this we can conclude that there were 2 groups of hackers involved in hacking!

In [None]:
#Closing spark session
spark.stop()