# Hackers Group Prediction

A large technology firm have been hacked! Luckily their forensic engineers have grabbed valuable data about the hacks, including information like session time,locations, wpm typing speed, etc. The forensic engineer relates what she has been able to figure out so far, she has been able to grab meta data of each session that the hackers used to connect to their servers. These are the features of the data:

- `Session_Connection_Time`: How long the session lasted in minutes
- `Bytes Transferred`: Number of MB transferred during session
- `Kali_Trace_Used`: Indicates if the hacker was using Kali Linux
- `Servers_Corrupted`: Number of server corrupted during the attack
- `Pages_Corrupted`: Number of pages illegally accessed
- `Location`: Location attack came from (Probably useless because the hackers used VPNs)
- `WPM_Typing_Speed`: Their estimated typing speed based on session logs.

The technology firm has 3 potential hackers that perpetrated the attack. Their certain of the first two hackers but they aren't very sure if the third hacker was involved or not. We need to figure out whether or not the third suspect had anything to do with the attacks, or was it just two hackers?

**One last key fact, the forensic engineer knows that the hackers trade off attacks. Meaning they should each have roughly the same amount of attacks. For example if there were 100 total attacks, then in a 2 hacker situation each should have about 50 hacks, in a three hacker situation each would have about 33 hacks. The engineer believes this is the key element to solving this, but doesn't know how to distinguish this unlabeled data into groups of hackers.**

### Initializing and creating a spark session

In [1]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("hackers").getOrCreate()

Intitializing Scala interpreter ...

Spark Web UI available at http://Varun-CK:4040
SparkContext available as 'sc' (version = 2.3.0, master = local[*], app id = local-1577731860363)
SparkSession available as 'spark'


2019-12-31 00:21:14 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.


import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@399e183d


### Initializing Logger

In [2]:
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)

import org.apache.log4j._


### Using Spark to read in the hackers data

In [3]:
val data = spark.read.options(Map(("header","true"),("inferSchema","true"))).csv("hack_data.csv")

data: org.apache.spark.sql.DataFrame = [Session_Connection_Time: double, Bytes Transferred: double ... 5 more fields]


### Printing the first row of the dataframe

In [4]:
val colnames = data.columns
val firstRow = data.head(1)(0)

colnames: Array[String] = Array(Session_Connection_Time, Bytes Transferred, Kali_Trace_Used, Servers_Corrupted, Pages_Corrupted, Location, WPM_Typing_Speed)
firstRow: org.apache.spark.sql.Row = [8.0,391.09,1,2.96,7.0,Slovenia,72.37]


In [5]:
for (i <- Range(0,colnames.size)){
    println(s"Column Name: ${colnames(i)}")
    println(s"Column Data: ${firstRow(i)}")
    println()
}

Column Name: Session_Connection_Time
Column Data: 8.0

Column Name: Bytes Transferred
Column Data: 391.09

Column Name: Kali_Trace_Used
Column Data: 1

Column Name: Servers_Corrupted
Column Data: 2.96

Column Name: Pages_Corrupted
Column Data: 7.0

Column Name: Location
Column Data: Slovenia

Column Name: WPM_Typing_Speed
Column Data: 72.37



### Count

In [6]:
data.count

res2: Long = 334


### Show

In [7]:
data.show(3)

+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
|Session_Connection_Time|Bytes Transferred|Kali_Trace_Used|Servers_Corrupted|Pages_Corrupted|            Location|WPM_Typing_Speed|
+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
|                    8.0|           391.09|              1|             2.96|            7.0|            Slovenia|           72.37|
|                   20.0|           720.99|              0|             3.04|            9.0|British Virgin Is...|           69.08|
|                   31.0|           356.32|              1|             3.71|            8.0|             Tokelau|           70.58|
+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
only showing top 3 rows



### Schema

In [8]:
data.printSchema

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)



### Formatting the data

In [9]:
// Dropping Location column since it is a string and not much useful (as hackers might have used VPN)
val feature_data = data.drop("Location")

feature_data: org.apache.spark.sql.DataFrame = [Session_Connection_Time: double, Bytes Transferred: double ... 4 more fields]


In [10]:
feature_data.columns

res5: Array[String] = Array(Session_Connection_Time, Bytes Transferred, Kali_Trace_Used, Servers_Corrupted, Pages_Corrupted, WPM_Typing_Speed)


### Import statements to setup ML for KMeans Algorithm

In [11]:
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors


### Assembling all the dependant features to a single vector column "features"

In [12]:
val assembler = new VectorAssembler().setInputCols(feature_data.columns).setOutputCol("features")

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_432350d9a598


In [13]:
val output = assembler.transform(feature_data)

output: org.apache.spark.sql.DataFrame = [Session_Connection_Time: double, Bytes Transferred: double ... 5 more fields]


In [14]:
output.show(3)

+-----------------------+-----------------+---------------+-----------------+---------------+----------------+--------------------+
|Session_Connection_Time|Bytes Transferred|Kali_Trace_Used|Servers_Corrupted|Pages_Corrupted|WPM_Typing_Speed|            features|
+-----------------------+-----------------+---------------+-----------------+---------------+----------------+--------------------+
|                    8.0|           391.09|              1|             2.96|            7.0|           72.37|[8.0,391.09,1.0,2...|
|                   20.0|           720.99|              0|             3.04|            9.0|           69.08|[20.0,720.99,0.0,...|
|                   31.0|           356.32|              1|             3.71|            8.0|           70.58|[31.0,356.32,1.0,...|
+-----------------------+-----------------+---------------+-----------------+---------------+----------------+--------------------+
only showing top 3 rows



In [15]:
output.select("features").show(3,false)

+--------------------------------+
|features                        |
+--------------------------------+
|[8.0,391.09,1.0,2.96,7.0,72.37] |
|[20.0,720.99,0.0,3.04,9.0,69.08]|
|[31.0,356.32,1.0,3.71,8.0,70.58]|
+--------------------------------+
only showing top 3 rows



### Scaling the Data

It is a good idea to scale the data to deal with the curse of dimensionality

In [16]:
import org.apache.spark.ml.feature.StandardScaler

import org.apache.spark.ml.feature.StandardScaler


In [17]:
val scaler = new StandardScaler().setInputCol("features").setOutputCol("scaledFeatures")

scaler: org.apache.spark.ml.feature.StandardScaler = stdScal_7a0f176c8a35


Compute summary statistics by fitting the StandardScaler

In [18]:
val scaled_model = scaler.fit(output)

scaled_model: org.apache.spark.ml.feature.StandardScalerModel = stdScal_7a0f176c8a35


Normalize each feature to have unit standard deviation.

In [19]:
val final_data = scaled_model.transform(output)

final_data: org.apache.spark.sql.DataFrame = [Session_Connection_Time: double, Bytes Transferred: double ... 6 more fields]


In [20]:
final_data.select("features","scaledFeatures").show(5)

+--------------------+--------------------+
|            features|      scaledFeatures|
+--------------------+--------------------+
|[8.0,391.09,1.0,2...|[0.56785108466505...|
|[20.0,720.99,0.0,...|[1.41962771166263...|
|[31.0,356.32,1.0,...|[2.20042295307707...|
|[2.0,228.08,1.0,2...|[0.14196277116626...|
|[20.0,408.5,0.0,3...|[1.41962771166263...|
+--------------------+--------------------+
only showing top 5 rows



**Now we need to identify whether then involved hackers are of 2 groups or 3 groups. So let's build k-means with k=2 and k=3 and thereby identifying the total number of hackers in each group**

### Creating a K-means model for k=2 and k=3

In [21]:
val kmeans2 = new KMeans().setFeaturesCol("scaledFeatures").setK(2)
val kmeans3 = new KMeans().setFeaturesCol("scaledFeatures").setK(3)

kmeans2: org.apache.spark.ml.clustering.KMeans = kmeans_ae39c74fa0d8
kmeans3: org.apache.spark.ml.clustering.KMeans = kmeans_f674b9fdf0a0


In [22]:
val k2_model = kmeans2.fit(final_data)
val k3_model = kmeans3.fit(final_data)

2019-12-31 00:33:41 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
2019-12-31 00:33:41 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS


k2_model: org.apache.spark.ml.clustering.KMeansModel = kmeans_ae39c74fa0d8
k3_model: org.apache.spark.ml.clustering.KMeansModel = kmeans_f674b9fdf0a0


### Evaluating clustering by computing Within Set Sum of Squared Errors

In [23]:
val wssse_k2 = k2_model.computeCost(final_data)
val wssse_k3 = k3_model.computeCost(final_data)

wssse_k2: Double = 601.7707512676716
wssse_k3: Double = 434.1492898715845


In [24]:
println(s"With K=2, WSSSE = " + wssse_k2)
println(s"With K=3, WSSSE = " + wssse_k3)

With K=2, WSSSE = 601.7707512676716
With K=3, WSSSE = 434.1492898715845


Not much to be gained from the WSSSE, after all, we would expect that as K increases, the WSSSE decreases. We could however continue the analysis by seeing the drop from K=3 to K=4 to check if the clustering favors even or odd numbers. This won't be substantial, but its worth a look:

In [25]:
for (k <- Range(2,9)){
    val kmeans = new KMeans().setFeaturesCol("scaledFeatures").setK(k)
    val k_model = kmeans.fit(final_data)
    val wssse = k_model.computeCost(final_data)
    println(s"With K=${k}, WSSSE=${wssse}")
    println("-"*60)
}

With K=2, WSSSE=601.7707512676716
------------------------------------------------------------
With K=3, WSSSE=434.1492898715845
------------------------------------------------------------
With K=4, WSSSE=267.1336116887891
------------------------------------------------------------
With K=5, WSSSE=246.62403145571247
------------------------------------------------------------
With K=6, WSSSE=238.8727655088174
------------------------------------------------------------
With K=7, WSSSE=205.4925600341295
------------------------------------------------------------
With K=8, WSSSE=201.49221244678472
------------------------------------------------------------


Nothing definitive can be said with the above! The last key fact that the engineer mentioned was that the attacks should be evenly numbered between the hackers! Let's check with the transform and prediction columns that result form this!

In [26]:
k2_model.transform(final_data).groupBy("prediction").count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         0|  167|
+----------+-----+



In [27]:
k3_model.transform(final_data).groupBy("prediction").count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         2|   83|
|         0|   84|
+----------+-----+



From above results it can be seen that there are equal number of hackers if the group size is 2 whereas for 3 group size, hackers are unevenly distributed!

## So from this we can conclude that there were 2 groups of hackers involved in hacking!

### Closing spark session

In [28]:
spark.stop()

## Thank You!