# Consulting project: Hack Data
### A company in San Francisco have been recently hacked and need your help finding out about the hackers! The technology firm has 3 potential hackers that perpetrated the attack. They are certain of the first two hackers but they aren't very sure if the third hacker was involved or not.

### Can you help figure out whether or not the third suspect had anything to do with the attacks, or was it just two hackers? Hint: Each hacker should have roughly the same amount of attacks.

## Set up Spark and load data

In [2]:
appname = "K-Means Clustering - Seeds"

# Look into https://spark.apache.org/downloads.html for the latest version
spark_mirror = "https://mirrors.sonic.net/apache/spark"
spark_version = "3.3.1"
hadoop_version = "3"

# Install Java 8 (Spark does not work with newer Java versions)
! apt-get update
! apt-get install openjdk-8-jdk-headless -qq > /dev/null

# Download and extract Spark binary distribution
! rm -rf spark-{spark_version}-bin-hadoop{hadoop_version}.tgz spark-{spark_version}-bin-hadoop{hadoop_version}
! wget -q {spark_mirror}/spark-{spark_version}/spark-{spark_version}-bin-hadoop{hadoop_version}.tgz
! tar xzf spark-{spark_version}-bin-hadoop{hadoop_version}.tgz

# The only 2 environment variables needed to set up Java and Spark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/spark-{spark_version}-bin-hadoop{hadoop_version}"

# Set up the Spark environment based on the environment variable SPARK_HOME 
! pip install -q findspark
import findspark
findspark.init()

# Get the Spark session object (basic entry point for every operation)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(appname).master("local[*]").getOrCreate()

Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [83.3 kB]
Get:5 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:8 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:10 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:11 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Get:12 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease [21.3 kB]
Get:13 http://archive.ubuntu.co

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import os
os.environ["KAGGLE_CONFIG_DIR"] = "/content/drive/MyDrive/kaggle"

from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()

! rm -f hack_data.csv
api.dataset_download_file("soheiltehranipour/sample-hack-data", "Hack_Data.csv")

True

## Preliminary analysis and data preprocessing

In [5]:
# Load dataset
df = spark.read.format('csv').options(inferSchema=True, header=True).load('hack_data.csv')
df.show(5)

+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
|Session_Connection_Time|Bytes Transferred|Kali_Trace_Used|Servers_Corrupted|Pages_Corrupted|            Location|WPM_Typing_Speed|
+-----------------------+-----------------+---------------+-----------------+---------------+--------------------+----------------+
|                      8|           391.09|              1|             2.96|              7|            Slovenia|           72.37|
|                     20|           720.99|              0|             3.04|              9|British Virgin Is...|           69.08|
|                     31|           356.32|              1|             3.71|              8|             Tokelau|           70.58|
|                      2|           228.08|              1|             2.48|              8|             Bolivia|            70.8|
|                     20|            408.5|              0|             3.57

In [6]:
df.printSchema()

root
 |-- Session_Connection_Time: integer (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: integer (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)



In [7]:
# Statistical description of features
df.describe().show()

+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|summary|Session_Connection_Time| Bytes Transferred|   Kali_Trace_Used|Servers_Corrupted|   Pages_Corrupted|   Location|  WPM_Typing_Speed|
+-------+-----------------------+------------------+------------------+-----------------+------------------+-----------+------------------+
|  count|                    334|               334|               334|              334|               334|        334|               334|
|   mean|     30.008982035928145| 607.2452694610777|0.5119760479041916|5.258502994011977|10.838323353293413|       null|57.342395209580864|
| stddev|     14.088200614636158|286.33593163576757|0.5006065264451406| 2.30190693339697|  3.06352633036022|       null| 13.41106336843464|
|    min|                      1|              10.0|                 0|              1.0|                 6|Afghanistan|              40.0|
|    max|           

In [8]:
# Count null values within the data columns
print({col: df.filter(df[col].isNull()).count() for col in df.columns})

{'Session_Connection_Time': 0, 'Bytes Transferred': 0, 'Kali_Trace_Used': 0, 'Servers_Corrupted': 0, 'Pages_Corrupted': 0, 'Location': 0, 'WPM_Typing_Speed': 0}


we observe that there is a string feature within the data, the Location. However, this variable might be useless since hackers used VPNs, therefore, it will be discarded. Before we build our model, we first need to transform categorical features using One Hot Encoder.

In [9]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder

encoder = OneHotEncoder(dropLast = False, handleInvalid = "error", inputCol = "Kali_Trace_Used",
                        outputCol = "Kali_Trace_Used_OHE")


Next, we assemble the feature set into a single vector.

In [10]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=['Session_Connection_Time', 'Bytes Transferred', 'Kali_Trace_Used_OHE',
                                       'Servers_Corrupted', 'Pages_Corrupted', 'WPM_Typing_Speed']
                            , outputCol='features', handleInvalid = "skip")


Finally, before building our model we need to scale our data, since each feature has a different scale. For this, I will use Standard Scaler, and the different transformers will be assembled in a single Pipeline to simplify the preprocessing of data.


In [49]:
from pyspark.ml.feature import StandardScaler
from pyspark.ml.pipeline import Pipeline

scaler = StandardScaler(inputCol='features', outputCol='scaled_features')

pipeline = Pipeline(stages = [encoder, assembler, scaler]).fit(df)

scaled_df = pipeline.transform(df)
scaled_df.select(["features", "scaled_features"]).show(5)

+--------------------+--------------------+
|            features|     scaled_features|
+--------------------+--------------------+
|[8.0,391.09,0.0,1...|[0.56785108466505...|
|[20.0,720.99,1.0,...|[1.41962771166263...|
|[31.0,356.32,0.0,...|[2.20042295307707...|
|[2.0,228.08,0.0,1...|[0.14196277116626...|
|[20.0,408.5,1.0,0...|[1.41962771166263...|
+--------------------+--------------------+
only showing top 5 rows



## Building the K-Means clustering model

We first build the K-means model.

In [76]:
from pyspark.ml.clustering import KMeans

kmeans = KMeans(featuresCol='scaled_features', distanceMeasure = "euclidean",
                initMode = "k-means||", initSteps = 50, maxIter = 100)

print(kmeans.explainParams())

distanceMeasure: the distance measure. Supported options: 'euclidean' and 'cosine'. (default: euclidean, current: euclidean)
featuresCol: features column name. (default: features, current: scaled_features)
initMode: The initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (default: k-means||, current: k-means||)
initSteps: The number of steps for k-means|| initialization mode. Must be > 0. (default: 2, current: 50)
k: The number of clusters to create. Must be > 1. (default: 2)
maxIter: max number of iterations (>= 0). (default: 20, current: 100)
predictionCol: prediction column name. (default: prediction)
seed: random seed. (default: -5706602770492230126)
tol: the convergence tolerance for iterative algorithms (>= 0). (default: 0.0001)
weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (undefined)


In order to estimate the optimal number of clusters I will perform a Cross Validation using different number of clusters. Every model will be evaluated according to their Silhouette score and validated using 5-fold cross validation method.

In [89]:
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Create an evaluator for our model
evaluator = ClusteringEvaluator(metricName='silhouette', distanceMeasure='squaredEuclidean',
                                featuresCol = "scaled_features", predictionCol = "prediction")

# Create a grip of parameters for different number of clusters (K)
params = ParamGridBuilder().addGrid(kmeans.k, range(2,11)).build()

# Perform Cross-Validation to find the best model
crossval = CrossValidator(estimator=kmeans,
                          estimatorParamMaps=params,
                          evaluator=evaluator,
                          numFolds=5)

# Train the models with our scaled data
best_km = crossval.fit(scaled_df)

We analyse the scores of the different models compared at the Cross Validation process and evaluate the final Silhouette score of our optimised model.

In [219]:
for i in range(len(best_km.avgMetrics)):
    print(list(best_km.getEstimatorParamMaps()[i].keys())[0], "=", list(best_km.getEstimatorParamMaps()[i].values())[0])
    print("Silhouette score", round(best_km.avgMetrics[i], 3))
    print()

# Return best K value
import numpy as np
for param, value in best_km.getEstimatorParamMaps()[np.argmax(best_km.avgMetrics)].items():
  print("|--------------------------|\n" +
        "  Best parameter\n"f"  {param}: {value}",
        "\n\n  Silhouette score\n"f"  {max(best_km.avgMetrics)}")

KMeans_914085005846__k = 2
Silhouette score 0.624

KMeans_914085005846__k = 3
Silhouette score 0.759

KMeans_914085005846__k = 4
Silhouette score 0.677

KMeans_914085005846__k = 5
Silhouette score 0.709

KMeans_914085005846__k = 6
Silhouette score 0.586

KMeans_914085005846__k = 7
Silhouette score 0.547

KMeans_914085005846__k = 8
Silhouette score 0.434

KMeans_914085005846__k = 9
Silhouette score 0.394

KMeans_914085005846__k = 10
Silhouette score 0.341

|--------------------------|
  Best parameter
  KMeans_914085005846__k: 3 

  Silhouette score
  0.7586589413351843


We can now use the best model (which uses the optimal combinantion of parameters) to predict the labels for our data.

In [228]:
print("Best model =>",best_km.bestModel)
print()

# Get predictions using the best model (K = 3)
pred = best_km.transform(scaled_df)
pred.select(["features", "scaled_features", "prediction"]).show(10)

best_km.bestModel.

Best model => KMeansModel: uid=KMeans_914085005846, k=3, distanceMeasure=euclidean, numFeatures=7

+--------------------+--------------------+----------+
|            features|     scaled_features|prediction|
+--------------------+--------------------+----------+
|[8.0,391.09,0.0,1...|[0.56785108466505...|         1|
|[20.0,720.99,1.0,...|[1.41962771166263...|         1|
|[31.0,356.32,0.0,...|[2.20042295307707...|         1|
|[2.0,228.08,0.0,1...|[0.14196277116626...|         1|
|[20.0,408.5,1.0,0...|[1.41962771166263...|         1|
|[1.0,390.69,0.0,1...|[0.07098138558313...|         1|
|[18.0,342.97,0.0,...|[1.27766494049636...|         1|
|[22.0,101.61,0.0,...|[1.56159048282889...|         1|
|[15.0,275.53,0.0,...|[1.06472078374697...|         1|
|[12.0,424.83,0.0,...|[0.85177662699757...|         1|
+--------------------+--------------------+----------+
only showing top 10 rows



3

Finally, I will compare the total number of observations within each cluster.

In [233]:
print("Observations in K = 3 clusters")
{"cluster "+str(cluster): pred.filter(pred.prediction == cluster).count() for cluster in range(best_km.bestModel.getK())}

Observations in K = 3 clusters


{'cluster 0': 88, 'cluster 1': 167, 'cluster 2': 79}

In [235]:
kmeans2 = KMeans(featuresCol='scaled_features', distanceMeasure = "euclidean",
                initMode = "k-means||", initSteps = 50, maxIter = 100,
                k = 2).fit(scaled_df)

pred2 = kmeans2.transform(scaled_df)

print("Observations in K = 2 clusters")
{"cluster "+str(cluster): pred2.filter(pred2.prediction == cluster).count() for cluster in range(kmeans2.getK())}

Observations in K = 2 clusters


{'cluster 0': 167, 'cluster 1': 167}

As we can observe, wven though the model with number of clusters K = 3 is the one with the highest Silhouette score, there is an unequal distribution of observations among the 3 final clusters. In contrast, if we cluster our data using K = 2, the cluster size is equal for both clusters.

In view of this, and considering that each hacker should have roughly the same amount of attacks, we can conclude that probably two different hackers performed the attack.