<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Find-the-Hacker-Project" data-toc-modified-id="Find-the-Hacker-Project-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Find the Hacker Project</a></span></li><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load the data</a></span></li><li><span><a href="#Get-vectorized-data" data-toc-modified-id="Get-vectorized-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Get vectorized data</a></span></li><li><span><a href="#Feature-scaling" data-toc-modified-id="Feature-scaling-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Feature scaling</a></span></li><li><span><a href="#Modelling" data-toc-modified-id="Modelling-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Modelling</a></span></li><li><span><a href="#Result" data-toc-modified-id="Result-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Result</a></span></li></ul></div>

In [1]:
import numpy as np
import pandas as pd
import pyspark
from pyspark import SparkConf, SparkContext, SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf # @udf("integer") def myfunc(x,y): return x - y
from pyspark.sql import functions as F # stddev format_number date_format, dayofyear, when
from pyspark.sql.types import StructField, StringType, IntegerType, StructType

print([(x.__name__,x.__version__) for x in [np, pd, pyspark]])

spark = pyspark.sql.SparkSession.builder.appName('bhishan').getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc) # spark_df = sqlContext.createDataFrame(pandas_df)
sc.setLogLevel("INFO")

[('numpy', '1.17.1'), ('pandas', '0.25.1'), ('pyspark', '2.4.4')]


In [2]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler

# Find the Hacker Project

> 'Session - Connection_ Time: How long the session lasted in

> "Bytes Transferred' Number of MB transferred during session

> "kali Transferred' : Number of MB transferred during session

> kali Trace Used: Indicates if the hacker was using Kali Linux

> 'Serves _{ Corrupted': Number of server corrupted during the attack

> 'Pages- Corrupted': Number of pages illegally accessed


> Location': Location attack came from (Probably useless
because the hackers used VPNs)

> WPM Typing. Speed': Their estimated typing speed based on
session logs.


- The technology firm has 3 potential
hackers that perpetrated the attack.
- They are certain of the first two hackers
but they aren't very sure if the third
hacker was involved or not.
- They have requested your help!

One last key fact, the forensic engineer
knows that the hackers trade off attacks.
Meaning they should each have roughly
the same amount of attacks.

For example if there were 100 total
attacks, then in a 2 hacker situation each
should have about 50 hacks, in a three
hacker situation each would have about
33 hacks.

# Load the data

In [3]:
!ls ../data/

[32mCollege.csv[m[m                       [32mcruise_ship_info.csv[m[m              [32mnew_customers.csv[m[m                 [32mseeds_dataset.csv[m[m
[32mContainsNull.csv[m[m                  [32mcustomer_churn.csv[m[m                [32mpeople.json[m[m                       [32mseeds_dataset.txt[m[m
Ecommerce-Customers.csv           [32mdog_food.csv[m[m                      [32msales_info.csv[m[m                    [32mtitanic.csv[m[m
[32mEcommerce_Customers.csv[m[m           [32mfake_customers.csv[m[m                [32msample_kmeans_data.txt[m[m            [32mwalmart_stock.csv[m[m
[32mMeal_Info.csv[m[m                     [32mhack_data.csv[m[m                     [32msample_libsvm_data.txt[m[m
[32mappl_stock.csv[m[m                    [32mmovielens_ratings.csv[m[m             [32msample_linear_regression_data.txt[m[m


In [8]:
df = spark.read.csv('../data/hack_data.csv',header=True,inferSchema=True)
print(df.count())
print(len(df.columns))
df.printSchema()

df.cache()

pd.DataFrame(df.take(5),columns=df.columns)

334
7
root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)



Unnamed: 0,Session_Connection_Time,Bytes Transferred,Kali_Trace_Used,Servers_Corrupted,Pages_Corrupted,Location,WPM_Typing_Speed
0,8.0,391.09,1,2.96,7.0,Slovenia,72.37
1,20.0,720.99,0,3.04,9.0,British Virgin Islands,69.08
2,31.0,356.32,1,3.71,8.0,Tokelau,70.58
3,2.0,228.08,1,2.48,8.0,Bolivia,70.8
4,20.0,408.5,0,3.57,8.0,Iraq,71.28


# Get vectorized data

In [9]:
# drop location
df = df.drop('Location')
df.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)



In [10]:
assembler = VectorAssembler(inputCols=df.columns, outputCol='features')

In [11]:
final_data = assembler.transform(df)

# Feature scaling

In [12]:
scaler = StandardScaler(inputCol='features',outputCol='scaledFeatures')
final_data = scaler.fit(final_data).transform(final_data)
final_data.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- scaledFeatures: vector (nullable = true)



# Modelling

In [13]:
from pyspark.ml.clustering import KMeans

In [15]:
kmeans = KMeans(featuresCol='scaledFeatures',k=2)
model = kmeans.fit(final_data)

preds = model.transform(final_data).select('prediction')
preds.groupBy('prediction').count().show(5)

+----------+
|prediction|
+----------+
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
+----------+
only showing top 20 rows



In [19]:
kmeans = KMeans(featuresCol='scaledFeatures',k=3)
model = kmeans.fit(final_data)

preds = model.transform(final_data).select('prediction')
preds.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         2|   88|
|         0|   79|
+----------+-----+



# Result

There are two hackers, since when k=2 we get equal number of counts.