## SUMMARY OF STEPS in k-means

STEPS
* Load Data
* Prepare Data
    * null handling (na handling),
    * category indexing (StringIndexer) and binary vectoriation (OneHotEncoder)
    * vecor assembly transformation from input columns to 'features', the combined vector column (.transform)
    * scaling of combined 'features' column to 'scaledFeatures' using StandardScaler (.fit.transform)
        * Scaling is very useful when there is a big difference between features and scaledFeatures.
        * Scaling is also useful when fields vary in orders of large magnitudes, say one column expressed in thousands of miles and another column in milimeters, then scaling helps.
* Create KMeans model with 'scaledFeatures' and desired K value (matching with number of suspected clusters) in the constructor.
* Train this model with the vectorized scaled input dataframe and that is our final kmodel (.fit)
* kmodel.clusterCenters() gives an list of feature arrays, each array representing a centroid of a cluster
    * Each array represents a single point of n-dimension and is called a centroid.
    * The number of arrays will be same as value of k used, i.e. number of clusters.
    * The count of elements in each array is equal to the number of features.
* Run trained_model.transform(sdfScaledVectorFeatures) to get the result dataframe with predictions, where we can check the 'prediction' column. We can also check the 'prediction' column from the model (kmodel.summary.predictions.select('prediction'))
* We can evaluate the model using one of the following ways.
    1. kmodel.computeCost(sdfScaledVectorFeatures)
    2. ClusteringEvaluator('prediction', 'scaledFeatures').evaluate(kmodel.summary.predictions)
* SOURCE: notes_dataframe_ml.ipynb

###### We need to scale the data before passing to KMeans clustering algorithm 
* REF: https://datascience.stackexchange.com/questions/22795/do-clustering-algorithms-need-feature-scaling-in-the-pre-processing-stage  (This provides justification for scaling)
* https://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering
* https://stackoverflow.com/questions/15777201/why-vector-normalization-can-improve-the-accuracy-of-clustering-and-classificati
* Scaling actually normalizes with same importance to all the features, so no disadvantage here.

###### We need to scale the final data after vectorization into 'features', but before training the model.
* The model need to be trained with scaled data
* We need to scale the combined "features" column and use it to train the KMeans model with this scaled features field as  featuresCol parameter

In [3]:
import sys
sys.path.append('C:/Users/nishita/exercises_udemy/MyTrials/tools/')
from chinmay_tools import *

## REFER:  <u>Clustering_Consulting_Project.ipynb</u>

### Clustering Consulting Project 

A large technology firm needs your help, they've been hacked! Luckily their forensic engineers have grabbed valuable data about the hacks, including information like session time,locations, wpm typing speed, etc. The forensic engineer relates to you what she has been able to figure out so far, she has been able to grab meta data of each session that the hackers used to connect to their servers. These are the features of the data:

* 'Session_Connection_Time': How long the session lasted in minutes
* 'Bytes Transferred': Number of MB transferred during session
* 'Kali_Trace_Used': Indicates if the hacker was using Kali Linux
* 'Servers_Corrupted': Number of server corrupted during the attack
* 'Pages_Corrupted': Number of pages illegally accessed
* 'Location': Location attack came from (Probably useless because the hackers used VPNs)
* 'WPM_Typing_Speed': Their estimated typing speed based on session logs.


The technology firm has 3 potential hackers that perpetrated the attack. Their certain of the first two hackers but they aren't very sure if the third hacker was involved or not. They have requested your help! Can you help figure out whether or not the third suspect had anything to do with the attacks, or was it just two hackers? It's probably not possible to know for sure, but maybe what you've just learned about Clustering can help!

**One last key fact, the forensic engineer knows that the hackers trade off attacks. Meaning they should each have roughly the same amount of attacks. For example if there were 100 total attacks, then in a 2 hacker situation each should have about 50 hacks, in a three hacker situation each would have about 33 hacks. The engineer believes this is the key element to solving this, but doesn't know how to distinguish this unlabeled data into groups of hackers.**

#### PROJECT: Finding out about the hackers who have hacked a software company
Problem:
* Company foreign hackers suspect 3 hackers, and are sure about 2 of them,
* We need to apply clustering algorithm to find out whetehr the 3rd hacker si involved or not.
* Also known that hackers trade-off atttacks, ie each of them use same amoutn of time for the hackings

Resolution:
* We will first load the data, Vectorize the numeric fields (ignore "Locations" as hackers used VPNs).
* We then clusterizze the data first with 2 clusters (k=2) and then again with 3 clusters (k=3) as we are not sure whether there were two hackers or three hackers involved.
* As per forensic engineer, the hackers traded off attacks meaning, each hacker carried out equal number of attacks.
* In our clustering we will check, in which case, the number of attacks were evenly distributed among clusters, and that will be our answer regrding number of attackers.
    * We will count the attacks after a groupBy('prediction').count() on clustering result.
    * THe number of clusters where this count is evenly distributed will be the number of attackers.

In [48]:
from pyspark.sql import SparkSession

In [49]:
spark3 = SparkSession.builder.appName('cliuster_project').getOrCreate()

In [52]:
sdf_hack = spark3.read.csv('Clustering/hack_data.csv', inferSchema=True, header=True)

In [55]:
sdf_hack.head(3)

[Row(Session_Connection_Time=8.0, Bytes Transferred=391.09, Kali_Trace_Used=1, Servers_Corrupted=2.96, Pages_Corrupted=7.0, Location='Slovenia', WPM_Typing_Speed=72.37),
 Row(Session_Connection_Time=20.0, Bytes Transferred=720.99, Kali_Trace_Used=0, Servers_Corrupted=3.04, Pages_Corrupted=9.0, Location='British Virgin Islands', WPM_Typing_Speed=69.08),
 Row(Session_Connection_Time=31.0, Bytes Transferred=356.32, Kali_Trace_Used=1, Servers_Corrupted=3.71, Pages_Corrupted=8.0, Location='Tokelau', WPM_Typing_Speed=70.58)]

In [56]:
  printHighlighted('We will drop the "Location" field as it is unnecessary as hackers were using VPN.')

[7m[1mWe will drop the "Location" field as it is unnecessary as hackers were using VPN.[0m[0m


In [59]:
from pyspark.ml.feature import VectorAssembler

In [62]:
sdf_hack.printSchema()
sdf_hack.columns

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)



['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'Location',
 'WPM_Typing_Speed']

In [63]:
# Vectorize all the numeric columns except "Location"
feat_cols = ['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'WPM_Typing_Speed']

In [64]:
from pyspark.ml.feature import VectorAssembler

In [66]:
assembler_hack = VectorAssembler(inputCols=feat_cols, outputCol='features')

In [67]:
sdf_hack_vec = assembler_hack.transform(sdf_hack)

In [68]:
sdf_hack_vec.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)
 |-- features: vector (nullable = true)



### Scale the features

In [69]:
from pyspark.ml.feature import StandardScaler

In [71]:
scaler3 = StandardScaler(withMean=False, withStd=True, inputCol='features', outputCol='scaledFeatures')

In [72]:
sdf_hack_vec_scaled = scaler3.fit(sdf_hack_vec).transform(sdf_hack_vec)

In [73]:
sdf_hack_vec_scaled.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- scaledFeatures: vector (nullable = true)



In [57]:
from pyspark.ml.clustering import KMeans

###### Now the trickiest part 
* The given fact is that the hackers traded off the attacks, i.e. number of attacks from each hacker was same
* So we will build custers with K=2 and another cluster with K=3
* The values of K for which the number of hacks per cluster are same, will be the actual one.

In [79]:
###### KMeans model for 3 hackers

kmeans3 = KMeans(featuresCol='scaledFeatures', predictionCol='prediction', k=3)
kmodel3 = kmeans3.fit(sdf_hack_vec_scaled)

###### KMeans model for 2 hackers

kmeans2 = KMeans(featuresCol='scaledFeatures', predictionCol='prediction', k=2)
kmodel2 = kmeans2.fit(sdf_hack_vec_scaled)

In [82]:
sdf_hack_vec_scaled

DataFrame[Session_Connection_Time: double, Bytes Transferred: double, Kali_Trace_Used: int, Servers_Corrupted: double, Pages_Corrupted: double, Location: string, WPM_Typing_Speed: double, features: vector, scaledFeatures: vector]

In [83]:
sdf_predictions_hack2 = kmodel2.transform(sdf_hack_vec_scaled)
sdf_predictions_hack3 = kmodel3.transform(sdf_hack_vec_scaled)

In [87]:
sdf_predictions_hack2.groupBy('prediction').count().show()
sdf_predictions_hack3.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|  167|
|         0|  167|
+----------+-----+

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   79|
|         2|   88|
|         0|  167|
+----------+-----+



* When we checked the number of hacks for 2 cluster and 3 cluster scenarios, the number of hacks were same for the 2 cluster scenario.
* So we conclude tha tthere were 2 hackers involved

In [88]:
printHighlighted('Here trick was to udnrstand the question properly')

[7m[1mHere trick was to udnrstand the question properly[0m[0m


## Refer: <u>"Clustering Code Along.ipynb"</u>
###### Requirement: Clustering of wheat seed kernel data into 3 varieties of wheat Kama, Rosa an Canaian
DataSource: https://archive.ics.uci.edu/ml/datasets/seeds.

In [4]:
from pyspark.sql import SparkSession

In [5]:
spark2 = SparkSession.builder.appName('cluster_wheat').getOrCreate()

In [6]:
sdf_seeds = spark2.read.csv('Clustering/seeds_dataset.csv', inferSchema=True, header=True)

In [7]:
sdf_seeds.head(1)

[Row(area=15.26, perimeter=14.84, compactness=0.871, length_of_kernel=5.763, width_of_kernel=3.312, asymmetry_coefficient=2.221, length_of_groove=5.22)]

In [8]:
sdf_seeds.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)



###### Format or vectorize the data to get 'features' column

In [9]:
from pyspark.ml.feature import VectorAssembler

In [10]:
assembler_seeds = VectorAssembler(inputCols=sdf_seeds.columns, outputCol='features')

In [11]:
sdf_seeds_vec = assembler_seeds.transform(sdf_seeds)

* The dataframe fitted to KMeans algorithm is expected to have a featuresCol (default col name is 'features'), Additional fields are ignored by the algorithm.

In [12]:
sdf_seeds_vec.printSchema()

root
 |-- area: double (nullable = true)
 |-- perimeter: double (nullable = true)
 |-- compactness: double (nullable = true)
 |-- length_of_kernel: double (nullable = true)
 |-- width_of_kernel: double (nullable = true)
 |-- asymmetry_coefficient: double (nullable = true)
 |-- length_of_groove: double (nullable = true)
 |-- features: vector (nullable = true)



Backgroud:
* There is no label, so it is Unsupervised model
* There are 3 variety of seeds so we can split the dat into 3 clusters one for each seed type

In [13]:
from pyspark.ml.feature import StandardScaler

* Create a StandardScaler object that will scale the 'features' column into 'scaledFeatures'

In [14]:
scaler = StandardScaler(withMean=False, withStd=True, inputCol='features', outputCol='scaledFeatures')

In [15]:
sdf_seeds_vec_scaled = scaler.fit(sdf_seeds_vec).transform(sdf_seeds_vec)

In [16]:
sdf_seeds_vec_scaled

DataFrame[area: double, perimeter: double, compactness: double, length_of_kernel: double, width_of_kernel: double, asymmetry_coefficient: double, length_of_groove: double, features: vector, scaledFeatures: vector]

* Fit the KMeans model to the scaled data

In [17]:
from pyspark.ml.clustering import KMeans

In [18]:
kmeans = KMeans(featuresCol='scaledFeatures', k=3)

In [19]:
%%time
model_wheat = kmeans.fit(sdf_seeds_vec_scaled)

Wall time: 4.75 s


In [20]:
model_wheat.clusterCenters()

[array([ 4.87257659, 10.88120146, 37.27692543, 12.3410157 ,  8.55443412,
         1.81649011, 10.32998598]),
 array([ 6.31670546, 12.37109759, 37.39491396, 13.91155062,  9.748067  ,
         2.39849968, 12.2661748 ]),
 array([ 4.06105916, 10.13979506, 35.80536984, 11.82133095,  7.50395937,
         3.27184732, 10.42126018])]

In [21]:
printHighlighted('Within Set Sum of Squared errors')
model_wheat.computeCost(sdf_seeds_vec_scaled)

[7m[1mWithin Set Sum of Squared errors[0m[0m


429.07559671506715

In [22]:
model_wheat_predictions = model_wheat.summary.predictions

In [23]:
model_wheat_predictions

DataFrame[area: double, perimeter: double, compactness: double, length_of_kernel: double, width_of_kernel: double, asymmetry_coefficient: double, length_of_groove: double, features: vector, scaledFeatures: vector, prediction: int]

In [24]:
from pyspark.ml.evaluation import ClusteringEvaluator

In [25]:
eval = ClusteringEvaluator(predictionCol='prediction', featuresCol='scaledFeatures')

In [26]:
eval.evaluate(model_wheat.summary.predictions)

0.5968576173138038

In [27]:
results = model_wheat.transform(sdf_seeds_vec_scaled)

* Basically the trained_model.summary.predictions dataframe and the dataframe returned frm model_wheat.transform(sdf_seeds_vec_scaled) are identical.

In [28]:
results.head(2)

[Row(area=15.26, perimeter=14.84, compactness=0.871, length_of_kernel=5.763, width_of_kernel=3.312, asymmetry_coefficient=2.221, length_of_groove=5.22, features=DenseVector([15.26, 14.84, 0.871, 5.763, 3.312, 2.221, 5.22]), scaledFeatures=DenseVector([5.2445, 11.3633, 36.8608, 13.0072, 8.7685, 1.4772, 10.621]), prediction=0),
 Row(area=14.88, perimeter=14.57, compactness=0.8811, length_of_kernel=5.553999999999999, width_of_kernel=3.333, asymmetry_coefficient=1.018, length_of_groove=4.956, features=DenseVector([14.88, 14.57, 0.8811, 5.554, 3.333, 1.018, 4.956]), scaledFeatures=DenseVector([5.1139, 11.1566, 37.2883, 12.5354, 8.8241, 0.6771, 10.0838]), prediction=0)]

In [29]:
eval.evaluate(results)

0.5968576173138038

In [30]:
results.select('prediction').show()

+----------+
|prediction|
+----------+
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         1|
|         1|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         2|
+----------+
only showing top 20 rows



#### Write the spark dataframe contents into a file

In [55]:
printHighlighted('Write the spark dataframe contents to a csv file')
model_wheat.summary.predictions.toPandas().to_csv('1.csv', index=False)
results.toPandas().to_csv('2.csv', index=False)

[7m[1mWrite the spark dataframe contents to a csv file[0m[0m


## Ref: <u>Clustering_Code_Example.ipynb</u>

#### OBJECTIVE: K-Mean Clustering
* Clustering si for unsupervised model and unsupervised means no labels hence no train_test_split.
* Instantiate a KMeans model from clustering, and set number of clusters (.setK(2) for two clusters for example and setSeed(1) to a random number generator seed to get the same set of trandom numebers if same seed is supplied. We can supply both as parameters to the constructor itself
* Load data, vectorize it if not already done and train (.fit()) the KMeans model to the loaded and vectorized data.
* Evaluate clustering by computing Within Set Sum of Squared Errors. [ kmodel<b>.computeCost(input_dataset)</b>]. This is sum of squard errors, which we need to minimize. This returns sum of squared distances of points to their nearest center.
* Print the <b>clusterCenters()</b> which are centroids which will be number of elements supplied with value with K.
* computeCost() of kmeans model is now deprecated and we can use <b>ClusteringEvaluator</b> instead. To use ClusteringEvaluator use the dataframe rom kmeans_trained_model.summary.predictions DF as parameter.

In [None]:
import sys
sys.path.append('C:/Users/nishita/exercises_udemy/MyTrials/tools/')
from chinmay_tools import *

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark1 = SparkSession.builder.appName('cluster1').getOrCreate()

In [None]:
dataset = spark1.read.format('libsvm').load('Clustering/sample_kmeans_data.txt')

In [None]:
dataset.count()

In [None]:
dataset.printSchema()

###### Here the problem is a clustering or grouping problem, so 'label' field is unnecessary

In [None]:
from pyspark.ml.clustering import KMeans

In [None]:
kmeans = KMeans(k=2, seed=1)

In [None]:
kmodel = kmeans.fit(dataset)

In [None]:
kmodel.clusterCenters()

In [None]:
ssse  = kmodel.computeCost?

In [None]:
ssse  = kmodel.computeCost

In [None]:
# Return sum of squared distances of points to their nearest center

# Evaluate clustering by computing Within Set Sum of Squared Errors.
# This is deprecated since 2.4.0, hencce I will use ClusteringEvaluator


ssse  = kmodel.computeCost(dataset)

In [None]:
ssse

In [None]:
from pyspark.ml.evaluation import ClusteringEvaluator

In [None]:
kmodel.summary.predictions.show()

In [None]:
eval = ClusteringEvaluator(predictionCol='prediction', featuresCol='features')

In [None]:
eval.explainParams()

In [None]:
eval.evaluate(kmodel.summary.predictions)