-- Notepad to myself --

# Machine Learning with DataFrames

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [2]:
data_path = 'data/'

In [3]:
df2_path = data_path + "utilization.json"
df2 = spark.read.json(df2_path)

In [None]:
#df2_csv_path = data_path + "utilization.csv"
#df2 = spark.read.csv(df2_csv_path, header=True, inferSchema=True)

In [4]:
df2.show(5, truncate=False)

+---------------+-------------------+-----------+---------+-------------+
|cpu_utilization|event_datetime     |free_memory|server_id|session_count|
+---------------+-------------------+-----------+---------+-------------+
|0.57           |03/05/2019 08:06:14|0.51       |100      |47           |
|0.47           |03/05/2019 08:11:14|0.62       |100      |43           |
|0.56           |03/05/2019 08:16:14|0.57       |100      |62           |
|0.57           |03/05/2019 08:21:14|0.56       |100      |50           |
|0.35           |03/05/2019 08:26:14|0.46       |100      |43           |
+---------------+-------------------+-----------+---------+-------------+
only showing top 5 rows



### ML - Clustering

A commonly used technique in exploratory data analysis is called **clustering**. And here the idea is that we want to see if there are natural groupings among the data. So for example, let's take a look at the utilization data. Let's see if we can divide that dataset into three groups that logically come together. 

In [6]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans

A vector is basically like an array or a single data structure that holds all the values from a particular row that the machine learning algorithm will be looking at. The machine learning algorithms in the PySpark-ML packages expect the input data to be in a *single vector* like in scikit-learn library.

In [7]:
vectorAssembler = VectorAssembler(inputCols=["cpu_utilization", "free_memory", "session_count"], outputCol="features")

In [10]:
df_vcluster = vectorAssembler.transform(df2)

In [11]:
df_vcluster.show(10, truncate=False)

+---------------+-------------------+-----------+---------+-------------+----------------+
|cpu_utilization|event_datetime     |free_memory|server_id|session_count|features        |
+---------------+-------------------+-----------+---------+-------------+----------------+
|0.57           |03/05/2019 08:06:14|0.51       |100      |47           |[0.57,0.51,47.0]|
|0.47           |03/05/2019 08:11:14|0.62       |100      |43           |[0.47,0.62,43.0]|
|0.56           |03/05/2019 08:16:14|0.57       |100      |62           |[0.56,0.57,62.0]|
|0.57           |03/05/2019 08:21:14|0.56       |100      |50           |[0.57,0.56,50.0]|
|0.35           |03/05/2019 08:26:14|0.46       |100      |43           |[0.35,0.46,43.0]|
|0.41           |03/05/2019 08:31:14|0.58       |100      |48           |[0.41,0.58,48.0]|
|0.57           |03/05/2019 08:36:14|0.35       |100      |58           |[0.57,0.35,58.0]|
|0.41           |03/05/2019 08:41:14|0.4        |100      |58           |[0.41,0.4,58.0] |

#### K-means

So, what we're taking the features from the vcluster dataframe, fit it to the kmeans model that we just specified, and keep the results in a machine learning model called kmodel.

In [12]:
kmeans = KMeans().setK(3) #number of clusters
kmeans = kmeans.setSeed(1)

In [13]:
kmodel = kmeans.fit(df_vcluster)

The critical thing in a kmeans model is the cluster centers.

In [14]:
kmodel.clusterCenters()

[array([ 0.71174897,  0.28808911, 86.87510507]),
 array([ 0.61918113,  0.38080285, 68.75004716]),
 array([ 0.51439668,  0.48445202, 50.49452021])]

What we find is we have a set of three centers, and each center is specified by three values (our dimensions are CPU utilization, free memory and session count). And each of these values indicate the center of one of the three clusters. And all of the rows in our utilization dataframe fit or fall into one of these three clusters, and we can determine that by measuring the distance from the feature vector of each row to each of these centers.