# 7. KMeans Clustering with Scikit-Learn and MLlib

Implement the K-Means Algorithm using Scikit-Learn and MLlib!

In [1]:
%matplotlib inline
from sklearn import datasets
import pandas as pd
import numpy as np
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

from sklearn.cluster import KMeans
from sklearn import datasets

The dataset class provides access to different public datasets. It will return a scikit-learn bunch: <http://scikit-learn.org/stable/datasets/index.html>

In [2]:
iris = datasets.load_iris()

Convert Scikit Bunch to Pandas Dataframe

In [3]:
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target
iris_df["target_name"]=iris['target_names'][iris_df['target']] 
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,target_name
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa
2,4.7,3.2,1.3,0.2,0,setosa
3,4.6,3.1,1.5,0.2,0,setosa
4,5.0,3.6,1.4,0.2,0,setosa


## 7.1 Cluster the data using the KMeans implementation of scikit-learn!

* Resource: <http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html>
* Measure the runtime for training the model!
* Experiment with different number of clusters! What are your observations!
* Plot the results!

In [5]:
iris_df.iloc[:,:4] = (iris_df.iloc[:,:4] - iris_df.iloc[:,:4].mean())/iris_df.iloc[:,:4].std()

In [6]:
iris_df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,target_name
0,-0.897674,1.015602,-1.335752,-1.311052,0,setosa
1,-1.139200,-0.131539,-1.335752,-1.311052,0,setosa
2,-1.380727,0.327318,-1.392399,-1.311052,0,setosa
3,-1.501490,0.097889,-1.279104,-1.311052,0,setosa
4,-1.018437,1.245030,-1.335752,-1.311052,0,setosa
...,...,...,...,...,...,...
145,1.034539,-0.131539,0.816859,1.443994,2,virginica
146,0.551486,-1.278680,0.703564,0.919223,2,virginica
147,0.793012,-0.131539,0.816859,1.050416,2,virginica
148,0.430722,0.786174,0.930154,1.443994,2,virginica


In [13]:
kmeans = KMeans(n_clusters=3, random_state=0)

In [14]:
%%time 
kmeans.fit(iris_df.iloc[:,:4])

CPU times: user 35 ms, sys: 7.72 ms, total: 42.7 ms
Wall time: 39.9 ms


KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

In [15]:
kmeans.predict(iris_df.iloc[:,:4])

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,
       2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,
       0, 2, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2], dtype=int32)

In [None]:
plt.figure(1, figsize=(4, 3))

plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.scatter(iris_df[y_test==0].iloc[:,0], X_test[y_test==0].iloc[:,1], s=10, c='r', marker=".", label='setosa')
plt.scatter(X_test[y_test==1].iloc[:,0], X_test[y_test==1].iloc[:,1], s=10, c='g', marker="^", label='versicolor')
plt.scatter(X_test[y_test==2].iloc[:,0], X_test[y_test==2].iloc[:,1], s=10, c='b', marker="s", label='virginica')
plt.legend(loc='upper right');
plt.show()

## 7.2 MLlib Clustering

* MLLib KMeans Example: 
    * <https://spark.apache.org/docs/latest/ml-clustering.html>
    * <https://spark.apache.org/docs/latest/api/python/>
    * <https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.ClusteringEvaluator>
* Run KMeans on the provided Iris dataset!
* Validate the quality of the model using the sum of the squared error for each point! Use the ClusterEvaluator of Spark MLlib!

In [4]:
# Initialize PySpark
import os, sys
APP_NAME = "PySpark Lecture"
SPARK_MASTER="local[1]"
import pyspark
import pyspark.sql
from pyspark.sql import Row
conf=pyspark.SparkConf()
conf=pyspark.SparkConf().setAppName(APP_NAME).set("spark.local.dir", os.path.join(os.getcwd(), "tmp"))
sc = pyspark.SparkContext(master=SPARK_MASTER, conf=conf)
spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()

print("PySpark initiated...")

PySpark initiated...


#### Model Evaluation

* https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.ClusteringEvaluator

* Evaluator for Clustering results, which expects two input columns: prediction and features. The metric computes the Silhouette measure using the squared Euclidean distance.

## 7.3 Manual KMeans Clustering

Implement a KMeans Model using Spark MapReduce (Do Not use MLlib version!)!