Creative Commons CC BY 4.0 Lynd Bacon & Associates, Ltd. Not warranted to be suitable for any particular purpose. (You're on your own!)

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

<h1 align='center'>KMeans examples Assignment 2 v1</h1>

Assignment 2 asks you to do a cluster analysis using kmeans clustering.  kmeans is at partitioning method.  It starts out using an a priori specified number of clusters, and then attempts to assign data observations to clusters in order to optimize a criterion like maximizing the _homogeneity_ of clusters.  Clustering is an unsupervised learning method, meaning that there is no target variable to be predicted.  The task is to be able to discover whether there is structure in data in terms of similarities or differences between different data points.

In the following we'll train a kmeans clustering solution using two different methods provided in the _scikit-learn_ library.  In doing your assignment you need only use one of them.   We'll apply a metric that 

In [1]:
import numpy as np
import pandas as pd
import os
from sklearn.cluster import KMeans, MiniBatchKMeans
%matplotlib inline
import matplotlib.pyplot as plt
import scikitplot as skplt
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

## Example Data

Here we'll use a selection of the numeric data from the Ames dataset. 

Assuming that the pickle file is in the current working directory:

In [2]:
AmesSelDF=pd.read_pickle('amesSelDF.pickle')
AmesSelDF.columns

Index(['Lot_Frontage', 'Lot_Area', 'Mas_Vnr_Area', 'Bsmt_Unf_SF',
       'Total_Bsmt_SF', 'First_Flr_SF', 'Second_Flr_SF', 'Gr_Liv_Area',
       'Bedroom_AbvGr', 'Kitchen_AbvGr', 'TotRms_AbvGrd', 'Fireplaces',
       'Garage_Area', 'Wood_Deck_SF', 'Open_Porch_SF', 'Sale_Price'],
      dtype='object')

These variables seem "more or less" metric:

In [3]:
AmesSelDF.describe()

Unnamed: 0,Lot_Frontage,Lot_Area,Mas_Vnr_Area,Bsmt_Unf_SF,Total_Bsmt_SF,First_Flr_SF,Second_Flr_SF,Gr_Liv_Area,Bedroom_AbvGr,Kitchen_AbvGr,TotRms_AbvGrd,Fireplaces,Garage_Area,Wood_Deck_SF,Open_Porch_SF,Sale_Price
count,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0
mean,57.647782,10147.921843,101.096928,559.071672,1051.255631,1159.557679,335.455973,1499.690444,2.854266,1.044369,6.443003,0.599317,472.658362,93.751877,47.533447,180796.060068
std,33.499441,7880.017759,178.634545,439.540571,440.968018,391.890885,428.395715,505.508887,0.827731,0.214076,1.572964,0.647921,215.187196,126.361562,67.4834,79886.692357
min,0.0,1300.0,0.0,0.0,0.0,334.0,0.0,334.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,12789.0
25%,43.0,7440.25,0.0,219.0,793.0,876.25,0.0,1126.0,2.0,1.0,5.0,0.0,320.0,0.0,0.0,129500.0
50%,63.0,9436.5,0.0,465.5,990.0,1084.0,0.0,1442.0,3.0,1.0,6.0,1.0,480.0,0.0,27.0,160000.0
75%,78.0,11555.25,162.75,801.75,1301.5,1384.0,703.75,1742.75,3.0,1.0,7.0,1.0,576.0,168.0,70.0,213500.0
max,313.0,215245.0,1600.0,2336.0,6110.0,5095.0,2065.0,5642.0,8.0,3.0,15.0,4.0,1488.0,1424.0,742.0,755000.0


_Sale_Price_ is the target variable you'll be training emsembles to predict.  We'll set that aside here and cluster on the rest of the columns. Then we'll 

In [4]:
AmesSelDF.loc[:,~(AmesSelDF.columns.isin(['Sale_Price']))]

Unnamed: 0,Lot_Frontage,Lot_Area,Mas_Vnr_Area,Bsmt_Unf_SF,Total_Bsmt_SF,First_Flr_SF,Second_Flr_SF,Gr_Liv_Area,Bedroom_AbvGr,Kitchen_AbvGr,TotRms_AbvGrd,Fireplaces,Garage_Area,Wood_Deck_SF,Open_Porch_SF
0,141,31770,112,441,1080,1656,0,1656,3,1,7,2,528,210,62
1,80,11622,0,270,882,896,0,896,2,1,5,0,730,140,0
2,81,14267,108,406,1329,1329,0,1329,3,1,6,0,312,393,36
3,93,11160,0,1045,2110,2110,0,2110,3,1,8,2,522,0,0
4,74,13830,0,137,928,928,701,1629,3,1,6,1,482,212,34
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2925,37,7937,0,184,1003,1003,0,1003,3,1,6,0,588,120,0
2926,0,8885,0,239,864,902,0,902,2,1,5,0,484,164,0
2927,62,10441,0,575,912,970,0,970,3,1,6,0,0,80,32
2928,77,10010,0,195,1389,1389,0,1389,2,1,6,1,418,240,38


In [5]:
AmesClusDF=AmesSelDF.loc[:,~(AmesSelDF.columns.isin(['Sale_Price']))].astype('float32')
AmesClusDF.dtypes
X=AmesClusDF.to_numpy(copy=True)     # get the np array out of the DataFrame

### Example: 4 Cluster Solution: Kmeans

In [None]:
km4=KMeans(n_clusters=4,random_state=33).fit(X)

In [None]:
label=km4.predict(X)     # predicted cluster membership labels
pd.Series(label).value_counts()

## Clustering Metrics

Be sure to check the documentation to understand what these metrics are measuring.

In [None]:
print(f'CH Score: {calinski_harabasz_score(X,label)}')
print(f'DB Score: {davies_bouldin_score(X,label)}')
print(f'Sihouette Score: {silhouette_score(X,label)}')

## Silhouette Plot

In [None]:
skplt.metrics.plot_silhouette(X, label)
plt.show();

## Grid Search, Rescaling, Data Reduction via PCA

You could do a simple grid search over number of clusters to find the the number of clusters with the "best" metric or score.  You could also try _preprocessing_ your data before clustering.  You could standardize your variables, or you could use PCA to define a reduced space into which they are projected.

## Example 8 Cluster Solution: MiniBatchKMeans

This is very similar to the above.

In [None]:
kmMB8 = MiniBatchKMeans(n_clusters=8, random_state=11,batch_size=10).fit(X)
labelMB8=kmMB8.predict(X)
pd.Series(labelMB8).value_counts()

In [None]:
print(f'CH Score: {calinski_harabasz_score(X,labelMB8)}')
print(f'DB Score: {davies_bouldin_score(X,labelMB8)}')
print(f'Sihouette Score: {silhouette_score(X,labelMB8)}')

In [None]:
skplt.metrics.plot_silhouette(X, labelMB8,figsize=(7,7))
plt.show();

## Graphical Method for Selecting Number of Clusters

Last but not least, here's an example of using a method in _scikit-plot_

In [None]:
kmMB=MiniBatchKMeans(random_state=88)
skplt.cluster.plot_elbow_curve(kmMB,X,n_jobs=-1,cluster_ranges=range(1,20),
                            figsize=(7,7))
plt.show();