# Recitation Exercises

## Exercise 4

a) Plot the probability of obtaining one point from each cluster in a sample of size K for values of K between 2 to 100.

![plot for exercise 4](p1.png)

b) For K clusters, K = 10, 100, & 1000, find the probability that a sample of size 2K contains at least one point from each cluster.

p = $\frac{2K!}{K^K}$

For K = 10,

p = $\frac{2K!}{K^K}$ = $\frac{2*10!}{10^10}$ = 0.000728

For K = 100,

p = $\frac{2K!}{K^K}$ = $\frac{2*100!}{100^100}$ = 1.867 x $10^{-42}$

For K = 1000, 

p = $\frac{2K!}{K^K}$ = $\frac{2*1000!}{1000^1000}$ = 0

## Exercise 7 

Given the data set:
 - there are m points & K clusters
 - half the points & clusters are in "more dense" regions
 - half the points & clusters are in "less dense" regions
 - the 2 regions are well-separated from each other
 
 Which of the following should occur in order to minimize the squared error when finding K clusters:

    The correct answer would be (c) which was to move centroids to the dense region. The less dense region would require more centroids if the squared error needs to be minimized. Recall that the less dense region tends to produce "noise" which would make it harder to identify clusters, hence needing more. 

## Exercise 11

Total SSE is the sume of the SSE for each separate attribute. 
- What does it mean if the SSE for one variable is low for all clusters?

    If the SSE of one attribute is always low for all clusters, than the variable is just a constant that contributes nothing in dividing the data into groups.
    
- Low for just one cluster?

    Then it be the opposite of the above, it would actually contribute to defining a cluster.
    
- High for all clusters?

    If it's is high for every cluster, then I'd assume it is either noise or an outlier.
    
- High for one cluster?

    Another outlier, which would not help defining a cluster.
    
- How could you use the per variable SSE info to improve your clustering?

    It can help in deciding which attributes to eliminate. Ch. 7 mentioned how sampling the data before clustering could be useful to eliminate the noise or outliers within the data, which would be useful to conserve time of the computation.

## Exercise 16

a) Single Link

![single link](p3a.png)

b) Complete Link

![complete link](p3b.png)

## Exercise 17

Given set of 1-dimensional points: {6, 12, 18, 24, 30, 42, 48}

a) For each of the following sets of inital centroids, create 2 clusters by assigning each point to the nearest centroid & then calculate the total squared error for each of the 2 clusters. 

    i. {18, 45}
    
    Reasoning: Using the minimal difference between points to find out which cluster they belong in...
        i.e 30 - 18 = 12 vs 45 - 30 = 15
        
    Cluster 1 : {6, 12, 18, 24, 30}, Error = 360
    Cluster 2: {42, 48}, Error = 18
    
    Thus, total error = 378
    
    ii. {15, 40}
    
    Cluster 1 : {6, 12, 18, 24}, Error = 180
    Cluster 2 : {30, 42, 48}, Error = 168
    
    Thus, total error = 348
    
b) Do both sets of centroids represent stable solutions? 
    
    Yes, they do represent stable solutions since the above centroids represent centroids that are very far apart.
    
c) What are the 2 clusters produced by single link?

        The minimal difference in (i) is 42-30 = 12. The minimal difference in (ii) is 42-30 = 6. So the two clusters formed by a single link is {6, 12, 18, 24, 30} & {42, 48}.
    
d) Which technique, K-means or single link, seems to produce the "most natural" clustering in this situation?

    Since MIN usually produces the most natural clustering, I would go with MIN (single link).
    
e) What definition(s) of clustering does this natural clustering correspond to?
    
    MIN produces continguous clsuters.
    
f) What well-known characteristic of the K-means alg. explains the previous behavior?

    From what I recall, the K-means alg. is weak towards finding clusters that have a variety in sizes, or when not well-separated. The objective of minimizing squared error leads it to breaking the larger cluster, thus, producing the unnatural one in this case.

## Exercise 21

Compute the entropy and purity for the confusion matrix...
    
Cluster #1:

Entropy = -[($\frac{1}{693}$)log($\frac{1}{693}$) + ($\frac{1}{693}$)log($\frac{1}{693}$) + ($\frac{0}{693}$)log($\frac{0}{693}$) + ($\frac{11}{693}$)log($\frac{11}{693}$) + ($\frac{4}{693}$)log($\frac{4}{693}$) + ($\frac{676}{693}$)log($\frac{676}{693}$)] = 0.199 = 0.2, Purity = $\frac{676}{693}$ = 0.975 = 0.98

Cluster #2 :

Entropy = 1.84, Purity = 0.53

Cluster #3:

Entropy = 1.7, Purity = 0.49

Total:

Entropy = 1.44, 0.61

## Exercise 22

Given 2 sets of 100 points that fall within the unit square. One set of points is arranged so that the points are uniormly spaced. The other set of points is generated from a uniform distribution over the unit square. 

a) Is there a difference between the 2 set of points?

    Definitely, the random points will have a region of less & more density, while the uniformly spaced will have uniform density. 
    
b) If so, which set of points will typically have a smaller SSE for K=10 clusters?

    The random generated will have smaller SSE for K=10 clusters.
    
c) What will be the behavior of DBSCAN on the uniform data set? The random data set?

    Depending on the threshold, DBSCAN will either merge all the points in the uniform data set into a cluster or state they are all just noise. In terms of the random data set, DBSCAN can often find clusters in random data due to the variety of density between regions. 

# Practicum Problems

## Problem 1 - Auto-Mpg Dataset

In [163]:
#All necessary imports
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.impute import SimpleImputer
#from sklearn.datasets.samples_generator import make_blobs
#from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as sch

In [164]:
#Load the auto-mpg sample dataset
autompg_ds = pd.read_csv('auto-mpg.csv', na_values=["?"])
autompg_ds = autompg_ds.drop(columns=['name', 'cylinders', 'model_year', 'origin'])

In [165]:
#Check to make sure the dataset works
autompg_ds.describe()

Unnamed: 0,mpg,displacement,horsepower,weight,acceleration
count,398.0,398.0,392.0,398.0,398.0
mean,23.514573,193.425879,104.469388,2970.424623,15.56809
std,7.815984,104.269838,38.49116,846.841774,2.757689
min,9.0,68.0,46.0,1613.0,8.0
25%,17.5,104.25,75.0,2223.75,13.825
50%,23.0,148.5,93.5,2803.5,15.5
75%,29.0,262.0,126.0,3608.0,17.175
max,46.6,455.0,230.0,5140.0,24.8


In [166]:
#Impute any missing values with the mean of the dataset
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(autompg_ds)
autompg_ds[autompg_ds.columns] = imp_mean.fit_transform(autompg_ds)

In [167]:
#Check the dataset again
autompg_ds.describe()

Unnamed: 0,mpg,displacement,horsepower,weight,acceleration
count,398.0,398.0,398.0,398.0,398.0
mean,23.514573,193.425879,104.469388,2970.424623,15.56809
std,7.815984,104.269838,38.199187,846.841774,2.757689
min,9.0,68.0,46.0,1613.0,8.0
25%,17.5,104.25,76.0,2223.75,13.825
50%,23.0,148.5,95.0,2803.5,15.5
75%,29.0,262.0,125.0,3608.0,17.175
max,46.6,455.0,230.0,5140.0,24.8


In [168]:
#Perform Hierarchial Clustering with linkage set to average & default affinity set to a euclidean. 
#Remaining parameters must obtain a shallow tree with 3 clusters as targets

clustering = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='average').fit(autompg_ds)

In [169]:
labels = clustering.labels_
print(labels)

[2 2 2 2 2 1 1 1 1 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 2 2 0
 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 2 1 1 1 1 0 2 1
 1 1 0 0 0 0 0 0 0 0 0 1 2 2 1 2 1 1 1 1 1 1 2 0 0 0 0 0 0 1 1 1 1 0 0 0 0
 0 0 0 0 1 1 0 0 0 0 2 0 0 2 0 0 0 2 0 0 0 0 2 2 2 1 1 1 1 1 0 0 0 0 0 0 0
 0 0 0 0 0 2 2 0 1 1 1 1 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 1 2 1 0 2 0 0 0 0 0 0 2 2 2 0 0 0 0 0 0 2 0 0 2 1 1 2 2 0 0 0 0 0 2
 1 1 1 2 2 2 2 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 0 0 0 2 0 2
 0 2 2 2 2 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 2 2 2 2 1 1 2 2 0 0 0
 0 2 2 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 2 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [170]:
#To find mean and variance of each cluster, let's place the labels into the dataframe which range from 0 to 2
autompg_ds['cluster'] = clustering.labels_
autompg_ds.describe()

Unnamed: 0,mpg,displacement,horsepower,weight,acceleration,cluster
count,398.0,398.0,398.0,398.0,398.0,398.0
mean,23.514573,193.425879,104.469388,2970.424623,15.56809,0.502513
std,7.815984,104.269838,38.199187,846.841774,2.757689,0.77019
min,9.0,68.0,46.0,1613.0,8.0,0.0
25%,17.5,104.25,76.0,2223.75,13.825,0.0
50%,23.0,148.5,95.0,2803.5,15.5,0.0
75%,29.0,262.0,125.0,3608.0,17.175,1.0
max,46.6,455.0,230.0,5140.0,24.8,2.0


In [171]:
#This is the description for the first cluster
c0 = autompg_ds.loc[autompg_ds['cluster'] == 0]
c0.describe()

Unnamed: 0,mpg,displacement,horsepower,weight,acceleration,cluster
count,266.0,266.0,266.0,266.0,266.0,266.0
mean,27.365414,131.934211,84.300061,2459.511278,16.29812,0.0
std,6.478913,53.179727,19.213107,427.354771,2.391296,0.0
min,13.0,68.0,46.0,1613.0,10.0,0.0
25%,22.075,97.0,70.0,2124.25,14.55,0.0
50%,26.9,119.0,85.0,2395.0,16.0,0.0
75%,32.0,149.75,95.0,2805.25,17.6,0.0
max,46.6,455.0,225.0,3302.0,24.8,0.0


In [172]:
#This is the description for the second cluster
c1 = autompg_ds.loc[autompg_ds['cluster'] == 1]
c1.describe()

Unnamed: 0,mpg,displacement,horsepower,weight,acceleration,cluster
count,64.0,64.0,64.0,64.0,64.0,64.0
mean,13.889062,358.09375,167.046875,4398.59375,13.025,1.0
std,1.832781,46.240818,27.504937,272.602899,1.895106,0.0
min,9.0,260.0,110.0,4042.0,8.5,1.0
25%,13.0,318.0,149.75,4183.75,12.0,1.0
50%,14.0,350.0,156.5,4357.0,13.0,1.0
75%,15.125,400.0,180.0,4530.25,14.0,1.0
max,17.5,455.0,230.0,5140.0,19.0,1.0


In [173]:
#This is the description for the third cluster
c2 = autompg_ds.loc[autompg_ds['cluster'] == 2]
c2.describe()

Unnamed: 0,mpg,displacement,horsepower,weight,acceleration,cluster
count,68.0,68.0,68.0,68.0,68.0,68.0
mean,17.510294,278.985294,124.470588,3624.838235,15.105882,2.0
std,2.971513,53.688847,26.70372,194.359999,3.249151,0.0
min,11.0,163.0,72.0,3329.0,8.0,2.0
25%,15.0,231.0,105.0,3435.25,12.725,2.0
50%,17.55,260.0,120.0,3616.5,15.45,2.0
75%,19.125,318.0,150.0,3782.0,17.25,2.0
max,26.6,400.0,190.0,3988.0,22.2,2.0


In [174]:
#To find a difference if we had use the feature 'origin' as a class label, let's create another dataset
autompg_ds2 = pd.read_csv('auto-mpg.csv', na_values=["?"])
autompg_ds2 = autompg_ds2.drop(columns=['name','cylinders', 'model_year'])
autompg_ds2.describe()

Unnamed: 0,mpg,displacement,horsepower,weight,acceleration,origin
count,398.0,398.0,392.0,398.0,398.0,398.0
mean,23.514573,193.425879,104.469388,2970.424623,15.56809,1.572864
std,7.815984,104.269838,38.49116,846.841774,2.757689,0.802055
min,9.0,68.0,46.0,1613.0,8.0,1.0
25%,17.5,104.25,75.0,2223.75,13.825,1.0
50%,23.0,148.5,93.5,2803.5,15.5,1.0
75%,29.0,262.0,126.0,3608.0,17.175,2.0
max,46.6,455.0,230.0,5140.0,24.8,3.0


In [175]:
#Impute any missing values with the mean of the dataset
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(autompg_ds2)
autompg_ds2[autompg_ds2.columns] = imp_mean.fit_transform(autompg_ds2)

In [176]:
#This is the description for the second cluster
o1 = autompg_ds2.loc[autompg_ds2['origin'] == 1]
o1.describe()

Unnamed: 0,mpg,displacement,horsepower,weight,acceleration,origin
count,249.0,249.0,249.0,249.0,249.0,249.0
mean,20.083534,245.901606,118.814769,3361.931727,15.033735,1.0
std,6.402892,98.501839,39.617323,794.792506,2.751112,0.0
min,9.0,85.0,52.0,1800.0,8.0,1.0
25%,15.0,151.0,88.0,2720.0,13.0,1.0
50%,18.5,250.0,105.0,3365.0,15.0,1.0
75%,24.0,318.0,150.0,4054.0,16.9,1.0
max,39.0,455.0,230.0,5140.0,22.2,1.0


In [177]:
#This is the description for the third cluster
o2 = autompg_ds2.loc[autompg_ds2['origin'] == 2]
o2.describe()

Unnamed: 0,mpg,displacement,horsepower,weight,acceleration,origin
count,70.0,70.0,70.0,70.0,70.0,70.0
mean,27.891429,109.142857,81.241983,2423.3,16.787143,2.0
std,6.72393,22.582079,20.264743,490.043191,3.045687,0.0
min,16.2,68.0,46.0,1825.0,12.2,2.0
25%,24.0,92.25,70.0,2067.25,14.5,2.0
50%,26.5,104.5,77.5,2240.0,15.7,2.0
75%,30.65,121.0,90.75,2769.75,18.9,2.0
max,44.3,183.0,133.0,3820.0,24.8,2.0


In [178]:
#This is the description for the third cluster
o3 = autompg_ds2.loc[autompg_ds2['origin'] == 3]
o3.describe()

Unnamed: 0,mpg,displacement,horsepower,weight,acceleration,origin
count,79.0,79.0,79.0,79.0,79.0,79.0
mean,30.450633,102.708861,79.835443,2221.227848,16.172152,3.0
std,6.090048,23.140126,17.819199,320.497248,1.954937,0.0
min,18.0,70.0,52.0,1613.0,11.4,3.0
25%,25.7,86.0,67.0,1985.0,14.6,3.0
50%,31.6,97.0,75.0,2155.0,16.4,3.0
75%,34.05,119.0,95.0,2412.5,17.55,3.0
max,46.6,168.0,132.0,2930.0,21.0,3.0


To answer the final question, I do not see a clear relationship between the cluster assigment and the class labels...The data seems to be pretty far apart or barely different in different attributes when comparing.

## Problem 2 - Boston Dataset

In [101]:
#Imports
from sklearn.cluster import KMeans
from sklearn.datasets import load_boston
from sklearn import preprocessing
from sklearn import metrics

In [104]:
#Load the dataset into a dataframe
boston_ds = load_boston()
boston_df = pd.DataFrame(boston_ds.data, columns=boston_ds.feature_names)
boston_df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97


In [126]:
#Scale the dataset
df_scaled = pd.DataFrame(data=preprocessing.scale(boston_ds.data), columns=boston_ds.feature_names)
df_scaled.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,-8.787437000000001e-17,-6.343191e-16,-2.682911e-15,4.701992e-16,2.490322e-15,-1.14523e-14,-1.407855e-15,9.210902e-16,5.441409e-16,-8.868619e-16,-9.205636e-15,8.163101e-15,-3.370163e-16
std,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099
min,-0.4197819,-0.4877224,-1.557842,-0.2725986,-1.465882,-3.880249,-2.335437,-1.267069,-0.9828429,-1.31399,-2.707379,-3.907193,-1.531127
25%,-0.4109696,-0.4877224,-0.8676906,-0.2725986,-0.9130288,-0.5686303,-0.837448,-0.8056878,-0.6379618,-0.767576,-0.4880391,0.2050715,-0.79942
50%,-0.3906665,-0.4877224,-0.2110985,-0.2725986,-0.1442174,-0.1084655,0.3173816,-0.2793234,-0.5230014,-0.4646726,0.274859,0.3811865,-0.1812536
75%,0.00739656,0.04877224,1.015999,-0.2725986,0.598679,0.4827678,0.9067981,0.6623709,1.661245,1.530926,0.8065758,0.433651,0.6030188
max,9.933931,3.804234,2.422565,3.668398,2.732346,3.555044,1.117494,3.960518,1.661245,1.798194,1.638828,0.4410519,3.548771


In [112]:
#Perform K-Means with the scaled data & number of clusters = 2
clust_model = KMeans(n_clusters=2, init='k-means++')
clust_labels = clust_model.fit_predict(df_scaled)
print(clust_labels)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 1 0 1 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 

In [115]:
#Provide the Silhouette
silhouette_avg = metrics.silhouette_score(df_scaled, clust_labels)
print(silhouette_avg)

0.36011768587358606


In [116]:
#Perform K-Means with the scaled data & number of clusters = 3
clust_model = KMeans(n_clusters=3, init='k-means++')
clust_labels = clust_model.fit_predict(df_scaled)
print(clust_labels)

[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 2 2 2 2
 2 2 2 2 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 2 2 2 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2
 2 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1
 2 2 2 2 2 2 2 2 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 

In [117]:
#Provide the Silhouette
silhouette_avg = metrics.silhouette_score(df_scaled, clust_labels)
print(silhouette_avg)

0.2574894522739469


In [118]:
#Perform K-Means with the scaled data & number of clusters = 4
clust_model = KMeans(n_clusters=4, init='k-means++')
clust_labels = clust_model.fit_predict(df_scaled)
print(clust_labels)

[3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 1 1 1 1 1 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 3 1 3 3 3 3
 3 3 3 3 3 3 1 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 3 0 3 3 3 3 0 3 0 3 0 0 0 0 2 0 0 0 0 0
 0 0 0 0 2 0 2 2 0 3 3 3 2 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 2 2 2 2 2 3 3 3 2 3 2 2 2 2
 2 3 3 3 3 3 3 3 3 3 3 3 2 3 2 3 1 1 1 1 1 1 3 3 1 3 1 1 1 1 1 1 1 1 1 3 3
 3 3 3 3 3 3 3 3 3 3 2 3 1 3 2 2 1 2 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 3 3 3
 3 3 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1
 3 3 3 3 3 3 3 3 1 3 1 1 3 3 1 1 1 1 1 1 1 1 1 2 2 2 0 0 0 0 2 2 0 0 0 0 2
 2 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 

In [119]:
#Provide the Silhouette
silhouette_avg = metrics.silhouette_score(df_scaled, clust_labels)
print(silhouette_avg)

0.2809804562187518


In [122]:
#Perform K-Means with the scaled data & number of clusters = 5
clust_model = KMeans(n_clusters=5, init='k-means++')
clust_labels = clust_model.fit_predict(df_scaled)
print(clust_labels)

[1 1 1 1 1 1 1 1 4 1 1 1 1 1 4 1 1 4 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1
 1 1 3 3 1 1 1 1 1 1 1 4 1 1 1 1 1 3 3 3 3 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 4 4 4 4 4
 4 4 4 4 2 4 2 2 4 4 4 4 2 4 2 2 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 4 2 2 2 2 2 1 4 1 2 4 2 2 2 2
 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 3 1 1 1 1 3 1 1 1 1 1 1 1 1 3 3 3 3 3 1 1
 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1 2 2 1 1 1 1 2 3 3 3 3 3 3 3 3 3 3 1 1 1
 1 1 3 3 3 1 3 3 1 1 1 1 1 4 4 1 4 1 1 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3
 1 1 1 1 1 1 1 1 3 1 3 3 1 1 3 3 3 3 3 3 3 3 3 2 2 2 0 0 0 0 2 2 0 0 0 0 2
 2 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 4 4 0 4 4 

In [123]:
#Provide the Silhouette
silhouette_avg = metrics.silhouette_score(df_scaled, clust_labels)
print(silhouette_avg)

0.24783915229552517


In [124]:
#Perform K-Means with the scaled data & number of clusters = 6
clust_model = KMeans(n_clusters=6, init='k-means++')
clust_labels = clust_model.fit_predict(df_scaled)
print(clust_labels)

[4 4 4 4 4 4 4 4 5 4 4 4 4 4 5 4 4 5 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4
 4 4 1 1 4 4 4 4 4 4 4 5 4 4 4 4 4 1 1 1 1 4 4 4 4 4 4 4 1 1 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 4 5 5 5 5 5 5 5 5 5
 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 2 5 5 5 5 5
 5 5 5 5 2 5 2 2 5 5 5 5 2 5 2 2 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 5 2 2 2 2 2 4 5 4 2 5 2 2 2 2
 2 4 4 4 4 4 4 4 4 4 4 4 2 4 2 4 1 4 4 4 4 1 4 4 4 4 4 4 4 4 1 1 1 1 1 4 4
 4 4 4 4 4 4 4 4 4 4 2 4 4 4 2 2 4 2 2 4 4 4 4 2 1 1 1 1 1 1 1 1 1 1 4 4 4
 4 4 1 1 1 4 1 1 4 4 4 4 4 5 5 4 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1
 4 4 4 4 4 4 4 4 1 4 1 1 4 4 1 1 1 1 1 1 1 1 1 2 2 2 0 0 0 0 2 2 0 0 3 0 2
 2 0 2 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3 0
 0 0 3 3 3 3 3 3 3 3 3 3 3 0 0 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 0 0 0 0
 0 3 0 0 0 0 3 0 0 0 3 3 3 3 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 

In [125]:
#Provide the Silhouette
silhouette_avg = metrics.silhouette_score(df_scaled, clust_labels)
print(silhouette_avg)

0.2617743548302116


Since the highest silouette score was when the total # of clusters of 2, it is the optimal k because when a high value is shown, it indicates that the object is well matched to its own cluster & poorly matched to neighboring clusters.

In [131]:
#Redo the cluster with n_clusters = 2 to calucate the mean of all values
clust_model = KMeans(n_clusters=2, init='k-means++')
clust_labels = clust_model.fit_predict(df_scaled)
print(clust_labels)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 

In [132]:
#Add the labels to the df_scaled
df_scaled['cluster'] = clust_labels
df_scaled.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,cluster
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,-8.787437000000001e-17,-6.343191e-16,-2.682911e-15,4.701992e-16,2.490322e-15,-1.14523e-14,-1.407855e-15,9.210902e-16,5.441409e-16,-8.868619e-16,-9.205636e-15,8.163101e-15,-3.370163e-16,0.349802
std,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,1.00099,0.477379
min,-0.4197819,-0.4877224,-1.557842,-0.2725986,-1.465882,-3.880249,-2.335437,-1.267069,-0.9828429,-1.31399,-2.707379,-3.907193,-1.531127,0.0
25%,-0.4109696,-0.4877224,-0.8676906,-0.2725986,-0.9130288,-0.5686303,-0.837448,-0.8056878,-0.6379618,-0.767576,-0.4880391,0.2050715,-0.79942,0.0
50%,-0.3906665,-0.4877224,-0.2110985,-0.2725986,-0.1442174,-0.1084655,0.3173816,-0.2793234,-0.5230014,-0.4646726,0.274859,0.3811865,-0.1812536,0.0
75%,0.00739656,0.04877224,1.015999,-0.2725986,0.598679,0.4827678,0.9067981,0.6623709,1.661245,1.530926,0.8065758,0.433651,0.6030188,1.0
max,9.933931,3.804234,2.422565,3.668398,2.732346,3.555044,1.117494,3.960518,1.661245,1.798194,1.638828,0.4410519,3.548771,1.0


In [134]:
#Show the mean values for all features in the first cluster
c0 = df_scaled.loc[df_scaled['cluster'] == 0]
c0.mean()

CRIM      -0.390124
ZN         0.262392
INDUS     -0.620368
CHAS       0.002912
NOX       -0.584675
RM         0.243315
AGE       -0.435108
DIS        0.457222
RAD       -0.583801
TAX       -0.631460
PTRATIO   -0.285808
B          0.326451
LSTAT     -0.446421
cluster    0.000000
dtype: float64

In [135]:
#Show the mean values for all features in the second cluster
c1 = df_scaled.loc[df_scaled['cluster'] == 1]
c1.mean()

CRIM       0.725146
ZN        -0.487722
INDUS      1.153113
CHAS      -0.005412
NOX        1.086769
RM        -0.452263
AGE        0.808760
DIS       -0.849865
RAD        1.085145
TAX        1.173731
PTRATIO    0.531248
B         -0.606793
LSTAT      0.829787
cluster    1.000000
dtype: float64

To answer the final question for this problem, it is important to remember that the mean of a cluster is the same as the centroid coordinate.

## Problem 3 - Wine Dataset

In [138]:
#Imports
from sklearn.datasets import load_wine

In [139]:
#Load the dataset
wine_ds = load_wine()
wine_scaled = pd.DataFrame(data=preprocessing.scale(wine_ds.data), columns=wine_ds.feature_names)
wine_scaled.describe()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,7.841418e-15,2.444986e-16,-4.059175e-15,-7.110417e-17,-2.4948830000000002e-17,-1.955365e-16,9.443133e-16,-4.178929e-16,-1.54059e-15,-4.129032e-16,1.398382e-15,2.126888e-15,-6.985673e-17
std,1.002821,1.002821,1.002821,1.002821,1.002821,1.002821,1.002821,1.002821,1.002821,1.002821,1.002821,1.002821,1.002821
min,-2.434235,-1.432983,-3.679162,-2.671018,-2.088255,-2.107246,-1.695971,-1.868234,-2.069034,-1.634288,-2.094732,-1.895054,-1.493188
25%,-0.7882448,-0.6587486,-0.5721225,-0.6891372,-0.8244151,-0.8854682,-0.8275393,-0.7401412,-0.5972835,-0.7951025,-0.7675624,-0.9522483,-0.7846378
50%,0.06099988,-0.423112,-0.02382132,0.001518295,-0.1222817,0.09595986,0.1061497,-0.1760948,-0.06289785,-0.1592246,0.03312687,0.2377348,-0.2337204
75%,0.8361286,0.6697929,0.6981085,0.6020883,0.5096384,0.8089974,0.8490851,0.6095413,0.6291754,0.493956,0.7131644,0.7885875,0.7582494
max,2.259772,3.109192,3.156325,3.154511,4.371372,2.539515,3.062832,2.402403,3.485073,3.435432,3.301694,1.960915,2.971473


In [140]:
clust_model = KMeans(n_clusters=3, init='k-means++')
clust_labels = clust_model.fit_predict(wine_scaled)
print(clust_labels)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]


In [160]:
#Obtain class labels
target_values = wine_ds.target
target_labels = pd.qcut(x=target_values, q=2, labels=[1,2])
class_labels = target_labels.astype('int32') - 1
print(class_labels)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


In [161]:
#Get the homogenity & completeness
h_score = metrics.homogeneity_score(class_labels, clust_labels)
c_score = metrics.completeness_score(class_labels, clust_labels)

print(h_score, c_score)

0.8900387051166616 0.4745258086227089


To answer the final question, homogeneity means all of the observations witht he same class label are within the same cluster. Completeness means all members of the same class are within the same cluster. Both scores range from 0 to 1, with the higher # being the best outcome.