# Advanced k-means

## Let us prepare the 'normal' data

To start with clustering, let us consider the datasets which follow the Gaussian 'normal' distribution with a low variance. To do so, we can synthesize a dataset using sklearn make_blob feature. The centers of these gaussian blobs need to be specified. In two dimensions, we need to specify the centers, standard deviation and number of samples as 2000. Here is the gaussian normal distribution function:

$$P(x) = \frac{1}{{\sigma \sqrt {2\pi } }}e^{{\frac{ - \left( {x - \mu } \right)^2 }{2\sigma^2 } }}$$

We can create this using make_blobs function. 


## k-means 

* Basic algorithm. Good to test out the linearity of the clusters.
* Given $d$ observations, $\{\mathbf{x}_1,\dots,\mathbf{x}_d\}$, the observations are associated with k clusters, $\mathbf{C} = \{C_1,\dots,C_k\}$

$$\underset{\mathbf{C}}{\textrm{argmin}} \sum_{i=1}^{k} \sum_{\mathbf{x} \in C_i} \left|\left| \mathbf{x} - \mathbf{\mu}_i \right|\right|^2 \;$$,

Let us create 5 blobs:

```python
from sklearn.datasets import make_blobs
centers = [[1, 1], [-1, -1], [1, -1], [-2,2], [0,2]]
X, y = make_blobs(n_samples=2000, centers=centers, cluster_std=0.3,
                                             random_state=0)
                                             ```
The working of the algorithm is explained in the figure below:

<img src="../images/k-means.png", style="width: 700px;"> 

<br/>
## Exercise:
- Visualize the blob from the X dataset by using the seaborn pairplot as sns.pairplot() and assign it to the variable g

In [1]:
from matplotlib import pyplot as plt
from sklearn.datasets import make_blobs
from sklearn import mixture 
from sklearn.mixture import GaussianMixture

import numpy as np
import seaborn as sns
import pandas as pd

plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

#Create clusters using make_blob feature
centers = [[1, 1], [-1, -1], [1, -1],[-2,2],[0,2]]
X, y = make_blobs(n_samples=2000, centers=centers, cluster_std=0.3,
                            random_state=0)

# Transform the data into a dataframe
blob_df = pd.DataFrame({'X_0':X[:,0], 'X_1':X[:,1], 'y':y})

# Visualize the pair plot and assign it to the variable g


<p>Use g = sns.pairplot(x_vars=<1st dimension column>, y_vars= <2nd dimension column>, hue="y", data = blob_df)</p>

In [None]:
g = sns.pairplot(x_vars="X_0", y_vars="X_1", hue="y", data = blob_df)
g.fig.set_size_inches(14, 6)

In [None]:
ref_tmp_var = False


try:
    ref_assert_var = False
    g_ = sns.pairplot(x_vars="X_0", y_vars="X_1", hue="y", data = blob_df)
    
    import numpy as np
    
    if np.all(g.data.y == g_.data.y):
      ref_assert_var = True
      out = g
    else:
      ref_assert_var = False
    
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
    if ref_assert_var:
        ref_tmp_var = True
    else:
        print('Please follow the instructions given and use the same variables provided in the instructions.')


assert ref_tmp_var


<br/><br/><br/>
## ML Engineer's favorite - The k-means

We shall now apply k-means to this dataset and observe its performance.

k-means is used as a basic unsupervised learning algorithm. The procedure follows a simple and easy way to classify a given data set through a fixed number of clusters, apriori.

The Algorithm for K means is as follow -

 - Select the desired number of clusters k
 - Select k initial observations as seeds
 - Calculate average cluster values (cluster centroids) over each variable (for the initial iteration, this will simply be the initial seed observations)
 - Assign each of the other training observations to the cluster with the nearest centroid 
 - Recalculate cluster centroids (averages) based on the assignments from step 4
 - Iterate between steps 4 and 5, stop when the error reduced below a threshold.

<br/>
## Exercise:

 * Form a dataframe with y as the labels and generate a plot.

In [2]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5)
kmeans.fit(X)

centroid = kmeans.cluster_centers_
labels = kmeans.labels_

#Visualize the k-means clusters


In [None]:
blob_df['k_means_cluster'] = labels
g = sns.pairplot(x_vars="X_0", y_vars="X_1", hue="k_means_cluster", data = blob_df)
g.fig.set_size_inches(14, 6)
sns.despine()
sns.plt.show()

In [None]:
ref_tmp_var = False


try:
    ref_assert_var = False
    blob_df_ = pd.DataFrame({'X_0':X[:,0], 'X_1':X[:,1], 'y':labels})
    
    import numpy as np
    
    if np.all(blob_df['k_means_cluster'] == labels):
      ref_assert_var = True
      out = g
    else:
      ref_assert_var = False
    
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
    if ref_assert_var:
        ref_tmp_var = True
    else:
        print('Please follow the instructions given and use the same variables provided in the instructions.')


assert ref_tmp_var


<br/><br/><br/>
## Silhouette Scores

The Silhouette Coefficient is a metric to estimate the optimum number of clusters. It uses average intra-cluster distance and average nearest-cluster distance for each sample. Higher the value of the score, the better the estimate. Typically the silhoutte scores go high and then fall peaking at an optimum cluster number. The values lie between -1.0 and 1.0.

```python
k_clusters = []
sil_coeffecients = []

for n_cluster in range(2,11):
    kmeans = KMeans(n_clusters = n_cluster).fit(X)
    label = kmeans.labels_
    sil_coeff = silhouette_score(X, label)
    print("For n_clusters={}, Silhouette Coefficient = {}".format(n_cluster, sil_coeff))
    sil_coeffecients.append(sil_coeff)
    k_clusters.append(n_cluster)
    
plt.plot(k_clusters, sil_coeffecients)
plt.ylabel('Silhouette Coefficient'), plt.xlabel('No. of Clusters')
plt.show()
```

<img src='https://s3.amazonaws.com/rfjh/media/silhoutte_scores.png' style='float: left;'/>


<br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/>
## Exercise
* From the above plot retrieve the optimum number of clusters and assign it to the variable k_best. 
* Print it out.


In [None]:
from sklearn.metrics import silhouette_score

k_clusters = []
sil_coeffecients = []

for n_cluster in range(2,11):
    kmeans = KMeans(n_clusters = n_cluster).fit(X)
    label = kmeans.labels_
    sil_coeff = silhouette_score(X, label)
    print("For n_clusters={}, Silhouette Coefficient = {}".format(n_cluster, sil_coeff))
    sil_coeffecients.append(sil_coeff)
    k_clusters.append(n_cluster)
    
plt.plot(k_clusters, sil_coeffecients)
plt.ylabel('Silhouette Coefficient'), plt.xlabel('No. of Clusters')
plt.show()



In [None]:
sil_best = max(sil_coeffecients)
k_best_index = sil_coeffecients.index(sil_best)
k_best = k_clusters[k_best_index]
print("Optimum Number of Clusters:", k_best)

In [None]:
ref_tmp_var = False


try:
    ref_assert_var = False
    
    
    ref_assert_var = False
    if k_best == 5:
        ref_assert_var = True
    else:
        ref_assert_var = False
    
except Exception:
    print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
    if ref_assert_var:
        ref_tmp_var = True
    else:
        print('Please follow the instructions given and use the same variables provided in the instructions.')


assert ref_tmp_var