In this notebook, we are going to learn more about **Clustering Algorithms** and mainly **K-Means Algorithm**. We will be learning about:
- Training K-Means Algorithm
- Evaluating the quality of the clusters generated
- Introducing the **Elbow Method** and **Silhouette Method** for selecting the best number of clusters

# K-means Clustering
For this task, we will be starting with a dummy dataset for easier visualization and manipulation. <br>
Let's start by importing the needed libraries:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
%matplotlib inline

In the first example, we will generating a dummy dataset of 2000 instances that belong nicely to K clusters. 

In [None]:
n_samples = 2000
random_state = 48
X, y = make_blobs(n_samples=n_samples, random_state=random_state)
print("number of features:",X.______)
print("number of instances:",X._____)

Let's go ahead and visualize the data:

In [None]:
plt.scatter(X[:,0],X[:,1])
plt.title("Visualizing the Dummy Dataset")
plt.show()

It is very clear that the data can be grouped into 3 clusters in this case. <br> Unfortunately not all the datasets can be easily visualized in 2 dimensions. Later on today we will be introducing an alternative way to tackle datasets with more than 2 features.
<br>
Next, let's try to use K-Means to detect these clusters: 

In [None]:
kmeans = KMeans(n_clusters=___, random_state=random_state)
y_pred=kmeans._________(X)

plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.title("Clustered Dataset")
plt.show()

Neat! It seems that K-means was easily able to find the 3 clusters! Let's evaluate the performance of K-means:

In [None]:
print("within-cluster sum-of-squares:",kmeans.__________)

As you might have noticed, inertia is not a very meaningful metric in our case. It is does not tell us how good the clusters are. However, it is really helpful when we need to search for **the optimal number of the clusters** as we will see soon.<br>
Next, let's try with **silhouette score**, which measures how similar a point is to its own cluster compared to other clusters.

In [None]:
from sklearn.metrics import silhouette_score
silhouette_score(_,___, metric = _______)

That is really good! Knowing that the maximum score for silhouette is 1, this shows that elements inside the clusters are similar to each other while being dissimilar to points in other clusters.

Next, let's investigate a real dataset and try to get some insights about the structure of the data. We will be using the Iris dataset which is usally a classification dataset for flower types based on their sepal and petal measures.

In [None]:
from sklearn import datasets
iris=datasets.load_iris()
print(iris.DESCR)

In [None]:
import pandas as pd
iris_df=pd.DataFrame(iris.data,columns=iris.feature_names)
iris_df["species"]=iris.target
iris_df.describe()

In [None]:
print("Target classes:",np.unique(iris.target))

As you can see now, we have 4 features which will make visualizing the data a bit hard. There are many approaches we can use like **dimensionality reduction** algorithms, **Andrew plots**... <br>
However, in this case we will use a different approach to tackle this clustering problem. We will be using the **Elbow Method**:


In [None]:
ine = []
kmax = 10

for k in range(2, kmax+1):
  kmeans = KMeans(n_clusters = k).___(iris.data)
  ine.append(kmeans.________)

In [None]:
plt.plot(np.arange(2,11),ine)

It seems that using the Elbow method, the best K value is either 3 or 4, which is close to the number of target classes already defined. <br>
Next, let's try to use the **Silhouette Method**: 

In [None]:
sil = []
kmax = 10

for k in range(2, kmax+1):
  kmeans = KMeans(n_clusters = k).fit(iris.data)
  labels = kmeans.labels_
  sil.append(silhouette_score(iris.data, labels, metric = 'euclidean'))

In [None]:
plt.plot(np.arange(2,11),sil)

That is very intruiguing! We already know that there are three classes in this dataset, but why silhouette score is giving the best score for the case of two clusters? <br>
To understand what is happening here, we need to go back and investigate the data structure:

In [None]:
scatter=plt.scatter(iris_df["sepal length (cm)"],iris_df["sepal width (cm)"],c=iris_df["species"],)
plt.legend(handles=scatter.legend_elements()[0],labels=("Iris-Setosa","Iris-Versicolour","Iris-Virginica"))
plt.show()

scatter=plt.scatter(iris_df["petal length (cm)"],iris_df["petal width (cm)"],c=iris_df["species"],)
plt.legend(handles=scatter.legend_elements()[0],labels=("Iris-Setosa","Iris-Versicolour","Iris-Virginica"))
plt.show()

We notice two important points:
- Visualizing the data shows that the dataset (at least in its parts) does not have condensed blobs or clusters.
- The dataset have 4 features. Knowing that the Euclidian distance is used by K-Means, it will start to lose its meaning as we increase the dimensions. This can prevent the clustering process from converging to the best solution.

## Further Steps
Now that you know how to use K-Means, you can try the following:
- Use different approaches of clustering, including Hierarchical Clustering
- Evaluate the clusters quality with other metrics
- Experiment with different datasets