<a href="https://colab.research.google.com/github/cagBRT/Clustering-Intro/blob/master/Clustering_2b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we create a synthetic dataset and use the KMeans clustering algorithm.<br>
We check which method gives us a better recommendation for the number of clusters:<br>
- the elbow method<br>
- the silhouette score

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs, make_moons

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

Select the number of clusters for our synthetic dataset

In [None]:
num_clusters=5
num_samples=1500

**Create and plot the dataset**

In [None]:
# create dataset
X, y = make_blobs(
   n_samples=num_samples, n_features=6,
   centers=num_clusters, cluster_std=0.5,
   shuffle=True, random_state=0
)

In [None]:
#X, y = make_moons(n_samples=200, noise=0.2, random_state=42)

In [None]:
# plot
plt.scatter(
   X[:, 0], X[:, 1],
   c='blue', marker='o',
   edgecolor='black', s=50
)
plt.show()

**Use the Elbow method to determine the correct number of clusters**

In [None]:
# calculate distortion for a range of number of cluster
max_of_clusters = 11

distortions = []
for i in range(1, max_of_clusters):
    km = KMeans(
        n_clusters=i, init='random',
        n_init=10, max_iter=300,
        tol=1e-04, random_state=0
    )
    km.fit(X)
    distortions.append(km.inertia_)

# plot
plt.plot(range(1, max_of_clusters), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()

**Use the Silhouette score to determine the correct number of clusters**

In [None]:
score_list=[]
for i in range(2,num_clusters+1):
  kmeans = KMeans(n_clusters=i, random_state=42,n_init=10,)
  km=kmeans.fit_predict(X)
  print("Num of clusters=", i,":",silhouette_score(X,km ))