# Clustering Challenge

Clustering is an *unsupervised* machine learning technique in which you train a model to group similar entities into clusters based on their features.

In this exercise, you must separate a dataset consisting of three numeric features (**A**, **B**, and **C**) into clusters. Run the cell below to load the data.

In [1]:
import pandas as pd

data = pd.read_csv('data/clusters.csv')
data.head()

Unnamed: 0,A,B,C
0,-0.087492,0.398,0.014275
1,-1.071705,-0.546473,0.072424
2,2.747075,2.012649,3.083964
3,3.217913,2.213772,4.260312
4,-0.607273,0.793914,-0.516091


Your challenge is to identify the number of discrete clusters present in the data, and create a clustering model that separates the data into that number of clusters. You should also visualize the clusters to evaluate the level of separation achieved by your model.

Add markdown and code cells as required to create your solution.

> **Note**: There is no single "correct" solution. A sample solution is provided in [04 - Clustering Solution.ipynb](04%20-%20Clustering%20Solution.ipynb).

### Identification du nombre de clusters

In [2]:
# Your code to create a clustering solution
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
import plotly.express as px
from IPython.display import display

# Normalize the numeric features so they're on the same scale
scaled_features = MinMaxScaler().fit_transform(data)

# Get two principal components
pca = PCA().fit(scaled_features)
components = pca.transform(scaled_features)

labels = {
    str(i): f"PC {i+1} ({var:.1%})"
    for i, var in enumerate(pca.explained_variance_ratio_)
}

fig = px.scatter_matrix(components, labels=labels, dimensions=range(3))
fig.update_traces(diagonal_visible=False)
display(fig)


  dims = [


In [3]:
# importing the libraries
import numpy as np
from sklearn.cluster import KMeans

# Create 10 models with 1 to 10 clusters
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, n_init="auto")
    # Fit the data points
    kmeans.fit(scaled_features)
    # Get the WCSS (inertia) value
    wcss.append(kmeans.inertia_)

# Plot the WCSS values onto a line graph
fig = px.line(x=range(1, 11), y=wcss, markers=True)
display(fig)
# plt.plot(range(1, 11), wcss)
# plt.title('WCSS by Clusters')
# plt.xlabel('Number of clusters')
# plt.ylabel('WCSS')
# plt.show()


### Clustering
#### KMeans

In [4]:
from sklearn.cluster import KMeans

# Create a model based on 4 centroids
model = KMeans(n_clusters=4, n_init=100)
# Fit to the data and predict the cluster assignments for each data point
km_clusters = model.fit_predict(scaled_features)
# View the cluster assignments
px.scatter_3d(data, x='A', y='B', z='C', color=km_clusters)

In [5]:
fig = px.scatter_matrix(components, labels=labels, color=km_clusters, dimensions=range(3))
fig.update_traces(diagonal_visible=False)
display(fig)


iteritems is deprecated and will be removed in a future version. Use .items instead.



#### Agglomerative clustering

In [6]:
from sklearn.cluster import AgglomerativeClustering

# Create a model based on 4 centroids
model = AgglomerativeClustering(n_clusters=4)
# Fit to the data and predict the cluster assignments for each data point
km_clusters = model.fit_predict(scaled_features)
# View the cluster assignments
px.scatter_3d(data, x='A', y='B', z='C', color=km_clusters)

In [7]:
fig = px.scatter_matrix(components, labels=labels, color=km_clusters, dimensions=range(3))
fig.update_traces(diagonal_visible=False)
display(fig)


iteritems is deprecated and will be removed in a future version. Use .items instead.

