<a href="https://colab.research.google.com/github/d-tomas/transform4europe/blob/main/notebooks/unsupervised_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Unsupervised Learning

In this *notebook* we will use the k-means algorithm for clustering.

## Initial setup

In [None]:
# Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans  # K-means algorithm
from sklearn.metrics import silhouette_score 
from sklearn.preprocessing import MinMaxScaler

# Download the dataset to apply the clustering algorithm
!wget https://raw.githubusercontent.com/d-tomas/transform4europe/main/datasets/iris.csv

## Dataset

This is perhaps the best known database to be found in the pattern recognition literature. The dataset contains 3 classes of 50 instances each, where each class refers to a type of iris plant. Predicted attribute: class of iris plant.

Attribute Information:

* `sepal_length`: sepal length in cm
* `sepal_width`: sepal width in cm
* `petal_length`: petal length in cm
* `petal_width`: petal width in cm
* `class`: Iris-setosa, Iris-versicolour and Iris-virginica

**Note**: we are ussing a labelled dataset to easily spot the quality of the clusters, but clustering is an unsupervised task.

In [None]:
# Loading data from file into a Pandas DataFrame

data = pd.read_csv('iris.csv')
data

In [None]:
# Show information about de DataFrame

data.info()

In [None]:
# Frequency distribution of species

data['species'].value_counts()

In [None]:
# Distribution of each class depending of the feature

sns.kdeplot(data=data, x='sepal_length', hue='species')
plt.show()
sns.kdeplot(data=data, x='sepal_width', hue='species')
plt.show()
sns.kdeplot(data=data, x='petal_length', hue='species')
plt.show()
sns.kdeplot(data=data, x='petal_width', hue='species')
plt.show()

## K-means clustering

In [None]:
# Find the optimum number of clusters for k-means classification

wcss = []
X = data.drop('species', axis=1).values  # Keep the values for all the features but the class

for i in range(1, 11):
  kmeans = KMeans(n_clusters=i, random_state=0)
  kmeans.fit(X)
  wcss.append(kmeans.inertia_)  # intertia_ -> Sum of squared distances of samples to their closest cluster center

In [None]:
# Using the elbow method to determine the optimal number of clusters for k-means clustering
# The optimum number of clusters is where the elbow occurs
# This is when the within cluster sum of squares (WCSS) does not decrease significantly with every iteration

sns.lineplot(x=range(1, 11), y=wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') # Within cluster sum of squares
plt.show()

In [None]:
# Implementing the k-Means clustering
# Seems that 3 was a good number of clusters based on the elbow method

kmeans = KMeans(n_clusters=3, random_state=0)
y = kmeans.fit_predict(X)  # Assign a cluster to each sample
y  # Show the labels assigned to each sample

In [None]:
# Visualisation of the clusters

sns.scatterplot(x=X[y == 0, 2], y=X[y == 0, 3])
sns.scatterplot(x=X[y == 1, 2], y=X[y == 1, 3])
sns.scatterplot(x=X[y == 2, 2], y=X[y == 2, 3])

# Plotting the centroids of the clusters
sns.scatterplot(x=kmeans.cluster_centers_[:, 2], y=kmeans.cluster_centers_[:,3], s=100, label='Centroids')
plt.legend(bbox_to_anchor=(1.01, 1.01), loc=2)
plt.show()

# References

* [Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris)