# 2D Gaussian clusters

## Exercise 1.1

To load the dataset, the `numpy.loadtxt` method can be used. This time, the dataset file contains an header in the first line. We will skip it using the method iself.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
X_gauss = ##

## Exercise 1.2

Let's use now `matplotlib` to explore our 2D data. Which is the best plot to show 2d points?

In [None]:
def plot_2d(X):
    """Display a 2D plot

    :param X: input data points, array
    :return: fig, ax, objects
    """
    return ##

_, _ = plot_2d(X_gauss) # the two underscores let you discard the returned values

## Exercises 1.3

Let's implement now our version of the K-means algorithm. We will use a custom Python class. The basic blocks will be:

- a `fit_predict` method exposed to the users of the class
- a stateful structure: we do want to save the final clusters labels and centroids
- some internal plotting functions
- a `dump_to_file` method to save cluster labels to CSV file (see Exercise 1.4)

In [None]:
class KMeans:
    def __init__(self, n_clusters, max_iter=100):
        self.n_clusters = n_clusters 
        self.max_iter = max_iter 
        self.centroids = None 
        self.labels = None 

    def fit_predict(self, X):
        """Run the K-means clustering on X.
        :param X: input data points, array, shape = (N,C).
        :return: labels : array, shape = N.
        """ 
        pass


In [None]:
np.random.seed(0)
kmeans_model = KMeans(10)
l_gauss = kmeans_model.fit_predict(X_gauss)
l_gauss

In [None]:
kmeans_model.dump_to_file('gauss_labels.csv')

## Exercise 1.4

Let's analyze now the Chameleon dataset just as before.

In [None]:
X_cham = ##

In [None]:
_, _ = plot_2d(X_cham)

## Exercise 1.5

It is time to enhance our plotting toolkit. We hence define a new function that highlights the final centroids. Specifically, we can use a different marker style and color for them.

In [None]:
def plot_centroids(X, c, title=None):
    ###
    return fig, ax

# Exercise 1.7
We want to plot over the data the position of centroids. We could imagine to include the plot in the `fit_predict` function (for instance, with a flag). Eventually, it would be interested to plot also the intermidiate position of the centroids.

Let's test it with Gaussian clusters.

In [None]:
np.random.seed(6)
k_gauss = KMeans(15)
_ = k_gauss.fit_predict(X_gauss, plot_centroids=True)

## Exercise 2.1
So far, we have talked about "good" and "bad" clustering results in relation with the capability of K-means to adapt to different data distribution. Our discussions were mainly founded on the final centroids positions. This visual approach could not be feasible with many more points and dimensions.

Whenever numerical data is processed, several distance-based metrics can be used to assess the quality of our clustering. Let's now focus on the Silhouette measure.

In [None]:
def silhouette_samples(X, labels):
    """Evaluate the silhouette for each point and return them as a list.
    
    :param X: input data points, array, shape = (N,C).
    :param labels: the list of cluster labels, shape = N. :
    return: silhouette : array, shape = N
    """
    silhouette = np.zeros(X.shape[0])
    for idx in range(X.shape[0]):
        ###
        
    return silhouette

def silhouette_score(X, labels):
    """Evaluate the silhouette for each point and return the mean.
    
    :param X: input data points, array, shape = (N,C).
    :param labels: the list of cluster labels, shape = N. :return: silhouette : float
    """
    return np.mean(silhouette_samples(X, labels))

In [None]:
np.random.seed(6)
k_gauss = KMeans(15)
k_cham = KMeans(6)

l_gauss = k_gauss.fit_predict(X_gauss)
l_cham = k_cham.fit_predict(X_cham)
sil_gauss = silhouette_score(X_gauss, l_gauss)
sil_cham = silhouette_score(X_cham, l_cham)

print('Gaussian clusters, average silhouette:', sil_gauss)
print('Chameleon clusters, average silhouette:', sil_cham)

## Exercise 2.2
Let's now analyze how the Silhouette values are distributed between our points. To do so, we can plot them in ascending order on the x-axis.

In [None]:
def plot_silhoeutte(silhouette, title=None):
    fig, ax = plt.subplots(figsize=(6,6), dpi=90)
    ###
    
    if title:
        ax.set_title(title)
    return fig, ax

In [None]:
_, _ = plot_silhoeutte(silhouette_samples(X_gauss, l_gauss), "Silhouette for gaussian points")
_, _ = plot_silhoeutte(silhouette_samples(X_cham, l_cham), "Silhouette for chameleon points")


## Exercise 2.3
Until now, we analized results achieved with the right number of cluster as K.

Let's analyze how our average silhouette varies with different K values for the Gaussian clusters.


In [None]:
###