# Compression d'image avec k-means

Dans ce projet, nous allons implémenter l'algorythm de k-means appliqué à la compression d'image.

## Plan
* Implementation de k-means
* Compression d'image avec k-means

In [22]:
import pandas as pd
import numpy as np

## Qu'est-ce qu K-means
C'est une methode pour grouper ensemble les points de données similaires de manière automatique.

On a un jeu de données d'entrainement $\{x^{(1)}, ..., x^{(m)}\}$, et on veut grouper les données en plusieurs clusters.

K-means est une procédure itérative qui :
* Commence par assigner aléatoirement des centroids
* Affine ces centroids en:
    - Assignant de manière répétitive les exemples à leur centroide le plus proche, 
    - Et en calculant de nouveau les centroids

## Chercher les centroids les plus proches

In [23]:
def find_closest_centroids(X, centroids):
    
    # K : number of clusters
    K = centroids.shape[0]
    
    
    idx = np.zeros(X.shape[0], dtype=int)
    for row_index in range(K):
        distances = []
        for centroid_index in range(centroids.shape[0]):
            # Implementation of the formula to compute distance between 2 points
            norm_ij = np.linalg.norm( X[row_index] - centroids[centroid_index])
            distances.append(norm_ij)
        idx[row_index] = np.argmin(distances)
    return idx

## Calculer les centroids

In [32]:
def compute_centroids(X, idx, K):
    m, n = X.shape
    centroids = np.zeros((K, n))
    
    ### START CODE HERE ###
    example_per_cluster = np.zeros(K)
    for i in range(m):
        cluster_idx = idx[i]
        centroids[cluster_idx] += X[i]
        example_per_cluster[cluster_idx] += 1
        
    for i in range(K):
        print("==> " , example_per_cluster[i])
        
    for i in range(K):
        
        centroids[i] /= max(1.0,example_per_cluster[i])
    
    return centroids
    

## Lancer K-means

In [33]:
def run_kmeans(X, initial_centroids, max_iter=10):
    # m : Nb of rows
    # n : Nb of features
    m, n = X.shape
    nbClusters = initial_centroids.shape[0]
    centroids = initial_centroids
    previous_centroids = centroids
    idx = np.zeros(m)
    
    for i in range(max_iter):
        idx = find_closest_centroids(X, centroids)
        centroids = compute_centroids(X, idx, nbClusters)
    return centroids, idx
    

In [34]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
X.shape


(150, 4)

In [35]:
initial_centroids = np.array([
    [1.0 ,1.0, 1.0, 1.0],
    [2.0 ,2.0, 2.0, 2.0],
    [3.0 ,3.0, 3.0, 3.0],
    [4.0, 4.0, 4.0, 4.0]])
max_iters = 10
centroids, idx = run_kmeans(X, initial_centroids, max_iters)

==>  146.0
==>  3.0
==>  1.0
==>  0.0
==>  146.0
==>  0.0
==>  0.0
==>  4.0
==>  146.0
==>  4.0
==>  0.0
==>  0.0
==>  146.0
==>  0.0
==>  4.0
==>  0.0
==>  146.0
==>  4.0
==>  0.0
==>  0.0
==>  146.0
==>  0.0
==>  4.0
==>  0.0
==>  146.0
==>  4.0
==>  0.0
==>  0.0
==>  146.0
==>  0.0
==>  4.0
==>  0.0
==>  146.0
==>  4.0
==>  0.0
==>  0.0
==>  146.0
==>  0.0
==>  4.0
==>  0.0


  centroids[i] /= example_per_cluster[i]
