# K-means Clustering Algorithm

Thie notebook shows an implementation of the K-means Clustering algorithm. This is an unsupervised learning algorithm used to find meaningful groups and segments within a dataset.

## Dataset

The dataset that we will be working with refers to the clients of a wholesale distributer.

- The Fresh, Milk, Grocery, Frozen, Detergents_Paper, and Deilicassen attributes are the annual spend that clients made on different product categories;
- The Channel category refers to the marketing channel that the customers came from - Hotel/Restaurant/Cafe, or Retail;
- The Region category refers to the region of Portugal that the clients came from - Lisbon, Oporto, or Other Region.

In [130]:
import pandas as pd

data = pd.read_csv("data/wholesale-customer-data.csv")

# Relabel the categorical columns
data.Region = data.Region.astype("category")
data.Region.cat.categories = ["Lisbon", "Oporto", "Other"]
data.Channel = data.Channel.astype("category")
data.Channel.cat.categories = ["Hotel/Restaurant/Cafe", "Retail"]

data.head(5)

Unnamed: 0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,Retail,Other,12669,9656,7561,214,2674,1338
1,Retail,Other,7057,9810,9568,1762,3293,1776
2,Retail,Other,6353,8808,7684,2405,3516,7844
3,Hotel/Restaurant/Cafe,Other,13265,1196,4221,6404,507,1788
4,Retail,Other,22615,5410,7198,3915,1777,5185


## Data Preprocessing

The dataset is already in a pretty good state for us to use. We need to:
    
- encode the two categorical columns; and
- scale the numerical columns.

#### Encoding Categorical Variables

In [131]:
# encode the Region column
data = pd.concat(
    [
        data,
        pd.get_dummies(data.Region, drop_first=True, prefix="Region")
    ],
    axis=1
).drop(["Region"], axis=1)

# encode the Channel column
data = pd.concat(
    [
        data,
        pd.get_dummies(data.Channel, drop_first=True, prefix="Channel")
    ],
    axis=1
).drop(["Channel"], axis=1)

data.head(5)

Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen,Region_Oporto,Region_Other,Channel_Retail
0,12669,9656,7561,214,2674,1338,0,1,1
1,7057,9810,9568,1762,3293,1776,0,1,1
2,6353,8808,7684,2405,3516,7844,0,1,1
3,13265,1196,4221,6404,507,1788,0,1,0
4,22615,5410,7198,3915,1777,5185,0,1,1


#### Feature Scaling

We will standardise the numerical data as this is recommended for unsupervised learning algorithms.

In [132]:
from sklearn import preprocessing

data[
    ["Fresh", "Milk", "Grocery", "Frozen", "Detergents_Paper", "Delicassen"]
] = preprocessing.scale(data[["Fresh", "Milk", "Grocery", "Frozen", "Detergents_Paper", "Delicassen"]])

data.head(5)

Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen,Region_Oporto,Region_Other,Channel_Retail
0,0.052933,0.523568,-0.041115,-0.589367,-0.043569,-0.066339,0,1,1
1,-0.391302,0.544458,0.170318,-0.270136,0.086407,0.089151,0,1,1
2,-0.447029,0.408538,-0.028157,-0.137536,0.133232,2.243293,0,1,1
3,0.100111,-0.62402,-0.392977,0.687144,-0.498588,0.093411,0,1,0
4,0.840239,-0.052396,-0.079356,0.173859,-0.231918,1.299347,0,1,1


## The Algorithm

The algorithm works by keeping track of $k$ "centroids" which represent the centres of the $k$ cluters that we are trying to sort the data into.

1. We start by randomly selecting $k$ points in the dataset to be the initial value for the centroids;
2. Then, for each data point in the set, we calculate the distance (Euclidean distance in this case) from the data point to each centroid;
3. We assign each point to the cluster of the centroid that it is closest to;
4. We then calculate the average of each of these clusters and for each cluster make the centroid the average of the cluster;
5. Repeat steps 2-4 until the datapoints in a cluster do not change

In [181]:
import numpy as np
from random import random

def kmeans_clustering(data, k=3):
    """Perform the K-means Clustering algorithm.

    This function performs the K-means Clustering algorithm with `k` clusters on the given dataset.
    
    Args:
        data (pandas dataframe): the data to perform the algorithm on.
        k (int): the number of clusters to separate the data into.
    
    Returns:
        clusters (list): a list of pandas datasets, one for each cluster.

    """
    
    centroids = select_initial_centroids(data, k)
    data["clusters"] = None
    clusters = get_clusters(data, centroids)
    count = 0
    
    while has_cluster_variance_changed(data, clusters):
        count += 1
        data.clusters = clusters
        centroids = iterate_centroids(data)
        clusters = get_clusters(data, centroids)
        
    print(f"The  algorithm performed {count} iterations.")
    return data

def select_initial_centroids(data, k):
    """Select the initial centroids for the start of the algorithm.
    
    Args:
        data (pandas dataframe): the data to perform the algorithm on.
        k (int): the number of clusters to separate the data into.
    
    Returns:
        initial_centroids (list): a list of numpy arrays representing the initial centroid vectors.
    
    """
    N = len(data)
    return [data.iloc[int(N * random()),:].copy().values for i in range(k)]

def closest_centroid(centroids):
    
    def closest_to(row):
        """Curried function to get the closest centroid for a given data point.
        
        Given a list of centroids, and then a data point, this curried function returns
        the index of the centroid that that data point was closest to.
        
        Args:
            centroids (list): a list of centroids (numpy vectors).
            row (pandas series): one row of data from the dataset.
        
        Returns:
            closest_index (int): the index of the closest centroid.
        
        
        """
        
        p = row.values
        distances = [np.sum((c-p) ** 2) for c in centroids]
        return np.argmin(distances)
        
    return closest_to

def get_clusters(data, centroids):
    """Get a column showing the nearest cluster for each data point."""
    return data.drop("clusters", axis=1).apply(closest_centroid(centroids), axis=1)

def has_cluster_variance_changed(data, clusters):
    """Determine whether the clustering has changed between iterations."""
    return not (data.clusters == clusters).all()

def iterate_centroids(data):
    """Calculate the next iteration of the centroid from the current clustering.
    
    Take data that has already been clustered and determine the average points for each cluster.
    These will then become the centroids in the next iteration of the algorithm.
    
    Args:
        data (pandas dataframe): the dataset that we are operating on.
    
    Returns:
        centroids (list): a list of the next `k` centroids.
    
    """
    
    cluster_means = data.groupby("clusters").apply(average_dataset)
    return [np.array(row[1:]) for row in cluster_means.itertuples()]

def average_dataset(dataframe):
    """Return the column average of the dataset without the clusters column."""
    return dataframe.drop("clusters", axis=1).mean()

In [186]:
kmeans_clustering(data.copy(), k=2)

The  algorithm performed 14 iterations.


Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen,Region_Oporto,Region_Other,Channel_Retail,clusters
0,0.052933,0.523568,-0.041115,-0.589367,-0.043569,-0.066339,0,1,1,1
1,-0.391302,0.544458,0.170318,-0.270136,0.086407,0.089151,0,1,1,1
2,-0.447029,0.408538,-0.028157,-0.137536,0.133232,2.243293,0,1,1,1
3,0.100111,-0.624020,-0.392977,0.687144,-0.498588,0.093411,0,1,0,1
4,0.840239,-0.052396,-0.079356,0.173859,-0.231918,1.299347,0,1,1,1
...,...,...,...,...,...,...,...,...,...,...
435,1.401312,0.848446,0.850760,2.075222,-0.566831,0.241091,0,1,0,1
436,2.155293,-0.592142,-0.757165,0.296561,-0.585519,0.291501,0,1,0,1
437,0.200326,1.314671,2.348386,-0.543380,2.511218,0.121456,0,1,1,0
438,-0.135384,-0.517536,-0.602514,-0.419441,-0.569770,0.213046,0,1,0,1
