### LSE Data Analytics Online Career Accelerator 

# DA301:  Advanced Analytics for Organisational Impact

## Demostration video: *k*-means clustering


#### *This is a possible solution to the demostration video.* 

In this video, we will create, fit and interpret a k-means clustering model based on a real life scenario.

Imagine we’re working as a data analyst at a fruit breeding research station. As one of our quarterly goals, we decide to investigate the possibility of utilising k-means clustering to understand data collected by all the researchers.

To get there, let’s first set two broad questions for us to answer:
- Can we use k-means clustering to draw useful conclusions and predictions? For example, can the fruit type be deduced from sepal length and sepal width based on k-means clustering?
- How can we improve the accuracy of the model?

# 

## 1. Prepare your workstation

Okay, so our first step is always to prepare our workstation, starting with importing  the necessary libraries and the data set. 
The data set we have from the research station – `fruit.csv` – contains a data of 1,500 fruit blossoms from three different fruit types: apricots, peaches, and plums. The sepal length and width of fruit blossoms were measured on mature fruit trees between the ages of 7 and 10 years.

Pollen of fruit flowers are harvested for pollination when the blossoms are in a balloon phase. Although the petals of apricots, plums and peaches differ in colour, the difference is not always as distinct enough for colourblind researchers. Therefore, the sepal width and legth were measured to determine whether the fruit type of the harvested blossoms can be identified based on sepal width and length.

In [None]:
# Import libraries.
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

# Load the data.
df = pd.read_csv('fruit.csv')

# View the DataFrame.
df.info()

# 

## 2. Prepare the data

In [None]:
# Drop unnecessary columns.
df_fruit = df.drop(columns=['tree_age', 'location', 'colour_blossom'])

# Display a summary of the numeric variables and column names.
print(df_fruit.columns)
df_fruit.describe()

# 

## 3. Visualise the data

### Scatterplot

In [None]:
# Import Seaborn and Matplotlib.
from matplotlib import pyplot as plt
import seaborn as sns

# Create a scatterplot with Seaborn.
sns.scatterplot(x='sepal_length',
                y='sepal_width',
                data=df_fruit,
                hue='fruit_type')

### Pairplot

In [None]:
# Create a pairplot with Seaborn.
x = df_fruit[['sepal_length', 'sepal_width']]

sns.pairplot(df_fruit,
             vars=x,
             hue='fruit_type',
             diag_kind='kde')

# 

## 4. Improve the accuracy

### The elbow method
The elbow method is used to determine the optimal number of clusters in *k*-means clustering. However, the elbow method doesn't always work well, especially if the data is not very clustered. 

In [None]:
# Import the KMeans class.
from sklearn.cluster import KMeans 

# Elbow chart for us to decide on the number of optimal clusters.
ss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i,
                    init='k-means++',
                    max_iter=300,
                    n_init=10,
                    random_state=0)
    kmeans.fit(x)
    ss.append(kmeans.inertia_)

# Plot the elbow method.
plt.plot(range(1, 11),
         ss,
         marker='o')

# Insert labels and title.
plt.title("The Elbow Method")
plt.xlabel("Number of clusters")
plt.ylabel("SS distance")

plt.show()

### The silhouette method

The silhouette method computes silhouette coefficients of each point that measure how much a point is similar to its own cluster compared to other clusters. It provides a succinct graphical representation of how well each object has been classified.

In [None]:
# Import silhouette_score class from sklearn.
from sklearn.metrics import silhouette_score

# Find the range of clusters to be used using silhouette method.
sil = []
kmax = 10

for k in range(2, kmax+1):
    kmeans_s = KMeans(n_clusters=k).fit(x)
    labels = kmeans_s.labels_
    sil.append(silhouette_score(x,
                                labels,
                                metric='euclidean'))

# Plot the silhouette method.
plt.plot(range(2, kmax+1),
         sil,
         marker='o')

# Insert labels and title.
plt.title("The Silhouette Method")
plt.xlabel("Number of clusters")
plt.ylabel("Sil")

plt.show()

# 

# Selecting the number of clusters

### As we have three fruit types, let's set `k=3` (three clusters).

## 5a. Evaluate and fit the model

In [None]:
# Use three clusters.
kmeans = KMeans(n_clusters = 3, 
                max_iter = 15000,
                init='k-means++',
                random_state=0).fit(x)

clusters = kmeans.labels_

x['K-Means Predicted'] = clusters

# Plot the predicted.
sns.pairplot(x,
             hue='K-Means Predicted',
             diag_kind= 'kde')

In [None]:
# Check the number of observations per predicted class.
x['K-Means Predicted'].value_counts()

## 6a. Visualise the clusters

In [None]:
# View the K-Means predicted.
print(x.head())

In [None]:
# Visualising the clusters.
# Set plot size.
sns.set(rc = {'figure.figsize':(12, 8)})

# Create a scatterplot.
sns.scatterplot(x='sepal_length' , 
                y ='sepal_width',
                data=x,
                hue='K-Means Predicted',
                palette=['red', 'green', 'blue'])

# 

### Let's set `k=4` and compare it with three clusters.

## 5a. Evaluate and fit the model

In [None]:
# Use four clusters.
kmeans = KMeans(n_clusters=4,
                max_iter=15000,
                init='k-means++',
                random_state=0).fit(x)

clusters = kmeans.labels_

x['K-Means Predicted'] = clusters

# Plot the predicted.
sns.pairplot(x, 
             hue='K-Means Predicted',
             diag_kind='kde')

In [None]:
# Check the number of observations per predicted class.
x['K-Means Predicted'].value_counts()

## 6a. Visualise the clusters

In [None]:
# View the K-Means predicted.
print(x.head())

In [None]:
# Visualising the clusters.
# Set plot size.
sns.set(rc = {'figure.figsize':(12, 8)})

# Create a scatterplot.
sns.scatterplot(x='sepal_length' , 
                y ='sepal_width',
                data=x,
                hue='K-Means Predicted',
                palette=['red', 'green', 'blue', 'black'])

# 

## 7. Conclusion(s)

>Although there were only three fruit types (apricot, plum, and peach), it seems that `k=4` (four clusters) might give the best results (groups). The three fruit types are closely related (same Genus, but different species); therefore, `Cluster 0` for both `k=3` and `k=4` is the largest group. The number of predicted values per class indicates a better distribution for `k=4` than `k=3`. 