# Cluster analysis

## 1. Overview

Today we learned about clustering for unsupervised machine learning and the k-means algorithm in particular. For a user-specified number $k$ of clusters, k-means assigns data points to their nearest centroid, which is iteratively readjusted to the mean of each cluster and points reassigned until clusters don't change anymore (or not much anymore). 


In this notebook we will use k-means clustering to analyse the Breast Cancer Wisconsin (Diagnostic) Data Set - check out the description here: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data.

The data is in fact labelled - the column ```"diagnosis"``` takes values 'B' (benign) and 'M' (malignant). Here, we will ignore the outcome labels at first, perform clustering and then check if the assigned clusters make any sense, i.e. coincide with the observed outcomes. 

## 2. Import and data prep
### 2.1. Importing Dependencies

As usual, we start by importing libraries we will use later on. Throughout the notebook, if any functions are unclear, try googling the library and function to familiarize yourself with the functions and their in- and outputs. 

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt2
import matplotlib.cm as cm
%matplotlib inline
from sklearn import preprocessing
from subprocess import check_output
from sklearn.metrics import confusion_matrix, accuracy_score, silhouette_score, adjusted_rand_score

### 2.2. Load data

**TASK:** Load the file ```breast-cancer-wisconsin.csv``` located in the same directory as this notebook using ```read_csv()``` from pandas.

In [None]:
data = ...

### 2.3. Data summary

As usual, we start by having a peak at the data. 

**TASK:** Use ```head()```, ```describe()``` and ```info()``` on the dataframe to get a first idea.

**TASK:** Check how many data points are classed as benign/malignant using ```groupby()``` and ```size()```, just as we did in the kNN notebook. 

**TASK:** Check for missing values (NaN = not a number) using ```isna()``` on the dataframe, just as we did in the regression notebook. 

**TASK:** We don't need the first and last column ('id', 'Unnamed: 32') - drop them. 

In [None]:
# Cleaning and modifying the data


**TASK:** Assign the number of data points and columns to variables ```ndat, nvar``` using ```shape```.

In [None]:
ndat, nvar = ...

Finally, we map the diagnostic values to binary values. 

In [None]:
# Mapping Benign to 0 and Malignant to 1 
data['diagnosis'] = data['diagnosis'].map({'M':1,'B':0})

## 3. Exploratory Data Analysis

**TASK:** In the past two sessions, we've used a variety of plotting tools. Think which ones you found most useful and try them on this dataset. Keep in mind though, that this dataset has a lot more data columns than those we've used in the previos two sessions. This might make it a bit more cumbersome to look at all variables at once. 

**Reflect:** Do you think the data look well separable, i.e. a clustering method would be able to find clusters that roughly coincide with the diagnotic outcomes?

**TASK:** Compute and plot the correlation matrix, just as we did in the regression notebook. 

In [None]:
corr_matrix = ...

fig = ...
fig.show()

## 4. Data preparation
### 4.1. Data scaling

Data scaling is important, as we're assigning clusters based on Euclidean distance. This means that for all data variables to get the same importance, they need to range on the same scale. 

We scale the data using ```scale()``` from sklearn ```preprocessing``` library we imported at the beginning. We don't need to rescale the ```"diagnosis"``` column.

In [None]:
# Scaling the dataset
datas = pd.DataFrame(preprocessing.scale(data.iloc[:,1:nvar+1]))
datas.columns = list(data.iloc[:,1:nvar+1].columns)
datas['diagnosis'] = data['diagnosis']

**TASK:** Assign the scaled data to a variable ```X```, which we will use for the cluster analysis. Don't forget to drop diagnosis, which should not be considered in the clustering. 

In [None]:
# Creating the high dimensional feature space X
data_drop = ...
X = ...
# you might want to check out X now

### 4.2. Dimensionality reduction for later visualisation

We have 30 variables, and it is hard to visualize results in 30D. Here we use dimensionality reduction, to project data down to 2D and visualize clustering results later on. We will learn more about dimensionality reduction in the next lecture. Just for fun, we will use two different techniques to see how the final visualization can differ. 

#### 4.2.1. Principle component analysis

Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data.

In [None]:
from sklearn.decomposition import PCA

pca=PCA()
#Fit PCA to the dataset (only variables, excluding class)
pca.fit(data_drop)

# datapca is the data projected onto the principle components
datapca = pca.transform(data_drop)
classes = data.diagnosis

#### 4.2.2. t-SNE

t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. 

In [None]:
from sklearn.manifold import TSNE

# tsne is the data projected onto the tsne dimensions
tsne = TSNE(verbose=1, perplexity=40, n_iter= 4000)
datatsne = tsne.fit_transform(X)

## 5. Cluster analysis

### 5.1. Perform k-means

Finally, we can use k-means to cluster our dataset. 

**TASK:** Create a k-means model object called ```km``` using ```KMeans()``` (set options to: ```n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=0, copy_x=True```) and then fit it using ```fit_predict()``` on the data vector ```X```.

In [None]:
from sklearn.cluster import KMeans

km = KMeans(...)
kY = ...

### 5.2. Cluster visualisation

Earlier in **§4** we computed the principle components and t-SNE projections. Use them now to visualize your clustering results in 2D. Which 2D projection do you find more useful?

**TASK:** Make two scatter plots using the first two principle components (```datapca```) computed in **§4.2.1** as axes. In the first scatter plot colour the data points by their cluster labels ```kY```, and in the second plot, for comparison, use the actual classes. 

**TASK:** Now make the same scatter plots, using the t-SNE projections (```datatsne```).

**Reflect:** From the above plots, do you think the clustering did a good job?

### 5.3. Evaluation

Since we know the actual outcomes, we can compare them to the assigned clusters. 

**TASK:** Compute the confusion matrix (like we did for kNN) and plot a heatmap of it (like we did in the regression notebook). 

In [None]:
# The clustering doesn't know which class is which, so we might need to invert cluster indices (swap 0 and 1)
# kY = [int(x) for x in kY==0]

cm = ...
sns.heatmap(cm, annot=True)

accuracy = accuracy_score(data.diagnosis, kY)
print('\n Clusters coincide with diagnosis with ' + str(round(accuracy*100, 2)) + '% accuracy.')
cm

## 6. Finding a good $k$

Above, we used $k=2$, because we knew that there are two different diagnostic outcomes. In principle, however, when faced with a clustering problem, we don't *a priori* know the number of clusters. In the lecture, we learned that we can use an elbow plot to find the best $k$. 

**TASK:** Within a ```for``` loop, fit a k-means model for $k=1,...,10$ and compute the total within-cluster variation (**tip:** it can be accessed via the ```inertia_``` property of the fitted model). Store the within-cluster variation in an array and plot it against $k$ to get the elbow plot. What does the plot say about the optimal value of $k$?

In [None]:
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, ...)
    kmeans.fit(X)
    wcss.append(...)

plt.figure()
...
plt.show()