# Chapter 6 - Unsupervised Learning: Clustering

Paul E. Anderson

In [1]:
%load_ext autoreload
%autoreload 2

from pathlib import Path
home = str(Path.home()) # all other paths are relative to this path. change to something else if this is not the case on your system

In many applications, observations need to be divided into similar groups based on observed features. This is done at the beginning of many data projects. It is often exploratory in nature, and helps identify structure in your data. At other times, clustering/partitioning is the main objective. For example, retailers may want to divide potential customers into groups, in order to target a marketing campaign at the customers who are most likely to respond positively.

The general problem of grouping observations based on observed features is known as _clustering_. Where classification focuses on a matrix $X$ and a vector $y$ of labels, clustering ignores $y$ and focuses solely on $X$.

This is the reason we call clustering an example of _unsupervised learning_. The supervision is contained in the $y$ vector, and we are removing that from direct analysis.

Here is an analogy from childhood. Two children are playing with blocks of different colors. One is accompanied by an adult who provides feedback about the color and shape of the blocks. This is supervised learning. The other child is observed but no feedback is provided. Both children play with the blocks, but the second child is doing so in what machine learning experts would say is unsupervised.

<img src="https://github.com/dlsun/pods/blob/master/07-Unsupervised-Learning/shape_sorter.jpg?raw=1">

#### K-means algorithm
$K$-means is an algorithm for finding clusters in data. The idea behind $k$-means is simple: each cluster has a "center" point called the **centroid**, and each observation is associated with the cluster of its nearest centroid. The challenge is finding those centroids. The $k$-means algorithm starts with a random guess for the centroids and iteratively improves them.

The steps are as follows:

1. Initialize $k$ centroids at random.
2. Assign each point to the cluster of its nearest centroid.
3. (After reassignment, each centroid may no longer be at the center of its cluster.) Recompute each centroid based on the points assigned to its cluster.
4. Repeat steps 2 and 3 until no points change clusters.

#### Stop and think: Will this algorithm converge at all times?

#### Your solution here

#### Stop and think: How do you pick the starting locations?

#### Your solution here

#### Stop and think: What is the best way?

#### Your solution here

#### Stop and think: How do you measure a good solution in clustering?

#### Your solution here

### K-means from scratch

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv(
    f"{home}/csc-466-student/data/breast_cancer_three_gene.csv",index_col=0
)
df.head()

Unnamed: 0,ESR1,AURKA,ERBB2,Subtype
0,0.804501,0.264356,6.941677,LumA
1,0.163597,0.589052,6.551394,Basal
2,0.569347,0.189531,7.05653,LumA
3,0.847584,0.264849,7.028625,LumB
4,0.442474,0.52604,8.783604,LumB


In [24]:
X = df[['ESR1','AURKA']]#,'ERBB2']]
# Stop and think: What happens when I put in the third variable? Did your code work? What about the plots?
X.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,ESR1,AURKA,ERBB2
0,8045.007372,0.264356,6.941677
1,1635.970192,0.589052,6.551394
2,5693.469359,0.189531,7.05653
3,8475.844588,0.264849,7.028625
4,4424.735073,0.52604,8.783604


In [7]:
# Stop and think: Implement a function that calculates the distance between two vectors using Euclidean distance
def distance(x,c):
    d = None
    return d

distance(X.loc[0],X.loc[1])

0.7184598565200455

In [13]:
# Stop and think: How would you find k random means for starting the clustering?
means = None
k = 6

means

Unnamed: 0,ESR1,AURKA
0,0.339851,0.065913
1,0.787042,0.322032
2,0.652941,0.356578
3,0.563117,0.489433
4,0.629883,0.277193
5,0.673965,0.530154


In [14]:
import altair as alt

alt.Chart(X).mark_circle(size=60).encode(
    x='ESR1',
    y='AURKA') + \
alt.Chart(means).mark_circle(color='black',size=200).encode(
    x='ESR1',
    y='AURKA')

In [15]:
# Stop and think: How would you assign each datapoint to a mean?
# Stop and think: How would you compute the distortion?
clusters = []
distortion = 0
# Your solution here
Xc = X.copy()
Xc['cluster']=clusters
Xc.head()

Unnamed: 0,ESR1,AURKA,cluster
0,0.804501,0.264356,1
1,0.163597,0.589052,3
2,0.569347,0.189531,4
3,0.847584,0.264849,1
4,0.442474,0.52604,3


In [16]:
distortion

360.85203065284895

In [15]:
# Stop and think: How would you recompute the mean?
means = None
# Your solution here
means

In [7]:
# Stop and think: Now put it all together and iterate
# Your solution here

In [17]:
# Stop and think: Now try it all again with different initial means. Did the distortion go up or down?
# Your interpretation here

In [18]:
# Stop and think: How would you visualize the clustering?
# Your solution here

In [19]:
# Stop and think: Can you overlay the actual subtypes on this in any manner?
# Your solution here