<a href="https://colab.research.google.com/github/danielbauer1979/CAS_PredMod/blob/main/pa_pynb_sess9_Unsupervised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unsupervised Learning: Clustering and PCA Analysis
Dani Bauer, 2022

As mentioned in the beginning of lecture, there are (at least) two basic learning setups.  In **supervised** machine learning one observes a response $Y$ with observing $p$ different features $X=(X_1,X_2,\ldots,X_p)$, where we typically postulate the relationship $Y = f(X)+\varepsilon$ and $\varepsilon$ independent of $X$ with mean zero. Here quality is usually assessed by the (test/out-of-sample) error that compares predictions and realizations for a separate dataset.  In **unsupervised** learning, we only observe $p$ features $X_1,X_2,\ldots,X_p$, and we would like to learn about their relationship -- without focussing on a supervising outcome.  Of course, the difficulty is how to assess quality in this case -- so different unsupervised learning techniques are very different, and which one to pick will depend on the nature of the problem.

In this tutorial, we will take a closer look at two algorithms: **Principal Component Analysis (PCA)** and **Clustering**.  There are variety of other techniques, including anomaly detection, self-organizing maps, association analysis, etc.

As usual, we start by implementing the relevant packages:

In [2]:
import numpy as np 
import matplotlib.pyplot as plt  
import matplotlib.lines as mlines
import pandas as pd   
import seaborn as sns
from random import sample

from sklearn.preprocessing import MinMaxScaler, StandardScaler # For rescaling metrics to fit 0 to 1 range
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.decomposition import PCA
from scipy.spatial.distance import euclidean

# Principal Component Analysis

## Background

The key idea behind *Principal Component Analysis* (PCA) is to use "meaningful" linear combinations of the data:
$$
Z_m = \sum_{j=1}^p \phi_{jm} X_j \text{ such that }\sum_{j=1}^p \phi^2_{jm} = 1,
$$
where the idea is to choose:

- $\phi_{j1},\,j=1,\ldots,p$  such that the the variance of $Z_1$ is maximal (i.e., so that it captures most of the variation in the $X$s)

- $\phi_{j2},\,j=1,\ldots,p$ such that the variance of $Z_2$ is maximial out of all the variables that are uncorrelated with $Z_1$.

- $\phi_{j3},\,j=1,\ldots,p$ such that the variance of $Z_3$ is maximial out of all the variables that are uncorrelated with $Z_1$ and $Z_2$. $\ldots$

Hence, one way to intepret the principal components are the linear combination that best reflect the variation in the data.

Importantly, the scale of variables matters, so it is a good idea to center and scale your variables.  Also, the components are only determined up to their sign (plus/minus).  To discern how important different variables are, one typically considers the *Percent of Variance Explained* (PVE):
$$
\text{PVE}_m = \frac{\sum_{i=1}^n \left(\sum_{j=1}^p \phi_{jm}x_{ij}\right)^2}{\sum_{i=1}^n \sum_{j=1}^p x_{ij}^2},
$$
and the resulting plot that depicts PVE by principal components is referred to as a *scee plot*.

## Simulated Example

Let's consider a basic example, where we simulate heights and weights of a fictional population according to some arbitrary parameters.  More precisely, we assume that weight and height follow Normal distributions with a mean weight (in kg) of 70 and a mean height (in cm) of 170m, and variances of 25 and 150, respectively.  We assume the correlation parameter is 50%.


In [3]:
mu = (70,170)
cov = np.array([[25, np.sqrt(25)*np.sqrt(150)*0.5], [np.sqrt(25)*np.sqrt(150)*0.5, 150]])
X_raw = np.random.multivariate_normal(mu, cov, size=5000)


Let's check quick:

In [None]:
X_raw.mean(axis=0)

In [None]:
np.cov(np.transpose(X_raw))

So the *empirical* moments look similar to our theoretical moments. Let's take a peak:


In [None]:
plt.figure(figsize = (6,4))
plt.scatter(X_raw[:,0], X_raw[:,1], s = 1, color = 'black')
plt.xlabel('weight')
plt.ylabel('height')
plt.legend()
plt.show()

Prinicipal components are closely related to the eigen-analysis (eigenvalues and eigenvectors) of the correlation (or covariance) matrix.  Let's illustrate (if you never had a linear algebra class, feel free to skip this part).  First, let's scale the data and calculate the (empirical) correlation matrix -- remember, in practical settings we usually don't know the underlying parameters:


In [None]:
R = np.corrcoef(np.transpose(X_raw))
R

Let's calculate the eigenvalues and the eigenvectors of `R`.  As a reminder, the eigenvectors decompose a matrix (a.k.a. a linear mapping in finite dimensions) in orthogonal directions, whereas the eignvalues provide the "length" (importance) of the directions.  In particular, we can represent a symmetric matrix in diagnolized form by relying on eigenvectors and eigenvalues. 

In [None]:
Dec = np.linalg.eig(R)
Dec

In [None]:
V = Dec[1] # Matrix of Eigenvectores
D = np.diag(Dec[0]) # Diagonal Matrix of Eigenvalues
np.dot(np.dot(V,D),np.transpose(V)) # Calculates V * D * V'

To get intuition, let's plot the Eigenvectors:

In [None]:
scaler = StandardScaler()
scaler.fit(X_raw)
plt.figure(figsize = (6,4))
fig, ax = plt.subplots()
ax.scatter(scaler.transform(X_raw)[:,0], scaler.transform(X_raw)[:,1], s = 1, color = 'black')
line1 = mlines.Line2D([0, 0.7071], [0, -.7071], color='red')
line2 = mlines.Line2D([0, .7071], [0, .7071], color='yellow')
transform = ax.transAxes
line.set_transform(transform)
ax.add_line(line1)
ax.add_line(line2)
plt.xlabel('scaled weight')
plt.ylabel('scaled height')
plt.show()

It is exactly this notion of "importance" that motivates principal components.  Indeed, the loadings of the principal components just amount to the *ordered* eigenvalues...

# Clustering

## Background

*Clustering* refers to techniques for finding subgroups in a given dataset. The typical approach to determine clusters $C_1,\ldots,C_K$ is to minimize:
$$
\sum_{k=1}^K W(C_k),
$$
where $W$ is a measure of *variation* within a cluster.  For instance, **k-means clustering** uses the Euclidean distance to measure variation:
$$
W(C_k) = \frac{1}{|C_k|} \sum_{i,i' \in C_k} \sum_{j=1}^p (x_{ij} - x_{i'j})^2.
$$
The algorithms are implemented via a greedy algorithm by considering the centers of clusters (referred to as *centroids*).  The number of clusters $K$ must be chosen beforehand.  One approach is *hierarchical clustering*, where one starts with a larger number of clusters and then *fuses* custers that are similar (e.g., with regards to the distance between their centroids). 

## Simulated Example

Let's consider a very basic simulated example -- let's simulate normal random variables with different means:

In [23]:
X_raw2 = np.random.multivariate_normal((0,0), np.array([[1, 0], [0, 1]]), size=100)
X_raw2[0:49,0]=X_raw2[1:50,0]+3
X_raw2[0:49,1]=X_raw2[1:50,1]-4

Let's plot:

In [None]:
plt.figure(figsize = (6,4))
plt.scatter(X_raw2[0:49,0], X_raw2[0:49,1], color='red')
plt.scatter(X_raw2[50:99,0], X_raw2[50:99,1], color='black')
plt.xlabel('X0')
plt.ylabel('X1')
plt.legend()
plt.show()

And let's run k means clustering:


In [None]:
kmeans = KMeans(n_clusters = 2, init = 'k-means++', max_iter = 1000, random_state = 123)
kmeans.fit(X_raw2)
centroids = kmeans.cluster_centers_
centroids

In [None]:
label = kmeans.fit_predict(X_raw2)
label

So the algorithm was able to identify how we set up the data!

# Case Study: County Health Rankings 2013

We analyze [County Health Rankings](www.countyhealthrankings.org) in the US in 2013, based on a data set from the University of Wisconsin Population Health Institute.

## Data Preparation

Let's load the data:

In [None]:
!git clone https://github.com/danielbauer1979/CAS_PredMod.git

In [45]:
health = pd.read_csv('CAS_PredMod/pa_data_countyHealthRR.csv')

In [None]:
health.info()

Unfortunately, we have a bunch of missing data. Let's drop the columns where we have lots of missing values and then drop na-s:

In [46]:
health = health.drop(columns=['FIPS','State','County','Perc.Fair.Poor.Health','Perc.Smokers','Perc.Excessive.Drinking','MV.Mortality.Rate','Pr.Care.Physician.Ratio','Perc.No.Soc.Emo.Support'])

In [47]:
health = health.dropna()

In [None]:
scaler = MinMaxScaler()
scaler.fit(health)
health_sc = scaler.transform(health)
health_sc

K-Means clustering

In [None]:
wcss = []
for i in range(2, 12):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=1000, n_init=10, random_state=0)
    kmeans.fit(health_sc)
    wcss.append(kmeans.inertia_)
plt.plot(range(2, 12), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

In [None]:
kmeans = KMeans(n_clusters=6, init='k-means++', max_iter=1000, n_init=10, random_state=0)
kmeans.fit(health_sc)
kmeans.cluster_centers_

PCA

In [None]:
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(health_sc)
principalComponents