# Illustrating Principal Component Analysis (PCA) with Penguins

In [None]:
# Execute this cell to import libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [None]:
x1 = np.linspace(1,100,100)
x2 = x1 + np.random.normal(0,2,100)

In [None]:
plt.plot(x1,x2,'ko')

In [None]:
xall = np.array([x1,x2]).T

In [None]:
xall.shape

In [None]:
xall[10]

In [None]:
# initialize the PCA object to only retain the first two principal components
pca = PCA(n_components=2)

In [None]:
# obtain the dimensionally reduced version of our dataset
xall_reduced = pca.fit_transform(xall)

In [None]:
plt.plot(xall_reduced[:,0],xall_reduced[:,1],'ko')
plt.xlim(-100,100)
plt.ylim(-100,100)

In [None]:
pca.explained_variance_ratio_

In [None]:
pca.components_

In [None]:
pca.components_[0]

In [None]:
fig,ax = plt.subplots(figsize=(5,5))
plt.plot([0,100*pca.components_[0][0]], [0,100*pca.components_[0][1]], 'r')
plt.plot([0,100*pca.components_[1][0]], [0,100*pca.components_[1][1]], 'b')
plt.xlim(-110,110)
plt.ylim(-110,110)
plt.scatter(xall[:,0],xall[:,1],s=2)

As an applied example, I will use Seaborn (`sns`) to import the toy Penguins dataset and I will drop all rows that have NaNs:

In [None]:
p = sns.load_dataset('penguins')
p = p.dropna()

In [None]:
# View the first 5 rows
p.head()

In [None]:
# Each species has measurements that fall into different regions of the feature space
sns.pairplot(data=p,
             hue='species')

For clustering, we'll assume we DO NOT know what the species are, and we will only retain the numerical data.

In [None]:
p_data = p.select_dtypes(include='number')

`p_data` has 333 data samples and 4 columns for the numerical features, as can be seen with the following:

In [None]:
p_data.shape

In [None]:
p_data.head()

This is a small dataset, but let's assume that we are EXTREMELY low on RAM and want to retain a lower number of features before doing our ML.

We can use PCA (Principal Components Analysis) to identify vectors in our feature space that retain a maximal amount of variance in our data and allow us to transform our coordinates in order to drop down into a lower-dimensional space.

* Here's the documentation for Scikit-Learn's PCA
  * https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
* As an input parameter, we can specify `n_components`
  * If this is assigned an integer value, it will be the number of principal components to retain
  * If this is assigned a float value between 0 and 1, it will be the percentage of variance to retain

In [None]:
# initialize the PCA object to only retain the first two principal components
pca = PCA(n_components=2)

In [None]:
# obtain the dimensionally reduced version of our dataset
p_data_reduced = pca.fit_transform(p_data)

* `p_data_reduced` is now a numpy array rather than a Pandas dataframe
* The above code retains only the first two principal components, so now we'll have reduced the dimensionality of our feature space from 4 to 2:

In [None]:
p_data_reduced.shape

More precisely, the values in the two columns are now the values corresponding to the transformed coordinates along our first two principal components, and the other two principal components have been dropped.  The amount of variance in our data that is explained by each of the components is:

In [None]:
pca.explained_variance_ratio_

Which means that ~99.99% of the variance lies along the direction specified by the first PC, and 0.0078% by the second PC.  

This may seem to be too fortuitous that all the variance lies along one PC, and indeed it is -> one feature variable has a much larger range than the others and its variance overwhelms the other variables.

We could also operate by identifying the amount of variance we want to retain.  If we want to retain enough principal components to explain some percentage of the variance, we could write:

In [None]:
pca = PCA(n_components=0.95)
p_data_reduced = pca.fit_transform(p_data)

And we would now have a number of components to retain 95% of the variance.  The number can be found with:

In [None]:
pca.n_components_

And the dataset would be reduced in dimensionality to:

In [None]:
p_data_reduced.shape

If all the variance is explained with just one principal component, this is usually a red flag -> it can mean that one variable has very large values compared to the others and its variance is much larger in magnitude, or it can mean that some variables are highly correlated.

I will do one more processing step to put the variable values on a common scale with standard scaling:

In [None]:
s = StandardScaler()
p_data_scaled = s.fit_transform(p_data)

# and use the scaled data when reducing the dimensionality
pca = PCA(n_components=0.95)
p_data_reduced = pca.fit_transform(p_data_scaled)

print('Number of components: ', pca.n_components_)
print('Shape of reduced p_data_scaled: ', p_data_reduced.shape)
print('Variance explained by each PC: ', pca.explained_variance_ratio_)

Now let's do the clustering.  To do K-Means Clustering, I will:
* use Scikit-Learn's `KMeans` (which is already imported)
  * https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
* specify the number of clusters to be 3
* specify `n_init=10`, which means 10 trainings will be performed with different initial centroid seeds and the result will be that training which gives the best result in terms of inertia

In [None]:
kmeans = KMeans(n_clusters=3, n_init=10)

# Here the fit is done on the dimensionally-reduced data, not the whole dataset
kmeans.fit(p_data_reduced)

In [None]:
# add a new column to our original dataframe with the predicted cluster values
p['cluster'] = kmeans.predict(p_data_reduced)

# plot the pairplot again with color representing the predicted cluster value (not the species value!)
sns.pairplot(data=p,
             hue='cluster')

The above colors represent the clusters identified by K-Means Clustering, and we have done the clustering using only 3 principal components rather than the 4 original features.  You can compare against the previous plot showing colors for species to see how well they line up.