# Intro to Principal Components Analysis
stough 202-

A brief look principal components

- [The Iris dataset example we're using here](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html)
- [Khan academy](https://www.youtube.com/playlist?list=PLbPhAbAhvjUzeLkPVnv0kc3_9rAfXpGtS) videos on Linear Algebra background.
- [Kind of a neat walkthrough](https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/) of this same Iris dataset.

In [1]:
%matplotlib widget
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn.decomposition import PCA

# import some data to play with
iris = datasets.load_iris()

In [2]:
X = iris.data
y = iris.target
dim1 = 1
dim2 = 3

x_min, x_max = X[:, dim1].min() - .5, X[:, dim1].max() + .5
y_min, y_max = X[:, dim2].min() - .5, X[:, dim2].max() + .5

f, ax = plt.subplots(1,1, figsize=(5,5))

# Plot the training points
ax.scatter(X[:, dim1], X[:, dim2], c=y, cmap=plt.cm.Set1,
            edgecolor='k')
plt.xlabel(iris.feature_names[dim1])
plt.ylabel(iris.feature_names[dim2])

plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(());

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [3]:
iris.data.mean(axis=0)

array([5.84333333, 3.05733333, 3.758     , 1.19933333])

In [4]:
# To getter a better understanding of interaction of the dimensions
# plot the first three PCA dimensions
fig = plt.figure(figsize=(5,5))
ax = Axes3D(fig)# , elev=-150, azim=110)

pca = PCA(n_components=3)
X_reduced = pca.fit_transform(iris.data)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=y,
           cmap=plt.cm.Set1, edgecolor='k', s=40)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])

plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

&nbsp;

## Let's explore the PCA itself.

In [5]:
# The first couple of principal components already account for a lot 
# of the variance in the data.
import numpy as np
print('The first two pca dims represent '\
      f'{sum(np.var(X_reduced[:,:2], axis=0)) / sum(np.var(X, axis=0)):.3f} '
      'fraction of the original data.')

The first two pca dims represent 0.978 fraction of the original data.


## The linear combinations required 
of the original 4 components, to construct each principal component.

See [PCA reference](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).

In [6]:
# What linear combinations does the PCA discover.
pca.components_ # Output is a vector that has 3 directions in the 4D space

array([[ 0.36138659, -0.08452251,  0.85667061,  0.3582892 ],
       [ 0.65658877,  0.73016143, -0.17337266, -0.07548102],
       [-0.58202985,  0.59791083,  0.07623608,  0.54583143]])

In [7]:
# Notice that they're orthogonal, just like we said before, 
# with block transform encoding.
np.dot(pca.components_[0,:], pca.components_[1,:])

2.914335439641036e-16

In [8]:
np.sum(pca.components_**2, axis=1)

array([1., 1., 1.])

In [9]:
pca.explained_variance_ratio_

array([0.92461872, 0.05306648, 0.01710261])