<a href="https://colab.research.google.com/github/bhulston/My-Personal-Notes/blob/main/Py_Illustration_of_PCA_Concepts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Principal Component Analysis


*   PCA section: https://docs.google.com/document/d/16a-moWcQAn06rQs0PWrduLf2Vb1TRtl8PEwyjD7bD8A/edit?usp=sharing
*   Eigenvalues & Eigenvectors: https://docs.google.com/document/d/1-2ooepmE7NS6E6guaVHtagsasgY4enBKoCQjFyFopAA/edit






# PCA in Python with Sci-Kit Learn

In [None]:
!pip install matplotlib
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
iris = datasets.load_iris()
X = iris.data
y = iris.target
target = iris.target_names
feature = iris.feature_names
desc = iris.DESCR

X.shape #150 rows #4 columns
target  #3 target values of setosa, versicolor, virginica
feature #sepal length, sepal width, petal length, petal width
desc #Full description of the iris dataset




In [None]:
#help(datasets)

#datsets like iris have
  #data (data matrix)
  #target (Classification target)
  #feature_names (Names of dataset columns)
  #target_names (Names of target classes)
  #frame (dataframe adds on target as a column)

**Step 1: Standardizing Values**



*   Data has to be normalized for us to run a PCA. This is because it is sensitive to the scales of the features, if they are normalized, there will not be any bias. This is important for gradient descent as well.
*   To normalize, we want to put values on a scale where the mean is 0 and std is 1. The formula for this is the distance from the mean, in terms of the std, so therefore:

$$
\begin{align}
\text{Z-scoring values: } & value_z = (value_i - mean(feature)) / std(feature)  \\ 
\end{align}
$$



In [None]:
#The X variable from earlier has our data that we need to normalize
iris_x = X.copy() #why if i remove copy it is very different?


for i, entry in enumerate(iris_x):
  iris_x[i] = [( (entry[index] - iris_x.mean(axis = 0)[index] ) / iris_x.std(axis=0)[index] ) 
                      for index in range(len(entry))]
    
iris_x[:5] # we see that values have been normalized


array([[-0.90068117,  1.01900435, -1.34022653, -1.3154443 ],
       [-0.91294259, -0.08801172, -1.30158347, -1.26395738],
       [-0.94360756,  0.33998439, -1.32037294, -1.21860577],
       [-0.90117017,  0.17288872, -1.17922097, -1.17814421],
       [-0.50792425,  1.00333122, -1.20164173, -1.14166551]])

We could have also just used the scaler we imported

In [None]:
scaler = StandardScaler() #WHY ARE VALUES SO DIFFERENT
X = scaler.fit(X)
X = scaler.transform(X)
X[:5]

TypeError: ignored

In [None]:
pca_iris = PCA(n_components=3) # number of components to estimate
iris_x_new = pca_iris.fit_transform(iris_x) # project the original data into the PCA space
  #the transform performs the dimensionality reduction on the matrix and then fits the model with iris_x
#this actually runs the pca on the dataset. 

In [None]:
iris_x_new[:5]

In [None]:
fig, axes = plt.subplots(1,2)

axes[0].scatter(iris_x[:,0], iris_x[:,1], c=y)

axes[0].set_xlabel('x1')
axes[0].set_ylabel('x2')
axes[0].set_title('Before PCA')

axes[1].scatter(iris_x_new[:,0], iris_x[:,1], c=y)

axes[1].set_xlabel('PC1')
axes[1].set_ylabel('PC2')
axes[1].set_title('After PCA')
plt.show()

In [None]:
pca_iris.explained_variance_ratio_
#so we can see that explained variance is from the first value primarily and a small % from the second

Looking at the covariance matrix, by grabbing the values 4.077, 0.232 and 0.0663 , we see that these are actual equal to the 

In [None]:
np.cov(iris_x_new.T) #the .T just transposes the data, making it consist of 3 arrays now instead of one array for each row. now it is 1 array for each column
  #each array is the transposed array covariance with other features, and itself(the variance)
  #so finds the covariance the arrays based on the 3 "features/groups", hence a 3x3 shape is returned
  #the diagonal line values (think: [1,1], [2,2], [3,3]) are the variances


In [None]:
pca_iris.explained_variance_ 

In [None]:
print(abs( pca_iris.components_ ))
#the components are the PCA components
  #component 1 has most value in features 1, 3, 4
  #component 2 has most value in features 2 and 4
  #component 3 has most value in features 1, 2, 4
