# Section 33: Principal Component Analysis 

- 05/25/21
- onl01-dtsc-ft-022221

## Learning Objectives


- Gain an intuitive understanding of PCA and eigenvalue decomposition.
- Understand how Principal Component Analysis reduces dimensionality.


- **ACTIVITY: PCA with NHANES**
    - Compress all 1800+ features of the [National Health and Nutrition Examination Survey (NHANES)](https://www.kaggle.com/cdc/national-health-and-nutrition-examination-survey) down to <10 features.
    - Use PC features to find groups of people in 3D space.
    - Tomorrow: use clustering algorithms to statistically identify groups of people. 
- **ACTIVITY: Follow-Up Feature Selection for Predicting Parkinson's Disease**

## Resources

- Videos:
    - [PCA YouTube Playlist - With statquest and ThreeBlueOneBrown Videos](https://www.youtube.com/playlist?list=PLFknVelSJiSzgzNCV-Wvvk5R8PY2UNype) 
    
- Readings:
    - [In-Depth Article About the Curse of Dimensionality](https://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/)
    - [Article: Gentle Introduction to Eigenvalues and Eigenvectors for Machine Learning]( https://machinelearningmastery.com/introduction-to-eigendecomposition-eigenvalues-and-eigenvectors/)


## Questions



# Principal Component Analysis 

## PCA Overview

#### Type of Learning
- Unsupervised

#### Assumptions
- Correlation among features

#### Advantages
- Captures most of the variance in a smaller number of features

#### Disadvantages
- Number of principal components that explain most of the variance are determined by the USER

#### Requirements 

- Features must be scaled (StandardScaler)
- Sensitive to missing data.
- Sensitive to outliers.

#### Example Use
- Reducing feature space/dimensionality
- Preprocessing"
- Creating a few, informative variables from tons of data

## What is the "curse of dimensionality"?

### Reading: [In-Depth Article About the Curse of Dimensionality](https://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/)

<!-- 
<img src="https://raw.githubusercontent.com/learn-co-students/dsc-curse-of-dimensionality-online-ds-pt-100719/master/images/sparsity.png">

 -->



<img src="https://www.visiondummy.com/wp-content/uploads/2014/04/1Dproblem.png">

<img src="https://www.visiondummy.com/wp-content/uploads/2014/04/overfitting.png">

<img src="https://www.visiondummy.com/wp-content/uploads/2014/04/3Dproblem.png">

<img src="https://www.visiondummy.com/wp-content/uploads/2014/04/3Dproblem_separated.png">

> ...

<img src="https://www.visiondummy.com/wp-content/uploads/2014/04/sparseness.png">

## How does PCA solve this?

<img src="https://raw.githubusercontent.com/jirvingphd/dsc-pca-in-scikitlearn-online-ds-sp-000/master/images/inhouse_pca.png">

### Steps for Performing PCA

The theory behind PCA rests upon many foundational concepts of linear algebra. After all, PCA is re-encoding a dataset into an alternative basis (the axes). Here's the exact steps:

1. Recenter each feature of the dataset by subtracting that feature's mean from the feature vector
2. Calculate the covariance matrix for your centered dataset
3. Calculate the eigenvectors of the covariance matrix
4. Project the dataset into the new feature space: Multiply the eigenvectors by the mean centered features

<img src="https://raw.githubusercontent.com/jirvingphd/dsc-unsupervised-learning-online-ds-pt-100719/master/images/pca.gif">

### Definitions/Vocabulary


>- **"Decomposition"**: breaking a matrix down into multiple matrices/vectors that can be combined again to produce the original matrix. 
    - There are many methods of decomposition, besides eigendecomposition. 
    - With time series we will discuss seasonal decomposition> breaking down a time series into seasonal components. 


>- **"Eigendecomposition"** will break down a matrix into 2 matrices: eigenvectors and eigenvalues.
    - **"Eigenvectors** are unit vectors, which means that their length or magnitude is equal to 1.0."*
    - **"Eigenvalues** are coefficients applied to eigenvectors that give the vectors their length or magnitude."*
  
_`*` = from: [Article: Gentle Introduction to Eigenvalues and Eigenvectors for Machine Learning]( https://machinelearningmastery.com/introduction-to-eigendecomposition-eigenvalues-and-eigenvectors/)_




>- From Central Lecturer Notebook (updated since video recorded):
    - "Eigenvectors are related to eigenvalues by the following property: $\vec{x}$ is an eigenvector of the matrix $A$ if $A\vec{x} = \lambda\vec{x}$, for some eigenvalue $\lambda$."
    


- "**Principal Components**":
    - The magnitude of the eigenvalue indicates how much variance that eigenvector captures/explains. 
    - The eigenvector that has explains the most variance in the data is called the "First Principal Component" or "PC 1".
    - The eigenvector that explains the second-most variance after PC1 is PC2 or the second principal component. 
    
- By selected the top X many principal components, we can capture the most variance in the data with the fewest number of features. 

### Example Use of PCA from My Neuroscience Research Days

<img src="https://raw.githubusercontent.com/jirvingphd/fsds_070620_FT_cohort_notes/master/images/Offline20Sorter.png">

# ACTIVITY: USING PCA TO COLLAPSE 1800+ HEALTH FEATURES TO <10

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

from sklearn.preprocessing import StandardScaler,LabelEncoder,OneHotEncoder,MinMaxScaler
from sklearn.impute import SimpleImputer,MissingIndicator
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


from ipywidgets import interact
import plotly.express as px
import plotly.io as pio
pio.templates.default='plotly_dark'

np.random.seed(321)

from sklearn.decomposition import PCA

pd.set_option('display.max_columns',0)
pd.set_option('display.max_info_rows',200)
plt.style.use('seaborn-notebook')

## Data - NHANES (2013-2014)

<img src="./images/nhanes.jpg">




>The [National Health and Nutrition Examination Survey (NHANES)](https://www.cdc.gov/nchs/nhanes/about_nhanes.htm) is a program of continuous studies designed to assess the health and nutritional status of adults and children in the United States. The survey examines a nationally representative sample of about 5,000 persons located across the country each year. The survey is unique in that it combines interviews and physical examinations. The NHANES interview includes demographic, socioeconomic, dietary, and health-related questions. The examination component consists of medical, dental, and physiological measurements, as well as laboratory tests administered by highly trained medical personnel.

>NHANES is a major program of the National Center for Health Statistics (NCHS). NCHS is part of the Centers for Disease Control and Prevention (CDC) and has the responsibility for producing vital and health statistics for the Nation.

- *The Above was Borrowed (with Permission) from [Kristin's Phase 3 Project](https://github.com/kcoop610/phase-3-project)*


#### LINKS:
- [NHANES Dataset - Kaggle](https://www.kaggle.com/cdc/national-health-and-nutrition-examination-survey)

- [Complete variable list](https://wwwn.cdc.gov/Nchs/Nhanes/Search/variablelist.aspx?Component=Demographics&CycleBeginYear=2013)

In [None]:
import os, sys,glob
folder = 'national-health-and-nutrition-examination-survey/'
os.listdir(folder)

In [None]:
## Use glob to get list of csvs


In [None]:
## Load in all CSVs combined (one liner if you can...)
# Here, it's files[1:] because of an invalid start byte


In [None]:
# So... what does this data look like?


## Task: Compress 1,800+ features down to 6 using PCA. 

### First: need to explore and define our column groups


In [None]:
# Some columns are mostly null data - let's explore


In [None]:
# Create a list of mostly null columns
high_null_cols = None


In [None]:
# Now a list of the rest of columns, which should all be numeric
num_cols = None

In [None]:
# Get a list of categorical columns (that aren't mostly null)
cat_cols = None

In [None]:
# Explore those categorical columns's null values
import missingno


In [None]:
## Check for null values in cat cols


In [None]:
# Any of them have too many uniques to OHE?


In [None]:
## Verify we got all cols
len([*num_cols, *cat_cols, *high_null_cols]) == len(df.columns)

### Pipelines

In [None]:
from sklearn import set_config
set_config(display='diagram')

In [None]:
# Let's discuss - what steps am I doing? Why?
num_transformer = Pipeline(steps=[
    ('num_imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

ohe_transformer = Pipeline(steps=[
    ('cat_imputer', SimpleImputer(strategy='constant', fill_value='MISSING')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))])

high_nulls_transformer = Pipeline(steps=[
    ('null_indicator', MissingIndicator())])


preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_cols),
        ('cat_ohe', ohe_transformer, cat_cols), 
        ('cat_null', high_nulls_transformer, high_null_cols)])

preprocessor

In [None]:
# Apply preprocessing to entire df and preview data


In [None]:
# Add PCA


In [None]:
# Make sure to grab the step, for explained variance later


In [None]:
# Fit transform with PCA


In [None]:
# Let's name these components


In [None]:
# Add the column names and check out the PC data as a dataframe


In [None]:
# Check how much variance is explained by all of our PCs


In [None]:
"""
How much of the total variance do these contain?


"""


### So what did we capture?

In [None]:
### Plot PC1 vs PC2


In [None]:
## Turn it into a quick function
def scatterplot_2D():
    pass

In [None]:
## plot pc2 vs pc3 with function


### Make an Interactive Function for Exploring

In [None]:
# Make interactive function to show any comparison
from ipywidgets import interact


## We are only visualizing a small portion of our PC data, lets add another dimension

### Make an interactive plotly scatter3d

In [None]:
def plot_3D_PC():
    pass

### What would we do with this data?

>- Notice how there are groups of datapoints that seem to form groupings/clusters in 3-dimensional space. 
    - Next class we will use K-Means clustering to identify groups of people in our PC data.
    - We will then try to explain those clusters using machine learning models.

___

# Revisiting Parkinson's Disease: Modeling with PCA

>- We previously discussed using feature selection methods to reduce the dimensionality of our Detecting Parkinson's via Speech Statistics dataset.
    - Let's test using PCA to reduce dimensionality 

## Preprocessing Parkinsons' Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## Preprocessing tools
from sklearn.model_selection import train_test_split,cross_val_predict,cross_validate
from sklearn.preprocessing import MinMaxScaler,StandardScaler,OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTE,SMOTENC
from sklearn import metrics

## Models & Utils
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression,LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from time import time

In [None]:
%load_ext autoreload
%autoreload 2
import project_functions as pf

In [None]:
# ## Changing Pandas Options to see full columns in previews and info
n=800
pd.set_option('display.max_columns',n)
pd.set_option("display.max_info_rows", n)
pd.set_option('display.max_info_columns',n)
pd.set_option('display.float_format',lambda x: f"{x:.2f}")

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/flatiron-school/Online-DS-FT-022221-Cohort-Notes/master/Phase_3/phase_3_project/feature_selection/pd_speech_features.csv',
                 skiprows=1)
df.head(3)

### Train Test Split & Pipelines

In [None]:
## Specifying root names of types of features to loop through and filter out from df
target_col = 'class'
drop_cols = ['id']

## making gender a str so its caught by pipeline
df['gender'] = df['gender'].astype(str)

y = df[target_col].copy()
X = df.drop(columns=[target_col,*drop_cols]).copy()


## Train test split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=321)

display(y_train.value_counts(),X_train.head())

In [None]:
## saving list of numeric vs categorical feature
num_cols = list(X_train.select_dtypes('number').columns)
cat_cols = list(X_train.select_dtypes('object').columns)


## create pipelines and column transformer
num_transformer = Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='median')),
    ('scale',StandardScaler())
])

cat_transformer = Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='constant',fill_value='MISSING')),
    ('encoder',OneHotEncoder(sparse=False,drop='if_binary'))])

print('# of num_cols:',len(num_cols))
print('# of cat_cols:',len(cat_cols))

## COMBINE BOTH PIPELINES INTO ONE WITH COLUMN TRANSFORMER
preprocessor=ColumnTransformer(transformers=[
    ('num',num_transformer,num_cols),
    ('cat',cat_transformer,cat_cols)])


## Fit preprocessing pipeline on training data and pull out the feature names and X_cols
preprocessor.fit(X_train)

## Use the encoder's .get_feature_names
cat_features = list(preprocessor.named_transformers_['cat'].named_steps['encoder']\
                            .get_feature_names(cat_cols))
X_cols = num_cols+cat_features

## Transform X_traian,X_test and remake dfs
X_train_df = pd.DataFrame(preprocessor.transform(X_train),
                          index=X_train.index, columns=X_cols)
X_test_df = pd.DataFrame(preprocessor.transform(X_test),
                          index=X_test.index, columns=X_cols)

## Tranform X_train and X_test and make into DataFrames
X_train_df

### Resample Data

In [None]:
## Save list of trues and falses for each cols
smote_feats = [False]*len(num_cols) +[True]*len(cat_features)
## resample training data
smote = SMOTENC(smote_feats)
X_train_sm,y_train_sm = smote.fit_resample(X_train_df,y_train)
y_train_sm.value_counts()

### Saving `train_test_list` &  `train_test_list_sm`

In [None]:
## saving train_test_list and train_test_list_sm
train_test_list = None
train_test_list_sm = None

## Baseline Model with Original Features

In [None]:
# Baseline linear RF  original features


In [None]:
# Baseline RF  original features - smote


## Apply PCA to reduce dimensionality

In [None]:
## Setting which version of data is used for PCA
USE_RESAMPELD = False

if USE_RESAMPELD:
    X_tr,y_tr,X_te,y_te = train_test_list_sm
    print('Using resampled data for PCA')
else:
    X_tr,y_tr,X_te,y_te = train_test_list
    print('Using imbalanced data for PCA')


In [None]:
## collapse 753 columns down to 10 with PCA


### Saving `train_test_list_pca`

In [None]:
## save train_test_list_pca
train_test_list_pca = None

## Modeling with PC Features

In [None]:
## Fit and evaluate rf model with PC features

## Finding Ideal # of Compoments

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
## Make a pca_grid_pipe for testing n_components for our RF
pca_grid_pipe = None

In [None]:
## Let's try up to ~half the total number of features
X_train_df.shape[1]//2

In [None]:
## Make n_components from 3 to half of total features (by 3s)
n_components_list = list(range(3,X_train_df.shape[1]//2,3))
print(len(n_components_list))
n_components_list[-1]

In [None]:
## define params grid for search (let's tune for f1_macro)
params = {}


In [None]:
## Check the best params


In [None]:
## Get the best pca pipeline from the gridsearch and evaluate


## Making Final Dataset & Model with Best `n_components` 

In [None]:
## save best n_components 
best_n = None
best_n

In [None]:
## Make a final X_train_pca and X_test_pca


In [None]:
## save train_test_list_pca
train_test_list_pca_final = None

In [None]:
## Fit a random forest with the PC data



### Identify Important Features

In [None]:
## Get Built-in Importances



In [None]:
## Calculate permutation importance
from sklearn.inspection import permutation_importance


In [None]:
## get list of features in order
perm_important_features = None
# perm_important_features[:5]

## Visualizing Parkinsons PCs

In [None]:
## combine the X and y data as df_pca for visual


In [None]:
## Select sorting of PCs for visualization


# features = list(X_train_pca.columns)
# features = perm_importance_features
# features = rf_important_features

In [None]:
@interact(x=features, y=features,z=features)
def plot_3D_PC(x = features[0],
               y = features[1],
               z = features[2]):
    ### Plot PC1 vs PC2
    df_pca['class'] = df_pca['class'].astype(str)
    pfig = px.scatter_3d(df_pca,x=x,y=y,z=z,color='class')
    pfig.update_traces(marker={'size':2})
    pfig.show(config = dict({'scrollZoom': False}))

### Summary

- PCA is a dimensionality reduction technique that we can apply to our modeling process. 
    - PCA makes it difficult to interpret/understand which features are important.