Dimension Reduction: Principal Components and Partial Least Squares Regression

Objectives:

    Explain how principal components regression and partial squares regression work.

    Show Python code to to perform Principal Components Regression and Partial Least Squares Regression

Overview:

Principal Components Regression (PCR) and Partial Least Squares Regression (PLS) are yet two other alternatives to simple linear model fitting that often produces a model with better fit and higher accuracy.  Both are dimension reduction methods but PCR offers an unsupervised approach, while PCL is a supervised alternative.

Principal Components Regression: 

The approach is based on reducing the number of predictors into a smaller dimension using principal components analysis.  These principal components then are used as regressors when fitting a new OLS model.  

Since a relatively small number of principal components explain a large percent of the variability in data, the approach may be sufficient in explaining a relationship between the target variable and the principal components that were constructed from a larger number of regressor variables. 

One drawback of PCR, is that it is based on an unsupervised approach to feature reduction: Principal Components Analysis.  PCA is set out to find linear combinations that best describe original regressors. Since detection of these linear combinations was performed without using a target variable, we can’t be certain that the principal components we created are the best to use to predict the target variable.  It is entirely possible, that a different set of principal components would perform better. The solution to this problem is Partial Least Squares Regression (more about that later).

Still, the PCA approach is a good way to overcome multicollinearity problems in OLS models.  Further, since PCA is a dimension reduction approach, PCR may be a good way of attacking problems with high-dimensional covariates.  

 PCR follows three steps: 

1.     Find principal components from the data matrix of original regressors. 

2.     Regress the outcome variable on the selected principal components, which are covariates of the original regressors.  The regression used should be OLS. 

3.     Transform the findings back to the scale of the covariates using PCA loadings to get a PCR estimator so that regression coefficients can be estimated.    

 PCR vs. Shrinkage:

Sometimes, PCR can outperform shrinkage models (in terms of model error), while other times shrinkage models are better.  If relatively few principal components are needed to explain variance in the data, then PCR will outperform shrinkage methods such as ridge, lasso or elastic net models.  If more principal components are required, then shrinkage methods will perform better.  

Principal Components Regression and the Boston Housing Data:

First, we need load all required libraries, most of them as part of the sklearn package

In [4]:
import pandas as pd
import statsmodels.api as sm
import numpy as np
import sklearn

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import scale 
from sklearn import model_selection
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.cross_decomposition import PLSRegression, PLSSVD

In [6]:
#load and partition data
from sklearn.datasets import load_boston
boston= load_boston()

boston_features_df = pd.DataFrame(data=boston.data,columns=boston.feature_names)
boston_target_df = pd.DataFrame(data=boston.target,columns=['MEDV'])

#Transform data 
#add one to each value to get rid of 0 values in both target and features

numeric_cols = [col for col in boston_features_df if boston_features_df[col].dtype.kind != 'O']
numeric_cols
numeric_cols2 = [col for col in boston_target_df if boston_target_df[col].dtype.kind != 'O']
numeric_cols2

boston_features_df[numeric_cols] += 1
boston_target_df[numeric_cols2] += 1
boston_features_df.head()
boston_target_df.head()


Unnamed: 0,MEDV
0,25.0
1,22.6
2,35.7
3,34.4
4,37.2


In [7]:
#Box-Cox Transform features
column_trans = ColumnTransformer(
    [('CRIM_bc', PowerTransformer(method='box-cox', standardize=True), ['CRIM']),
     ('ZN_bc', PowerTransformer(method='box-cox', standardize=True), ['ZN']),
     ('INDUS_bc', PowerTransformer(method='box-cox', standardize=True), ['INDUS']),
     ('CHAS_bc', PowerTransformer(method='box-cox', standardize=True), ['CHAS']),
     ('NOX_bc', PowerTransformer(method='box-cox', standardize=True), ['NOX']),
     ('RM_bc', PowerTransformer(method='box-cox', standardize=True), ['RM']),
     ('AGE_bc', PowerTransformer(method='box-cox', standardize=True), ['AGE']),
     ('DIS_bc', PowerTransformer(method='box-cox', standardize=True), ['DIS']),
     ('RAD_bc', PowerTransformer(method='box-cox', standardize=True), ['RAD']),
     ('TAX_bc', PowerTransformer(method='box-cox', standardize=True), ['TAX']),
     ('PTRATIO_bc', PowerTransformer(method='box-cox', standardize=True), ['PTRATIO']),
     ('B_bc', PowerTransformer(method='box-cox', standardize=True), ['B']),
     ('LSTAT_bc', PowerTransformer(method='box-cox', standardize=True), ['LSTAT']),
    ])

transformed_boxcox = column_trans.fit_transform(boston_features_df)
new_cols = ['CRIM_bc', 'ZN_bc', 'INDUS_bc', 'CHAS_bc', 'NOX_bc', 'RM_bc', 'AGE_bc', 'DIS_bc','RAD_bc','TAX_bc','PTRATIO_bc', 'B_bc', 'LSTAT_bc']

boston_features_bc = pd.DataFrame(transformed_boxcox, columns=new_cols)
pd.concat([ boston_features_bc], axis = 1)
boston_features_bc.head()
##### Box-Cox tranform target
column_trans = ColumnTransformer(
    [('MEDV_bc', PowerTransformer(method='box-cox', standardize=True), ['MEDV']),
    ])
    
transformed_boxcox = column_trans.fit_transform(boston_target_df)
new_cols2 = ['MEDV_bc']

boston_target_bc = pd.DataFrame(transformed_boxcox, columns=new_cols2)
pd.concat([ boston_target_bc], axis = 1)
boston_target_bc.head()


NameError: name 'ColumnTransformer' is not defined