Creative Commons CC BY 4.0 Lynd Bacon & Associates, Ltd. Not warranted to be suitable for any particular purpose. (You're on your own!)

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

<h1 align='center'>Assignment 2 Rescaling and PCA Examples v1</h1>

Assignment 2 includes objectives involving the rescaling of features, and extracting principal components from features.  What follows are some examples using a subset of the Assignment 2 data.

In [1]:
import numpy as np
import pandas as pd
import os
import joblib
import pickle
import pickleshare
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import math

## Loading Some Example Data

These are a subset of the numerical features used in the assignment.

In [2]:
os.listdir()   #what's in the current working directory (cwd)

['.ipynb_checkpoints',
 '2-Assignment-2-Guide-v1.ipynb',
 'ames-data-info.zip',
 'amesDF.pickle',
 'amesNumDFclean.pickle',
 'amesSelDF.pickle',
 'BenPrescott_Assignment2.ipynb',
 'data-input-select-ex-assignment-2-v1.ipynb',
 'DataDocumentation.txt',
 'decock.pdf',
 'kmeans-assignment-2-ex-v2.ipynb',
 'NAME.docx',
 'rescaling-PCA-ex-assignment-2-v1.ipynb',
 'RF-example-v1.ipynb',
 'saved_notebook.db']

In [3]:
amesDF=pd.read_pickle('amesSelDF.pickle')  # assumes the file is in the cwd
amesDF.dtypes

Lot_Frontage     int64
Lot_Area         int64
Mas_Vnr_Area     int64
Bsmt_Unf_SF      int64
Total_Bsmt_SF    int64
First_Flr_SF     int64
Second_Flr_SF    int64
Gr_Liv_Area      int64
Bedroom_AbvGr    int64
Kitchen_AbvGr    int64
TotRms_AbvGrd    int64
Fireplaces       int64
Garage_Area      int64
Wood_Deck_SF     int64
Open_Porch_SF    int64
Sale_Price       int64
dtype: object

We can cast all these measures to type "float:"

In [4]:
amesDF2=amesDF.astype('float32')
amesDF2.dtypes

Lot_Frontage     float32
Lot_Area         float32
Mas_Vnr_Area     float32
Bsmt_Unf_SF      float32
Total_Bsmt_SF    float32
First_Flr_SF     float32
Second_Flr_SF    float32
Gr_Liv_Area      float32
Bedroom_AbvGr    float32
Kitchen_AbvGr    float32
TotRms_AbvGrd    float32
Fireplaces       float32
Garage_Area      float32
Wood_Deck_SF     float32
Open_Porch_SF    float32
Sale_Price       float32
dtype: object

## Splitting for training and test data

In Assignment 2 you'll be using data to train and validate two different kinds of ensemble learners, a RandomForest (RF) regression model, and an AdaBoost regression model.  RF can be validated using "out of bag" (OOB) data points.  AdaBoost doesn't have this characteristic.  In Assignment 2 you'll be training both kind of ensemble learners using different versions of the features you'll use.  You'll be rescaling your features data using one of two methods, either "minmax" or standardization rescaling.  The effect of both of these is to make all features have the same range of values.  It's the opinion of many that doing this sort of rescaling before training many ML models will improve a model's performance.

You'll want to rescale your training and your test data separately, using the training data to "learn" the rescaling transformation, and then applying the learned transformation to the training data, and to the test data.  The learning consists of quantities computed from the data that are used in the rescaling performed.  The minimum and maximum data values are used in minmax training.  The mean and the standard deviation of the data values are used in standardization rescaling. 

In [5]:
from sklearn.model_selection import train_test_split
X=amesDF.loc[:,~(amesDF.columns.isin(['Sale_Price']))].to_numpy(copy=True)
y=amesDF.Sale_Price.to_numpy(copy=True)
X.shape
y.shape

(2930,)

In [6]:
trainX,testX,trainy, testy = train_test_split(X,y,test_size=0.15,
                                              random_state=33)

## Rescaling

There are several different ways to rescale features.   You are asked to use one of minmax and standardizing. Here we'll apply the former. You can use either one when doing the assignment.

_scikit-learn_ includes many different ways of rescaling and transforming data values.  The are summarized at [scikit-learn preprocessing](https://scikit-learn.org/stable/modules/classes.html?highlight=preprocessing#module-sklearn.preprocessing).

In [7]:
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
scaler.fit(trainX)
trainXS=scaler.transform(trainX)
testXS=scaler.transform(testX)

Examine the summary statistics that follow to be sure they make sense to you.  Note that we didn't rescale y, the target variable we're going to predict with some ensemble models.

In [8]:
# check the stats; easiest using Pandas
pd.DataFrame(trainXS).describe()
pd.DataFrame(testXS).describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
count,440.0,440.0,440.0,440.0,440.0,440.0,440.0,440.0,440.0,440.0,440.0,440.0,440.0,440.0,440.0
mean,0.189893,0.054373,0.079664,0.248216,0.178356,0.175322,0.193005,0.225935,0.3625,0.343182,0.34965,0.157955,0.337104,0.065267,0.06884
std,0.106842,0.067295,0.143081,0.188616,0.06809,0.081179,0.240741,0.101725,0.103264,0.060822,0.122377,0.173055,0.156356,0.087935,0.091116
min,0.0,0.001077,0.0,0.0,0.0,0.021844,0.0,0.019593,0.0,0.0,0.076923,0.0,0.0,0.0,0.0
25%,0.140575,0.036609,0.0,0.105415,0.13617,0.115522,0.0,0.152176,0.25,0.333333,0.230769,0.0,0.236953,0.0,0.0
50%,0.202875,0.048641,0.0,0.197132,0.163666,0.156585,0.0,0.21298,0.375,0.333333,0.307692,0.25,0.3378,0.0,0.043127
75%,0.249201,0.061485,0.125181,0.355522,0.217594,0.224375,0.39196,0.273267,0.375,0.333333,0.384615,0.25,0.407793,0.117978,0.107817
max,0.58147,1.309654,1.161103,0.907962,0.513584,0.588952,1.103098,0.818011,0.75,0.666667,0.769231,0.75,1.049365,0.511236,0.67655


## PCA

Here we follow a process similar to what we did above when we rescaled the training and test data.  We "train" our PCA using our training data, and we apply it to our training data and to our test/validation data.

In [9]:
from sklearn.decomposition import PCA

As an example, here we'll identify the components that account for 90% of the total variation in the training data.  We'll exract these for the training data, and for the test data.

In [10]:
pca90=PCA(n_components=0.90,svd_solver='full')
pca90.fit(trainX)
trainXPCA=pca90.transform(trainX)
testXPCA=pca90.transform(testX)

In [11]:
trainXPCA.shape

(2490, 1)

Here are the proportions of total variance the extracted components account for.  There will be one proportion for each component extracted.

In [None]:
print(f'prop. of variance explained: {pca90.explained_variance_ratio_}')


The _scikit-plot_ package provides a graphical way of describing the variance accounted for by the components.  See 
https://scikit-plot.readthedocs.io/en/stable/decomposition.html