In [7]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from sklearn import cross_validation
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.cross_decomposition import PLSRegression, PLSSVD
from sklearn.metrics import mean_squared_error

In [29]:
ames_train = pd.read_csv("ames_train.csv")
ames_test = pd.read_csv("ames_test.csv")
ames_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 82 columns):
Unnamed: 0         1500 non-null int64
PID                1500 non-null int64
area               1500 non-null int64
price              1500 non-null int64
MS.SubClass        1500 non-null int64
MS.Zoning          1500 non-null object
Lot.Frontage       1218 non-null float64
Lot.Area           1500 non-null int64
Street             1500 non-null object
Alley              105 non-null object
Lot.Shape          1500 non-null object
Land.Contour       1500 non-null object
Utilities          1500 non-null object
Lot.Config         1500 non-null object
Land.Slope         1500 non-null object
Neighborhood       1500 non-null object
Condition.1        1500 non-null object
Condition.2        1500 non-null object
Bldg.Type          1500 non-null object
House.Style        1500 non-null object
Overall.Qual       1500 non-null int64
Overall.Cond       1500 non-null int64
Year.Built         15

In [31]:
y = ames_train.price
#x_predictors = pd.concat([ames_train.drop(["price", "PID"], axis=1)], axis = 1).astype("float64")
x_predictors = ames_train.drop(["price", "PID"], axis=1)
x_predictors.head(n=5)

Unnamed: 0.1,Unnamed: 0,area,MS.SubClass,MS.Zoning,Lot.Frontage,Lot.Area,Street,Alley,Lot.Shape,Land.Contour,...,Screen.Porch,Pool.Area,Pool.QC,Fence,Misc.Feature,Misc.Val,Mo.Sold,Yr.Sold,Sale.Type,TotalSq
0,1,1654,70,RL,50.0,4960,Pave,,Reg,Lvl,...,0,0,,,,0,5,2009,WD,1654
1,2,2340,70,RL,66.0,9042,Pave,,Reg,Lvl,...,0,0,,GdPrv,Shed,2500,5,2010,WD,2340
2,3,1466,20,RL,134.0,17755,Pave,,Reg,Lvl,...,100,0,,,,0,11,2006,WD,1466
3,4,1224,120,RM,,4500,Pave,,Reg,Lvl,...,0,0,,,,0,6,2009,WD,1224
4,5,1575,120,RL,50.0,8012,Pave,,Reg,Lvl,...,0,0,,,,0,7,2007,WD,1575


In [32]:
pca = PCA()
X_reduced = pca.fit_transform(scale(x_predictors))

ValueError: could not convert string to float: 'WD '

### Description of algorithm

PLS is a regression method used to overcome limitations discussed above for normal linear regressions (e.g., many collinear predictors, more predictors than samples, etc.) by mapping observed sets of observed variables to response variables by means of latent variables. Essentially the model assumes that the data is generated by an underlying model directed by a smaller number of latent variables in the data. 

First, two sets of latent variables are extracted from the data: $T$ (or x-scores) from the predictors, and $U$ (or y-scores) from the response variable. These latent vectors are determined through maximizing the covariance between different sets of variables. 

For the classic linear regression, we try to solve the equation, $ Y = XB + {\epsilon} $, where the ordinary least squares estimate for ${\beta}$ is identified as ${(X^T X)}^{-1} X^TY$. This estimate is obtained by minimizing the sum of squared residuals. However, models that have predictors with high collinearity or more predictors than observations can result in singularity of the matrix ${(X^T X)}$. As an alternative and way to fix this issue, we implement the PLS algorithm throught the following steps:

1) Start with vector $u$. If there is only one response variable, then $ u = y $, otherwise it is one of the columns of $Y$.

2) Calculate the weights for the predictors ($X$) :
$$ w = \frac{X^Tu}{u^Tu} $$

3) Determine $t$ ($X$ scores):
$$ t = Xw $$

4) Now perform similar calculations for $Y$. Calculate the weights for the response variable:
$$ c = \frac{X^Tt}{t^Tt} $$

5) Determine $u$ ($Y$ scores):
$$ u = \frac{Yc}{c^Tc} $$

6) If there is more than one response variable, then we test to determine whether the $t$ values have converged. If the change in $t$ from one iteration to the next, $ \frac{||t_{old} - t_{new}}{||t_{new}||} $, is not smaller than a threshold value, then we iterate through steps 2-5 until convergence is reached.

7) Deflate variables for next iteration.
$$ p = \frac{X^Tt}{t^Tt} $$
$$ X = X - tp^T $$
$$ Y = Y - tc^T $$

8) Iterate through components until they are not found to be predictive of $Y$.