

## Kaggle Housing Prices Prediction using Regression

#  David Muruli
---

In [268]:
# Install pyspark in google colab
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [269]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [270]:
import pyspark
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

from pyspark.ml.regression import LinearRegression

from pyspark.ml.regression import GBTRegressor
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.regression import IsotonicRegression

from pyspark.ml.evaluation import RegressionEvaluator

from pyspark.ml.feature import StandardScaler

from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, PCA
from pyspark.ml import Pipeline
from pyspark.sql.functions import col
%matplotlib inline

In [271]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark Final Project") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

In [272]:
df = spark.read.format('csv').\
                       options(header='true', \
                       inferschema='true').\
            load("drive/MyDrive/spark_class_data/final/train.csv",header=True);

In [273]:
df.show(5,True)
df.printSchema()

+---+----------+--------+-----------+-------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+-----------+-----------+---------+------------+---------+--------+-----------+-----------+----------+----------+---------+---------+----------+--------+--------+------------+------------+----------+------------+----------+---------+-----------+-------+---------+----------+----------+--------+--------+------------+---------+------------+------------+--------+--------+------------+------------+-----------+------------+----------+----------+-----------+----------+-----------+------------+----------+----------+----------+----------+----------+----------+-----------+-------------+---------+-----------+--------+------+-----+-----------+-------+------+------+--------+-------------+---------+
| Id|MSSubClass|MSZoning|LotFrontage|LotArea|Street|Alley|LotShape|LandContour|Utilities|LotConfig|LandSlope|Neighborhood|Condition1|Condition

In [274]:
df.describe().show()

+-------+-----------------+------------------+--------+-----------------+------------------+------+-----+--------+-----------+---------+---------+---------+------------+----------+----------+--------+----------+------------------+------------------+------------------+------------------+---------+--------+-----------+-----------+----------+------------------+---------+---------+----------+--------+--------+------------+------------+-----------------+------------+-----------------+-----------------+------------------+-------+---------+----------+----------+-----------------+------------------+-----------------+-----------------+-------------------+--------------------+------------------+-------------------+------------------+-------------------+-----------+------------------+----------+------------------+-----------+----------+------------------+------------+------------------+-----------------+----------+----------+----------+------------------+-----------------+------------------+-----

In [275]:
print("No. of columns(features) in dataset", len(df.columns))

No. of columns(features) in dataset 81


In [276]:
print("No. Neighbourhoods: ", df.groupBy('Neighborhood').count().count())
df.groupBy('Neighborhood').avg('SalePrice').show(25)

No. Neighbourhoods:  25
+------------+------------------+
|Neighborhood|    avg(SalePrice)|
+------------+------------------+
|     Veenker|238772.72727272726|
|     BrkSide|124834.05172413793|
|     NPkVill|142694.44444444444|
|     NridgHt| 316270.6233766234|
|     NoRidge|335295.31707317074|
|      NWAmes| 189050.0684931507|
|     OldTown|128225.30088495575|
|     Gilbert|192854.50632911394|
|     Somerst|225379.83720930232|
|     Crawfor|210624.72549019608|
|       NAmes|         145847.08|
|      IDOTRR|100123.78378378379|
|     Edwards|          128219.7|
|      Sawyer|136793.13513513515|
|     StoneBr|          310499.0|
|     CollgCr|197965.77333333335|
|       SWISU|         142591.36|
|     MeadowV|  98576.4705882353|
|      Timber|242247.44736842104|
|     Blmngtn|194870.88235294117|
|     Mitchel| 156270.1224489796|
|     SawyerW| 186555.7966101695|
|     Blueste|          137500.0|
|      BrDale|         104493.75|
|     ClearCr|212565.42857142858|
+------------+----------

In [277]:
neigbhourhoodPrice = df.groupBy('Neighborhood').max('SalePrice')
df.groupBy('Neighborhood').max('SalePrice').show(25)

+------------+--------------+
|Neighborhood|max(SalePrice)|
+------------+--------------+
|     Veenker|        385000|
|     BrkSide|        223500|
|     NPkVill|        155000|
|     NridgHt|        611657|
|     NoRidge|        755000|
|      NWAmes|        299800|
|     OldTown|        475000|
|     Gilbert|        377500|
|     Somerst|        423000|
|     Crawfor|        392500|
|       NAmes|        345000|
|      IDOTRR|        169500|
|     Edwards|        320000|
|      Sawyer|        190000|
|     StoneBr|        556581|
|     CollgCr|        424870|
|       SWISU|        200000|
|     MeadowV|        151400|
|      Timber|        378500|
|     Blmngtn|        264561|
|     Mitchel|        271000|
|     SawyerW|        320000|
|     Blueste|        151000|
|      BrDale|        125000|
|     ClearCr|        328000|
+------------+--------------+


We need to find which features are corelated with the target feature, the sale price

In [278]:
catCols = [x for (x, dataType) in df.dtypes if dataType == "string"]


In [279]:
len(catCols)

46

In [280]:
print(catCols)

['MSZoning', 'LotFrontage', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']


In [281]:
numCols = [x for (x, dataType) in df.dtypes if dataType == "int"]


In [282]:
len(numCols)

35

In [283]:
print(numCols)

['Id', 'MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'SalePrice']


Find which of the numeric features is corrlated with the target feature

In [284]:
selectNumCols ={}
for colName in numCols:
  colCorrelation = df.stat.corr('SalePrice',colName)
  if(colCorrelation > 0.5 or colCorrelation < -0.5):
    selectNumCols[colName] = colCorrelation
    print( "Correlation to Sale Price for ", colName, colCorrelation)

Correlation to Sale Price for  OverallQual 0.7909816005838053
Correlation to Sale Price for  YearBuilt 0.522897332879497
Correlation to Sale Price for  YearRemodAdd 0.5071009671113869
Correlation to Sale Price for  TotalBsmtSF 0.6135805515591942
Correlation to Sale Price for  1stFlrSF 0.6058521846919153
Correlation to Sale Price for  GrLivArea 0.7086244776126517
Correlation to Sale Price for  FullBath 0.5606637627484453
Correlation to Sale Price for  TotRmsAbvGrd 0.5337231555820284
Correlation to Sale Price for  GarageCars 0.6404091972583519
Correlation to Sale Price for  GarageArea 0.6234314389183622
Correlation to Sale Price for  SalePrice 1.0


In [285]:
print("values: " ,selectNumCols)


values:  {'OverallQual': 0.7909816005838053, 'YearBuilt': 0.522897332879497, 'YearRemodAdd': 0.5071009671113869, 'TotalBsmtSF': 0.6135805515591942, '1stFlrSF': 0.6058521846919153, 'GrLivArea': 0.7086244776126517, 'FullBath': 0.5606637627484453, 'TotRmsAbvGrd': 0.5337231555820284, 'GarageCars': 0.6404091972583519, 'GarageArea': 0.6234314389183622, 'SalePrice': 1.0}


List of numeric features to be selected. 

In [286]:
print("No of original numeric columns: ", len(numCols))
print("Final No. of numeric columns(features) that are above the threshold: ", len(selectNumCols))
dct = {k:[v] for k,v in selectNumCols.items()} 
df_numCorr = pd.DataFrame(dct)
df_numCorr.style

No of original numeric columns:  35
Final No. of numeric columns(features) that are above the threshold:  11


Unnamed: 0,OverallQual,YearBuilt,YearRemodAdd,TotalBsmtSF,1stFlrSF,GrLivArea,FullBath,TotRmsAbvGrd,GarageCars,GarageArea,SalePrice
0,0.790982,0.522897,0.507101,0.613581,0.605852,0.708624,0.560664,0.533723,0.640409,0.623431,1.0


### **Numeric Columns with covariance greater than 0.5 = 11**

#### Build Baseline regression housing dataset

#Build StringIndex

In [287]:
stringindexer_stages = [ StringIndexer(inputCol=c, outputCol="idx_"+ c) for c in catCols ]

In [288]:
onehotencoder_stages = [OneHotEncoder(inputCol='idx_' +c, outputCol='onehot_' + c) for c in catCols]

In [289]:
encoded_category_columns = ['onehot_' + c for c in catCols]

baseline_cat_columns_np = np.array(encoded_category_columns)

numCols.remove('SalePrice')
numCols.remove('Id')

baseline_numeric_columns_np = np.array(numCols)
base_line_columns = np.concatenate((baseline_cat_columns_np, baseline_numeric_columns_np))


vectorassembler_stage = VectorAssembler(inputCols= base_line_columns, outputCol="features")

In [290]:
all_stages = stringindexer_stages + onehotencoder_stages + [vectorassembler_stage]
basePipeline = Pipeline(stages=all_stages)

# Transform base line data

In [291]:
final_columns =['features', 'SalePrice']
base_line_tf = basePipeline.fit(df).transform(df).select(final_columns)
base_line_tf.show(5)

+--------------------+---------+
|            features|SalePrice|
+--------------------+---------+
|(792,[0,10,114,11...|   208500|
|(792,[0,7,114,115...|   181500|
|(792,[0,17,114,11...|   223500|
|(792,[0,5,114,115...|   140000|
|(792,[0,46,114,11...|   250000|
+--------------------+---------+


In [292]:
scaler = StandardScaler(inputCol="features",outputCol="scaledFeatures",withStd=True, withMean=True)
scalerModel=scaler.fit(base_line_tf)
base_line_scaled_tf=scalerModel.transform(base_line_tf)
#base_line_tf=scalerModel.transform(base_line_tf)

#Split Data into train and test after encoding and transformation

In [293]:
splits = base_line_scaled_tf.randomSplit([0.7, 0.3])
train_df = splits[0]
test_df = splits[1]


## Run Regressions, Start with Linear Regression, Gradient Booster, Random Forest Regressor

In [294]:
base_line_columns =['Regression Algo', 'R2', 'RSME']
base_line_eval_df =pd.DataFrame(columns=base_line_columns)

#Linear Regression

In [295]:
lr = LinearRegression(featuresCol = 'scaledFeatures', labelCol='SalePrice', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(train_df)
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))

Coefficients: [1171.8609608724353,-822.0618088657393,1260.0079594370784,-268.23668345608945,-385.9530566202519,-488.07579954794124,677.5334215600258,614.1175015260485,-796.1781099144316,1420.2295104708285,-56.30289209431225,396.64147552368655,-334.91635968613855,-513.12266926461,1156.6482437980264,-450.0428207463614,36.59545720760404,-486.09540965704184,-443.8455893597602,93.74266501317148,-711.9281510812442,543.3393455660303,19.352547631584795,1425.1764607119442,-257.99707860075625,31.46305564197704,-176.64822972513127,219.43900773141615,-649.433182683776,-1515.5196224263373,-644.2261356801963,-118.7021472072831,50.50804540623128,-713.2482843415552,-31.455331958063674,-2.97372957202975,-289.0643832998443,-153.44804926662866,-328.1989099841921,364.5398147976626,245.33113350499264,-812.2469251906615,-666.8408037233305,-599.187360536589,66.20914057563814,-639.2728072356222,961.6206959813254,438.07701257589696,-128.03584302367247,-304.4596818604172,-1064.7984383096257,2067.6264547377473,-

In [296]:
trainingSummary = lr_model.summary

eval_results = {'Regression Algo':'Linear Regression','R2':trainingSummary.r2, 'RSME':trainingSummary.rootMeanSquaredError}

base_line_eval_df.loc[len(base_line_eval_df.index)] = eval_results


#Gradient Booster

In [297]:
gbt = GBTRegressor(featuresCol="features",labelCol='SalePrice', maxIter=10)
gbt_model = gbt.fit(train_df)

In [298]:
gbt_predictions = gbt_model.transform(test_df)

gbt_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="SalePrice",metricName="r2")
gbtR2 = gbt_evaluator.evaluate(gbt_predictions)
print("R Squared (R2) on test data = %g" % gbtR2)

R Squared (R2) on test data = 0.688573


In [299]:
gbt_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="SalePrice",metricName="rmse")

In [300]:
gbtRMSE = gbt_evaluator.evaluate(gbt_predictions)
print("RMSE on test data = %g" % gbtRMSE)

RMSE on test data = 43482.9


In [301]:
eval_results = {'Regression Algo':'Gradient Booster','R2':gbtR2, 'RSME':gbtRMSE}

base_line_eval_df.loc[len(base_line_eval_df.index)] = eval_results

print(base_line_eval_df)

     Regression Algo        R2          RSME
0  Linear Regression  0.973773  12972.232217
1   Gradient Booster  0.688573  43482.899508


#Random Forest Regression

In [302]:
rf = RandomForestRegressor(featuresCol="features",labelCol='SalePrice', maxDepth=15)
rf_model = rf.fit(train_df)

In [303]:
rf_predictions = rf_model.transform(test_df)
rf_predictions.select("prediction","SalePrice","features").show(5)

+------------------+---------+--------------------+
|        prediction|SalePrice|            features|
+------------------+---------+--------------------+
|140760.03408549193|   144000|(792,[0,4,114,115...|
|          92343.75|    87500|(792,[0,4,114,115...|
| 277620.9428571429|   228500|(792,[0,4,114,115...|
|         353146.85|   313000|(792,[0,4,114,115...|
|136542.64621836497|   139000|(792,[0,4,114,115...|
+------------------+---------+--------------------+


In [304]:
rf_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="SalePrice",metricName="r2")
rf_R2 = rf_evaluator.evaluate(rf_predictions)
print("R Squared (R2) on test data = %g" % rf_R2)

R Squared (R2) on test data = 0.854129


In [305]:
rf_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="SalePrice",metricName="rmse")
rf_RMSE = rf_evaluator.evaluate(rf_predictions)
print("RMSE on test data = %g" % rf_RMSE)

RMSE on test data = 29759.4


In [306]:
eval_results = {'Regression Algo':'Random Forest','R2':rf_R2, 'RSME':rf_RMSE}

base_line_eval_df.loc[len(base_line_eval_df.index)] = eval_results

print(base_line_eval_df)

     Regression Algo        R2          RSME
0  Linear Regression  0.973773  12972.232217
1   Gradient Booster  0.688573  43482.899508
2      Random Forest  0.854129  29759.427365


#Decison Tree Regression

In [307]:
dt = DecisionTreeRegressor(featuresCol="features",labelCol='SalePrice', maxDepth=10)
dt_model = dt.fit(train_df)
dt_predictions = dt_model.transform(test_df)

In [308]:
dt_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="SalePrice",metricName="r2")
dt_R2 = dt_evaluator.evaluate(dt_predictions)
print("R Squared (R2) on test data = %g" % dt_R2)

R Squared (R2) on test data = 0.679391


In [309]:
dt_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="SalePrice",metricName="rmse")
dt_RMSE = dt_evaluator.evaluate(dt_predictions)
print("Decision Tree Regressor RMSE = %g" % dt_RMSE )

Decision Tree Regressor RMSE = 44119.3


In [310]:
eval_results = {'Regression Algo':'Decision Tree','R2':dt_R2, 'RSME':dt_RMSE}

base_line_eval_df.loc[len(base_line_eval_df.index)] = eval_results

print(base_line_eval_df)

     Regression Algo        R2          RSME
0  Linear Regression  0.973773  12972.232217
1   Gradient Booster  0.688573  43482.899508
2      Random Forest  0.854129  29759.427365
3      Decision Tree  0.679391  44119.280875


#Isotonic Regression

In [311]:
iso = IsotonicRegression(featuresCol="features", labelCol="SalePrice") 
iso_model=iso.fit(train_df)
iso_predictions = iso_model.transform(test_df)

In [312]:
iso_evaluator = RegressionEvaluator(
    labelCol="SalePrice", predictionCol="prediction", metricName="rmse")
iso_rmse = iso_evaluator.evaluate(iso_predictions)
print("Isotonic Regression  (RMSE) = %g" % iso_rmse)

Isotonic Regression  (RMSE) = 74326.1


In [313]:
iso_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="SalePrice",metricName="r2")
iso_R2 = iso_evaluator.evaluate(iso_predictions)
print("Isotonic Regression R Squared (R2) = %g" % iso_R2)

Isotonic Regression R Squared (R2) = 0.0900843


In [314]:
eval_results = {'Regression Algo':'Isotonic ','R2':iso_R2, 'RSME':iso_rmse}

base_line_eval_df.loc[len(base_line_eval_df.index)] = eval_results

print(base_line_eval_df)

     Regression Algo        R2          RSME
0  Linear Regression  0.973773  12972.232217
1   Gradient Booster  0.688573  43482.899508
2      Random Forest  0.854129  29759.427365
3      Decision Tree  0.679391  44119.280875
4          Isotonic   0.090084  74326.060955


#Summary of Base Line(Encoded Categorical Data) Regressions

In [315]:
base_line_eval_df.style

Unnamed: 0,Regression Algo,R2,RSME
0,Linear Regression,0.973773,12972.232217
1,Gradient Booster,0.688573,43482.899508
2,Random Forest,0.854129,29759.427365
3,Decision Tree,0.679391,44119.280875
4,Isotonic,0.090084,74326.060955


###Summary

From the summary of the regression models the Linear Regression and Random Forest are the best models to determine the sale price. \
This does not completely make sense, as when the threshold variance between the numeric features and the target feature, Sales Price is set at 0.5, the number of numeric features drops from 35 to 11.  This suggests that the Linear Regression model is overfitting the data and the Random Forest regression model would be a more trustworthy model. 

In [316]:
def findTopPrincipalComponentNumber(data:pyspark.sql.dataframe.DataFrame=df, \
                                    featureCol:str='features',threshold:int=0.80) -> (int, int):                              
    df_detect=data.select(featureCol)
    df_detect_list=[]
    for i in df_detect.collect():
        df_detect_list.append(i[0].tolist())
    df_detect_np=np.array(df_detect_list)
    cov_mat = np.cov(df_detect_np.T) #Get the covariance matrix
    eigen_vals, eigen_vecs = np.linalg.eig(cov_mat) #Get list of Eigenvalues and list of Eigenvectors, 
                                                    #numbef of Eigenvalues = number of the feature columns 
    tot = sum(eigen_vals)
    var_exp = [(i / tot) for i in sorted(eigen_vals, reverse=True)] #Sort the Eigenvalues in descending order
                                                                    #get the percentage of each Eigenvalue against 
                                                                    #sum of total eigenvalues
                                                                    #Then find number cumulative percentage from
                                                                    #first few largest eigenvalues to see threshold 
                                                                    #is crossed.  If so, you only need these few 
                                                                    #number of columns in PCA()
    cumulativePercept=0
    cumulativePerceptList=[]
    for i in var_exp:
        cumulativePercept+=i
        cumulativePerceptList.append(cumulativePercept)
        if cumulativePercept >= threshold:
            break
    return (len(cumulativePerceptList), float(cumulativePerceptList[-1]))

### Modelling with Dimension Reduction

In [317]:
#scaler = StandardScaler(inputCol="features",outputCol="scaledFeatures",withStd=True, withMean=True)
#scalerModel=scaler.fit(base_line_tf)

In [318]:
#base_line_scaled_tf=scalerModel.transform(base_line_tf)

In [319]:
#df_pca=base_line_scaled_tf.select('scaledFeatures','SalePrice')

In [320]:
df_pca = base_line_scaled_tf

In [321]:
bestK, coverPercentage=findTopPrincipalComponentNumber(data=df_pca, featureCol='scaledFeatures', threshold=0.80)
pca = PCA(k=bestK, inputCol="scaledFeatures", outputCol="pcaFeatures")
print(f"{bestK} principal components covers the {round(coverPercentage*100,2)}% from original dataset with {len(base_line_tf.columns)-1} feature columns")

386 principal components covers the 80.01% from original dataset with 1 feature columns


  return (len(cumulativePerceptList), float(cumulativePerceptList[-1]))


In [322]:
df_pca_reduced = pca.fit(df_pca).transform(df_pca)

print("Type: ", type(df_pca_reduced))

Type:  <class 'pyspark.sql.dataframe.DataFrame'>


In [323]:
print("Dimensions: " ,len(df_pca_reduced.columns))
print("OG Dimensions: " ,len(df.columns))

Dimensions:  4
OG Dimensions:  81


In [324]:
df_pca_reduced.printSchema()



root
 |-- features: vector (nullable = true)
 |-- SalePrice: integer (nullable = true)
 |-- scaledFeatures: vector (nullable = true)
 |-- pcaFeatures: vector (nullable = true)


In [325]:
splits = df_pca_reduced.randomSplit([0.7, 0.3])
train_df_pca = splits[0]
test_df_pca = splits[1]

#Utilities

In [344]:
columns =["Regression Algo",	"R2",	"RSME"]

pca_results_df = pd.DataFrame(columns=columns)


def runRegression(regressor, regressor_label):

  predictions = regressor.fit(train_df_pca).transform(test_df_pca)
  r2_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="SalePrice",metricName="r2")
  r2_value = r2_evaluator.evaluate(predictions)

  rsme_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="SalePrice",metricName="rmse")
  rsme_value = rsme_evaluator.evaluate(predictions)

  eval_results = {'Regression Algo': regressor_label,'R2':r2_value, 'RSME':rsme_value}

  pca_results_df.loc[len(pca_results_df.index)] = eval_results

  

Run Regressions Similar to the Non-Reduced Data

#Linear Regression

In [345]:


lr = LinearRegression(featuresCol = 'pcaFeatures', labelCol='SalePrice', maxIter=10, regParam=0.3, elasticNetParam=0.8)
runRegression(lr, 'Linear Regression')

#Gradient Boost Regression

In [346]:
gbt = GBTRegressor(featuresCol="pcaFeatures",labelCol='SalePrice', maxIter=10)
runRegression(gbt, 'Gradeient Booster Regression')

###Decison Tree Regression

In [347]:
dt = DecisionTreeRegressor(featuresCol="pcaFeatures",labelCol='SalePrice', maxDepth=10)
runRegression(dt, 'Decision Tree Regression')

###Random Forest Regression

In [348]:
rf = RandomForestRegressor(featuresCol="pcaFeatures",labelCol='SalePrice', maxDepth=15)
runRegression(rf, 'Random Forest Regression')

### Isotonic Regression

In [349]:
iso = IsotonicRegression(featuresCol="pcaFeatures", labelCol="SalePrice") 
runRegression(iso, 'Isotonic Regression')

####PCA Reduced Result

Before Dimension Reduction

In [350]:
base_line_eval_df.style

Unnamed: 0,Regression Algo,R2,RSME
0,Linear Regression,0.973773,12972.232217
1,Gradient Booster,0.688573,43482.899508
2,Random Forest,0.854129,29759.427365
3,Decision Tree,0.679391,44119.280875
4,Isotonic,0.090084,74326.060955


In [351]:
pca_results_df.style

Unnamed: 0,Regression Algo,R2,RSME
0,Linear Regression,0.727192,42452.458251
1,Gradeient Booster Regression,0.750434,40603.782122
2,Decision Tree Regression,0.617234,50285.257145
3,Random Forest Regression,0.76969,39005.916616
4,Isotonic Regression,-0.000496,81298.3082


####Conclusion

###Results

Comparing the results from before Dimension Reduction and after Dimension Reduction it is 
apparent that colinearity and/or overfitting from the additional features reduced the reliability of the models that were developed. Once the dimensions are reduced, the Linear Regression, Gradient Booster, and RandomForest algorithms converge, with the Random Forest Regressor as the best.  The Random Forest Regressor has a R2 that is slightly higher than the Gradient Booster, but it has the lowest RSME and the highest R2 suggesting the best fit, with the least amount of errors.  

In this case the reduction of dimensions from 81 to 4 indicates the dataset is too noisy, and better features need to be added. Looking at the dataset the features are mostly the physical description of the house, but from a common sense understanding there a house purchase decision involves more than the physical, the emotional and behavioural features guide the buying the decision, and including them would improve the predictiablity of the model. 