## Project 3: Housing Price Regressions

## Introduction

Project 3 is focused on regression models, utilizing a dataset that is used for a Kaggle competition for predicting house prices based on many characteristics, including some such as year built, roof material, type of foundation, and more.

This dataset can be found <a href="https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview">here.</a>

I will be doing 3 experiments in this project using three different regression models.

## What is Regression?


Before I show my experiments, I'm going to explain what regression is.

Regression, in essence, is analyzing relationships between variables and that can be used to predict continuous outcomes. It can be used to either analyze a single independent variable to a dependent variable, or multiple independent variables to a dependent variable.


From there, different regression algorithms exist to try and explore these relationships and predict values. For example, a common regression (which will be used in experiment 1), is linear regression, which tries to establish a line of best fit to visualize the relationship between variables.

These regression algorithms can be evaluated looking at a few different metrics.

R squared / coefficient of determination is a common way of evaluating the effectiveness of a regression model and generally shows, between 0 and 1, how well the model did in calculating and predicting data.

Mean Absolute Error is the sum of absolute value distance from points on a graph to the line of best fit. In essence, it's the absolute value mean of how "off" the points are, or the "error." The lower, the better.

Mean Squared Error is similar, but instead of just summing the absolute value of the error, we are squaring it. Again, lower is better.

Root Mean Squared Error again is similar, but we are taking the square root of Mean Squared Error, which will make it smaller. Lower is better.

## Pre-Processing

Before jumping straight into an experiment, I'm just going to look at the data, see what needs adjusting to make sure my models can run without any major hitches.

In [1]:
import pandas as pd
import numpy as np
dataset = pd.read_csv("train.csv")
dataset

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


In [2]:
#check for nulls
null = dataset.isna().sum()[dataset.isna().sum() > 0]
null

LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

Some of these features are null for almost every single row in this dataset, so I'll aim to drop those features and I will keep the rest. I'll drop features that have over 100 nulls, as there's only 1460 rows in the first place.

In [3]:
null_to_drop = dataset.isna().sum()[dataset.isna().sum() > 100]
dataset.drop(columns = null_to_drop.index, axis=1, inplace=True)

In [4]:
#check to make sure it worked
null = dataset.isna().sum()[dataset.isna().sum() > 0]
null

MasVnrType       8
MasVnrArea       8
BsmtQual        37
BsmtCond        37
BsmtExposure    38
BsmtFinType1    37
BsmtFinType2    38
Electrical       1
GarageType      81
GarageYrBlt     81
GarageFinish    81
GarageQual      81
GarageCond      81
dtype: int64

From here, I'm just going to now drop the rows that have null values to clean up the data, and we will be losing some rows, but we should still have plenty to make our regression models.

In [5]:
dataset = dataset.dropna(axis=0)
null = dataset.isna().sum()[dataset.isna().sum() > 0]
null

Series([], dtype: int64)

Now, there are no nulls in the dataset.

Now, regression uses numbers, so I'm going to check the data types of each feature to make sure it will work properly with the regression models. I'll also see if I need to standardize any of the data before continuing.

In [6]:
dataset.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1338 entries, 0 to 1459
Data columns (total 75 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1338 non-null   int64  
 1   MSSubClass     1338 non-null   int64  
 2   MSZoning       1338 non-null   object 
 3   LotArea        1338 non-null   int64  
 4   Street         1338 non-null   object 
 5   LotShape       1338 non-null   object 
 6   LandContour    1338 non-null   object 
 7   Utilities      1338 non-null   object 
 8   LotConfig      1338 non-null   object 
 9   LandSlope      1338 non-null   object 
 10  Neighborhood   1338 non-null   object 
 11  Condition1     1338 non-null   object 
 12  Condition2     1338 non-null   object 
 13  BldgType       1338 non-null   object 
 14  HouseStyle     1338 non-null   object 
 15  OverallQual    1338 non-null   int64  
 16  OverallCond    1338 non-null   int64  
 17  YearBuilt      1338 non-null   int64  
 18  YearRemo

I'm seeing there's 39 features that are objects, so let's check them out.

In [7]:
objectcols = dataset.select_dtypes(include='object')
objectcols.head(10)

Unnamed: 0,MSZoning,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,...,Electrical,KitchenQual,Functional,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,SaleType,SaleCondition
0,RL,Pave,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,...,SBrkr,Gd,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
1,RL,Pave,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,...,SBrkr,TA,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
2,RL,Pave,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,...,SBrkr,Gd,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
3,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,...,SBrkr,Gd,Typ,Detchd,Unf,TA,TA,Y,WD,Abnorml
4,RL,Pave,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,...,SBrkr,Gd,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
5,RL,Pave,IR1,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,...,SBrkr,TA,Typ,Attchd,Unf,TA,TA,Y,WD,Normal
6,RL,Pave,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,...,SBrkr,Gd,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
7,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,NWAmes,PosN,Norm,...,SBrkr,TA,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
8,RM,Pave,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Artery,Norm,...,FuseF,TA,Min1,Detchd,Unf,Fa,TA,Y,WD,Abnorml
9,RL,Pave,Reg,Lvl,AllPub,Corner,Gtl,BrkSide,Artery,Artery,...,SBrkr,TA,Typ,Attchd,RFn,Gd,TA,Y,WD,Normal


I would like to include as many features as I can into the experiments, so I'm going to change these categorical features into numerical ones by using Label Encoding, which essentially just assigns numerical values based on what the categorical data previously was. For example, if there was a column with 3 different possibilities of categories, the new numerical column will have either 0,1 or 2. Scikit learn has a module for this already.

In [8]:
from sklearn import preprocessing

objectcols = objectcols.apply(preprocessing.LabelEncoder().fit_transform)
objectcols.head(10)

Unnamed: 0,MSZoning,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,...,Electrical,KitchenQual,Functional,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,SaleType,SaleCondition
0,3,1,3,3,0,4,0,5,2,2,...,4,2,6,1,1,4,4,2,8,4
1,3,1,3,3,0,2,0,24,1,2,...,4,3,6,1,1,4,4,2,8,4
2,3,1,0,3,0,4,0,5,2,2,...,4,2,6,1,1,4,4,2,8,4
3,3,1,0,3,0,0,0,6,2,2,...,4,2,6,5,2,4,4,2,8,0
4,3,1,0,3,0,2,0,15,2,2,...,4,2,6,1,1,4,4,2,8,4
5,3,1,0,3,0,4,0,11,2,2,...,4,3,6,1,2,4,4,2,8,4
6,3,1,3,3,0,4,0,21,2,2,...,4,2,6,1,1,4,4,2,8,4
7,3,1,0,3,0,0,0,14,4,2,...,4,3,6,1,1,4,4,2,8,4
8,4,1,3,3,0,4,0,17,0,2,...,1,3,2,5,2,1,4,2,8,0
9,3,1,3,3,0,0,0,3,0,0,...,4,3,6,1,1,2,4,2,8,4


Now, we can see the object data is now numerical.

In [9]:
objectcols.dtypes

MSZoning         int64
Street           int64
LotShape         int64
LandContour      int64
Utilities        int64
LotConfig        int64
LandSlope        int64
Neighborhood     int64
Condition1       int64
Condition2       int64
BldgType         int64
HouseStyle       int64
RoofStyle        int64
RoofMatl         int64
Exterior1st      int64
Exterior2nd      int64
MasVnrType       int64
ExterQual        int64
ExterCond        int64
Foundation       int64
BsmtQual         int64
BsmtCond         int64
BsmtExposure     int64
BsmtFinType1     int64
BsmtFinType2     int64
Heating          int64
HeatingQC        int64
CentralAir       int64
Electrical       int64
KitchenQual      int64
Functional       int64
GarageType       int64
GarageFinish     int64
GarageQual       int64
GarageCond       int64
PavedDrive       int64
SaleType         int64
SaleCondition    int64
dtype: object

So, let's add this to our main dataframe by removing the objects then concatenating these new numeric features. Lets verify the size/shape of the dataframe first so we can then compare at the end.

In [10]:
dataset.shape

(1338, 75)

So after merging, we should be expected 1338 rows and 75 columns still.

In [11]:
dataset_nums = dataset.select_dtypes(exclude=['object'])
dataset_objs = dataset.select_dtypes(include='object')

In [12]:
#now to concat back together using objectcols instead of dataset_objs
dataset_clean = pd.concat([dataset_nums, objectcols], axis=1)

In [13]:
dataset_clean.head(10)

Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,Electrical,KitchenQual,Functional,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,SaleType,SaleCondition
0,1,60,8450,7,5,2003,2003,196.0,706,0,...,4,2,6,1,1,4,4,2,8,4
1,2,20,9600,6,8,1976,1976,0.0,978,0,...,4,3,6,1,1,4,4,2,8,4
2,3,60,11250,7,5,2001,2002,162.0,486,0,...,4,2,6,1,1,4,4,2,8,4
3,4,70,9550,7,5,1915,1970,0.0,216,0,...,4,2,6,5,2,4,4,2,8,0
4,5,60,14260,8,5,2000,2000,350.0,655,0,...,4,2,6,1,1,4,4,2,8,4
5,6,50,14115,5,5,1993,1995,0.0,732,0,...,4,3,6,1,2,4,4,2,8,4
6,7,20,10084,8,5,2004,2005,186.0,1369,0,...,4,2,6,1,1,4,4,2,8,4
7,8,60,10382,7,6,1973,1973,240.0,859,32,...,4,3,6,1,1,4,4,2,8,4
8,9,50,6120,7,5,1931,1950,0.0,0,0,...,1,3,2,5,2,1,4,2,8,0
9,10,190,7420,5,6,1939,1950,0.0,851,0,...,4,3,6,1,1,2,4,2,8,4


In [14]:
dataset_clean.shape

(1338, 75)

The shape is what we started with, so no columns got messed up. Let's reconfirm our datatypes as well.

In [15]:
dataset_clean.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1338 entries, 0 to 1459
Data columns (total 75 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1338 non-null   int64  
 1   MSSubClass     1338 non-null   int64  
 2   LotArea        1338 non-null   int64  
 3   OverallQual    1338 non-null   int64  
 4   OverallCond    1338 non-null   int64  
 5   YearBuilt      1338 non-null   int64  
 6   YearRemodAdd   1338 non-null   int64  
 7   MasVnrArea     1338 non-null   float64
 8   BsmtFinSF1     1338 non-null   int64  
 9   BsmtFinSF2     1338 non-null   int64  
 10  BsmtUnfSF      1338 non-null   int64  
 11  TotalBsmtSF    1338 non-null   int64  
 12  1stFlrSF       1338 non-null   int64  
 13  2ndFlrSF       1338 non-null   int64  
 14  LowQualFinSF   1338 non-null   int64  
 15  GrLivArea      1338 non-null   int64  
 16  BsmtFullBath   1338 non-null   int64  
 17  BsmtHalfBath   1338 non-null   int64  
 18  FullBath

All numerical values, so we're almost good to go. I am going to drop the "Id" column because it has no correlation to price, it is simply for identification purposes and could throw off the models.

In [16]:
dataset_clean = dataset_clean.drop(columns = 'Id', axis=1)

In [17]:
dataset_clean.head(5)

Unnamed: 0,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,Electrical,KitchenQual,Functional,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,SaleType,SaleCondition
0,60,8450,7,5,2003,2003,196.0,706,0,150,...,4,2,6,1,1,4,4,2,8,4
1,20,9600,6,8,1976,1976,0.0,978,0,284,...,4,3,6,1,1,4,4,2,8,4
2,60,11250,7,5,2001,2002,162.0,486,0,434,...,4,2,6,1,1,4,4,2,8,4
3,70,9550,7,5,1915,1970,0.0,216,0,540,...,4,2,6,5,2,4,4,2,8,0
4,60,14260,8,5,2000,2000,350.0,655,0,490,...,4,2,6,1,1,4,4,2,8,4


In [18]:
dataset_clean.shape

(1338, 74)

Ok, that marks the end of general pre-processing, and now the experiments can begin.

## Experiment 1

For experiment 1, I'm going to use all of the features in a basic linear regression from Scikit Learn. So this will exclude all columns that had over 100 null values in them, with the categorical data being encoded so it can be included.

This process will include what we're already familiar with - splitting the features from the target (which is SalePrice).

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X = dataset_clean.drop(columns = ['SalePrice'])
y = dataset_clean['SalePrice']

In [20]:
X

Unnamed: 0,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,Electrical,KitchenQual,Functional,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,SaleType,SaleCondition
0,60,8450,7,5,2003,2003,196.0,706,0,150,...,4,2,6,1,1,4,4,2,8,4
1,20,9600,6,8,1976,1976,0.0,978,0,284,...,4,3,6,1,1,4,4,2,8,4
2,60,11250,7,5,2001,2002,162.0,486,0,434,...,4,2,6,1,1,4,4,2,8,4
3,70,9550,7,5,1915,1970,0.0,216,0,540,...,4,2,6,5,2,4,4,2,8,0
4,60,14260,8,5,2000,2000,350.0,655,0,490,...,4,2,6,1,1,4,4,2,8,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,60,7917,6,5,1999,2000,0.0,0,0,953,...,4,3,6,1,1,4,4,2,8,4
1456,20,13175,6,6,1978,1988,119.0,790,163,589,...,4,3,2,1,2,4,4,2,8,4
1457,70,9042,7,9,1941,2006,0.0,275,0,877,...,4,2,6,1,1,4,4,2,8,4
1458,20,9717,5,6,1950,1996,0.0,49,1029,0,...,0,2,6,1,2,4,4,2,8,4


In [21]:
y

0       208500
1       181500
2       223500
3       140000
4       250000
         ...  
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1338, dtype: int64

In [22]:
#using a 20% split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

lr = LinearRegression()

lr = lr.fit(X_train, y_train)

In [23]:
lr.score(X_test, y_test)

0.8492855694112781

Just looking at the score alone, it's scoring 84%, which is a good score. I'm honestly surprised at how well that did. Let's look at a few more metrics for this model.

In [24]:
from sklearn import metrics

y_pred = lr.predict(X_test)

mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

In [25]:
print('Mean Absolute Error:', mae)
print('Mean Squared Error:', mse)
print('Root Mean Square Error:', rmse)

Mean Absolute Error: 21972.598974447705
Mean Squared Error: 1250472201.2793148
Root Mean Square Error: 35362.01636331439


So to sum up our evaluations, our score or Coefficient of Determination was about .84, with our error scores above. The only thing I'll comment on right now with these three numbers above is that the closer to 0, the better the score. I'll be comparing my other experiments to these metrics to compare how well the other experiments predicts sales prices. 

## Experiment 2

For my second experiment, I'm going to be cutting out some features to try and hone in on the features that correlate the most with the independent variable, the price. 

So to figure out the features that correlate the most (negatively or positively), I'm going to find and sort the correlation values for features and price.

In [26]:
#shoving the correlation values into one series
correlation = dataset_clean.corrwith(dataset_clean['SalePrice'])
#using absolute value here to equally include values that are negatively correlated with sales price
#I don't want to just use features that are positively correlated
correlation.abs().sort_values(ascending=False).head(11) 

SalePrice      1.000000
OverallQual    0.783546
GrLivArea      0.711706
ExterQual      0.648023
GarageCars     0.640154
KitchenQual    0.616867
BsmtQual       0.611833
GarageArea     0.607535
1stFlrSF       0.604714
TotalBsmtSF    0.602042
FullBath       0.569313
dtype: float64

So, I'm going to use the top ten most positively or negatively correlated items to sales price and run it through a linear regression again to see how that changes the results.

The features that have survived this round of cleansing are:
<ul>
    <li>OverallQual: Overall material and finish quality</li>
    <li>GrLivArea: Above grade (ground) living area square feet</li>
    <li>ExterQual: Exterior material quality</li>
    <li>GarageCars: Size of garage in car capacity</li>
    <li>KitchenQual: Kitchen quality</li>
    <li>BsmtQual: Height of the basement</li>
    <li>GarageArea: Size of garage in square feet</li>
    <li>1stFlrSF: First floor square feet</li>
    <li>TotalBsmtSF: Total square feet of basement area</li>
    <li>FullBath: Full bathrooms above grade</li>
</ul>

In [27]:
dataset_exp_2 = dataset_clean.filter(['OverallQual', 'GrLivArea', 'ExterQual', 'GarageCars', 'KitchenQual', 
                                      'BsmtQual', 'GarageArea', '1stFlrSF', 'TotalBsmtSF', 'FullBath', 'SalePrice'])

In [28]:
dataset_exp_2

Unnamed: 0,OverallQual,GrLivArea,ExterQual,GarageCars,KitchenQual,BsmtQual,GarageArea,1stFlrSF,TotalBsmtSF,FullBath,SalePrice
0,7,1710,2,2,2,2,548,856,856,2,208500
1,6,1262,3,2,3,2,460,1262,1262,2,181500
2,7,1786,2,2,2,2,608,920,920,2,223500
3,7,1717,3,3,2,3,642,961,756,1,140000
4,8,2198,2,3,2,2,836,1145,1145,2,250000
...,...,...,...,...,...,...,...,...,...,...,...
1455,6,1647,3,2,3,2,460,953,953,2,175000
1456,6,2073,3,2,3,2,500,2073,1542,2,210000
1457,7,2340,0,1,2,3,252,1188,1152,2,266500
1458,5,1078,3,1,2,3,240,1078,1078,1,142125


In [29]:
X = dataset_exp_2.drop(columns = ['SalePrice'])
y = dataset_exp_2['SalePrice']

In [30]:
X

Unnamed: 0,OverallQual,GrLivArea,ExterQual,GarageCars,KitchenQual,BsmtQual,GarageArea,1stFlrSF,TotalBsmtSF,FullBath
0,7,1710,2,2,2,2,548,856,856,2
1,6,1262,3,2,3,2,460,1262,1262,2
2,7,1786,2,2,2,2,608,920,920,2
3,7,1717,3,3,2,3,642,961,756,1
4,8,2198,2,3,2,2,836,1145,1145,2
...,...,...,...,...,...,...,...,...,...,...
1455,6,1647,3,2,3,2,460,953,953,2
1456,6,2073,3,2,3,2,500,2073,1542,2
1457,7,2340,0,1,2,3,252,1188,1152,2
1458,5,1078,3,1,2,3,240,1078,1078,1


In [31]:
y

0       208500
1       181500
2       223500
3       140000
4       250000
         ...  
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1338, dtype: int64

In [32]:
#running same experiement as before, just with 10 most correlated features
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

lr2 = LinearRegression()

lr2 = lr2.fit(X_train, y_train)

In [33]:
lr2.score(X_test, y_test)

0.804299671472742

In [34]:
y_pred = lr2.predict(X_test)

mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

In [35]:
print('Mean Absolute Error:', mae)
print('Mean Squared Error:', mse)
print('Root Mean Square Error:', rmse)

Mean Absolute Error: 22115.152219543117
Mean Squared Error: 883770754.9249071
Root Mean Square Error: 29728.282071537655


Alright, let's look at the results and evaluate this experiment.

The Coefficient of Determination / Score for this experiment was .80 / 80%. It's about 4% lower than our first experiment. 

The MAE is higher than experiment 1, by only a tad.

The MSE and RMSE are actually lower than experiment 1, which is surprising to me. Perhaps this model is better in one way than experiment 1? I'm not sure.

So what does this mean for experiment 2? It seems that cutting out a lot of the features added up and made a noticeable difference in the regression model, which lowered the accuracy of its future predictions. Even if the other features were not the most important in the dataset, they still affected the sales price.

## Experiment 3

Moving on to the last experiment, I'm going to go back to my correlations between the features and sales prices and use the top 10 highest positive correlation features instead of top 10 absolute value correlation features. What will happen to the score and evaluation of the model when we decide to only include positively correlating features in the dataset? I'm assuming the score and other metrics will reflect that it will likely predict worse, but there's only one way to find out for sure, and to find out how much it affects it.

In [36]:
#pulling up correlations again
correlation = dataset_clean.corrwith(dataset_clean['SalePrice'])
#no absolute value this time
correlation.sort_values(ascending=False).head(11) 

SalePrice       1.000000
OverallQual     0.783546
GrLivArea       0.711706
GarageCars      0.640154
GarageArea      0.607535
1stFlrSF        0.604714
TotalBsmtSF     0.602042
FullBath        0.569313
TotRmsAbvGrd    0.551821
YearBuilt       0.504297
YearRemodAdd    0.501435
dtype: float64

This time, the features that will be included in the model are:
<ul>
    <li>OverallQual: Overall material and finish quality</li>
    <li>GrLivArea: Above grade (ground) living area square feet</li>
    <li>GarageCars: Size of garage in car capacity</li>
    <li>GarageArea: Size of garage in square feet</li>
    <li>1stFlrSF: First floor square feet</li>
    <li>TotalBsmtSF: Total square feet of basement area</li>
    <li>FullBath: Full bathrooms above grade</li>
    <li>TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)</li>
    <li>YearBuilt: Original construction date</li>
    <li>YearRemodAdd: Remodel date</li>
</ul>

So this time, 3 new features are added: TotRmsAbvGrd, YearBuilt, YearRemodAdd. Let's see if this changes things significantly.

In [37]:
dataset_exp_3 = dataset_clean.filter(['OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea', '1stFlrSF', 
                                      'TotalBsmtSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt', 'YearRemodAdd', 
                                      'SalePrice'])
dataset_exp_3

Unnamed: 0,OverallQual,GrLivArea,GarageCars,GarageArea,1stFlrSF,TotalBsmtSF,FullBath,TotRmsAbvGrd,YearBuilt,YearRemodAdd,SalePrice
0,7,1710,2,548,856,856,2,8,2003,2003,208500
1,6,1262,2,460,1262,1262,2,6,1976,1976,181500
2,7,1786,2,608,920,920,2,6,2001,2002,223500
3,7,1717,3,642,961,756,1,7,1915,1970,140000
4,8,2198,3,836,1145,1145,2,9,2000,2000,250000
...,...,...,...,...,...,...,...,...,...,...,...
1455,6,1647,2,460,953,953,2,7,1999,2000,175000
1456,6,2073,2,500,2073,1542,2,7,1978,1988,210000
1457,7,2340,1,252,1188,1152,2,9,1941,2006,266500
1458,5,1078,1,240,1078,1078,1,5,1950,1996,142125


In [38]:
X = dataset_exp_3.drop(columns = ['SalePrice'])
y = dataset_exp_3['SalePrice']

In [39]:
#10 most positively correlated features
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

lr3 = LinearRegression()

lr3 = lr3.fit(X_train, y_train)

In [40]:
lr3.score(X_test, y_test)

0.5436689348656747

In [41]:
y_pred = lr3.predict(X_test)

mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

In [42]:
print('Mean Absolute Error:', mae)
print('Mean Squared Error:', mse)
print('Root Mean Square Error:', rmse)

Mean Absolute Error: 22397.91421904186
Mean Squared Error: 2380861166.577862
Root Mean Square Error: 48794.068969269836


So, to look at the results of experiment 3, it actually performed worse than I would've thought. There was only a difference in 3 features yet the score dropped roughly 25%. This shows how it's important to include features that can negatively correlate to the independent variable as well.

So sum up the metrics - our Coefficient of Determination was about .54, and our MAE was higher than experiment 1, yet not that much higher than experiment 2. I'm curious as to why MAE stayed relatively close compared to MSE / RMSE for all experiments. MSE / RMSE were all significantly higher than both experiment 1 and 2 (MSE having to add another digit). It is worth noting that the MSE / RMSE were actually higher in experiment 1 than experiment 2. 

These scores being higher and farther away from 0 in experiment 3 indicate that the model is not as accurate as the other two and that there are more errors in the linear regression. 

## Conclusion and Wrap-Up

To conclude out on the project, I'll reflect the three different experiments I ran. 

The first experiment I ran was trying to include almost every feature in this dataset, as each variable has some correlation and sway on the independent variable, and I wanted to include every piece of data I could to try and accurately create a model to predict sales prices. This model seemed to be the most accurate out of all the models, and it doesn't come as that much of a surprise to me.

My second experiment cut out less significant features and honed in on the top 10 correlated (whether it was positive or negatively correlated) features to the sales price to model out in a linear regression. I wanted to see if I could only take the top 10 features and see if an accurate model could be made. After seeing the results, it actually wasn't too far off of the first experiment, which shows that while the remaining features do have a sway on the final sales price, it isn't as significant as the top 10.

Lastly, experiment three took experiment two's idea of simplifying the dataset to 10 features, but instead of using both the top 10 positively correlated and negatively correlated features to the sales price, I only focused on the positively correlated features. I wanted to see if this would have a significant effect on the model's accuracy, and it did. Even with just changing 3 features in the model, the prediction power dropped and would not have made an effective model for predicting sales prices in homes.

Overall, I think my experiments show that even the little features do matter in the end for predicting sales price. Some features are certainly more important than others, but if there's a feature on a home, in the end, it'll likely affect the sales price one way or another. My last experiment shows that it's important to also include negative features in a dataset, even in this case, if you think the final price would decrease because of it. Including all data characteristics possible about a home will help accurately predict data.

## Sources

<ul>
    <li>https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview</li>
    <li>https://pbpython.com/categorical-encoding.html</li>
    <li>https://www.investopedia.com/terms/r/regression.asp</li>
    <li>https://medium.com/acing-ai/how-to-evaluate-regression-models-d183b4f5853d</li>
</ul>