<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2 - Ames Housing Data and Kaggle Challenge

## Problem Statement

There are are many variables that determine how much a home can fetch.
Using the Ames (IA) dataset (train, test), we want to find out which variables matter for home sale prices and produce accurate sale price predictions. 
This model will help provide the Outside View*, helping to reduce information asymmetry between potential home-buyers, home-sellers and real estate agents, and the success of the model will depend on the accuracy on how well the model is able to predict home prices.

### Contents

* Background
* Data sets used
* Data Import
* Cleaning Train Dataset
* Exploratory Data Analysis
* Train and Test Model
* Conclusion


### Background 

The original original intent for the data was to be used for tax assessment purposes but lends itself directly to the prediction of home selling prices. The type of information contained in the data is similar to what a typical home buyer would want to know before making a purchase, like size, neighborhood, exterior, basement, sale price, etc.

### Data sets used 

* train.csv
* test.csv

### Data Import

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.experimental import enable_iterative_imputer 
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso, LassoCV, Ridge, RidgeCV
from sklearn.model_selection import train_test_split, cross_val_score,  GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, mean_squared_error, mean_absolute_error, mean_squared_error
from sklearn.utils import resample
from scipy import stats

In [None]:
train_df = pd.read_csv('./datasets/train.csv')
test_df = pd.read_csv('./datasets/test.csv')

# backup df
old_train_df = train_df
old_test_df = test_df

In [None]:
print(train_df.shape)
print(test_df.shape)

### Cleaning Train Dataset

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
train_df.info()

###### Rename Columns

In [None]:
train_df.columns

In [None]:
# edit column names -> lower case -> replace space with _

train_df.columns= train_df.columns.str.lower()
train_df.columns = train_df.columns.str.replace(' ','_')

test_df.columns = test_df.columns.str.lower()
test_df.columns = test_df.columns.str.replace(' ','_')

In [None]:
# convert year to age train_df

train_df['house_age'] = train_df['yr_sold'] - train_df['year_built']
train_df['reno_age'] = train_df['yr_sold'] - train_df['year_remod/add']
train_df['garage_age'] = train_df['yr_sold'] - train_df['garage_yr_blt']

train_df.drop(columns=['year_built', 'year_remod/add', 'garage_yr_blt'],inplace = True)

In [None]:
#convert year to age test_df

test_df['house_age'] = test_df['yr_sold'] - test_df['year_built']
test_df['reno_age'] = test_df['yr_sold'] - test_df['year_remod/add']
test_df['garage_age'] = test_df['yr_sold'] - test_df['garage_yr_blt']

test_df.drop(columns=['year_built', 'year_remod/add', 'garage_yr_blt'],inplace = True)

### Exploratory Data Analysis

###### Ordinal Features

From data dictionary, we can see that some features are ordinal in nature. Convert these features to ordinal numbers to better represent the scale of the feature.

In [None]:
ordinal_features = ['exter_qual', 'exter_cond', 'bsmt_qual', 'bsmt_cond', 'kitchen_qual', 'garage_qual',
           'garage_cond', 'heating_qc']

In [None]:
# convert to ordinal features

# train_df

for x in ordinal_features:
    train_df[x].fillna(0, inplace=True)
    train_df[x].replace('Po', 1, inplace=True)
    train_df[x].replace('Fa', 2, inplace=True)
    train_df[x].replace('TA', 3, inplace=True)
    train_df[x].replace('Gd', 4, inplace=True)
    train_df[x].replace('Ex', 5, inplace=True)
    
# test_df

for x in ordinal_features:
    test_df[x].fillna(0, inplace=True)
    test_df[x].replace('Po', 1, inplace=True)
    test_df[x].replace('Fa', 2, inplace=True)
    test_df[x].replace('TA', 3, inplace=True)
    test_df[x].replace('Gd', 4, inplace=True)
    test_df[x].replace('Ex', 5, inplace=True)

###### Exploring Pearson's Correlation for each feature.

In [None]:
corr = train_df.corr()
corr['saleprice'].sort_values(ascending = False)

There is a large number of features in the dataset, and some features may predict the target variable better than others. Dropping features that do not have a strong positive or negative Pearson's correlation with the target variable to reduce noise.

In [None]:
corr['saleprice'].sort_values(ascending = False).index

In [None]:
# 0.2

train_df.drop(columns = ['bsmt_unf_sf', 'bedroom_abvgr', 'screen_porch', '3ssn_porch',
       'exter_cond', 'mo_sold', 'pool_area', 'bsmtfin_sf_2', 'misc_val',
       'yr_sold', 'low_qual_fin_sf', 'bsmt_half_bath',  'ms_subclass',
       'overall_cond', 'kitchen_abvgr', 'enclosed_porch'], inplace = True)

In [None]:
train_df.shape

###### Checking for Multicollinearity 

Due to the large amount of features in the dataset, it is likely that a few features may be correlated to each other. Checking dataset for such features.

In [None]:
mask = np.zeros_like(train_df.corr()[abs(train_df.corr()) >= 0.8])
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize= (12,8))
sns.heatmap(train_df.corr()[abs(train_df.corr()) >= 0.8], annot= True, cmap = "YlGnBu", mask = mask)
title = plt.title('Correlation Heatmap')
plt.show()

There are some features that show a strong correlation with each other.
- 1st_flr_sf with total_bsmt_sf
- gr_living_area with totrms_abvgrd
- garage_cars with garage_area
- garage_cond with garage_qual
- garage_age with house_age<br>

Dropping features with lower pearson's correlation with saleprice.

In [None]:
# total_bsmt_sf      0.628925
# 1st_flr_sf         0.618486 drop

# gr_liv_area        0.697038
# totrms_abvgrd      0.504014 drop

# garage_area        0.650270
# garage_cars        0.648220 drop

# garage_qual        0.285281
# garage_cond        0.265021 drop

# garage_age        -0.533962 drop
# house_age         -0.571881

train_df.drop(columns = ['1st_flr_sf', 
                         'totrms_abvgrd', 
                         'garage_cars', 
                         'garage_cond', 
                         'garage_age'], 
              inplace = True)

test_df.drop(columns = ['1st_flr_sf', 
                         'totrms_abvgrd', 
                         'garage_cars', 
                         'garage_cond', 
                         'garage_age'], 
              inplace = True)

###### Impute Numerical NaN values for Train dataset numerical features

In [None]:
pd.set_option("display.max_rows", 101)
train_nan = train_df.isnull().sum().sort_values(ascending = False)
train_nan

In [None]:
train_to_remove_col = []
train_to_remove_val = []
for col, val in train_nan.iteritems():
    if val > 0:
        train_to_remove_col.append(col)
        train_to_remove_val.append(val)
        

In [None]:
y = train_to_remove_col
width = train_to_remove_val

fig, ax = plt.subplots(figsize=(5,5))

right_side = ax.spines["right"]
right_side.set_visible(False)
top_side = ax.spines["top"]
top_side.set_visible(False)

plt.barh(y, width,color='purple')
plt.title("Train df numerical NaNs", fontdict = {'fontsize' : 15})
for index, value in enumerate(width):
    plt.text(value + 100, index,
             str(value))
  
plt.show()

In [None]:
test_nan = test_df.isnull().sum().sort_values(ascending = False)
test_nan

In [None]:
test_to_remove_col = []
test_to_remove_val = []
for col, val in test_nan.iteritems():
    if val > 0:
        test_to_remove_col.append(col)
        test_to_remove_val.append(val)

In [None]:
y = test_to_remove_col
width = test_to_remove_val

fig, ax = plt.subplots(figsize=(5,5))

right_side = ax.spines["right"]
right_side.set_visible(False)
top_side = ax.spines["top"]
top_side.set_visible(False)

plt.barh(y, width,color='coral')
plt.title("Test df numerical NaNs", fontdict = {'fontsize' : 15})
for index, value in enumerate(width):
    plt.text(value + 50, index,
             str(value))
  
plt.show()

Features pool_qc, misc_feature, alley, fence and fireplace_qu have more than 50% of missing data for both train and test dataset. Dropping these columns as they will not provide any useful information for our model.

In [None]:
# dropping columns with large missing data

train_df.drop(columns = ['pool_qc', 'misc_feature', 'alley', 'fence', 'fireplace_qu'], inplace = True)
test_df.drop(columns = ['pool_qc', 'misc_feature', 'alley', 'fence', 'fireplace_qu'], inplace = True)

In [None]:
train_df_num = train_df.select_dtypes(include=np.number).columns.tolist()

In [None]:
train_df_num

In [None]:
train_df[train_df_num].isnull().sum().sort_values(ascending=False)

Lot Frontage


Lot Frontage NaN % is relatively high and it seems to be missing completely at random. <br>

According to this [website](https://www.gimme-shelter.com/frontage-50043/), all houses have a lot frontage. It is the width of the lot.<br>

Opt to impute with mean as both values for both data sets are not far off.<br>

In [None]:
print('NaNs: ', train_df['lot_frontage'].isnull().sum())
print('Total: ', train_df.shape[0])
print('Percentage: ', (train_df['lot_frontage'].isnull().sum()/train_df.shape[0])*100)

In [None]:
print(train_df['lot_frontage'].mean())
print(train_df['lot_frontage'].median())

In [None]:
print(test_df['lot_frontage'].mean())
print(test_df['lot_frontage'].median())

In [None]:
# As the median and the mean for lot_frontage is about the same, fill lot_frontage nan with lot_frontage mean

train_df['lot_frontage'].fillna(train_df['lot_frontage'].mean(),inplace=True)

# imputing test_df does not affect kaggle score

test_df['lot_frontage'].fillna(test_df['lot_frontage'].mean(),inplace=True)

Replace remaining numerical feature NaNs with 0, as they are only a small percentage of the data.

In [None]:
# replace NaN with 0

train_num_list = train_df.select_dtypes(exclude='object').columns.tolist()

for feature in train_num_list:
    train_df[feature].fillna(0, inplace = True)

In [None]:
train_df[train_num_list].isnull().sum().sort_values(ascending=False)

##### Impute NaN values for train dataset categorical features

In [None]:
train_cat_list = train_df.select_dtypes(include='object').columns.tolist()

train_df[train_cat_list].isnull().sum().sort_values(ascending=False)

For the catagorical features with NaN, they represent that the feature simply does not exist in the home. Replace all NaN with None.

In [None]:
for feature in train_cat_list:
    train_df[feature].fillna('None', inplace = True)

In [None]:
train_df[train_cat_list].isnull().sum().sort_values(ascending=False)

###### Plotting each numerical feature against target variable to check for any irregularities or outliers.

In [None]:
for feature in train_df.select_dtypes(exclude='object').columns.tolist():
    plt.figure(figsize= ( 10, 5 ))
    sns.scatterplot(data=train_df, x=feature, y='saleprice')
    plt.show()

In [None]:
# remove outliers

train_df = train_df[(train_df['lot_frontage'] <=300)]
train_df = train_df[(train_df['mas_vnr_area'] <=1400)]
train_df = train_df[(train_df['bsmtfin_sf_1'] <=3000)]
train_df = train_df[(train_df['total_bsmt_sf'] <=4000)]
train_df = train_df[(train_df['gr_liv_area'] <=4000)]
train_df = train_df[(train_df['wood_deck_sf'] <=1000)]

###### Plotting histograms to see the distribution of each numerical feature.

In [None]:
for feature in train_df.select_dtypes(exclude='object').columns.tolist():
    plt.figure(figsize= ( 10, 5 ))
    sns.histplot(data=train_df, x=feature, bins = 50, kde = True)
    plt.show()

Histogram to see if the features are distributed normally. There are some features including our target variable that are right skewed.

###### Plotting box graphs to see if there is any relationships between each catagorical feature with our target variable.

In [None]:
for feature in train_df.select_dtypes(include='object').columns.tolist():
    sns.boxplot(data=train_df, x=feature, y='saleprice')
    plt.show()

Boxplots to see the relation of each catagorical feature with the saleprice. We can see that some features like ms_zoning, neighborhood, condition_1, condition_2, exterior_1st, exterior_2nd, garage_type and sale_type has some relation with saleprice.

###### Matching features between Train and Test dataset

Matching test dataframe colums with train dataframe columns. This will remove the dropped features from test dataset.

In [None]:
test_df.shape

In [None]:
test_df.isnull().sum().sort_values(ascending = False)

In [None]:
temp_list = train_df.columns.tolist()

In [None]:
temp_list.remove('saleprice')

In [None]:
test_df = test_df[temp_list]

In [None]:
test_df.shape

##### Impute NaN values for test dataset numerical features

In [None]:
test_df_num = test_df.select_dtypes(exclude='object').columns.tolist()

In [None]:
test_df[test_df_num].isnull().sum().sort_values(ascending=False)

In [None]:
for feature in test_df_num:
    test_df[feature].fillna(0, inplace = True)

Replace missing value in mas_vnr_area with 0 as it is only 1 datapoint.

##### Impute NaN values for test dataset numerical categorical features

In [None]:
test_df.isnull().sum().sort_values(ascending = False)

In [None]:
test_df_cat = test_df.select_dtypes(include='object').columns.tolist()

In [None]:
test_df[test_df_cat].isnull().sum().sort_values(ascending = False)

In [None]:
test_df['electrical'].unique()

In [None]:
test_df['electrical'].fillna('SBrkr', inplace = True)

Impute test_df electrical with mode 'SBrkr'.

In [None]:
test_df.isnull().sum().sort_values(ascending = False)
test_df_cat = test_df.select_dtypes(include='object').columns.tolist()

In [None]:
for feature in test_df_cat:
    test_df[feature].fillna('None', inplace = True)

In [None]:
test_df.shape

Replace NaNs with 'None'

### Get Dummies

Dummify categorical features for both train and test datasets.

In [None]:
train_df_cat = train_df.select_dtypes(include='object').columns.tolist()

In [None]:
train_df = pd.get_dummies(data=train_df,columns=train_df_cat)

In [None]:
test_df_cat = test_df.select_dtypes(include='object').columns.tolist()

In [None]:
test_df = pd.get_dummies(data=test_df, columns=test_df_cat)

In [None]:
train_df.shape

In [None]:
test_df.shape

There are different columns in dummified train and test df. Checking which columns each df does not have.

In [None]:
# declare list variable
train_df_missing_col = []
test_df_missing_col = []

# append column missing in train_df

for col in test_df.columns:
    if col not in train_df.columns:
        train_df_missing_col.append(col)
        
# append column missing in test_df

for col in train_df.columns:
    if col != 'saleprice':
        if col not in test_df.columns:
            test_df_missing_col.append(col)



In [None]:
train_df_missing_col # columns in test not in train

In [None]:
test_df_missing_col # columns in train not in test

In [None]:
test_df[train_df_missing_col].sum()

In [None]:
train_df[test_df_missing_col].sum()

Opting to remove missing columns from each dataset.<br>
If columns were added, only a small percentage of meaningfull data will be added (ie 1/878 to 6/2043 $\approx$ 0.1% to 0.3%) and a large percentage of noise will be included.<br>

Removing colums in train dataset that does not exist in test dataset.

In [None]:
for col in train_df.columns:
    if col != 'saleprice':
        if col not in test_df.columns:
            print(col)
            train_df.drop(columns=col, inplace = True)

Removing colums in test dataset that does not exist in train dataset.

In [None]:
for col in test_df.columns:
    if col != 'saleprice':
        if col not in train_df.columns:
            print(col)
            test_df.drop(columns=col, inplace = True)

Only the saleprice column will be log transformed, as transforming the other features did not positively impact the model much

# Train and score model

In [None]:
X = train_df.drop(columns = 'saleprice')
y = train_df['saleprice']

In [None]:
print(X.shape)
y.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 147)

In [None]:
ss = StandardScaler()

In [None]:
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

### Baseline Model 

In [None]:
# baseline model

y_baseline_pred = [np.mean(train_df['saleprice']) for i in range(len(y_test))]
print(y_baseline_pred[:5])
print(len(y_baseline_pred))

In [None]:
# baseline rmse

baseline_rmse = mean_squared_error(y_test, y_baseline_pred, squared = False)
baseline_rmse

### Linear Regression

Linear Regression model will make a the best fit line to predict the sale price. It takes into account all features, wh

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(X_train_sc, np.log(y_train))

In [None]:
lr_train_r2 = lr.score(X_train_sc, np.log(y_train))
lr_test_r2 = lr.score(X_test_sc, np.log(y_test))
lr_train_cv_r2 = cross_val_score(lr, X_train_sc, np.log(y_train), cv = 5).mean()
lr_test_cv_r2 = cross_val_score(lr, X_test_sc,  np.log(y_test), cv = 5).mean()

In [None]:
# by left way

lr_y_pred = lr.predict(X_test_sc)
lr_rmse = mean_squared_error(y_test, lr_y_pred, squared = False)
lr_rmse_train = mean_squared_error(y_train, lr.predict(X_train_sc),squared = False)

In [None]:
# # by right way

# lr_y_pred = np.exp(lr.predict(X_test_sc))
# lr_rmse = mean_squared_error(y_test, lr_y_pred, squared = False)
# lr_rmse_train = mean_squared_error(y_train, np.exp(lr.predict(X_train_sc)),squared = False)

### Linear Regression with Ridge

In [None]:
r_alphas = np.logspace (-1, 10, 300)

In [None]:
ridgecv = RidgeCV(r_alphas)

In [None]:
ridgecv.fit(X_train_sc, np.log(y_train))
ridgecv_r2_train = (ridgecv.score(X_train_sc, np.log(y_train)))
ridgecv_r2_test = (ridgecv.score(X_test_sc, np.log(y_test)))

In [None]:
r_y_pred = np.exp(ridgecv.predict(X_test_sc))

In [None]:
r_rmse = mean_squared_error(y_test,r_y_pred,squared=False)

In [None]:
r_rmse_train = mean_squared_error(y_train, np.exp(ridgecv.predict(X_train_sc)),squared = False)

### Linear Regression with Lasso

In [None]:
lassocv = LassoCV(n_alphas = 500)

In [None]:
lassocv.fit(X_train_sc, np.log(y_train))
lasso_train_r2 = lassocv.score(X_train_sc, np.log(y_train))
lasso_test_r2 = lassocv.score(X_test_sc, np.log(y_test))

In [None]:
l_y_pred = np.exp(lassocv.predict(X_test_sc))

In [None]:
l_rmse = mean_squared_error(y_test,l_y_pred,squared=False)

In [None]:
l_rmse_train = mean_squared_error(y_train, np.exp(lassocv.predict(X_train_sc)), squared = False)

In [None]:
def lasso_model():
    t = PrettyTable(['Model', 'R2 Scores', 'RMSE'])

    print(t)

In [None]:
from prettytable import PrettyTable
def model_scores():
    t = PrettyTable(['Model', 'R2 Scores', 'RMSE'])
    t.add_row(['Baseline', 'na', 'na'])
    t.add_row(['Baseline', 'na', baseline_rmse])
    t.add_row(['-------', '-------', '-------'])
    t.add_row(['Linear Regression Train', lr_train_r2, lr_rmse_train])
    t.add_row(['Linear Regression Train', lr_test_r2, lr_rmse])
    t.add_row(['-------', '-------', '-------'])
    t.add_row(['Ridge CV Train', ridgecv_r2_train, r_rmse_train])
    t.add_row(['Ridge CV Test', ridgecv_r2_test, r_rmse])
    t.add_row(['-------', '-------', '-------'])
    t.add_row(['Lasso Train CV', lasso_train_r2, l_rmse_train])
    t.add_row(['Lasso Test CV', lasso_test_r2, l_rmse])
    t.add_row(['-------', '-------', '-------'])
    print(t)

In [None]:
model_scores()

Looking at the R2 scores for all 3 models, Linear Regression did the worse with a very large difference between the train and test score. The model looked overfit. It could be due to the large number of features that were made in the process. We will use this as our baseline model. Comparing lasso and ridge R2 scores, both models have very high R2 scores, meaning that a high percentage of the variance of our target value can be explained by our features with our models.

### Coefficients

In [None]:
lasso_coefs = pd.DataFrame(list(zip(np.exp(lassocv.coef_),X.columns)))
lasso_coefs.rename(columns = { 0:
                       'coef',
                       1:
                       'variable'}, inplace = True)
lasso_coefs.sort_values(by='coef', ascending = False, inplace = True)
lasso_coefs['coef'] = lasso_coefs['coef'] - 1

In [None]:
ridge_coefs = pd.DataFrame(list(zip(np.exp(ridgecv.coef_),X.columns)))
ridge_coefs.rename(columns = { 0:
                       'coef',
                       1:
                       'variable'}, inplace = True)
ridge_coefs.sort_values(by='coef', ascending = False, inplace = True)
ridge_coefs['coef'] = ridge_coefs['coef'] - 1

In [None]:
linear_coefs = pd.DataFrame(list(zip(np.exp(lr.coef_),X.columns)))
linear_coefs.rename(columns = { 0:
                       'coef',
                       1:
                       'variable'}, inplace = True)
linear_coefs.sort_values(by='coef', ascending = False, inplace = True)
linear_coefs['coef'] = linear_coefs['coef'] - 1

In [None]:
# recap corr from above

corr['saleprice'].sort_values(ascending = False).head(10)

In [None]:
corr['saleprice'].sort_values(ascending = False).tail(10)

In [None]:
linear_coefs.head()

In [None]:
linear_coefs.tail()

In [None]:
lasso_coefs.head(10)

In [None]:
lasso_coefs.tail(10)

In [None]:
ridge_coefs.head(10)

In [None]:
ridge_coefs.tail(10)

Coefficients mean that for every unit increase for the feature, our target will also increase. <br>
<br>
For Linear Regression with Ridge and Lasso, the top 5 positive coefficients are similar, mainly consisting of
- gr_living_area ( above grade living area square feet )
- overall_qual ( overall material and finish quality )
- functional_typ ( home functionality rating: house with typical functionality )
- functional_min1 ( home functionality rating: house with minor deductions 1)
- functional_min2 ( home functionality rating : house with deductions 2)

The top 5 negative coefficients are simiar,
- reno_age ( number of years since last renovation )
- house_age ( age of the house )
- neighborhood_MeadowV ( Physical locations within Ames city limits: Meadow Village )
- ms_zoning_c (  Identifies the general zoning classification of the sale: Commercial )
- paved_drive_N ( Paved driveway: Dirt or gravel )


### Distribution of Residuals 

In [None]:
l_residuals = y_test - l_y_pred
lasso_joint = sns.jointplot(y=l_residuals,x=l_y_pred)
lasso_joint.set_axis_labels('Lasso CV Predicted Values', 'Lasso CV Residuals', fontsize=16);

In [None]:
r_residuals = y_test - r_y_pred
ridge_joint = sns.jointplot(y=r_residuals,x=r_y_pred)
ridge_joint.set_axis_labels('Ridge CV Predicted Values', 'Ridge CV Residuals', fontsize=16);

The distribution of residuals for both lasso and ridge seem to be normally distributed, satisfying the condition of normally distributed for linear regression.

### Kaggle Submission Score

###### Linear Regression

In [None]:
X_sc = ss.fit_transform(X)
test_sc = ss.transform(test_df)

linear = LinearRegression()

linear.fit(X_sc, np.log(y))

chosen_y_pred = np.exp(linear.predict(test_sc))

submission_ridge = pd.DataFrame()
submission_ridge['Id'] = test_df['id']
submission_ridge['SalePrice'] = chosen_y_pred
submission_ridge.to_csv('./datasets/linear regression submission.csv', index = False)

##### Linear Regression with Lasso Submission to Kaggle

In [None]:
lasso_chosen = lassocv.alpha_

X_sc = ss.fit_transform(X)
test_sc = ss.transform(test_df)

lasso = Lasso(alpha = lasso_chosen)

lasso.fit(X_sc, np.log(y))

print('R2 score:')
print(lasso.score(X_sc, np.log(y)))

chosen_l_y_pred = np.exp(lasso.predict(test_sc))

submission_ridge = pd.DataFrame()
submission_ridge['Id'] = test_df['id']
submission_ridge['SalePrice'] = chosen_l_y_pred
submission_ridge.to_csv('./datasets/lasso submission.csv', index = False)

##### Linear Regression with Ridge Submission to Kaggle

In [None]:
ridge_chosen = ridgecv.alpha_

X_sc = ss.fit_transform(X)
test_sc = ss.transform(test_df)

ridge = Ridge(alpha = ridge_chosen)

ridge.fit(X_sc, np.log(y))

print('R2 score:')
print(ridge.score(X_sc, np.log(y)))

chosen_r_y_pred = np.exp(ridge.predict(test_sc))

submission_ridge = pd.DataFrame()
submission_ridge['Id'] = test_df['id']
submission_ridge['SalePrice'] = chosen_r_y_pred
submission_ridge.to_csv('./datasets/ridge submission.csv', index = False)

### Lasso Kaggle score: <br>
Private: 19,874<br>
Public: 21,076<br>
It has a spread of 1202.<br><br>
Ridge Kaggle score:<br>
Private Score: 20,137<br>
Public Score: 21,464<br>
It has a spread of 1326.

# Conclusion

###### Model Choice : Linear Regression with Lasso

Looking at the overall scoring for Linear Regression, it is not a good model for our situation as it is overfit.<br>
<br>
Using the train test scores for linear regression, we can see that for lasso and ridge, the training scores were a little lower. This is to be expected as we introduced some bias to get better accuracy, which can be seen in the test set R2 scores for lasso and ridge.<br>
<br>
Between Lasso and Ridge, both models gave good metrics, with a similar RMSE and coefficients. However, the best model for this case would be Linear Regression with Lasso, as it predicts the saleprice with the highest accuracy among the 3 models, and the lowst RMSE spread.<br><br>

###### Findings

Our model is able to predict sale price of a house, given it's features with an accuracy of +- $20,000 ( based on kaggle private score ). <br><br>
Features that affect the sale price positively are:<br>
- gr_living_area ( above grade living area square feet )
- overall_qual ( overall material and finish quality )
- functional_typ ( home functionality rating: house with typical functionality )
- functional_min1 ( home functionality rating: house with minor deductions 1)
- functional_min2 ( home functionality rating : house with deductions 2)<br><br>

Features that affect the sale price negatively are:<br>
- reno_age ( number of years since last renovation )
- house_age ( age of the house )
- neighborhood_MeadowV ( Physical locations within Ames city limits: Meadow Village )
- ms_zoning_c (  Identifies the general zoning classification of the sale: Commercial )
- paved_drive_N ( Paved driveway: Dirt or gravel )


With this in mind, home buyers / sellers and real estate agents are able to get an outside view of the market rate of a house. They can have a rough feel of the price of the house by observing the features that positively and negatively affect the price.<br>
They can also use the model to get a clearer picture of theh saleprice. In this way, they will not overpay / under price / estimate inaccurately the value of a house<br>
Buyers and real estate agents are able to pick out under priced house, and sellers know when they are offered a good price.