In this project, we practice the linear regression model. We'll work with housing data for the city of Ames, Iowa, United States from 2006 to 2010. You can read more about why the data was collected [here](https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627). You can also read about the different columns in the data [here](https://s3.amazonaws.com/dq-content/307/data_description.txt).

## Initial Data Exploration

In [40]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold

In [41]:
df = pd.read_csv('AmesHousing.tsv', delimiter= '\t')
df.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Order            2930 non-null   int64  
 1   PID              2930 non-null   int64  
 2   MS SubClass      2930 non-null   int64  
 3   MS Zoning        2930 non-null   object 
 4   Lot Frontage     2440 non-null   float64
 5   Lot Area         2930 non-null   int64  
 6   Street           2930 non-null   object 
 7   Alley            198 non-null    object 
 8   Lot Shape        2930 non-null   object 
 9   Land Contour     2930 non-null   object 
 10  Utilities        2930 non-null   object 
 11  Lot Config       2930 non-null   object 
 12  Land Slope       2930 non-null   object 
 13  Neighborhood     2930 non-null   object 
 14  Condition 1      2930 non-null   object 
 15  Condition 2      2930 non-null   object 
 16  Bldg Type        2930 non-null   object 
 17  House Style   

Write a simple initial function train_and_test function to use linear regression to predict the house sale prices

In [66]:
def train_and_test(df):
    train = df[:1460]
    test = df[1460:]
# select only columns that have continuous numeric data type 
    train = train.select_dtypes(include=['integer', 'float'])
    test = test.select_dtypes(include=['integer', 'float'])
    features = train.columns.drop('SalePrice')
    lr = LinearRegression()
# train our data
    lr.fit(train[features], train['SalePrice'])
    predictions = lr.predict(test[features])
    mse = mean_squared_error(predictions, test['SalePrice'])
    rmse = np.sqrt(mse)
    return rmse

In [67]:
#test our initial functions
rmse = train_and_test(df[['Gr Liv Area', 'SalePrice']])

rmse

57088.25161263909

## Feature Engineering/ Data Cleaning

First we will frop any columns with 5% or more missing values 

In [68]:
# find the number of missing values by column
n_missing = df.isnull().sum()

In [69]:
# Filter columns containing more than 5% missing values
drop_missing_cols = n_missing[(n_missing > len(df)/20)]
# Drop those columns from the data frame.
trans_df = df.drop(drop_missing_cols.index, axis=1)

Next, we will frop any text columns 

In [70]:
# Selecting only text columns
text_cols = trans_df.select_dtypes(include=['object'])
# Drop those columns from the data frame.
trans_df = trans_df.drop(text_cols, axis=1)

For numerical columns with missing values, we fill the missing value with the most common value in that column

In [71]:
# Selecting only numerical columns
numb_cols = trans_df.select_dtypes(include=['int', 'float'])
# Check to see which numb cols have missing value
numb_missing = numb_cols.isnull().sum()
# Filter columns containing missing values
numb_missing = numb_missing[numb_missing > 0]
# Find the mode of the numb_missing cols
mode_dict = trans_df[numb_missing.index].mode().to_dict('records')[0]
# print(replacement_values_dict)
# Fill the missing values with mode
trans_df = trans_df.fillna(mode_dict)
# Check for missing data
trans_df.isnull().sum().value_counts()

0    37
dtype: int64

Create new and more informative features

In [72]:
trans_df['Years Before Sale'] = trans_df['Yr Sold'] - trans_df['Year Built']
trans_df['Years Since Remod'] = trans_df['Yr Sold'] - trans_df['Year Remod/Add']

In [73]:
# Check df['Years Before Sale'] for potential wrong information
trans_df[trans_df['Years Before Sale'] <0]

Unnamed: 0,Order,PID,MS SubClass,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,...,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,SalePrice,Years Before Sale,Years Since Remod
2180,2181,908154195,20,39290,10,5,2008,2009,1224.0,4010.0,...,0,0,0,0,17000,10,2007,183850,-1,-2


In [74]:
# Check df['Years Since Remod'] for potential wrong information
trans_df[trans_df['Years Since Remod'] <0]

Unnamed: 0,Order,PID,MS SubClass,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,...,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,SalePrice,Years Before Sale,Years Since Remod
1702,1703,528120010,60,16659,8,5,2007,2008,0.0,0.0,...,0,0,0,0,0,6,2007,260116,0,-1
2180,2181,908154195,20,39290,10,5,2008,2009,1224.0,4010.0,...,0,0,0,0,17000,10,2007,183850,-1,-2
2181,2182,908154205,60,40094,10,5,2007,2008,762.0,2260.0,...,0,0,0,0,0,10,2007,184750,0,-1


We should delete row 1702, 2180, 2181

In [75]:
# Drop row 1702, 2180, 2181
trans_df = trans_df.drop([1702, 2180, 2181], axis = 0)

Drop columns that aren't useful for ML or columns that leak info about the final sale

In [76]:
## Drop columns that aren't useful for ML
trans_df = trans_df.drop(["PID", "Order"], axis=1)

## Drop columns that leak info about the final sale
trans_df = trans_df.drop(["Mo Sold", "Yr Sold"], axis=1)

Train our model with just our clean dataset

In [77]:
rmse = train_and_test(trans_df[['Gr Liv Area', 'SalePrice']])

rmse

55275.367312413066

## Features Selection

In [78]:
trans_df.head()

Unnamed: 0,MS SubClass,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,...,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,SalePrice,Years Before Sale,Years Since Remod
0,20,31770,6,5,1960,1960,112.0,639.0,0.0,441.0,...,210,62,0,0,0,0,0,215000,50,50
1,20,11622,5,6,1961,1961,0.0,468.0,144.0,270.0,...,140,0,0,0,120,0,0,105000,49,49
2,20,14267,6,6,1958,1958,108.0,923.0,0.0,406.0,...,393,36,0,0,0,0,12500,172000,52,52
3,20,11160,7,5,1968,1968,0.0,1065.0,0.0,1045.0,...,0,0,0,0,0,0,0,244000,42,42
4,60,13830,5,5,1997,1998,0.0,791.0,0.0,137.0,...,212,34,0,0,0,0,0,189900,13,12


In [79]:
# Calculate the correlation with respect to SalePrice (our target)
corr = trans_df.corr()['SalePrice'].abs().sort_values()
corr

BsmtFin SF 2         0.006127
Misc Val             0.019273
3Ssn Porch           0.032268
Bsmt Half Bath       0.035875
Low Qual Fin SF      0.037629
Pool Area            0.068438
MS SubClass          0.085128
Overall Cond         0.101540
Screen Porch         0.112280
Kitchen AbvGr        0.119760
Enclosed Porch       0.128685
Bedroom AbvGr        0.143916
Bsmt Unf SF          0.182751
Lot Area             0.267520
2nd Flr SF           0.269601
Bsmt Full Bath       0.276258
Half Bath            0.284871
Open Porch SF        0.316262
Wood Deck SF         0.328183
BsmtFin SF 1         0.439284
Fireplaces           0.474831
TotRms AbvGrd        0.498574
Mas Vnr Area         0.506983
Year Remod/Add       0.533007
Years Since Remod    0.534985
Full Bath            0.546118
Year Built           0.558490
Years Before Sale    0.558979
1st Flr SF           0.635185
Garage Area          0.641425
Total Bsmt SF        0.644012
Garage Cars          0.648361
Gr Liv Area          0.717596
Overall Qu

In [80]:
# filter out all feature that has correlation bigger 0.3 
features = corr[corr > 0.3]
# Select features
trans_df = trans_df[features.index]
trans_df.head()

Unnamed: 0,Open Porch SF,Wood Deck SF,BsmtFin SF 1,Fireplaces,TotRms AbvGrd,Mas Vnr Area,Year Remod/Add,Years Since Remod,Full Bath,Year Built,Years Before Sale,1st Flr SF,Garage Area,Total Bsmt SF,Garage Cars,Gr Liv Area,Overall Qual,SalePrice
0,62,210,639.0,2,7,112.0,1960,50,1,1960,50,1656,528.0,1080.0,2.0,1656,6,215000
1,0,140,468.0,0,5,0.0,1961,49,1,1961,49,896,730.0,882.0,1.0,896,5,105000
2,36,393,923.0,0,6,108.0,1958,52,1,1958,52,1329,312.0,1329.0,1.0,1329,6,172000
3,0,0,1065.0,2,8,0.0,1968,42,2,1968,42,2110,522.0,2110.0,2.0,2110,7,244000
4,34,212,791.0,1,6,0.0,1998,12,2,1997,13,928,482.0,928.0,2.0,1629,5,189900


In [81]:
# transform 'Overall Qual' into catagorical data type and get dummies cols
trans_df['Overall Qual'] = trans_df['Overall Qual'].astype('category')
dummy = pd.get_dummies(trans_df['Overall Qual'])
clean_df = pd.concat([trans_df, dummy], axis =1)
clean_df.head()

Unnamed: 0,Open Porch SF,Wood Deck SF,BsmtFin SF 1,Fireplaces,TotRms AbvGrd,Mas Vnr Area,Year Remod/Add,Years Since Remod,Full Bath,Year Built,...,1,2,3,4,5,6,7,8,9,10
0,62,210,639.0,2,7,112.0,1960,50,1,1960,...,0,0,0,0,0,1,0,0,0,0
1,0,140,468.0,0,5,0.0,1961,49,1,1961,...,0,0,0,0,1,0,0,0,0,0
2,36,393,923.0,0,6,108.0,1958,52,1,1958,...,0,0,0,0,0,1,0,0,0,0
3,0,0,1065.0,2,8,0.0,1968,42,2,1968,...,0,0,0,0,0,0,1,0,0,0
4,34,212,791.0,1,6,0.0,1998,12,2,1997,...,0,0,0,0,1,0,0,0,0,0


re-write the initial train and test function and integrate kfold cross validation

In [82]:
def train_and_test(df,k=0):
    lr = LinearRegression()
    features = df.columns.drop('SalePrice')       
    if k == 0:
        # Randomize order of rows in data frame.
        shuffled_index = np.random.permutation(df.index)
        rand_df = df.reindex(shuffled_index)
        train = rand_df[:1460]
        test = rand_df[1460:]
        # train our data
        lr.fit(train[features], train['SalePrice'])
        predictions = lr.predict(test[features])
        mse = mean_squared_error(predictions, test['SalePrice'])
        rmse = np.sqrt(mse)
        return rmse
    else:
        kf = KFold(n_splits=k, shuffle=True)
        rmse_values = []
        for train_index, test_index, in kf.split(df):
            train = df.iloc[train_index]
            test = df.iloc[test_index]
            lr.fit(train[features], train["SalePrice"])
            predictions = lr.predict(test[features])
            mse = mean_squared_error(test["SalePrice"], predictions)
            rmse = np.sqrt(mse)
            rmse_values.append(rmse)
        print(rmse_values)
        avg_rmse = np.mean(rmse_values)
        return avg_rmse

In [83]:
rmse = train_and_test(clean_df,k=5)
rmse

[25223.590553752725, 24608.61706281656, 28604.89586155913, 29065.619310292805, 40356.44457511208]


29571.833472706658