Predicting House Sale Prices
===

In this project, we will work with housing data for the city of Ames, Iowa, United States from 2006 to 2010. 

The dataset can be found in here https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627

The information about the columns can be found in here https://s3.amazonaws.com/dq-content/307/data_description.txt

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold

Introduction
---

In [2]:
df = pd.read_csv('AmesHousing.tsv', delimiter='\t')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
Order              2930 non-null int64
PID                2930 non-null int64
MS SubClass        2930 non-null int64
MS Zoning          2930 non-null object
Lot Frontage       2440 non-null float64
Lot Area           2930 non-null int64
Street             2930 non-null object
Alley              198 non-null object
Lot Shape          2930 non-null object
Land Contour       2930 non-null object
Utilities          2930 non-null object
Lot Config         2930 non-null object
Land Slope         2930 non-null object
Neighborhood       2930 non-null object
Condition 1        2930 non-null object
Condition 2        2930 non-null object
Bldg Type          2930 non-null object
House Style        2930 non-null object
Overall Qual       2930 non-null int64
Overall Cond       2930 non-null int64
Year Built         2930 non-null int64
Year Remod/Add     2930 non-null int64
Roof Style         29

In [4]:
def transform_feautres(df):
    return df

def select_features(df):
    return df[["Gr Liv Area", "SalePrice"]]

def train_and_test(df):
    train = df[:1460]
    test = df[1460:]
    
    numeric_train = train.select_dtypes(include=["integer", "float"])
    numeric_test = test.select_dtypes(include=["integer", "float"])
    features = numeric_train.columns.drop("SalePrice")
    
    lr = LinearRegression()
    lr.fit(train[features], train["SalePrice"])
    test_predictions = lr.predict(test[features])
    test_mse = mean_squared_error(test["SalePrice"],test_predictions)
    test_rmse = np.sqrt(test_mse)
    return test_rmse

In [5]:
transfrom_df = transform_feautres(df)
filtered_df = select_features(transfrom_df)
rmse = train_and_test(filtered_df)

In [6]:
rmse

57088.25161263909

Feature Engineering
---

We will now, update our transform function to remove features that have more than 5% missing values, dive deeper into potential categorical features, and transform text and numerical columns.

Before modifing the function, we will do the cleaning outside of the function, so that we can ensure that we don't have any bugs in our code. This will make debugging much easier and faster. After we are done with the cleaning, we will add our code to the function.

First we will start by removing any columns that have more than 5% missing columns.

In [7]:
missing_num = df.isnull().sum()

In [8]:
missing_cols = missing_num[(missing_num > len(df)*0.05)].sort_values()

In [9]:
df.drop(missing_cols.index, axis=1, inplace=True)

Next, we will drop any text columns that have one or more missing values, because we can't fill any text data with its mean.

In [10]:
text_null_count = df.select_dtypes(include=["object"]).isnull().sum()

In [11]:
missing_text = text_null_count[text_null_count > 0].index

In [12]:
df.drop(missing_text, axis=1, inplace=True)

We still have missing values in the numerical columns. We will fill those missing values with the most common value for each column.

In [13]:
missing_num_cols = df.select_dtypes(include=["integer","float"]).isnull().sum()

In [14]:
missing_num_cols = missing_num_cols[(missing_num_cols > 0)]

In [15]:
df[missing_num_cols.index].mode()

Unnamed: 0,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Bsmt Full Bath,Bsmt Half Bath,Garage Cars,Garage Area
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0


We will now convert this output into a dictionary so that we can fill in the missing values with each corresponding column.

In [16]:
missing_dict = df[missing_num_cols.index].mode().to_dict(orient='records')[0]

In [17]:
df.fillna(missing_dict, inplace=True)

In [18]:
df.isnull().sum().any()

False

We got rid of all the missing values in our data. We can now move on.

We can now, create new features. We have "Year Remod/Add" and "Year Sold" columns which are numerical but not ordinal. Thus, they don't provide valuable information for linear regression model. However, we can create a new feature from them. We can subtract "Year Remod/Add" column from "Year Sold" and come up with a valuable column that we can use as a feature.

In [19]:
df["yr_since_remod"] = df["Yr Sold"] - df["Year Remod/Add"]

We can do the same with "Year Built" column as well.

In [20]:
df["yr_since_built"] = df["Yr Sold"] - df["Year Built"]

In [21]:
df[df["yr_since_remod"]<0]

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,...,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice,yr_since_remod,yr_since_built
1702,1703,528120010,60,RL,16659,Pave,IR1,Lvl,AllPub,Corner,...,0,0,0,6,2007,New,Partial,260116,-1,0
2180,2181,908154195,20,RL,39290,Pave,IR1,Bnk,AllPub,Inside,...,0,0,17000,10,2007,New,Partial,183850,-2,-1
2181,2182,908154205,60,RL,40094,Pave,IR1,Bnk,AllPub,Inside,...,0,0,0,10,2007,New,Partial,184750,-1,0


In [22]:
df[df["yr_since_built"]<0]

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,...,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice,yr_since_remod,yr_since_built
2180,2181,908154195,20,RL,39290,Pave,IR1,Bnk,AllPub,Inside,...,0,0,17000,10,2007,New,Partial,183850,-2,-1


We have some rows with negative values, we can't use these columns in our model, so we are going to drop them.

In [23]:
df = df.drop(df[df["yr_since_remod"]<0].index, axis=0)

We also don't need the original columns.

In [24]:
df.drop(["Year Built", "Year Remod/Add"], axis=1, inplace=True)

We still have columns that are useless for machine learning and  leak data about our final scale. We will drop these columns.

In [25]:
df.drop(["PID", "Order", "Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1, inplace=True)

We can now update our transform function that we wrote in the introduction part.

In [26]:
def transform_feautres(df):
    missing_num = df.isnull().sum()
    missing_cols = missing_num[(missing_num > len(df)*0.05)].sort_values()
    df.drop(missing_cols.index, axis=1, inplace=True)
    
    text_null_count = df.select_dtypes(include=["object"]).isnull().sum()
    missing_text = text_null_count[text_null_count > 0].index
    df.drop(missing_text, axis=1, inplace=True)
    
    missing_num_cols = df.select_dtypes(include=["integer","float"]).isnull().sum()
    missing_num_cols = missing_num_cols[(missing_num_cols > 0)]
    missing_dict = df[missing_num_cols.index].mode().to_dict(orient='records')[0]
    df.fillna(missing_dict, inplace=True)
    
    df["yr_since_remod"] = df["Yr Sold"] - df["Year Remod/Add"]
    df["yr_since_built"] = df["Yr Sold"] - df["Year Built"]
    df = df.drop(df[df["yr_since_remod"]<0].index, axis=0)
    df.drop(["Year Built", "Year Remod/Add"], axis=1, inplace=True)
    df.drop(["PID", "Order", "Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1, inplace=True)
    return df

def select_features(df):
    return df[["Gr Liv Area", "SalePrice"]]

def train_and_test(df):
    train = df[:1460]
    test = df[1460:]
    
    numeric_train = train.select_dtypes(include=["integer", "float"])
    numeric_test = test.select_dtypes(include=["integer", "float"])
    features = numeric_train.columns.drop("SalePrice")
    
    lr = LinearRegression()
    lr.fit(train[features], train["SalePrice"])
    test_predictions = lr.predict(test[features])
    test_mse = mean_squared_error(test["SalePrice"],test_predictions)
    test_rmse = np.sqrt(test_mse)
    return test_rmse

In [30]:
df = pd.read_csv('AmesHousing.tsv', delimiter='\t')
transform_df = transform_feautres(df)
filtered_df = select_features(transfrom_df)
rmse = train_and_test(filtered_df)

In [31]:
rmse

55275.36731241307

We can see that our RMSE value decreased after cleaning the dataset.

Feature Selection
---

Let's see which features correlate strongly with our target column "SalePrice".

In [32]:
corrdf = transform_df.select_dtypes(include=["integer","float"]).corr()

In [33]:
corrdf["SalePrice"].abs().sort_values(ascending=False)

SalePrice          1.000000
Overall Qual       0.801206
Gr Liv Area        0.717596
Garage Cars        0.648361
Total Bsmt SF      0.644012
Garage Area        0.641425
1st Flr SF         0.635185
yr_since_built     0.558979
Full Bath          0.546118
yr_since_remod     0.534985
Mas Vnr Area       0.506983
TotRms AbvGrd      0.498574
Fireplaces         0.474831
BsmtFin SF 1       0.439284
Wood Deck SF       0.328183
Open Porch SF      0.316262
Half Bath          0.284871
Bsmt Full Bath     0.276258
2nd Flr SF         0.269601
Lot Area           0.267520
Bsmt Unf SF        0.182751
Bedroom AbvGr      0.143916
Enclosed Porch     0.128685
Kitchen AbvGr      0.119760
Screen Porch       0.112280
Overall Cond       0.101540
MS SubClass        0.085128
Pool Area          0.068438
Low Qual Fin SF    0.037629
Bsmt Half Bath     0.035875
3Ssn Porch         0.032268
Misc Val           0.019273
BsmtFin SF 2       0.006127
Name: SalePrice, dtype: float64

We will drop the columns that have less than 0.4 correlation with "SalePrice" column.

In [34]:
transform_df.drop(corrdf[corrdf["SalePrice"].abs() < 0.4].index, axis=1, inplace=True)

Next, we will select the columns that need to be converted into categorical data type.

In [35]:
transform_df.head()

Unnamed: 0,MS Zoning,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,...,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Garage Cars,Garage Area,Paved Drive,SalePrice,yr_since_remod,yr_since_built
0,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,...,TA,7,Typ,2,2.0,528.0,P,215000,50,50
1,RH,Pave,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,...,TA,5,Typ,0,1.0,730.0,Y,105000,49,49
2,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,...,Gd,6,Typ,0,1.0,312.0,Y,172000,52,52
3,RL,Pave,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,...,Ex,8,Typ,2,2.0,522.0,Y,244000,42,42
4,RL,Pave,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,...,TA,6,Typ,1,2.0,482.0,Y,189900,12,13


Let's make a list of nominal columns.

In [36]:
nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]

In [37]:
transform_cat_cols = []
for col in nominal_features:
    if col in transform_df.columns:
        transform_cat_cols.append(col)

In [38]:
transform_cat_cols

['MS Zoning',
 'Street',
 'Land Contour',
 'Lot Config',
 'Neighborhood',
 'Condition 1',
 'Condition 2',
 'Bldg Type',
 'House Style',
 'Roof Style',
 'Roof Matl',
 'Exterior 1st',
 'Exterior 2nd',
 'Foundation',
 'Heating',
 'Central Air']

In [39]:
transform_df[transform_cat_cols].apply(lambda x: len(x.value_counts())).sort_values()

Street           2
Central Air      2
Land Contour     4
Lot Config       5
Bldg Type        5
Roof Style       6
Foundation       6
Heating          6
MS Zoning        7
Condition 2      8
House Style      8
Roof Matl        8
Condition 1      9
Exterior 1st    16
Exterior 2nd    17
Neighborhood    28
dtype: int64

We should drop the columns that have more than 10 unique values.

In [43]:
drop_noms = transform_df[transform_cat_cols].apply(lambda x: len(x.value_counts())).sort_values()

In [48]:
transform_df.drop(drop_noms[drop_noms > 10].index, axis=1, inplace=True)

Now, we can convert the text columns into categorical data.

In [52]:
text_df = transform_df.select_dtypes(include=["object"])
for col in text_df:
    transform_df[col] = transform_df[col].astype('category')

In [53]:
dummy_cols = pd.get_dummies(transform_df.select_dtypes(include=['category']))

In [55]:
transform_df = pd.concat([transform_df, dummy_cols], axis=1)

In [58]:
transform_df.drop(text_df.columns, axis=1, inplace=True)

We can now, add these codes to the select_features function.

In [61]:
def transform_feautres(df):
    missing_num = df.isnull().sum()
    missing_cols = missing_num[(missing_num > len(df)*0.05)].sort_values()
    df.drop(missing_cols.index, axis=1, inplace=True)
    
    text_null_count = df.select_dtypes(include=["object"]).isnull().sum()
    missing_text = text_null_count[text_null_count > 0].index
    df.drop(missing_text, axis=1, inplace=True)
    
    missing_num_cols = df.select_dtypes(include=["integer","float"]).isnull().sum()
    missing_num_cols = missing_num_cols[(missing_num_cols > 0)]
    missing_dict = df[missing_num_cols.index].mode().to_dict(orient='records')[0]
    df.fillna(missing_dict, inplace=True)
    
    df["yr_since_remod"] = df["Yr Sold"] - df["Year Remod/Add"]
    df["yr_since_built"] = df["Yr Sold"] - df["Year Built"]
    df = df.drop(df[df["yr_since_remod"]<0].index, axis=0)
    df.drop(["Year Built", "Year Remod/Add"], axis=1, inplace=True)
    df.drop(["PID", "Order", "Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1, inplace=True)
    return df

def select_features(df):
    corrdf = df.select_dtypes(include=["integer","float"]).corr()
    df.drop(corrdf[corrdf["SalePrice"].abs() < 0.4].index, axis=1, inplace=True)
    
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]
    transform_cat_cols = []
    for col in nominal_features:
        if col in df.columns:
            transform_cat_cols.append(col)
    drop_noms = df[transform_cat_cols].apply(lambda x: len(x.value_counts())).sort_values()
    df.drop(drop_noms[drop_noms > 10].index, axis=1, inplace=True)
    
    text_df = df.select_dtypes(include=["object"])
    for col in text_df:
        df[col] = df[col].astype('category')
    dummy_cols = pd.get_dummies(df.select_dtypes(include=['category']))
    df = pd.concat([df, dummy_cols], axis=1)
    df.drop(text_df.columns, axis=1, inplace=True)
    return df

def train_and_test(df):
    train = df[:1460]
    test = df[1460:]
    
    numeric_train = train.select_dtypes(include=["integer", "float"])
    numeric_test = test.select_dtypes(include=["integer", "float"])
    features = numeric_train.columns.drop("SalePrice")
    
    lr = LinearRegression()
    lr.fit(train[features], train["SalePrice"])
    test_predictions = lr.predict(test[features])
    test_mse = mean_squared_error(test["SalePrice"],test_predictions)
    test_rmse = np.sqrt(test_mse)
    return test_rmse

In [62]:
df = pd.read_csv('AmesHousing.tsv', delimiter='\t')
transform_df = transform_feautres(df)
filtered_df = select_features(transfrom_df)
rmse = train_and_test(filtered_df)

In [63]:
rmse

33367.287183402805

Train and Test
---

For the final function, we will add k parameter that controls the type of cross validation that occurs.

In [75]:
def transform_feautres(df):
    missing_num = df.isnull().sum()
    missing_cols = missing_num[(missing_num > len(df)*0.05)].sort_values()
    df.drop(missing_cols.index, axis=1, inplace=True)
    
    text_null_count = df.select_dtypes(include=["object"]).isnull().sum()
    missing_text = text_null_count[text_null_count > 0].index
    df.drop(missing_text, axis=1, inplace=True)
    
    missing_num_cols = df.select_dtypes(include=["integer","float"]).isnull().sum()
    missing_num_cols = missing_num_cols[(missing_num_cols > 0)]
    missing_dict = df[missing_num_cols.index].mode().to_dict(orient='records')[0]
    df.fillna(missing_dict, inplace=True)
    
    df["yr_since_remod"] = df["Yr Sold"] - df["Year Remod/Add"]
    df["yr_since_built"] = df["Yr Sold"] - df["Year Built"]
    df = df.drop(df[df["yr_since_remod"]<0].index, axis=0)
    df.drop(["Year Built", "Year Remod/Add"], axis=1, inplace=True)
    df.drop(["PID", "Order", "Mo Sold", "Sale Condition", "Sale Type", "Yr Sold"], axis=1, inplace=True)
    return df

def select_features(df):
    corrdf = df.select_dtypes(include=["integer","float"]).corr()
    df.drop(corrdf[corrdf["SalePrice"].abs() < 0.4].index, axis=1, inplace=True)
    
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]
    transform_cat_cols = []
    for col in nominal_features:
        if col in df.columns:
            transform_cat_cols.append(col)
    drop_noms = df[transform_cat_cols].apply(lambda x: len(x.value_counts())).sort_values()
    df.drop(drop_noms[drop_noms > 10].index, axis=1, inplace=True)
    
    text_df = df.select_dtypes(include=["object"])
    for col in text_df:
        df[col] = df[col].astype('category')
    dummy_cols = pd.get_dummies(df.select_dtypes(include=['category']))
    df = pd.concat([df, dummy_cols], axis=1)
    df.drop(text_df.columns, axis=1, inplace=True)
    return df

def train_and_test(df, k=0):
    numeric_train = df.select_dtypes(include=["integer", "float"])
    features = numeric_train.columns.drop("SalePrice")
    
    if k==0:
        train = df[:1460]
        test = df[1460:]
    
        lr = LinearRegression()
        lr.fit(train[features], train["SalePrice"])
        test_predictions = lr.predict(test[features])
        test_mse = mean_squared_error(test["SalePrice"],test_predictions)
        test_rmse = np.sqrt(test_mse)
        return test_rmse
    
    if k==1:
        shuffled = df.sample(frac=1)
        train = df[:1460]
        test = df[1460:]
        
        lr = LinearRegression()
        lr.fit(train[features], train["SalePrice"])
        test_predictions = lr.predict(test[features])
        train_predictions = lr.predict(train[features])
        test_mse = mean_squared_error(test["SalePrice"],test_predictions)
        train_mse = mean_squared_error(train["SalePrice"],train_predictions)
        train_rmse = np.sqrt(train_mse)
        test_rmse = np.sqrt(test_mse)
        print(test_rmse)
        print(train_rmse)
        avg_rmse = (train_rmse+test_rmse)/2
        return avg_rmse
    
    else:
        kf = KFold(n_splits=k, shuffle=True, random_state=1)
        rmses = []
        
        for train_index, test_index in kf.split(df):
            train = df.iloc[train_index]
            test = df.iloc[test_index]
        
            lr = LinearRegression()
            lr.fit(train[features], train["SalePrice"])
            test_predictions = lr.predict(test[features])
            test_mse = mean_squared_error(test["SalePrice"],test_predictions)
            test_rmse = np.sqrt(test_mse)
            rmses.append(test_rmse)
        print(rmses)
        avg_rmse = np.mean(rmses)
        return avg_rmse

In [78]:
df = pd.read_csv('AmesHousing.tsv', delimiter='\t')
transform_df = transform_feautres(df)
filtered_df = select_features(transfrom_df)
rmse = train_and_test(filtered_df, k=5)

[39086.22885496491, 24011.593774701, 27527.5630122074, 25057.618801018743, 28479.566805355185]


In [79]:
rmse

28832.51424964945