# Predicting House Sale Prices

In this project, we'll explore ways to build and improve a linear regression model by working with housing data for the city of Ames, Iowa from 2006 to 2010. Information on the dataset can be found [here](https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627), and the columns info can be found [here](https://s3.amazonaws.com/dq-content/307/data_description.txt).

## Building the Pipeline

We'll start by importing our libraries, reading in our data, and then setting up a pipeline of functions that will help us quickly iterate over different models.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn import linear_model

pd.options.display.max_columns = 999

In [2]:
df = pd.read_csv('AmesHousing.tsv', delimiter='\t')

In [3]:
# Returns training Data Frame
def transform_features(df):
    return df

def select_features(df):
    return df[['Gr Liv Area', 'SalePrice']]

def train_and_test(df):
    train = df[:1460]
    test = df[1460:]
    
    numeric_train = train.select_dtypes(include=['integer', 'float'])
    numeric_test = test.select_dtypes(include=['integer', 'float'])
    
    features = numeric_train.columns.drop('SalePrice')
    
    lr = linear_model.LinearRegression()
    lr.fit(train[features], train['SalePrice'])
    predictions = lr.predict(test[features])
    mse = mean_squared_error(test['SalePrice'], predictions)
    rmse = np.sqrt(mse)
    
    return rmse


transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df)

print("RMSE: ", round(rmse, 2))

RMSE:  57088.25


## Feature Engineering

Next, we'll begin removing features that have too many missing values, look for potential categorical features, and transform the text and numerical columns.

We'll start by updating our `transform_features()` function so that any column with more than XX% missing values is dropped. We'll also remove columns that leak information about the sale. We'll need to read the data documentation to get a better idea of what transformations are necessary and for which columns.

1) For **all columns**, drop any with 5% or more missing values:

In [4]:
# Create series with missing values data
val_missing = df.isnull().sum()

# Filter the columns containing > 5% missing values
missing_val_cols = val_missing[(val_missing > len(df)/20)].sort_values()

# Drop missing value columns
df = df.drop(missing_val_cols.index, axis=1)

2) For **text columns**, drop any with 1 or more missing values:

In [5]:
# Create series with missing text values data
text_missing = df.select_dtypes(include=['object']).isnull().sum().sort_values(ascending=False)

# Filter the text columns containing any missing values
missing_text_cols = text_missing[text_missing > 0]

# Drop missing value columns
df = df.drop(missing_text_cols.index, axis=1)

3) For **numerical columns**, fill in with the most common value in that column:

In [6]:
# Compute missing value counts
num_missing = df.select_dtypes(include=['int', 'float']).isnull().sum()
fixable_num_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values()
fixable_num_cols

BsmtFin SF 1       1
BsmtFin SF 2       1
Bsmt Unf SF        1
Total Bsmt SF      1
Garage Cars        1
Garage Area        1
Bsmt Full Bath     2
Bsmt Half Bath     2
Mas Vnr Area      23
dtype: int64

In [7]:
# Compute the most common value for each of the columns in `fixable_num_cols`
replacement_vals = df[fixable_num_cols.index].mode().to_dict(orient='records')[0]
replacement_vals

{'BsmtFin SF 1': 0.0,
 'BsmtFin SF 2': 0.0,
 'Bsmt Unf SF': 0.0,
 'Total Bsmt SF': 0.0,
 'Garage Cars': 2.0,
 'Garage Area': 0.0,
 'Bsmt Full Bath': 0.0,
 'Bsmt Half Bath': 0.0,
 'Mas Vnr Area': 0.0}

In [8]:
# Replace missing values
df = df.fillna(replacement_vals)

# Verify no missing values left
df.isnull().sum().value_counts()

0    64
dtype: int64

Next, let's look at what new features we can create to better capture some of the information in the dataset.

In [9]:
years_sold = df['Yr Sold'] - df['Year Built']
years_sold[years_sold < 0]

2180   -1
dtype: int64

In [10]:
years_remodel = df['Yr Sold'] - df['Year Remod/Add']
years_remodel[years_remodel < 0]

1702   -1
2180   -2
2181   -1
dtype: int64

In [11]:
# Make new columns
df['Years Before Sale'] = years_sold
df['Years Since Remodel'] = years_remodel

# Drop the rows with negative values for both new features
df = df.drop([1702, 2180, 2181], axis=0)

# Remove original year columns
df = df.drop(['Year Built', 'Year Remod/Add'], axis = 1)

Next, we'll drop the columns that aren't useful in machine learning models and columns that leak data about the final sale.

In [12]:
# Drop columns that are not useful
df = df.drop(['PID', 'Order'], axis=1)

# Drop columns that leak info
df = df.drop(['Mo Sold', 'Sale Condition', 'Sale Type', 'Yr Sold'], axis=1)

Now let's update the `transform_features()` function.

In [13]:
def transform_features(df):
    
    val_missing = df.isnull().sum()
    missing_val_cols = val_missing[(val_missing > len(df)/20)].sort_values()
    df = df.drop(missing_val_cols.index, axis=1)
    
    text_missing = df.select_dtypes(include=['object']).isnull().sum().sort_values(ascending=False)
    missing_text_cols = text_missing[text_missing > 0]
    df = df.drop(missing_text_cols.index, axis=1)
    
    num_missing = df.select_dtypes(include=['int', 'float']).isnull().sum()
    fixable_num_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values()
    replacement_vals = df[fixable_num_cols.index].mode().to_dict(orient='records')[0]
    df = df.fillna(replacement_vals)
    
    years_sold = df['Yr Sold'] - df['Year Built']
    years_remodel = df['Yr Sold'] - df['Year Remod/Add']
    df['Years Before Sale'] = years_sold
    df['Years Since Remodel'] = years_remodel
    df = df.drop([1702, 2180, 2181], axis=0)
    
    df = df.drop(['Year Built', 'Year Remod/Add', 'PID', 'Order', 'Mo Sold', 'Sale Condition', 'Sale Type', 'Yr Sold'], axis = 1)
    
    return df

def select_features(df):
    return df[['Gr Liv Area', 'SalePrice']]

def train_and_test(df):
    train = df[:1460]
    test = df[1460:]
    
    numeric_train = train.select_dtypes(include=['integer', 'float'])
    numeric_test = test.select_dtypes(include=['integer', 'float'])
    
    features = numeric_train.columns.drop('SalePrice')
    
    lr = linear_model.LinearRegression()
    lr.fit(train[features], train['SalePrice'])
    predictions = lr.predict(test[features])
    mse = mean_squared_error(test['SalePrice'], predictions)
    rmse = np.sqrt(mse)
    
    return rmse

df = pd.read_csv('AmesHousing.tsv', delimiter='\t')
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df)

print("RMSE: ", round(rmse, 2))

RMSE:  55275.37


## Feature Selection

Now that we've cleaned and transformed our data, we can move on to selecting our features for our model.

We'll start by looking at how numerical features in our training set correlate with `SalePrice`.

In [14]:
numerical_df = transform_df.select_dtypes(include=['int', 'float'])
numerical_df.head()

Unnamed: 0,MS SubClass,Lot Area,Overall Qual,Overall Cond,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,TotRms AbvGrd,Fireplaces,Garage Cars,Garage Area,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,SalePrice,Years Before Sale,Years Since Remodel
0,20,31770,6,5,112.0,639.0,0.0,441.0,1080.0,1656,0,0,1656,1.0,0.0,1,0,3,1,7,2,2.0,528.0,210,62,0,0,0,0,0,215000,50,50
1,20,11622,5,6,0.0,468.0,144.0,270.0,882.0,896,0,0,896,0.0,0.0,1,0,2,1,5,0,1.0,730.0,140,0,0,0,120,0,0,105000,49,49
2,20,14267,6,6,108.0,923.0,0.0,406.0,1329.0,1329,0,0,1329,0.0,0.0,1,1,3,1,6,0,1.0,312.0,393,36,0,0,0,0,12500,172000,52,52
3,20,11160,7,5,0.0,1065.0,0.0,1045.0,2110.0,2110,0,0,2110,1.0,0.0,2,1,3,1,8,2,2.0,522.0,0,0,0,0,0,0,0,244000,42,42
4,60,13830,5,5,0.0,791.0,0.0,137.0,928.0,928,701,0,1629,0.0,0.0,2,1,3,1,6,1,2.0,482.0,212,34,0,0,0,0,0,189900,13,12


In [15]:
corr_coeffs = numerical_df.corr()['SalePrice'].abs().sort_values(ascending=False)
corr_coeffs

SalePrice              1.000000
Overall Qual           0.801206
Gr Liv Area            0.717596
Garage Cars            0.648361
Total Bsmt SF          0.644012
Garage Area            0.641425
1st Flr SF             0.635185
Years Before Sale      0.558979
Full Bath              0.546118
Years Since Remodel    0.534985
Mas Vnr Area           0.506983
TotRms AbvGrd          0.498574
Fireplaces             0.474831
BsmtFin SF 1           0.439284
Wood Deck SF           0.328183
Open Porch SF          0.316262
Half Bath              0.284871
Bsmt Full Bath         0.276258
2nd Flr SF             0.269601
Lot Area               0.267520
Bsmt Unf SF            0.182751
Bedroom AbvGr          0.143916
Enclosed Porch         0.128685
Kitchen AbvGr          0.119760
Screen Porch           0.112280
Overall Cond           0.101540
MS SubClass            0.085128
Pool Area              0.068438
Low Qual Fin SF        0.037629
Bsmt Half Bath         0.035875
3Ssn Porch             0.032268
Misc Val

In [16]:
# Keep only the columns with a correlation coefficient > 0.4
corr_coeffs[corr_coeffs > 0.4]

SalePrice              1.000000
Overall Qual           0.801206
Gr Liv Area            0.717596
Garage Cars            0.648361
Total Bsmt SF          0.644012
Garage Area            0.641425
1st Flr SF             0.635185
Years Before Sale      0.558979
Full Bath              0.546118
Years Since Remodel    0.534985
Mas Vnr Area           0.506983
TotRms AbvGrd          0.498574
Fireplaces             0.474831
BsmtFin SF 1           0.439284
Name: SalePrice, dtype: float64

In [17]:
# Drop the columns with a correlation coefficient < 0.4
transform_df = transform_df.drop(corr_coeffs[corr_coeffs < 0.4].index, axis=1)

Next, let's decide which categorical columns we should keep. Some columns may be numerical but need to be encoded as categorical instead, and some columns with too many unique values might be best left out, otherwise we'd have to add hundreds of dummy variable columns to our dataframe.

In [18]:
# List of column names that are supposed to be categorical
nominal_features = ['PID', 'MS SubClass', 'MS Zoning', 'Street', 'Alley', 'Land Contour', 'Lot Config', 'Neighborhood', 
                    'Condition 1', 'Condition 2', 'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl', 'Exterior 1st', 
                    'Exterior 2nd', 'Mas Vnr Type', 'Foundation', 'Heating', 'Central Air', 'Garage Type', 
                    'Misc Feature', 'Sale Type', 'Sale Condition']

In [19]:
# Looking at unique values in the categorical columns
categorical_cols = []
for col in nominal_features:
    if col in transform_df.columns:
        categorical_cols.append(col)
        
unique_count = transform_df[categorical_cols].apply(lambda col: len(col.value_counts())).sort_values()

# Drop categorical columns with more than 10 unique values
drop_unique_cols = unique_count[unique_count > 10].index
transform_df = transform_df.drop(drop_unique_cols, axis=1)

In [20]:
# Select remaining text columns and convert to categorical columns
text_cols = transform_df.select_dtypes(include=['object'])

for col in text_cols:
    transform_df[col] = transform_df[col].astype('category')
    
# Create dummy columns and add them to the dataframe
transform_df = pd.concat([transform_df,
                          pd.get_dummies(transform_df.select_dtypes(include=['category']))
                         ], axis=1).drop(text_cols, axis=1)

## Train & Test

Finally, let's update the `select_features()` function. We'll also add in a parameter named k to our `train_and_test()` function that will control the type of cross validations that occurs.

In [21]:
def transform_features(df):
    
    val_missing = df.isnull().sum()
    missing_val_cols = val_missing[(val_missing > len(df)/20)].sort_values()
    df = df.drop(missing_val_cols.index, axis=1)
    
    text_missing = df.select_dtypes(include=['object']).isnull().sum().sort_values(ascending=False)
    missing_text_cols = text_missing[text_missing > 0]
    df = df.drop(missing_text_cols.index, axis=1)
    
    num_missing = df.select_dtypes(include=['int', 'float']).isnull().sum()
    fixable_num_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values()
    replacement_vals = df[fixable_num_cols.index].mode().to_dict(orient='records')[0]
    df = df.fillna(replacement_vals)
    
    years_sold = df['Yr Sold'] - df['Year Built']
    years_remodel = df['Yr Sold'] - df['Year Remod/Add']
    df['Years Before Sale'] = years_sold
    df['Years Since Remodel'] = years_remodel
    df = df.drop([1702, 2180, 2181], axis=0)
    
    df = df.drop(['Year Built', 'Year Remod/Add', 'PID', 'Order', 'Mo Sold', 'Sale Condition', 'Sale Type', 'Yr Sold'], axis = 1)
    
    return df

def select_features(df, coeff_threshold=0.4, unique_threshold=10):
    numerical_df = df.select_dtypes(include=['int', 'float'])
    corr_coeffs = numerical_df.corr()['SalePrice'].abs().sort_values(ascending=False)
    df = df.drop(corr_coeffs[corr_coeffs < coeff_threshold].index, axis=1)
    
    nominal_features = ['PID', 'MS SubClass', 'MS Zoning', 'Street', 'Alley', 'Land Contour', 'Lot Config', 'Neighborhood', 
                    'Condition 1', 'Condition 2', 'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl', 'Exterior 1st', 
                    'Exterior 2nd', 'Mas Vnr Type', 'Foundation', 'Heating', 'Central Air', 'Garage Type', 
                    'Misc Feature', 'Sale Type', 'Sale Condition']
    
    categorical_cols = []
    for col in nominal_features:
        if col in df.columns:
            categorical_cols.append(col)
        
    unique_count = df[categorical_cols].apply(lambda col: len(col.value_counts())).sort_values()
    drop_unique_cols = unique_count[unique_count > unique_threshold].index
    df = df.drop(drop_unique_cols, axis=1)
    
    text_cols = df.select_dtypes(include=['object'])
    for col in text_cols:
        df[col] = df[col].astype('category')
        
    df = pd.concat([df,
                    pd.get_dummies(df.select_dtypes(include=['category']))
                    ], axis=1).drop(text_cols, axis=1)
    
    return df

def train_and_test(df, k=0):
    numeric_df = df.select_dtypes(include=['integer', 'float'])
    features = numeric_df.columns.drop('SalePrice')
    lr = linear_model.LinearRegression()
    
    if k == 0:
        train = df[:1460]
        test = df[1460:]
        
        lr.fit(train[features], train['SalePrice'])
        predictions = lr.predict(test[features])
        mse = mean_squared_error(test['SalePrice'], predictions)
        rmse = np.sqrt(mse)
    
        return rmse
    
    if k == 1:
        # Randomize rows in df
        shuffled_df = df.sample(frac=1, )
        train = df[:1460]
        test = df[1460:]
        
        lr.fit(train[features], train['SalePrice'])
        predictions_one = lr.predict(test[features])
        mse_one = mean_squared_error(test['SalePrice'], predictions_one)
        rmse_one = np.sqrt(mse_one)
        
        lr.fit(train[features], train['SalePrice'])
        predictions_two = lr.predict(test[features])
        mse_two = mean_squared_error(test['SalePrice'], predictions_two)
        rmse_two = np.sqrt(mse_two)
        
        avg_rmse = np.mean([rmse_one, rmse_two])
        
        print('RMSE 1: ', round(rmse_one, 2))
        print('RMSE 2: ', round(rmse_two, 2))
        
        return avg_rmse
    
    else:
        kf = KFold(n_splits=k, shuffle=True)
        rmse_values = []
        
        for train_index, test_index in kf.split(df):
            train = df.iloc[train_index]
            test = df.iloc[test_index]
            lr.fit(train[features], train['SalePrice'])
            predictions = lr.predict(test[features])
            mse = mean_squared_error(test['SalePrice'], predictions)
            rmse = np.sqrt(mse)
            rmse_values.append(rmse)
        
        print('RMSE Values: ', rmse_values)
        
        avg_rmse = np.mean(rmse_values)
        return avg_rmse

df = pd.read_csv('AmesHousing.tsv', delimiter='\t')
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df, k=4)

print('RMSE: ', round(rmse, 2))

RMSE Values:  [25937.807461762204, 38941.9349743024, 27151.223356899758, 24948.372629900896]
RMSE:  29244.83


## Conclusion & Next Steps

In this project we went through the process of building a linear regression model to accurately predict house sale prices. If we'd like to continue improving this model, some potential next steps would be to:

* Continue the feature engineering process and try the model after with different hyperparameters.
* Look at the Kaggle kernels page for this dataset to see what approaches others have taken.
* Work on improving our feature selection, and find better ways to handle our categorical columns.

The idea for this project comes from the [DATAQUEST](https://app.dataquest.io/) **Linear Regression For Machine Learning** course.