Running Brainstorm and To-Do List:

1) Each team member is going to take a shot at cleaning up the data.  Primarly concerned with conerting categorical variables to multiple dummy variables, and making sure no missing values

2) Check to ensure no extreme outliers.  (Maybe do some quick .describe() and write up

3) Two baselines:
        * Linear regression (Or Lasso Regression)
        * Some super simple rule of thumb (e.g. average price of all homes in neighborhood)

In [199]:
import numpy as np
import pandas as pd
import sklearn
from sklearn import preprocessing

import scipy


from datetime import date

pd.set_option('display.max_columns', 300)

test_df = pd.read_csv("test.csv")
train_df = pd.read_csv("train.csv")

In [200]:
def convert_categorical_to_dummy(df, columns = [], drop_first = True):
    """
    Convert all categorical variables to k-1 dummy variables
    """
    if len(columns) > 0:
        new_df = pd.get_dummies(df, drop_first = drop_first, columns = columns)
    else:
        new_df = pd.get_dummies(df, drop_first = drop_frist)
        
    return new_df
        

In [201]:
def recession_indicator(row):
    """Take in sale date and return whether it was during recession
    as determined by official statistics; only include the recession
    which occured during years in question, but can expand as necessary
    https://en.wikipedia.org/wiki/List_of_recessions_in_the_United_States"""
    
    period = date(row['YrSold'],row['MoSold'],1)
    if (period >= date(2007,12,1)) & (period <= date(2009,6,1)) :
        return 'recession'
    else:
        return ''

In [202]:
# Convert year based features to age; Most recent year is 2010, so we will treat that as year 0
def convert_year_to_age(df, columns):
    """Helper function to convert year features into age assuming 2010 as year 0"""
    for col in columns:
        df[col] = df[col].apply(lambda x: (2010-x))
    return df

In [203]:
train_df['recession'] = train_df.apply(lambda x: recession_indicator(x), axis = 1)
test_df['recession'] = test_df.apply(lambda x: recession_indicator(x), axis = 1)

In [204]:
# List of columns which we believe it makes sense to convert to boolean
categorical_columns = ['MSSubClass','MSZoning','Street','Alley','LotShape','LandContour','Utilities','LotConfig',
                       'LandSlope','Neighborhood','Condition1','Condition2','BldgType','HouseStyle','RoofStyle',
                       'RoofMatl','Exterior1st','Exterior2nd','MasVnrType','ExterQual','ExterCond','Foundation',
                       'BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','Heating','HeatingQC',
                       'CentralAir','Electrical','KitchenQual','Functional','FireplaceQu','GarageType','GarageFinish',
                       'GarageQual','GarageCond','PavedDrive','PoolQC','Fence','MiscFeature','MoSold','SaleType',
                       'SaleCondition','recession']

new_train = convert_categorical_to_dummy(train_df,categorical_columns)
new_test = convert_categorical_to_dummy(test_df,categorical_columns)

In [205]:
# Get missing columns in the training test
missing_cols = set( new_train.columns ) - set( new_test.columns )

# Add a missing column in test set with default value equal to 0
for col in missing_cols:
    new_test[col] = 0
    
# Ensure matching order and columns 
new_test = new_test[new_train.columns]

In [206]:
# Check to make sure that our missing columns are only derived from the test data lacking specific instances
# of a categorical variable.  Test data set will not have SalePrice (that's target variable Y) so we are alright
for col in missing_cols:
    if col.split("_")[0] not in categorical_columns:
        print("Error, %s was not derived from a missing category" %col)

Error, SalePrice was not derived from a missing category


In [207]:
"""Note: YrSold is different from the other year variables as for the first 3 converting year to age 
from 2010 is a logical conversion assuming linear depreciation (Being built 2 years ago should have twice the 
depreciation effect as 1 year ago, etc.).  However, for year sold we are only covering a 5 year period during which 
there was a significant shock to global financial markets.  As such we will not convert year sold to age,
and opt instead to convert it to a categorical variable (see above)"""

year_cols = ['YearBuilt','YearRemodAdd','GarageYrBlt']
new_train = convert_year_to_age(new_train,year_cols)
new_test = convert_year_to_age(new_test,year_cols)



In [208]:
# Use SKLearn scaling 
def min_max_scaling(df, columns):
    min_max_scaler = preprocessing.MinMaxScaler()
    scaled_df = min_max_scaler.fit_transform(df)
    scaled_df = pd.DataFrame(scaled_df,columns = columns)
    return scaled_df, min_max_scaler

# Opt for min-max scaling as opposed to mean normalization as we have so many boolean variables.  Min-Max
# preserve the 0/1 nature of these features.  We may revisit using alternative scaling methods in future
# iterations of preprocessing

scaled_train, min_max_scaler_train = min_max_scaling(new_train, new_train.columns)
scaled_train['Id'] = scaled_train.index + 1

scaled_test = pd.DataFrame(min_max_scaler_train.transform(new_test),columns=new_test.columns)

  return self.partial_fit(X, y)


Things we found interesting when evaluating data:

1) Placeholder

Modeling to do list:

    1) Do all of the below both pre and post PCA

    2) Do all of the below both pre and post k-means++ clustering

Models:

    1) Neural net (Feed Forward)

    2) XGBoost (or RF)

    3) GMM

    4) SVM

    5) KNN (Restrict what features we use)



In [39]:
# PCA + GXBoost = Best | Clay & Rohini prediction
# Lasso Regression     | Mark predicts this will be the hard to beat