# House prices regression

**This tutorial requires scikit-learn>=0.20.0**.

We will be using the data from [this Kaggle competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques). To begin with you should download the data and put in the `data/` directory of this repository.

In [1]:
import pandas as pd

cat_cols = ['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape'
            'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood',
            'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual',
            'OverallCond', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
            'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
            'BsmtCond', 'BsmtExposure', 'Heating', 'HeatingQC', 'CentralAir',
            'Electrical', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
            'KitchenQual', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
            'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType'
            'SaleCondition', 'LotShape', 'LandContour', 'BsmtFinType1', 'BsmtFinType2',
            'Functional', 'FireplaceQu', 'SaleType', 'SaleCondition']

dtypes = {
    col: 'category'
    for col in cat_cols
}

train = pd.read_csv('data/train.csv.gz', dtype=dtypes)
test = pd.read_csv('data/test.csv.gz', dtype=dtypes)

train.head()

FileNotFoundError: [Errno 2] No such file or directory: 'data/train.csv.gz'

The goal of the competition is to make predictions that have the following format.

In [18]:
sample_sub = pd.read_csv('data/sample_submission.csv.gz')
sample_sub.head()

Unnamed: 0,Id,SalePrice
0,1461,169277.052498
1,1462,187758.393989
2,1463,183583.68357
3,1464,179317.477511
4,1465,150730.079977


## Feature extraction

Something that is always a good idea is to merge the training set with the test set. This makes the feature extraction code much more terse.

In [22]:
train['is_train'] = True
test['is_train'] = False

df = pd.concat((train, test), sort=False)

# pd.concat can modify the column types
df = df.astype(test.dtypes)

Another good idea is to write unit tests to check that the data we have is "correct".

In [20]:
assert df['SalePrice'].isnull().sum() == len(test)

Let's add a feature.

In [21]:
df['garage_and_paved_driveway'] = df['GarageQual'].notnull() & df['PavedDrive'].notnull()
df['garage_and_paved_driveway'].value_counts()

True     2757
False     162
Name: garage_and_swimming, dtype: int64

We'll now handle the missing values. For the categorical variables we'll add a new value called `'missing'`. As for the numerical variables we will impute them with their mean.

In [16]:
from sklearn import preprocessing

cat_cols = df.select_dtypes(include='category').columns
num_cols = df.select_dtypes(exclude='category').columns.drop(['SalePrice', 'Id', 'is_train'])

In [None]:
for col in cat_cols:
    if df[col].isnull().sum() > 0:
        df[col] = df[col].cat.add_categories('missing').fillna('missing')
    df[col] = preprocessing.LabelEncoder().fit_transform(df[col])

In [6]:
for col in num_cols:
    if df[col].isnull().sum() > 0:
        df[col] = df[col].fillna(df[col].mean())

## Machine learning

A good practice is to first split the data between the features into `X_fit`, `y_fit`, and `X_test`.

In [72]:
import numpy as np

to_drop = ['Id', 'is_train']

X_train = df.query('is_train == True').drop(columns=to_drop + ['SalePrice'])
y_train = df.query('is_train == True')['SalePrice']
X_test = df.query('is_train == False').drop(columns=to_drop + ['SalePrice'])

X_train.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,RL,65.0,8450,Pave,missing,Reg,Lvl,AllPub,Inside,...,0,0,missing,missing,missing,0,2,2008,WD,Normal
1,20,RL,80.0,9600,Pave,missing,Reg,Lvl,AllPub,FR2,...,0,0,missing,missing,missing,0,5,2007,WD,Normal
2,60,RL,68.0,11250,Pave,missing,IR1,Lvl,AllPub,Inside,...,0,0,missing,missing,missing,0,9,2008,WD,Normal
3,70,RL,60.0,9550,Pave,missing,IR1,Lvl,AllPub,Corner,...,0,0,missing,missing,missing,0,2,2006,WD,Abnorml
4,60,RL,84.0,14260,Pave,missing,IR1,Lvl,AllPub,FR2,...,0,0,missing,missing,missing,0,12,2008,WD,Normal


The metric used for the competition is the RMSE between the log of the true price and the log of the predicted prices. Our model will do better if we convert the target to log scale.

In [73]:
y_train = np.log(y_train)

We can write some tests to make sure our data is okay.

In [74]:
assert all(X_train.columns == X_test.columns)
assert X_train.isnull().sum().sum() == 0
assert X_test.isnull().sum().sum() == 0
assert len(X_train) == len(y_train)

Let's implement a linear regression model.

In [75]:
from sklearn import decomposition
from sklearn import linear_model
from sklearn import pipeline
from sklearn import preprocessing


model = pipeline.Pipeline([
    ('one_hot', preprocessing.OneHotEncoder(sparse=False, handle_unknown='ignore')),
    ('rescale', preprocessing.StandardScaler()),
    ('pca', decomposition.TruncatedSVD(n_components=30)),
    ('ridge', linear_model.Ridge())
])

To evaluate our model we can use a cross-validation scheme.

In [85]:
from sklearn import metrics
from sklearn import model_selection


def NegRMSE(y_true, y_pred):
    return -metrics.mean_squared_error(y_true, y_pred) ** 0.5

scoring = metrics.make_scorer(NegRMSE) 
cv = model_selection.KFold(n_splits=5, random_state=42)

scores = model_selection.cross_val_score(
    estimator=model,
    X=X_train,
    y=y_train,
    scoring=scoring,
    cv=cv
)

print('Model RMSE: {:.5f} ± {:.5f}'.format(-scores.mean(), scores.std()))

Model RMSE: 0.20395 ± 0.01373


We can perform a grid-search to find better parameters for our model.

In [86]:
param_grid = {
    'ridge__alpha': [0.01, 0.1, 1],
    'pca__n_components': [10, 25, 50]
}

grid = model_selection.GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    scoring=scoring,
    cv=cv,
    return_train_score=True
)

grid = grid.fit(X_train, y_train)

Let's put the results in a `DataFrame` to make them easier to read.

In [87]:
results = pd.concat(
    (
        pd.DataFrame.from_dict(grid.cv_results_['params']),
        pd.DataFrame({
            'mean_train_score': -grid.cv_results_['mean_train_score'],
            'std_train_score': grid.cv_results_['std_train_score'],
            'mean_test_score': -grid.cv_results_['mean_test_score'],
            'std_test_score': grid.cv_results_['std_test_score']
        })
    ),
    axis='columns'
)

results.sort_values('mean_test_score')

Unnamed: 0,pca__n_components,ridge__alpha,mean_train_score,std_train_score,mean_test_score,std_test_score
6,50,0.01,0.160938,0.00224,0.200887,0.015303
8,50,1.0,0.162396,0.001859,0.20165,0.014464
7,50,0.1,0.165046,0.003395,0.202599,0.013305
5,25,1.0,0.174153,0.004159,0.205411,0.014961
3,25,0.01,0.174533,0.006926,0.206529,0.01178
4,25,0.1,0.176388,0.004411,0.206676,0.012628
2,10,1.0,0.184714,0.006262,0.210935,0.013697
1,10,0.1,0.187193,0.008907,0.21135,0.013977
0,10,0.01,0.187659,0.005026,0.211409,0.016108


`grid` now possesses an attribute called `best_estimator_`. We can use it to make our make our final predictions. 

In [88]:
grid.best_estimator_

Pipeline(memory=None,
     steps=[('one_hot', OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='ignore',
       n_values=None, sparse=False)), ('rescale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', TruncatedSVD(algorithm='randomized', n_components=50, n_iter=5,
       random_state=None, tol=0.0)), ('ridge', Ridge(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))])

## Submitting

In [91]:
sub = sample_sub.copy()

# We predict the log of the price
sub['SalePrice'] = grid.best_estimator_.predict(X_test)

# We convert the prices back to their normal scale
sub['SalePrice'] = np.exp(sub['SalePrice'])

# We save the submission; the name of the file has the best validation 
sub.to_csv('submission_{:.5f}.csv'.format(-grid.best_score_), index=False)

Let's take a look at the submission.

In [93]:
!head submission_0.20089.csv

Id,SalePrice
1461,116441.6130193994
1462,152499.3642918255
1463,196781.66003689083
1464,206544.46221658096
1465,187892.8484093237
1466,191979.25949717793
1467,174668.03238978193
1468,186394.0289619949
1469,179740.4014589204
