# MI-ADM: second home assignment

  * **Deadline**: 17/04/2018, -2 points for a late submission, hard deadline is the first day of the exam period.
  * **What to submit**: Just this notebook with you code and texts, not the dataset! Please run "Kernel>Restart & Clear Output" before submitting.
  * **How to submit**: Preferred way is to start a repository (for both home assignments) on https://gitlab.fit.cvut.cz and add me as a reporter (not just guest, my username is kloudkar), however, you can send this jupyter notebook by email (*do not send the dataset!!*).

Generally speaking, the goal of this assignment is to use decision trees for the regression problem and experiment with discretisation of continuous (numeric) variables.

What you HAVE TO do:
  * Experiment with the scikit implementation and show how the missing values are treated.
  * Study the data from `house-prices.csv` (see also `house-prices_description.txt`).
  * Try to replace continuous features with some discrete ones (indicator variables, dummy variables, binning, ...) and then run decision tree algorithm on the resulting discrete features to predict the Sales Price.
  * Try to find best possible way of discretisation, tune some reasonably selected hyperparameters (using cross-validation or just a validation set) and measure the results using *Root mean squared logarithmic error (RMSLE)*.

If you do all this properly with some reasonable reults you will be given 4 points out of possible 6.

To get more do some extra work to (try to) improve the result, here are some suggestions:
  * Use also random forest algorithm or gradient boosted trees.
  * Use [XGboost](http://xgboost.readthedocs.io/en/latest/) implementation.
  * Sign up at kaggle.com and apply your resulting model in [the respective competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

In [3]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import mean_squared_error
import sklearn.metrics as metrics
from sklearn.preprocessing import Imputer

In [4]:
import pandas as pd
import math
from sklearn.preprocessing import Imputer
data = pd.read_csv('house-prices.csv')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

* Shows how many features have missing values and count of them.

In [6]:
null_list = data[list(data.loc[:, (data.isnull().sum() > 0)])].isnull().sum()
null_list

LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

* Count of all rows in dataframe is 1460. For some features the count of missing value more than 50% and it isn't useful. In next part show how to select features by some threshold (400) and they will be deleted.

In [7]:
drop_list =  list(data.loc[:, (data.isnull().sum() > 700)])
drop_list

['Alley', 'PoolQC', 'Fence', 'MiscFeature']

In [8]:
data_drop_null = data.drop(drop_list, axis=1)
null_list = list(data_drop_null.loc[:, (data_drop_null.isnull().sum() > 0)])

* To work with missing values in sklearn could be used 'Preprocessing data'.
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html#sklearn.preprocessing.Imputer
* Imputer doesn't work with categorical data and in the first, the data should be transformed.

In [9]:
cat_columns = data_drop_null.select_dtypes(['object']).columns
for column in cat_columns:
    data_drop_null[column] = data_drop_null[column].astype('category')
data_drop_null[cat_columns] = data_drop_null[cat_columns].apply(lambda x: x.cat.codes)

In [10]:
data_drop_null = data_drop_null.replace(-1, np.nan)
y = data_drop_null['SalePrice'].apply(math.log) 
data_drop_null = data_drop_null.drop(['SalePrice', 'Id'], axis = 1)

In [11]:
# Imputer(missing_values=’NaN’, strategy=’mean’, axis=0, verbose=0, copy=True)
# I use strategy - most frequent
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)

* Fit the imputer on data without missing values and after use transform to imput all missing values

In [12]:
data_drop_all = data_drop_null.dropna()

In [13]:
imp.fit(data_drop_all)

Imputer(axis=0, copy=True, missing_values='NaN', strategy='most_frequent',
    verbose=0)

In [14]:
X = imp.transform(data_drop_null)

In [15]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25, random_state=100)

In [16]:
dtr = DecisionTreeRegressor(max_depth= 10)
dtr.fit(Xtrain,ytrain)
print('Root mean squared logarithmic error test:', np.sqrt(mean_squared_error(dtr.predict(Xtest), ytest)))
print('Root mean squared logarithmic error train:', np.sqrt(mean_squared_error(dtr.predict(Xtrain), ytrain)))

Root mean squared logarithmic error test: 0.20315990359232175
Root mean squared logarithmic error train: 0.04728823333917736


* use in imputer another metric to show how change the result of regression

In [17]:
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(data_drop_all)
X = imp.transform(data_drop_null)
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25, random_state=100)

print('Root mean squared logarithmic error test:', np.sqrt(mean_squared_error(dtr.predict(Xtest), ytest)))
print('Root mean squared logarithmic error train:', np.sqrt(mean_squared_error(dtr.predict(Xtrain), ytrain)))

Root mean squared logarithmic error test: 0.20345752306488832
Root mean squared logarithmic error train: 0.04990008746237524


In next part, I also delete features, where are the count of missing values more than 50%.
For missing values in numerical attributes was used fillna function, which replaces nan for a mean. For categorical attributes was implemented a function, which finds the most frequent val and replaces nan for it.

In [18]:
data = pd.read_csv('house-prices.csv')
drop_list =  list(data.loc[:, (data.isnull().sum() > 700)])
data_drop_null = data.drop(drop_list, axis=1)

In [19]:
# numerical attributes
data_drop_null = data_drop_null.fillna(data_drop_null.mean(), inplace=True)

In [20]:
cat_list = list(data_drop_null[list(data_drop_null.loc[:, data_drop_null.isnull().sum()>0])].isnull())
data_drop_null[list(data_drop_null.loc[:, data_drop_null.isnull().sum()>0])].isnull().sum()

MasVnrType        8
BsmtQual         37
BsmtCond         37
BsmtExposure     38
BsmtFinType1     37
BsmtFinType2     38
Electrical        1
FireplaceQu     690
GarageType       81
GarageFinish     81
GarageQual       81
GarageCond       81
dtype: int64

In [21]:
for f in cat_list:
    most_expected = data_drop_null[f].value_counts().keys()[0]
    data_drop_null[f] = data_drop_null.replace(np.nan, most_expected)

In [22]:
# check null values
data_drop_null[list(data_drop_null.loc[:, data_drop_null.isnull().sum()>0])].isnull().sum()

Series([], dtype: float64)

* for categorical attributes, which have maximum 3 different value will be used dummy variables.

In [23]:
dummy_index = data_drop_null.columns[data_drop_null.nunique() < 4]
dummy = pd.get_dummies(data_drop_null[dummy_index])
data_drop_dummy = data_drop_null.drop(dummy_index, axis = 1).merge(dummy, left_index = True, right_index = True)

In [24]:
data_drop_dummy

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,LotShape,LandContour,LotConfig,Neighborhood,Condition1,...,Utilities_AllPub,Utilities_NoSeWa,LandSlope_Gtl,LandSlope_Mod,LandSlope_Sev,CentralAir_N,CentralAir_Y,PavedDrive_N,PavedDrive_P,PavedDrive_Y
0,1,60,RL,65.000000,8450,Reg,Lvl,Inside,CollgCr,Norm,...,1,0,1,0,0,0,1,0,0,1
1,2,20,RL,80.000000,9600,Reg,Lvl,FR2,Veenker,Feedr,...,1,0,1,0,0,0,1,0,0,1
2,3,60,RL,68.000000,11250,IR1,Lvl,Inside,CollgCr,Norm,...,1,0,1,0,0,0,1,0,0,1
3,4,70,RL,60.000000,9550,IR1,Lvl,Corner,Crawfor,Norm,...,1,0,1,0,0,0,1,0,0,1
4,5,60,RL,84.000000,14260,IR1,Lvl,FR2,NoRidge,Norm,...,1,0,1,0,0,0,1,0,0,1
5,6,50,RL,85.000000,14115,IR1,Lvl,Inside,Mitchel,Norm,...,1,0,1,0,0,0,1,0,0,1
6,7,20,RL,75.000000,10084,Reg,Lvl,Inside,Somerst,Norm,...,1,0,1,0,0,0,1,0,0,1
7,8,60,RL,70.049958,10382,IR1,Lvl,Corner,NWAmes,PosN,...,1,0,1,0,0,0,1,0,0,1
8,9,50,RM,51.000000,6120,Reg,Lvl,Inside,OldTown,Artery,...,1,0,1,0,0,0,1,0,0,1
9,10,190,RL,50.000000,7420,Reg,Lvl,Corner,BrkSide,Artery,...,1,0,1,0,0,0,1,0,0,1


* for attributes which have more than 50 variables applied binning. Allocated 5 uniform intervals and the continuous value is replaced by the number of the interval.

In [25]:
bin_cells = data_drop_dummy.loc[:, data_drop_dummy.nunique()> 50]
bin_list = list(bin_cells.drop('SalePrice', axis =1))

In [26]:
for col in bin_list:
    data_drop_dummy[col] = pd.cut(data_drop_dummy[col], 5, labels=[1, 2, 3, 4, 5])

In [27]:
# changing object columns on categorical
cat_columns = data_drop_dummy.select_dtypes(['object']).columns
for column in cat_columns:
    data_drop_dummy[column] = data_drop_dummy[column].astype('category')
data_drop_dummy[cat_columns] = data_drop_dummy[cat_columns].apply(lambda x: x.cat.codes)

In [28]:
data_drop_dummy.SalePrice = data_drop_dummy.SalePrice.apply(math.log) 

In [29]:
# drop Id
data_drop_dummy = data_drop_dummy.drop('Id', axis =1)

In [30]:
# split the set for train and test
X, Xtest, y, ytest = train_test_split(data_drop_dummy.drop(['SalePrice'], axis = 1), data_drop_dummy.SalePrice,\
                                      test_size=0.25, random_state=100)

In [31]:
# implementation of decision tree
dtr = DecisionTreeRegressor(max_depth= 30)
dtr.fit(X,y)

DecisionTreeRegressor(criterion='mse', max_depth=30, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [32]:
print('Root mean squared logarithmic error test:', np.sqrt(mean_squared_error(dtr.predict(Xtest), ytest)))
print('Root mean squared logarithmic error train:', np.sqrt(mean_squared_error(dtr.predict(X), y)))

Root mean squared logarithmic error test: 0.2028932580372886
Root mean squared logarithmic error train: 1.8523473448755646e-05


* in next part implemented GridSearchCV for search the best params - max depth and max features. 

In [33]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score

In [34]:
tree_params = {'max_depth': range(10,35), 'max_features': range(2,10)}

In [35]:
# I was trying to make own scorer for cross-validation but it didn't work (or too long)
#def rmsle(predicted, actual, size):
#    return np.sqrt(mean_squared_error(predicted - actual))

In [36]:
# neg_mean_squared_log_error return the negated value of the metric
# the best value for neg_mean_squared_log_error is 0.0
tree_grid = GridSearchCV(dtr, tree_params, cv=3, n_jobs=-1, verbose=True, scoring='neg_mean_squared_log_error')

In [37]:
X_cross = data_drop_dummy.drop(['SalePrice'], axis = 1)
y_cross = data_drop_dummy['SalePrice']
tree_grid.fit(X_cross, y_cross)

Fitting 3 folds for each of 200 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   38.7s
[Parallel(n_jobs=-1)]: Done 220 tasks      | elapsed:   41.0s
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:   45.9s finished


GridSearchCV(cv=3, error_score='raise',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=30, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best'),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'max_depth': range(10, 35), 'max_features': range(2, 10)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_log_error', verbose=True)

* showing the best params

In [38]:
tree_grid.best_params_

{'max_depth': 26, 'max_features': 9}

In [39]:
tree_grid.best_score_

-0.00033241237592177844

In [40]:
dtr = tree_grid.best_estimator_
dtr.fit(X,y)

DecisionTreeRegressor(criterion='mse', max_depth=26, max_features=9,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [41]:
from sklearn.metrics import mean_squared_log_error
print('Mean squared logarithmic error regression loss for test set:', mean_squared_log_error(ytest, dtr.predict(Xtest)))
print('Mean squared logarithmic error regression loss for train set:', mean_squared_log_error(y, dtr.predict(X)))

Mean squared logarithmic error regression loss for test set: 0.0003266514979200021
Mean squared logarithmic error regression loss for train set: 2.629385607383713e-12


* I think that I have some mistake with evaluating because the result is unreal. I will be grateful if you tell me which metric is better to use.