# First submission, my personal benchmark

This is still one of my first Kaggle submissions. To be honest, I'm still very keen to have a first submission very fast. I see this first submission then as my personal benchmark from where I can take things further. 

In all the literature I've read so far, the recommendation to tackle a machine learning problem goes along some other priorities: in-depth data exploration/analysis and literature review first, followed by data manipulations, model building and finally model optimization. 

However, in this blog I deliberately try to achieve a first submission quite quickly to achieve my individual benchmark. I worked along the following challenges:
- **Relevant imports and data loading.** Based on this try to build a model - RandomForestRegressor - on the data loaded. This failed, as text couldn't be processed by the RandomForestRegressor. 
- **Straight forward data conversion**: from categorical data (in text form) into integers. Again, try to build the model on the train set. Apply the model to the test set for a prediction. First, this failed because of missing values, which have been filled up with 0 then. It failed a second time because of the different numbers of features. 
- **Other conversion ideas**. After some research in the internet: concat train/test data, apply the pd.get_dummies again. Train the model. 
- **Benchmark model and prepare submission**. Apply the trained model on the test data to make a first prediction. Don't forget to fill all missing values with 0 first.
- **Next steps**. Various ideas that came up while working through the single steps, how the initial benchmark model can then be improved further. 

## Relevant imports and data loading

In [61]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor  # as I think random forest might be a good and easy starting point
from sklearn.model_selection import train_test_split  # split train data into a train and valid part for a first validation
from sklearn.metrics import mean_squared_error  # evaluation criteria is root mean squared error
from math import sqrt  # to calculate the root of the mean squared error. alternatively **0.5 could be used

In [62]:
# load data which is stored in the /data folder of the project
train_data = pd.read_csv('../data/train.csv', sep=',', header=0)
test_data = pd.read_csv('../data/test.csv', sep=',', header=0)

In [63]:
# quickly check the shape of the two data sets.
train_data.shape

(1460, 81)

In [64]:
test_data.shape

(1459, 80)

In [65]:
# store the SalePrice (target variable) and drop it from the rest of the train data
target_variable = train_data["SalePrice"]
train_features = train_data.drop(["SalePrice"], axis=1)
train_features.shape

(1460, 80)

Now the train and the test data have 80 features (attributes / columns) each.

In [66]:
# have a look at the first 10 entries of the train features
train_features.head(15) 

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,2,2008,WD,Normal
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,5,2007,WD,Normal
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,9,2008,WD,Normal
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,2,2006,WD,Abnorml
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,12,2008,WD,Normal
5,6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,Shed,700,10,2009,WD,Normal
6,7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,8,2007,WD,Normal
7,8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,...,0,0,,,Shed,350,11,2009,WD,Normal
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,4,2008,WD,Abnorml
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,1,2008,WD,Normal


A lot of the categorical features are available in the form of text. 
E.g. the categorical feature SaleType has values such as 'WD' and 'New' etc.
Let's have a look at the individual data types of the train features. 

In [67]:
# have a look at the data types of the first 30 attributes
train_features.dtypes[:30]

Id                int64
MSSubClass        int64
MSZoning         object
LotFrontage     float64
LotArea           int64
Street           object
Alley            object
LotShape         object
LandContour      object
Utilities        object
LotConfig        object
LandSlope        object
Neighborhood     object
Condition1       object
Condition2       object
BldgType         object
HouseStyle       object
OverallQual       int64
OverallCond       int64
YearBuilt         int64
YearRemodAdd      int64
RoofStyle        object
RoofMatl         object
Exterior1st      object
Exterior2nd      object
MasVnrType       object
MasVnrArea      float64
ExterQual        object
ExterCond        object
Foundation       object
dtype: object

Without having a look at all the individual features yet, the RandomForestRegressor will be able to deal with all the numerical parts (e.g. int64, float64), but the algorithm will not be able to deal with the text information in a meaningful way. It will return a ValueError: could not convert string to float: 'Normal'

In [None]:
rf_reg = RandomForestRegressor()
rf_reg.fit(train_features, target_variable)  # returns a ValueError: could not convert string to float: 'Normal'

So let's convert the train_features into a format the RandomForestRegressor can deal with. 

## Straight forward data conversions

Given my research in the internet, I found the following options to tackle this converstion question of the train_features:

- pd.get_dummies()
- OneHotEncoder
- LabelEncoder

### pd.get_dummies( )

In [74]:
# convert all features of dtype object (or categorical) into a dummy/indicator variable
# insert a separat feature for missing values, this way we don't have to deal with them explicitely (the categorical ones)
# drop the original feature
dummies_train = pd.get_dummies(train_features, dummy_na=True, drop_first=True)
# have a look at the first 10 entries in the converted train data
dummies_train.head(10)

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_New,SaleType_Oth,SaleType_WD,SaleType_nan,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,SaleCondition_nan
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,...,0,0,1,0,0,0,0,1,0,0
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,...,0,0,1,0,0,0,0,1,0,0
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,...,0,0,1,0,0,0,0,1,0,0
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,...,0,0,1,0,0,0,0,0,0,0
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,...,0,0,1,0,0,0,0,1,0,0
5,6,50,85.0,14115,5,5,1993,1995,0.0,732,...,0,0,1,0,0,0,0,1,0,0
6,7,20,75.0,10084,8,5,2004,2005,186.0,1369,...,0,0,1,0,0,0,0,1,0,0
7,8,60,,10382,7,6,1973,1973,240.0,859,...,0,0,1,0,0,0,0,1,0,0
8,9,50,51.0,6120,7,5,1931,1950,0.0,0,...,0,0,1,0,0,0,0,0,0,0
9,10,190,50.0,7420,5,6,1939,1950,0.0,851,...,0,0,1,0,0,0,0,1,0,0


This looks already very promissing. The text information is converted into numerical values. So next trial

In [None]:
# stubborn: train a model as above
rf_reg = RandomForestRegressor()
rf_reg.fit(dummies_train, target_variable)  

# results in a Input contains NaN, infinity or a value too large for dtype('float32').

In [76]:
# as an answer to this, fill all the NaN values with 0.
dummies_train = dummies_train.fillna(0)

# and again: try to train a model as above
rf_reg = RandomForestRegressor()
rf_reg.fit(dummies_train, target_variable)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

YES. Model trained. Ready to make predictions on the test data and submit it to Kaggle, but....

In [None]:
# convert the loaded test data with pd.get_dummies, as I did with the train data.
dummies_test = pd.get_dummies(test_data, dummy_na=True, drop_first=True)
# fill all NaN values with 0, as with the train data.
dummies_test = dummies_test.fillna(0)

# and make a prediction that I can submit
prediction = rf_reg.predict(dummies_test)  

# returns ValueError: Number of features of the model must match the input. Model n_features is 289 and input n_features is 271 

Number of features of the model must match the input. Model n_features is 289 and input n_features is 271...

what went wrong? 

In [78]:
# have a look at the first 10 entries in the converted test data
dummies_test.head(10)

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1461,20,80.0,11622,5,6,1961,1961,0.0,468.0,...,0,0,0,1,0,0,0,0,1,0
1,1462,20,81.0,14267,6,6,1958,1958,108.0,923.0,...,0,0,0,1,0,0,0,0,1,0
2,1463,60,74.0,13830,5,5,1997,1998,0.0,791.0,...,0,0,0,1,0,0,0,0,1,0
3,1464,60,78.0,9978,6,6,1998,1998,20.0,602.0,...,0,0,0,1,0,0,0,0,1,0
4,1465,120,43.0,5005,8,5,1992,1992,0.0,263.0,...,0,0,0,1,0,0,0,0,1,0
5,1466,60,75.0,10000,6,5,1993,1994,0.0,0.0,...,0,0,0,1,0,0,0,0,1,0
6,1467,20,0.0,7980,6,7,1992,2007,0.0,935.0,...,0,0,0,1,0,0,0,0,1,0
7,1468,60,63.0,8402,6,5,1998,1998,0.0,0.0,...,0,0,0,1,0,0,0,0,1,0
8,1469,20,85.0,10176,7,5,1990,1990,0.0,637.0,...,0,0,0,1,0,0,0,0,1,0
9,1470,20,70.0,8400,4,5,1970,1970,0.0,804.0,...,0,0,0,1,0,0,0,0,1,0


Hm... the conversion of the test data resulted in 271 features only. Whereas the conversion of the train_features resulted in 289 features. So in other words, I trained a model on 289 features, but wanted to predict on 271 features only. This led to the ValueError above. 

It seems that certain categorical features in the train_data have different levels compared to the same categorical features in the test data. This means that for example MSZoning knows the levels RL and RM in the train data, but the level FV in the test data. 

I was stuck. How can I convert the train data in a way that it has also the levels from the test data?
Different ideas came into my mind. 

### OneHotEncoder 
OneHotEncoder only works with categorical **integer** features and therefore cannot be applied directly on the train and test data. Recall that the categorical features are of data type 'object'.

### LabelEncoder
LabelEncoder unfolds an array-like shape of **(n_levels, )** into a matrix like structure. So, we would have to call the LabelEncoder for each categorical feature on an individual basis. This seems to be inefficient and would finally result in the same outcome as with the get_dummies version, as I still don't have all levels of all categorical features. 

## Other conversion ideas?

as pd.get_dummies(), OneHotEncoder and LabelEncoder didn't bring the required success, I did some research in the internet. One approach I found is the one described right next. Another one is described in the next steps section at the end of this notebook.

### concat train and test, pd.get_dummies, split the set back
This idea wasn't very appealing to me first, as I don't want to touch the test data until the 'final' model has been built. However, I didn't find any other approach to get to know all levels in the train and test data, withouth having a look at it. Therefore, I tried to minimize the influence from the test data to my model, by using test data 'only' to get a complete set of levels of categorical features and filling up missing values with 0.

The approach can be summarized as: 
- concat train and test
- get_dummies()
- split the set back. 
It is inspired by various blogs/kernels from stackoverflow, Kaggle and FastML. 

In [81]:
# concat test and train data. List all train records first, attach the test data second
all_data = pd.concat((train_features, test_data), axis=0)

# convert categorical variables into dummy/indicator variable. 
# For missing values an additional column will be created - dummy_na
# The original feature will be dropped - drop_first 
all_dummies = pd.get_dummies(all_data, dummy_na=True, drop_first=True)

# remaining NaN values will be filled with a 0. Still a radical approach. See next steps at the bottom of this notebook
all_dummies = all_dummies.fillna(0)

# split test and train sets again
dummies_train = all_dummies.iloc[:train_features.shape[0],:]
dummies_test = all_dummies.iloc[train_features.shape[0]:,:]

print('shape all dummies: ', all_dummies.shape)
print('shape dummy train: ', dummies_train.shape)
print('shape dummy test: ', dummies_test.shape)  # Kaggle is expecting 1459 predictions. 


shape all dummies:  (2919, 289)
shape dummy train:  (1460, 289)
shape dummy test:  (1459, 289)


## Benchmark model and submission

In [82]:
X_train, X_valid, y_train, y_valid = train_test_split(dummies_train, target_variable, test_size=0.2, random_state=0)

# train a simple random forest regressor on the X_train part of the benchmark_data
rf_reg = RandomForestRegressor()
rf_reg.fit(X_train, y_train)

# make a prediction for the X_valid part of the benchmark_data, based on the trained model
y_pred = rf_reg.predict(X_valid)

# evaluate performance of your trained model on the X_valid part of the benchmark_data
# root mean squared error (rmse) and score (R^2) are related to each other. See respective literature.
# As a hint: the smaller rsme, the better. the closer R^2 is to 1, the better.
rmse = sqrt(mean_squared_error(y_valid, y_pred))
print('Root mean squared error: ', rmse)

score = rf_reg.score(X_valid, y_valid)
print('Score: ', score)

Root mean squared error:  33852.20455307229
Score:  0.834057826469


In [49]:
# get the test Ids and make the prediction
test_ids = dummies_test['Id']
predictions = rf_reg.predict(dummies_test)

# prepare submission as outlined in the submission_sample from Kaggle
submission = pd.DataFrame({"Id": test_ids,"SalePrice": predictions})
print(submission.head(20))

      Id  SalePrice
0   1461   130875.0
1   1462   157750.0
2   1463   195282.5
3   1464   178058.5
4   1465   189824.0
5   1466   188750.0
6   1467   168720.0
7   1468   177978.5
8   1469   198450.0
9   1470   128055.0
10  1471   202872.8
11  1472    93820.0
12  1473    92380.0
13  1474   157240.0
14  1475   136700.0
15  1476   351060.5
16  1477   243114.9
17  1478   297898.8
18  1479   232145.6
19  1480   475340.7


In [83]:
# write submission to csv and submit to Kaggle. 
# the default separator is ',' (comma) and we don't have to define this explicitely. Although we could with sep=','
# index is set to false, as we don't need the row numbers (0, 1, 2 etc) but the Id's instead 
submission.to_csv("../data/submission.csv", index=False)

This was my first - I admit very simplified - prediction. However, this is now a benchmark I want to improve stepwise. While working through the steps above a lot of ideas evolved that I want to try out next. Sometimes it helps to start very very simple so that questions can arise :-)

It scored with 0.16142 on rank 1728 (out of about 6000).

## Next steps

Analyze and explore the individual features. Especially the numerical ones.
- Find correleations?
- look at their distributions. Transformation (e.g. log) required?
- what about standardiazation/normalization?

Handling of missing values. Fill up with 0 has to be analyzed. Maybe there are better ways and missing values could be filled up with mean or average numbers from the respective feature, or backfill/forwardfill are an option etc.

Take out 'irrelevant' features. E.g. drop the Id column from the dummies_train and dummies_test data
dummies_train.drop(['Id'], axis=1) 
dummies_test.drop(['Id'], axis=1)

Analyze feature importance: individual but also dependencies.

Test various algorithms on a very simple basis.

Search for and try out another approach to get all different levels of categorical data. Instead of concat train/test, pd.get_dummies(), split back, find another option.