# Elementary Benchmark and Feature Selection 

Eventhough I started to learn about machine learning, I always struggle with the question 'where to begin a new machine learning challenge?'. A lot of good literature is available, but also blog entries etc. can give advice what has to be done in principle. However, the risk remains that as soon as I have to do it by myself, everything becomes somewhat blank. This notebook shows my very first steps I chose to approach Kaggle's House Prices competition in the getting started section.  

Three things I did when starting this getting started competition:<br/>
1) Load data and have a first glimps at it<br/>
2) a very trivial (elementary) benchmark, which will serve as a starting point for my house price prediction model<br/>
3) one possible approach for feature selection to improve the trivial benchmark from above<br/>

This notebook is inspired by ... and ...

## Load data and have a first glimps at it
set up a new project with the following structure<br>
<b>HousePrices</b> <br>
<b>|- data</b> *contains the train and test data* <br>
<b>|- output</b> *contains generated output files, such as the submission file or an overview.xlsx as explained below etc.* <br>
<b>|- src </b> *contains source files e.g. this notebook or .py files etc.*

In [2]:
# relevant imports
import pandas as pd

from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split

In [4]:
# load the train and test data
train_data = pd.read_csv('../data/train.csv', sep=',', header=0)
test_data = pd.read_csv('../data/test.csv', sep=',', header=0)

# have a look at train_data shape. It consists of 1460 observations (rows) and 81 features (columns)
print(train_data.shape)

(1460, 81)


In [5]:
# create an csv file with all the attributes in the train_data, including their data types. 
# add additional information I want to capture to each attribute. E.g. my expectation of the attributes relevance etc

attributes = list(train_data.columns.values)
attr_types = list(train_data.dtypes)

overview = pd.DataFrame({"AttributeNames": attributes, "DataType": attr_types, "VarType": "", "Expectation": "",
                         "Conclusion": "", "Comments": ""})
    
overview.to_csv("../output/overview.csv", columns=["AttributeNames", "DataType", "VarType", "Expectation", 
                                                   "Conclusion", "Comments"], header=True, index=False)

Now I started to read through each attribute. What is meant by it, do I expect that it will be relevant regarding the SalePrice, is it a categorical or a quantitative attribute etc.

As an example: From my personal experience and some research in the internet, I came to the conclusion that the area in square meters will be of high relevance, but for example the form of the Hausdach maybe of low relevance and so on. 

After a first iteration the file still had a lot of gaps in it, but I had a first impression what the attributes are in there and what they mean. 

In a next step I started to do some descriptive (multivariate) analysis to gain additional insights. 

## Missing Data

- Analyse missing data. How much is actually missed in each feature
- decide on deleting or completing the feature or on deleting the respective observations causing the missing value.
- delete or impute, as decided above.

In [7]:
# analyze missing data

total = train_data.isnull().sum().sort_values(ascending=False)
percent = (train_data.isnull().sum()/train_data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['TotalMissing', 'Percent'])

print(missing_data.head(20))

              TotalMissing   Percent
PoolQC                1453  0.995205
MiscFeature           1406  0.963014
Alley                 1369  0.937671
Fence                 1179  0.807534
FireplaceQu            690  0.472603
LotFrontage            259  0.177397
GarageCond              81  0.055479
GarageType              81  0.055479
GarageYrBlt             81  0.055479
GarageFinish            81  0.055479
GarageQual              81  0.055479
BsmtExposure            38  0.026027
BsmtFinType2            38  0.026027
BsmtFinType1            37  0.025342
BsmtCond                37  0.025342
BsmtQual                37  0.025342
MasVnrArea               8  0.005479
MasVnrType               8  0.005479
Electrical               1  0.000685
Utilities                0  0.000000


In a first step, I decided to delete all the features (columns) that have more than 5% missing values (e.g. PoolQC, MiscFeature). Additionally, to delete all the observations (rows) that are affected by missing values. I will store this changed dataframe under a new benchmark_data name, as it is radically changing the original dataframe. I decided to go with this mixed column/row delection approach, to soften the radicality. I.e. instead of deleting 8 additional features, I only took out several observations that bring the feature onto this list (e.g. BsmtExposure, BsmtFinType2). 

Again, this is very radical and maybe not really improving a future model. 
However, I start with this approach first to be able to submit a high-level benchmark first, before going more into detail.  I will re-think this step later on and follow another strategy then. 

In [11]:
# delete features (columns) with more than 5% missing values
benchmark_data = train_data.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'LotFrontage',
                                 'GarageCond', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual'], axis=1)

# delete observations (rows) with a missing value in it
benchmark_data = benchmark_data.drop(benchmark_data.loc[benchmark_data['BsmtExposure'].isnull()].index)
benchmark_data = benchmark_data.drop(benchmark_data.loc[benchmark_data['BsmtFinType2'].isnull()].index)
benchmark_data = benchmark_data.drop(benchmark_data.loc[benchmark_data['BsmtFinType1'].isnull()].index)
benchmark_data = benchmark_data.drop(benchmark_data.loc[benchmark_data['BsmtCond'].isnull()].index)
benchmark_data = benchmark_data.drop(benchmark_data.loc[benchmark_data['BsmtQual'].isnull()].index)
benchmark_data = benchmark_data.drop(benchmark_data.loc[benchmark_data['MasVnrArea'].isnull()].index)
benchmark_data = benchmark_data.drop(benchmark_data.loc[benchmark_data['MasVnrType'].isnull()].index)
benchmark_data = benchmark_data.drop(benchmark_data.loc[benchmark_data['Electrical'].isnull()].index)

# how is the dataframe shaped now?
print(benchmark_data.shape)

(1412, 70)


vs. (1460, 81) in the original traindata set. Let's proceed with this benchmark_data and come back to this step later on. 

## Elementary benchmark including first submission

In [13]:
# convert categorical features - which are all strings in the current dataframe - into codes
benchmark_data = pd.get_dummies(benchmark_data, drop_first=True)

# split the benchmark_data into a train and a valid set.
target_variable = benchmark_data['SalePrice']
features = benchmark_data.drop(['Id', 'SalePrice'], axis=1)

X_train, X_valid, y_train, y_valid = train_test_split(features, target_variable, test_size=0.2, random_state=0)

# train a simple random forest regressor on the X_train part of the benchmark_data
rf_reg = RandomForestRegressor()
rf_reg.fit(X_train, y_train)

# make a prediction for the X_valid part of the benchmark_data, based on the trained model
predictions = rf_reg.predict(X_valid)

# evaluate performance of your trained model on the X_valid part of the benchmark_data
loss = mean_squared_error(y_valid, predictions)
print('Loss: ', loss)

NameError: name 'train_test_split' is not defined

## Feature selection