# Housing Prices Competition for Kaggle Learn Users
Apply what you learned in the Machine Learning course on Kaggle Learn alongside others in the course.

## Introduction

**Competition Description**

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

**Practice Skills**
- Creative feature engineering 
- Advanced regression techniques like random forest and gradient boosting


Refer to [Housing Prices Competition](https://www.kaggle.com/c/home-data-for-ml-course) for more information

In [44]:
# So module imports can work and reuse code
import sys; sys.path.insert(0, '../../')

In [14]:
# local modules
from lib.custom_transformers import CustomColumnTransformer

In [15]:
#imports
import pandas as pd
import numpy as np 
import yellowbrick
from ydata_profiling import ProfileReport


import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder


In [37]:
# Read the data
home_data_full = pd.read_csv('../../input/home-data-for-ml-course/train.csv', index_col='Id')
home_data_test = pd.read_csv('../../input/home-data-for-ml-course/test.csv', index_col='Id')

## Exploratory Data Analysis (EDA)

### Quick Peek

In [38]:
print(f"The shape of the data is: {home_data_full.shape}")
print(f"Numerical variables: {home_data_full.describe().transpose().shape[0]}")
print(f"Categorical variables: {home_data_full.describe(include=['O']).transpose().shape[0]}")

The shape of the data is: (1460, 80)
Numerical variables: 37
Categorical variables: 43


In [39]:
home_data_full.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MSSubClass,1460.0,56.89726,42.300571,20.0,20.0,50.0,70.0,190.0
LotFrontage,1201.0,70.049958,24.284752,21.0,59.0,69.0,80.0,313.0
LotArea,1460.0,10516.828082,9981.264932,1300.0,7553.5,9478.5,11601.5,215245.0
OverallQual,1460.0,6.099315,1.382997,1.0,5.0,6.0,7.0,10.0
OverallCond,1460.0,5.575342,1.112799,1.0,5.0,5.0,6.0,9.0
YearBuilt,1460.0,1971.267808,30.202904,1872.0,1954.0,1973.0,2000.0,2010.0
YearRemodAdd,1460.0,1984.865753,20.645407,1950.0,1967.0,1994.0,2004.0,2010.0
MasVnrArea,1452.0,103.685262,181.066207,0.0,0.0,0.0,166.0,1600.0
BsmtFinSF1,1460.0,443.639726,456.098091,0.0,0.0,383.5,712.25,5644.0
BsmtFinSF2,1460.0,46.549315,161.319273,0.0,0.0,0.0,0.0,1474.0


In [19]:
home_data_full.describe(include=['O']).transpose()

Unnamed: 0,count,unique,top,freq
MSZoning,1460,5,RL,1151
Street,1460,2,Pave,1454
Alley,91,2,Grvl,50
LotShape,1460,4,Reg,925
LandContour,1460,4,Lvl,1311
Utilities,1460,2,AllPub,1459
LotConfig,1460,5,Inside,1052
LandSlope,1460,3,Gtl,1382
Neighborhood,1460,25,NAmes,225
Condition1,1460,9,Norm,1260


From the describe we can notice the following facts: 
- there are several columns with a high ammout of missing values 
- there are a 43 categorical variables, we need to keep the ones with low cardinality

### Pandas Report

In [58]:
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in home_data_full.columns if home_data_full[cname].nunique() < 10 and 
                        home_data_full[cname].dtype == "object"]

high_cardinality_cols = [cname for cname in home_data_full.columns if home_data_full[cname].nunique() >= 10 and 
                        home_data_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in home_data_full.columns if home_data_full[cname].dtype in ['int64', 'float64']]

# Print column names
print(f"Low cardinality columns ({len(low_cardinality_cols)}): {low_cardinality_cols} ")
print(f"High cardinality columns ({len(high_cardinality_cols)}): {high_cardinality_cols} ")
print(f"Numerical columns ({len(numerical_cols)}): {numerical_cols} ")

Low cardinality columns (40): ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition'] 
High cardinality columns (3): ['Neighborhood', 'Exterior1st', 'Exterior2nd'] 
Numerical columns (37): ['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Firep

In [21]:
profile = ProfileReport(home_data_full[low_cardinality_cols + numerical_cols], title="Profiling Report")
profile.to_file("./reports/home_data_report.html")
# profile.to_html()
# profile

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={"index": "df_index"}, inplace=True)


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

(using `df.profile_report(correlations={"auto": {"calculate": False}})`
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'could not convert string to float: 'Grvl'')
  annotation = ("{:" + self.fmt + "}").format(val)
(using `df.profile_report(missing_diagrams={"Heatmap": False}`)
If this is problematic for your use case, please report this as an issue:
https://github.com/ydataai/ydata-profiling/issues
(include the error message: 'could not convert string to float: '--'')


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### Missing values

From the [missing value matrix](./reports/home_data_report.html#missing) we can see that there are variables with a high percentage of missing values. The columns with a high missing percentage are:

In [41]:
cols = low_cardinality_cols + numerical_cols
high_missing_cols = [
  cname for cname in home_data_full[cols] 
  if (1 - home_data_full[cname].count()/home_data_full.shape[0]) > 0.10 ]

high_missing_cols

['Alley',
 'MasVnrType',
 'FireplaceQu',
 'PoolQC',
 'Fence',
 'MiscFeature',
 'LotFrontage']

Let's take a look to the [data_description](../../input/home-data-for-ml-course/data_description.txt) file to see if any of the missing values can be used:

>
```bash
Alley: Type of alley access to property

       Grvl	Gravel
       Pave	Paved
       NA 	No alley access

MasVnrType: Masonry veneer type

       BrkCmn	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       None	None
       Stone	Stone

FireplaceQu: Fireplace quality

       Ex	Excellent - Exceptional Masonry Fireplace
       Gd	Good - Masonry Fireplace in main level
       TA	Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
       Fa	Fair - Prefabricated Fireplace in basement
       Po	Poor - Ben Franklin Stove
       NA	No Fireplace

PoolQC: Pool quality
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       NA	No Pool

Fence: Fence quality
		
       GdPrv	Good Privacy
       MnPrv	Minimum Privacy
       GdWo	Good Wood
       MnWw	Minimum Wood/Wire
       NA	No Fence

MiscFeature: Miscellaneous feature not covered in other categories
		
       Elev	Elevator
       Gar2	2nd Garage (if not described in garage section)
       Othr	Other
       Shed	Shed (over 100 SF)
       TenC	Tennis Court
       NA	None

LotFrontage: Linear feet of street connected to property
```

In [42]:
for column in home_data_full[low_cardinality_cols]:
  if column in high_missing_cols:
    print(f"{column}: {home_data_full[column].unique()}")

Alley: [nan 'Grvl' 'Pave']
MasVnrType: ['BrkFace' nan 'Stone' 'BrkCmn']
FireplaceQu: [nan 'TA' 'Gd' 'Fa' 'Ex' 'Po']
PoolQC: [nan 'Ex' 'Fa' 'Gd']
Fence: [nan 'MnPrv' 'GdWo' 'GdPrv' 'MnWw']
MiscFeature: [nan 'Shed' 'Gar2' 'Othr' 'TenC']


- For some categorical columns: `[Alley, FireplaceQu, PoolQC, Fence, MiscFeature]` the [data_description](../../input/home-data-for-ml-course/data_description.txt) explicitly says that the NA values meas inexistence of the especific feature. So we could creat a value in the category to cover this since a pool or a fireplace could actually have influence in the price of a property.

- Regarding `MasVnrType`, it might be related to `MasVnrArea`, so let's take a deeper look into this variable.

- Finally, for `LotFrontage` there's no other variable that might help to fill the gaps, and somehow `LotArea` could help us to describe it, so it will be dropped.

In [28]:
home_data_full.loc[ home_data_full['MasVnrArea'].isna(),  ['MasVnrType', 'MasVnrArea']]

Unnamed: 0_level_0,MasVnrType,MasVnrArea
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
235,,
530,,
651,,
937,,
974,,
978,,
1244,,
1279,,


In [54]:
# plt.switch_backend('Qt5Agg')
mas_vnr_df = home_data_full.loc[ 
                   home_data_full['MasVnrType'].isna(),  ['MasVnrType', 'MasVnrArea']
                  ]
mas_vnr_df.groupby('MasVnrArea').apply(lambda x: x.isna().sum()).drop('MasVnrArea', axis=1)
# mas_vnr_df.groupby('MasVnrArea').apply(lambda x: x.isna().sum()).drop('MasVnrArea', axis=1).plot.bar()
# plt.show()

Unnamed: 0_level_0,MasVnrType
MasVnrArea,Unnamed: 1_level_1
0.0,859
1.0,2
288.0,1
312.0,1
344.0,1


Here we notice that, not in all cases NaN means that the Masonry Venner does not exist, otherwise the Area would be 0. 

What can be done, is to drop the column `MasVnrType` and keep the `MasVnrArea` as it provides information about wheter the property has or not _Masonry Venner_. If the `MasVnrArea` has missing values we can impute with the most frequent.

Let's take a look to the rest of the columns with missing values to define the impute strategy

In [59]:
drop_cols = high_cardinality_cols + ['LotFrontage', 'MasVnrType']
drop_cols

['Neighborhood', 'Exterior1st', 'Exterior2nd', 'LotFrontage', 'MasVnrType']

In [63]:
low_cardinality_cols = [ c for c in low_cardinality_cols if c not in drop_cols]
numerical_cols = [ c for c in numerical_cols if c not in drop_cols]

print(f"Low cardinality cols ({len(low_cardinality_cols)}): {low_cardinality_cols}")
print(f"Num cols ({len(numerical_cols)}): {numerical_cols}")

Low cardinality cols (39): ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']
Num cols (36): ['MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPor

In [64]:
num_cols_w_missing = [
  cname for cname in home_data_full[numerical_cols] 
  if home_data_full[cname].count() < home_data_full.shape[0] 
  ]
num_cols_w_missing

['MasVnrArea', 'GarageYrBlt']

In [66]:
cat_cols_w_missing = [
  cname for cname in home_data_full[low_cardinality_cols] 
  if home_data_full[cname].count() < home_data_full.shape[0] and cname not in high_missing_cols
  ]
cat_cols_w_missing

['BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Electrical',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond']

For Numerical variables, it was mentioned that `MasVnrArea` NaN will be set to `0` as it is most likely that no Massonery Veener is present. Whereas for `GarageYrBlt` we could explore how this column has relation with the rest of the features that describe the Garage.

In [72]:
garage_cols = [
  'GarageType',
  'GarageFinish',
  'GarageQual',
  'GarageCond',
  'GarageYrBlt'
]
garage_df = home_data_full[garage_cols]
garage_df.loc[garage_df.isna().any(axis=1)]

Unnamed: 0_level_0,GarageType,GarageFinish,GarageQual,GarageCond,GarageYrBlt
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
40,,,,,
49,,,,,
79,,,,,
89,,,,,
90,,,,,
...,...,...,...,...,...
1350,,,,,
1408,,,,,
1450,,,,,
1451,,,,,


In [73]:
print(garage_df.loc[garage_df.isna().any(axis=1)].describe())
print(garage_df.loc[garage_df.isna().any(axis=1)].describe(include='O'))

       GarageYrBlt
count          0.0
mean           NaN
std            NaN
min            NaN
25%            NaN
50%            NaN
75%            NaN
max            NaN
       GarageType GarageFinish GarageQual GarageCond
count           0            0          0          0
unique          0            0          0          0
top           NaN          NaN        NaN        NaN
freq          NaN          NaN        NaN        NaN


From the [data_description](../../input/home-data-for-ml-course/data_description.txt) , we can see again that NaN values matches when the property does not have garage, and this is consistent along the rest of the "Garage features"

```bash
GarageType: Garage location
		
       2Types	More than one type of garage
       Attchd	Attached to home
       Basment	Basement Garage
       BuiltIn	Built-In (Garage part of house - typically has room above garage)
       CarPort	Car Port
       Detchd	Detached from home
       NA	No Garage
```
So what we could do is a Hot-Deck Imputation: for `GarageYrBlt` to 0 and for categorical _"Garage features"_,  create a `NA`category for the rest of the missing values 

In [67]:
bsmt_cols = [
  'BsmtQual',
  'BsmtCond',
  'BsmtExposure',
  'BsmtFinType1',
  'BsmtFinType2',
]
bsmt_df = home_data_full[bsmt_cols]
bsmt_df.loc[bsmt_df.isna().any(axis=1)]

Unnamed: 0_level_0,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
18,,,,,
40,,,,,
91,,,,,
103,,,,,
157,,,,,
183,,,,,
260,,,,,
333,Gd,TA,No,GLQ,
343,,,,,
363,,,,,


## Data Preparation

From the previous section we got to the conclusion of following the next steps for the data transformation:

1. Drop columns `[]`
  1. Drop high cardinality columns `['Neighborhood', 'Exterior1st', 'Exterior2nd']`
  1. Drop high percentage of missing values that cannot be impute/calculated in any way `['MasVnrType', 'LotFrontage']`
1. Missing values
  1. Fill NaN with new category `NA` for columns where explicitly says NaN means absence of thi feature: `['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']`
  1. For _"Garage features"_, do is a Hot-Deck Imputation: `{'GarageType': 'NA','GarageFinish': 'NA', 'GarageQual': 'NA','GarageCond': 'NA','GarageYrBlt': 0 }`

**Notes:** Neighborhood could be an important factor for the price, so if the prediction is bad maybe we can perform a conversion on this to ZIP codes and then PCA to reduce dimensionality

In [32]:
# Preprocessing for numerical data
numerical_transformer = Pipeline([
  ('custom_mas_vnr', CustomColumnTransformer(column='MasVnrArea', value=0)),
  ('custom_garage_yr', CustomColumnTransformer(column='GarageYrBlt', value=0))
])

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [33]:
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, low_cardinality_cols)
    ])

## Feature Engineering

In [34]:
# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll use only numerical predictors
X = X_full.select_dtypes(exclude=['object'])
X_test = X_test_full.select_dtypes(exclude=['object'])

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

NameError: name 'X_full' is not defined

## Algorithm selection

## Model training

## Model Evaluation