## Imports

In [141]:
# Use this cell to regroup all your imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from scipy import stats
from tempfile import mkdtemp
from shutil import rmtree

from xgboost import XGBRegressor

from sklearn import set_config
set_config(display = 'diagram')

# Sklearn preprocessing
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.ensemble import AdaBoostRegressor, VotingRegressor, GradientBoostingRegressor, StackingRegressor, RandomForestRegressor
from sklearn.feature_selection import SelectPercentile, mutual_info_regression, VarianceThreshold, SelectFromModel
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.metrics import make_scorer, mean_squared_error, mean_squared_log_error
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV, cross_validate, cross_val_predict
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, OrdinalEncoder
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# 🏆 Le Wagon Kaggle Batch Challenge

**Welcome to your first Kaggle competition!**

<img src='https://wagon-public-datasets.s3.amazonaws.com/data-science-images/ML/kaggle-batch-challenge.png' width=600>

Your objective is to **submit an answer (online)** to the open competition [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) 🏠

Fortunately, you have already come across the housing dataset earlier in the bootcamp! You will be semi-guided toward a **baseline model**, and only after creating a baseline will you be free to improve and refine it. We will approach the problem using **pipelines** (the best practice)!

A few words on Kaggle:
- Kaggle will rank your submission amongst all participants!
- Everyone is removed from the public leaderboard after 2 months
- You can make up to 10 submissions per day

🧹 Today is the perfect day to practice keeping your long notebook **tidy** 🧹
- Collapse all headings from the command palette (`Cmd + Shift + P`)
- Stay  "idempotent" (`Restart & Run All` should never crash)
- Name and delete variables carefully

## Kaggle Setup

👉 Create an account on Kaggle if you want to participate in the competition

👉 Join the [House Prices Challenge](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) 

👉 Write down your Kaggle `username` in the [results spreadsheet here](https://docs.google.com/spreadsheets/d/1ZEBKwa_k1Ytb0WCOh-Nopq3eaezwBNu1SAqKXEXRguc/edit#gid=0); if you can't find your batch, reach out to your teacher!

**The whole batch will compete as a group against the team of TAs**

## Loading Data

In the challenge instructions, you should have already executed the steps to download everything you need from Kaggle into your current notebook folder:

- `train.csv` is your `(1460, 81)` training set containing `X` and `y`
- `test.csv` is your `(1459, 80)` testing set without the associated target `y` 😈
- `sample_submission.csv` describes the format required to submit your answer

ℹ️ You'll find a detailed description of the dataset [here](https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/07-Ensemble-Methods/kaggle_houses_data_description.txt). Refer to it throughout the challenge!

Your goal is to predict the `y_pred` missing from your test set and submit it to discover your `test_score` and ranking

❓ Load the training dataset into a DataFrame called `data`, and create your `X` and `y`. Inspect their shapes.

**Hint:** if you check the CSV file, you will notice a column called `Id`. When reading the CSV file into a DF, make sure to set `index_col="Id"` so that you don't get two ID columns 😉

In [142]:
data = pd.read_csv("data/train.csv", index_col="Id")
display(data.describe())
display(data.head())

print("\nShape", data.shape)
print("\n Dtypes")
display(data.dtypes.value_counts())
print("\nDuplicates", data.duplicated().sum())
print("\nNulls")
display(data.isnull().sum().sort_values(ascending=False).head(20))

print("Columns:", data.columns)


Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,46.549315,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,161.319273,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,0.0,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,0.0,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000



Shape (1460, 80)

 Dtypes


object     43
int64      34
float64     3
dtype: int64


Duplicates 0

Nulls


PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
FireplaceQu      690
LotFrontage      259
GarageYrBlt       81
GarageCond        81
GarageType        81
GarageFinish      81
GarageQual        81
BsmtExposure      38
BsmtFinType2      38
BsmtCond          37
BsmtQual          37
BsmtFinType1      37
MasVnrArea         8
MasVnrType         8
Electrical         1
MSSubClass         0
dtype: int64

Columns: Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDr

In [143]:
X = data.drop("SalePrice", axis=1)
y = data["SalePrice"]

print(X.shape, y.shape)

(1460, 79) (1460,)


# 🐣 1. BASELINE

## 1.1 Initial feature overview

80 features are too much to deal with one by one for a first baseline pipeline! Let's treat them solely based on their `dtype`:

❓ How many numerical features vs. categorical features do we have? 

In [144]:
print(data.dtypes.value_counts())

X_cat = X.select_dtypes("object")
X_num = X.select_dtypes("number")

print(X_cat.shape)
print(X_num.shape)

object     43
int64      34
float64     3
dtype: int64
(1460, 43)
(1460, 36)


❓ Create a Series called `feat_categorical_nunique` containing the number of **unique values** for each categorical feature in our training set. How many unique categories are there in total?

In [145]:
feat_categorical_nunique = X_cat.nunique()
feat_categorical_nunique

MSZoning          5
Street            2
Alley             2
LotShape          4
LandContour       4
Utilities         2
LotConfig         5
LandSlope         3
Neighborhood     25
Condition1        9
Condition2        8
BldgType          5
HouseStyle        8
RoofStyle         6
RoofMatl          8
Exterior1st      15
Exterior2nd      16
MasVnrType        4
ExterQual         4
ExterCond         5
Foundation        6
BsmtQual          4
BsmtCond          4
BsmtExposure      4
BsmtFinType1      6
BsmtFinType2      6
Heating           6
HeatingQC         5
CentralAir        2
Electrical        5
KitchenQual       4
Functional        7
FireplaceQu       5
GarageType        6
GarageFinish      3
GarageQual        5
GarageCond        5
PavedDrive        3
PoolQC            3
Fence             4
MiscFeature       4
SaleType          9
SaleCondition     6
dtype: int64

🤔 If we were to `OneHotEncode` all categorical features, our feature matrix `X_preproc` would become pretty big and sparse, with almost 300 (highly correlated) features for only 1400 observations. Ideally, we should aim at feeding our model with a maximum of ~50 features (📚 read this [rule of thumb](https://datascience.stackexchange.com/a/11480/98300))

We know 2 main strategies to reduce the number of categorical features post-preprocessing:
1. **[Remove](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection)** features that bring too little explanation to our model; this may require statistical analysis of feature importance
2. **[Ordinally encode](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)** (instead of one-hot encode) categorical features into integers; this, however, creates a notion of "order" (1 > 2 > 3 > ...) that can be detrimental if not handled properly!

❓ Plot the **histogram** of the number of unique values per categorical feature. Do you see some quick wins?

In [146]:
#for col in X_cat.columns:
#    X_cat[col].value_counts().plot(kind="bar", figsize=(3,3))
#    plt.show()

💡 As a starting point, what about simply **removing** all features that have **7 unique values or more**, and one-hot encoding the rest? Let's keep ordinal encoding and statistical feature selection for the next iteration of our pipeline.

❓ Store the names of the features to be OHE'd in a list called `feat_categorical_small` below. How many features will be OHE'd?

In [147]:
feat_categorical_small = []
for col in X_cat.columns:
    if len(X_cat[col].value_counts()) < 7:
        feat_categorical_small.append(col)

print(len(feat_categorical_small))
print(feat_categorical_small)

34
['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'BldgType', 'RoofStyle', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleCondition']


In [148]:
# YOUR CODE HERE

🧪 Test your code below (and clear the cell once it passed)

In [149]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'features_overview',
    n=len(feat_categorical_small)
)

result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/branchedelac/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/branchedelac/code/branchedelac/05-ML/07-Ensemble-Methods/data-houses-kaggle-competition/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_features_overview.py::TestFeaturesOverview::test_feat_categorical_small [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/features_overview.pickle

[32mgit[39m commit -m [33m'Completed features_overview step'[39m

[32mgit[39m push origin master



## 1.2 Baseline Pipe

### a) Preprocessing

❓ Let's code the basic preprocessing pipeline described below. Save it under `preproc_baseline`.

For categorical features:
- Simple-Impute with the most frequent values
- One-Hot Encode features that have less than 7 unique values to start with
- Drop all other features


As for numerical features:
- Simple-Impute with strategy `mean`
- Min-Max Scale


<details>
    <summary>ℹ️ Click here for a pro tip</summary>

If you are confident, you can try Sklearn's shorter-syntax `make_pipeline` or `make_column_transformer` instead of the longer syntax of `Pipeline` or `ColumnTransformer`; also useful if you want to avoid giving names manually to every step.
</details>

In [150]:
# Numerical preprocessing
num_preprocessor = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaling', MinMaxScaler())
]).set_output(transform="pandas")

num_preprocessor

In [151]:
# Test pipeline
num_preprocessor.fit(X_num)
test_np = num_preprocessor.transform(X_num)
test_np.head(1)

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.235294,0.150685,0.03342,0.666667,0.5,0.949275,0.883333,0.1225,0.125089,0.0,...,0.38646,0.0,0.111517,0.0,0.0,0.0,0.0,0.0,0.090909,0.5


In [152]:
test_np.isnull().sum().sort_values(ascending=True).head(5)

MSSubClass      0
BedroomAbvGr    0
KitchenAbvGr    0
TotRmsAbvGrd    0
Fireplaces      0
dtype: int64

In [153]:
# Categorical preprocessing
cat_preprocessor = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoding', OneHotEncoder(sparse_output=False, handle_unknown="ignore"))
]).set_output(transform="pandas")

cat_preprocessor

In [154]:
# Test pipeline
cat_preprocessor.fit(X_cat[feat_categorical_small])
test_cp = cat_preprocessor.transform(X_cat[feat_categorical_small])
test_cp.head(1)

Unnamed: 0_level_0,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Street_Grvl,Street_Pave,Alley_Grvl,Alley_Pave,LotShape_IR1,...,MiscFeature_Gar2,MiscFeature_Othr,MiscFeature_Shed,MiscFeature_TenC,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [155]:
# Final pipe
preproc_baseline = ColumnTransformer([
    ('num_preprocessor', num_preprocessor, make_column_selector(dtype_include="number")),
    ('cat_preprocessor', cat_preprocessor, feat_categorical_small)
]).set_output(transform="pandas")

preproc_baseline

In [156]:
preproc_baseline.fit(X)
test_pp = preproc_baseline.transform(X)
test_pp.head(1)

Unnamed: 0_level_0,num_preprocessor__MSSubClass,num_preprocessor__LotFrontage,num_preprocessor__LotArea,num_preprocessor__OverallQual,num_preprocessor__OverallCond,num_preprocessor__YearBuilt,num_preprocessor__YearRemodAdd,num_preprocessor__MasVnrArea,num_preprocessor__BsmtFinSF1,num_preprocessor__BsmtFinSF2,...,cat_preprocessor__MiscFeature_Gar2,cat_preprocessor__MiscFeature_Othr,cat_preprocessor__MiscFeature_Shed,cat_preprocessor__MiscFeature_TenC,cat_preprocessor__SaleCondition_Abnorml,cat_preprocessor__SaleCondition_AdjLand,cat_preprocessor__SaleCondition_Alloca,cat_preprocessor__SaleCondition_Family,cat_preprocessor__SaleCondition_Normal,cat_preprocessor__SaleCondition_Partial
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.235294,0.150685,0.03342,0.666667,0.5,0.949275,0.883333,0.1225,0.125089,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


❓ Look at the **shape** of your preprocessed DataFrame and save it to `shape_preproc_baseline`

In [157]:
shape_preproc_baseline = test_pp.shape
shape_preproc_baseline

(1460, 183)

🧪 Test your code below

In [158]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'preproc_baseline',
    shape=shape_preproc_baseline
)

result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/branchedelac/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/branchedelac/code/branchedelac/05-ML/07-Ensemble-Methods/data-houses-kaggle-competition/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 1 item

test_preproc_baseline.py::TestPreprocBaseline::test_shape [32mPASSED[0m[32m         [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/preproc_baseline.pickle

[32mgit[39m commit -m [33m'Completed preproc_baseline step'[39m

[32mgit[39m push origin master



### b) Add Estimator

❓ Add a simple Decision Tree model to your `preproc_baseline` and store it to `pipe_baseline` variable.

In [159]:
pipe_baseline = Pipeline([
    ("preprocessing", preproc_baseline),
    ("tree_regressor", DecisionTreeRegressor())
]).set_output(transform="pandas")

### c) Cross-Validate

❓ Read the Kaggle [contest evaluation rules](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview/evaluation). Which performance metric do you need? Is it readily available in Sklearn?

Sadly, it isn't! We will need to create our custom `sklearn.metrics.scorer` object to pass to any cross-validation or Grid Search. The process is described below:


1. Create a scorer called `rmsle` using [`make_scorer`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html) that can be passed as a value for the `scoring` `kwarg` like so:  
    ```python
    cross_val_score(pipe_baseline, X, y, cv=5, scoring=rmsle)
    ```
2.  Create its negative counterpart, `rmsle_neg`, which is best when _maximized_; this will come in handy later as `GridSearchCV` always tries to _maximize_ a score 😉
    ```python
    GridSearchCV(pipe_baseline, param_grid=..., cv=5, scoring=rmsle_neg)
    ```

RMSLE formula

$$\text{RMSLE}(y, \hat{y}) = \sqrt{\frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples} - 1} (\log_e (1 + y_i) - \log_e (1 + \hat{y}_i) )^2.}$$

❓5-fold cross-validate your `pipe_baseline` using this metric to get a first glance at your baseline performance.    

Store your mean score as `score_baseline`

In [160]:
cv_results = cross_validate(estimator=pipe_baseline, X=X, y=y, scoring="neg_root_mean_squared_log_error")
cv_results

{'fit_time': array([0.18668103, 0.15293741, 0.20106864, 0.1669786 , 0.17171717]),
 'score_time': array([0.05793381, 0.06991482, 0.05110192, 0.05475569, 0.05951786]),
 'test_score': array([-0.207896  , -0.22774953, -0.18620275, -0.19113674, -0.23370456])}

In [161]:
score_baseline = cv_results["test_score"].mean()

In [162]:
score_baseline

-0.2093379170123018

### d) Predict Baseline

❓ Predict `y_pred_baseline` from the Kaggle `test.csv` dataset you stored in the `data` folder.

In [163]:
test_data = pd.read_csv("data/test.csv", index_col="Id")
test_data.head(1)

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,...,120,0,,MnPrv,,0,6,2010,WD,Normal


In [254]:
test_id = pd.DataFrame(test_data).reset_index()["Id"]
test_id[:5]

0    1461
1    1462
2    1463
3    1464
4    1465
Name: Id, dtype: int64

In [165]:
test_X = preproc_baseline.transform(test_data)
pipe_baseline.fit(X, y)

In [166]:
y_pred_baseline = pipe_baseline.predict(test_data)

In [167]:
y_pred_baseline = pd.DataFrame(y_pred_baseline).rename(columns = {0: "SalePrice"})

In [168]:
y_pred_baseline

Unnamed: 0,SalePrice
0,127500.0
1,157900.0
2,173000.0
3,169000.0
4,229456.0
...,...
1454,82500.0
1455,83500.0
1456,139000.0
1457,108000.0


In [169]:
y_pred_baseline_w_index = pd.concat([test_id, y_pred_baseline], axis=1)[["Id", "SalePrice"]]
y_pred_baseline_w_index.to_csv("data/submission_baseline.csv", index=False)

In [255]:
y_pred_baseline_w_index

Unnamed: 0,Id,SalePrice
0,1461,127500.0
1,1462,157900.0
2,1463,173000.0
3,1464,169000.0
4,1465,229456.0
...,...,...
1454,2915,82500.0
1455,2916,83500.0
1456,2917,139000.0
1457,2918,108000.0


❓ Finally, store your ready-to-submit CSV as `submission_baseline.csv` in the `data` folder. **Carefully read** and understand Kaggle's required format and test it below (you don't need to submit this baseline to Kaggle for now).

🧪 Test your code

In [171]:
from nbresult import ChallengeResult

tmp = pd.read_csv("data/submission_baseline.csv")

result = ChallengeResult(
    'submission_baseline',
    score_baseline = score_baseline,
    submission_shape = tmp.shape,
    submission_columns = list(tmp.columns),
    submission_dtypes = str(list(tmp.dtypes)),
)

result.write()
print(result.check())


platform linux -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /home/branchedelac/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/branchedelac/code/branchedelac/05-ML/07-Ensemble-Methods/data-houses-kaggle-competition/tests
plugins: anyio-3.6.2, asyncio-0.19.0, typeguard-2.13.3
asyncio: mode=strict
[1mcollecting ... [0mcollected 4 items

test_submission_baseline.py::TestSubmissionBaseline::test_score_baseline [32mPASSED[0m[32m [ 25%][0m
test_submission_baseline.py::TestSubmissionBaseline::test_submission_columns [32mPASSED[0m[32m [ 50%][0m
test_submission_baseline.py::TestSubmissionBaseline::test_submission_dtypes [32mPASSED[0m[32m [ 75%][0m
test_submission_baseline.py::TestSubmissionBaseline::test_submission_shape [32mPASSED[0m[32m [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/submission_baseline.pickle

[32mgit[39m commit -m [33m'Completed submission_baseline step'[39m

[32mgit[39m push origin master



# 🏋️‍♀️ 2. ITERATIONS 

🎉 🎉 Congratulation on having fully pipelined a baseline model! You will see now how much easier it is to iterate and improve performance 🚀

Now, your goal is to improve your prediction and submit it to Kaggle **at least 30 minutes before the Recap ⏳**

We have some suggestions for improvements below: **pick your battles** and **incrementally** improve your pipeline as you see fit!

**Estimators**

- Tree-based ensembles (a must-try today); probably the best suited for problems with many categorical features
- Stacking!
- XGBoost!

**Preprocessing** (once your first ensemble model works)

- **Ordinal Encoding** of categorical features with a hidden notion of order in their values (e.g. "bad", "average", good")
- **Statistical Feature Selection** to remove useless features (avoids overfitting and reduces training time)
- Predict `log(SalePrice)` instead?
- 🤷

## 2.1 Preprocessing Iteration ♲ 
**⚠️ Come back here only after you have iterated on your estimators in section 2.2 ⚠️**

⏩ Collapse me if I'm not in use!

### a) Ordinal Encoding (~1h)

❓ Look at the following feature. Couldn't it be encoded numerically in a wise manner?
```
ExterQual: Evaluates the quality of the material on the exterior 
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor
```

💡 Luckily, the `OrdinalEncoder` and its argument `categories`  allows us to do just that! Check it out below and make sure to understand how this works 👇

In [172]:
# Define specific order for features
# Note: if you change this order, it will change the output for .transform()
feature_A_sorted_values = ['bad', 'average', 'good']
feature_B_sorted_values = ['dirty', 'clean', 'new']

encoder = OrdinalEncoder(
    categories=[
        feature_A_sorted_values,
        feature_B_sorted_values
    ],
    handle_unknown="use_encoded_value",
    unknown_value=-1
)

# Just some random training data
XX = [
    ['good', 'dirty'],
    ['bad', 'new'],
    ['average', 'clean'],
]

encoder.fit(XX)

encoder.transform([
        ['bad', "dirty"],
        ["average", "clean"],
        ['good', 'new'],
        ['bad', 'oops never seen this label before']
])

array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  2.],
       [ 0., -1.]])

❓ **Your turn**: split your categorical preprocessor into

- `preproc_ordinal` to ordinally encode **some features** (of your choice)
- `preproc_nominal` to one-hot encode the other ones


<details>
    <summary>Hints</summary>

- You won't be able to avoid hard-coding names and ordered values of features! Be tidy!
- It's a good practice to sort your features alphabetically to avoid bad surprises
</details>

### Numerical pre-processing

In [173]:
# Numerical preprocessing
num_preprocessor = Pipeline([
    ('imputer', KNNImputer()),
    ('scaling', MinMaxScaler())
]).set_output(transform="pandas")

num_preprocessor

In [174]:
# Test pipeline
num_preprocessor.fit(X_num)
test_np = num_preprocessor.transform(X_num)
test_np.head(1)

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.235294,0.150685,0.03342,0.666667,0.5,0.949275,0.883333,0.1225,0.125089,0.0,...,0.38646,0.0,0.111517,0.0,0.0,0.0,0.0,0.0,0.090909,0.5


### Categorical pre-processing

#### Identify ordinal and nominal categories

In [175]:
feat_ordinal_dict = {
    # Considers "missing" as "neutral"
    "BsmtCond": ['NoData', 'Po', 'Fa', 'TA', 'Gd'],
    "BsmtExposure": ['NoData', 'No', 'Mn', 'Av', 'Gd'],
    "BsmtFinType1": ['NoData', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
    "BsmtFinType2": ['NoData', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
    "BsmtQual": ['NoData', 'Fa', 'TA', 'Gd', 'Ex'],
    "Electrical": ['NoData', 'Mix', 'FuseP', 'FuseF', 'FuseA', 'SBrkr'],
    "ExterCond": ['NoData', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    "ExterQual": ['NoData', 'Fa', 'TA', 'Gd', 'Ex'],
    "Fence": ['NoData', 'MnWw', 'GdWo', 'MnPrv', 'GdPrv'],
    "FireplaceQu": ['NoData', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    "Functional": ['NoData', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'],
    "GarageCond": ['NoData', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    "GarageFinish": ['NoData', 'Unf', 'RFn', 'Fin'],
    "GarageQual": ['NoData', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    "HeatingQC": ['NoData', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    "KitchenQual": ['NoData', 'Fa', 'TA', 'Gd', 'Ex'],
    "LandContour": ['NoData', 'Low', 'Bnk', 'HLS', 'Lvl'],
    "LandSlope": ['NoData', 'Sev', 'Mod', 'Gtl'],
    "LotShape": ['NoData', 'IR3', 'IR2', 'IR1', 'Reg'],
    "PavedDrive": ['NoData', 'N', 'P', 'Y'],
    "PoolQC": ['NoData', 'Fa', 'Gd', 'Ex']
}

In [176]:
feat_ordinal_list = list(feat_ordinal_dict.values())
print(len(feat_ordinal_list))

X_ordinal = list(feat_ordinal_dict.keys())
print(len(X_ordinal_names))

21
21


In [177]:
X_cat_columns = list(X_cat.columns)
print(len(X_cat_columns))
X_nominal = [n for n in X_cat_columns if n not in X_ordinal]
print(len(X_nominal))

43
22


#### Categorical preprocessing pipelines

In [178]:
# Categorical preprocessing - Ordinal
cat_ordinal_preprocessor = Pipeline([
    ('ordinal_encoding', OrdinalEncoder(categories=feat_ordinal_list, handle_unknown="use_encoded_value", unknown_value=np.nan)),
    ('simple_imputer', SimpleImputer(strategy='most_frequent'))
]).set_output(transform="pandas")

cat_ordinal_preprocessor

In [179]:
# Test pipeline
cat_ordinal_preprocessor.fit(X_cat[X_ordinal])
test_cop = cat_ordinal_preprocessor.transform(X_cat[X_ordinal])
test_cop.head(1)

Unnamed: 0_level_0,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,BsmtQual,Electrical,ExterCond,ExterQual,Fence,FireplaceQu,...,GarageCond,GarageFinish,GarageQual,HeatingQC,KitchenQual,LandContour,LandSlope,LotShape,PavedDrive,PoolQC
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,3.0,1.0,6.0,1.0,3.0,5.0,3.0,3.0,3.0,4.0,...,3.0,2.0,3.0,5.0,3.0,4.0,3.0,4.0,3.0,2.0


In [180]:
# Categorical preprocessing - Nominal
cat_nominal_preprocessor = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
     ('ohe_encoding', OneHotEncoder(sparse_output=False, handle_unknown="ignore"))
]).set_output(transform="pandas")

cat_nominal_preprocessor

In [181]:
# Test pipeline
cat_nominal_preprocessor.fit(X_cat[X_nominal])
test_cop = cat_nominal_preprocessor.transform(X_cat[X_nominal])
test_cop.head(1)

Unnamed: 0_level_0,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Street_Grvl,Street_Pave,Alley_Grvl,Alley_Pave,Utilities_AllPub,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [189]:
# Final pipe
column_transformations = ColumnTransformer([
    ('num_preprocessor', num_preprocessor, X.select_dtypes("number").columns),
    ('cat_nominal_preprocessor', cat_nominal_preprocessor, X_nominal),
    ('cat_ordinal_preprocessor', cat_ordinal_preprocessor, X_ordinal)
]).set_output(transform="pandas")

column_transformations

In [190]:
# Test pipeline
column_transformations.fit(X)
test_pre = column_transformations.transform(X)
test_pre.head(1)

Unnamed: 0_level_0,num_preprocessor__MSSubClass,num_preprocessor__LotFrontage,num_preprocessor__LotArea,num_preprocessor__OverallQual,num_preprocessor__OverallCond,num_preprocessor__YearBuilt,num_preprocessor__YearRemodAdd,num_preprocessor__MasVnrArea,num_preprocessor__BsmtFinSF1,num_preprocessor__BsmtFinSF2,...,cat_ordinal_preprocessor__GarageCond,cat_ordinal_preprocessor__GarageFinish,cat_ordinal_preprocessor__GarageQual,cat_ordinal_preprocessor__HeatingQC,cat_ordinal_preprocessor__KitchenQual,cat_ordinal_preprocessor__LandContour,cat_ordinal_preprocessor__LandSlope,cat_ordinal_preprocessor__LotShape,cat_ordinal_preprocessor__PavedDrive,cat_ordinal_preprocessor__PoolQC
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.235294,0.150685,0.03342,0.666667,0.5,0.949275,0.883333,0.1225,0.125089,0.0,...,3.0,2.0,3.0,5.0,3.0,4.0,3.0,4.0,3.0,2.0


### Automated feature selection

In [334]:
select_percentile = SelectPercentile(mutual_info_regression, percentile=80)

select_percentile

In [335]:
preprocessing = Pipeline([
    ("column_transformations", column_transformations),
    ("feature_selection", select_percentile)
]).set_output(transform="pandas")

preprocessing

In [294]:
# Test pipeline
preprocessing.fit(X, y)
test_pre = preprocessing.transform(X)
test_pre.head(1)

Unnamed: 0_level_0,num_preprocessor__MSSubClass,num_preprocessor__LotFrontage,num_preprocessor__LotArea,num_preprocessor__OverallQual,num_preprocessor__OverallCond,num_preprocessor__YearBuilt,num_preprocessor__YearRemodAdd,num_preprocessor__MasVnrArea,num_preprocessor__BsmtFinSF1,num_preprocessor__BsmtUnfSF,...,cat_ordinal_preprocessor__Fence,cat_ordinal_preprocessor__FireplaceQu,cat_ordinal_preprocessor__Functional,cat_ordinal_preprocessor__GarageCond,cat_ordinal_preprocessor__GarageFinish,cat_ordinal_preprocessor__HeatingQC,cat_ordinal_preprocessor__KitchenQual,cat_ordinal_preprocessor__LandContour,cat_ordinal_preprocessor__LotShape,cat_ordinal_preprocessor__PavedDrive
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.235294,0.150685,0.03342,0.666667,0.5,0.949275,0.883333,0.1225,0.125089,0.064212,...,3.0,4.0,7.0,3.0,2.0,5.0,3.0,4.0,4.0,3.0


In [204]:
test_pre.shape

(1460, 108)

In [205]:
test_pre.columns

Index(['num_preprocessor__MSSubClass', 'num_preprocessor__LotFrontage',
       'num_preprocessor__LotArea', 'num_preprocessor__OverallQual',
       'num_preprocessor__OverallCond', 'num_preprocessor__YearBuilt',
       'num_preprocessor__YearRemodAdd', 'num_preprocessor__MasVnrArea',
       'num_preprocessor__BsmtFinSF1', 'num_preprocessor__BsmtUnfSF',
       ...
       'cat_ordinal_preprocessor__Fence',
       'cat_ordinal_preprocessor__FireplaceQu',
       'cat_ordinal_preprocessor__GarageCond',
       'cat_ordinal_preprocessor__GarageFinish',
       'cat_ordinal_preprocessor__HeatingQC',
       'cat_ordinal_preprocessor__KitchenQual',
       'cat_ordinal_preprocessor__LandContour',
       'cat_ordinal_preprocessor__LandSlope',
       'cat_ordinal_preprocessor__LotShape',
       'cat_ordinal_preprocessor__PavedDrive'],
      dtype='object', length=108)

## Target engineering

In [220]:
y.head(1)

Id
1    208500
Name: SalePrice, dtype: int64

In [221]:
y_log = np.log(y)
y_log.head(1)

Id
1    12.247694
Name: SalePrice, dtype: float64

## Test out models

In [None]:
["AdaBoostRegressor", "VotingRegressor" "GradientBoostingRegressor", "RandomForestRegressor"]

In [None]:
# Instantiate model
model = ElasticNet()

# Hyperparameter Grid
grid = {'l1_ratio': stats.uniform(0, 1), 'alpha': [0.001, 0.01, 0.1, 1]}

# Instantiate Grid Search
search = RandomizedSearchCV(
    model,
    grid, 
    scoring='r2',
    n_iter=100,  # number of draws
    cv=5, n_jobs=-1
)

# Fit data to Grid Search
search.fit(X_train, y_train)
search.best_estimator_

In [231]:
models = {"ada_boost": AdaBoostRegressor(), "KNN": KNeighborsRegressor(), "gradient_boosting": GradientBoostingRegressor(),"random_forest": RandomForestRegressor()}

rmsle = "neg_root_mean_squared_log_error"
rmse = "neg_root_mean_squared_error"

for name, model in models.items():
    print(f"Cross-validating {name}...")
    pipe_model_testing = Pipeline([
    ("preprocessing", preprocessing),
    (name,  model)
    ])
    pipe_model_testing.fit(X,y)
    cv_results = cross_validate(estimator=pipe_model_testing, X=X, y=y, scoring=rmsle)
    print("Mean fit time:", cv_results["fit_time"].mean())
    print("Mean score:", cv_results["test_score"].mean(), "\n")

Cross-validating ada_boost...
Mean fit time: 3.4899540901184083
Mean score: -0.1988364377112207 

Cross-validating KNN...
Mean fit time: 3.0062727451324465
Mean score: -0.21362174528308034 

Cross-validating gradient_boosting...
Mean fit time: 4.090334558486939
Mean score: -0.13090811432242003 

Cross-validating random_forest...
Mean fit time: 6.232566499710083
Mean score: -0.14442992807406668 



### With log(y)

In [230]:
models = {"ada_boost": AdaBoostRegressor(), "KNN": KNeighborsRegressor(), "gradient_boosting": GradientBoostingRegressor(),"random_forest": RandomForestRegressor()}

rmsle = "neg_root_mean_squared_log_error"
rmse = "neg_root_mean_squared_error"

for name, model in models.items():
    print(f"Cross-validating {name}...")
    pipe_model_testing = Pipeline([
    ("preprocessing", preprocessing),
    (name,  model)
    ])
    pipe_model_testing.fit(X,y_log)
    cv_results = cross_validate(estimator=pipe_model_testing, X=X, y=y_log, scoring=rmsle)
    print("Mean fit time:", cv_results["fit_time"].mean())
    print("Mean score:", cv_results["test_score"].mean(), "\n")

Cross-validating ada_boost...
Mean fit time: 3.685636806488037
Mean score: -0.013234892133466195 

Cross-validating KNN...
Mean fit time: 2.8564298152923584
Mean score: -0.016685559947393357 

Cross-validating gradient_boosting...
Mean fit time: 3.9596177101135255
Mean score: -0.009825013599438524 

Cross-validating random_forest...
Mean fit time: 5.9207014560699465
Mean score: -0.010972436891987385 



In [232]:
models = {"ada_boost": AdaBoostRegressor(), "KNN": KNeighborsRegressor(), "gradient_boosting": GradientBoostingRegressor(),"random_forest": RandomForestRegressor()}

rmsle = "neg_root_mean_squared_log_error"
rmse = "neg_root_mean_squared_error"

for name, model in models.items():
    print(f"Cross-validating {name}...")
    pipe_model_testing = Pipeline([
    ("preprocessing", preprocessing),
    (name,  model)
    ])
    pipe_model_testing.fit(X,y_log)
    cv_results = cross_validate(estimator=pipe_model_testing, X=X, y=y_log, scoring=rmse)
    print("Mean fit time:", cv_results["fit_time"].mean())
    print("Mean score:", cv_results["test_score"].mean(), "\n")

Cross-validating ada_boost...
Mean fit time: 3.6507599353790283
Mean score: -0.17050709612173603 

Cross-validating KNN...
Mean fit time: 2.9185513496398925
Mean score: -0.21918591176617125 

Cross-validating gradient_boosting...
Mean fit time: 4.0324296951293945
Mean score: -0.1267668477423042 

Cross-validating random_forest...
Mean fit time: 6.615621137619018
Mean score: -0.1410292427741769 



### Predict

In [271]:
test_data = pd.read_csv("data/test.csv")

In [336]:
# Predict with the best model! :) 

gradient_boosting = GradientBoostingRegressor()

pipe_best_model = Pipeline([
    ("preprocessing", preprocessing),
    ("gradient_boosting",  gradient_boosting)
    ])


In [337]:
pipe_best_model.fit(X, y_log)

In [338]:
cv_results = cross_validate(estimator=pipe_best_model, X=X, y=y_log, scoring=rmse)
print("Mean fit time:", cv_results["fit_time"].mean())
print("Mean score:", cv_results["test_score"].mean(), "\n")

Mean fit time: 3.1422780036926268
Mean score: -0.12631504558536694 



In [339]:
predictions = pipe_best_model.predict(test_data)
predictions[:5]

array([11.69915229, 11.96946242, 12.11868251, 12.13077429, 12.20934563])

In [313]:
exp_predictions = np.exp(predictions)

In [314]:
exp_predictions_df = pd.DataFrame(exp_predictions).rename(columns = {0: "SalePrice"})
exp_predictions_df.head(1)

Unnamed: 0,SalePrice
0,120786.805522


In [315]:
test_id = pd.DataFrame(test_data).reset_index()["Id"]
test_id[:5]

0    1461
1    1462
2    1463
3    1464
4    1465
Name: Id, dtype: int64

In [316]:
best_prediction_w_id = pd.concat([test_id, exp_predictions_df], axis=1)[["Id", "SalePrice"]]
best_prediction_w_id

Unnamed: 0,Id,SalePrice
0,1461,120786.805522
1,1462,158483.449043
2,1463,182336.766296
3,1464,185167.805721
4,1465,199624.234377
...,...,...
1454,2915,79633.687177
1455,2916,82723.061616
1456,2917,161863.967436
1457,2918,118613.305664


In [317]:
best_prediction_w_id.to_csv("data/submission_improved_more_features.csv", index=False)

# More

### b) Statistical Feature Selection (~30min)

Our goal is to remove the least interesting features to limit overfitting and shorten training time.  

🔥 We will make use of Sklearn's [feature selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection) transformers directly in your pipeline!

❗️ We recommend you try **only Option 1 today**, to start with. Options 2 and 3 will be corrected in the Recap!

#### Option 1 (Recommended) - <font color=green>Univariate</font> Feature Selection
*based on their mutual information with target `y`*

- Feel free to add a `SelectPercentile` filter at the end of your `preproc` pipeline.
- This will filter out features that, taken individually, least explain your target!
- The statistical test we recommend passing to SelectPercentile is the `mutual_info_regression`

<details>
    <summary markdown='span'>🤔 What is mutual information? Click here!</summary>

- [Mutual Information](https://en.wikipedia.org/wiki/Mutual_information) is a **statistical** distance between two probability distributions
- Correlation is a **linear** distance between two random variables
- Mutual Information is more general and measures the reduction of uncertainty in Y after observing X.
- On the other hand, if you already know you are working with variables that are smooth (like continuous numerical variables), sometimes correlation may tell you more about them, for instance if their relationship is monotonic.

See [this animation](https://twitter.com/ari_seff/status/1409296508634152964)
</details>

#### Option 2 - <font color=green>Multivariate</font> Feature Selection

*based on their combined relationship with target `y`*

🤔 We want to remove features that do not help predict our target even when combined with all the others.

1️⃣ To do so, remember that we can use the [`permutation_importance`](https://scikit-learn.org/stable/modules/permutation_importance.html) metric in combination with an estimator! It trains one pipe per feature to estimate which feature makes our performance score *decrease* the most when shuffling it randomly. These would be our most important features, which we don't want to remove.

The best thing is that `scikit-learn` allows you to integrate this methodology directly into your `preproc` pipeline thanks to the [`SequentialFeatureSelector`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html) transformer; this will recursively remove the least important features according to the `cross_val_score`.

When you have many features, however, this process can take extremely long to train.

2️⃣ Alternatively, a faster way would be to make use of models that already output some measure of `feature_importance` when being fitted. For instance, trees with a Gini-based `feature_importance_`, or Lasso regressions with an L1 `coef_`. `scikit-learn` already has the [`SelectFromModel`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html) transformer to do just that.

In [None]:
# YOUR CODE HERE

#### Option 3 - <font color=green>Unsupervised</font> Selection?
*filter based only on the properties of `X`*

❓ A quick win is to remove features with the lowest variance. Think about it: a feature that only has one value is useless (and has a variance of 0).

Feel free to add a [`VarianceThreshold`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) to the end of your pipeline!

In [None]:
# YOUR CODE HERE

❓ Additionally, we can check for correlation between our **numerical features** only

- Use [Pearson's correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) combined with a heatmap to visually check whether any **numerical** features almost entirely correlate with others
- Use `VIF` from `statsmodels` to check for features that have the highest multicollinearity

In [None]:
# YOUR CODE HERE

❓ For **ordinal features**, we can use [Spearman's rank correlation](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) instead to check whether some **ordinally encoded** features are almost entirely "ordered" similarly to others. Feel free to plot a heatmap again.

In [None]:
# YOUR CODE HERE

❓ Now, feel free to create a "filter" in your pipeline that removes any feature you want beyond a given (Spearman + Pearson) correlation threshold; you'll need a custom transformer class.

In [None]:
# YOUR CODE HERE

### c) Treat Cyclical Features

❓ We have some time-based features, why not **transform them** into cyclical features?

In [None]:
# YOUR CODE HERE

### d) Target Engineering (~15min)

❓ We are asked to minimize the RMS**L**E. Why don't we transform our target to directly predict its `log`?
- Check out the histogram of the target `y`
- Normally distributed variables should be easier to predict with linear or parametric models
- Create `y_log` and your new performance metrics
- Don't forget to take the exponent of your predictions at the end!

### 2.2 Model Iteration ♻

#### a) Final Version of the Preproc Pipeline
❓We advise you to start with a fresh definition below so you can quickly update it as needed and then try many model types to find the best one possible (you can try GridSearch or go model by model)

In [None]:
# YOUR CODE HERE

# 🏅FINAL SUBMISSION (submit at least 30 min before Recap)

Discover your real test score by submitting to Kaggle! 

👉 Write down your test score on the [result spreadsheet here](https://docs.google.com/spreadsheets/d/1ZEBKwa_k1Ytb0WCOh-Nopq3eaezwBNu1SAqKXEXRguc/edit#gid=0) (pick the correct batch!)

In [None]:
# YOUR CODE HERE

In [None]:
# Adding columns to X_test according to what we did to X
X_test['sin_MoSold'] = np.sin(2 * np.pi * (X_test.MoSold - 1) / months_in_a_year)
X_test['cos_MoSold'] = np.cos(2 * np.pi * (X_test.MoSold - 1) / months_in_a_year)

X_test.drop(columns=['MoSold'], inplace=True)