# House Prices - Advanced Regression Techniques
Predict sales prices and practice feature engineering, RFs, and gradient boosting

### Start here if...

You have some experience with R or Python and machine learning basics. This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition. 

### Competition Description

![](https://storage.googleapis.com/kaggle-competitions/kaggle/5407/media/housesbanner.png)

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

### Practice Skills

*   Creative feature engineering 
*   Advanced regression techniques like random forest and gradient boosting

Acknowledgments
---------------

The [Ames Housing dataset](http://www.amstat.org/publications/jse/v19n3/decock.pdf) was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. 

Photo by [Tom Thain](https://unsplash.com/@tthfilms) on Unsplash.

Dataset Description
-------------------

File descriptions
-----------------

*   **train.csv** - the training set
*   **test.csv** - the test set
*   **data\_description.txt** - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
*   **sample\_submission.csv** - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms

Data fields
-----------

Here's a brief version of what you'll find in the data description file.

*   **SalePrice** - the property's sale price in dollars. This is the target variable that you're trying to predict.
*   **MSSubClass**: The building class
*   **MSZoning**: The general zoning classification
*   **LotFrontage**: Linear feet of street connected to property
*   **LotArea**: Lot size in square feet
*   **Street**: Type of road access
*   **Alley**: Type of alley access
*   **LotShape**: General shape of property
*   **LandContour**: Flatness of the property
*   **Utilities**: Type of utilities available
*   **LotConfig**: Lot configuration
*   **LandSlope**: Slope of property
*   **Neighborhood**: Physical locations within Ames city limits
*   **Condition1**: Proximity to main road or railroad
*   **Condition2**: Proximity to main road or railroad (if a second is present)
*   **BldgType**: Type of dwelling
*   **HouseStyle**: Style of dwelling
*   **OverallQual**: Overall material and finish quality
*   **OverallCond**: Overall condition rating
*   **YearBuilt**: Original construction date
*   **YearRemodAdd**: Remodel date
*   **RoofStyle**: Type of roof
*   **RoofMatl**: Roof material
*   **Exterior1st**: Exterior covering on house
*   **Exterior2nd**: Exterior covering on house (if more than one material)
*   **MasVnrType**: Masonry veneer type
*   **MasVnrArea**: Masonry veneer area in square feet
*   **ExterQual**: Exterior material quality
*   **ExterCond**: Present condition of the material on the exterior
*   **Foundation**: Type of foundation
*   **BsmtQual**: Height of the basement
*   **BsmtCond**: General condition of the basement
*   **BsmtExposure**: Walkout or garden level basement walls
*   **BsmtFinType1**: Quality of basement finished area
*   **BsmtFinSF1**: Type 1 finished square feet
*   **BsmtFinType2**: Quality of second finished area (if present)
*   **BsmtFinSF2**: Type 2 finished square feet
*   **BsmtUnfSF**: Unfinished square feet of basement area
*   **TotalBsmtSF**: Total square feet of basement area
*   **Heating**: Type of heating
*   **HeatingQC**: Heating quality and condition
*   **CentralAir**: Central air conditioning
*   **Electrical**: Electrical system
*   **1stFlrSF**: First Floor square feet
*   **2ndFlrSF**: Second floor square feet
*   **LowQualFinSF**: Low quality finished square feet (all floors)
*   **GrLivArea**: Above grade (ground) living area square feet
*   **BsmtFullBath**: Basement full bathrooms
*   **BsmtHalfBath**: Basement half bathrooms
*   **FullBath**: Full bathrooms above grade
*   **HalfBath**: Half baths above grade
*   **Bedroom**: Number of bedrooms above basement level
*   **Kitchen**: Number of kitchens
*   **KitchenQual**: Kitchen quality
*   **TotRmsAbvGrd**: Total rooms above grade (does not include bathrooms)
*   **Functional**: Home functionality rating
*   **Fireplaces**: Number of fireplaces
*   **FireplaceQu**: Fireplace quality
*   **GarageType**: Garage location
*   **GarageYrBlt**: Year garage was built
*   **GarageFinish**: Interior finish of the garage
*   **GarageCars**: Size of garage in car capacity
*   **GarageArea**: Size of garage in square feet
*   **GarageQual**: Garage quality
*   **GarageCond**: Garage condition
*   **PavedDrive**: Paved driveway
*   **WoodDeckSF**: Wood deck area in square feet
*   **OpenPorchSF**: Open porch area in square feet
*   **EnclosedPorch**: Enclosed porch area in square feet
*   **3SsnPorch**: Three season porch area in square feet
*   **ScreenPorch**: Screen porch area in square feet
*   **PoolArea**: Pool area in square feet
*   **PoolQC**: Pool quality
*   **Fence**: Fence quality
*   **MiscFeature**: Miscellaneous feature not covered in other categories
*   **MiscVal**: $Value of miscellaneous feature
*   **MoSold**: Month Sold
*   **YrSold**: Year Sold
*   **SaleType**: Type of sale
*   **SaleCondition**: Condition of sale

Link: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Help: https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from catboost import (
    CatBoostRegressor,
    EFeaturesSelectionAlgorithm,
    EShapCalcType,
    Pool,
    sum_models,
    to_regressor,
)
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, train_test_split
from tqdm.notebook import tqdm

In [2]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [3]:
scaler = StandardScaler()

<IPython.core.display.Javascript object>

In [4]:
sample_submission_df = pd.read_csv(
    "../../data/house-prices-advanced-regression-techniques/sample_submission.csv"
).set_index("Id")
sample_submission_df

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
1461,169277.052498
1462,187758.393989
1463,183583.683570
1464,179317.477511
1465,150730.079977
...,...
2915,167081.220949
2916,164788.778231
2917,219222.423400
2918,184924.279659


<IPython.core.display.Javascript object>

In [5]:
df = pd.read_csv(
    "../../data/house-prices-advanced-regression-techniques/prepared.csv"
).set_index("Id")
df

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,MoSold,YrSold,SaleType,SaleCondition,SalePrice,isTest,BsmtSFExt,2ndFlrExt,PoolExt,TotalLivBsmtSF
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5,3,65.0,8450,1,2,3,3,0,4,...,2,2,8,4,208500.0,0,1,1,0,2566.0
2,0,3,80.0,9600,1,2,3,3,0,2,...,5,1,8,4,181500.0,0,1,0,0,2524.0
3,5,3,68.0,11250,1,2,0,3,0,4,...,9,2,8,4,223500.0,0,1,1,0,2706.0
4,6,3,60.0,9550,1,2,0,3,0,0,...,2,0,8,0,140000.0,0,1,1,0,2473.0
5,5,3,84.0,14260,1,2,0,3,0,2,...,12,2,8,4,250000.0,0,1,1,0,3343.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2915,13,4,21.0,1936,1,2,3,3,0,4,...,6,0,8,4,,1,1,1,0,1638.0
2916,13,4,21.0,1894,1,2,3,3,0,4,...,4,0,8,0,,1,1,1,0,1638.0
2917,0,3,160.0,20000,1,2,3,3,0,4,...,9,0,8,0,,1,1,0,0,2448.0
2918,9,3,62.0,10441,1,2,3,3,0,4,...,7,0,8,4,,1,1,0,0,1882.0


<IPython.core.display.Javascript object>

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2919 entries, 1 to 2919
Data columns (total 85 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   MSSubClass      2919 non-null   int64  
 1   MSZoning        2919 non-null   int64  
 2   LotFrontage     2919 non-null   float64
 3   LotArea         2919 non-null   int64  
 4   Street          2919 non-null   int64  
 5   Alley           2919 non-null   int64  
 6   LotShape        2919 non-null   int64  
 7   LandContour     2919 non-null   int64  
 8   Utilities       2919 non-null   int64  
 9   LotConfig       2919 non-null   int64  
 10  LandSlope       2919 non-null   int64  
 11  Neighborhood    2919 non-null   int64  
 12  Condition1      2919 non-null   int64  
 13  Condition2      2919 non-null   int64  
 14  BldgType        2919 non-null   int64  
 15  HouseStyle      2919 non-null   int64  
 16  OverallQual     2919 non-null   int64  
 17  OverallCond     2919 non-null   i

<IPython.core.display.Javascript object>

# Prepare

In [7]:
num_columns = df.drop(["isTest", "SalePrice"], axis=1).columns
num_columns

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'Wo

<IPython.core.display.Javascript object>

In [8]:
df[num_columns] = scaler.fit_transform(df[num_columns])
df

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,MoSold,YrSold,SaleType,SaleCondition,SalePrice,isTest,BsmtSFExt,2ndFlrExt,PoolExt,TotalLivBsmtSF
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.168840,-0.042157,-0.202068,-0.217879,0.064249,0.257283,0.746669,0.316671,-0.030863,0.588555,...,-1.552184,0.157646,0.319527,0.204967,208500.0,0,0.166784,1.154701,-0.066884,0.023003
2,-0.982129,-0.042157,0.501870,-0.072044,0.064249,0.257283,0.746669,0.316671,-0.030863,-0.658174,...,-0.446925,-0.602962,0.319527,0.204967,181500.0,0,0.166784,-0.866025,-0.066884,-0.029172
3,0.168840,-0.042157,-0.061280,0.137197,0.064249,0.257283,-1.381776,0.316671,-0.030863,0.588555,...,1.026753,0.157646,0.319527,0.204967,223500.0,0,0.166784,1.154701,-0.066884,0.196920
4,0.399034,-0.042157,-0.436714,-0.078385,0.064249,0.257283,-1.381776,0.316671,-0.030863,-1.904903,...,-1.552184,-1.363569,0.319527,-3.505415,140000.0,0,0.166784,1.154701,-0.066884,-0.092527
5,0.168840,-0.042157,0.689587,0.518903,0.064249,0.257283,-1.381776,0.316671,-0.030863,-0.658174,...,2.132012,0.157646,0.319527,0.204967,250000.0,0,0.166784,1.154701,-0.066884,0.988241
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2915,2.010392,1.477041,-2.266952,-1.043937,0.064249,0.257283,0.746669,0.316671,-0.030863,0.588555,...,-0.078505,-1.363569,0.319527,0.204967,,1,0.166784,1.154701,-0.066884,-1.129816
2916,2.010392,1.477041,-2.266952,-1.049263,0.064249,0.257283,0.746669,0.316671,-0.030863,0.588555,...,-0.815344,-1.363569,0.319527,-3.505415,,1,0.166784,1.154701,-0.066884,-1.129816
2917,-0.982129,-0.042157,4.256207,1.246808,0.064249,0.257283,0.746669,0.316671,-0.030863,0.588555,...,1.026753,-1.363569,0.319527,-3.505415,,1,0.166784,-0.866025,-0.066884,-0.123584
2918,1.089616,-0.042157,-0.342855,0.034605,0.064249,0.257283,0.746669,0.316671,-0.030863,0.588555,...,0.289914,-1.363569,0.319527,0.204967,,1,0.166784,-0.866025,-0.066884,-0.826704


<IPython.core.display.Javascript object>

In [9]:
# RMSLE - https://www.kaggle.com/code/carlolepelaars/understanding-the-metric-rmsle/notebook
y_train = np.log1p(df[df["isTest"] == 0]["SalePrice"])

X_train = df[df["isTest"] == 0][num_columns]
X_test = df[df["isTest"] == 1][num_columns]

X_train.shape, y_train.shape, X_test.shape

((1460, 83), (1460,), (1459, 83))

<IPython.core.display.Javascript object>

In [10]:
X_train, X_true, y_train, y_true = train_test_split(
    X_train, y_train, test_size=0.1, random_state=42
)
X_train.shape, X_true.shape, y_train.shape, y_true.shape

((1314, 83), (146, 83), (1314,), (146,))

<IPython.core.display.Javascript object>

In [11]:
X_train

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,BsmtSFExt,2ndFlrExt,PoolExt,TotalLivBsmtSF
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
908,-0.061354,-0.042157,0.783446,0.168900,0.064249,0.257283,-1.381776,0.316671,-0.030863,0.588555,...,-0.181028,-0.089592,-0.078505,-1.363569,0.319527,0.204967,0.166784,1.154701,-0.066884,0.654073
783,-0.982129,-0.042157,-0.108209,0.775699,0.064249,0.257283,-0.672294,0.316671,-0.030863,0.588555,...,-0.181028,-0.089592,-0.078505,0.918253,0.319527,0.204967,0.166784,-0.866025,-0.066884,0.367110
953,1.089616,-0.042157,-0.436714,-0.376395,0.064249,0.257283,0.746669,0.316671,-0.030863,0.588555,...,-0.181028,-0.089592,-0.815344,0.918253,0.319527,0.204967,0.166784,-0.866025,-0.066884,-1.256527
621,-0.751935,-0.042157,-1.140652,-0.243495,0.064249,-4.466046,0.746669,0.316671,-0.030863,0.588555,...,-0.181028,-0.089592,1.026753,0.157646,0.319527,0.204967,0.166784,-0.866025,-0.066884,-1.018013
670,-0.751935,-0.042157,0.501870,0.181581,0.064249,0.257283,0.746669,0.316671,-0.030863,0.588555,...,-0.181028,-0.089592,0.289914,-1.363569,0.319527,0.204967,0.166784,-0.866025,-0.066884,-0.829189
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1096,-0.982129,-0.042157,0.408012,-0.107932,0.064249,0.257283,-1.381776,0.316671,-0.030863,0.588555,...,-0.181028,-0.089592,-1.183764,-0.602962,0.319527,0.204967,0.166784,-0.866025,-0.066884,0.100024
1131,-0.061354,-0.042157,-0.202068,-0.299800,0.064249,0.257283,0.746669,0.316671,-0.030863,0.588555,...,-0.181028,-0.089592,2.132012,0.918253,0.319527,0.204967,0.166784,1.154701,-0.066884,0.690098
1295,-0.982129,-0.042157,-0.436714,-0.253133,0.064249,0.257283,0.746669,0.316671,-0.030863,0.588555,...,-0.181028,-0.089592,-0.815344,-1.363569,0.319527,0.204967,0.166784,-0.866025,-0.066884,-1.018013
861,-0.061354,-0.042157,-0.671360,-0.320343,0.064249,0.257283,0.746669,0.316671,-0.030863,-1.904903,...,-0.181028,-0.089592,-0.078505,-0.602962,0.319527,0.204967,0.166784,1.154701,-0.066884,-0.260232


<IPython.core.display.Javascript object>

# Train

## Hyperparameter tuning

In [12]:
model = CatBoostRegressor(logging_level="Silent")

# https://catboost.ai/en/docs/concepts/parameter-tuning
# https://docs.aws.amazon.com/sagemaker/latest/dg/catboost-tuning.html
tuned_params = {
    "learning_rate": np.concatenate(
        (
            np.arange(1, 10, 1) / 10,
            np.arange(1, 10, 1) / 100,
            np.arange(1, 10, 1) / 1000,
        )
    ),
    "depth": np.arange(1, 11),
    "l2_leaf_reg": np.concatenate((np.arange(1, 10), np.arange(10, 110, 10))),
    "random_strength": np.arange(1, 11),
    "iterations": np.arange(100, 1100, 100),
}

grid_search_result = model.randomized_search(
    tuned_params, Pool(X_train, y_train), cv=3, verbose=False, plot=True
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

<IPython.core.display.Javascript object>

In [13]:
best_model_params = grid_search_result["params"]
best_model_params

{'random_strength': 1,
 'depth': 3,
 'l2_leaf_reg': 6,
 'iterations': 800,
 'learning_rate': 0.09}

<IPython.core.display.Javascript object>

## Feature selections

In [14]:
X_sf_train, X_sf_val, y_sf_train, y_sf_val = train_test_split(
    X_train, y_train, test_size=0.1, random_state=42
)
X_sf_train.shape, X_sf_val.shape, y_sf_train.shape, y_sf_val.shape

((1182, 83), (132, 83), (1182,), (132,))

<IPython.core.display.Javascript object>

In [15]:
sf_model = CatBoostRegressor(**best_model_params, logging_level="Silent")
sf_model.fit(
    Pool(X_sf_train, y_sf_train),
    eval_set=Pool(X_sf_val, y_sf_val),
    verbose=False,
)

<catboost.core.CatBoostRegressor at 0x7fafb438caf0>

<IPython.core.display.Javascript object>

In [16]:
importance_df = (
    pd.DataFrame(
        {
            "Column": X_train.columns,
            "Score": sf_model.get_feature_importance(),
        }
    )
    .sort_values(by="Score", ascending=False)
    .set_index("Column")
)
importance_df

Unnamed: 0_level_0,Score
Column,Unnamed: 1_level_1
TotalLivBsmtSF,26.527573
OverallQual,21.739055
GrLivArea,5.129957
GarageCars,3.791661
Fireplaces,3.753480
...,...
ExterQual,0.000000
ExterCond,0.000000
BsmtFinType2,0.000000
BsmtHalfBath,0.000000


<IPython.core.display.Javascript object>

In [17]:
X_sf = X_train[importance_df.index]
y_sf = y_train

X_sf.shape, y_sf.shape

((1314, 83), (1314,))

<IPython.core.display.Javascript object>

In [18]:
num_list = np.arange(10, X_sf.shape[1], 3)
num_list

array([10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 58,
       61, 64, 67, 70, 73, 76, 79, 82])

<IPython.core.display.Javascript object>

In [19]:
model.best_score_

{'learn': {'RMSE': 0.06328158506600852}}

<IPython.core.display.Javascript object>

In [20]:
def select_features_loop(X, y, num_features=10):
    X = X.iloc[:, :num_features]

    X_sub_train, X_sub_val, y_sub_train, y_sub_val = train_test_split(
        X, y, test_size=0.1, shuffle=False, random_state=42
    )

    model = CatBoostRegressor(**best_model_params, logging_level="Silent")
    model.fit(
        Pool(X_sub_train, y_sub_train),
        eval_set=Pool(X_sub_val, y_sub_val),
        verbose=False,
    )

    score = mean_squared_error(y_true, model.predict(X_true), squared=False)
    # score = model.best_score_["validation"]["RMSE"]

    return [num_features, score]


loss_list = []
for num_features in tqdm(num_list):
    loss_values = select_features_loop(X_sf, y_sf, num_features)
    loss_list.append(loss_values)

num_features_df = (
    pd.DataFrame(loss_list, columns=["num_features", "score"])
    .set_index("num_features")
    .sort_values(by="score")
)
num_features_df.head(10)

  0%|          | 0/25 [00:00<?, ?it/s]

Unnamed: 0_level_0,score
num_features,Unnamed: 1_level_1
31,0.103163
28,0.10458
37,0.104714
25,0.106384
70,0.108365
22,0.10952
49,0.109703
52,0.109904
64,0.110051
34,0.110483


<IPython.core.display.Javascript object>

In [21]:
# X_train = X_train[X_train.columns[: num_features_df.index[0]]]
# X_train.shape

<IPython.core.display.Javascript object>

## KFold

In [22]:
kf = KFold(n_splits=5)

<IPython.core.display.Javascript object>

In [23]:
ensemble = []

for train_index, val_index in kf.split(X_train):
    X_sub_train, X_sub_valid = X_train.iloc[train_index], X_train.iloc[val_index]
    y_sub_train, y_sub_valid = y_train.iloc[train_index], y_train.iloc[val_index]

    train_pool = Pool(X_sub_train, y_sub_train)
    valid_pool = Pool(X_sub_valid, y_sub_valid)

    model = CatBoostRegressor(**best_model_params, logging_level="Silent")
    model.fit(train_pool, eval_set=valid_pool, verbose=False)

    ensemble.append(model)
    print(model.best_score_)

{'learn': {'RMSE': 0.05748547476006802}, 'validation': {'RMSE': 0.11718502167285287}}
{'learn': {'RMSE': 0.057552164786365774}, 'validation': {'RMSE': 0.1456234144557095}}
{'learn': {'RMSE': 0.05624911044848918}, 'validation': {'RMSE': 0.1443633666154224}}
{'learn': {'RMSE': 0.05785965091942618}, 'validation': {'RMSE': 0.10734540528524772}}
{'learn': {'RMSE': 0.057446421395864936}, 'validation': {'RMSE': 0.1109771435398126}}


<IPython.core.display.Javascript object>

In [24]:
models_avrg = to_regressor(
    sum_models(ensemble, weights=[1.0 / len(ensemble)] * len(ensemble))
)
models_avrg

<catboost.core.CatBoostRegressor at 0x7fafb42673d0>

<IPython.core.display.Javascript object>

# Validation

In [25]:
y_train

Id
908     12.429220
783     12.139404
953     11.804856
621     11.112463
670     11.831386
          ...    
1096    12.080696
1131    11.813037
1295    11.652696
861     12.154521
1127    12.066816
Name: SalePrice, Length: 1314, dtype: float64

<IPython.core.display.Javascript object>

In [26]:
y_pred = models_avrg.predict(X_true)
mean_squared_error(y_pred, y_true, squared=False)

0.10890748208456814

<IPython.core.display.Javascript object>

# Submission

In [27]:
submission = pd.DataFrame(
    {"Id": X_test.index, "SalePrice": np.exp(models_avrg.predict(X_test))}
).set_index("Id")
submission

Unnamed: 0_level_0,SalePrice
Id,Unnamed: 1_level_1
1461,123216.619126
1462,160641.511760
1463,188682.702150
1464,196853.726502
1465,190235.392462
...,...
2915,78912.438462
2916,82007.397294
2917,161159.003586
2918,116505.470645


<IPython.core.display.Javascript object>

In [28]:
submission.to_csv(
    "../../data/house-prices-advanced-regression-techniques/submission.csv"
)

<IPython.core.display.Javascript object>