<div style=" border-bottom: 8px solid #e3f56c; overflow: hidden; border-radius: 10px; height: 95%; width: 100%; display: flex;">
  <div style="height: 100%; width: 100%; background-color: #3800BB; float: left; text-align: center; display: flex; justify-content: left; align-items: center; font-size: 40px; ">
    <b><span style="color: #FFFFFF; padding: 20px 20px;">Blending</span></b>
  </div>
</div>



<div class="alert" style="background-color: #FFFFFF; border-left: 8px solid #B12111; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

<div class="alert alert-danger">

### **Contents** 
</div>

<hr>

<div class="alert">

  <p><font size="3" face="Arial" font-size="large">
  <ul type="square">

  <li> Train Catboost;  </li>
  <li> Train LightGBM;  </li>
  <li> Train XGBoost;  </li>
  <li> Blending and some of its key concepts;  </li>
  <li> Conclusions and Summary;  </li>
  
  </ul>
  </font></p>

</div>

</div>

<div class="alert alert-warning">

### **Intro** 
</div>

<div class="alert" style="background-color:#E8F8F5; border-left: 8px solid #1ABC9C; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

<div class="alert alert-info">

**Key Concepts**
</div>

* The core idea of this technique is to take the best from each algorithm and combine several different ML models into one.
* Such an ensemble increases the generalization ability of the final model and improves performance.
* In addition, your model becomes more stable, helping you avoid a drop on the private leaderboard.
* Blending works especially well when combining models of **different nature**: for example, neural networks, KNN, and decision trees — in this case, they learn different patterns and complement each other nicely.


</div>

In [2]:
# Models for blending
import lightgbm as lgbm
import xgboost as xgb
import catboost as cb

In [3]:
import numpy as np
import pandas as pd

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

from classes import Paths
paths = Paths()

In [9]:
path = paths.quickstart_train
df = pd.read_csv(path)

cat_cols = ["model", "car_type", "fuel_type"]
for col in cat_cols:
    print(np.unique(df[col]))
    print(np.arange(df[col].nunique()))
    df[col] = df[col].replace(np.unique(df[col]), np.arange(df[col].nunique()))
    df[col] = df[col].astype("category")

df.head(15)

['Audi A3' 'Audi A4' 'Audi Q3' 'BMW 320i' 'Fiat 500' 'Hyundai Solaris'
 'Kia Rio' 'Kia Rio X' 'Kia Rio X-line' 'Kia Sportage' 'MINI CooperSE'
 'Mercedes-Benz E200' 'Mercedes-Benz GLC' 'Mini Cooper' 'Nissan Qashqai'
 'Renault Kaptur' 'Renault Sandero' 'Skoda Rapid' 'Smart Coupe'
 'Smart ForFour' 'Smart ForTwo' 'Tesla Model 3' 'VW Polo' 'VW Polo VI'
 'VW Tiguan' 'Volkswagen ID.4 ']
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25]
['business' 'economy' 'premium' 'standart']
[0 1 2 3]
['electro' 'petrol']
[0 1]


Unnamed: 0,car_id,model,car_type,fuel_type,car_rating,year_to_start,riders,year_to_work,target_reg,target_class,mean_rating,distance_sum,rating_min,speed_max,user_ride_quality_median,deviation_normal_count,user_uniq
0,y13744087j,8,1,1,3.78,2015,76163,2021,109.99,another_bug,4.737759,12141310.0,0.1,180.855726,0.023174,174,170
1,O41613818T,23,1,1,3.9,2015,78218,2021,34.48,electro_bug,4.480517,18039090.0,0.0,187.862734,12.306011,174,174
2,d-2109686j,16,3,1,6.3,2012,23340,2017,34.93,gear_stick,4.768391,15883660.0,0.1,102.382857,2.513319,174,173
3,u29695600e,12,0,1,4.04,2011,1263,2020,32.22,engine_fuel,3.88092,16518830.0,0.1,172.793237,-5.029476,174,170
4,N-8915870N,16,3,1,4.7,2012,26428,2017,27.51,engine_fuel,4.181149,13983170.0,0.1,203.462289,-14.260456,174,171
5,b12101843B,17,1,1,2.36,2013,42176,2018,48.99,engine_ignition,4.351782,10855890.0,0.1,180.886289,-18.221832,174,173
6,Q-9368117S,14,3,1,5.32,2012,24611,2014,54.72,engine_overheat,4.392126,8343280.0,0.1,174.984786,12.321364,174,167
7,O-2124190y,21,2,0,3.9,2017,116872,2019,50.4,gear_stick,4.712356,9793288.0,0.1,95.890736,-8.939366,174,139
8,h16895544p,9,3,1,3.5,2014,56384,2017,33.59,gear_stick,4.507759,16444050.0,0.32,101.798615,-1.16469,174,170
9,K77009462l,19,1,1,4.56,2013,41309,2018,39.04,gear_stick,4.376839,6975742.0,0.1,125.254983,3.769684,174,173


<div class="alert" style="background-color:rgb(255, 255, 255); border-left: 8px solid #D4AC0D; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

### **Split the dataset into training and validation sets**
</div>

In [None]:
cols2drop = ["car_id", "target_reg", "target_class"]

X_train, X_val, y_train, y_val = train_test_split(
    df.drop(cols2drop, axis=1),
    df["target_reg"],
    test_size=0.25,
    stratify=df["target_class"],
    random_state=42,
)
print(X_train.shape, X_val.shape)

(1752, 14) (585, 14)


<div class="alert alert-warning">

### **Train three models for blending** 
</div>

<div class="alert" style="background-color:rgb(255, 255, 255); border-left: 8px solid #B12111; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

##### **Train CatBoost**
</div>

In [11]:
params_cat = {
    "n_estimators": 1500,
    "learning_rate": 0.03,
    "depth": 3,
    "use_best_model": True,
    "cat_features": cat_cols,
    "text_features": [],
    # 'train_dir' : '/path/to/catboost/model',
    "border_count": 64,
    "l2_leaf_reg": 1,
    "bagging_temperature": 2,
    "rsm": 0.5,
    "loss_function": "RMSE",  # Not defined for regression
    # 'auto_class_weights' : 'Balanced', # Not defined for regression
    "random_state": 42,
    "custom_metric": ["MAE", "MAPE"],
}

cat_model = cb.CatBoostRegressor(**params_cat)

In [12]:
cat_model.fit(
    X_train,
    y_train,
    verbose=100,
    eval_set=(X_val, y_val),
    early_stopping_rounds=150,
)

0:	learn: 17.4391776	test: 17.9234161	best: 17.9234161 (0)	total: 55ms	remaining: 1m 22s
100:	learn: 12.0171853	test: 12.3281023	best: 12.3281023 (100)	total: 105ms	remaining: 1.46s
200:	learn: 11.4189213	test: 11.7777692	best: 11.7777692 (200)	total: 141ms	remaining: 914ms
300:	learn: 11.1134309	test: 11.6124163	best: 11.6124163 (300)	total: 175ms	remaining: 698ms
400:	learn: 10.8590320	test: 11.5378271	best: 11.5365214 (398)	total: 209ms	remaining: 573ms
500:	learn: 10.6685339	test: 11.5151383	best: 11.5129698 (494)	total: 244ms	remaining: 486ms
600:	learn: 10.5119473	test: 11.5108646	best: 11.4979901 (561)	total: 279ms	remaining: 418ms
700:	learn: 10.3486559	test: 11.5041313	best: 11.4913259 (636)	total: 314ms	remaining: 358ms
Stopped by overfitting detector  (150 iterations wait)

bestTest = 11.49132595
bestIteration = 636

Shrink model to first 637 iterations.


<catboost.core.CatBoostRegressor at 0x31a87e090>

In [17]:
print('MSE (catboost) : ', round(mean_squared_error(cat_model.predict(X_val), y_val), 3))

MSE (catboost) :  132.051


In [None]:
# Compare with the baseline as the mean value
round(mean_squared_error(np.ones(len(y_val)) * y_val.mean(), y_val), 3)

323.939

In [22]:
submit = pd.DataFrame({"target": cat_model.predict(X_val).reshape(-1)})
submit.to_csv("../tmp_data/catboost_preds.csv", index=False)
submit.head()

Unnamed: 0,target
0,32.987894
1,47.608527
2,35.089056
3,61.962611
4,71.456127


<div class="alert alert-success">

### **Conclusions and Summary** 
</div>

<div class="alert" style="background-color:rgb(255, 255, 255); border-left: 8px solid #5ad197; padding: 14px; border-radius: 8px; font-size: 14px; color: #000000;">

* 
* 
* 


</div>