<a href="https://colab.research.google.com/github/eclipseeyo/practiceML/blob/main/Boostings_screencast.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Загрузка данных и импорт библиотек

In [1]:
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.metrics import r2_score

In [2]:
RANDOM_STATE = 42

In [3]:
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing(as_frame=True)

X = data.data
y = data.target

In [4]:
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


## Сравнение моделей с гиперпараметрами по умолчанию

In [5]:
from sklearn.ensemble import GradientBoostingRegressor

gbm = GradientBoostingRegressor()

cross_val_score(gbm, X, y, cv=3, scoring='r2').mean()

np.float64(0.6800381653609042)

In [6]:
!pip install xgboost -q

In [7]:
from xgboost import XGBRegressor

xgb = XGBRegressor()

cross_val_score(xgb, X, y, cv=3, scoring='r2').mean()

np.float64(0.65844446361156)

In [8]:
!pip install catboost -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [9]:
from catboost import CatBoostRegressor

cb = CatBoostRegressor(verbose=0)

cross_val_score(cb, X, y, cv=3, scoring='r2').mean()

np.float64(0.7142210654701769)

In [10]:
!pip install lightgbm -q

In [11]:
from lightgbm import LGBMRegressor

lgbm = LGBMRegressor()

cross_val_score(lgbm, X, y, cv=3, scoring='r2').mean()

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001984 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 13760, number of used features: 8
[LightGBM] [Info] Start training from score 2.117384
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001862 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 13760, number of used features: 8
[LightGBM] [Info] Start training from score 2.079973
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.008973 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1837
[LightGBM] [Info] Number of data points in the train set: 13760, number of used features: 8
[LightGBM] [Info] Start traini

np.float64(0.7016238052098068)

## Подбор гиперпараметров

Разобъем данные на тренировочную и тестовую часть. На тренировочной части по кросс-валидации подберем гиперпараметры моделей, а затем проверим качество на тестовой части.

In [12]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25, random_state=RANDOM_STATE)

params = {'max_depth' : [2, 5, 8, 11]}

In [13]:
%%time

gs_xgb = GridSearchCV(xgb, params, cv=3, scoring='r2', verbose=2)

gs_xgb.fit(Xtrain, ytrain)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] END ........................................max_depth=2; total time=   0.4s
[CV] END ........................................max_depth=2; total time=   0.3s
[CV] END ........................................max_depth=2; total time=   0.1s
[CV] END ........................................max_depth=5; total time=   3.1s
[CV] END ........................................max_depth=5; total time=   3.2s
[CV] END ........................................max_depth=5; total time=   2.7s
[CV] END ........................................max_depth=8; total time=   6.1s
[CV] END ........................................max_depth=8; total time=   1.2s
[CV] END ........................................max_depth=8; total time=   0.8s
[CV] END .......................................max_depth=11; total time=   2.4s
[CV] END .......................................max_depth=11; total time=   2.4s
[CV] END .......................................m

In [14]:
pred_xgb = gs_xgb.best_estimator_.predict(Xtest)

r2_score(ytest, pred_xgb)

0.8385315938584312

In [15]:
%%time

gs_cb = GridSearchCV(cb, params, cv=3, scoring='r2', verbose=2)

gs_cb.fit(X, y)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] END ........................................max_depth=2; total time=   2.7s
[CV] END ........................................max_depth=2; total time=   2.2s
[CV] END ........................................max_depth=2; total time=   1.6s
[CV] END ........................................max_depth=5; total time=   2.9s
[CV] END ........................................max_depth=5; total time=   2.9s
[CV] END ........................................max_depth=5; total time=   4.6s
[CV] END ........................................max_depth=8; total time=   8.9s
[CV] END ........................................max_depth=8; total time=  10.4s
[CV] END ........................................max_depth=8; total time=  11.5s
[CV] END .......................................max_depth=11; total time=  56.2s
[CV] END .......................................max_depth=11; total time=  56.5s
[CV] END .......................................m

In [16]:
pred_cb = gs_cb.best_estimator_.predict(Xtest)

r2_score(ytest, pred_cb)

0.8911533719179447

In [17]:
%%time

gs_lgbm = GridSearchCV(lgbm, params, cv=3, scoring='r2', verbose=2)

gs_lgbm.fit(X, y)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001093 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 13760, number of used features: 8
[LightGBM] [Info] Start training from score 2.117384
[CV] END ........................................max_depth=2; total time=   0.1s
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000343 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 13760, number of used features: 8
[LightGBM] [Info] Start training from score 2.079973
[CV] END ........................................max_depth=2; total time=   0.1s
[LightGBM] [Info] Auto-choosing 

In [18]:
pred_lgbm = gs_lgbm.best_estimator_.predict(Xtest)

r2_score(ytest, pred_lgbm)

0.876891981387784

Мы видим, что даже на маленьком датасете и при подборе одного гиперпараметра приходится подождать результатов. А если датасет больше? И гиперпараметров много, и их для достижения оптимального результата нужно подбирать одновременно!

При этом подбор гиперпараметров сильно улучшает качество моделей!

Что же делать, чтобы не ждать вечность, пока ищутся гиперпараметры? Узнаете в следующем уроке :)