### Домашняя работа
Теперь решаем задачу регрессии - предскажем цены на недвижимость.

Использовать датасет https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data (train.csv)

Данных немного, поэтому необходимо использовать 10-fold кросс-валидацию для оценки качества моделей

Построить случайный лес, вывести важность признаков

Обучить стекинг как минимум 3х моделей, использовать хотя бы 1 линейную модель и 1 нелинейную

Для валидации модели 2-го уровня использовать отдельный hold-out датасет, как на занятии

Показать, что использование ансамблей моделей действительно улучшает качество (стекинг vs другие модели сравнивать на hold-out)

В качестве решения: Jupyter notebook с кодом, комментариями и графиками

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import r2_score

In [2]:
data = pd.read_csv('train_house.csv')

In [3]:
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [5]:
y = data['SalePrice']
X = data.drop(['SalePrice','Id'], axis =1) #также удалим айдишники, не несут смысловой нагрузки в датасете и целевую переменную
X = X.drop(['Alley','PoolQC','Fence','MiscFeature'],axis=1) #удалим те признаки, значений которых мало в датасете

### Работа с пропусками

In [6]:
#Заменим пропуски в части категориальных признаков
X_regr = X
X_regr['MasVnrType'].fillna('None', inplace=True)
X_regr['BsmtQual'].fillna('NB', inplace=True)
X_regr['BsmtCond'].fillna('NB', inplace=True) 
X_regr['BsmtExposure'].fillna('NB', inplace=True) 
X_regr['BsmtFinType1'].fillna('NB', inplace=True)
X_regr['BsmtFinType2'].fillna('NB', inplace=True)
X_regr['Electrical'].fillna('SBrkr', inplace=True)
X_regr['FireplaceQu'].fillna('NB', inplace=True)
X_regr['GarageType'].fillna('NB', inplace=True)
X_regr['GarageFinish'].fillna('NB', inplace=True)
X_regr['GarageQual'].fillna('NB', inplace=True)
X_regr['GarageCond'].fillna('NB', inplace=True)
# сформируем перечень категориальных признаков
cat_feat = list(X_regr.dtypes[X_regr.dtypes == object].index)

In [7]:
#Заменим модой пропущенные значения категориальных переменных (для линейной модели)
for category in cat_feat:
    X_regr[category].fillna(X_regr[category].mode, inplace=True)

In [8]:
#отфильтруем непрерывные признаки
num_feat = [f for f in X_regr if f not in cat_feat]

In [9]:
#Заменим средним пропущенные значения для числовых переменных
for category in num_feat:
    X_regr[category].fillna(X_regr[category].mean(),inplace=True)

In [10]:
cat_nunique = X_regr[cat_feat].nunique()
print(cat_nunique)

MSZoning          5
Street            2
LotShape          4
LandContour       4
Utilities         2
LotConfig         5
LandSlope         3
Neighborhood     25
Condition1        9
Condition2        8
BldgType          5
HouseStyle        8
RoofStyle         6
RoofMatl          8
Exterior1st      15
Exterior2nd      16
MasVnrType        4
ExterQual         4
ExterCond         5
Foundation        6
BsmtQual          5
BsmtCond          5
BsmtExposure      5
BsmtFinType1      7
BsmtFinType2      7
Heating           6
HeatingQC         5
CentralAir        2
Electrical        5
KitchenQual       4
Functional        7
FireplaceQu       6
GarageType        7
GarageFinish      4
GarageQual        6
GarageCond        6
PavedDrive        3
SaleType          9
SaleCondition     6
dtype: int64


In [11]:
#Чтобы в разы не увеличивать число признаков при построении dummy,
#будем использовать категориальные признаки с < 10 уникальных значений
X_regr = X_regr.drop(['Neighborhood','Exterior1st','Exterior2nd'],axis=1)

In [12]:
#Находим категориальные признаки
cat_feat = list(X_regr.dtypes[X_regr.dtypes == object].index)

In [13]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

le = LabelEncoder()

for i in X_regr[cat_feat]:
    if X_regr[i].dtype:
        X_regr[i] = le.fit_transform(X_regr[i].values)
        print(i + ' has been label encoded')

MSZoning has been label encoded
Street has been label encoded
LotShape has been label encoded
LandContour has been label encoded
Utilities has been label encoded
LotConfig has been label encoded
LandSlope has been label encoded
Condition1 has been label encoded
Condition2 has been label encoded
BldgType has been label encoded
HouseStyle has been label encoded
RoofStyle has been label encoded
RoofMatl has been label encoded
MasVnrType has been label encoded
ExterQual has been label encoded
ExterCond has been label encoded
Foundation has been label encoded
BsmtQual has been label encoded
BsmtCond has been label encoded
BsmtExposure has been label encoded
BsmtFinType1 has been label encoded
BsmtFinType2 has been label encoded
Heating has been label encoded
HeatingQC has been label encoded
CentralAir has been label encoded
Electrical has been label encoded
KitchenQual has been label encoded
Functional has been label encoded
FireplaceQu has been label encoded
GarageType has been label encod

In [14]:
# для части признаков (там, где были категориальные номинальные переменные) делаем dummy
features = ['MSZoning','Street','LotShape','LandContour','Utilities','GarageQual','LandSlope','SaleType','SaleCondition']
X_dummy = pd.get_dummies(X_regr[features], columns=features)
X_dummy.head()

Unnamed: 0,MSZoning_0,MSZoning_1,MSZoning_2,MSZoning_3,MSZoning_4,Street_0,Street_1,LotShape_0,LotShape_1,LotShape_2,...,SaleType_5,SaleType_6,SaleType_7,SaleType_8,SaleCondition_0,SaleCondition_1,SaleCondition_2,SaleCondition_3,SaleCondition_4,SaleCondition_5
0,0,0,0,1,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0
1,0,0,0,1,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0
2,0,0,0,1,0,0,1,1,0,0,...,0,0,0,1,0,0,0,0,1,0
3,0,0,0,1,0,0,1,1,0,0,...,0,0,0,1,1,0,0,0,0,0
4,0,0,0,1,0,0,1,1,0,0,...,0,0,0,1,0,0,0,0,1,0


In [15]:
#Конкатенируем финальный датасет из признаков
X_final = pd.concat([X_regr.drop(features, axis =1),
                     X_dummy], axis=1)
X_final.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,LotConfig,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,...,SaleType_5,SaleType_6,SaleType_7,SaleType_8,SaleCondition_0,SaleCondition_1,SaleCondition_2,SaleCondition_3,SaleCondition_4,SaleCondition_5
0,60,65.0,8450,4,2,2,0,5,7,5,...,0,0,0,1,0,0,0,0,1,0
1,20,80.0,9600,2,1,2,0,2,6,8,...,0,0,0,1,0,0,0,0,1,0
2,60,68.0,11250,4,2,2,0,5,7,5,...,0,0,0,1,0,0,0,0,1,0
3,70,60.0,9550,0,2,2,0,5,7,5,...,0,0,0,1,1,0,0,0,0,0
4,60,84.0,14260,2,2,2,0,5,8,5,...,0,0,0,1,0,0,0,0,1,0


In [16]:
# разбиваем выборку на обучающую и валидационную
from sklearn.model_selection import train_test_split
X_train, X_validation, y_train, y_validation = train_test_split(X_final, y, test_size=0.25)
# X_validation, y_validation - hold_out dataset

In [17]:
# разбиваем обучающую выборку на обучающую и тестовую
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.25)

### Random forest

In [18]:
from sklearn.model_selection import KFold

In [20]:
kfold = KFold(n_splits=10, random_state=3) #зададим кол-во фолдов = 10

In [21]:
from sklearn.ensemble import RandomForestRegressor

clf_rf = RandomForestRegressor(n_estimators=10, max_depth=5, min_samples_leaf=5, max_features=0.5, n_jobs=-1)
clf_rf.fit(X_train, y_train)

RandomForestRegressor(max_depth=5, max_features=0.5, min_samples_leaf=5,
                      n_estimators=10, n_jobs=-1)

In [23]:
#Смотрим значение R2 на фолдах и находим среднее значение (обуч выборка)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf_rf, X_train, y_train, cv=kfold, n_jobs= -1, scoring='r2')
mean = np.mean(scores)
print('Scores R2 train - ', scores)
print('Mean score R2 train- ', mean)

Scores R2 train -  [0.57567513 0.63375328 0.84385495 0.66800213 0.88022268 0.74875878
 0.87321933 0.70521612 0.85087336 0.82524239]
Mean score R2 train-  0.7604818151509456


In [24]:
#Смотрим значение R2 на фолдах и находим среднее значение (тестовая выборка)
scores = cross_val_score(clf_rf, X_test, y_test, cv=kfold, n_jobs= -1, scoring='r2')
mean = np.mean(scores)
print('Scores R2 test - ', scores)
print('Mean score R2 test - ', mean)

Scores R2 test -  [0.7391083  0.82818954 0.74960992 0.79306201 0.81271605 0.88302948
 0.86391975 0.74788552 0.78370813 0.91295215]
Mean score R2 test -  0.8114180846685685


In [25]:
# Random forest                Mean score R2 train: 0.7604818151509456   Mean score R2 test: 0.8114180846685685

### Feature importances

In [26]:
# попробуем оценить важность признаков с помощью RF
imp = pd.Series(clf_rf.feature_importances_)
imp_values = imp.sort_values(ascending=False)

In [27]:
#выведем 10 наиболее важных признаков
imp_10 = imp_values[:10]
for f in imp_10.index:
    print(X_train.columns[f], '-', imp_10[f] )

OverallQual - 0.24057675834231868
GrLivArea - 0.18037627916964122
GarageCars - 0.1750909391250959
ExterQual - 0.09508482351552387
GarageArea - 0.06005246126359202
1stFlrSF - 0.04784273646348563
TotalBsmtSF - 0.03278771827822343
FullBath - 0.021606145836576204
YearBuilt - 0.019442665619760522
GarageYrBlt - 0.01751960497141824


In [28]:
# Попробуем подобрать оптимальные параметры модели при помощи GridSearchCV
from sklearn.model_selection import GridSearchCV

In [29]:
#Зададим параметры модели
param_grid =  {'n_estimators': [10,20,30,40],
               'max_features': ['sqrt'],
               'max_depth': [None,1,5,10,20],
               'min_samples_split': [2,4,6],
               'min_samples_leaf': [1,3,5]}
rf = RandomForestRegressor()
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = kfold, n_jobs = -1)

In [30]:
#Посмотрим какие параметры наилучшие
grid_search.fit(X_train, y_train)
grid_search.best_params_

{'max_depth': None,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 20}

In [31]:
best_grid = grid_search.best_estimator_
best_grid.fit(X_train,y_train)

RandomForestRegressor(max_features='sqrt', n_estimators=20)

In [33]:
#Смотрим значение R2 на фолдах и находим среднее значение (обуч выборка)
scores = cross_val_score(best_grid, X_train, y_train, cv=kfold, n_jobs= -1, scoring='r2')
mean = np.mean(scores)
print('Scores R2- ', scores)
print('Mean score R2 train - ', mean)

Scores R2-  [0.48064503 0.69172353 0.8580631  0.72349152 0.88743641 0.78472059
 0.86442438 0.80774547 0.86501115 0.84406011]
Mean score R2 train -  0.7807321273513038


In [34]:
#Смотрим значение R2 на фолдах и находим среднее значение (тестовая выборка)
scores = cross_val_score(best_grid, X_test, y_test, cv=kfold, n_jobs= -1, scoring='r2')
mean = np.mean(scores)
print('Scores R2- ', scores)
print('Mean score R2 test - ', mean)

Scores R2-  [0.87701368 0.83697959 0.73236881 0.83693331 0.82865466 0.89447257
 0.88213775 0.75109818 0.8868639  0.89823246]
Mean score R2 test -  0.842475491135063


In [35]:
# Random forest                            Mean score R2 train: 0.7604818151509456   Mean score R2 test: 0.8114180846685685
# Random forest best estimat               Mean score R2 train: 0.7807321273513038   Mean score R2 test: 0.842475491135063

In [36]:
not_important = imp_values[lambda x: x == 0.000000] #получаем перечень строк с нулевой "важностью" признака
not_important_index = not_important.index.tolist() #получим индексы неважных признаков

In [37]:
# Оставляем в X_train и X_test только "важные" признаки
X_train_imp = X_train.drop(X_train.columns[[not_important_index]], axis=1)

  result = getitem(key)


In [38]:
X_test_imp = X_test.drop(X_test.columns[[not_important_index]], axis=1)
X_val_imp = X_validation.drop(X_validation.columns[[not_important_index]], axis=1)

In [39]:
# пробуем подобрать наилучшие параметры модели для выборки X_train_imp 
grid_search_imp = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = kfold, n_jobs = -1) #кол-во фолдов = 10
grid_search_imp.fit(X_train_imp, y_train)
grid_search_imp.best_params_

{'max_depth': None,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 6,
 'n_estimators': 40}

In [41]:
#Смотрим значение R2 на фолдах и находим среднее значение (обуч выборка)
scores = cross_val_score(grid_best_imp, X_train, y_train, cv=kfold, n_jobs= -1, scoring='r2')
mean = np.mean(scores)
print('Scores R2- ', scores)
print('Mean score R2 train - ', mean)
#Смотрим значение R2 на фолдах и находим среднее значение (тестовая выборка)
scores = cross_val_score(grid_best_imp, X_test, y_test, cv=kfold, n_jobs= -1, scoring='r2')
mean = np.mean(scores)
print('Scores R2- ', scores)
print('Mean score R2 test - ', mean)

Scores R2-  [0.54724087 0.65628147 0.89392798 0.66003002 0.8826581  0.81566946
 0.87216382 0.76951314 0.87360622 0.84642709]
Mean score R2 train -  0.7817518154858635
Scores R2-  [0.84832277 0.84585056 0.69393113 0.78038888 0.85904947 0.890258
 0.89323863 0.73772593 0.89323078 0.93215309]
Mean score R2 test -  0.8374149250109941


In [42]:
# Random forest                            Mean score R2 train: 0.7604818151509456   Mean score R2 test: 0.8114180846685685
# Random forest best estimat               Mean score R2 train: 0.7807321273513038   Mean score R2 test: 0.842475491135063
# Random forest best estimat imp feat      Mean score R2 train: 0.7817518154858635   Mean score R2 test: 0.8374149250109941

### Stacking

In [43]:
from sklearn.ensemble import StackingRegressor

In [44]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression

In [45]:
# используем бустинг

gbm_param = dict(
    loss=['ls', 'huber'],
    n_estimators=[10,20,30,40],
    min_samples_split=[2,4,6],
    max_depth=[None,1,5,10,20],
    )

gbm = GradientBoostingRegressor()
grid_search_gbr = GridSearchCV(estimator = gbm, param_grid = gbm_param, 
                          cv = kfold, n_jobs = -1) #кол-во фолдов = 10

In [46]:
#Посмотрим какие параметры наилучшие
grid_search_gbr.fit(X_train_imp, y_train)
grid_search_gbr.best_params_

{'loss': 'huber', 'max_depth': 5, 'min_samples_split': 6, 'n_estimators': 40}

In [49]:
#Смотрим значение R2 на фолдах и находим среднее значение (обуч выборка)
scores = cross_val_score(grid_best_gbr, X_train, y_train, cv=kfold, n_jobs= -1, scoring='r2')
mean = np.mean(scores)
print('Scores R2- ', scores)
print('Mean score R2 train - ', mean)
#Смотрим значение R2 на фолдах и находим среднее значение (тестовая выборка)
scores = cross_val_score(grid_best_gbr, X_test, y_test, cv=kfold, n_jobs= -1, scoring='r2')
mean = np.mean(scores)
print('Scores R2- ', scores)
print('Mean score R2 test - ', mean)

Scores R2-  [0.76657684 0.77682629 0.90499292 0.65388335 0.86590477 0.79655908
 0.88888737 0.76679595 0.87962682 0.87423818]
Mean score R2 train -  0.8174291564557186
Scores R2-  [0.83636334 0.86810713 0.62507205 0.82510042 0.7898817  0.91767591
 0.76966074 0.71349659 0.84640824 0.92714348]
Mean score R2 test -  0.8118909589734681


In [50]:
# Random forest                            Mean score R2 train: 0.7604818151509456   Mean score R2 test: 0.8114180846685685
# Random forest best estimat               Mean score R2 train: 0.7807321273513038   Mean score R2 test: 0.842475491135063
# Random forest best estimat imp feat      Mean score R2 train: 0.7817518154858635   Mean score R2 test: 0.8374149250109941
# GradientBoostingRegressor                Mean score R2 train: 0.8174291564557186   Mean score R2 test: 0.8118909589734681

In [51]:
# lr = LinearRegression().fit(X_train, y_train)
# обычную линейную регрессию не сильно имеет смысл использовать, так как признаков очень много, поэтому нужно использовать регуляризацию
# будем использовать lasso, подразумевая, что часть признаков не оказывает никакого влияния (в оценке важности признаков у многих было значение = 0)
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=1, tol = 0.1).fit(X_train, y_train)
# найдем оптимальное значение альфа
alpha = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
param_grid_lasso = dict(alpha=alpha)
grid_search_lasso = GridSearchCV(estimator=lasso, param_grid=param_grid_lasso, scoring='r2', cv = kfold) #кол-во фолдов = 10

In [52]:
#Посмотрим какие параметры наилучшие
grid_search_lasso = grid_search_lasso.fit(X_train, y_train)
grid_search_lasso.best_params_

{'alpha': 1000}

In [55]:
#Смотрим значение R2 на фолдах и находим среднее значение (обуч выборка)
scores = cross_val_score(grid_best_lasso, X_train, y_train, cv=kfold, n_jobs= -1, scoring='r2')
mean = np.mean(scores)
print('Scores R2- ', scores)
print('Mean score R2 train - ', mean)
#Смотрим значение R2 на фолдах и находим среднее значение (тестовая выборка)
scores = cross_val_score(grid_best_lasso, X_test, y_test, cv=kfold, n_jobs= -1,scoring='r2')
mean = np.mean(scores)
print('Scores R2- ', scores)
print('Mean score R2 test - ', mean)

Scores R2-  [0.43528138 0.03983304 0.85878913 0.79062695 0.85095379 0.83283364
 0.82088162 0.79184816 0.74295258 0.88163047]
Mean score R2 train -  0.7045630765596063
Scores R2-  [0.77597493 0.93786726 0.88737856 0.93119916 0.90680872 0.83875645
 0.84734822 0.73824834 0.83335784 0.89267067]
Mean score R2 test -  0.8589610145489038


In [56]:
# Random forest                            Mean score R2 train: 0.7604818151509456   Mean score R2 test: 0.8114180846685685
# Random forest best estimat               Mean score R2 train: 0.7807321273513038   Mean score R2 test: 0.842475491135063
# Random forest best estimat imp feat      Mean score R2 train: 0.7817518154858635   Mean score R2 test: 0.8374149250109941
# GradientBoostingRegressor                Mean score R2 train: 0.8174291564557186   Mean score R2 test: 0.8118909589734681
# Lasso (L1-regularization)                Mean score R2 train: 0.70456307655960638   Mean score R2 test: 0.8589610145489038

### Stacking на наилучших параметрах моделей

In [57]:
estimators = [
    ('grid_search_imp',grid_search_imp.best_estimator_),
    ('grid_search_gbr',grid_search_gbr.best_estimator_),
    ('grid_search_lasso',grid_search_lasso.best_estimator_),
]

stacking_model = StackingRegressor(
    estimators=estimators, cv = kfold) #кол-во фолдов = 10

In [58]:
stacking_model.named_estimators['grid_search_imp'].fit(X_train_imp,y_train)
stacking_model.named_estimators['grid_search_gbr'].fit(X_train_imp,y_train)
stacking_model.named_estimators['grid_search_lasso'].fit(X_train,y_train)

Lasso(alpha=1000, tol=0.1)

In [59]:
stacking_model.named_estimators['grid_search_imp']

RandomForestRegressor(max_features='sqrt', min_samples_split=6, n_estimators=40)

In [60]:
stacking_model.fit(X_train, y_train)

StackingRegressor(cv=KFold(n_splits=10, random_state=3, shuffle=False),
                  estimators=[('grid_search_imp',
                               RandomForestRegressor(max_features='sqrt',
                                                     min_samples_split=6,
                                                     n_estimators=40)),
                              ('grid_search_gbr',
                               GradientBoostingRegressor(loss='huber',
                                                         max_depth=5,
                                                         min_samples_split=6,
                                                         n_estimators=40)),
                              ('grid_search_lasso',
                               Lasso(alpha=1000, tol=0.1))])

### Сравнение R2 на отложенной (валидационной) hold-out выборке

In [62]:
scores = cross_val_score(grid_best_imp, X_val_imp, y_validation, cv=kfold, n_jobs= -1,scoring='r2')
r2_val_rf_imp = np.mean(scores)
print('R2 val random forest:', r2_val_rf_imp)

R2 val random forest: 0.8585196981988418


In [63]:
scores = cross_val_score(grid_best_gbr, X_val_imp, y_validation, cv=kfold, n_jobs= -1,scoring='r2')
r2_val_gbr = np.mean(scores)
print('R2 val gradient boosting:', r2_val_gbr)

R2 val gradient boosting: 0.8524832494598981


In [64]:
scores = cross_val_score(grid_best_lasso, X_validation, y_validation, cv=kfold, n_jobs= -1,scoring='r2')
r2_val_lasso = np.mean(scores)
print('R2 val lasso:', r2_val_lasso)

R2 val lasso: 0.8603483063024175


In [65]:
r2_val_ensemble = np.mean(cross_val_score(stacking_model, X_validation, y_validation, cv=kfold, n_jobs= -1,scoring='r2'))
print('R2 val ensemble:', r2_val_ensemble)

R2 val ensemble: 0.8849652027255029


In [66]:
# Ансамбль дает значение R2 выше по сравнению с моделями, входящими в нее