# Домашняя работа к лекции "Улучшение качества модели"
Взять boston house-prices datase (sklearn.datasets.load_boston) и сделать тоже самое для задачи регрессии (попробовать разные алгоритмы, поподбирать параметры, вывести итоговое качество)

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np

In [2]:
data = load_boston()
X = pd.DataFrame(data['data'], columns=data['feature_names'])
y = pd.DataFrame(data['target'], columns=['MEDV'])
df = pd.concat([X,y], axis=1)
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score

In [4]:
def get_score(X, y, random_seed=42, model=None):
    if model is None:
        model = LinearRegression()
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=random_seed)  
        model.fit(X_train, y_train)
    return model.score(X_test, y_test)

### Лучший результат на модели линейной регресии

In [9]:
get_score(X[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM',  'DIS', 'RAD', 'TAX',
       'PTRATIO', 'LSTAT']], y)

0.7190673146383935

In [5]:
# разделяет на train и test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

### Попробуем посмотреть зависимость точности модели RandomForestRegressor от глубины дерева

In [19]:
for depth in range(1, 20):
    acc = []
    for n_trials in range(20):
        X_train_small, X_test_small, y_train_small, y_test_small = train_test_split(X_train, 
                                                                                y_train, 
                                                                                test_size=0.1)
        rfr = RandomForestRegressor(max_depth=depth)
        rfr.fit(X_train_small, np.ravel(y_train_small))
        acc_ = rfr.score(X_test_small, y_test_small)
        acc.append(acc_)
    print("max_depth: %d,\t score: %.3f"%(depth, np.array(acc).mean()))

max_depth: 1,	 score: 0.558
max_depth: 2,	 score: 0.758
max_depth: 3,	 score: 0.778
max_depth: 4,	 score: 0.815
max_depth: 5,	 score: 0.835
max_depth: 6,	 score: 0.796
max_depth: 7,	 score: 0.863
max_depth: 8,	 score: 0.870
max_depth: 9,	 score: 0.843
max_depth: 10,	 score: 0.829
max_depth: 11,	 score: 0.847
max_depth: 12,	 score: 0.864
max_depth: 13,	 score: 0.873
max_depth: 14,	 score: 0.888
max_depth: 15,	 score: 0.856
max_depth: 16,	 score: 0.855
max_depth: 17,	 score: 0.858
max_depth: 18,	 score: 0.854
max_depth: 19,	 score: 0.884


точность модели RandomForestRegressor максимальна при глубине дерева 14

In [6]:
rfr_new = RandomForestRegressor(max_depth=14)
rfr_new.fit(X_train, np.ravel(y_train))
rfr_new.score(X_test, y_test)

0.8219336413056106

### Сделаем cross_val_score для RandomForestRegressor

In [7]:
for depth in range(1,15):
    rfr = RandomForestRegressor(max_depth=depth)
    print("max_depth: %d,\t mean cv_score: %.3f"%(depth, cross_val_score(rfr, X_train, np.ravel(y_train), cv=20).mean()))

max_depth: 1,	 mean cv_score: 0.492
max_depth: 2,	 mean cv_score: 0.675
max_depth: 3,	 mean cv_score: 0.747
max_depth: 4,	 mean cv_score: 0.773
max_depth: 5,	 mean cv_score: 0.790
max_depth: 6,	 mean cv_score: 0.803
max_depth: 7,	 mean cv_score: 0.809
max_depth: 8,	 mean cv_score: 0.823
max_depth: 9,	 mean cv_score: 0.802
max_depth: 10,	 mean cv_score: 0.822
max_depth: 11,	 mean cv_score: 0.813
max_depth: 12,	 mean cv_score: 0.819
max_depth: 13,	 mean cv_score: 0.821
max_depth: 14,	 mean cv_score: 0.825


максимальная точность модели при оценке по cross_val_score тоже при глубине дерева 14

### Подбор оптимальных параметров через GridSearchCV

In [8]:
from sklearn.model_selection import GridSearchCV

In [9]:
param_grid = {'n_estimators': [50, 100, 150],
    'max_depth': [10, 13, 16],
    'min_samples_split': [2, 3],
    'min_samples_leaf': [1, 2],
    }

In [10]:
np.random.seed(seed=12)

rfr = RandomForestRegressor()
grid = GridSearchCV(rfr, param_grid, cv=10)

In [11]:
# fit the grid with data
grid.fit(X, np.ravel(y))

GridSearchCV(cv=10, estimator=RandomForestRegressor(),
             param_grid={'max_depth': [10, 13, 16], 'min_samples_leaf': [1, 2],
                         'min_samples_split': [2, 3],
                         'n_estimators': [50, 100, 150]})

In [12]:
# лучшие параметры из предложенных
grid.best_params_

{'max_depth': 10,
 'min_samples_leaf': 1,
 'min_samples_split': 3,
 'n_estimators': 50}

In [13]:
test_scores = grid.cv_results_['mean_test_score']
print(test_scores)

[0.43330687 0.48716196 0.46912332 0.50360172 0.49587173 0.48056839
 0.47567166 0.497207   0.48961411 0.44669087 0.4646487  0.47944506
 0.48684982 0.47521203 0.49578686 0.46361063 0.4785519  0.47375589
 0.47892894 0.48897027 0.49130329 0.44632187 0.48111725 0.46329634
 0.48662413 0.49209554 0.46841098 0.47695261 0.4968034  0.49082129
 0.49960612 0.46391659 0.48600705 0.471902   0.4809589  0.47734543]


In [14]:
rfr_new = RandomForestRegressor(max_depth=10, min_samples_leaf=1, min_samples_split=3, n_estimators=50)
rfr_new.fit(X_train, np.ravel(y_train))
rfr_new.score(X_test, y_test)

0.8236690165766594

In [15]:
grid.cv_results_

{'mean_fit_time': array([0.16270313, 0.33609188, 0.5027889 , 0.16329851, 0.3160054 ,
        0.49469404, 0.1732928 , 0.34268854, 0.4741071 , 0.15220644,
        0.32939608, 0.50128992, 0.18218718, 0.37416935, 0.52527463,
        0.1896821 , 0.37756796, 0.54756148, 0.15880156, 0.31530557,
        0.4811033 , 0.15550358, 0.30750964, 0.46271396, 0.18528521,
        0.36517427, 0.54016621, 0.16939411, 0.34008961, 0.52217684,
        0.16209939, 0.31140647, 0.50278778, 0.1635988 , 0.31020877,
        0.46591139]),
 'std_fit_time': array([0.00281755, 0.01510399, 0.0331897 , 0.01080216, 0.00339865,
        0.03173895, 0.01163849, 0.03023094, 0.02248979, 0.00279271,
        0.02376426, 0.03815035, 0.01133305, 0.02914423, 0.01207725,
        0.04569201, 0.05052581, 0.03476103, 0.00484356, 0.01252383,
        0.02729257, 0.00162414, 0.00354893, 0.00592939, 0.00997615,
        0.01567615, 0.01204443, 0.00215484, 0.00888278, 0.02403595,
        0.00898135, 0.00489983, 0.07481902, 0.0077928 , 0.003

## RandomizedSearchCV

In [17]:
from sklearn.model_selection import RandomizedSearchCV

In [18]:
rand_cv = RandomizedSearchCV(rfr, param_grid, cv=10)

In [19]:
rand_cv.fit(X_train, np.ravel(y_train))

RandomizedSearchCV(cv=10, estimator=RandomForestRegressor(),
                   param_distributions={'max_depth': [10, 13, 16],
                                        'min_samples_leaf': [1, 2],
                                        'min_samples_split': [2, 3],
                                        'n_estimators': [50, 100, 150]})

In [20]:
rand_cv.cv_results_

{'mean_fit_time': array([0.25324385, 0.37436817, 0.13611546, 0.40574808, 0.12412286,
        0.12232382, 0.24694741, 0.36737304, 0.3853622 , 0.13301783]),
 'std_fit_time': array([0.00885068, 0.02060308, 0.0052655 , 0.00244775, 0.00381297,
        0.00286886, 0.01136249, 0.01419149, 0.05258789, 0.00537177]),
 'mean_score_time': array([0.01139255, 0.01638987, 0.00699646, 0.01539152, 0.0066962 ,
        0.00709605, 0.01119399, 0.01509037, 0.01529033, 0.0070955 ]),
 'std_score_time': array([0.00091562, 0.0019064 , 0.00044702, 0.00048916, 0.00045777,
        0.00029946, 0.00074804, 0.0003006 , 0.00078047, 0.00029996]),
 'param_n_estimators': masked_array(data=[100, 150, 50, 150, 50, 50, 100, 150, 150, 50],
              mask=[False, False, False, False, False, False, False, False,
                    False, False],
        fill_value='?',
             dtype=object),
 'param_min_samples_split': masked_array(data=[3, 2, 3, 2, 3, 3, 2, 3, 3, 3],
              mask=[False, False, False, False, 

In [21]:
rand_cv.best_params_

{'n_estimators': 150,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_depth': 13}

In [22]:
rand_cv.best_estimator_

RandomForestRegressor(max_depth=13, n_estimators=150)

In [23]:
rand_cv.best_estimator_.score(X_test, y_test)

0.8179564897061667

## OOB Score

OOB-оценка - это оценка, когда для каждого $x_i$, используются только те классификаторы, которые до этого не видели $x_i$ в качестве обучающего примера

In [30]:
rfr = RandomForestRegressor(oob_score=True, max_depth=14)
rfr.fit(X_train, np.ravel(y_train))
print(rfr.oob_score_)
print(rfr.score(X_test, y_test))

0.8701254926733086
0.8273626067283906
