# Дерево решений

Для выполнения домашнего задания необходимо взять boston house-prices datase (sklearn.datasets.load_boston) и сделать тоже самое для задачи регрессии (попробовать разные алгоритмы, поподбирать параметры, вывести итоговое качество).

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

https://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

from matplotlib import pyplot as plt

In [15]:
from sklearn.datasets import load_boston
import pandas as pd

boston = load_boston()
data = pd.read_csv(boston['filename'], skiprows=1)

In [16]:
print(boston['DESCR'])

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [17]:
data

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


In [18]:
X = boston.data
y = boston.target

X.shape, y.shape

((506, 13), (506,))

In [57]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)
reg.score(X_test, y_test)

0.6161354828664029

In [78]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
criterions = ['mse', 'friedman_mse', 'mae']

for criterion in criterions:
    for i in range(20):
        if i == 0:
            max_depth = None
        else:
            max_depth = i

        reg = DecisionTreeRegressor(criterion=criterion, max_depth=max_depth)
        reg.fit(X_train, y_train)
        
        score = reg.score(X_test, y_test)
        print('criterion: {}, max_depth: {}. SCORE: {}'.format(criterion, max_depth, score))

criterion: mse, max_depth: None. SCORE: 0.8426415508742308
criterion: mse, max_depth: 1. SCORE: 0.40764310361554723
criterion: mse, max_depth: 2. SCORE: 0.669868297445114
criterion: mse, max_depth: 3. SCORE: 0.8277278897441037
criterion: mse, max_depth: 4. SCORE: 0.8512739398706871
criterion: mse, max_depth: 5. SCORE: 0.8481938526851687
criterion: mse, max_depth: 6. SCORE: 0.8690876907011398
criterion: mse, max_depth: 7. SCORE: 0.8580629155796035
criterion: mse, max_depth: 8. SCORE: 0.855363347461428
criterion: mse, max_depth: 9. SCORE: 0.8457179923497664
criterion: mse, max_depth: 10. SCORE: 0.8284936241137797
criterion: mse, max_depth: 11. SCORE: 0.8528544068622782
criterion: mse, max_depth: 12. SCORE: 0.8440955594553676
criterion: mse, max_depth: 13. SCORE: 0.8348331042907089
criterion: mse, max_depth: 14. SCORE: 0.8455165396196375
criterion: mse, max_depth: 15. SCORE: 0.8524756214010655
criterion: mse, max_depth: 16. SCORE: 0.8542814696002403
criterion: mse, max_depth: 17. SCORE: 0

In [79]:
from sklearn.model_selection import cross_val_score

reg = DecisionTreeRegressor()
scores = cross_val_score(reg, X, y, cv=10)
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[ 0.56527908  0.55479704 -1.58830635  0.50946038  0.76635322  0.21421668
  0.24819822  0.37364072 -2.43854234  0.10856761]
Accuracy: -0.07 (+/- 2.02)


Выводы:
1. При увеличении  MAX_DEPTH до 3..5, возрастает качаство модели, дальнейшее увеличение этого параметра на качество модели не влияет. 
2. Качество модели, такк же не сильно зависимо от используемого критерия (MSE-Mean Squared Error; MAE-Mean Absolute Error; friedman_mse).