___
# Tree Regression
___

## Import Libraries

In [1]:
# Import DS environment
import sys; import os; sys.path.append(os.path.expanduser('~/Google Drive/my/projects/python/'))
from ds_setup import *
import datetime
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

%load_ext autoreload
%autoreload

## Data

This example uses the `diabetes` dataset

In [2]:
from sklearn import datasets
# Load the diabetes dataset
diabetes = datasets.load_diabetes()

data1 = pd.DataFrame(data= np.c_[diabetes['data'], diabetes['target']],
                     columns= diabetes['feature_names'] + ['target'])

# lets select BMI
data1 = T(data1).select("target", "bmi", "age")
data1.head()

Unnamed: 0,target,bmi,age
0,151.0,0.061696,0.038076
1,75.0,-0.051474,-0.001882
2,141.0,0.044451,0.085299
3,206.0,-0.011595,-0.089063
4,135.0,-0.036385,0.005383


See that the bmi is more correlated with the target value, than the age is

## Train Test Split

Now let's split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.

In [3]:
X = data1[['bmi', "age"]]
y = data1['target']

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

## Grid Search

In [6]:
# Model 1
from sklearn.ensemble import GradientBoostingRegressor
param_grid1={'n_estimators':[1, 3, 10, 15], 
            'learning_rate': [0.1, 0.05, 0.2, 0.01],
            'max_depth':[6,4,10], 
            'min_samples_leaf':[3,5,9,12], 
            'max_features':[1.0,0.3,0.1, 2.0] }
classifier = t().grid_search(GradientBoostingRegressor(), x_train, y_train.values.ravel(), param_grid1)

Fitting 10 folds for each of 768 candidates, totalling 7680 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    2.8s
[Parallel(n_jobs=-1)]: Done 832 tasks      | elapsed:    4.8s
[Parallel(n_jobs=-1)]: Done 4080 tasks      | elapsed:   10.0s



Best Estimator
---------------------------
GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.2, loss='ls', max_depth=4,
                          max_features=0.3, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=12, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=10,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)


Best Estimator Parameters
---------------------------
{'learning_rate': 0.2, 'max_depth': 4, 'max_features': 0.3, 'min_samples_leaf': 12, 'n_estimators': 10}



[Parallel(n_jobs=-1)]: Done 7648 tasks      | elapsed:   15.4s
[Parallel(n_jobs=-1)]: Done 7680 out of 7680 | elapsed:   15.4s finished


In [7]:
# Model 2
from sklearn.ensemble import RandomForestRegressor
param_grid2 = [
    {'n_estimators': [1, 3, 10, 30, 40], 'max_features': [1, 2, 4, 6, 8, 10]},
    {'bootstrap': [False], 'n_estimators': [1, 3, 10, 15], 'max_features': [1, 2, 3, 4, 7]},
]
classifier = t().grid_search(RandomForestRegressor(), x_train, y_train.values.ravel(), param_grid2, scoring='neg_mean_squared_error')

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 10 folds for each of 50 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.2s



Best Estimator
---------------------------
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features=1, max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=30, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)


Best Estimator Parameters
---------------------------
{'max_features': 1, 'n_estimators': 30}



[Parallel(n_jobs=-1)]: Done 466 tasks      | elapsed:    1.8s
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:    1.9s finished


### Running the best Grid search

In [13]:
rf_reg = GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.2, loss='ls', max_depth=4,
                          max_features=0.3, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=12, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=10,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

model =  rf_reg.fit(x_train, y_train.values.ravel())

### Run Validation

Let's evaluate the model by checking out it's coefficients and how we can interpret them.

In [14]:
t().cross_validation_score(rf_reg, X, y.values.ravel())

Scores: [64.69358345 58.84199201 66.98050736 70.08492884 61.95801858 72.85564692
 65.35187267 54.13902596 71.89730974 56.42232332]

Mean: 64.32252088533538
Standard deviation: 6.110670296341487


### Eval

In [15]:
predictions = model.predict(x_test)

In [16]:
T().regression_score(y_test, predictions)

MAE: 57.928787303694975
MSE: 4695.404981118185
RMSE: 68.52302518948055
r2: 0.27827739610982494


27.8% as R2 is too low
this only serves to exexmplify how to run the code, not as an example of a good model

### Interpretation

In [17]:
#One of the benefits of growing trees is that we can understand how important each of the features are print "Feature Importances" 
coeff_df = pd.DataFrame(model.feature_importances_ , X.columns, columns=['Coefficient'])
T(coeff_df).sort("Coefficient", ascending=False)

Unnamed: 0,Coefficient
bmi,0.819976
age,0.180024
