**CURSO**: *Machine Learning* en Geociencias<br />
**Profesor**: Edier Aristizábal (evaristizabalg@unal.edu.co) <br />
**Credits**: The content of this notebook is taken from several sources. Every effort has been made to trace copyright holders of the materials used in this notebook. The author apologies for any unintentional omissions and would be pleased to add an acknowledgment in future editions.


# 20: Ensambles de métodos

Los dos métodos de ensamblajes mas populares son:

**Bagging**. Entrena y acopla múltiples modelos, generalmente del mismo tipo, de forma paralela e independiente con diferentes conjuntos de muestras de entrenamiento.

**Boosting**. Entrena y acopla múltiples modelos, generalmente del mismo tipo, de forma secuencial y donde cada modelo individual aprende del error del modelo previo.

## *Bagging*
Agregacion tipo *Bagging* consiste en acoplar múltiples modelos tomando conjunto de muestras de entrenamiento aleatorias, con reemplazamiento. El resultado final es un promedio o moda de todas las predicciones de los submodelos. Los modelos tipos *Bagging* mas conocidos son: (i) *Bagged Decision Trees*, (ii) *Random Forest*.

### *Decision tree & Bagged Decision Tree*

In [10]:
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter("ignore")
from sklearn.model_selection import cross_val_score

In [2]:
from sklearn.datasets import load_iris
iris=load_iris()
X=iris['data']
y=iris['target']

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)

In [5]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion='entropy', min_samples_split=20, min_samples_leaf=10)

In [6]:
dtc.fit(X,y)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=10, min_samples_split=20,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [7]:
print(f'Decision tree has {dtc.tree_.node_count} nodes with maximum depth {dtc.tree_.max_depth}.')

Decision tree has 11 nodes with maximum depth 4.


In [None]:
from sklearn.tree import export_graphviz

# Para exportar a .dot
export_graphviz(dtc,'tree.dot',rounded=True,max_depth=5,feature_names=iris.feature_names,class_names=iris.target_names,filled=True)

# Para convertir a png
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=400']);
#Si no transforma el archivo tree.dot, vaya a http://www.webgraphviz.com/ e ingrese el archivo dot para visualizarlo.

from IPython.display import Image
Image(filename='tree.png')

In [9]:
y_pred = dtc.predict(X_test)
print(dtc.score(X_test,y_test))
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00         6

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



In [10]:
importances = list(dtc.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(iris.feature_names, importances)]
feature_importances

[('sepal length (cm)', 0.01),
 ('sepal width (cm)', 0.01),
 ('petal length (cm)', 0.66),
 ('petal width (cm)', 0.32)]

Ahora se ejecutará el modelo de Árbol de Decisión pero de forma *bagging*. Como se puede observar el tipo de algoritmo a utilizar es un argumento de la función *BagginClassifier*.

In [11]:
from sklearn.ensemble import BaggingClassifier
bc = BaggingClassifier(base_estimator=dtc, n_estimators=100)
bc.fit(X,y)

BaggingClassifier(base_estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                        class_weight=None,
                                                        criterion='entropy',
                                                        max_depth=None,
                                                        max_features=None,
                                                        max_leaf_nodes=None,
                                                        min_impurity_decrease=0.0,
                                                        min_impurity_split=None,
                                                        min_samples_leaf=10,
                                                        min_samples_split=20,
                                                        min_weight_fraction_leaf=0.0,
                                                        presort='deprecated',
                                                        random_state=N

In [12]:
y_pred = bc.predict(X_test)
print(bc.score(X_test,y_test))
print(classification_report(y_test,y_pred))

0.9666666666666667
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      0.92      0.96        13
           2       0.86      1.00      0.92         6

    accuracy                           0.97        30
   macro avg       0.95      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30



In [13]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, random_state=1)
results = cross_val_score(bc, X, y, cv=kfold)
print(results.mean())

0.9333333333333333


Esto significa que es posible *Bagged* cualquier método. A continuación se presenta el modelo SVC *Bagged*.

In [51]:
from sklearn.svm import SVC
svc=SVC()
bc_svc = BaggingClassifier(base_estimator=svc, n_estimators=100, random_state=1)
bc_svc.fit(X,y)

BaggingClassifier(base_estimator=SVC(C=1.0, cache_size=200, class_weight=None,
                                     coef0=0.0, decision_function_shape='ovr',
                                     degree=3, gamma='auto_deprecated',
                                     kernel='rbf', max_iter=-1,
                                     probability=False, random_state=None,
                                     shrinking=True, tol=0.001, verbose=False),
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=100, n_jobs=None,
                  oob_score=False, random_state=1, verbose=0, warm_start=False)

In [52]:
y_pred = bc_svc.predict(X_test)
print(bc_svc.score(X_test,y_test))
print(classification_report(y_test,y_pred))

0.9666666666666667
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      0.92      0.96        13
           2       0.86      1.00      0.92         6

    accuracy                           0.97        30
   macro avg       0.95      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30



### *Random Forests*

In [16]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, max_features=3)

In [17]:
 rf.fit(X, y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features=3,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [18]:
y_pred = rf.predict(X_test)
print(rf.score(X_test,y_test))
print(classification_report(y_test,y_pred))

1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00         6

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



In [19]:
results = cross_val_score(rf, X, y, cv=kfold)
print(results.mean())

0.9066666666666666


In [45]:
importances = list(rf.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(iris.feature_names, importances)]
feature_importances

[('sepal length (cm)', 0.01),
 ('sepal width (cm)', 0.01),
 ('petal length (cm)', 0.46),
 ('petal width (cm)', 0.51)]

## Regresión

In [2]:
from sklearn.datasets import load_boston
X,y=load_boston(return_X_y=True)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)

In [53]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
print(rf.get_params())

{'bootstrap': True, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 'warn', 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}


La función *pprint* entrega de forma organizada la información:

In [54]:
from pprint import pprint
pprint(rf.get_params())

{'bootstrap': True,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 'warn',
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}


In [58]:
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 100, num = 10)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [59]:
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
                              n_iter = 100, scoring='neg_mean_absolute_error', 
                              cv = 3, verbose=2, random_state=42, n_jobs=-1)

# Fit the random search model
rf_random.fit(X, y)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   48.7s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  2.8min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  5.7min finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=RandomForestRegressor(bootstrap=True,
                                                   criterion='mse',
                                                   max_depth=None,
                                                   max_features='auto',
                                                   max_leaf_nodes=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   n_estimators='warn',
                                                   n_jobs=None, oob_score=False,
                                                   random_sta...


In [60]:
rf_random.best_params_

{'n_estimators': 1800,
 'min_samples_split': 5,
 'min_samples_leaf': 4,
 'max_features': 'sqrt',
 'max_depth': 10,
 'bootstrap': False}

In [61]:
from sklearn.model_selection import GridSearchCV

# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [5, 10, 15],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [3, 5, 7],
    'n_estimators': [1000, 1500, 1800, 2000]
}

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                           scoring = 'neg_mean_absolute_error', cv = 3, 
                           n_jobs = -1, verbose = 2)

In [62]:
# Fit the grid search to the data
grid_search.fit(X,y)

Fitting 3 folds for each of 108 candidates, totalling 324 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done 324 out of 324 | elapsed:  9.9min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=RandomForestRegressor(bootstrap=True, criterion='mse',
                                             max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators='warn', n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'bootstrap': [True], 'max_depth':

In [63]:
grid_search.best_params_

{'bootstrap': True,
 'max_depth': 10,
 'min_samples_leaf': 5,
 'min_samples_split': 7,
 'n_estimators': 1000}

In [144]:
rf_best1 = RandomForestRegressor(n_estimators=1800,min_samples_split=5,min_samples_leaf=4,max_features='sqrt',max_depth=10,bootstrap=False)
rf_best1.fit(X_train, y_train);
y_pred = rf_best1.predict(X_test)
print(rf_best1.score(X_test,y_test))

0.9447210967705905


In [67]:
rf_best2 = RandomForestRegressor(bootstrap=True, max_depth=10, min_samples_leaf=5, min_samples_split=7, n_estimators=1000)
rf_best2.fit(X_train, y_train);
y_pred = rf_best2.predict(X_test)
print(rf_best2.score(X_test,y_test))

0.9406467397060365


## Boosting Algorithms
Los algortimso de ensamble crean una secuencia de modelos que aprenden de los errores de los modelos previos. Posteriormente las predicciones las realiza el modelo sopesado. Los dos algoritmos tipo *Boosting* son: (i) AdaBoost, y (ii) Stochastic Gradient Boosting.

### AdaBoost

In [32]:
from sklearn.ensemble import AdaBoostClassifier
adb = AdaBoostClassifier(n_estimators=30, random_state=1)

In [33]:
adb.fit(X_train,y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=30, random_state=1)

In [34]:
y_pred = adb.predict(X_test)
print(adb.score(X_test,y_test))
print(classification_report(y_test,y_pred))

0.9666666666666667
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      0.92      0.96        13
           2       0.86      1.00      0.92         6

    accuracy                           0.97        30
   macro avg       0.95      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30



In [46]:
importances = list(adb.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(iris.feature_names, importances)]
feature_importances

[('sepal length (cm)', 0.0),
 ('sepal width (cm)', 0.0),
 ('petal length (cm)', 0.5),
 ('petal width (cm)', 0.5)]

In [8]:
results = cross_val_score(adb, X, y, cv=kfold)
print(results.mean())

0.9400000000000001


### Stochastic Gradient Boosting

In [35]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators=100, random_state=1)

In [36]:
gbc.fit(X_train,y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='auto',
                           random_state=1, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [37]:
y_pred = gbc.predict(X_test)
print(gbc.score(X_test,y_test))
print(classification_report(y_test,y_pred))

0.9666666666666667
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      0.92      0.96        13
           2       0.86      1.00      0.92         6

    accuracy                           0.97        30
   macro avg       0.95      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30



In [None]:
results = cross_val_score(model, X, y, cv=kfold)
print(results.mean())

In [None]:
importances = list(gbc.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(iris.feature_names, importances)]
feature_importances

## XGBoost

In [1]:
import xgboost as xgb 
model = xgb.XGBClassifier(objective='binary:logistic', n_estimators= 7, seed=44)

In [4]:
model.fit(X_train, y_train) 

XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints=None,
              n_estimators=7, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=44, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, seed=44, subsample=1,
              tree_method=None, validate_parameters=False, verbosity=None)

In [8]:
y_pred = model.predict(X_test)
accuracy = float(np.sum(y_pred == y_test)) / y_test.shape[0]
accuracy

0.058823529411764705