### Data Mining and Machine Learning
### Ensembles of classifiers: Bagging, Boosting, Gradient Boosting
#### Datasets:  Diabetes and Landsat
#### Edgar Acuna
#### November 2021

In [35]:
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import tree
from sklearn import metrics
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import numpy as np
import pandas as pd
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init(ip="localhost", port=54323)
#h2o.no_progress()

Checking whether there is an H2O instance running at http://localhost:54323 . connected.


0,1
H2O_cluster_uptime:,29 mins 24 secs
H2O_cluster_timezone:,America/Halifax
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.34.0.3
H2O_cluster_version_age:,1 month and 24 days
H2O_cluster_name:,H2O_from_python_eacun_cd0dzs
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,67.0 Mb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


### Bootstrap Samples

In [36]:
#This is the orginal training sample L 
x=[5,3,12,13,21,31,8,9,15,17,24,32] 

In [37]:
#This is bootsrap sample(sample with replacement)
boot1=np.random.choice(x,12)
print(boot1)

[ 9  5 21 21 17  8 13 24  8 17 32 21]


In [38]:
np.unique(boot1)

array([ 5,  8,  9, 13, 17, 21, 24, 32])

In [39]:
#another boostrapp sample
boot2=np.random.choice(x,12)
print(boot2)

[ 5  5 31 17  8 17 31 15 15  5 17 31]


In [40]:
np.unique(boot2)

array([ 5,  8, 15, 17, 31])

Note: Approximately 37% of the  instances of the training sample  L DO NOT appear in any bootstrap sample. In the above examples 16.67% and 41.67% of instances do not appear in each  of the bootstrap samples.

### I. Bagging for Diabetes using trees and scikit learn

In [41]:
url= "http://academic.uprm.edu/eacuna/diabetes.dat"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_table(url, names=names,header=None)
#The response varaiable must be binary  (0,1)
y=data['class']-1
X=data.iloc[:,0:8]
modeltree = tree.DecisionTreeClassifier()
bagging = BaggingClassifier(modeltree,n_estimators=100)

In [42]:
# Accuracy rate by resubstitution
bagging.fit(X, y)
predictions = bagging.predict(X)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       500
           1       1.00      1.00      1.00       268

    accuracy                           1.00       768
   macro avg       1.00      1.00      1.00       768
weighted avg       1.00      1.00      1.00       768



In [43]:
#Estimating the accuracy by cross validation
kfold = model_selection.KFold(n_splits=10, shuffle= True,random_state=99)
results = model_selection.cross_val_score(bagging, X, y, cv=kfold)
print(results.mean())

0.7566643882433356


#### Out-of-Bag accuracy

In [44]:
bagging1 = BaggingClassifier(modeltree,n_estimators=50, oob_score=True)
bagging1.fit(X, y)
bagging1.oob_score_

0.75

### II. AdaBoosting para Diabetes usando scikit-learn

In [45]:
adaboost = AdaBoostClassifier(modeltree,n_estimators=100,learning_rate=1)
adaboost.fit(X, y)
predictions = adaboost.predict(X)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       500
           1       1.00      1.00      1.00       268

    accuracy                           1.00       768
   macro avg       1.00      1.00      1.00       768
weighted avg       1.00      1.00      1.00       768



In [46]:
#Estimating the accuracy by cross-validation
kfold = model_selection.KFold(n_splits=10,shuffle= True, random_state=999)
results = model_selection.cross_val_score(adaboost, X, y, cv=kfold)
print(results.mean())

0.6875768967874232


### III. Gradient Boosting para Diabetes usando scikit-learn

In [47]:
gboost = GradientBoostingClassifier(n_estimators=100)
#X_train, X_train_lr, y_train, y_train_lr = train_test_split(X,y,test_size=0.5)
gboost.fit(X, y)
predictions = gboost.predict(X)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       0.90      0.96      0.93       500
           1       0.91      0.81      0.86       268

    accuracy                           0.91       768
   macro avg       0.91      0.88      0.89       768
weighted avg       0.91      0.91      0.90       768



In [48]:
#Estimating the accuracy by cross-validation
kfold = model_selection.KFold(n_splits=10, shuffle= True, random_state=999)
results = model_selection.cross_val_score(gboost, X, y, cv=kfold)
print(results.mean())

0.7591592617908407


### IV  Gradient Boosting for diabetes using h2o

In [49]:
diabetes = h2o.import_file("https://academic.uprm.edu/eacuna/diabetes.dat")
myx=['C1','C2','C3','C4','C5','C6','C7','C8']
diabetes['C9']=diabetes['C9'].asfactor()
myy="C9"
gbm1 = H2OGradientBoostingEstimator(model_id="gbm_covType_v1",ntrees = 100, max_depth=4,nfolds=10, sample_rate = 1,col_sample_rate = 1,seed=20000)
gbm1.train(myx, myy, training_frame=diabetes)
y_pred=gbm1.predict(diabetes)
print((y_pred['predict']==diabetes['C9']).mean())

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
[0.9388020833333334]


In [50]:
#Accuracy ny resubstitution
gbm1.model_performance(diabetes)


ModelMetricsBinomial: gbm
** Reported on test data. **

MSE: 0.06189488461506235
RMSE: 0.24878682564609877
LogLoss: 0.2278218904572922
Mean Per-Class Error: 0.06844776119402984
AUC: 0.9836082089552238
AUCPR: 0.973810489785994
Gini: 0.9672164179104477

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4582730505624667: 


Unnamed: 0,Unnamed: 1,1,2,Error,Rate
0,1,482.0,18.0,0.036,(18.0/500.0)
1,2,28.0,240.0,0.1045,(28.0/268.0)
2,Total,510.0,258.0,0.0599,(46.0/768.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.458273,0.912548,171.0
1,max f2,0.333183,0.92914,206.0
2,max f0point5,0.538282,0.936441,148.0
3,max accuracy,0.458273,0.940104,171.0
4,max precision,0.985922,1.0,0.0
5,max recall,0.104561,1.0,309.0
6,max specificity,0.985922,1.0,0.0
7,max absolute_mcc,0.458273,0.86739,171.0
8,max min_per_class_accuracy,0.412628,0.929104,185.0
9,max mean_per_class_accuracy,0.412628,0.931552,185.0



Gains/Lift Table: Avg response rate: 34.90 %, avg score: 34.90 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.010417,0.974314,2.865672,2.865672,1.0,0.979208,1.0,0.979208,0.029851,0.029851,186.567164,186.567164,0.029851
1,2,0.020833,0.964821,2.865672,2.865672,1.0,0.969928,1.0,0.974568,0.029851,0.059701,186.567164,186.567164,0.059701
2,3,0.03125,0.955925,2.865672,2.865672,1.0,0.960038,1.0,0.969725,0.029851,0.089552,186.567164,186.567164,0.089552
3,4,0.040365,0.952782,2.865672,2.865672,1.0,0.954008,1.0,0.966176,0.026119,0.115672,186.567164,186.567164,0.115672
4,5,0.050781,0.941174,2.865672,2.865672,1.0,0.946141,1.0,0.962066,0.029851,0.145522,186.567164,186.567164,0.145522
5,6,0.10026,0.897921,2.865672,2.865672,1.0,0.919839,1.0,0.941227,0.141791,0.287313,186.567164,186.567164,0.287313
6,7,0.151042,0.835725,2.865672,2.865672,1.0,0.86911,1.0,0.916981,0.145522,0.432836,186.567164,186.567164,0.432836
7,8,0.200521,0.752943,2.865672,2.865672,1.0,0.789735,1.0,0.885582,0.141791,0.574627,186.567164,186.567164,0.574627
8,9,0.300781,0.532979,2.56794,2.766428,0.896104,0.648036,0.965368,0.8064,0.257463,0.83209,156.793952,176.64276,0.81609
9,10,0.39974,0.345884,1.206599,2.380281,0.421053,0.436178,0.830619,0.714749,0.119403,0.951493,20.659859,138.028101,0.847493







In [51]:
#Mostrando la matrix de confusion para estimar la precision out-of-bag y por validacion crizada
gbm1.confusion_matrix

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  gbm_covType_v1


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,100.0,100.0,19371.0,4.0,4.0,4.0,5.0,16.0,10.74




ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.06189488480582112
RMSE: 0.24878682602947674
LogLoss: 0.2278218932658851
Mean Per-Class Error: 0.06844776119402984
AUC: 0.9836082089552238
AUCPR: 0.973810489785994
Gini: 0.9672164179104477

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4582730550290878: 


Unnamed: 0,Unnamed: 1,1,2,Error,Rate
0,1,482.0,18.0,0.036,(18.0/500.0)
1,2,28.0,240.0,0.1045,(28.0/268.0)
2,Total,510.0,258.0,0.0599,(46.0/768.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.458273,0.912548,171.0
1,max f2,0.333183,0.92914,206.0
2,max f0point5,0.538282,0.936441,148.0
3,max accuracy,0.458273,0.940104,171.0
4,max precision,0.985922,1.0,0.0
5,max recall,0.104561,1.0,309.0
6,max specificity,0.985922,1.0,0.0
7,max absolute_mcc,0.458273,0.86739,171.0
8,max min_per_class_accuracy,0.412628,0.929104,185.0
9,max mean_per_class_accuracy,0.412628,0.931552,185.0



Gains/Lift Table: Avg response rate: 34.90 %, avg score: 34.90 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.010417,0.974314,2.865672,2.865672,1.0,0.979208,1.0,0.979208,0.029851,0.029851,186.567164,186.567164,0.029851
1,2,0.020833,0.964821,2.865672,2.865672,1.0,0.969928,1.0,0.974568,0.029851,0.059701,186.567164,186.567164,0.059701
2,3,0.03125,0.955925,2.865672,2.865672,1.0,0.960038,1.0,0.969725,0.029851,0.089552,186.567164,186.567164,0.089552
3,4,0.040365,0.952782,2.865672,2.865672,1.0,0.954008,1.0,0.966176,0.026119,0.115672,186.567164,186.567164,0.115672
4,5,0.050781,0.941174,2.865672,2.865672,1.0,0.946141,1.0,0.962066,0.029851,0.145522,186.567164,186.567164,0.145522
5,6,0.10026,0.897921,2.865672,2.865672,1.0,0.919839,1.0,0.941227,0.141791,0.287313,186.567164,186.567164,0.287313
6,7,0.151042,0.835725,2.865672,2.865672,1.0,0.86911,1.0,0.916981,0.145522,0.432836,186.567164,186.567164,0.432836
7,8,0.200521,0.752943,2.865672,2.865672,1.0,0.789735,1.0,0.885582,0.141791,0.574627,186.567164,186.567164,0.574627
8,9,0.300781,0.532979,2.56794,2.766428,0.896104,0.648036,0.965368,0.8064,0.257463,0.83209,156.793952,176.64276,0.81609
9,10,0.39974,0.345884,1.206599,2.380281,0.421053,0.436178,0.830619,0.714749,0.119403,0.951493,20.659859,138.028101,0.847493




ModelMetricsBinomial: gbm
** Reported on cross-validation data. **

MSE: 0.17092323346297955
RMSE: 0.4134286316439387
LogLoss: 0.5301717811615181
Mean Per-Class Error: 0.25068656716417914
AUC: 0.8136343283582089
AUCPR: 0.6733566314887582
Gini: 0.6272686567164178

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.20111566420593152: 


Unnamed: 0,Unnamed: 1,1,2,Error,Rate
0,1,337.0,163.0,0.326,(163.0/500.0)
1,2,47.0,221.0,0.1754,(47.0/268.0)
2,Total,384.0,384.0,0.2734,(210.0/768.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.201116,0.677914,250.0
1,max f2,0.082691,0.784866,321.0
2,max f0point5,0.58949,0.664498,129.0
3,max accuracy,0.58949,0.760417,129.0
4,max precision,0.987617,1.0,0.0
5,max recall,0.00701,1.0,395.0
6,max specificity,0.987617,1.0,0.0
7,max absolute_mcc,0.201116,0.475332,250.0
8,max min_per_class_accuracy,0.313324,0.738,212.0
9,max mean_per_class_accuracy,0.201116,0.749313,250.0



Gains/Lift Table: Avg response rate: 34.90 %, avg score: 34.20 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.010417,0.974809,2.149254,2.149254,0.75,0.981673,0.75,0.981673,0.022388,0.022388,114.925373,114.925373,0.018388
1,2,0.020833,0.9617,2.865672,2.507463,1.0,0.970081,0.875,0.975877,0.029851,0.052239,186.567164,150.746269,0.048239
2,3,0.03125,0.949709,1.432836,2.149254,0.5,0.956059,0.75,0.969271,0.014925,0.067164,43.283582,114.925373,0.055164
3,4,0.040365,0.939627,1.637527,2.033702,0.571429,0.94492,0.709677,0.963773,0.014925,0.08209,63.752665,103.370246,0.06409
4,5,0.050781,0.93009,2.149254,2.057405,0.75,0.934818,0.717949,0.957833,0.022388,0.104478,114.925373,105.740528,0.082478
5,6,0.10026,0.875401,2.413197,2.232991,0.842105,0.905597,0.779221,0.932054,0.119403,0.223881,141.319717,123.299089,0.189881
6,7,0.151042,0.79798,2.057405,2.173958,0.717949,0.834062,0.758621,0.899109,0.104478,0.328358,105.740528,117.39578,0.272358
7,8,0.200521,0.702292,1.88531,2.102733,0.657895,0.748364,0.733766,0.861912,0.093284,0.421642,88.531029,110.273309,0.339642
8,9,0.300781,0.531594,1.60031,1.935259,0.558442,0.606871,0.675325,0.776898,0.160448,0.58209,60.031014,93.525877,0.43209
9,10,0.39974,0.357655,1.282011,1.773543,0.447368,0.446075,0.618893,0.695001,0.126866,0.708955,28.2011,77.354271,0.474955




Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid,cv_6_valid,cv_7_valid,cv_8_valid,cv_9_valid,cv_10_valid
0,accuracy,0.752715,0.06075,0.705128,0.785714,0.780488,0.746835,0.74026,0.835443,0.694444,0.820513,0.6375,0.780822
1,auc,0.814498,0.045645,0.769444,0.848,0.827586,0.813605,0.80922,0.86907,0.751736,0.855978,0.743819,0.856522
2,err,0.247285,0.06075,0.294872,0.214286,0.219512,0.253165,0.25974,0.164557,0.305556,0.179487,0.3625,0.219178
3,err_count,19.0,4.876246,23.0,15.0,18.0,20.0,20.0,13.0,22.0,14.0,29.0,16.0
4,f0point5,0.643605,0.065844,0.61828,0.68323,0.680473,0.657895,0.65534,0.619469,0.5625,0.77381,0.538793,0.646259
5,f1,0.70033,0.051058,0.666667,0.745763,0.71875,0.714286,0.72973,0.682927,0.62069,0.787879,0.632911,0.703704
6,f2,0.770505,0.040804,0.72327,0.820896,0.761589,0.78125,0.823171,0.76087,0.692308,0.802469,0.766871,0.772358
7,lift_top_group,2.453903,1.419914,2.6,2.8,2.827586,2.633333,0.0,4.647059,3.0,0.0,2.857143,3.173913
8,logloss,0.529822,0.082563,0.604204,0.487304,0.495635,0.566602,0.557965,0.370058,0.597556,0.514954,0.650412,0.45353
9,max_per_class_error,0.293154,0.096402,0.333333,0.266667,0.226415,0.306122,0.361702,0.176471,0.333333,0.1875,0.5,0.24



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error
0,,2021-12-01 22:27:02,3.154 sec,0.0,0.476641,0.646799,0.5,0.348958,1.0,0.651042
1,,2021-12-01 22:27:02,3.154 sec,1.0,0.458367,0.609445,0.870556,0.769786,2.626866,0.21875
2,,2021-12-01 22:27:02,3.154 sec,2.0,0.443037,0.579248,0.875873,0.787032,2.703464,0.213542
3,,2021-12-01 22:27:02,3.169 sec,3.0,0.430069,0.55421,0.882642,0.791588,2.703464,0.208333
4,,2021-12-01 22:27:02,3.169 sec,4.0,0.419352,0.533754,0.883937,0.798746,2.800543,0.205729
5,,2021-12-01 22:27:02,3.169 sec,5.0,0.409897,0.515609,0.886638,0.805237,2.865672,0.217448
6,,2021-12-01 22:27:02,3.185 sec,6.0,0.400978,0.498875,0.894974,0.817641,2.865672,0.1875
7,,2021-12-01 22:27:02,3.185 sec,7.0,0.393181,0.484229,0.900034,0.826908,2.865672,0.184896
8,,2021-12-01 22:27:02,3.185 sec,8.0,0.38614,0.470769,0.903716,0.835206,2.865672,0.1875
9,,2021-12-01 22:27:02,3.201 sec,9.0,0.37984,0.458879,0.906813,0.844634,2.865672,0.171875



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,C2,253.061905,1.0,0.401102
1,C6,133.097443,0.525948,0.210959
2,C8,78.79007,0.311347,0.124882
3,C7,65.523659,0.258923,0.103855
4,C1,35.47662,0.140189,0.05623
5,C5,30.243166,0.119509,0.047935
6,C3,22.35951,0.088356,0.03544
7,C4,12.364817,0.048861,0.019598


<bound method H2OBinomialModel.confusion_matrix of >

### V. Bagging  using Decision Trees for Landsat (scikit-learn)

In [52]:
url='http://academic.uprm.edu/eacuna/landsat.txt'
data = pd.read_table(url, header=None,delim_whitespace=True)
y=data.iloc[:,36]-1
names=['C1','C2','C3','C4','C5','C6','C7','C8','C9','C10','C11','C12','C13',
            'C14','C15','C16','C17','C18','C19','C20','C21','C22','C23','C24','C25','C26','C27',
           'C28','C29', 'C30','C31','C32','C33','C34','C35','C36','C37']
X=data.iloc[:,0:36]
modeltree = tree.DecisionTreeClassifier()
bagging = BaggingClassifier(modeltree,n_estimators=100, max_features=1.0)
# Tasa de precision
bagging.fit(X, y)
predictions = bagging.predict(X)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1072
           1       1.00      1.00      1.00       479
           2       1.00      1.00      1.00       961
           3       1.00      1.00      1.00       415
           4       1.00      1.00      1.00       470
           5       1.00      1.00      1.00      1038

    accuracy                           1.00      4435
   macro avg       1.00      1.00      1.00      4435
weighted avg       1.00      1.00      1.00      4435



In [53]:
#Accuracy by crossvalidation
kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=99)
results = model_selection.cross_val_score(modeltree, X, y, cv=kfold)
print(results.mean())

0.8561431069896083


In [54]:
#accuracy by the holdout method
#Estimacion de la precision  por el metodo  "holdout 
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
X_train, y_train

X_test, y_test

modeltree = tree.DecisionTreeClassifier()
modeltree = modeltree.fit(X_train,y_train)
# Tasa de precision
predictions = modeltree.predict(X_test)
bagging = BaggingClassifier(modeltree,n_estimators=100, max_features=1.0)
# Tasa de precision
bagging.fit(X_train, y_train)
predictions = bagging.predict(X_test)
bagging.score(X_test, y_test)

0.8970698722764838

In [55]:
#accuracy by out-of-bag
bagging1 = BaggingClassifier(modeltree,n_estimators=50, oob_score=True)
bagging1.fit(X, y)
bagging1.oob_score_

0.9016910935738445

### VI. AdaBoosting for Landsat

In [56]:
adaboost = AdaBoostClassifier(modeltree,n_estimators=100,learning_rate=1)
adaboost.fit(X, y)
predictions = adaboost.predict(X)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1072
           1       1.00      1.00      1.00       479
           2       1.00      1.00      1.00       961
           3       1.00      1.00      1.00       415
           4       1.00      1.00      1.00       470
           5       1.00      1.00      1.00      1038

    accuracy                           1.00      4435
   macro avg       1.00      1.00      1.00      4435
weighted avg       1.00      1.00      1.00      4435



In [57]:
#accuracy by cross-validation
kfold = model_selection.KFold(n_splits=10, shuffle=True,random_state=999)
results = model_selection.cross_val_score(adaboost, X, y, cv=kfold)
print(results.mean())

0.8435121916498891


In [58]:
#accuracy by the holdout method
#Estimacion de la precision  por el metodo  "holdout 
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
X_train, y_train

X_test, y_test

modeltree = tree.DecisionTreeClassifier()
modeltree = modeltree.fit(X_train,y_train)
# Tasa de precision
predictions = modeltree.predict(X_test)
adaboost = AdaBoostClassifier(modeltree,n_estimators=100,learning_rate=1)
# Tasa de precision
adaboost.fit(X_train, y_train)
predictions = adaboost.predict(X_test)
adaboost.score(X_test, y_test)

0.8324567993989481

### Gradient Boosting for Landsat

In [59]:
gboost = GradientBoostingClassifier(n_estimators=100)
#X_train, X_train_lr, y_train, y_train_lr = train_test_split(X,y,test_size=0.5)
gboost.fit(X, y)
predictions = gboost.predict(X)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1072
           1       1.00      1.00      1.00       479
           2       0.96      0.99      0.98       961
           3       0.96      0.87      0.92       415
           4       0.99      0.99      0.99       470
           5       0.98      0.98      0.98      1038

    accuracy                           0.98      4435
   macro avg       0.98      0.97      0.98      4435
weighted avg       0.98      0.98      0.98      4435



In [60]:
#Estimating the accueacy bt cross-validation
kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=999)
results = model_selection.cross_val_score(gboost, X, y, cv=kfold)
print(results.mean())

0.9016919854391636


In [61]:
#accuracy by the holdout method
#Estimacion de la precision  por el metodo  "holdout 
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
X_train, y_train

X_test, y_test

modeltree = tree.DecisionTreeClassifier()
modeltree = modeltree.fit(X_train,y_train)
# Tasa de precision
predictions = modeltree.predict(X_test)
gboost = GradientBoostingClassifier(n_estimators=100)
# Tasa de precision
gboost.fit(X_train, y_train)
predictions = gboost.predict(X_test)
gboost.score(X_test, y_test)

0.9053343350864012

### VII. Gradient Boostimg for Landsat using H2o

In [62]:
#Leyendo los datos
datos= h2o.import_file("http://academic.uprm.edu/eacuna/landsat.txt")
myx=['C1','C2','C3','C4','C5','C6','C7','C8','C9','C10','C11','C12','C13',
            'C14','C15','C16','C17','C18','C19','C20','C21','C22','C23','C24','C25','C26','C27',
           'C28','C29', 'C30','C31','C32','C33','C34','C35','C36']
datos['C37']=datos['C37'].asfactor()
myy="C37"
gbm2 = H2OGradientBoostingEstimator(model_id="gbm_covType_v1",ntrees = 100, max_depth=4,nfolds=10, sample_rate = 1,col_sample_rate = 1,seed=20000)
gbm2.train(myx, myy, training_frame=datos)
y_pred=gbm2.predict(datos)
print((y_pred['predict']==datos['C37']).mean())

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
gbm prediction progress: |███████████████████████████████████████████████████████| (done) 100%
[0.9950394588500564]


In [63]:
#Mostrando la matrix de confusion para estimar la precision out-of-bag y por validacion crizada
gbm2.confusion_matrix

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  gbm_covType_v1


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,100.0,600.0,135536.0,4.0,4.0,4.0,7.0,16.0,13.368333




ModelMetricsMultinomial: gbm
** Reported on train data. **

MSE: 0.007986146376090133
RMSE: 0.08936524143138726
LogLoss: 0.04483049751656768
Mean Per-Class Error: 0.007389093948046522
AUC: NaN
AUCPR: NaN
Multinomial auc values: Table is not computed because it is disabled (model parameter 'auc_type' is set to AUTO or NONE) or due to domain size (maximum is 50 domains).
Multinomial auc_pr values: Table is not computed because it is disabled (model parameter 'auc_type' is set to AUTO or NONE) or due to domain size (maximum is 50 domains).

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class


Unnamed: 0,1,2,3,4,5,6,Error,Rate
0,1072.0,0.0,0.0,0.0,0.0,0.0,0.0,"0 / 1,072"
1,0.0,479.0,0.0,0.0,0.0,0.0,0.0,0 / 479
2,0.0,0.0,961.0,0.0,0.0,0.0,0.0,0 / 961
3,0.0,0.0,7.0,399.0,0.0,9.0,0.038554,16 / 415
4,0.0,0.0,0.0,0.0,470.0,0.0,0.0,0 / 470
5,0.0,0.0,4.0,2.0,0.0,1032.0,0.00578,"6 / 1,038"
6,1072.0,479.0,972.0,401.0,470.0,1041.0,0.004961,"22 / 4,435"



Top-6 Hit Ratios: 


Unnamed: 0,k,hit_ratio
0,1,0.995039
1,2,0.999549
2,3,1.0
3,4,1.0
4,5,1.0
5,6,1.0



ModelMetricsMultinomial: gbm
** Reported on cross-validation data. **

MSE: 0.0720374623883169
RMSE: 0.26839795526105803
LogLoss: 0.24269901109645906
Mean Per-Class Error: 0.11790647672150929
AUC: NaN
AUCPR: NaN
Multinomial auc values: Table is not computed because it is disabled (model parameter 'auc_type' is set to AUTO or NONE) or due to domain size (maximum is 50 domains).
Multinomial auc_pr values: Table is not computed because it is disabled (model parameter 'auc_type' is set to AUTO or NONE) or due to domain size (maximum is 50 domains).

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class


Unnamed: 0,1,2,3,4,5,6,Error,Rate
0,1046.0,2.0,12.0,2.0,9.0,1.0,0.024254,"26 / 1,072"
1,0.0,461.0,2.0,5.0,9.0,2.0,0.037578,18 / 479
2,5.0,1.0,917.0,27.0,0.0,11.0,0.045786,44 / 961
3,2.0,7.0,73.0,260.0,2.0,71.0,0.373494,155 / 415
4,24.0,5.0,1.0,5.0,408.0,27.0,0.131915,62 / 470
5,0.0,1.0,20.0,56.0,21.0,940.0,0.094412,"98 / 1,038"
6,1077.0,477.0,1025.0,355.0,449.0,1052.0,0.090868,"403 / 4,435"



Top-6 Hit Ratios: 


Unnamed: 0,k,hit_ratio
0,1,0.909132
1,2,0.98354
2,3,0.99752
3,4,0.999098
4,5,0.999549
5,6,1.0



Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid,cv_6_valid,cv_7_valid,cv_8_valid,cv_9_valid,cv_10_valid
0,accuracy,0.909175,0.007608,0.89726,0.913486,0.921241,0.908686,0.908072,0.917721,0.907621,0.909287,0.89755,0.910828
1,auc,,0.0,,,,,,,,,,
2,err,0.090825,0.007608,0.10274,0.086514,0.078759,0.091314,0.091928,0.082278,0.092379,0.090713,0.10245,0.089172
3,err_count,40.3,4.164666,45.0,34.0,33.0,41.0,41.0,39.0,40.0,42.0,46.0,42.0
4,logloss,0.242322,0.035508,0.263483,0.215872,0.209001,0.294329,0.207829,0.214769,0.241951,0.292245,0.273043,0.210698
5,max_per_class_error,0.366196,0.077845,0.431373,0.235294,0.352941,0.351351,0.275,0.365854,0.404255,0.317073,0.439024,0.489796
6,mean_per_class_accuracy,0.883043,0.011099,0.874371,0.901173,0.886739,0.880726,0.889741,0.887826,0.881428,0.889357,0.859273,0.879797
7,mean_per_class_error,0.116957,0.011099,0.125629,0.098827,0.113261,0.119274,0.110259,0.112174,0.118572,0.110643,0.140727,0.120203
8,mse,0.071944,0.007837,0.080324,0.066498,0.062177,0.082635,0.066317,0.066082,0.068791,0.080103,0.080228,0.066282
9,pr_auc,,0.0,,,,,,,,,,



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error,training_auc,training_pr_auc
0,,2021-12-01 22:29:52,21.595 sec,0.0,0.833333,1.791759,0.806313,,
1,,2021-12-01 22:29:52,21.640 sec,1.0,0.761586,1.441517,0.127847,,
2,,2021-12-01 22:29:52,21.668 sec,2.0,0.698646,1.215178,0.114994,,
3,,2021-12-01 22:29:52,21.695 sec,3.0,0.642396,1.048866,0.111387,,
4,,2021-12-01 22:29:52,21.725 sec,4.0,0.592087,0.919148,0.105975,,
5,,2021-12-01 22:29:52,21.784 sec,5.0,0.547864,0.816253,0.106426,,
6,,2021-12-01 22:29:52,21.816 sec,6.0,0.508222,0.729874,0.104622,,
7,,2021-12-01 22:29:52,21.855 sec,7.0,0.473356,0.657732,0.101466,,
8,,2021-12-01 22:29:52,21.885 sec,8.0,0.442903,0.59694,0.094701,,
9,,2021-12-01 22:29:52,21.914 sec,9.0,0.415971,0.544306,0.095378,,



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,C17,2636.68042,1.0,0.190932
1,C22,1897.120728,0.719511,0.137378
2,C20,1485.08374,0.56324,0.107541
3,C18,1168.854004,0.443305,0.084641
4,C34,1040.686523,0.394696,0.07536
5,C16,600.934082,0.227913,0.043516
6,C10,574.891174,0.218036,0.04163
7,C24,369.264313,0.140049,0.02674
8,C30,342.390656,0.129857,0.024794
9,C33,289.765717,0.109898,0.020983



See the whole table with table.as_data_frame()


<bound method H2OMultinomialModel.confusion_matrix of >