### Data Mining and Machine Learning
### Ensembles of classifiers: Bagging, Adaboosting, Gradient Boosting
#### Datasets:  Diabetes and Landsat
#### Modules: Scikit-learn and H2o
#### Edgar Acuna
#### April 2021

In [55]:
import warnings
warnings.filterwarnings("ignore")
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import tree
from sklearn import metrics
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import numpy as np
import pandas as pd
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init(ip="localhost", port=54323)
#h2o.no_progress()

Checking whether there is an H2O instance running at http://localhost:54323 . connected.


0,1
H2O_cluster_uptime:,4 hours 55 mins
H2O_cluster_timezone:,America/La_Paz
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.1.1
H2O_cluster_version_age:,17 days
H2O_cluster_name:,H2O_from_python_eacun_0ahvl7
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,7.933 Gb
H2O_cluster_total_cores:,12
H2O_cluster_allowed_cores:,12


### Bootstrap Samples

In [56]:
#This is the orginal training sample L 
x=[5,3,12,13,21,31,8,9,15,17,24,32] 

In [57]:
#This is bootsrap sample(sample with replacement)
boot1=np.random.choice(x,12)
print(boot1)

[13  5 17 12 12  8 15  5  9 15  8 24]


In [58]:
np.unique(boot1)

array([ 5,  8,  9, 12, 13, 15, 17, 24])

In [59]:
#another boostrapp sample
boot2=np.random.choice(x,12)
print(boot2)

[ 8 21 17 21  8  8 24  9 31 13  8  8]


In [60]:
np.unique(boot2)

array([ 8,  9, 13, 17, 21, 24, 31])

Note: Approximately 37% of the  instances of the training sample  L DO NOT appear in any bootstrap sample. In the above examples 16.67% and 41.67% of instances do not appear in each  of the bootstrap samples.

### I. Bagging for Diabetes using trees and scikit learn

In [61]:
url= "http://academic.uprm.edu/eacuna/diabetes.dat"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_table(url, names=names,header=None)
#The response varaiable must be binary  (0,1)
y=data['class']-1
X=data.iloc[:,0:8]
modeltree = tree.DecisionTreeClassifier()
bagging = BaggingClassifier(modeltree,n_estimators=100)

In [62]:
# Accuracy rate by resubstitution
bagging.fit(X, y)
predictions = bagging.predict(X)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       500
           1       1.00      1.00      1.00       268

    accuracy                           1.00       768
   macro avg       1.00      1.00      1.00       768
weighted avg       1.00      1.00      1.00       768



In [63]:
#Estimating the accuracy by cross validation
kfold = model_selection.KFold(n_splits=10, random_state=99)
results = model_selection.cross_val_score(bagging, X, y, cv=kfold)
print(results.mean())

0.7538277511961723


#### Out-of-Bag accuracy

In [64]:
bagging1 = BaggingClassifier(modeltree,n_estimators=50, oob_score=True)
bagging1.fit(X, y)
bagging1.oob_score_

0.7591145833333334

### II. AdaBoosting para Diabetes usando scikit-learn

In [65]:
adaboost = AdaBoostClassifier(modeltree,n_estimators=100,learning_rate=1)
adaboost.fit(X, y)
predictions = adaboost.predict(X)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       500
           1       1.00      1.00      1.00       268

    accuracy                           1.00       768
   macro avg       1.00      1.00      1.00       768
weighted avg       1.00      1.00      1.00       768



In [66]:
#Estimating the accuracy by cross-validation
kfold = model_selection.KFold(n_splits=10, random_state=999)
results = model_selection.cross_val_score(adaboost, X, y, cv=kfold)
print(results.mean())

0.6873889268626111


### III. Gradient Boosting para Diabetes usando scikit-learn

In [67]:
gboost = GradientBoostingClassifier(n_estimators=100)
#X_train, X_train_lr, y_train, y_train_lr = train_test_split(X,y,test_size=0.5)
gboost.fit(X, y)
predictions = gboost.predict(X)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       0.90      0.96      0.93       500
           1       0.91      0.81      0.86       268

    accuracy                           0.91       768
   macro avg       0.91      0.88      0.89       768
weighted avg       0.91      0.91      0.90       768



In [68]:
#Estimating the accuracy by cross-validation
kfold = model_selection.KFold(n_splits=10, random_state=999)
results = model_selection.cross_val_score(gboost, X, y, cv=kfold)
print(results.mean())

0.7656015037593986


### IV  Gradient Boosting for diabetes using h2o

In [72]:
diabetes = h2o.import_file("https://academic.uprm.edu/eacuna/diabetes.dat")
myx=['C1','C2','C3','C4','C5','C6','C7','C8']
diabetes['C9']=diabetes['C9'].asfactor()
myy="C9"
gbm1 = H2OGradientBoostingEstimator(model_id="gbm_covType_v1",ntrees = 100,nfolds=10, sample_rate = 1,col_sample_rate = 1,seed=20000)
gbm1.train(myx, myy, training_frame=diabetes)
y_pred=gbm1.predict(diabetes)
print((y_pred['predict']==diabetes['C9']).mean())

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
[0.9830729166666666]


In [73]:
#Accuracy ny resubstitution
gbm1.model_performance(diabetes)


ModelMetricsBinomial: gbm
** Reported on test data. **

MSE: 0.038151927643653376
RMSE: 0.19532518435586685
LogLoss: 0.16373584821172893
Mean Per-Class Error: 0.01819402985074625
AUC: 0.9978358208955224
AUCPR: 0.9964328500995646
Gini: 0.9956716417910447

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4138068815720585: 


Unnamed: 0,Unnamed: 1,1,2,Error,Rate
0,1,493.0,7.0,0.014,(7.0/500.0)
1,2,6.0,262.0,0.0224,(6.0/268.0)
2,Total,499.0,269.0,0.0169,(13.0/768.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.413807,0.975791,185.0
1,max f2,0.373994,0.98003,194.0
2,max f0point5,0.51438,0.983542,168.0
3,max accuracy,0.413807,0.983073,185.0
4,max precision,0.990656,1.0,0.0
5,max recall,0.246444,1.0,240.0
6,max specificity,0.990656,1.0,0.0
7,max absolute_mcc,0.413807,0.962782,185.0
8,max min_per_class_accuracy,0.403983,0.981343,188.0
9,max mean_per_class_accuracy,0.413807,0.981806,185.0



Gains/Lift Table: Avg response rate: 34.90 %, avg score: 34.90 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.010417,0.981457,2.865672,2.865672,1.0,0.98576,1.0,0.98576,0.029851,0.029851,186.567164,186.567164,0.029851
1,2,0.020833,0.976561,2.865672,2.865672,1.0,0.978925,1.0,0.982342,0.029851,0.059701,186.567164,186.567164,0.059701
2,3,0.03125,0.973534,2.865672,2.865672,1.0,0.974528,1.0,0.979737,0.029851,0.089552,186.567164,186.567164,0.089552
3,4,0.040365,0.966971,2.865672,2.865672,1.0,0.970113,1.0,0.977564,0.026119,0.115672,186.567164,186.567164,0.115672
4,5,0.050781,0.962355,2.865672,2.865672,1.0,0.964952,1.0,0.974977,0.029851,0.145522,186.567164,186.567164,0.145522
5,6,0.10026,0.935159,2.865672,2.865672,1.0,0.947053,1.0,0.961196,0.141791,0.287313,186.567164,186.567164,0.287313
6,7,0.151042,0.883616,2.865672,2.865672,1.0,0.910996,1.0,0.944319,0.145522,0.432836,186.567164,186.567164,0.432836
7,8,0.200521,0.813021,2.865672,2.865672,1.0,0.84664,1.0,0.920216,0.141791,0.574627,186.567164,186.567164,0.574627
8,9,0.300781,0.585143,2.865672,2.865672,1.0,0.725613,1.0,0.855348,0.287313,0.86194,186.567164,186.567164,0.86194
9,10,0.39974,0.300241,1.282011,2.473625,0.447368,0.427741,0.863192,0.749491,0.126866,0.988806,28.2011,147.362536,0.904806







In [75]:
#Mostrando la matrix de confusion para estimar la precision out-of-bag y por validacion crizada
gbm1.confusion_matrix

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  gbm_covType_v1


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,100.0,100.0,25677.0,5.0,5.0,5.0,6.0,27.0,15.72




ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.03815192825610679
RMSE: 0.19532518592364578
LogLoss: 0.16373584910531283
Mean Per-Class Error: 0.01819402985074625
AUC: 0.9978358208955224
AUCPR: 0.9964328500995646
Gini: 0.9956716417910447

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.413806826430252: 


Unnamed: 0,Unnamed: 1,1,2,Error,Rate
0,1,493.0,7.0,0.014,(7.0/500.0)
1,2,6.0,262.0,0.0224,(6.0/268.0)
2,Total,499.0,269.0,0.0169,(13.0/768.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.413807,0.975791,185.0
1,max f2,0.373994,0.98003,194.0
2,max f0point5,0.51438,0.983542,168.0
3,max accuracy,0.413807,0.983073,185.0
4,max precision,0.990656,1.0,0.0
5,max recall,0.246444,1.0,240.0
6,max specificity,0.990656,1.0,0.0
7,max absolute_mcc,0.413807,0.962782,185.0
8,max min_per_class_accuracy,0.403983,0.981343,188.0
9,max mean_per_class_accuracy,0.413807,0.981806,185.0



Gains/Lift Table: Avg response rate: 34.90 %, avg score: 34.90 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.010417,0.981457,2.865672,2.865672,1.0,0.98576,1.0,0.98576,0.029851,0.029851,186.567164,186.567164,0.029851
1,2,0.020833,0.976561,2.865672,2.865672,1.0,0.978925,1.0,0.982342,0.029851,0.059701,186.567164,186.567164,0.059701
2,3,0.03125,0.973534,2.865672,2.865672,1.0,0.974527,1.0,0.979737,0.029851,0.089552,186.567164,186.567164,0.089552
3,4,0.040365,0.966971,2.865672,2.865672,1.0,0.970113,1.0,0.977564,0.026119,0.115672,186.567164,186.567164,0.115672
4,5,0.050781,0.962355,2.865672,2.865672,1.0,0.964952,1.0,0.974977,0.029851,0.145522,186.567164,186.567164,0.145522
5,6,0.10026,0.935159,2.865672,2.865672,1.0,0.947053,1.0,0.961196,0.141791,0.287313,186.567164,186.567164,0.287313
6,7,0.151042,0.883616,2.865672,2.865672,1.0,0.910996,1.0,0.944319,0.145522,0.432836,186.567164,186.567164,0.432836
7,8,0.200521,0.813021,2.865672,2.865672,1.0,0.84664,1.0,0.920216,0.141791,0.574627,186.567164,186.567164,0.574627
8,9,0.300781,0.585143,2.865672,2.865672,1.0,0.725613,1.0,0.855348,0.287313,0.86194,186.567164,186.567164,0.86194
9,10,0.39974,0.300241,1.282011,2.473625,0.447368,0.427741,0.863192,0.749491,0.126866,0.988806,28.2011,147.362536,0.904806




ModelMetricsBinomial: gbm
** Reported on cross-validation data. **

MSE: 0.17882862151878806
RMSE: 0.42288133266767414
LogLoss: 0.562892421727282
Mean Per-Class Error: 0.26613432835820894
AUC: 0.8024141791044777
AUCPR: 0.6737745428946297
Gini: 0.6048283582089553

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.17930364402076748: 


Unnamed: 0,Unnamed: 1,1,2,Error,Rate
0,1,324.0,176.0,0.352,(176.0/500.0)
1,2,49.0,219.0,0.1828,(49.0/268.0)
2,Total,373.0,395.0,0.293,(225.0/768.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.179304,0.660633,271.0
1,max f2,0.047439,0.774863,350.0
2,max f0point5,0.642473,0.654528,123.0
3,max accuracy,0.642473,0.753906,123.0
4,max precision,0.988878,1.0,0.0
5,max recall,0.005525,1.0,396.0
6,max specificity,0.988878,1.0,0.0
7,max absolute_mcc,0.267892,0.448197,238.0
8,max min_per_class_accuracy,0.299332,0.728,229.0
9,max mean_per_class_accuracy,0.267892,0.733866,238.0



Gains/Lift Table: Avg response rate: 34.90 %, avg score: 33.85 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.010417,0.981041,2.507463,2.507463,0.875,0.98513,0.875,0.98513,0.026119,0.026119,150.746269,150.746269,0.024119
1,2,0.020833,0.972879,2.149254,2.328358,0.75,0.976867,0.8125,0.980998,0.022388,0.048507,114.925373,132.835821,0.042507
2,3,0.03125,0.9683,1.791045,2.149254,0.625,0.970065,0.75,0.977354,0.018657,0.067164,79.104478,114.925373,0.055164
3,4,0.040365,0.958626,2.45629,2.218584,0.857143,0.963953,0.774194,0.974328,0.022388,0.089552,145.628998,121.85845,0.075552
4,5,0.050781,0.951165,2.149254,2.204363,0.75,0.956164,0.769231,0.970602,0.022388,0.11194,114.925373,120.43628,0.09394
5,6,0.10026,0.901611,2.413197,2.307424,0.842105,0.925846,0.805195,0.948514,0.119403,0.231343,141.319717,130.742392,0.201343
6,7,0.151042,0.818624,2.057405,2.223366,0.717949,0.859037,0.775862,0.918432,0.104478,0.335821,105.740528,122.336593,0.283821
7,8,0.200521,0.69621,1.659073,2.084125,0.578947,0.760701,0.727273,0.879511,0.08209,0.41791,65.907306,108.412483,0.33391
8,9,0.300781,0.540269,1.451444,1.873231,0.506494,0.62087,0.65368,0.793297,0.145522,0.563433,45.144408,87.323125,0.403433
9,10,0.39974,0.337128,1.319717,1.736205,0.460526,0.441022,0.605863,0.706089,0.130597,0.69403,31.97172,73.620497,0.45203




Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid,cv_6_valid,cv_7_valid,cv_8_valid,cv_9_valid,cv_10_valid
0,accuracy,0.75315523,0.07636527,0.6666667,0.8,0.80487806,0.73417723,0.72727275,0.84810126,0.75,0.8333333,0.6,0.7671233
1,auc,0.8066292,0.053694718,0.7548611,0.8417778,0.82888746,0.77619046,0.8028369,0.86907023,0.7638889,0.8620924,0.71016484,0.8565217
2,aucpr,0.6908814,0.067576386,0.68544525,0.7501705,0.755243,0.6348142,0.6727421,0.6841638,0.6972894,0.79968643,0.55812305,0.6711359
3,err,0.24684474,0.07636527,0.33333334,0.2,0.19512194,0.2658228,0.27272728,0.15189873,0.25,0.16666667,0.4,0.23287672
4,err_count,19.0,6.2360954,26.0,14.0,16.0,21.0,21.0,12.0,18.0,13.0,32.0,17.0
5,f0point5,0.6511915,0.077198565,0.5855856,0.7092199,0.7241379,0.6451613,0.64220184,0.64705884,0.625,0.79268295,0.51229507,0.62857145
6,f1,0.6986443,0.055406548,0.6666667,0.7407407,0.7241379,0.6956522,0.72727275,0.64705884,0.65384614,0.8,0.6097561,0.72131145
7,f2,0.76053435,0.06315655,0.77380955,0.7751938,0.7241379,0.754717,0.83832335,0.64705884,0.6854839,0.8074534,0.75301206,0.84615386
8,lift_top_group,2.6976535,1.1318568,2.6,2.8,2.8275862,2.6333334,0.0,4.647059,3.0,2.4375,2.857143,3.173913
9,logloss,0.5619636,0.09710537,0.6368383,0.52165824,0.5125177,0.6571069,0.5796877,0.3964031,0.580854,0.53038883,0.7321649,0.47201654



See the whole table with table.as_data_frame()

Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error
0,,2021-04-11 20:55:08,1.048 sec,0.0,0.476641,0.646799,0.5,0.348958,1.0,0.651042
1,,2021-04-11 20:55:08,1.052 sec,1.0,0.454994,0.602865,0.897358,0.831509,2.865672,0.1875
2,,2021-04-11 20:55:08,1.055 sec,2.0,0.437018,0.568046,0.900888,0.837872,2.865672,0.19401
3,,2021-04-11 20:55:08,1.058 sec,3.0,0.422197,0.539896,0.905642,0.841896,2.8047,0.169271
4,,2021-04-11 20:55:08,1.061 sec,4.0,0.409006,0.515212,0.908175,0.850686,2.865672,0.178385
5,,2021-04-11 20:55:08,1.064 sec,5.0,0.398051,0.494855,0.914672,0.857599,2.865672,0.173177
6,,2021-04-11 20:55:08,1.067 sec,6.0,0.388058,0.476226,0.918437,0.863485,2.865672,0.15625
7,,2021-04-11 20:55:08,1.070 sec,7.0,0.37875,0.459186,0.924407,0.871864,2.865672,0.14974
8,,2021-04-11 20:55:08,1.073 sec,8.0,0.370194,0.443595,0.929037,0.881162,2.865672,0.147135
9,,2021-04-11 20:55:08,1.076 sec,9.0,0.362754,0.429826,0.931743,0.886762,2.865672,0.139323



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,C2,257.64859,1.0,0.363092
1,C6,136.503769,0.529806,0.192368
2,C8,94.454041,0.3666,0.13311
3,C7,82.871147,0.321644,0.116786
4,C1,44.622375,0.173191,0.062884
5,C3,38.467346,0.149302,0.05421
6,C5,33.343941,0.129416,0.04699
7,C4,21.684818,0.084164,0.030559


<bound method H2OBinomialModel.confusion_matrix of >

### V. Bagging  using Decision Trees for Landsat (scikit-learn)

In [46]:
url='http://academic.uprm.edu/eacuna/landsat.txt'
data = pd.read_csv(url, header=None,delim_whitespace=True)
y=data.iloc[:,36]-1
names=['C1','C2','C3','C4','C5','C6','C7','C8','C9','C10','C11','C12','C13',
            'C14','C15','C16','C17','C18','C19','C20','C21','C22','C23','C24','C25','C26','C27',
           'C28','C29', 'C30','C31','C32','C33','C34','C35','C36','C37']
X=data.iloc[:,0:36]
modeltree = tree.DecisionTreeClassifier()
bagging = BaggingClassifier(modeltree,n_estimators=100, max_features=1.0)
# Tasa de precision
bagging.fit(X, y)
predictions = bagging.predict(X)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1072
           1       1.00      1.00      1.00       479
           2       1.00      1.00      1.00       961
           3       1.00      1.00      1.00       415
           4       1.00      1.00      1.00       470
           5       1.00      1.00      1.00      1038

    accuracy                           1.00      4435
   macro avg       1.00      1.00      1.00      4435
weighted avg       1.00      1.00      1.00      4435



In [47]:
#Accuracy by resubstitution
kfold = model_selection.KFold(n_splits=10, random_state=99)
results = model_selection.cross_val_score(modeltree, X, y, cv=kfold)
print(results.mean())

0.8033931222418808


In [48]:
#accuracy by out-of-bag
bagging1 = BaggingClassifier(modeltree,n_estimators=50, oob_score=True)
bagging1.fit(X, y)
bagging1.oob_score_

0.9012401352874859

### VI. AdaBoosting for Landsat

In [49]:
adaboost = AdaBoostClassifier(modeltree,n_estimators=100,learning_rate=1)
adaboost.fit(X, y)
predictions = adaboost.predict(X)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1072
           1       1.00      1.00      1.00       479
           2       1.00      1.00      1.00       961
           3       1.00      1.00      1.00       415
           4       1.00      1.00      1.00       470
           5       1.00      1.00      1.00      1038

    accuracy                           1.00      4435
   macro avg       1.00      1.00      1.00      4435
weighted avg       1.00      1.00      1.00      4435



In [50]:
#accuracy by cross-validation
kfold = model_selection.KFold(n_splits=10, random_state=999)
results = model_selection.cross_val_score(adaboost, X, y, cv=kfold)
print(results.mean())

0.8024922213409796


In [51]:
gboost = GradientBoostingClassifier(n_estimators=100)
#X_train, X_train_lr, y_train, y_train_lr = train_test_split(X,y,test_size=0.5)
gboost.fit(X, y)
predictions = gboost.predict(X)
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1072
           1       1.00      1.00      1.00       479
           2       0.96      0.99      0.98       961
           3       0.96      0.87      0.92       415
           4       0.99      0.99      0.99       470
           5       0.98      0.98      0.98      1038

    accuracy                           0.98      4435
   macro avg       0.98      0.97      0.98      4435
weighted avg       0.98      0.98      0.98      4435



In [52]:
#Estimating the accueacy bt cross-validation
kfold = model_selection.KFold(n_splits=10, random_state=999)
results = model_selection.cross_val_score(gboost, X, y, cv=kfold)
print(results.mean())

0.866313322351697


### VII. Gradient Boostimg for Landsat using H2o

In [53]:
#Leyendo los datos
datos= h2o.import_file("http://academic.uprm.edu/eacuna/landsat.txt")
myx=['C1','C2','C3','C4','C5','C6','C7','C8','C9','C10','C11','C12','C13',
            'C14','C15','C16','C17','C18','C19','C20','C21','C22','C23','C24','C25','C26','C27',
           'C28','C29', 'C30','C31','C32','C33','C34','C35','C36']
datos['C37']=datos['C37'].asfactor()
myy="C37"
gbm2 = H2OGradientBoostingEstimator(model_id="gbm_covType_v1",ntrees = 100, max_depth=4,nfolds=10, sample_rate = 1,col_sample_rate = 1,seed=20000)
gbm2.train(myx, myy, training_frame=datos)
y_pred=gbm2.predict(datos)
print((y_pred['predict']==datos['C37']).mean())

Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%
[0.9950394588500564]


In [54]:
#Mostrando la matrix de confusion para estimar la precision out-of-bag y por validacion crizada
gbm2.confusion_matrix

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  gbm_covType_v1


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,100.0,600.0,135557.0,4.0,4.0,4.0,7.0,16.0,13.368333




ModelMetricsMultinomial: gbm
** Reported on train data. **

MSE: 0.007986146376090133
RMSE: 0.08936524143138726
LogLoss: 0.04483049751656768
Mean Per-Class Error: 0.007389093948046522
AUC: NaN
AUCPR: NaN
Multinomial auc values: Table is not computed because it is disabled (model parameter 'auc_type' is set to AUTO or NONE) or due to domain size (maximum is 50 domains).
Multinomial auc_pr values: Table is not computed because it is disabled (model parameter 'auc_type' is set to AUTO or NONE) or due to domain size (maximum is 50 domains).

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class


Unnamed: 0,1,2,3,4,5,6,Error,Rate
0,1072.0,0.0,0.0,0.0,0.0,0.0,0.0,"0 / 1,072"
1,0.0,479.0,0.0,0.0,0.0,0.0,0.0,0 / 479
2,0.0,0.0,961.0,0.0,0.0,0.0,0.0,0 / 961
3,0.0,0.0,7.0,399.0,0.0,9.0,0.038554,16 / 415
4,0.0,0.0,0.0,0.0,470.0,0.0,0.0,0 / 470
5,0.0,0.0,4.0,2.0,0.0,1032.0,0.00578,"6 / 1,038"
6,1072.0,479.0,972.0,401.0,470.0,1041.0,0.004961,"22 / 4,435"



Top-6 Hit Ratios: 


Unnamed: 0,k,hit_ratio
0,1,0.995039
1,2,0.999549
2,3,1.0
3,4,1.0
4,5,1.0
5,6,1.0



ModelMetricsMultinomial: gbm
** Reported on cross-validation data. **

MSE: 0.0720374623883169
RMSE: 0.26839795526105803
LogLoss: 0.24269901109645906
Mean Per-Class Error: 0.11790647672150929
AUC: NaN
AUCPR: NaN
Multinomial auc values: Table is not computed because it is disabled (model parameter 'auc_type' is set to AUTO or NONE) or due to domain size (maximum is 50 domains).
Multinomial auc_pr values: Table is not computed because it is disabled (model parameter 'auc_type' is set to AUTO or NONE) or due to domain size (maximum is 50 domains).

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class


Unnamed: 0,1,2,3,4,5,6,Error,Rate
0,1046.0,2.0,12.0,2.0,9.0,1.0,0.024254,"26 / 1,072"
1,0.0,461.0,2.0,5.0,9.0,2.0,0.037578,18 / 479
2,5.0,1.0,917.0,27.0,0.0,11.0,0.045786,44 / 961
3,2.0,7.0,73.0,260.0,2.0,71.0,0.373494,155 / 415
4,24.0,5.0,1.0,5.0,408.0,27.0,0.131915,62 / 470
5,0.0,1.0,20.0,56.0,21.0,940.0,0.094412,"98 / 1,038"
6,1077.0,477.0,1025.0,355.0,449.0,1052.0,0.090868,"403 / 4,435"



Top-6 Hit Ratios: 


Unnamed: 0,k,hit_ratio
0,1,0.909132
1,2,0.98354
2,3,0.99752
3,4,0.999098
4,5,0.999549
5,6,1.0



Cross-Validation Metrics Summary: 


Unnamed: 0,Unnamed: 1,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid,cv_6_valid,cv_7_valid,cv_8_valid,cv_9_valid,cv_10_valid
0,accuracy,0.90917534,0.0076079755,0.89726025,0.913486,0.92124104,0.908686,0.90807176,0.9177215,0.90762126,0.9092873,0.8975501,0.91082805
1,auc,,0.0,,,,,,,,,,
2,aucpr,,0.0,,,,,,,,,,
3,err,0.09082468,0.0076079755,0.10273973,0.086513996,0.07875895,0.09131403,0.09192825,0.08227848,0.09237875,0.09071274,0.10244989,0.089171976
4,err_count,40.3,4.164666,45.0,34.0,33.0,41.0,41.0,39.0,40.0,42.0,46.0,42.0
5,logloss,0.24232204,0.035508305,0.26348302,0.21587239,0.20900099,0.29432887,0.20782925,0.21476933,0.2419506,0.29224464,0.27304333,0.21069802
6,max_per_class_error,0.36619616,0.07784504,0.43137255,0.23529412,0.3529412,0.35135135,0.275,0.36585367,0.40425533,0.31707317,0.4390244,0.48979592
7,mean_per_class_accuracy,0.88304317,0.011099036,0.87437147,0.9011727,0.886739,0.88072604,0.8897406,0.8878265,0.8814279,0.8893574,0.8592726,0.8797972
8,mean_per_class_error,0.11695686,0.011099036,0.12562853,0.09882732,0.113260955,0.11927393,0.110259384,0.11217353,0.11857213,0.1106426,0.14072742,0.12020277
9,mse,0.07194362,0.007837235,0.080324434,0.066498086,0.06217672,0.08263468,0.06631684,0.06608171,0.06879122,0.08010282,0.08022784,0.0662818



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error,training_auc,training_pr_auc
0,,2021-04-11 19:29:02,8.810 sec,0.0,0.833333,1.791759,0.806313,,
1,,2021-04-11 19:29:02,8.810 sec,1.0,0.761586,1.441517,0.127847,,
2,,2021-04-11 19:29:02,8.828 sec,2.0,0.698646,1.215178,0.114994,,
3,,2021-04-11 19:29:02,8.844 sec,3.0,0.642396,1.048866,0.111387,,
4,,2021-04-11 19:29:02,8.859 sec,4.0,0.592087,0.919148,0.105975,,
5,,2021-04-11 19:29:02,8.875 sec,5.0,0.547864,0.816253,0.106426,,
6,,2021-04-11 19:29:02,8.875 sec,6.0,0.508222,0.729874,0.104622,,
7,,2021-04-11 19:29:02,8.890 sec,7.0,0.473356,0.657732,0.101466,,
8,,2021-04-11 19:29:02,8.906 sec,8.0,0.442903,0.59694,0.094701,,
9,,2021-04-11 19:29:02,8.928 sec,9.0,0.415971,0.544306,0.095378,,



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,C17,2636.68042,1.0,0.190932
1,C22,1897.120728,0.719511,0.137378
2,C20,1485.08374,0.56324,0.107541
3,C18,1168.854004,0.443305,0.084641
4,C34,1040.686523,0.394696,0.07536
5,C16,600.934082,0.227913,0.043516
6,C10,574.891174,0.218036,0.04163
7,C24,369.264313,0.140049,0.02674
8,C30,342.390656,0.129857,0.024794
9,C33,289.765717,0.109898,0.020983



See the whole table with table.as_data_frame()


<bound method H2OMultinomialModel.confusion_matrix of >