
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb)

 **Please make sure to install the following libraries in order for the entire notebook to work.**

AutoML H20 is a process that assists in the selection of training models, which compares the performance of models by comparing metrics, which are then reflected in a leaderboard. In order to get an initial idea of which supervised or unsupervised models are ideal for our case (fraud detection) this type of training is used. The data used for this exercise is coded, preprocessed and passed through a technique to adjust the unequal distribution of the classes of the dataset, this technique is called SMOTE (Synthetic Minority Over-sampling Technique), as a rate to adjust the imbalance is chosen: **0.05**. Inside the folder you will find AutoML executions for different scaling methods and SMOTE rates. 


# **1. Libraries Installation**

In [None]:
! pip install requests
! pip install tabulate
! pip install "colorama>=0.3.8"
! pip install future

Collecting colorama>=0.3.8
  Downloading colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Installing collected packages: colorama
Successfully installed colorama-0.4.4


In [None]:
!pip install gdown



In [None]:
! pip install -f http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o


Looking in links: http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html
Collecting h2o
  Downloading h2o-3.32.1.7.tar.gz (170.0 MB)
[K     |████████████████████████████████| 170.0 MB 11 kB/s 
Building wheels for collected packages: h2o
  Building wheel for h2o (setup.py) ... [?25l[?25hdone
  Created wheel for h2o: filename=h2o-3.32.1.7-py2.py3-none-any.whl size=170040343 sha256=cf63cab7ecfbf34a3e261a573ee90682b6431f0f2e900af21de974ab06b464e8
  Stored in directory: /root/.cache/pip/wheels/6e/60/8f/172971bebc94f839b69460f46c9a5fc9e7e88457453bb149d7
Successfully built h2o
Installing collected packages: h2o
Successfully installed h2o-3.32.1.7


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import math
from pandas_profiling import ProfileReport
from collections import Counter
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import Normalizer, StandardScaler



In [None]:
from google.colab import files

In [None]:
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
from datetime import datetime
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import cross_validate, GridSearchCV   #Additional scklearn functions
#from sklearn.grid_search import GridSearchCV   #Perforing grid search
from sklearn.metrics import accuracy_score, roc_auc_score
from matplotlib.pylab import rcParams

# **2. h2o initialization**

In [None]:
import h2o
from h2o.automl import H2OAutoML


In [None]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.11" 2021-04-20; OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.18.04); OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.18.04, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.7/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpy6zvgoyx
  JVM stdout: /tmp/tmpy6zvgoyx/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpy6zvgoyx/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.32.1.7
H2O_cluster_version_age:,5 days
H2O_cluster_name:,H2O_from_python_unknownUser_pi4wqk
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.172 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


# **3. Dataset initialization**

Due to the size of the dataset it is not possible to download the file from Github, as an alternative a link is used to download the dataset from Drive. ***Remember that this dataset is no longer a representation of the raw data due to its previous preprocessing.*** 

In [None]:
#train_fill
!gdown --id 1HS9g0Fk2Vx-t_gO_OB72AOh3gxnYsBYl

Downloading...
From: https://drive.google.com/uc?id=1HS9g0Fk2Vx-t_gO_OB72AOh3gxnYsBYl
To: /content/X_trainss_005.csv
656MB [00:03, 175MB/s]


In [None]:
X_trainss_005 = pd.read_csv('/content/X_trainss_005.csv')

In [None]:
def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

## **AutoML para dataset con escalamiento StandardScaler e imbalanceo con SMOTE 0.05 y métrica de clasif.**

In [None]:
hftrainss_005 = h2o.H2OFrame(X_trainss_005)
#hftest =h2o.H2OFrame(test)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [None]:
trainss_005, testss_005 = hftrainss_005.split_frame(ratios=[.7]) #mirar en diferentes proporciones 70/30 ... 

In [None]:
xss_005 = trainss_005.columns
yss_005 = "isFraud"
xss_005.remove(yss_005)

In [None]:
trainss_005[yss_005] = trainss_005[yss_005].asfactor()


In [None]:
# Run AutoML for 20 base models (limited to 1 hour max runtime by default)
aml = H2OAutoML(max_runtime_secs = 3600*5, max_models=15, seed=1) # tres horas, ir modificando el tiempo, trabajar con 10 modelos 
# ir mirando si se necesita aumentar más modelos
aml.train(x=xss_005, y=yss_005, training_frame=trainss_005)

# View the AutoML Leaderboard
lb = aml.leaderboard
lb.head(rows=lb.nrows)  # Print all rows instead of default (10 rows)

AutoML progress: |████████████████████████████████████████████████████████| 100%


model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse
StackedEnsemble_AllModels_AutoML_20210908_151402,0.971251,0.0511948,0.881846,0.116172,0.108641,0.0118029
StackedEnsemble_BestOfFamily_AutoML_20210908_151402,0.970456,0.0522655,0.87748,0.118303,0.109783,0.0120523
XGBoost_grid__1_AutoML_20210908_151402_model_1,0.968008,0.0577282,0.872667,0.124493,0.114057,0.013009
XGBoost_grid__1_AutoML_20210908_151402_model_2,0.966334,0.0579871,0.865522,0.119741,0.11559,0.0133612
XGBoost_3_AutoML_20210908_151402,0.962602,0.060232,0.852959,0.123918,0.118617,0.01407
XGBoost_1_AutoML_20210908_151402,0.961,0.0621524,0.846352,0.139668,0.120607,0.014546
XGBoost_2_AutoML_20210908_151402,0.956453,0.0659372,0.831369,0.152327,0.12465,0.0155377
GBM_grid__1_AutoML_20210908_151402_model_1,0.95625,0.0653361,0.834416,0.140332,0.123425,0.0152336
XRT_1_AutoML_20210908_151402,0.941054,0.091377,0.748418,0.191987,0.149226,0.0222683
DRF_1_AutoML_20210908_151402,0.938314,0.0926814,0.74253,0.184182,0.150086,0.0225257




In [None]:
lb = h2o.automl.get_leaderboard(aml, extra_columns = 'ALL')
lb

model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse,training_time_ms,predict_time_per_row_ms,algo
StackedEnsemble_AllModels_AutoML_20210908_151402,0.971251,0.0511948,0.881846,0.116172,0.108641,0.0118029,16258,0.183697,StackedEnsemble
StackedEnsemble_BestOfFamily_AutoML_20210908_151402,0.970456,0.0522655,0.87748,0.118303,0.109783,0.0120523,13244,0.156336,StackedEnsemble
XGBoost_grid__1_AutoML_20210908_151402_model_1,0.968008,0.0577282,0.872667,0.124493,0.114057,0.013009,170093,0.017665,XGBoost
XGBoost_grid__1_AutoML_20210908_151402_model_2,0.966334,0.0579871,0.865522,0.119741,0.11559,0.0133612,177478,0.019305,XGBoost
XGBoost_3_AutoML_20210908_151402,0.962602,0.060232,0.852959,0.123918,0.118617,0.01407,74157,0.01497,XGBoost
XGBoost_1_AutoML_20210908_151402,0.961,0.0621524,0.846352,0.139668,0.120607,0.014546,77037,0.019998,XGBoost
XGBoost_2_AutoML_20210908_151402,0.956453,0.0659372,0.831369,0.152327,0.12465,0.0155377,88883,0.014755,XGBoost
GBM_grid__1_AutoML_20210908_151402_model_1,0.95625,0.0653361,0.834416,0.140332,0.123425,0.0152336,569496,0.104741,GBM
XRT_1_AutoML_20210908_151402,0.941054,0.091377,0.748418,0.191987,0.149226,0.0222683,88140,0.02686,DRF
DRF_1_AutoML_20210908_151402,0.938314,0.0926814,0.74253,0.184182,0.150086,0.0225257,87876,0.026177,DRF




# **XGB**

In [None]:
Y = X_trainss_005['isFraud'].values
X = X_trainss_005.drop(['isFraud', 'TransactionID'], axis=1)

In [None]:
params = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5]
        }

In [None]:
xgb = XGBClassifier(learning_rate=0.02, n_estimators=600, objective='binary:logistic',
                    silent=True, nthread=1)

In [None]:
folds = 3
param_comb = 5

skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)

random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=param_comb, scoring='roc_auc', n_jobs=4, cv=skf.split(X,Y), verbose=3, random_state=1001 )

# Here we go
start_time = timer(None) # timing starts from this point for "start_time" variable
random_search.fit(X, Y)
timer(start_time) # timing ends here for "start_time" variable

Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  15 out of  15 | elapsed: 73.0min finished



 Time taken: 1 hours 25 minutes and 34.18 seconds.


In [None]:
print('\n All results:')
print(random_search.cv_results_)
print('\n Best estimator:')
print(random_search.best_estimator_)
print('\n Best normalized gini score for %d-fold search with %d parameter combinations:' % (folds, param_comb))
print(random_search.best_score_ * 2 - 1)
print('\n Best hyperparameters:')
print(random_search.best_params_)
results = pd.DataFrame(random_search.cv_results_)
results.to_csv('xgb-random-grid-search-results-01.csv', index=False)


 All results:
{'mean_fit_time': array([ 844.33210603, 1454.31477515, 1309.1265204 ,  876.15348371,
       1030.99452996]), 'std_fit_time': array([  2.55608505,   7.32876431,  16.54380758,  15.82917313,
       103.53052162]), 'mean_score_time': array([1.51649737, 3.16825008, 3.09268316, 3.21075368, 1.52770281]), 'std_score_time': array([0.01548288, 0.12367843, 0.02473287, 0.37132552, 0.46977241]), 'param_subsample': masked_array(data=[1.0, 0.6, 0.8, 1.0, 0.8],
             mask=[False, False, False, False, False],
       fill_value='?',
            dtype=object), 'param_min_child_weight': masked_array(data=[5, 1, 5, 5, 1],
             mask=[False, False, False, False, False],
       fill_value='?',
            dtype=object), 'param_max_depth': masked_array(data=[3, 5, 5, 5, 4],
             mask=[False, False, False, False, False],
       fill_value='?',
            dtype=object), 'param_gamma': masked_array(data=[5, 1.5, 1, 5, 1],
             mask=[False, False, False, False, False]