
# Basic Overview
The objective is to explore h2o package in python and see how we can make randomForest models using the same.

Comments/criticisms/appreciations are greatly accepted and appreciated. Do not be shy and send me an email at babinu@gmail.com !

Source of data : https://www.kaggle.com/c/titanic/data

### A note about randoForest and h2o

This is more of binary classification problem and hence random forest appears to be a natural first choice.

However, we cannot yet use categorical predictors correctly in scikit learn (https://github.com/scikit-learn/scikit-learn/pull/4899) and hence we use another package named h2o to do the same(http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/intro.html)

A more detailed description can be seen here : https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/

In [1]:
import pandas as pd
import numpy as np

In [2]:
train_data = pd.read_csv("../train.csv")
test_data = pd.read_csv("../test.csv")

In [3]:
len(train_data)

891

In [4]:
train_data['Predictions'] = -1

In [5]:
# Using h2o
import h2o
h2o.init(nthreads = -1, max_mem_size = 8)

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O cluster uptime:,3 days 23 hours 18 mins
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.18.0.2
H2O cluster version age:,3 months and 20 days !!!
H2O cluster name:,H2O_from_python_babs4JESUS_c6btew
H2O cluster total nodes:,1
H2O cluster free memory:,6.770 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


In [6]:
def get_h2o_frame_with_rel_factors_test(clean_training_data_v3):
    clean_h2o_data = h2o.H2OFrame(clean_training_data_v3)    
    clean_h2o_data['Sex'] = clean_h2o_data['Sex'].asfactor()
    clean_h2o_data['Pclass'] = clean_h2o_data['Pclass'].asfactor()
    return clean_h2o_data

In [7]:
def get_h2o_frame_with_rel_factors(clean_training_data_v3):
    
    clean_h2o_data = get_h2o_frame_with_rel_factors_test(clean_training_data_v3)
    clean_h2o_data['Survived'] = clean_h2o_data['Survived'].asfactor()
    return clean_h2o_data

In [8]:
train_data_h2o = get_h2o_frame_with_rel_factors(train_data)
test_data_h2o = get_h2o_frame_with_rel_factors_test(test_data)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [9]:
# Split to 70% training, 15% validation and 15% for test. Seed set to fixed value to ensure reproducibility.
splits = train_data_h2o.split_frame(ratios=[0.75], seed=1)  

train_validation = splits[0]
test = splits[1]

In [10]:
y_column = 'Survived'
x_columns = 'Sex'

In [11]:
# Import H2O RF:
from h2o.estimators.random_forest import H2ORandomForestEstimator


In [12]:
# Do a 10 fold cross validation as that is done typically.
rf_fit_cross_val = H2ORandomForestEstimator(model_id='rf_fit_cross_val', seed=1, nfolds=10)
rf_fit_cross_val.train(x=x_columns, y=y_column, 
                       training_frame=train_validation)


drf Model Build progress: |███████████████████████████████████████████████| 100%


### Getting useful metrics.
Now, we will see as to how to extract useful metrics from this cross validated model object.

In [13]:
# Cross validated auc(). Details can be found here : 
# http://docs.h2o.ai/h2o/latest-stable/h2o-docs/cross-validation.html
print("Cross validated auc () is ", format(rf_fit_cross_val.auc(xval=True), '0.4f'))

# AUC() directly obtained from training data
print("AUC ()  obtained directly from training data is ", format(rf_fit_cross_val.auc(), '0.4f'))


Cross validated auc () is  0.7286
AUC ()  obtained directly from training data is  0.7646


### Dissecting AUC

What exactly is this AUC() and how do we interpret this result ?

A detailed explanation can be found in https://stats.stackexchange.com/questions/145566/how-to-calculate-area-under-the-curve-auc-or-the-c-statistic-by-hand.

For now, I will explaining as to how we got the computed value by doing a paralled computation by hand.

In [14]:
rf_fit_cross_val

Model Details
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  rf_fit_cross_val


ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.1631173476185785
RMSE: 0.4038778870136102
LogLoss: 0.5065569105131861
Mean Per-Class Error: 0.2316714975845411
AUC: 0.7645845410628019
Gini: 0.5291690821256039
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.7393161333524264: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,353.0,61.0,0.1473,(61.0/414.0)
1,79.0,171.0,0.316,(79.0/250.0)
Total,432.0,232.0,0.2108,(140.0/664.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.7393161,0.7148760,11.0
max f2,0.1790696,0.7512019,27.0
max f0point5,0.7393161,0.7293423,11.0
max accuracy,0.7393161,0.7921687,11.0
max precision,0.7393162,0.7582418,1.0
max recall,0.1790696,1.0,27.0
max specificity,0.7393162,0.9589372,0.0
max absolute_mcc,0.7393161,0.5523967,11.0
max min_per_class_accuracy,0.1790696,0.72,14.0


Gains/Lift Table: Avg response rate: 37.65 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0978916,0.7393162,1.9613538,1.9613538,0.7384615,0.7384615,0.192,0.192,96.1353846,96.1353846
,2,0.1370482,0.7393162,2.1452308,2.0138901,0.8076923,0.7582418,0.084,0.276,114.5230769,101.3890110
,3,0.1686747,0.7393162,1.7706667,1.9682857,0.6666667,0.7410714,0.056,0.332,77.0666667,96.8285714
,4,0.2153614,0.7393162,1.9705806,1.9687832,0.7419355,0.7412587,0.092,0.424,97.0580645,96.8783217
,5,0.3072289,0.7393162,1.9158033,1.9529412,0.7213115,0.7352941,0.176,0.6,91.5803279,95.2941176
,6,0.4277108,0.1790696,0.996,1.6833803,0.375,0.6338028,0.12,0.72,-0.4000000,68.3380282
,7,0.5376506,0.1790696,0.3638356,1.4135574,0.1369863,0.5322129,0.04,0.76,-63.6164384,41.3557423
,8,0.5993976,0.1790696,0.6478049,1.3346734,0.2439024,0.5025126,0.04,0.8,-35.2195122,33.4673367
,9,0.7018072,0.1790696,0.5077647,1.2140086,0.1911765,0.4570815,0.052,0.852,-49.2235294,21.4008584




ModelMetricsBinomial: drf
** Reported on cross-validation data. **

MSE: 0.1643192354001771
RMSE: 0.4053630908212748
LogLoss: 0.510189193374565
Mean Per-Class Error: 0.22767149758454108
AUC: 0.7285507246376812
Gini: 0.45710144927536245
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.7156862616539001: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,353.0,61.0,0.1473,(61.0/414.0)
1,77.0,173.0,0.308,(77.0/250.0)
Total,430.0,234.0,0.2078,(138.0/664.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.7156863,0.7148760,8.0
max f2,0.1666667,0.7512019,18.0
max f0point5,0.7156863,0.7293423,8.0
max accuracy,0.7156863,0.7921687,8.0
max precision,0.7156863,0.7393162,8.0
max recall,0.1666667,1.0,18.0
max specificity,0.75,0.9830918,0.0
max absolute_mcc,0.7156863,0.5523967,8.0
max min_per_class_accuracy,0.1883289,0.716,9.0


Gains/Lift Table: Avg response rate: 37.65 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0271084,0.75,1.6231111,1.6231111,0.6111111,0.6111111,0.044,0.044,62.3111111,62.3111111
,2,0.0557229,0.7488372,1.6774737,1.6510270,0.6315789,0.6216216,0.048,0.092,67.7473684,65.1027027
,3,0.1265060,0.7414634,1.8648511,1.7706667,0.7021277,0.6666667,0.132,0.224,86.4851064,77.0666667
,4,0.2048193,0.7403846,1.9409231,1.8357647,0.7307692,0.6911765,0.152,0.376,94.0923077,83.5764706
,5,0.3072289,0.7348837,2.0310588,1.9008627,0.7647059,0.7156863,0.208,0.584,103.1058824,90.0862745
,6,0.4322289,0.1883289,1.056,1.6565296,0.3975904,0.6236934,0.132,0.716,5.6000000,65.6529617
,7,0.5602410,0.1855670,0.3124706,1.3494194,0.1176471,0.5080645,0.04,0.756,-68.7529412,34.9419355
,8,0.6204819,0.1846154,0.3320000,1.2506408,0.125,0.4708738,0.02,0.776,-66.8,25.0640777
,9,0.7018072,0.1835107,0.3934815,1.1513133,0.1481481,0.4334764,0.032,0.808,-60.6518519,15.1313305



Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7,8,9,10,11,12
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid,cv_6_valid,cv_7_valid,cv_8_valid,cv_9_valid,cv_10_valid
accuracy,0.7895145,0.0258059,0.835443,0.8072289,0.8032787,0.779661,0.8235294,0.7118644,0.7543859,0.8260869,0.7605634,0.7931035
auc,0.7713597,0.0266816,0.8151852,0.7879949,0.7733957,0.7407895,0.828125,0.7132184,0.7375,0.8180556,0.7389163,0.7604167
err,0.2104855,0.0258059,0.1645570,0.1927711,0.1967213,0.2203390,0.1764706,0.2881356,0.2456140,0.1739131,0.2394366,0.2068966
err_count,13.8,1.4071248,13.0,16.0,12.0,13.0,12.0,17.0,14.0,12.0,17.0,12.0
f0point5,0.719099,0.0451879,0.7364341,0.7241379,0.6451613,0.6593407,0.8653846,0.7307692,0.7425743,0.7421875,0.72,0.625
f1,0.7061765,0.0368760,0.7450980,0.7241379,0.6666667,0.6486486,0.8181818,0.6909091,0.6818182,0.76,0.6792453,0.6470588
f2,0.6959623,0.0388520,0.7539682,0.7241379,0.6896552,0.6382979,0.7758621,0.6551724,0.6302521,0.7786886,0.6428571,0.6707317
lift_top_group,1.9865305,0.1793898,2.3092308,2.0725327,2.266254,2.0701754,1.7,1.4946667,1.8,2.1009614,1.8362069,2.2152777
logloss,0.514018,0.0407193,0.4427268,0.4818485,0.4780133,0.5166941,0.5014075,0.6392374,0.5762265,0.4548141,0.5585339,0.4906777


Scoring History: 


0,1,2,3,4,5,6,7,8
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_lift,training_classification_error
,2018-06-26 13:37:00,0.340 sec,0.0,,,,,
,2018-06-26 13:37:00,0.341 sec,1.0,0.4182649,0.5355520,0.7601504,2.0405854,0.2236842
,2018-06-26 13:37:00,0.342 sec,2.0,0.4074420,0.5145612,0.7747845,2.0574648,0.2091837
,2018-06-26 13:37:00,0.343 sec,3.0,0.4070305,0.5128467,0.7711684,1.9843678,0.2111801
,2018-06-26 13:37:00,0.343 sec,4.0,0.4079117,0.5143937,0.7694028,1.9718788,0.2127273
---,---,---,---,---,---,---,---,---
,2018-06-26 13:37:00,0.406 sec,46.0,0.4038779,0.5065569,0.7751981,1.9584646,0.2078313
,2018-06-26 13:37:00,0.408 sec,47.0,0.4038779,0.5065569,0.7754879,1.9994607,0.2078313
,2018-06-26 13:37:00,0.410 sec,48.0,0.4038779,0.5065569,0.7685266,2.0090256,0.2078313



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
Sex,2378.1787109,1.0,1.0




NOTE : An issue with the confusion matrix has been logged in https://github.com/h2oai/h2o-tutorials/issues/71

We compute the AUC by hand, by computing the area in the ROC curve.

The ROC curve is computed by finding the relevant data points.

Data point #1 -> Corresponding to sensitivity = 0 and specificity = 1 (or (1 - specificity) = 0)
Data point #2 -> Corresopnding to sensitivity = 1 and specificity = 0 (or (1 - specificity) = 1)
Data point #3 -> Corresponding to the case we have our predictions as 1 and see how many survived.

In this scenario => When we predict as survived, we see that we obtained 173 out of 250 cases => Sensitivity = 173/250
Also, we see that in we actually predict 353 (out of the total 414 cases) of not survived correctly => Specificity = 353/432

=> Our relevant data point = (1- 353/432, 173/250)
=> AUC = Area under ROC curve => Area of the triangle and relevant trapeziod , which when computed turns out to be = 0.5 * ( 353/414 + 173/250 ) = 0.77232850241

NOTE : The expression for area, comes out nicely as 0.5*(sensitivity + specificity).

Let us compare with the value of AUC computed from the model directly .

In [15]:
perf = rf_fit_cross_val.model_performance()  #train=True is the default, so it's not needed
perf.plot() 
perf.show()

<matplotlib.figure.Figure at 0x10dbf1358>


ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.1631173476185785
RMSE: 0.4038778870136102
LogLoss: 0.5065569105131861
Mean Per-Class Error: 0.2316714975845411
AUC: 0.7645845410628019
Gini: 0.5291690821256039
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.7393161333524264: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,353.0,61.0,0.1473,(61.0/414.0)
1,79.0,171.0,0.316,(79.0/250.0)
Total,432.0,232.0,0.2108,(140.0/664.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.7393161,0.7148760,11.0
max f2,0.1790696,0.7512019,27.0
max f0point5,0.7393161,0.7293423,11.0
max accuracy,0.7393161,0.7921687,11.0
max precision,0.7393162,0.7582418,1.0
max recall,0.1790696,1.0,27.0
max specificity,0.7393162,0.9589372,0.0
max absolute_mcc,0.7393161,0.5523967,11.0
max min_per_class_accuracy,0.1790696,0.72,14.0


Gains/Lift Table: Avg response rate: 37.65 %



0,1,2,3,4,5,6,7,8,9,10,11
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,cumulative_response_rate,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0978916,0.7393162,1.9613538,1.9613538,0.7384615,0.7384615,0.192,0.192,96.1353846,96.1353846
,2,0.1370482,0.7393162,2.1452308,2.0138901,0.8076923,0.7582418,0.084,0.276,114.5230769,101.3890110
,3,0.1686747,0.7393162,1.7706667,1.9682857,0.6666667,0.7410714,0.056,0.332,77.0666667,96.8285714
,4,0.2153614,0.7393162,1.9705806,1.9687832,0.7419355,0.7412587,0.092,0.424,97.0580645,96.8783217
,5,0.3072289,0.7393162,1.9158033,1.9529412,0.7213115,0.7352941,0.176,0.6,91.5803279,95.2941176
,6,0.4277108,0.1790696,0.996,1.6833803,0.375,0.6338028,0.12,0.72,-0.4000000,68.3380282
,7,0.5376506,0.1790696,0.3638356,1.4135574,0.1369863,0.5322129,0.04,0.76,-63.6164384,41.3557423
,8,0.5993976,0.1790696,0.6478049,1.3346734,0.2439024,0.5025126,0.04,0.8,-35.2195122,33.4673367
,9,0.7018072,0.1790696,0.5077647,1.2140086,0.1911765,0.4570815,0.052,0.852,-49.2235294,21.4008584





How do we explain the mismatch between the AUC numbers ?

I am not very sure, from the the following link(http://gim.unmc.edu/dxtests/roc3.htm) and from the plot obtained from the summary, it looks like h2o computes AUC using a parametric method, whereas we do an approximation using trapezoids.

Moreover, when we compute the metric using model performance function and a relevant function in scikit learn, the values match exactly. 

In [16]:
# Using model performance function.
print("AUC ()  obtained directly from training data is ", 
      format(rf_fit_cross_val.model_performance(train_validation).auc(), '0.10f')) 

AUC ()  obtained directly from training data is  0.7723285024


In [17]:
# Using relevant routine from scikit learn.
survived_set = rf_fit_cross_val.predict(train_validation)['predict']
survived_set['Survived'] = train_validation['Survived']
survived_set_df = survived_set.as_data_frame()
from sklearn.metrics import roc_curve, auc, roc_auc_score

fp_rate, tp_rate, thresholds = roc_curve(survived_set_df['Survived'], survived_set_df['predict'])
print(auc(fp_rate, tp_rate))
print(roc_auc_score(survived_set_df['Survived'], survived_set_df['predict']))

drf prediction progress: |████████████████████████████████████████████████| 100%
0.7723285024154589
0.7723285024154589


### What are the takeaways from this ?

1. Though we understand the basic computation of AUC, we are not able to replicate exactly as to how it is computed by the h2o model internally. This will make it tricky when we build complex models later and will be forced to compute auc by hand and compare with earlier values.

2. We are able compute auc by hand using the trapezoid approximation and are able to match up with values generated by scikit learn. We can opt to use this method of computing auc, rather than the one implicitly used by the model.

3. We can ignore auc and use a fit score, measuring the number of observations we are able to predict correctly and use that to refine our model. 

Practically, I believe we can go for both #2 and #3. Let us see how it goes !

In [18]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Predictions
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,-1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,-1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,-1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,-1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,-1


In [19]:
# Function for splitting input dataframe into training/validation and test sets.
def get_train_validation_test_data(input_df, ratio=0.75):
    train_len = int(ratio * len(input_df))
    train_validation = input_df[:train_len]
    test = input_df[train_len:]
    return (train_validation, test)


In [20]:
(train_validation, test) = get_train_validation_test_data(train_data)
test_h2o = get_h2o_frame_with_rel_factors_test(test)
train_validation_h2o = get_h2o_frame_with_rel_factors(train_validation)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [21]:
def get_predictions_given_model_and_data(rf_fit, train_validation, train_validation_h2o):
    # Using relevant routine from scikit learn.
    predict_out = rf_fit.predict(train_validation_h2o) 
    if 'Survived' in train_validation.columns:
        survived_set = pd.DataFrame(columns=['Survived', 'Predictions'])
        survived_set['Survived'] = train_validation['Survived']
    else:
        survived_set = pd.DataFrame(columns=['Predictions'])
    survived_set['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()
    
    return survived_set

In [22]:
# The bread and butter routine for cross validation.
def get_cross_val_output(input_df, x_columns=['Sex'], y_column = 'Survived', nfolds=10):
    partition_indices = np.array_split(np.arange(len(input_df)), nfolds)
    
    cross_validated_data = pd.DataFrame(columns=input_df.columns)
    for i in range(nfolds):
        cross_validated_set = input_df[partition_indices[i][0]:partition_indices[i][-1] + 1].copy()
        rel_training_data = pd.DataFrame(columns=input_df.columns)
        for j in range(nfolds):
            if j != i:
                training_set = input_df[partition_indices[j][0]:partition_indices[j][-1] + 1]
                rel_training_data = pd.concat([rel_training_data, training_set])
        h2o_train_data = get_h2o_frame_with_rel_factors(rel_training_data)

        rf_fit = H2ORandomForestEstimator(model_id='rf_fit', seed=1)
        rf_fit.train(x=x_columns, y=y_column, training_frame=h2o_train_data)

        predict_out = rf_fit.predict(
            get_h2o_frame_with_rel_factors_test(cross_validated_set))                

        cross_validated_set['Predictions'] = predict_out.as_data_frame()['predict'].values.tolist()
        cross_validated_data = pd.concat([cross_validated_data, cross_validated_set])

    return cross_validated_data
    

In [23]:
cross_validated_output = get_cross_val_output(train_validation, x_columns=['Sex'], y_column = 'Survived', nfolds=10)

Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
drf prediction progress: |████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
drf prediction progress: |████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
drf prediction progress: |████████████████████████████████████████████████| 100%
Parse progress: |███████████

In [24]:
assert(cross_validated_output[cross_validated_output['Predictions'] == -1].empty)

In [25]:
print("Fit score obtained through cross validation is ", 
      np.sum((cross_validated_output['Survived'] == cross_validated_output['Predictions'])/len(
          cross_validated_output['Survived'])))

Fit score obtained through cross validation is  0.7859281437125747


In [26]:
print(roc_auc_score(cross_validated_output['Survived'].apply(lambda x : int(x)), 
                    cross_validated_output['Predictions'].apply(lambda x : int(x))))

0.768947963800905


### Out of sample testing.
Let us see how out of sample testing looks with this model. Intuitively, one would not expect the results to be much different from that of cross validation.

In [27]:

rf_fit_cross_val.model_performance(test_h2o).auc()

0.7593409444732745

In [28]:
rf_fit_cross_val.model_performance(train_validation_h2o).auc()
rf_fit_cross_val.auc(xval=True)

0.7285507246376812

Let us see how things look, using the performance metrics designed by us 

In [29]:
predictions_set = get_predictions_given_model_and_data(rf_fit_cross_val, test, test_h2o)

drf prediction progress: |████████████████████████████████████████████████| 100%


In [30]:
print("Computed AUC is ", 
      roc_auc_score(predictions_set['Survived'].apply(lambda x : int(x)), 
                    predictions_set['Predictions'].apply(lambda x : int(x))))

Computed AUC is  0.7593409444732746


In [31]:
print("Computed fit score is ",
      np.sum((predictions_set['Survived'] == predictions_set['Predictions'])/len(
          predictions_set['Survived'])))

Computed fit score is  0.7892376681614349


### Train the model on entire training data

In [32]:
model1_full_train_data = H2ORandomForestEstimator(model_id='model1_full_train_data', seed=1)
model1_full_train_data.train(x=x_columns, y=y_column, 
                             training_frame=train_data_h2o)


drf Model Build progress: |███████████████████████████████████████████████| 100%


###  Making predictions
Let us make predictions on the test data here !

In [33]:
predict_out = get_predictions_given_model_and_data(model1_full_train_data, test_data, test_data_h2o)

drf prediction progress: |████████████████████████████████████████████████| 100%


In [34]:
test_data['Survived'] = predict_out['Predictions']

Some sanity on test data !

In [35]:
test_data[(test_data['Survived'] ==1) &(test_data['Sex'] == 'male')]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived


In [36]:
test_data[['PassengerId', 'Survived']].sort_values(by=['PassengerId']).to_csv("output_only_gender.csv", index=False)
# Kaggle score : 0.76555 ???