# Final assignment: Part 1

Part 1: The only type of model you should use for this part, is the H2OGeneralizedLinearEstimator model. You should:

1. Deal appropriately with missings (for all numeric variables, -99 means missing).
2. Deal with numerics - i.e. for at least some try linear splines (or another method of your choice to deal with non-linear effects)
3. Deal with hccvs (eg using the feature encoding library that we looked at in lecture) (You do not need to deal with low cardinality categorical features since H2O will one-hot them for you.)
4. Try out some interactions
5. Try out some other features (eg division of numerics).

Presumably you will train various models, submit your predictions on Kaggle and note the public leaderboard score.

- Choose your best model, and for it: Create a function which carries out any data preparation and fitting:
• The name of your function must be fn logistic
• You must save your function in a (plain text) file with exactly the following
name
      en_<studentnumber>.py
For example, if your student number is 123456789 then your function must be stored as en 123456789.py
• The only input for your function should be the df train and df test datasets created in 01a ReadData. You may choose the smaller or larger train data, as you wish, but the score in your return statement should be consistent. Any data manipulation should be done by code in your function.
• The only output from your function should be three items - in this order: – Your trained H2OGeneralizedLinearEstimator object
– The test data, the data that you feed your object when you make predictions.
– Your Kaggle public leaderboard score for this model, hardcoded as a number to 3 d.p.
• Your function should be totally self contained. If it requires any import eg of numpy or pandas or from sklearn, you should do those imports in your function.
3
    
• There should be no code at all in your .py file before the def statement and no code after the end of the return statement.
• The code in your function should tidy (especially if you download if from a ipynb) and it should be well commented.


**Import packages**

In [88]:
### Import packages
import os
import numpy as np
import pandas as pd
import pickle
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.grid.grid_search import H2OGridSearch

#category encoders
from category_encoders import LeaveOneOutEncoder

#needed for fn_computeRatiosOfNumerics()
from itertools import permutations

In [89]:
#### Set directories
print(os.getcwd())
dirRawData = "../input/"
dirPData = "../PData/"
dirPOutput = "../POutput/"

/home/jovyan/Projects/final_assignment/PCode


In [90]:
### Functions
def fn_MAE(actuals, predictions):
    return np.round(np.mean(np.abs(predictions - actuals)))

def fn_RMSE(actuals, predictions):
    return np.round(np.sqrt(np.mean((predictions - actuals)**2)))

def fn_tosplines(x):
    x = x.values
    # hack: remove zeros to avoid issues where lots of values are zero
    x_nonzero = x[x != 0]
    ptiles = np.percentile(x_nonzero, [10, 20, 40, 60, 80, 90])
    ptiles = np.unique(ptiles)
    print(var, ptiles)
    df_ptiles = pd.DataFrame({var: x})
    for idx, ptile in enumerate(ptiles):
        df_ptiles[var + '_' + str(idx)] = np.maximum(0, x - ptiles[idx])
    return(df_ptiles)

def fn_computeRatiosOfNumerics(df, variables):
## Process:
# 1. Gets passed most important numeric variables
# 2. Computes all pairwise ratios between each of these i.e
# - get all permutations of length 2, and divide term 1 by term 2
# e. Returns a dataframe containing engineered variables, with appropriately named columns

#     variables = ['A','B','C'] #debugging
    pairs = []
    lst_series = []
    for i in range(len(variables)+1):
        for subset in permutations(variables, i):
            if len(subset)==2: pairs.extend([subset])
    temp_colnames = []
    for elem in pairs:
        ## create column names
        temp_colname = 'ratio_{}.{}'.format(elem[0],elem[1])
        temp_colnames.append(temp_colname)
        #compute ratio
        try: 
            srs_pair_ratio = df[elem[0]]/df[elem[1]]
        except ZeroDivisionError:  
            #if denominator is 0, will catch error and assign nan value to that ratio
            srs_pair_ratio = np.nan
            srs_pair_ratio = np.nan
        srs_pair_ratio.rename(temp_colname, inplace=True)
        lst_series.append(srs_pair_ratio)
    #create dataframe with appropriate column names
    df_2 = pd.DataFrame(index = df.index, columns = temp_colnames)
    #fill dataframe with series
    for idx, col in enumerate(df_2):
        df_2[col] = lst_series[idx]
    
    
    # Seems df division already catches ZeroDivisonError and assigns infinity value when denom = 0 but not numerator 
    # In such case, want 0 coefficient.
    # Also want 0 coefficients when both numerator and denom are 0
    # therefore replace all inf and nan values with zeroes
    df_2.replace([np.inf, -np.inf, np.nan], 0, inplace=True)
    return df_2

def fn_createInteractions(df, factors):
    ## takes as input a pandas dataframe, and a LIST of column names on which to create interactions
    #create an h2o frame
    h2o_df_temp = h2o.H2OFrame(df[factors], destination_frame='df_interactions_temp')

    #use H2OFrame.interaction(factors, pairwise, max_factors, min_occurence, destination_frame=None)
    h2o_df_temp = h2o_df_temp.interaction(factors, pairwise=True, max_factors=100, min_occurrence=1)

    return h2o_df_temp.as_data_frame(use_pandas=True)

In [91]:
#### Load data via pickle
f_name = dirPData + '01_df_250k.pickle'

with (open(f_name, "rb")) as f:
    dict_ = pickle.load(f)

df_train = dict_['df_train']
df_test  = dict_['df_test']

del f_name, dict_

In [92]:
f_name = dirPData + '01_vars.pickle'

with open(f_name, "rb") as f:
    dict_ = pickle.load(f)

vars_ind_numeric     = dict_['vars_ind_numeric']
vars_ind_hccv        = dict_['vars_ind_hccv']
vars_ind_categorical = dict_['vars_ind_categorical']
vars_notToUse        = dict_['vars_notToUse']
var_dep              = dict_['var_dep']

del f_name, dict_

In [93]:
### weirdly, 'id' doesn't appear in index. I believe should use 'unique_id' instead but will leave both in, as not sure how data is being passed to function by prof.
#add 'unique_id' to vars_notToUse, remove it from list of numeric variables
if 'unique_id' not in vars_notToUse: #make sure we don't add it if already there
    vars_notToUse.extend(['unique_id']) 
# vars_ind_numeric
vars_notToUse


['id', 'unique_id']

In [94]:
### Set index for train, val, design, test data
#### Create folds to seperate train data into train, val, design, test
rng = np.random.RandomState(2020)
fold = rng.randint(0, 10, df_train.shape[0])
df_train['fold'] = fold

#get indices for each subset
idx_train  = df_train['fold'].isin(range(8))
idx_val    = df_train['fold'].isin([7, 8])
idx_design = df_train['fold'].isin(range(9))

#drop fold column
df_train.drop(columns='fold', inplace=True)

**Start and connect the H2O JVM**
- load previous models

In [95]:
h2o.init(port=54321)

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,2 hours 0 mins
H2O cluster timezone:,Etc/UTC
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.3
H2O cluster version age:,"1 year, 2 months and 7 days !!!"
H2O cluster name:,H2O_from_python_jovyan_r5blp1
H2O cluster total nodes:,1
H2O cluster free memory:,3.026 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


*Models are taking very long to run so have pre-loaded them below.*
- uncomment the below code to load the models but note that they must be in the PData directory 

In [96]:
# ### LOAD THE MODELS

# # GLM basic, no interactions, no mean imputation for missing level values in test
# # model name: GLM_model_basic
# path_glm_basic = dirPData + 'GLM_model_basic'

# # GLM basic, no interactions, WITH mean imputation for missing level values in test
# # model name: GLM_model_basic_meanImpute
# path_glm_basic_meanImpute = dirPData + 'GLM_model_basic_meanImpute'

# # GLM numerical divisons, no interactions, WITH mean imputation for missing level values in test
# # model name: GLM_model_numeric_meanImpute
# path_glm_numeric_meanImpute = dirPData + 'GLM_model_numeric_meanImpute'

# # GLM numerical divisons, with interactions, WITH mean imputation for missing level values in test
# # model name: GLM_model_numeric_interactions_meanImpute

# glm_basic = h2o.load_model(path = path_glm_basic) 
# glm_basic_meanImpute = h2o.load_model(path = path_glm_basic_meanImpute)
# glm_numeric_meanImpute = h2o.load_model(path = path_glm_numeric_meanImpute)

**Deal with missings**

In [97]:
#### IDENTIFY NULLS, MISSINGS
print(df_train.shape)
#collapse axis = 0 i.e. sum missing values,
#store as series
## Check for nulls
srs_null = df_train.isnull().sum(axis=0) 
print(srs_null[srs_null>0]) #show which features have null values

## Check for missings numerics which have been replaced with -99 (placeholder, really it is missing)
#get percentage of missing values for each feature
srs_missing = pd.DataFrame(df_train.loc[:,:]==-99).sum(axis=0)/len(df_train)
print(srs_missing[srs_missing!=0])  #show which numerics have 'missing' placeholder values, and their percentage of missing values

#get list of variables which have more than x% missing values
#arbitrarily setting threshold to 50% but could tune this parameter if time permits
many_missings = [var for var in df_train.columns.values if srs_missing[var]>=.5 ]  

## DO NOT USE VARIABLES WITH MORE THAN x% MISSINGS
#add vars from many_missings to vars_notToUse, remove them from list of numeric variables
vars_notToUse.extend(many_missings)
#turn into set and set back into list - deals with issue of duplicates when running code multiple time
vars_notToUse = list(set(vars_notToUse)) 

#remove variables in many_missings from var_ind_numeric
vars_ind_numeric = [var for var in vars_ind_numeric if var not in vars_notToUse]
# print([var for var in vars_ind_numeric if var in vars_notToUse])  #double check they've been removed: printed list should be empty

(250000, 97)
Series([], dtype: int64)
a04    0.411936
a05    0.095152
a06    0.095152
a07    0.095152
a08    0.007408
a09    0.008360
a11    0.095592
a14    0.097284
a15    0.046692
b01    0.131224
b05    0.135056
b06    0.791212
c01    0.035580
c03    0.024604
d01    0.222176
d02    0.231476
d03    0.228544
e02    0.000008
e12    0.004368
f11    0.000004
f13    0.000004
dtype: float64


In [98]:
### MEAN-IMPUTE MISSINGS
# list of variables to impute
vars_toImpute = [var for var in srs_missing[srs_missing>0].index.tolist() if var not in many_missings]

#get subset dataframe (only cols which are in variables_toImpute)
#get only values != -99 -> this will mean that the missings will be returned as NaN. Can then use fillna
df_temp=df_train[vars_toImpute][df_train[vars_toImpute]!=-99].copy()  #make a working copy

#use fillna: computing the mean of each column and filling NaNs with this mean value.
df_temp.fillna(df_temp.mean(), inplace=True)

df_train[vars_toImpute] = df_temp

In [99]:
df_train.shape

(250000, 97)

**Prepare basis functions**

In [100]:
### Spline numeric variables with cardinality higher than 8
# define variables to spline
vars_ind_tospline = df_train[vars_ind_numeric].columns[(df_train[vars_ind_numeric].nunique() > 8)].tolist()
#Find the percentiles on train data only, then apply same percentiles to both train and test data, even if test data distribution is very different.
#update df_train, df_test
for var in vars_ind_tospline:
    df_ptiles = fn_tosplines(df_train[var])
    df_train.drop(columns=[var], inplace=True)
    df_test.drop(columns=[var], inplace=True)
    vars_ind_numeric.remove(var)
    df_train = pd.concat([df_train, df_ptiles], axis=1, sort=False)
    df_test = pd.concat([df_test, df_ptiles], axis=1, sort=False)
    vars_ind_numeric.extend(df_ptiles.columns.tolist())


a04 [ 18.          36.          81.08778636  84.         180.        ]
a15 [1. 2. 3. 4. 5. 6.]
b01 [ 9. 26. 46. 54. 77. 87.]
b05 [ 9. 10. 13. 14. 17. 19.]
c01 [11. 21. 54. 88.]
c03 [1. 2. 4. 5.]
d01 [1.       3.       4.810065 5.       9.      ]
d02 [1.         1.35343073 2.        ]
d03 [1.         1.80588394 2.        ]
e02 [11. 20. 41. 60. 80. 90.]
e04 [12. 18. 39. 60. 81. 89.]
e05 [ 9. 19. 39. 59. 79. 89.]
e06 [20. 61. 76. 89.]
e08 [16. 40. 61. 81. 90.]
e09 [18. 43. 62. 79. 90.]
e12 [11. 20. 41. 59. 79. 89.]
e15 [13. 22. 36. 59. 83. 90.]
f01 [13. 41. 59. 75.]
f02 [10. 19. 40. 58. 79. 90.]
f06 [ 9. 19. 60.]
f11 [ 4.  6.  9. 11. 13. 15.]
f13 [ 4.  6.  8.  9. 10. 11.]


In [101]:
df_train.shape

(250000, 208)

In [102]:
df_test.shape

(296690, 207)

In [17]:
# for convenience store dependent variable as y
y = df_train[var_dep].values.ravel()

**HCCV**
- note that any modifications made to train data must also be made to test data (engineered colums etc)

In [18]:
### GET HCCV VARS

## If want to use some cardinality threshold other than 30, can edit threshold below:
th_card = 30
srs_card = df_train[vars_ind_categorical].nunique()
print(srs_card.min())
print(srs_card.max())
print(srs_card[srs_card>th_card])
vars_ind_hccv = srs_card[srs_card>th_card].index.values.tolist()  #stores names of categorical variables with cardinality higher than threshold

1
21244
e17       82
e18      684
e19    21244
f10     1704
dtype: int64


In [19]:
### HCCV ENCODING USING category_encoders

enc = LeaveOneOutEncoder(cols=vars_ind_hccv, sigma=0.3)
enc.fit(df_train[idx_design], y[idx_design])
df_train = enc.transform(df_train)  #encode hccvs in train data
# df_train[vars_ind_hccv].head()

df_test['target'] = np.nan  #add NaN target column to test dataset in order for it to have same shape as df_train
df_test = enc.transform(df_test)  #encode hccvs in test data
df_test.drop(columns='target', inplace=True)  #drop target column from df_test 

In [20]:
# df_train[vars_ind_hccv]  #see newly added hccv columns in train data

**Try out some interactions**
- same applies here, whatever interactions are in train data must also be in test data

In [81]:
### NOTE: The below interactions are created based on the largest
### coefficients in a previously-run model. The code below identifies
### those coefficients by loading the model and manipulating the data.
### However, as assignment requires only input to be train and test
### datasets, the most important variables have been hardcoded in.


# ##Inspect coefficients from basic model with no interactions
# # Plot standardised coefficients
# glm_basic.std_coef_plot(num_of_features=10)

# # Get list of 5 most important variables via varimp()
# # note that glm_basic.varimp() contains some onehots created by H2o on the fly when building the model, and thus some aren't actually present in the train/test frames
# # therefore can't refer to them before running a model, and we need to refer to the original variables before h2o onehots them
# # we extract these by:
# # - getting only the name of the variable and not its values i.e. var[0] for var in glm_basic.varimp()
# # - splitting on onehot delimiter '.' and keeping only first part of result. This is name of original variable

# # Get list of FIVE most important categorical variables
# vars_mostImp_cat=[]
# for var in glm_basic.varimp():
#     orig_var = var[0].split('.')[0]
#     if orig_var in vars_ind_categorical and orig_var not in vars_mostImp_cat:  #check if numeric
#         #add to list of important categorical vars only if not already in list
#         vars_mostImp_cat.append(orig_var)
#     if len(vars_mostImp_cat)>= 5:
#         break
# vars_mostImp_cat


vars_mostImp_cat=['f09', 'f03', 'f07', 'f27', 'e11']  #comment this line if uncommenting the above block

#Get dataframe of interactions all pairwise interactions between five most important categorical variables
df_train_interactions = fn_createInteractions(df_train, vars_mostImp_cat)
df_test_interactions = fn_createInteractions(df_test, vars_mostImp_cat)

#append new columns to df_train and df_test
df_train[df_train_interactions.columns.values] = df_train_interactions
df_test[df_test_interactions.columns.values] = df_test_interactions

# include new numeric variables in vars_ind_numeric
vars_ind_categorical.extend(df_train_interactions.columns.tolist())

Parse progress: |█████████████████████████████████████████████████████████| 100%
Interactions progress: |██████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Interactions progress: |██████████████████████████████████████████████████| 100%


**Try out some other features e.g. division of numerics**
- must also add engineered columns to test data

In [83]:
### DEFINE THREE MOST IMPORTANT NUMERICAL VARS

### NOTE: The below interactions are created based on the largest
### coefficients in a previously-run model. The code below identifies
### those coefficients by loading the model and manipulating the data.
### However, as assignment requires only input to be train and test
### datasets, the most important variables have been hardcoded in.


# # plot largest standardised coefficients
# # glm_basic.std_coef_plot(num_of_features=10)
# # Get list of THREE most important variables
# vars_mostImp_numeric=[]
# for var in glm_basic.varimp():
#     orig_var = var[0].split('.')[0]
#     if orig_var in vars_ind_numeric and orig_var not in vars_mostImp_numeric:  #check if numeric
#         #add to list of important numeric vars
#         vars_mostImp_numeric.append(orig_var)
#     if len(vars_mostImp_numeric)>= 3:
#         break

vars_mostImp_numeric=['f11', 'f11_0', 'f11_1']  #comment this line if uncommenting the above block
### COMPUTE RATIO COLUMNS FOR BOTH DATASETS
df_temp_train = fn_computeRatiosOfNumerics(df_train, vars_mostImp_numeric)
df_temp_test = fn_computeRatiosOfNumerics(df_test, vars_mostImp_numeric)

#append new columns to df_train and df_test
df_train[df_temp_train.columns.values] = df_temp_train
df_test[df_temp_test.columns.values] = df_temp_test

# include new numeric variables in vars_ind_numeric
if df_temp_train.columns.tolist() not in vars_ind_numeric:
    vars_ind_numeric.extend(df_temp_train.columns.tolist())

In [86]:
vars_ind_numeric

['b01_1',
 'f24',
 'c03_3',
 'd01',
 'c03',
 'e09_0',
 'c01_0',
 'b05_0',
 'b01_5',
 'ratio_f11_1.f11_0',
 'a04_3',
 'f11_5',
 'f02',
 'e04_3',
 'b01_0',
 'f13_2',
 'e09_2',
 'e04',
 'a04_0',
 'b01',
 'e05_3',
 'e09',
 'ratio_f11_0.f11',
 'f11_0',
 'f11_3',
 'e06_3',
 'b05_1',
 'f21',
 'f06_0',
 'f06_2',
 'f13_4',
 'e09_3',
 'f01_1',
 'e05_4',
 'f25',
 'c01',
 'e02_2',
 'a15',
 'e15',
 'f01',
 'e12_5',
 'e04_0',
 'e08_2',
 'e09_4',
 'e06_1',
 'a04',
 'e08_0',
 'f17',
 'e02_4',
 'f13_5',
 'e07',
 'a05',
 'd03_0',
 'f16',
 'd03',
 'f01_3',
 'f06',
 'f11_4',
 'e23',
 'f02_4',
 'd02_2',
 'e04_1',
 'a15_3',
 'e15_1',
 'e04_5',
 'ratio_f11.f11_0',
 'f23',
 'f26',
 'e02_0',
 'e06_0',
 'd01_0',
 'd01_3',
 'e05',
 'a15_2',
 'e06',
 'e15_3',
 'a06',
 'e05_2',
 'f28',
 'd03_1',
 'b05',
 'd02_1',
 'c03_1',
 'b01_3',
 'e12_1',
 'f01_2',
 'e08_4',
 'e08_3',
 'a15_5',
 'e12_0',
 'd01_4',
 'a15_1',
 'd02_0',
 'f19',
 'b01_4',
 'e05_0',
 'f18',
 'e15_2',
 'e09_1',
 'a11',
 'f31',
 'f02_3',
 'e04_4',
 '

**Load data to h2o JVM**

In [63]:
# h2o.init(port=54321)
# h2o.connect(port=54321)

In [64]:
# # remove all data loaded in JVM
# for key in h2o.ls()['key']:
#     h2o.remove(key)

In [65]:
# Is df_train already in the JVM?
# h2o.ls()

In [66]:
# It if is, then just create a handle:
# h2o_df_train = h2o.get_frame('df_train')
# h2o_df_test = h2o.get_frame('df_test')


In [67]:
# Create H2OFrames in H2O cluster
h2o_df_train = h2o.H2OFrame(df_train[vars_ind_numeric + vars_ind_categorical + var_dep],
                           destination_frame='df_train')
h2o_df_test = h2o.H2OFrame(df_test[vars_ind_numeric + vars_ind_categorical],
                           destination_frame='df_test')

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


Change target to enum type as we are building a classification model

In [68]:
h2o_df_train[var_dep].types

{'target': 'int'}

In [69]:
#### Set target type to enum
h2o_df_train[var_dep] = h2o_df_train[var_dep].asfactor()
h2o_df_train[var_dep].types

{'target': 'enum'}

In [None]:
# var_dep

**Define the features to be used**

In [71]:
features = vars_ind_numeric + vars_ind_categorical

**lambda_search for alpha and lambda given an identity link**
- According to H2O documentation, must use logit link as we are estimating a binomial classification model. 

In [72]:
idx_h2o_train  = h2o.H2OFrame(idx_train.astype('int').values)
idx_h2o_val    = h2o.H2OFrame(idx_val.astype('int').values)
idx_h2o_design = h2o.H2OFrame(idx_design.astype('int').values)

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [74]:
%%time
# We set family to bimonial as we are running a classification GLM model (with only two classes).
# missing_values_handling -> MeanImputation: deals with new sample having categorical levels not seen in training. Replaces the unseen value with the most frequent level present in training.
# keep_cross_valudation_* -> set to false as only concerned with best final model. Saves some memory in H2o cluster.
model=H2OGeneralizedLinearEstimator(  alpha=0.99
                                        , family='binomial'
                                        , link='logit'
                                        , lambda_search=True
                                        , lambda_min_ratio=1e-7
                                        , nlambdas=100
                                        , early_stopping=True
                                        , nfolds=10
                                        , seed=2020
                                        , keep_cross_validation_models=False
                                        , keep_cross_validation_predictions=False
                                        , keep_cross_validation_fold_assignment=False
                                        , missing_values_handling='mean_imputation'
                                   )
model.train(x=features, 
            y='target',
            training_frame=h2o_df_train[idx_h2o_design, :])

### NOTE: models were taking very long to run (>1hr) so I changed 
### the number of landas to search to 100, and enabled early-stopping,
### which seems to save a decent amount of time without negatively 
### affecting performance.

glm Model Build progress: |███████████████████████████████████████████████| 100%
CPU times: user 7.14 s, sys: 346 ms, total: 7.48 s
Wall time: 32min 11s


In [75]:
### Save the model
glm_curr = model
glm_curr_path = h2o.save_model(model=glm_curr, path=dirPData, force=True)

In [76]:
glm_curr_path

'/home/jovyan/Projects/final_assignment/PData/GLM_model_python_1594764900026_1'

In [77]:
### PREDICT
temp_preds = model.predict(h2o_df_test)

glm prediction progress: |████████████████████████████████████████████████| 100%




In [78]:
temp_preds
df_test['Predicted'] = np.round(temp_preds[2].as_data_frame(), 5)
df_preds = df_test[['unique_id', 'Predicted']].copy()
df_test[['unique_id', 'Predicted']].to_csv(dirPOutput + 'part1_preds_num_interactions_meanImpute_250k.csv', index=False)
# h2o_df_train[idx_h2o_val]

In [None]:
### Inspect coefficients
#plot
glm_bst.std_coef_plot(num_of_features=10)
#get list of 5 most important variables
vars_mostImp = [var[0] for var in glm_bst.varimp()[0:5]]
#note that vars_mostImp is made up of onehots created by H2o and thus some aren't actually present in the train/test frames
#we need to refer to the original variables before h2o onehots them
vars_mostImp = [var.split('.')[0] for var in vars_mostImp] #split on the onehot delimiter and keep only the first part of the variable name (i.e. the original variable name)

In [None]:
# df_train['f09.F']
# [var.split('.')[0] for var in vars_mostImp]

In [None]:
bst_pred_train = glm_bst.predict(h2o_df_train[idx_h2o_train, :])
bst_pred_val   = glm_bst.predict(h2o_df_train[idx_h2o_val, :])
# bst_pred_test  = glm_bst.predict(h2o_df_all[idx_h2o_test, :])

bst_pred_train = bst_pred_train.as_data_frame().values.ravel()
bst_pred_val   = bst_pred_val.as_data_frame().values.ravel()
# bst_pred_test  = bst_pred_test.as_data_frame().values.ravel()

print('train error', fn_MAE(y[idx_train], bst_pred_train))
print('val error',   fn_MAE(y[idx_val], bst_pred_val))
# print('test error',  fn_MAE(y[idx_test], bst_pred_test))

h2o.show_progress()

# AC run gives
#train error 11950.0
#val error 12077.0
#test error 13667.0
# And these should be reproduced by this code

In [None]:
bst_pred_train

In [None]:
# len(set(df_test[vars_ind_numeric+vars_ind_categorical].columns.values))

In [None]:
# len(df_test[~df_test[['unique_id','b06']]].columns)
# vars_notToUse
# temp = ['unique_id','b06']
# df_test[temp]
# set(features) == set(vars_ind_numeric+vars_ind_categorical) #only order differs
# len(features)

**Create Predictions**

When, the h2o tries to make predictions, we get a warning telling us that in some features there are some observations with new levels of the factors and these values were not present in the training dataset.  There is not alot we can do about this.  You should make sure you udnerstand how H2O makes predictions in such a case.

In [None]:
# h2o_df_test = h2o.H2OFrame(df_test[vars_ind_numeric + vars_ind_categorical],
#                            destination_frame='df_test')

In [None]:
preds = glm_bst.predict(h2o_df_test)

# preds = model.predict(h2o_df_test)
# There is no need to round your predictions


In [None]:
preds

In [None]:
df_test['Predicted'] = np.round(preds[2].as_data_frame(), 5)
df_preds = df_test[['unique_id', 'Predicted']].copy()
df_test[['unique_id', 'Predicted']].to_csv(dirPOutput + 'part1_preds_250k.csv', index=False)

In [None]:
df_preds

Now you can submit 04b_df_preds_dt_250k.csv on Kaggle.  You should get an AUROC of around 0.75

**Note**

If you shut down your h2o JVM in this session, then any other Python notebooks open will also loose the JVM since they all connect to the same JVM!  

In [None]:
h2o.cluster().shutdown()