# Ames housing: h2o interaction example

Contents
 - 1. Start_. Packages, directories, functions
 - 2. Data preparation
 - 3. logistic regression without interactions
 - 4. Interactions:
    - logistic regression testing to see if one or two interactions improves performance

## 1. Start_.

**Packages**

In [None]:
import os
import numpy as np
import pandas as pd
import pickle

import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.grid.grid_search import H2OGridSearch

import matplotlib.pyplot as plt
%matplotlib inline

**Directories and paths**

In [None]:
# Set directories
print(os.getcwd())
dirRawData = "../input/"
dirPData =   "../PData/"

**Functions**

In [3]:
def fn_MAE(actuals, predictions):
    return np.round(np.mean(np.abs(predictions - actuals)))

**Load data**

In [4]:
# load df_all (use the none one-hot version)
#df_all = pd.read_hdf(dirPData + '02_df_all.h5', 'df_all')
f_name = dirPData + '02_df.pickle'

with (open(f_name, "rb")) as f:
    dict_ = pickle.load(f)

df_all = dict_['df_all']

del f_name, dict_

In [5]:
# load the variables information
f_name = dirPData + '02_vars.pickle'
with open(f_name, "rb") as f:
    dict_ = pickle.load(f)
    
var_dep = dict_['var_dep']
vars_ind_numeric     = dict_['vars_ind_numeric']
vars_ind_categorical = dict_['vars_ind_categorical']
vars_ind_onehot      = dict_['vars_ind_onehot']

del dict_

In [6]:
idx_train  = df_all['fold'].isin(range(6))
idx_val    = df_all['fold'].isin([6, 7])
idx_design = df_all['fold'].isin(range(8))
idx_test   = df_all['fold'].isin([8, 9])

print("number of train examples",    np.sum(idx_train == 1))
print("number of val examples",      np.sum(idx_val == 1))
print("number of design examples",   np.sum(idx_design == 1))

print("number of test examples",  np.sum(idx_test == 1))


number of train examples 1601
number of val examples 525
number of design examples 2126
number of test examples 539


### 2. Data preparation

**Drop some variables**

Three or four students in the class came up with amazingly good results on the week 3 assignment.  The code below is based on the submission of Alexander Rostovtsev, who dropped the following variables.  He did this after exploratory data analysis suggested that they may not be much use.  It turns out that this helps avoid overfitting.

Exactly why the LASSO did not drop them, is something I would like to investigate further - especially, whether or not the "relaxed LASSO" or some other form of intital variable selection would work.  

Alexander also experimented with different spline points to achieve an even better result - but I have not included those results.

In [7]:
vars_toDrop = ['x3ssn_porch','enclosed_porch','screen_porch',
               'pool_area','misc_val','half_bath','kitchen_abvgr',
               'fireplaces','bsmtfin_sf_2', 'low_qual_fin_sf']

vars_ind_categorical = list(set(vars_ind_categorical) - set(vars_toDrop))
vars_ind_numeric     = list(set(vars_ind_numeric) - set(vars_toDrop))

**Standardise before preparing basis functions**

In [8]:
# standardise features (as in standard normal distribution mean 0, sd 1)
for var in vars_ind_numeric:
    x = df_all[var].values
    x -= np.mean(x, axis=0)
    x /= np.sqrt(np.mean(x ** 2, axis=0))
    df_all[var] = x

**Prepare basis functions**

In [10]:
# do this only for truly continuous variables
# (this is not necessarily "right" - but just quicker to code ...)
# using >8 made sklearn crash - but h2o is fine with it
vars_ind_tospline = df_all[vars_ind_numeric].columns[(df_all[vars_ind_numeric].nunique() > 8)].tolist()

In [11]:
def fn_tosplines(x):
    x = x.values
    # hack: remove zeros to avoid issues where lots of values are zero
    x_nonzero = x[x != 0]
    ptiles = np.percentile(x_nonzero,[5, 10, 30, 50, 70, 90, 95] )
    df_ptiles = pd.DataFrame({var: x})
    for idx, ptile in enumerate(ptiles):
        df_ptiles[var + '_' + str(idx)] = np.maximum(0, x - ptiles[idx])
    return(df_ptiles)

Now update df_all with splines / basis functions

In [12]:
for var in vars_ind_tospline:
    df_ptiles = fn_tosplines(df_all[var])
    df_all.drop(columns=[var], inplace=True)
    vars_ind_numeric.remove(var)
    df_all = pd.concat([df_all, df_ptiles], axis=1, sort=False)
    vars_ind_numeric.extend(df_ptiles.columns.tolist())

In [13]:
vars_ind = vars_ind_categorical + vars_ind_numeric

In [14]:
# for convenience store dependent variable as y
y = df_all[var_dep].values.ravel()

**start h2o**

In [15]:
h2o.init(port=54321)
#h2o.connect()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_212"; OpenJDK Runtime Environment (build 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03); OpenJDK 64-Bit Server VM (build 25.212-b03, mixed mode)
  Starting server from /opt/conda/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpkdkzhol4
  JVM stdout: /tmp/tmpkdkzhol4/h2o_jovyan_started_from_python.out
  JVM stderr: /tmp/tmpkdkzhol4/h2o_jovyan_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,03 secs
H2O cluster timezone:,Etc/UTC
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.3
H2O cluster version age:,"1 year, 2 months and 7 days !!!"
H2O cluster name:,H2O_from_python_jovyan_tyjfu2
H2O cluster total nodes:,1
H2O cluster free memory:,3.257 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


**Load data into h2o**

In [16]:
h2o_df_all = h2o.H2OFrame(df_all[vars_ind + var_dep + ['fold']],
                          destination_frame = 'df_all')

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [17]:
idx_h2o_train  = h2o.H2OFrame(idx_train.astype('int').values,  
                              destination_frame = 'idx_h2o_train')
idx_h2o_val    = h2o.H2OFrame(idx_val.astype('int').values  ,  
                              destination_frame = 'idx_h2o_val')
idx_h2o_design = h2o.H2OFrame(idx_design.astype('int').values, 
                              destination_frame = 'idx_h2o_design')
idx_h2o_test   = h2o.H2OFrame(idx_test.astype('int').values,   
                              destination_frame = 'idx_h2o_test')

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


In [18]:
# Define upfront the h2o frames - needed for stacked ensemble
h2o_df_design = h2o_df_all[idx_h2o_design, :]
h2o_df_train  = h2o_df_all[idx_h2o_train, :]
h2o_df_val    = h2o_df_all[idx_h2o_val, :]

### 3.  logistic regression without interactions

**Note on lambda search**

I had all kinds of trouble with this - particularly when looking at low values of alpha - and it is not the first time.  It seems to be that the range of lambdas searched is not enough and also that the early stopping option will sometimes stop at the very beginning if there is no initial improvement.  I therefore change lambda_min_ratio to something much smaller and turn early stopping off.

In [19]:
model=H2OGeneralizedLinearEstimator(alpha=0.20, 
                                    lambda_search=True,
                                    lambda_min_ratio=1e-8,
                                    nlambdas=200,
                                    nfolds=20,
                                    early_stopping=False,
                                    family='gaussian',
                                    link='identity',
                                    # we already standardised above
                                    standardize=False,
                                    seed=2020)
    
model.train(x=vars_ind, 
            y='saleprice',
            training_frame=h2o_df_all[idx_h2o_train, :])

# Predict the model on train and val
model_pred_train = model.predict(h2o_df_all[idx_h2o_train, :])
model_pred_val   = model.predict(h2o_df_all[idx_h2o_val, :])

model_pred_train = model_pred_train.as_data_frame().values.ravel()
model_pred_val   = model_pred_val.as_data_frame().values.ravel()

# Calculate train and cal mae and mse
mae_train = fn_MAE(y[idx_train], model_pred_train)
mae_val   = fn_MAE(y[idx_val], model_pred_val)

glm Model Build progress: |███████████████████████████████████████████████| 100%
glm prediction progress: |████████████████████████████████████████████████| 100%
glm prediction progress: |████████████████████████████████████████████████| 100%


In [20]:
mae_diff = mae_val - mae_train
print(mae_train, mae_val, mae_diff)
#12018 13652 1634

12249.0 13620.0 1371.0


### 4. Interactions

**Choose variables to interact**

We need to choose some variables to test for interactions. How can we do this??  Well in week 6 we will discover a way to do this automatically, but that is not required for Part 1 of the  assignment.  All we need to do is to try a few interactions.

If we knew what the variables mean (which you do not for this assigment) we could use our own thought process - for example maybe in Ames, the increase of price with gr_living_area will depend on the neighborhood (I think this is quite likely)


So for the assignment - some ideas:
 - Simply try a few variables at random
 - Look at the features with largest coefficients in the model - as these are important.  If they do interact with other variables, it could be a useful addition to the model.  Ofcourse, just because they have large (standardised) coefficients, does not mean they will interact.  But I have seen people take the 5 or 10 features with largest coefficients and then try all possible combinations of them (55), for interactions.  This could be done in a for loop - using train-val.
 
Part 1 of the assignment is not meant to be hard, if you just try a few interactions and leave some comments in your function on your results - i.e. 
 - does a model fitted with them have a lower validation error
 - did you include them in your final model
that will be sufficient

See the example below where I test the ['bsmtfin_sf_1', 'gr_liv_area'] interaction.

If you did want to see the largest few coefficients, you can get them with (if you en model name is "model"):
     > model.std_coef_plot(num_of_features=10)
     
     
Below we use the argument interactions.  This creates all possible two way interactions of whatever features you include.  If you wanted to define specific interactions pairs you can use the interaction_pairs argument:

 > interaction_pairs = [("CRSDepTime", "UniqueCarrier"),
                     ("CRSDepTime", "Origin"),
                     ("UniqueCarrier", "Origin")]


In [None]:
#bsmtfin_sf_1 
#neighborhood
#gr_liv_area
#model.varimp()

model=H2OGeneralizedLinearEstimator(alpha=0.20, 
                                    lambda_search=True,
                                    lambda_min_ratio=1e-8,
                                    nlambdas=200,
                                    nfolds=20,
                                    early_stopping=False,
                                    family='gaussian',
                                    link='identity',
                                    interactions=['neighborhood', 'gr_liv_area'],
                                    # we already standardised above
                                    standardize=False,
                                    seed=2020)
    
model.train(x=vars_ind, 
            y='saleprice',
            training_frame=h2o_df_all[idx_h2o_train, :])

# Predict the model on train and val
model_pred_train = model.predict(h2o_df_all[idx_h2o_train, :])
model_pred_val   = model.predict(h2o_df_all[idx_h2o_val, :])

model_pred_train = model_pred_train.as_data_frame().values.ravel()
model_pred_val   = model_pred_val.as_data_frame().values.ravel()

# Calculate train and cal mae and mse
mae_train = fn_MAE(y[idx_train], model_pred_train)
mae_val   = fn_MAE(y[idx_val], model_pred_val)


In [None]:
mae_diff = mae_val - mae_train
print(mae_train, mae_val, mae_diff)

# I could not get this result to be replicable
#12115 13415 1300
#12411 13912 1501
#11716 13403 1687

**Can we see the interactions?**

Given that we fitted the interaction - can we see what h2o did?

Luckily, one of  interaction terms created appears as the largest standardised coefficient.

In the plot below you can see that the in the neighbourhood Stonebridge, saleprice has a different relationship with living area than in other areas.  If we were looking at this properly, we would now plot two saleprice vs living area charts, one for Stonebridge and one for all other areas, and we would look to see if the relationship appeared different.

In [None]:
model.std_coef_plot(10)

**Conclusion**

Starting point: 
 - validation error:  13,652  overfitting: 1,635

Interaction ['neighborhood', 'gr_liv_area']:
 - validation error:  13,403 overfitting: 1,687

Adding this interaction to the model does not consistently improve performance on validation data.  In the run shown here it does.  But different runs gave different answers and I could not consistently get this improvement.

### 5. Test error of the final model

Please note - many of you used the test data repeatedly in the week 3 assignment.  I can't blame you given what I was asking for.  However, this means we have just fitted our model to our test data and once again, we have no idea of how it will generalise to data (on houses) it has never seen.  

I do try to fit models on train-val and only occassionaly look at test.

As previously discussed, this cannot be done on Kaggle - since even if you repeatedly submit, you are not seeing all the test data, since the public leaderboard reports only on 50% of the test data.  Last year (on a much smaller dataset) one team came second, right up to the last day, but when the final scores we released they dropped far down - they had overfitted to the public leaderboard.  Actually you will probably not be able to do this - because the data is much bigger this year.

**design-test performance**

In [None]:
model=H2OGeneralizedLinearEstimator(alpha=0.20, 
                                    lambda_search=True,
                                    lambda_min_ratio=1e-7,
                                    nlambdas=150,
                                    nfolds=20,
                                    early_stopping=False,
                                    family='gaussian',
                                    link='identity',
                                    # we already standardised above
                                    standardize=False,
                                    seed=2020)
    
model.train(x=vars_ind, 
            y='saleprice',
            training_frame=h2o_df_all[idx_h2o_design, :])

# Predict the model on train and val
model_pred_design = model.predict(h2o_df_all[idx_h2o_design, :])
model_pred_test   = model.predict(h2o_df_all[idx_h2o_test, :])

model_pred_design = model_pred_design.as_data_frame().values.ravel()
model_pred_test   = model_pred_test.as_data_frame().values.ravel()

# Calculate train and cal mae and mse
mae_design = fn_MAE(y[idx_design], model_pred_design)
mae_test   = fn_MAE(y[idx_test], model_pred_test)

In [None]:
mae_diff = mae_test - mae_design
print(mae_design, mae_test, mae_diff)

# 12853 13312 459

In [None]:
h2o.cluster().shutdown()