# Week 3 - GLMs

Using H2O for exploratory data analysis

In [1]:
import h2o

In [2]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_232"; OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-0ubuntu1~19.04.1-b09); OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)
  Starting server from /home/megan/Projects/h2oclass/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpuq4dvcp3
  JVM stdout: /tmp/tmpuq4dvcp3/h2o_megan_started_from_python.out
  JVM stderr: /tmp/tmpuq4dvcp3/h2o_megan_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,America/Chicago
H2O data parsing timezone:,UTC
H2O cluster version:,3.28.0.3
H2O cluster version age:,9 days
H2O cluster name:,H2O_from_python_megan_1baid0
H2O cluster total nodes:,1
H2O cluster free memory:,1.520 Gb
H2O cluster total cores:,3
H2O cluster allowed cores:,3


In [3]:
# import the dataset
# note the next steps are more complicated than shown in the course - the provided link does not work
url = 'https://data.princeton.edu/wws509/datasets/smoking.dat'
data = h2o.import_file(
    url, 
    destination_frame='data', 
    col_names=['idx', 'age', 'smoke', 'pop', 'dead']
)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [4]:
data

idx,age,smoke,pop,dead
,smoke,pop,,
1.0,40-44,no,656.0,18.0
2.0,45-59,no,359.0,22.0
3.0,50-54,no,249.0,19.0
4.0,55-59,no,632.0,55.0
5.0,60-64,no,1067.0,117.0
6.0,65-69,no,897.0,170.0
7.0,70-74,no,668.0,179.0
8.0,75-79,no,361.0,120.0
9.0,80+,no,274.0,120.0




In [5]:
# drop the first row with bad column labels
smoking = data.drop([0], axis=0)

In [6]:
# create ratio column
smoking['ratio'] = 1000 * smoking['dead'] // smoking['pop']

In [7]:
smoking

idx,age,smoke,pop,dead,ratio
1,40-44,no,656,18,27
2,45-59,no,359,22,61
3,50-54,no,249,19,76
4,55-59,no,632,55,87
5,60-64,no,1067,117,109
6,65-69,no,897,170,189
7,70-74,no,668,179,267
8,75-79,no,361,120,332
9,80+,no,274,120,437
10,40-44,cigarPipeOnly,145,2,13




In [8]:
smoking.summary()

Unnamed: 0,idx,age,smoke,pop,dead,ratio
type,int,enum,enum,int,int,int
mins,1.0,,,98.0,2.0,13.0
mean,18.5,,,1558.9444444444443,253.61111111111114,204.27777777777777
maxs,36.0,,,6052.0,1001.0,557.0
sigma,10.535653752852738,,,1562.232174887577,262.5974951221821,161.18624739476488
zeros,0,,,0,0,0
missing,0,0,0,0,0,0
0,1.0,40-44,no,656.0,18.0,27.0
1,2.0,45-59,no,359.0,22.0,61.0
2,3.0,50-54,no,249.0,19.0,76.0


In [9]:
# sum our population to get total population
smoking[:,'pop'].sum()

56122.0

In [10]:
# import the model
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

In [11]:
# define x and y
x = ['age', 'smoke']
y = 'ratio'

In [12]:
model1 = H2OGeneralizedLinearEstimator(
    family='poisson',
    model_id='smoking_p'
)
model1.train(x, y, smoking)

glm Model Build progress: |███████████████████████████████████████████████| 100%


In [13]:
model1.model_performance()


ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 2317.278863564693
RMSE: 48.138122767352414
MAE: 42.01284703316798
RMSLE: 0.5194841361204997
R^2: 0.9082604115075811
Mean Residual Deviance: 16.5654755342314
Null degrees of freedom: 35
Residual degrees of freedom: 24
Null deviance: 4452.040944755189
Residual deviance: 596.3571192323304
AIC: 864.114733053809




In [14]:
model1.coef()

{'Intercept': 5.105058353854452,
 'age.40-44': -0.747627689463485,
 'age.45-59': -0.5203727628229302,
 'age.50-54': -0.3812251465963783,
 'age.55-59': -0.10696663321593292,
 'age.60-64': 0.0,
 'age.65-69': 0.12115281990217852,
 'age.70-74': 0.41814318047271115,
 'age.75-79': 0.71445004975903,
 'age.80+': 0.9760296419439634,
 'age.smoke': 0.0,
 'smoke.cigarPipeOnly': -0.054287949649570066,
 'smoke.cigarretteOnly': 0.15479220408862204,
 'smoke.cigarrettePlus': 0.0,
 'smoke.no': -0.05701552993507164,
 'smoke.pop': 0.0}

In [15]:
model2 = H2OGeneralizedLinearEstimator(
    family='poisson',
    model_id='smoking_p2'
)
model2.train('smoke', y, smoking)

glm Model Build progress: |███████████████████████████████████████████████| 100%


In [16]:
model2.model_performance()


ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 24342.611860135847
RMSE: 156.0211904201985
MAE: 132.6987657369525
RMSLE: 1.0144378007997055
R^2: 0.03629156162749714
Mean Residual Deviance: 119.2662951879298
Null degrees of freedom: 35
Residual degrees of freedom: 31
Null deviance: 4452.040944755189
Residual deviance: 4293.586626765473
AIC: 4547.344240586952




In [17]:
model2.coef()

{'Intercept': 5.311708901581492,
 'smoke.cigarPipeOnly': -0.10892325437396622,
 'smoke.cigarretteOnly': 0.18911297651822245,
 'smoke.cigarrettePlus': 0.03181156005906516,
 'smoke.no': -0.11190624028792584,
 'smoke.pop': 0.0}

What can we see from this exploration with modeling?

- Smoking cigarettes only is most highly related to the ratio of deaths
- Not smoking had the most negative coefficient, but only slightly beats out cigar and pipe smoking only
- When we included age, the oldest groups were most strongly related to the ratio of deaths
- The younger groups had negative coefficients