## Introduction to h2o

Source
 - Based on the h2o python booklet 
 - https://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/PythonBooklet.pdf
 - You are recommended to read the booklet up until the Machine Learning section carefully.  You are more than welcome to read the whole booklet!
 - You are also recommended to listen to at least the first part of Amy Wang tutorial at: https://www.youtube.com/watch?v=g7drhm_SdbQ

In [2]:
import pandas as pd
import h2o

In [3]:
# At the time of writing (May 2020), the latest h2o version is 3.30.0.3|
print(h2o.__version__)

3.24.0.3


### Start h2o

In [4]:
h2o.connect(ip='127.0.0.1', 
            port=54321, 
            https=False)

Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,2 hours 52 mins
H2O cluster timezone:,Etc/UTC
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.3
H2O cluster version age:,"1 year, 2 months and 5 days !!!"
H2O cluster name:,H2O_from_python_jovyan_zlzful
H2O cluster total nodes:,1
H2O cluster free memory:,843 Mb
H2O cluster total cores:,1
H2O cluster allowed cores:,1


<H2OConnection to http://127.0.0.1:54321, no session>

In [None]:
h2o.init(ip='127.0.0.1', 
         port=54321, 
         https=False)

In [None]:
#h2o.cluster().shutdown()

In [5]:
dct_all = {'col1': (1, 2, 3),
           'col2': ('a', 'b', 'c'),
           'col3': (0.1, 0.2, 0.3)}

In [6]:
dct_all['col1']

(1, 2, 3)

In [7]:
df_all = pd.DataFrame(dct_all)

In [8]:
df_all

Unnamed: 0,col1,col2,col3
0,1,a,0.1
1,2,b,0.2
2,3,c,0.3


**h2o JVM**

h2o runs on a Java Virtual Machine.  The Python h2o module allows us to send information and requests to the machine and to ask the machine for information.

We can send the information in a Pandas DataFrame to the h2o JVM with h2o.H2OFrame(df_all)


Send the data again

In [9]:
h2o.H2OFrame(df_all)

Parse progress: |█████████████████████████████████████████████████████████| 100%


col1,col2,col3
1,a,0.1
2,b,0.2
3,c,0.3




Where is the data?

In [10]:
h2o.ls()

Unnamed: 0,key
0,Key_Frame__upload_b1ef7090cbb190f56859324be5e3...
1,df_all


In [11]:
type(h2o.ls())

pandas.core.frame.DataFrame

In [12]:
h2o.ls().iloc[0][0]

'Key_Frame__upload_b1ef7090cbb190f56859324be5e34557.hex'

In [13]:
h2o.get_frame('type the Key_Frame...hex here')

**Delete all of our frames from the h2o JVM**

In [14]:
df_h2o_ls = h2o.ls()
df_h2o_ls

Unnamed: 0,key
0,Key_Frame__upload_b1ef7090cbb190f56859324be5e3...
1,df_all


In [15]:
for key in df_h2o_ls['key']:
    print(key)

Key_Frame__upload_b1ef7090cbb190f56859324be5e34557.hex
df_all


In [16]:
for key in df_h2o_ls['key'].values:
    print(key)

Key_Frame__upload_b1ef7090cbb190f56859324be5e34557.hex
df_all


In [17]:
for key in df_h2o_ls['key']:
    h2o.remove(key)

In [18]:
h2o.ls()

Unnamed: 0,key


**Send the data to the h2o JVM and define the key**

Use h2o.H2OFrame(name of data frame, destination_frame="name in h2o")

In [19]:
h2o.ls()

Unnamed: 0,key


In [20]:
# Now it is easier to get the frame
h2o.get_frame('df_all')

In [21]:
h2o.remove('df_all')

**Send the data to the h2o JVM and create a "handle" for it in Python**

As above, but now: h2o_df_all = h2o.H2OFrame(...)

In [22]:
h2o_df_all = h2o.H2OFrame(df_all, 'df_all')

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [23]:
type(h2o_df_all)

h2o.frame.H2OFrame

This "handle" points to the h2o JVM.

We can apply some methods to it and Python knows to ask the JVM to apply the method.

For example we can send a request to the JVM to tell us the shape of the data

In [24]:
h2o_df_all.shape

(3, 3)

In [25]:
h2o.remove('df_all')

### Methods for h2o frames

In [26]:
import numpy as np

In [27]:
mat = np.random.randn(100,4)

In [28]:
type(mat)

numpy.ndarray

In [29]:
mat[0:10, 0:4]

array([[-2.47115784, -0.43411595, -0.1624479 , -0.53610705],
       [-1.14722448, -1.4654785 , -0.74584884,  1.10236682],
       [-0.27772572,  0.5380224 ,  1.38077058,  0.3007472 ],
       [-2.08251855,  1.24636186,  0.47877362,  0.50794017],
       [-0.59824763, -0.04598924,  0.05614474,  0.9703106 ],
       [-0.29291966, -1.12989933, -0.74861899,  0.49434264],
       [ 0.91722388, -0.86644566,  0.13031941, -0.41613006],
       [ 0.82275804,  0.07861058,  0.52654079, -1.207546  ],
       [ 0.34395933,  0.06076843, -0.14675539,  0.48268036],
       [-0.9575299 ,  1.05183761, -0.00989106,  0.06464874]])

In [30]:
df_all = pd.DataFrame(mat, columns=list('ABCD'))

In [31]:
df_all.head()

Unnamed: 0,A,B,C,D
0,-2.471158,-0.434116,-0.162448,-0.536107
1,-1.147224,-1.465479,-0.745849,1.102367
2,-0.277726,0.538022,1.380771,0.300747
3,-2.082519,1.246362,0.478774,0.50794
4,-0.598248,-0.045989,0.056145,0.970311


Now send the data to the h2o JVM

As before with h2o_df_all = h2o.H2OFrame(...)

In [32]:
h2o_df_all =  h2o.H2OFrame(df_all, 'df_all')

Parse progress: |█████████████████████████████████████████████████████████| 100%


**head and tail methods**

Similar to Pandas - use h2o_df_all.head()

In [33]:
# default in Pandas is 5 obs
# df_all.head()  #pandas
h2o_df_all.head()  #h2o

A,B,C,D
-2.47116,-0.434116,-0.162448,-0.536107
-1.14722,-1.46548,-0.745849,1.10237
-0.277726,0.538022,1.38077,0.300747
-2.08252,1.24636,0.478774,0.50794
-0.598248,-0.0459892,0.0561447,0.970311
-0.29292,-1.1299,-0.748619,0.494343
0.917224,-0.866446,0.130319,-0.41613
0.822758,0.0786106,0.526541,-1.20755
0.343959,0.0607684,-0.146755,0.48268
-0.95753,1.05184,-0.00989106,0.0646487




In [34]:
# and tail is the same....
h2o_df_all.tail()

A,B,C,D
-0.182559,0.0387505,0.799841,0.684986
-0.975321,-1.55732,-1.10503,1.34218
-0.932271,-0.575013,-0.0527701,-1.61499
-0.223381,-0.0752052,0.517266,1.10151
-2.74959,-0.681963,0.0439335,0.465204
-0.567208,1.3909,-0.484036,1.66231
0.911352,0.8837,1.06185,1.42315
-0.737166,-0.0816223,1.11206,0.0101389
-0.139895,-0.233609,1.35299,-0.147354
0.839878,-0.352736,0.0298493,1.86887




**columns attribute**

Similar to Pandas

In [35]:
df_all.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [36]:
# and .columns with h2o
h2o_df_all.columns

['A', 'B', 'C', 'D']

And you can also use .types (similar to pandas .dtypes)

In [37]:
# or .types
h2o_df_all.types

{'A': 'real', 'B': 'real', 'C': 'real', 'D': 'real'}

**describe method**

Similar - kind of....

In [38]:
df_all.describe()

Unnamed: 0,A,B,C,D
count,100.0,100.0,100.0,100.0
mean,-0.024857,-0.075239,0.036512,0.027847
std,1.10337,0.987522,0.952652,1.113161
min,-2.749595,-2.209788,-2.168616,-2.741339
25%,-0.764599,-0.834056,-0.646961,-0.701228
50%,-0.021983,-0.078414,0.056015,0.061279
75%,0.827038,0.559029,0.642095,0.717835
max,2.313557,2.858188,2.070478,3.627819


In [39]:
# or describe
h2o_df_all.describe()

Rows:100
Cols:4




Unnamed: 0,A,B,C,D
type,real,real,real,real
mins,-2.749594586825697,-2.209787572434603,-2.168616335995749,-2.7413388983822204
mean,-0.024856977018178288,-0.07523868923721971,0.0365119340925782,0.027847417856134975
maxs,2.3135572376443547,2.858188163105298,2.070477782846559,3.627818655174457
sigma,1.103370243142427,0.9875219832858204,0.9526523537223163,1.1131614088407233
zeros,0,0,0,0
missing,0,0,0,0
0,-2.4711578394490346,-0.4341159526646765,-0.16244790375187562,-0.5361070487054688
1,-1.147224479808595,-1.465478500757383,-0.7458488384669256,1.102366821966816
2,-0.277725722584112,0.5380223997440394,1.3807705799204026,0.30074719692638524


**select columns**

By location (integer)


In [40]:
# try using 0 (zero) as a slice
h2o_df_all[0]

A
-2.47116
-1.14722
-0.277726
-2.08252
-0.598248
-0.29292
0.917224
0.822758
0.343959
-0.95753




or by column name

In [41]:
# the column name also works, try 'A'
h2o_df_all['A']

A
-2.47116
-1.14722
-0.277726
-2.08252
-0.598248
-0.29292
0.917224
0.822758
0.343959
-0.95753




in Pandas you have to be a little bit more explicit...

In [42]:
# df_all[0]  #won't work

In [43]:
# rather you have to give it the location as an integer.... eg .iloc[:, 0]
df_all.iloc[:,0]

0    -2.471158
1    -1.147224
2    -0.277726
3    -2.082519
4    -0.598248
5    -0.292920
6     0.917224
7     0.822758
8     0.343959
9    -0.957530
10   -0.340331
11    1.260976
12    1.571904
13    1.693347
14    0.054149
15    0.396382
16   -0.658620
17    1.683607
18    1.780733
19   -0.952059
20    0.726750
21    0.870807
22   -1.156521
23   -0.772091
24    0.045874
25   -1.875366
26   -0.113604
27    0.109923
28    0.405979
29   -1.233644
        ...   
70    1.232271
71    0.798431
72    1.056531
73   -1.420481
74    1.508727
75    0.782547
76   -0.699397
77   -0.700294
78   -0.058067
79    0.256852
80    1.315372
81    1.081311
82    0.692046
83    0.490750
84   -0.637963
85   -0.454736
86   -1.632188
87   -0.595101
88   -0.731324
89    0.369718
90   -0.182559
91   -0.975321
92   -0.932271
93   -0.223381
94   -2.749595
95   -0.567208
96    0.911352
97   -0.737166
98   -0.139895
99    0.839878
Name: A, Length: 100, dtype: float64

Selecting multiple rows or columns is fairly obvious, eg

In [44]:
#slice with ['A', 'B']
h2o_df_all[['A', 'B']]

A,B
-2.47116,-0.434116
-1.14722,-1.46548
-0.277726,0.538022
-2.08252,1.24636
-0.598248,-0.0459892
-0.29292,-1.1299
0.917224,-0.866446
0.822758,0.0786106
0.343959,0.0607684
-0.95753,1.05184




**Summing rows and columns**

In [45]:
df_all.sum(axis=0)

A   -2.485698
B   -7.523869
C    3.651193
D    2.784742
dtype: float64

In [46]:
df_all.sum(axis=1)[0:10]

0   -3.603829
1   -2.256185
2    1.941814
3    0.150557
4    0.382218
5   -1.677095
6   -0.235032
7    0.220363
8    0.740653
9    0.149065
dtype: float64

In [47]:
df_all.apply(sum, axis=0)

A   -2.485698
B   -7.523869
C    3.651193
D    2.784742
dtype: float64

With an anonymous function

In [48]:
df_all.apply(lambda x: sum(x), axis=0)

A   -2.485698
B   -7.523869
C    3.651193
D    2.784742
dtype: float64

Now in h2o

In [49]:
h2o_df_all.apply(lambda thing: thing.sum(), axis=0)

A,B,C,D
-2.4857,-7.52387,3.65119,2.78474




In [50]:
h2o_df_all.apply(lambda thing: thing.mean(), axis=0)

A,B,C,D
-0.024857,-0.0752387,0.0365119,0.0278474




In [51]:
h2o_df_all.apply(lambda thing: thing.mean(), axis=1)

C1
-0.900957
-0.564046
0.485454
0.0376393
0.0955546
-0.419274
-0.0587581
0.0550908
0.185163
0.0372663




**Missings**

In [52]:
dct_missings = {'A': [1, 2, 3, np.nan],
                'B': [1, 2, 3, None],
                'C': ['a', 'b', 'c', 'NA'],
                'D': ['this', 'is', 'string', None]}


In [53]:
df_missings = pd.DataFrame(dct_missings)
df_missings

Unnamed: 0,A,B,C,D
0,1.0,1.0,a,this
1,2.0,2.0,b,is
2,3.0,3.0,c,string
3,,,,


In [54]:
df_missings.loc[3, 'C']

'NA'

In [55]:
df_missings.loc[3, 'C'] == 'NA'

True

In [56]:
df_missings.dtypes

A    float64
B    float64
C     object
D     object
dtype: object

In [57]:
h2o_df_missings = h2o.H2OFrame.from_python(dct_missings,
                                           column_types=['numeric', 'numeric', 'enum', 'string'], 
                                           destination_frame="df_missings")
                                
h2o_df_missings

Parse progress: |█████████████████████████████████████████████████████████| 100%


A,B,C,D
1.0,1.0,a,this
2.0,2.0,b,is
3.0,3.0,c,string
,,,




In [58]:
h2o_df_missings[3, 2] == 'NA'

True

Note that the None value in the string column gets stored as an empty string i.e. ''

In [59]:
h2o_df_missings[3, 3]

''

In [60]:
h2o_df_missings[3, 3] == ''

True

isna() will find missing numbers.  But 
 - 'NA' is not missing - it is a string with the letters N and A in it...
 - The empty string we created with None does not show up with isna()

In [61]:
h2o_df_missings.isna()

isNA(A),isNA(B),isNA(C),isNA(D)
0,0,0,0
0,0,0,0
0,0,0,0
1,1,0,0




It seems that when we take oparations on columns, the default is to exclude missings....

In [62]:
h2o_df_missings['A'].mean()

[2.0]

In [63]:
# which is the same as ...
h2o_df_missings['A'].mean(na_rm=True)

[2.0]

In [64]:
# but...
h2o_df_missings['A'].mean(na_rm=False)

[nan]

Other h2o dataframe methods:
- .hist
- .countmatches
- .sub / .gsub
- .strsplit
- .rbind and .cbind
- .merge
- .group_by

h2o dataframe can also deal with date and time data

**Categorical data**

 - Known as "enum" data type 
 - This is the same as a "factor" in R
 - It seems that if we do not explicitly stored a column as a string, it will store it as "enum"

In [65]:
df_enum = h2o.H2OFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
    'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three']})

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [66]:
df_enum

A,B
foo,one
bar,one
foo,two
bar,three
foo,two
bar,two
foo,one
bar,three




In [67]:
df_enum.types

{'A': 'enum', 'B': 'enum'}

Just as in R, we can get the levels of our factors...

In [68]:
df_enum['A'].levels()

[['bar', 'foo']]

In [69]:
df_enum['B'].levels()

[['one', 'three', 'two']]

Creating interactions

We will talk later about interactions in our models, but note the following ....

In [70]:
df_enum.interaction(['A', 'B'], pairwise=False, max_factors=100, min_occurrence=1)

Interactions progress: |██████████████████████████████████████████████████| 100%


A_B
foo_one
bar_one
foo_two
bar_three
foo_two
bar_two
foo_one
bar_three




In [71]:
df_enum.interaction(['A', 'B'], pairwise=False, max_factors=100, min_occurrence=2)

Interactions progress: |██████████████████████████████████████████████████| 100%


A_B
foo_one
other
foo_two
bar_three
foo_two
other
foo_one
bar_three




- df_all.any_factor
- is.factor() and as.factor()
- df_all[colname].levels()
- df_all.interaction(['col1','col2'], args)

In [73]:
h2o.cluster().shutdown()

H2O session _sid_8ddd closed.


In [72]:
# df_enum

A,B
foo,one
bar,one
foo,two
bar,three
foo,two
bar,two
foo,one
bar,three


