## Introduction to h2o

Source
 - Based on the h2o python booklet 
 - https://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/PythonBooklet.pdf
 - You are recommended to read the booklet up until the Machine Learning section carefully.  You are more than welcome to read the whole booklet!
 - You are also recommended to listen to at least the first part of Amy Wang tutorial at: https://www.youtube.com/watch?v=g7drhm_SdbQ

In [None]:
import pandas as pd
import h2o

In [None]:
# At the time of writing (May 2020), the latest h2o version is 3.30.0.3|
print(h2o.__version__)

### Start h2o

In [None]:
h2o.connect(ip='127.0.0.1', 
            port=54321, 
            https=False)

In [None]:
h2o.init(ip='127.0.0.1', 
         port=54321, 
         https=False)

In [None]:
#h2o.cluster().shutdown()

In [None]:
dct_all = {'col1': (1, 2, 3),
           'col2': ('a', 'b', 'c'),
           'col3': (0.1, 0.2, 0.3)}

In [None]:
dct_all['col1']

In [None]:
df_all = pd.DataFrame(dct_all)

In [None]:
df_all

**h2o JVM**

h2o runs on a Java Virtual Machine.  The Python h2o module allows us to send information and requests to the machine and to ask the machine for information.

We can send the information in a Pandas DataFrame to the h2o JVM with h2o.H2OFrame(df_all)


Send the data again

In [None]:
h2o.H2OFrame(df_all)

Where is the data?

In [None]:
h2o.ls()

In [None]:
type(h2o.ls())

In [None]:
h2o.ls().iloc[0][0]

In [None]:
h2o.get_frame('type the Key_Frame...hex here')

**Delete all of our frames from the h2o JVM**

In [None]:
df_h2o_ls = h2o.ls()
df_h2o_ls

In [None]:
for key in df_h2o_ls['key']:
    print(key)

In [None]:
for key in df_h2o_ls['key'].values:
    print(key)

In [None]:
for key in df_h2o_ls['key']:
    h2o.remove(key)

In [None]:
h2o.ls()

**Send the data to the h2o JVM and define the key**

Use h2o.H2OFrame(name of data frame, destination_frame="name in h2o")

In [None]:
h2o.ls()

In [None]:
# Now it is easier to get the frame
h2o.get_frame('df_all')

In [None]:
h2o.remove('df_all')

**Send the data to the h2o JVM and create a "handle" for it in Python**

As above, but now: h2o_df_all = h2o.H2OFrame(...)

In [None]:
h2o_df_all = 

In [None]:
type(h2o_df_all)

This "handle" points to the h2o JVM.

We can apply some methods to it and Python knows to ask the JVM to apply the method.

For example we can send a request to the JVM to tell us the shape of the data

In [None]:
h2o_df_all.shape

In [None]:
h2o.remove('df_all')

### Methods for h2o frames

In [None]:
import numpy as np

In [None]:
mat = np.random.randn(100,4)

In [None]:
type(mat)

In [None]:
mat[0:10, 0:4]

In [None]:
df_all = pd.DataFrame(mat, columns=list('ABCD'))

In [None]:
df_all.head()

Now send the data to the h2o JVM

As before with h2o_df_all = h2o.H2OFrame(...)

In [None]:
h2o_df_all = 

**head and tail methods**

Similar to Pandas - use h2o_df_all.head()

In [None]:
# default in Pandas is 5 obs


In [None]:
# and tail is the same....


**columns attribute**

Similar to Pandas

In [None]:
df_all.columns

In [None]:
# and .columns with h2o


And you can also use .types (similar to pandas .dtypes)

In [None]:
# or .types


**describe method**

Similar - kind of....

In [None]:
df_all.describe()

In [None]:
# or describe


**select columns**

By location (integer)


In [None]:
# try using 0 (zero) as a slice


or by column name

In [None]:
# the column name also works, try 'A'


in Pandas you have to be a little bit more explicit...

In [None]:
df_all[0]

In [None]:
# rather you have to give it the location as an integer.... eg .iloc[:, 0]


Selecting multiple rows or columns is fairly obvious, eg

In [None]:
#slice with ['A', 'B']


**Summing rows and columns**

In [None]:
df_all.sum(axis=0)

In [None]:
df_all.sum(axis=1)[0:10]

In [None]:
df_all.apply(sum, axis=0)

With an anonymous function

In [None]:
df_all.apply(lambda x: sum(x), axis=0)

Now in h2o

In [None]:
h2o_df_all.apply(lambda thing: thing.sum(), axis=0)

In [None]:
h2o_df_all.apply(lambda thing: thing.mean(), axis=0)

In [None]:
h2o_df_all.apply(lambda thing: thing.mean(), axis=1)

**Missings**

In [None]:
dct_missings = {'A': [1, 2, 3, np.nan],
                'B': [1, 2, 3, None],
                'C': ['a', 'b', 'c', 'NA'],
                'D': ['this', 'is', 'string', None]}


In [None]:
df_missings = pd.DataFrame(dct_missings)
df_missings

In [None]:
df_missings.loc[3, 'C']

In [None]:
df_missings.loc[3, 'C'] == 'NA'

In [None]:
df_missings.dtypes

In [None]:
h2o_df_missings = h2o.H2OFrame.from_python(dct_missings,
                                           column_types=['numeric', 'numeric', 'enum', 'string'], 
                                           destination_frame="df_missings")
                                
h2o_df_missings

In [None]:
h2o_df_missings[3, 2] == 'NA'

Note that the None value in the string column gets stored as an empty string i.e. ''

In [None]:
h2o_df_missings[3, 3]

In [None]:
h2o_df_missings[3, 3] == ''

isna() will find missing numbers.  But 
 - 'NA' is not missing - it is a string with the letters N and A in it...
 - The empty string we created with None does not show up with isna()

In [None]:
h2o_df_missings.isna()

It seems that when we take oparations on columns, the default is to exclude missings....

In [None]:
h2o_df_missings['A'].mean()

In [None]:
# which is the same as ...
h2o_df_missings['A'].mean(na_rm=True)

In [None]:
# but...
h2o_df_missings['A'].mean(na_rm=False)

Other h2o dataframe methods:
- .hist
- .countmatches
- .sub / .gsub
- .strsplit
- .rbind and .cbind
- .merge
- .group_by

h2o dataframe can also deal with date and time data

**Categorical data**

 - Known as "enum" data type 
 - This is the same as a "factor" in R
 - It seems that if we do not explicitly stored a column as a string, it will store it as "enum"

In [None]:
df_enum = h2o.H2OFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
    'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three']})

In [None]:
df_enum

In [None]:
df_enum.types

Just as in R, we can get the levels of our factors...

In [None]:
df_enum['A'].levels()

In [None]:
df_enum['B'].levels()

Creating interactions

We will talk later about interactions in our models, but note the following ....

In [None]:
df_enum.interaction(['A', 'B'], pairwise=False, max_factors=100, min_occurrence=1)

In [None]:
df_enum.interaction(['A', 'B'], pairwise=False, max_factors=100, min_occurrence=2)

- df_all.any_factor
- is.factor() and as.factor()
- df_all[colname].levels()
- df_all.interaction(['col1','col2'], args)

In [None]:
h2o.cluster().shutdown()