# Automated Machine Learning with Interactive User Inputs

**Purpose:** This notebook is designed to interactively guide the user through an end-to-end process for deploying an automated machine learning workflow utilizing h2o.ai's autoML function.  The user is simply required to select a dataset and choose a variable they would like to predict before running the automation.  The user can choose to run the automation with default parameters or override those parameters following the input prompts embedded in this notebook.  This workflow is designed for all modelers - new and experienced - who are looking to leverage automated machine learning methods in their work.

**Data Mining Problems Covered:**
- Utilize this notebook to solve any "Supervised Learning" problem with a Categorical or Continuous Target Variable

**Edit Date:** 6/4/2019
    
</br></br>
- **Resources**:
    * http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
    * https://ipywidgets.readthedocs.io/en/stable/examples/Widget%20List.html
    * https://towardsdatascience.com/a-very-simple-demo-of-interactive-controls-on-jupyter-notebook-4429cf46aabd
    
</br></br>
- **Example Datasets**
    - **[Sample csv] = artificial_attrition_data.csv:** This is an artificial dataset containing employment records for a fictional company.
    - **[Sample xlsx] = Artificial Employment Data.xlsx:** This is an artificial dataset containing employment records for a fictional company. The file has (3) separate tabs to choose from when uploading into the workflow.
    - **[Sample GET REQUEST]** - https://gist.github.com/bradblanch/53857ddb24ad83287b900aa8968ff416: This is an artificial dataset containing employment records for a fictional company. The file is available through my gist account and can be imported directly from the web via the url import function contained in this workflow.  Any valid url containing data in .csv format can be uploaded directly within this tool.
    
## Table of Contents

**1.0** **- Ingest Data**
    * 1.1 - Set Your Working Directory
    * 1.2 - Upload Your Data (for Modeling)
    * 1.3 - Select a Data Frame (for Modeling)
     
**2.0** **- Train Models**
    * 2.1 - Select Your Target Variable
    * 2.2 - Initiate H2O
    * 2.3 - Configure Models
    * 2.4 - Automatically Train Models
    * 2.5 - Evaluate Models & Select Top Performer
   
**3.0** **- Make Predictions & Export Results**
    * 3.1 - Upload your Data (for Scoring) 
    * 3.2 - Select a Data Frame (for Scoring)  
    * 3.3 - Predict the Target Value
    * 3.4 - Export Dataframes for Offline Analysis

## Dependencies

This script was executed using the following version of Python:
* **Python 3.6.2 :: Anaconda, Inc.**

Use this link to install Python on your machine:
* https://www.anaconda.com/distribution/#download-section

**About Python Versions:**
If you are running a higher-version of Python and this notebook fails to execute properly, you can downgrade your version in the terminal by running the following commands:
* conda search python [to see which versions are available on your machine]
* conda install python=3.6.2 [which will switch the active version to 3.6.2; if available in the list above]

**About Python Packages:**
All packages used in this notebook can be installed on your machine using the "pip install [package_name]" command on your terminal.  Be sure you've installed each of the packages below before attempting to execute the notebook.

Current package requirements include:
* os - https://docs.python.org/3/library/os.html
* Pandas - https://pandas.pydata.org/
* Datetime - https://docs.python.org/3/library/datetime.html
* re - https://docs.python.org/3/library/re.html
* Numpy - http://www.numpy.org/
* ipywidgets - https://ipywidgets.readthedocs.io/en/stable/user_install.html
* ipython - https://ipython.org/ipython-doc/rel-0.10.2/html/interactive/extension_api.html
* h2o - http://docs.h2o.ai/h2o/latest-stable/h2o-docs/downloading.html#install-in-python
* scikit-learn - https://scikit-learn.org/stable/install.html
* requests - https://2.python-requests.org/en/master/user/install/
* io - https://docs.python.org/3/library/io.html
* warnings - https://docs.python.org/3/library/warnings.html

The current template uses the following versions:
* os== module 'os' from '/anaconda3/lib/python3.6/os.py'
* pandas==0.24.1
* datetime== module 'datetime' from '/anaconda3/lib/python3.6/datetime.py'
* re== module 're' from '/anaconda3/lib/python3.6/re.py'
* numpy==1.16.1
* ipywidgets==7.4.2
* ipython==6.2.1
* h2o==3.24.0.4
* scikit-learn==0.19.1
* requests==2.18.4
* io== module 'io' from '/anaconda3/lib/python3.6/io.py'
* warnings== module 'warnings' from '/anaconda3/lib/python3.6/warnings.py'

## Before you begin, ensure you've installed the required Python packages

* See the list above and make note of the specific versions that were used in this notebook

In [1]:
############################################
###### Import required Python packages #####
############################################

import os
import pandas as pd
import re
import datetime as dt
import numpy as np
from ipywidgets import interact, interactive, IntSlider, Layout
import ipywidgets as widgets
from IPython.display import display
import h2o
from h2o.automl import H2OAutoML
from sklearn.model_selection import train_test_split
import io
import requests
import warnings
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'h2o'

## Note: Code Cells are Hidden by Default for Ease-of-Use

This notebook incorporates interactive "widgets" which will result in large blocks of cells being utilized to enable specific user-interactions.  Executing this cell will hide all "Code" cells while making all outputs visible to the user.  Refer to the link below for the source or simply "run" the block below to see the impact on the rest of the notebook.

* https://stackoverflow.com/questions/27934885/how-to-hide-code-from-cells-in-ipython-notebook-visualized-with-nbviewer

#### Disclaimer:
* As the "output text" notes, simply click the "here" hyperlink in the text to toggle on/off this feature

In [None]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

## 1.0 - Data Ingestion

The series of code blocks below will walk you through the process of mapping to your working directory and uploading your dataset.

## 1.1 - Set Your Working Directory

Your "working directory" is a folder location on your computer that will store files either read-in or written-out by this script.  This code by default will return your current, active directory.  You can change this directory by typing in a specific path into the text box provided.

## AN IMPORTANT NOTE ABOUT INTERACTIVE WIDGETS

This notebook uses interactive widgets to help you make selections and inputs more conveniently.  As you work through this notebook, be sure to follow the steps below to ensure your selections are incorporated in the cells that follow:

#### 1. Run the cell containing the interactive widget(s) to bring them into view
#### 2. Apply your selections and/or inputs to the widgets that appear
#### 3. DO NOT rerun the cell as it will erase your selections and inputs
#### 4. To proceed, simply click on the next cell in the notebook, and Run it

<br/>

In [2]:
set_working_directory = widgets.Text(
    value=os.getcwd(),
    placeholder='/Users/bblanchard006/Desktop/digital_lab/autoML',
    description='Directory:',
    disabled=False,
    layout=Layout(width='50%')
)

display(set_working_directory)

Text(value='C:\\Users\\Brandon\\Documents\\GitHub\\CaseStudy3-Quantifying', description='Directory:', layout=L…

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

After executing the cell above, you can leave the default directory or overwrite the text string that appears with your desired folder directory. **DO NOT execute the cell again after making your update.** The input above will be fed into the following code cell, where it will either successfully map to the new directory or notify you of an error.

In [3]:
try:
    os.chdir(set_working_directory.value)
    print('Changed directory to {}'.format(set_working_directory.value))
except Exception as e:
    print('Failed to change directory')
    print(e)

Changed directory to C:\Users\Brandon\Documents\GitHub\CaseStudy3-Quantifying


## 1.2 - Upload Your Data (Excel and CSV files)

The function in the code cell below will find, ingest, and format both xlsx and csv files.  This is the dataset with "known" values which will be used to train your models.

In [4]:
########################################
##### Data Ingestion Functions
########################################

def compile_raw_data(filename, tab_names, subfolder, delimiter_char = ',', skip_rows = 0, file_ext = 'xlsx'):
    
    # Inputs: 
    ## filename = 'sample.csv' | 'sample.xlsx' - the filename in the directory (including the extension) 
    ## tab_names = None | ['Sheet1,'Sheet2'] - None for csv; [comma separated list of tab names] for xlsx
    ## subfolder = 'source_data' - string containing the name of a folder in the working directory
    ## delimiter_char = ',' | ';' - None for xlsx
    ## rows to skip = default 0 - Not used for csv; trims the user-defined number of rows from an xlsx
    ## file extension = csv | xlsx
    
    # Description: reads in the workbook; standardizes header names; 
    # Outputs: returns a dictionary of dataframes

    master_data = {}
    if subfolder:
        file_path = subfolder+'/{}'.format(filename)
    else:
        file_path = filename

    if file_ext == 'csv':
        tab_names = [re.sub('.csv','', filename)]

    for tab in tab_names:
        try:
            if file_ext == 'xlsx':
                dframe = pd.read_excel(file_path, tab, skip_rows)
            elif file_ext == 'csv' and delimiter_char == ',':
                dframe = pd.read_csv(file_path, header=0, delimiter=',')
            else:
                dframe = pd.read_csv(file_path, header=0, delimiter=';')
                
            sanitizer = {
                        '$':'USD',
                        '(':' ',
                        ')':' ',
                        '/':' ',
                        '-':' ',
                        ',':' ',
                        '.':' '
            }
                        
            for key, value in sanitizer.items():
                dframe.rename(columns=lambda x: x.replace(key, value), inplace=True)
                
            dframe.rename(columns=lambda x: x.strip(), inplace=True)
            dframe.rename(columns=lambda x: re.sub(' +','_', x), inplace=True)
            
            dframe.columns = map(str.lower, dframe.columns)
            
            master_data.update({tab:dframe})
        except Exception as e:
            master_data.update({tab:'Failed'})
    
    return master_data

The code blocks below enable conditional filtering to support multiple file types. Further instructions are provided below:

**Uploading csv files**

To upload a csv file, complete these steps:
1. Type in your filename along with the extension (ex. sample.csv)
2. Check the 'csv' radio-button
3. Is your file in the main directory or a sub-folder in the directory:
    * Select the "no" radio-button if your file is in your main directory
    * Select the "yes" radio-button to expose a text-box where you can type-in the name of your sub-folder
    
**Uploading xlsx files**

To upload an xlsx file, complete these steps:
1. Type in your filename along with the extension (ex. sample.xlsx)
2. Check the 'xlsx' radio-button
3. Type in the tab-names you'd like to ingest (comma-separated; Sheet1,Sheet2,Sheet3)
4. If the data in your file has leading rows, select how many rows to skip before ingesting the data (ex. if your data starts on Row 2 in the Excel-file, set the Skip Rows value to 1)
5. Is your file in the main directory or a sub-folder in the directory:
    * Select the "no" radio-button if your file is in your main directory
    * Select the "yes" radio-button to expose a text-box where you can type-in the name of your sub-folder

In [5]:
upload_type = widgets.RadioButtons(
    options=['local', 'url'],
    description='File Location:',
    disabled=False
)

upload_url = widgets.Text(
#    value='https://gist.githubusercontent.com/bradblanch/53857ddb24ad83287b900aa8968ff416/raw/790b0a27392f69a5efcf2c1ac7d0bdad5e38428e/artificial_attrition_data.csv',
    value='https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv',
    placeholder='http://',
    description='URL:',
    disabled=False,
    layout=Layout(width='80%')
)
upload_filename = widgets.Text(
    value='artificial_attrition_data.csv',
#     value='Artificial Employment Data.xlsx'
    placeholder='Sample File.csv',
    description='File Name:',
    disabled=False,
    layout=Layout(width='50%')
)

file_type = widgets.RadioButtons(
    options=['csv', 'xlsx'],
    description='File Type:',
    disabled=False
)

tab_names = widgets.Text(
    value='Sheet1, Sheet2, Sheet3, etc',
    placeholder='ALL EMPLOYEES, PAST EMPLOYEES',
    description='Tab(s):',
    disabled=False,
    layout=Layout(width='50%')
)

subfolder_name = widgets.Text(
    value='source_data',
    placeholder='Subfolder name',
    description='Subfolder:',
    disabled=False,
    layout=Layout(width='50%')
)

subfolder = widgets.RadioButtons(
    options=['no','yes'],
    value='no',
    description='Subfolder:',
    disabled=False
)

skip_rows = widgets.IntSlider(
    value=0,
    min=0,
    max=10,
    step=1,
    description='Skip Rows:',
    disabled=False,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

delimiter = widgets.RadioButtons(
    options=[',',';'],
    value=',',
    description='Delimiter:',
    disabled=False
)

def text_field(x):
    if(x=='xlsx'):
        display(tab_names)
        tab_names.on_submit(tab_names)
        display(skip_rows)
    else:
        display(delimiter)
        print('Tab Names: Not needed for csv files')

def sub_folder(y):
    if(y=='yes'):
        display(subfolder_name)
        subfolder_name.on_submit(subfolder_name)
    else:
        print('Using {} folder'.format(os.getcwd()))

def file_location(z):
    if(z=='local'):
        display(upload_filename)
        i = widgets.interactive(text_field, x=file_type)
        display(i)
        p = widgets.interactive(sub_folder, y=subfolder)
        display(p)
    else:
        display(upload_url)
    
q = widgets.interactive(file_location, z=upload_type)

display(q)

interactive(children=(RadioButtons(description='File Location:', options=('local', 'url'), value='local'), Out…

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

The following code cell will attempt to ingest the data you've selected in the widgets above:

**Note About xlsx Files** - Depending on the number of tabs and the size of the data on each tab, ingesting an xlsx file can take several minutes to execute.  If possible, it may be more efficient to break your Excel file into separate csv files which take only a fraction of a second to ingest.

In [6]:
master_data = {}

if upload_type.value == 'url':
    url_response = requests.request("GET", upload_url.value)
    master_data['url_data'] = pd.read_csv(io.BytesIO(url_response.content))
else:
    if file_type.value == 'csv':
        tabs = None
        skiprows = 0
    else:
        tabs = [x.strip() for x in tab_names.value.split(',')]
        skiprows = skip_rows.value

    if subfolder.value == 'yes':
        subfolder = subfolder_name.value
    else:
        subfolder = None
    master_data = compile_raw_data(upload_filename.value, tabs, subfolder, delimiter_char = delimiter.value, skip_rows = skiprows, file_ext = file_type.value)


**Note:** If you see an AttributeError: 'NoneType' object has no attribute 'value' message above, simply rerun the last two code cells to reset the input parameters.

The following code cell will print out the attributes associated with the files you've uploaded and alert you of any errors:

In [7]:
for key, value in master_data.items():
    try:
        print('{} table was ingested with {} rows and {} columns'.format(key,value.shape[0],value.shape[1]))
    except:
        print('{} table failed to load'.format(key))

artificial_attrition_data table failed to load


## 1.3 - Select a Data Frame

The following menus will allow you to select the dataset you would like to use in your modeling and the variables you would like included in the subsequent processes.  You can preview a sample of the data as well as increase or decrease the number of records returned by using the integer input widget (which has a default range; minimum rows = 1, maximum rows = 50).

Select an available frame from the list below:

In [8]:
dict_keys = widgets.Select(
    options=master_data.keys(),
    description='Tables:',
    disabled=False,
    layout=Layout(width='50%')
)

display(dict_keys)

Select(description='Tables:', layout=Layout(width='50%'), options=('artificial_attrition_data',), value='artif…

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

After selecting a frame above, select the variables you would like included in your workflow from the list below:

**NOTE:** To select multiple values from the picklist, either hold down the command key on your keyboard or click and hold the shift key to select ranges of variables.  You can scroll down if your mouse is within the widget window.

In [9]:
review_variables = widgets.SelectMultiple(
    options=master_data[dict_keys.value].columns.tolist(),
    description='Variables:',
    disabled=False
)

display(review_variables)

AttributeError: 'str' object has no attribute 'columns'

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed
Input the number of rows you'd like to sample:

In [10]:
review_var_list = []
for i in review_variables.value:
    review_var_list.append(i)
    
master_data['custom_table'] = master_data[dict_keys.value][review_var_list]

head_number = widgets.BoundedIntText(
    value=5,
    min=1,
    max=50,
    step=1,
    description='Rows:',
    disabled=False
)

def sample_view(head_number):
    sample = master_data['custom_table'].head(head_number)
    print(sample)

out = widgets.interactive_output(sample_view, {'head_number':head_number})

widgets.VBox([widgets.VBox([head_number]), out])

NameError: name 'review_variables' is not defined

## 2.0 - Train Models

The following widgets allow a user to choose a dataframe, select a subset of variables to groupby, and build appropriate datatype aggregations (ex. counts for categorical fields; summary statistics for numerical fields).  The function below does not require editing if being applied to another notebook (however, it can be extended to include additional transformations as needed).

## 2.1 - Select Your Target Variable

Your "Target" variable represents the thing you are attempting to predict. It should be either "categorical" (ex. text, labels) or "continuous" (ex. numeric values) in nature. The target and its type will impact which algorithms are used and the evaluation metrics that are useful in evaluating each models' performance.

Select your Target variable and note whether or not it is a categorical or continuous data type:

In [11]:
target = widgets.Select(
    options=master_data['custom_table'].columns.tolist(),
    description='Target',
    disabled=False
)

target_type = widgets.Select(
    options=['Continuous','Categorical'],
    description='Type',
    disabled=False,
)

display(target)
display(target_type)

KeyError: 'custom_table'

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

## 2.2 - Initiate H2O

The code cell below will terminate any existing h2o instances and create a new instance

In [12]:
try:
    h2o.cluster().shutdown()
    h2o.init()
except:
    h2o.init()

NameError: name 'h2o' is not defined

## IMPORTANT: If you are rerunning this workflow and have not "restarted your kernel" you will need to run the "cell" above up to three times to clear the instances.

**DO NOT PROCEED UNTIL** the above cell contains the following text (which will be visible just above a summary table):

Connecting to H2O server at http://127.0.0.1:54321 ... successful.

Load your dataset into h2o by running the command below

In [13]:
df = master_data['custom_table'].dropna(subset=[target.value])
df = h2o.H2OFrame(df)

KeyError: 'custom_table'

Once the "Parse progress:" reaches 100% above, confirm that your dataset has been loaded correctly by reviewing the table below

In [14]:
df.describe()

NameError: name 'df' is not defined

If your "Target" variable is categorical, the code below will convert it to a factor before modeling

In [15]:
if target_type.value == 'Categorical':
    df[target.value] = df[target.value].asfactor()

NameError: name 'target_type' is not defined

## 2.3 - Configure Models

The parameters below can be left in their default settings or modified to meet your specific requirements.

In [16]:
sample_type = widgets.RadioButtons(
    options=['Train | Test', 'Cross Validation', 'Use Data Field'],
    description='Sampling:',
    disabled=False
)

train_perc = widgets.FloatSlider(
    value=.70,
    min=.50,
    max=.80,
    step=.01,
    description='Train %:',
    disabled=False,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='.2f'
)

nfolds = widgets.IntSlider(
    value=3,
    min=3,
    max=10,
    step=1,
    description='CV Folds:',
    disabled=False,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

project_name = widgets.Text(
    value='My Project Name',
#     value='Crash_Qtr02_2018.xlsx'
    placeholder='My Project Name',
    description='Project:',
    disabled=False,
    layout=Layout(width='50%')
)

include_algos = widgets.SelectMultiple(
    options=['DRF','GLM','XGBoost','GBM','DeepLearning','StackedEnsemble'],
    description='Include:',
    disabled=False,
)

specify_algos = widgets.RadioButtons(
    options=["Use defaults" , 'Select from list'],
    description='Algorithms:',
    disabled=False
)

run_time_mins = widgets.IntSlider(
    value=5,
    min=1,
    max=60,
    step=1,
    description='Run (min):',
    disabled=False,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

def algo_select(x):
    if(x=='Select from list'):
        display(include_algos)
    else:
        print('Using all available algorithms')

def sampling(x):
    if(x=='Train | Test'):
        display(train_perc)
    elif (x=='Cross Validation'):
        display(nfolds)
        print('Train %: Not needed for Cross Validation')
    else:
        display(sample_fields)
        print('Sampling field to be selected in the next cell')

        
sample_fields = widgets.Select(
    options=master_data['custom_table'].columns.tolist(),
    description='Sample on:',
    disabled=False
)

def sample_labels(x):
    train_label = widgets.Select(
        options=master_data['custom_table'][x].unique().tolist(),
        description='Train Label:',
        disabled=False
    )

    display(train_label)

f = widgets.interactive(sampling, x=sample_type)
g = widgets.interactive(algo_select, x=specify_algos)

display(project_name,f,g,run_time_mins)
print('The Run (min) value will determine how long the process builds models')

KeyError: 'custom_table'

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

If you have already included a field to separate your data into a training and testing set, please select which label in the field should be used as your training data.  All other labels will be assigned to the testing set.

In [17]:
def sample_by_field(Method):
    if(Method=='Use Data Field'):
        display(sample_field_label_value)
    else:
        print('Using h2o.ai sampling procedure')
       
sample_field_label_value = widgets.Select(
    options=master_data['custom_table'][sample_fields.value].unique().tolist(),
    description='Train on:',
    disabled=False
)

print("If you have already created a sampling field in your dataset, please select the label that designates your training data")
b = widgets.interactive(sample_by_field, Method=sample_type.value)

display(b)


KeyError: 'custom_table'

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed
The code below will compile the settings you have selected above into the H2OAutoML function

In [18]:
if sample_type.value == 'Train | Test':
    cv_folds = 0
    splits = df.split_frame(ratios = [train_perc.value], seed = 1)
    train = splits[0]
    test = splits[1]
elif sample_type.value == 'Use Data Field':
    cv_folds = 0
    train = df[df[sample_fields.value] == sample_field_label_value.value]
    train = train.drop(sample_fields.value)
    test = df[df[sample_fields.value] != sample_field_label_value.value]
    test = test.drop(sample_fields.value)
else:
    cv_folds = nfolds.value

min_to_secs = run_time_mins.value*60    

algo_list = [x for x in include_algos.value]

if specify_algos.value == 'Use defaults':
    aml = H2OAutoML(max_runtime_secs = min_to_secs, seed = 1, project_name = project_name.value, nfolds=cv_folds)
else:
    aml = H2OAutoML(max_runtime_secs = min_to_secs, seed = 1, project_name = project_name.value, include_algos = algo_list, nfolds=cv_folds)


NameError: name 'df' is not defined

## 2.4 - Automatically Train Models

The code below will automatically generate as many models as possible within the time you have permitted.  If you have selected "Cross Validation" as your "Sampling" method above, note that the entire dataset will be used in modeling bby leveraging the "CV Folds" (Cross Validation Folds) you have entered.

**NOTE:** This process will run for the number of minutes you have selected above and may provide different results if run iteratively based on the time it takes for specific algorithms to train.

In [19]:
if sample_type.value == 'Cross Validation':
    aml.train(y = target.value, training_frame = df)
else:
    aml.train(y = target.value, training_frame = train, validation_frame = test, leaderboard_frame = test)

NameError: name 'aml' is not defined

Once the cell above shows 100%, executing the code cell below will show the "leaderboard" based on the specific evaluation metric you have selected to rank

**Ranking:** The ranking defaults to AUC for binary classification, mean_per_class_error for multinomial classification, and deviance for regression.

In [20]:
aml.leaderboard

NameError: name 'aml' is not defined

The following code returns a summary of the top performing model (per the leaderboard above) and how it performed on either the full dataset (if Cross Validation was selected) or the test dataset (if a training set was used during modeling)

In [21]:
if sample_type.value == 'Cross Validation':
    eval_results = aml.leader.model_performance()
else:
    eval_results = aml.leader.model_performance(test)
    
print(eval_results)

NameError: name 'aml' is not defined

If you would like to investigate the parameters that were used by the model or would like to review any other method associated with the model,  please select from the the list below.

**NOTE:** Not all methods may be relevant to all algorithms. If you would like to print a plot of your important variables, select the "varimp_plot" if available in the list.  You can also print the standardized coefficients using the "std_coef_plot" method if listed.

In [22]:
all_methods = dir(aml.leader)
keep_methods = [x for x in all_methods if not x.startswith('_')]

methods = widgets.Select(
    options=keep_methods,
    description='Method:',
    disabled=False
)

display(methods)

NameError: name 'aml' is not defined

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed
After making your selection above, the results will display as appropriate in the cell below:

In [23]:
try:
    print('{}: {}'.format(methods.value, eval('aml.leader.'+methods.value+'()')))
except:
    print('{}: {}'.format(methods.value, eval('aml.leader.'+methods.value)))

NameError: name 'methods' is not defined

### Score your existing dataset with the top model

The code below will apply predicted-values to your full dataset should you wish to evaluate the results against the historical results.

**Note:** This is different than if you want to apply your model against new data (which we will cover in the section to follow).

In [24]:
prediction_frame = aml.leader.predict(df)
full_data_with_predictions = df.cbind(prediction_frame)
full_data_with_predictions

NameError: name 'aml' is not defined

## 3.0 - Make Predictions & Export Results

The following process will walk you through uploading another dataset to score against your top model.

**Note:** The new dataset must contain the same fields that were used to train your models in the prior steps.  The structure of the new dataset does not have to be consistent with the one used in prior steps (ex. there is no need to align columns or remove fields).

## 3.1 - Upload Your Data (Excel and CSV files)

Follow the same process you used in subsequent steps to upload the dataset you would like to apply against your trained model.  This is the dataset with "unknown" values which your trained models will attempt to predict.

In [25]:
upload_type = widgets.RadioButtons(
    options=['local', 'url'],
    description='File Location:',
    disabled=False
)

upload_url = widgets.Text(
#    value='https://gist.githubusercontent.com/bradblanch/4d9591ed64ad0e894668319529f51504/raw/1e7adf768d09724727a4e78bbef6968ed2fa695c/artificial_employee_records.csv',
    value='https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv',
    placeholder='http://',
    description='URL:',
    disabled=False,
    layout=Layout(width='80%')
)
upload_filename = widgets.Text(
    value='artificial_employee_records.csv',
#     value='artificial_employee_records.xlsx'
    placeholder='Sample File.csv',
    description='File Name:',
    disabled=False,
    layout=Layout(width='50%')
)

file_type = widgets.RadioButtons(
    options=['csv', 'xlsx'],
    description='File Type:',
    disabled=False
)

tab_names = widgets.Text(
    value='Sheet1, Sheet2, Sheet3, etc',
    placeholder='ALL EMPLOYEES, PAST EMPLOYEES',
    description='Tab(s):',
    disabled=False,
    layout=Layout(width='50%')
)

subfolder_name = widgets.Text(
    value='source_data',
    placeholder='Subfolder name',
    description='Subfolder:',
    disabled=False,
    layout=Layout(width='50%')
)

subfolder = widgets.RadioButtons(
    options=['no','yes'],
    value='no',
    description='Subfolder:',
    disabled=False
)

skip_rows = widgets.IntSlider(
    value=0,
    min=0,
    max=10,
    step=1,
    description='Skip Rows:',
    disabled=False,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    readout_format='d'
)

delimiter = widgets.RadioButtons(
    options=[',',';'],
    value=',',
    description='Delimiter:',
    disabled=False
)

def text_field(x):
    if(x=='xlsx'):
        display(tab_names)
        tab_names.on_submit(tab_names)
        display(skip_rows)
    else:
        display(delimiter)
        print('Tab Names: Not needed for csv files')

def sub_folder(y):
    if(y=='yes'):
        display(subfolder_name)
        subfolder_name.on_submit(subfolder_name)
    else:
        print('Using {} folder'.format(os.getcwd()))

def file_location(z):
    if(z=='local'):
        display(upload_filename)
        i = widgets.interactive(text_field, x=file_type)
        display(i)
        p = widgets.interactive(sub_folder, y=subfolder)
        display(p)
    else:
        display(upload_url)

q = widgets.interactive(file_location, z=upload_type)

display(q)

interactive(children=(RadioButtons(description='File Location:', options=('local', 'url'), value='local'), Out…

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed
The following code cell will attempt to ingest the data you've selected in the widgets above:

**Note About xlsx Files** - Depending on the number of tabs and the size of the data on each tab, ingesting an xlsx file can take several minutes to execute.  If possible, it may be more efficient to break your Excel file into separate csv files which take only a fraction of a second to ingest.

In [26]:
new_data = {}

if upload_type.value == 'url':
    url_response = requests.request("GET", upload_url.value)
    new_data['url_data'] = pd.read_csv(io.BytesIO(url_response.content))
else:
    if file_type.value == 'csv':
        tabs = None
        skiprows = 0
    else:
        tabs = [x.strip() for x in tab_names.value.split(',')]
        skiprows = skip_rows.value

    if subfolder.value == 'yes':
        subfolder = subfolder_name.value
    else:
        subfolder = None
    new_data = compile_raw_data(upload_filename.value, tabs, subfolder, delimiter_char = delimiter.value, skip_rows = skiprows, file_ext = file_type.value)


**Note:** If you see an AttributeError: 'NoneType' object has no attribute 'value' message above, simply rerun the last two code cells to reset the input parameters.

The following code cell will print out the attributes associated with the files you've uploaded and alert you of any errors:

In [27]:
for key, value in new_data.items():
    try:
        print('{} table was ingested with {} rows and {} columns'.format(key,value.shape[0],value.shape[1]))
    except:
        print('{} table failed to load'.format(key))

artificial_employee_records table was ingested with 900 rows and 19 columns


## 3.2 - Select a Data Frame to be Scored

The following menus will allow you to select the dataset you would like to score against your trained model.  This dataset should contain the fields you used to train the models in prior steps, but it does not have to consistent of the same structure (ex. there is no need to remove unused columns or align column locations).

Select an available frame from the list below:

In [28]:
dict_keys = widgets.Select(
    options=new_data.keys(),
    description='Tables:',
    disabled=False,
    layout=Layout(width='50%')
)

display(dict_keys)

Select(description='Tables:', layout=Layout(width='50%'), options=('artificial_employee_records',), value='art…

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed

In [29]:
score_data = h2o.H2OFrame(new_data[dict_keys.value])

NameError: name 'h2o' is not defined

## 3.3 - Predict the Target Value

The code below will attempt to apply your trained model against this new dataset and predict the target value.  The result will be a new dataset with your original data combined with the predicted results.

In [30]:
prediction_frame = aml.leader.predict(score_data)
full_data_with_predictions = score_data.cbind(prediction_frame)
full_data_with_predictions

NameError: name 'aml' is not defined

In [31]:
new_data['scored_data'] = full_data_with_predictions.as_data_frame(use_pandas=True)

NameError: name 'full_data_with_predictions' is not defined

## 3.4 - Export Dataframes for Offline Analysis

The following code block will allow you to select and export dataframes to a local directory.  Use the inputs below to write the files to your current directory and to apply a timestamp to the filenames to prevent the risk of overwriting prior files saved to that folder.

In [32]:
def dict_to_csv(dict_name, dframe, subfolder, timestamp = False):
    
    # Inputs: a dictionary of dataframes; timestamp = True adds an ISO-formatted suffix to the filename
    # Description: Writes dataframes contained within a dictionary to CSV (on your directory)

    if subfolder:
        file_path = subfolder+'/'
        suffix = '_' + re.sub(r"\:+", '', dt.datetime.now().isoformat()) + '.csv' if timestamp else '.csv'  
        file_path = os.path.join(file_path, dframe + suffix)
    else:
        suffix = '_' + re.sub(r"\:+", '', dt.datetime.now().isoformat()) + '.csv' if timestamp else '.csv'  
        file_path = os.path.join(dframe + suffix)
        
    try:
        dict_name[dframe].to_csv(file_path, sep=',', encoding='utf-8', index = False)
        print('Successfully wrote {} with {} rows and {} columns to the directory'.format(dframe+suffix, new_data[dframe].shape[0], new_data[dframe].shape[1]))
    except Exception as e:
        print('Writing the data to the directory failed')
        

Select one or more available dataframes, then select whether or not you'd like the files saved to the current working directory or a subfolder in the directory.  Lastly, if you would like a timestamp to be added to your exported filenames, select Timestamp = 'yes' to prevent overwriting prior files saved to the folder.

## The scored_data table in the widget below contains both the raw data you uploaded along with the predicted values resulting from your model

In [33]:
dict_keys = widgets.SelectMultiple(
    options=new_data.keys(),
    description='Tables:',
    disabled=False,
    layout=Layout(width='50%')
)

display(dict_keys)

subfolder_option = widgets.RadioButtons(
    options=['no','yes'],
    value='no',
    description='Subfolder:',
    disabled=False
)

timestamp_option = widgets.RadioButtons(
    options=['no','yes'],
    value='yes',
    description='Timestamp:',
    disabled=False
)

subfolder_text = widgets.Text(
    value='output',
    placeholder='Subfolder name',
    description='Subfolder:',
    disabled=False,
    layout=Layout(width='50%')
)

def sub_folder_edit(y):
    if(y=='yes'):
        display(subfolder_text)
        subfolder_text.on_submit(subfolder_text)
        print('Your file(s) will be written to the subfolder in {}'.format(os.getcwd()+os.sep+subfolder_text.value))
    else:
        print('Using {} folder'.format(os.getcwd()))
        
y = widgets.interactive(sub_folder_edit, y=subfolder_option)

display(y, timestamp_option)

SelectMultiple(description='Tables:', layout=Layout(width='50%'), options=('artificial_employee_records',), va…

interactive(children=(RadioButtons(description='Subfolder:', options=('no', 'yes'), value='no'), Output()), _d…

RadioButtons(description='Timestamp:', index=1, options=('no', 'yes'), value='yes')

## Reminder: Do not rerun the cell above after applying your inputs

### Click on and Run this cell to proceed
Execute the code cell below to export the csv files to your chosen directory.

**NOTE:** If you have chosen to write your files to a "subfolder" - ensure that the folder can be found in your working directory.  The function below will "not create a subfolder" in your directory.

In [34]:
if subfolder_option.value == 'yes':
    subfolder = subfolder_text.value
else:
    subfolder = None
    
dframe_list = []
for df in dict_keys.value:
    dframe_list.append(df)

if timestamp_option.value == 'yes':
    timestamp_boolean = True
else:
    timestamp_boolean = False
 
for df in dframe_list:
    dict_to_csv(new_data, df, subfolder, timestamp = timestamp_boolean)

## Extract & Save Your Model

Executing the cell below will save a copy of your h2o model:

In [35]:
model_path = h2o.save_model(model=aml.leader, force=True)
print('The model has been successfully saved in the following directory: {}'.format(model_path))
h2o.cluster().shutdown()

NameError: name 'h2o' is not defined

#### If you need any support,  please feel free to contact me at bradley.a.blanchard@pwc.com