## License 

<span style="color:gray"> Copyright 2019 David Whiting and the H2O.ai team

<span style="color:gray"> Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

<span style="color:gray">     http://www.apache.org/licenses/LICENSE-2.0

<span style="color:gray"> Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

<span style="color:gray"> **DISCLAIMER:** This notebook is not legal compliance advice. </span>

<hr style="background-color: gray;height: 2.0px;"/>

<img src="./img/h2o_banner2.png">

# Introduction to H2O-3: Lesson 2

This is the second in a series of instructional Jupyter notebooks on H2O-3. These notebooks are built to be run on the H2O.ai Aquarium training platform [http://aquarium.h2o.ai](http://aquarium.h2o.ai) under the `Coursework` lab. There is an accompanying instructional video with additional commentary found <span style="color:red">**_here_** _(link to be added)_.</span>

<div style="margin-left: 3em;">

### Intended Audience

The target audience for this training notebook is data scientists, machine learning engineers, and other experienced modelers. Technically advanced analysts may also find this training understandable.

### Prerequisites
Successful completion of <span style="color:blue">Introduction to H2O-3: Lesson 1</span>

In addition, a working knowledge of python and previous experience building statistical or machine learning models is assumed.

### Learning Outcomes

By the end of this notebook, you will be able to ...

<ul style="list-style: none;">
    <li><input type="checkbox" disabled ><span style="color:black"> 
    Start the H2O-3 server
    </span></li><li><input type="checkbox" disabled ><span style="color:black">
    Load data directly into the H2O-3 cluster
    </span></li><li><input type="checkbox" disabled ><span style="color:black">
    Inspect data using H2O Flow
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Perform basic data munging tasks with H2O
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Engineer new data features
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Train and evaluate an XGBoost ML model (in H2O and H2O Flow)
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Create and save a MOJO for model production
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Stop the H2O-3 server
    </span></li>
</ul>

</div>

<hr style="background-color: black;height: 2.0px;"/>

# About H2O-3

H2O is an open source, in-memory, distributed, fast, and scalable machine learning and predictive analytics platform that allows you to build machine learning models on big data and provides easy productionalization of those models in an enterprise environment.

# Step 1. Start the H2O-3 cluster

<div class="alert alert-block alert-info"><span style="color:black">
    
The `os` commands below check whether this notebook is being run on the Aquarium platform. We use the `h2o.init` command to connect to the H2O-3 cluster, starting it if it is not already up. (The parameters used in `h2o.init` will depend on your specific environment.)
</span></div>

In [None]:
import os
import h2o

startup = '/home/h2o/bin/aquarium_startup'
if os.path.exists(startup):
    os.system(startup)
    local_url = 'http://localhost:54321/h2o'
    aquarium = True
else:
    local_url = 'http://localhost:54321'
    aquarium = False

h2o.init(url=local_url)

Note: The method you use for starting and stopping an H2O-3 cluster will depend on how H2O is installed and configured on your system. Regardless of how H2O is installed, if you start a cluster, you will need to ensure that it is shut down when you are done.

<div class="alert alert-block alert-warning"><span style="color:black">

## Learning Outcomes

<ul style="list-style: none;">
    <li><input type="checkbox" disabled checked><span style="color:gray"> 
    Start the H2O-3 server
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Load data directly into the H2O-3 cluster
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Inspect data using H2O Flow
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Perform basic data munging tasks with H2O
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Engineer new data features
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Train and evaluate an XGBoost ML model (in H2O and H2O Flow)
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Create and save a MOJO for model production
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Stop the H2O-3 server
    </span></li>
</ul>

</span>
</div>


# Step 2. Import data

The data set we use below is a local copy of https://s3-us-west-2.amazonaws.com/h2o-tutorials/data/topics/lending/lending_club/LoanStats3a.csv. 

In [None]:
if aquarium:
    input_csv = "/home/h2o/data/lending_club/LoanStats3a.csv"
else:
    input_csv = "https://s3-us-west-2.amazonaws.com/h2o-tutorials/data/topics/lending/lending_club/LoanStats3a.csv"

<div class="alert alert-block alert-info"><span style="color:black">
Besides delimited files (CSV and gzipped CSV), H2O-3 currently supports the following file types:

- ORC
- SVMLight
- ARFF
- XLS (BIFF 8 only)
- XLSX (BIFF 8 only)
- Avro version 1.8.0 (without multifile parsing or column type modification)
- Parquet
</span></div>

The loans data set is loaded directly into the H2O-3 cluster using the `h2o.import_file` command shown below:

In [None]:
loans = h2o.import_file(input_csv,
                        col_types = {"int_rate":"string", 
                                     "revol_util":"string", 
                                     "emp_length":"string", 
                                     "verification_status":"string"})

<div class="alert alert-block alert-warning"><span style="color:black">

## Learning Outcomes

<ul style="list-style: none;">
    <li><input type="checkbox" disabled checked><span style="color:gray"> 
    Start the H2O-3 server
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Load data directly into the H2O-3 cluster
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Inspect data using H2O Flow
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Perform basic data munging tasks with H2O
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Engineer new data features
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Train and evaluate an XGBoost ML model (in H2O and H2O Flow)
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Create and save a MOJO for model production
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Stop the H2O-3 server
    </span></li>
</ul>

</span>
</div>

## Details: How H2O file import works

The `h2o.import_file` command gives the H2O-3 cluster the location of the data file. The actual data never go into or through Python.

**Step 1**: The user passes the file location to the H2O server via a Python function call.

<img src="./img/h2o_read_step1.png" style="height:250px">

**Step 2**: H2O initiates the the distributed data ingest from the filesystem.

<img src="./img/h2o_read_step2.png" style="height:320px">

**Step 3**: H2O then performs a parallel upload of the data directly into the H2O-3 cluster's memory. The H2O Frame object is a proxy for the big data in H2O. 

<img src="./img/h2o_read_step3.png" style="height:320px">

This data upload happens in parallel and is extremely efficient.

## Inspect the Data with H2O Flow

Now is a good time to connect to H2O Flow. Although H2O Flow can be used for everything from loading data to building models to creating production code, we use it here for data investigation and H2O system monitoring.

<div class="alert alert-block alert-info"><span style="color:black">

**Note**: the reported IP above, `http://localhost`, is the local IP within your particular cloud instance. To open H2O Flow in your own browser, copy the browser URL and (in Aquarium), replace 
`http://{your_URL}/jupyter/` with `http://{your_URL}/h2o/`.

More generally, Jupyter notebooks are found on port 8888 (`http://{your_URL}:8888`) by default. H2O Flow can be accessed by replacing 8888 with 54321: `http://{your_URL}:54321`.
</span></div>

Use the `List All Frames` command from the `Data` menu, or select `getFrames` from the Assistance options, to list the frames.

<img src="./img/flow_list_frames.png" style="width:600px">

The `LoanStats3a.hex` file is the in-memory H2OFrame.

<img src="./img/loans_hex.png" style="width:600px">

Clicking on the `LoanStats3a.hex` link will display column summaries for the data. You can also click `View Data` to visually inspect the data values themselves.

**Within Python** we can get the same summary information by

In [None]:
loans.dim

In [None]:
loans.head()

<div class="alert alert-block alert-warning"><span style="color:black">

## Learning Outcomes

<ul style="list-style: none;">
    <li><input type="checkbox" disabled checked><span style="color:gray"> 
    Start the H2O-3 server
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Load data directly into the H2O-3 cluster
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Inspect data using H2O Flow
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Perform basic data munging tasks with H2O
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Engineer new data features
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Train and evaluate an XGBoost ML model (in H2O and H2O Flow)
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Create and save a MOJO for model production
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Stop the H2O-3 server
    </span></li>
</ul>

</span>
</div>


# Step 3.  Clean data (data munging)

## Part 1. Defining the problem and creating the response variable

The total number of loans in our data set is

In [None]:
num_unfiltered_loans = loans.dim[0]
num_unfiltered_loans

Because we are interested in loan default, we need to look at the `loan_status` column.

In [None]:
loans["loan_status"].table().head(20)

Like many real data sources, `loan_status` is messy and contains multiple, somewhat overlapping, categories. Before modeling, we will need to clean this up by (a) removing loans that are still ongoing, and (b) simplifying the response column.

### (a) Filter Loans

In order to build a valid model, we have to remove loans that are still in process. They have `loan_status` like "Current" and "In Grace Period":

In [None]:
ongoing_status = ["Current",
                  "In Grace Period",
                  "Late (16-30 days)"
                 ]

<div class="alert alert-block alert-success"><span style="color:black">
 
**YOUR TURN**: Use the empty cell below to complete the `ongoing_status` object.
</span></div>

In [None]:
# ongoing status


<div class="alert alert-block alert-success"><span style="color:black">
You can compare your answer to what we got below:
    </span></div>

In [None]:
ongoing_status = ["Current",
                  "In Grace Period",
                  "Late (16-30 days)",
                  "Late (31-120 days)",
                  "Does not meet the credit policy.  Status:Current",
                  "Does not meet the credit policy.  Status:In Grace Period"
                 ]

Now we can use the following code to filter out loans that are ongoing:

In [None]:
loans = loans[~loans["loan_status"].isin(ongoing_status)]

After filtering out these loans, we have

In [None]:
num_filtered_loans = loans.dim[0]
num_filtered_loans

loans whose final state is known, which means we filtered out

In [None]:
num_loans_filtered_out = num_unfiltered_loans - num_filtered_loans
num_loans_filtered_out

loans. These loans are now summarized by `loan_status` as

In [None]:
loans["loan_status"].table().head(20)

### (b) Create Response Column

Let's name our response column `bad_loan`, which will equal one if the loan was not completely paid off.

In [None]:
fully_paid = ["Fully Paid",
              "Does not meet the credit policy.  Status:Fully Paid"
             ]
loans["bad_loan"] = ~(loans["loan_status"].isin(fully_paid))

Next make the `bad_loan` column a factor so that we can build a classification model,

In [None]:
loans["bad_loan"] = loans["bad_loan"].asfactor()

The percentage of bad loans is given by

In [None]:
bad_loan_dist = loans["bad_loan"].table()
bad_loan_dist["Percentage"] = (100 * bad_loan_dist["Count"] / loans.nrow).round()
bad_loan_dist

## Part 2. Convert strings to numeric

Consider the columns `int_rate`, `revol_util`, and `emp_length`:

In [None]:
loans[["int_rate", "revol_util", "emp_length"]].head()

Both `int_rate` and `revol_util` are inherently numeric but entered as percentages. Since they include a "%" sign, they are read in as strings. The solution for both of these columns is simple: strip the "%" sign and convert the strings to numeric.

The `emp_length` column is only slightly more complex. Besides removing the "year" or "years" term, we have to deal with `< 1` and `10+`, which aren't directly numeric. If we define `< 1` as 0 and `10+` as 10, then `emp_length` can also be cast as numeric.

We demonstrate the steps for converting these string variables into numeric values below.

### Convert `int_rate`

In [None]:
loans["int_rate"] = loans["int_rate"].gsub(pattern = "%", replacement = "") # strip %
loans["int_rate"] = loans["int_rate"].trim() # trim whitespace
loans["int_rate"] = loans["int_rate"].asnumeric() # change to numeric 

### Convert `revol_util`

<div class="alert alert-block alert-success"><span style="color:black">
 
**YOUR TURN**: Use the empty cell below to convert the `revol_util` variable to numeric.
</span></div>

In [None]:
# revol_util


<div class="alert alert-block alert-success"><span style="color:black">
Check your answer below:
    </span></div>

In [None]:
loans["revol_util"] = loans["revol_util"].gsub(pattern="%", replacement="") # strip %
loans["revol_util"] = loans["revol_util"].trim() # trim whitespace
loans["revol_util"] = loans["revol_util"].asnumeric() # change to numeric 

### Convert `emp_length`

<div class="alert alert-block alert-success"><span style="color:black">
 
**YOUR TURN**: Use the empty cell below to convert the `emp_length` variable to numeric.
</span></div>

In [None]:
# emp_length


<div class="alert alert-block alert-success"><span style="color:black">
Check your answer below:
    </span></div>

In [None]:
# Use gsub to remove " year" and " years"; also translate n/a to "" 
loans["emp_length"] = loans["emp_length"].gsub(pattern="([ ]*+[a-zA-Z].*)|(n/a)", replacement="") 
loans["emp_length"] = loans["emp_length"].trim() # trim whitespace

loans["emp_length"] = loans["emp_length"].gsub(pattern="< 1", replacement="0") # convert "< 1" to 0
loans["emp_length"] = loans["emp_length"].gsub(pattern="10\\+", replacement="10") # convert "10+" to 10
loans["emp_length"] = loans["emp_length"].asnumeric()

These steps result in

In [None]:
loans[["int_rate", "revol_util", "emp_length"]].head()

### Plotting interest rate distributions

Now that we have converted interest rate to numeric, we can use the `hist` function to compare the interest rate distributions for good and bad loans.

In [None]:
%matplotlib inline

print("Bad Loans")
loans[loans["bad_loan"] == "1", "int_rate"].hist()

print("Good Loans")
loans[loans["bad_loan"] == "0", "int_rate"].hist()

As expected, the bad loan distribution contains proportionately more high interest rate loans than the distribution for good loans. Likewise, the good loan distribution contains a higher proportion of low interest rate loans than that for bad loans. It would not surprise us if interest rate were a strong predictor of loan performance.

<div class="alert alert-block alert-info"><span style="color:black">
Financial institutions typically set a borrower's interest rate based on factors like estimated risk and customer demand. If the underwriting rules are any good at all, we would expect interest rate to be one of the best predictors of default. 
    </span></div>
    

## Part 3. Clean up messy categorical columns

Much as we did with the `loan_status` column, the `verification_status` column needs cleaning

In [None]:
loans["verification_status"].head()

Because there are multiple values that mean verified ("VERIFIED - income" and "VERIFIED - income source"), we should replace them simply with "verified",

<div class="alert alert-block alert-success"><span style="color:black">
 
**YOUR TURN**: Use the empty cell to clean up the `verification_status` variable
</span></div>

In [None]:
# verification_status


<div class="alert alert-block alert-success"><span style="color:black">
Check your answer below:
    </span></div>

In [None]:
loans["verification_status"] = loans["verification_status"].sub(pattern="VERIFIED - income source", 
                                                                replacement="verified")
loans["verification_status"] = loans["verification_status"].sub(pattern="VERIFIED - income", 
                                                                replacement="verified")
loans["verification_status"] = loans["verification_status"].asfactor()

We can confirm that status is cleaned up using

In [None]:
loans["verification_status"].table()

<div class="alert alert-block alert-warning"><span style="color:black">

## Learning Outcomes

<ul style="list-style: none;">
    <li><input type="checkbox" disabled checked><span style="color:gray"> 
    Start the H2O-3 server
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Load data directly into the H2O-3 cluster
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Inspect data using H2O Flow
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Perform basic data munging tasks with H2O
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Engineer new data features
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Train and evaluate an XGBoost ML model (in H2O and H2O Flow)
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Create and save a MOJO for model production
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Stop the H2O-3 server
    </span></li>
</ul>

</span>
</div>

# Step 4.  Feature engineering

Now that we have cleaned our data, we can extract information from our current columns to create new features. This process is referred to as _feature engineering_. The general idea is to express information found in our data in a manner that is most understandable to the algorithms we employ, with the goal of improving the performance of our supervised learning models.

Feature engineering can be considered the "secret sauce" in building a superior predictive model: it is often (although not always) more important than the choice of machine learning algorithm. A very good summary of feature engineering recipes can be found in the online [Driverless AI Documentation](http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/transformations.html). 

We will do some basic feature engineering using the date fields in our data. The new columns we will create are: 
* `credit_length`: the number of years someone has had a credit history
* `issue_d_year` and `issue_d_month`: the year and month from the loan issue date

### Credit Length

We create the `credit_length` feature by subtracting the year of a customer's earliest credit line from the year they were issued the loan.

In [None]:
loans["credit_length"] = loans["issue_d"].year() - loans["earliest_cr_line"].year()
loans["credit_length"].head()

### Issue Date Expansion

We next extract the year and month from the issue date.  We may find that the month or the year when the loan was issued will impact the probability of a bad loan. Additionally, since months are cyclical we will treat `issue_d_month` as a factor.

In [None]:
loans["issue_d_year"] = loans["issue_d"].year()
loans["issue_d_month"] = loans["issue_d"].month().asfactor()

loans[["issue_d_year", "issue_d_month"]].head()

<div class="alert alert-block alert-info"><span style="color:black">
There are a multitude of other options for feature engineering. In the date field alone we could have created day of the week, weekday vs. weekend, etc. In the second lesson in this series, we will look at creating features using natural language processing of the loan description field.
</span></div>

<div class="alert alert-block alert-warning"><span style="color:black">

## Learning Outcomes

<ul style="list-style: none;">
    <li><input type="checkbox" disabled checked><span style="color:gray"> 
    Start the H2O-3 server
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Load data directly into the H2O-3 cluster
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Inspect data using H2O Flow
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Perform basic data munging tasks with H2O
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Engineer new data features
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Train and evaluate an XGBoost ML model (in H2O and H2O Flow)
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Create and save a MOJO for model production
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Stop the H2O-3 server
    </span></li>
</ul>

</span>
</div>

# Step 5. Model training

Now that we have cleaned our data and added new columns, we train a model to predict bad loans. First split our loans data into train and test.

In [None]:
train, test = loans.split_frame(seed=25, ratios=[0.75])

Next create a list of predictors as a subset of the columns of the `loans` H2O Frame. We do this by listing the columns we will exclude from the predictors.

In [None]:
cols_to_remove = ["initial_list_status",
                  "out_prncp",
                  "out_prncp_inv",
                  "total_pymnt",
                  "total_pymnt_inv",
                  "total_rec_prncp", 
                  "total_rec_int",
                  "total_rec_late_fee",
                  "recoveries",
                  "issue_d",
                  "collection_recovery_fee",
                  "last_pymnt_d", 
                  "last_pymnt_amnt",
                  "next_pymnt_d",
                  "last_credit_pull_d",
                  "collections_12_mths_ex_med" , 
                  "mths_since_last_major_derog",
                  "policy_code",
                  "loan_status",
                  "funded_amnt",
                  "funded_amnt_inv",
                  "mths_since_last_delinq",
                  "mths_since_last_record",
                  "id",
                  "member_id",
                  "desc",
                  "zip_code"]

predictors = list(set(loans.col_names) - set(cols_to_remove))

In [None]:
predictors

Now create an XGBoost model for predicting loan default. 

<div class="alert alert-block alert-info"><span style="color:black">
This model is being run with almost all of the model-tuning values at their defaults. Later we may want to optimize the hyperparameters using a grid search or AutoML.
    
</span></div>

In [None]:
from h2o.estimators import H2OXGBoostEstimator

param = {
      "ntrees" : 20
    , "nfolds" : 5
    , "seed": 25
}
xgboost_model = H2OXGBoostEstimator(**param)
xgboost_model.train(x = predictors,
                    y = "bad_loan",
                    training_frame=train,
                    validation_frame=test)

# Step 6.  Examine model accuracy

The plot below shows the performance of the model as more trees are built.  This graph can help us see at what point our model begins overfitting.  Our test data error rate stops improving at around 8-10 trees.

In [None]:
%matplotlib inline
xgboost_model.plot()

The ROC curve of the training and testing data are shown below.  The area under the ROC curve is much higher for the training data than the test data, indicating that the model is beginning to memorize the training data.

In [None]:
print("Training Data")
xgboost_model.model_performance(train = True).plot()
print("X-Val")
xgboost_model.model_performance(xval=True).plot()
print("Testing Data")
xgboost_model.model_performance(valid = True).plot()

# Step 7. Interpret model

The variable importance plot shows us which variables are most important to predicting `bad_loan`.  We can use partial dependency plots to learn more about how these variables affect the prediction.

In [None]:
xgboost_model.varimp_plot(20)

As suspected, interest rate appears to be the most important feature in predicting loan default. The partial dependency plot of the `int_rate` predictor shows us that as the interest rate increases, the likelihood of the loan defaulting also increases.

In [None]:
pdp = xgboost_model.partial_plot(cols=["int_rate"], data=train)

## Examine model accuracy and interpretability in H2O Flow

Use the `Models` directory to list all models, or input use the `getModels` directive. Your results should look something like

<img src="./img/list_models.png" style="height:300px">

Note that this contains the XGBoost model plus the five XGBoost folds from our cross-validation. Select the overall model and make sure you can find at very least

- Scoring history plots
- AUC metrics and ROC plots
- Variable importances
- Confusion matrices
- XGBoost parameters

<div class="alert alert-block alert-info"><span style="color:black">
H2O Flow is a very convenient tool for interactive model investigation.
</span></div>

<div class="alert alert-block alert-warning"><span style="color:black">

## Learning Outcomes

<ul style="list-style: none;">
    <li><input type="checkbox" disabled checked><span style="color:gray"> 
    Start the H2O-3 server
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Load data directly into the H2O-3 cluster
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Inspect data using H2O Flow
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Perform basic data munging tasks with H2O
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Engineer new data features
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Train and evaluate an XGBoost ML model (in H2O and H2O Flow)
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Create and save a MOJO for model production
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Stop the H2O-3 server
    </span></li>
</ul>

</span>
</div>


# Assignment

<div class="alert alert-block alert-success"><span style="color:black">
    
## YOUR TURN: Build and investigate a new model

Building a model including `int_rate`, as we did above, is perhaps a questionable choice. Build an XGBoost model without interest rate and include the following:

- A scoring history model plot
- ROC curves for training, cross-validation, and testing data
- An updated variable importance plot
- Partial dependence plots for the top two variables 

Insert as many cells below as needed to complete.

In [None]:
# new model


# Step 8. Save and reuse model

The model can either be embedded into a self-contained Java MOJO package
or it can be saved and later loaded directly into an H2O-3 cluster. For production
use, we recommend using MOJO as it is optimized for speed. See the [guide](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html) for further information.

### Downloading MOJO

Creating and downloading a MOJO is a simple matter of using the `download_mojo` method

In [None]:
xgboost_model.download_mojo()

### Save and reuse the model 

We can also save the model to disk for later use.

In [None]:
model_path = h2o.save_model(model=xgboost_model, force=True)
print(model_path)

After the H2O cluster shuts down, all unsaved data and models are lost. At some future date, we can load the model for batch scoring in the H2O cluster.

In [None]:
loaded_model = h2o.load_model(path=model_path)

Using that model, we can also score new data with the predict function:

In [None]:
bad_loan_hat = loaded_model.predict(test)
bad_loan_hat.head(15)

<div class="alert alert-block alert-info"><span style="color:black">
It is a good idea to save any work on your H2O cluster before shutting it down. 
</span></div>

<div class="alert alert-block alert-warning"><span style="color:black">

## Learning Outcomes

<ul style="list-style: none;">
    <li><input type="checkbox" disabled checked><span style="color:gray"> 
    Start the H2O-3 server
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Load data directly into the H2O-3 cluster
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Inspect data using H2O Flow
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Perform basic data munging tasks with H2O
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Engineer new data features
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Train and evaluate an XGBoost ML model (in H2O and H2O Flow)
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Create and save a MOJO for model production
    </span></li><li><input type="checkbox" disabled><span style="color:black">
    Stop the H2O-3 server
    </span></li>
</ul>

</span>
</div>

# Step 9. Stop the H2O-3 server

In [None]:
h2o.cluster().shutdown()

Once your work is completed, shutting down the H2O cluster frees up the resources reserved by H2O.

<div class="alert alert-block alert-warning"><span style="color:black">

## Learning Outcomes

<ul style="list-style: none;">
    <li><input type="checkbox" disabled checked><span style="color:gray"> 
    Start the H2O-3 server
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Load data directly into the H2O-3 cluster
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Inspect data using H2O Flow
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Perform basic data munging tasks with H2O
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Engineer new data features
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Train and evaluate an XGBoost ML model (in H2O and H2O Flow)
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Create and save a MOJO for model production
    </span></li><li><input type="checkbox" disabled checked><span style="color:gray">
    Stop the H2O-3 server
    </span></li>
</ul>

</span>
</div>

# CONGRATULATIONS! You have completed Lesson 1.