## Introduction

**Prediction example:**  
___
In this example we will show how to:
- Setup the required environment for accessing the ecosystem prediction server.
- Upload data to ecosystem prediction server.
- Load data into feature store and parse to frame.
- Build and test a prediction model for prism scores.

## Setup

**Setting up import path:**  
___
Add path of ecosystem notebook wrappers. It needs to point to the ecosystem notebook wrapper to allow access to the packages required for running the prediction server via python.
- **notebook_path:** Path to notebook repository. 

In [None]:
notebook_path = "/path of to ecosystem notebook repository"

In [None]:
# ---- Uneditible ----
import sys
sys.path.append(notebook_path)
# ---- Uneditible ----

**Import required packages:**  
___
Import and load all packages required for the following usecase.

In [None]:
# ---- Uneditible ----
import pymongo
from bson.son import SON
import pprint
import pandas as pd
import json
import numpy
import operator
import datetime
import time
import os
import matplotlib.pyplot as plt

from prediction import jwt_access
from prediction import notebook_functions
from prediction.apis import functions
from prediction.apis import data_munging_engine
from prediction.apis import worker_h2o
from prediction.apis import prediction_engine
from prediction.apis import worker_file_service

%matplotlib inline
# ---- Uneditible ----

**Setup prediction server access:**  
___
Create access token for prediction server.
- **url:** Url for the prediction server to access.
- **username:** Username for prediction server.
- **password:** Password for prediction server.

In [None]:
url = "http://demo.ecosystem.ai:3001/api"
username = "user@ecosystem.ai"
password = "cd486be3-9955-4364-8ccc-a9ab3ffbc168"

In [None]:
# ---- Uneditible ----
auth = jwt_access.Authenticate(url, username, password)
# ---- Uneditible ----

## Upload Data

**List uploaded files:**  
___
List all files already uploaded.

In [None]:
# ---- Uneditible ----
files = worker_file_service.get_files(auth, path="./", user=username)
files = files["item"]
for file in files:
    file_name = file["name"]
    fn_parts = file_name.split(".")
    if len(fn_parts) > 1 and fn_parts[-1] != "log":
        print(file_name)
# ---- Uneditible ----

**List uploadable files:**  
___
List all files in path ready for upload to prediction server.

In [None]:
# ---- Uneditible ----
path = "../example_data/"
upload_files = [f for f in os.listdir(path) if os.path.isfile(os.path.join(path, f))]
print(upload_files)
# ---- Uneditible ----

**Upload file:**  
___
Select file to upload to prediction server.
- **file_name:** file name of file to upload to prediction server. See list of available files for upload.

In [None]:
file_name = "multi_personality_tiny.csv"

In [None]:
# ---- Uneditible ----
worker_file_service.upload_file(auth, path + file_name, "/data/")
# ---- Uneditible ----

**List uploaded files:**  
___
List all files in path ready for upload to prediction server to compare with previous list to confirm that file was uploaded correctly.

In [None]:
# ---- Uneditible ----
files = worker_file_service.get_files(auth, path="./", user=username)
files = files["item"]
for file in files:
    file_name = file["name"]
    fn_parts = file_name.split(".")
    if len(fn_parts) > 1 and fn_parts[-1] != "log":
        print(file_name)
# ---- Uneditible ----

## Build Model

**Train Model:**
___
Set training parameters for model and train.
- **predict_id:** Id for the prediction (for logging). 
- **description:** Description of model (for logging).
- **model_id:** Id for the model (for logging).
- **model_type:** Type of model to build (for logging). 
- **frame_name:** Name of frame used (for logging).
- **frame_name_desc:** Description of frame used (for logging).
- **model_purpose:** Purpose of model (for logging).
- **version:** Model version (for logging).

The following parameters are dependend on what is selected in the algo parameter.

- **algo:** Algorithm to use to train model. (Availble algorithms: "H20-AUTOML", "PYTORCH")
- **transformer:** Pytorch transformer to use. (Available transformers: "bert-base-uncase")
- **model_name:** Output name of model being built.
- **device:** Hardware on which to build model. (Available devices: "cpu")
- **data_file_path:** Path to input data file.
- **data_file_type:** Type of input data file. (Available types: "csv")
- **model_path:** Path to output model.
- **training_column:** Column in dataset containing training text.
- **response_column:** Column in dataset containing predictor reponse.
- **epochs:** Number of epochs for which to train the model.
- **learning_rate:** (TODO)
- **epsilon:** (TODO)
- **seed:** Random seed with which to run training of model.
- **model_checkpoint:** If set to True, will save model as a checkout of the base transformer, if false a whole model will be saved.
- **train_test_split:** Percentage of data to use for validation.
- **do_lower_case:** If True, for input text to lowercase.
- **batch_size:** Number of rows in data to process concurrently.
- **add_special_tokens:** If set to True, special tokens will be added to tokenized data.
- **padding:** If input is less than max_length the fill rest of tokenized data with padding tokens.
- **max_length:** Max length for tokenizers to allow.
- **truncation:** If set to True, if input is more than max_length then data will be truncated to max_length. If set to False, if input is more than max_length model will not train.


In [None]:
version = "1010"
model_id = featurestore_name + version
model_purpose = "Prediction of personality based on text data."
description = "Automated features store generated for " + featurestore_name
model_params = { 
        "predict_id": featurestore_name,
        "description": description,
        "model_id": model_id,
        "model_type": "PYTORCH",
        "frame_name": hexframename,
        "frame_name_desc": description,
        "model_purpose": model_purpose,
        "version": version,
        "model_parms": {
            "algo": "PYTORCH",
            "transformer": "bert-base-uncased",
            "transformer_configs": {
                "model_name": "personality",
                "device": "cpu",
                "data_file_path": "data/multi_personality_tiny.csv",
                "data_file_type": "csv",
                "model_path": "modeling/personality.model",
                "training_column": "text",
                "response_column": "response",
                "epochs": 1,
                "learning_rate": 0.00001,
                "epsilon": 1e-8,
                "seed": 17,
                "model_checkpoint": false,
                "train_test_split": 0.15,
                "do_lower_case": true,
                "batch_size": 10,
                "add_special_tokens": true,
                "padding": true,
                "max_length": 512,
                "truncation": true
      }
}

In [None]:
# ---- Uneditible ----
worker_h2o.train_model(auth, model_id, "pytorch", json.dumps(model_params["model_parms"]))
# ---- Uneditible ----

**View Model:**
___
View autoML model to see which generated models are performing the best.

In [None]:
# ---- Uneditible ----
model_data = worker_h2o.get_train_model(auth, model_id, "AUTOML")
notebook_functions.RenderJSON(model_data)
# ---- Uneditible ----

In [None]:
sort_metric = model_data["leaderboard"]["sort_metric"]
model_names = []
for model in model_data["leaderboard"]["models"]:
    model_names.append(model["name"])

model_metrics = model_data["leaderboard"]["sort_metrics"]

df = pd.DataFrame(
    {
        "model_names": model_names,
        "model_metrics": model_metrics
    }
)
df.sort_values("model_metrics", inplace=True, ascending=False)
ax = df.plot(y="model_metrics", x="model_names", kind="bar", align="center", alpha=0.5, legend=None)
plt.xticks(rotation=90)
ax.set_title("Performance of Models. Sorted Using Metric: {}".format(sort_metric))
ax.yaxis.grid(True)

**Save Model:**
___
Save model for prediction.
- **model_id:** Id for the model to save. 

In [None]:
best_model = df.iloc[0]["model_names"]

In [None]:
h2o_name = best_model
zip_name = h2o_name + ".zip"
# worker_h2o.download_model_mojo(auth, h2o_name)
high_level_mojo = worker_h2o.get_train_model(auth, h2o_name, "single")
model_to_save = high_level_mojo["models"][0]
model_to_save["model_identity"] = h2o_name
model_to_save["userid"] = "user"
model_to_save["timestamp"] = "time_stamp"
prediction_engine.save_model(auth, model_to_save)

**View Model Stats:**
___
View stats of saved model.

In [None]:
prediction_engine.get_user_model(auth, h2o_name) 
stats = worker_h2o.get_model_stats(auth, h2o_name, "ecosystem", "variable_importances")
notebook_functions.RenderJSON(stats)