## Train Model

### Configure Spark for Your Notebook
* This examples uses the local Spark Master `--master local[1]`
* In production, you would use the PipelineIO Spark Master `--master spark://apachespark-master-2-1-0:7077`

In [None]:
import os

master = '--master local[1]'
#master = '--master spark://apachespark-master-2-1-0:7077'
conf = '--conf spark.cores.max=1 --conf spark.executor.memory=512m'
packages = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1'
jars = '--jars /root/lib/jpmml-sparkml-package-1.0-SNAPSHOT.jar'
py_files = '--py-files /root/lib/jpmml.py'

os.environ['PYSPARK_SUBMIT_ARGS'] = master \
  + ' ' + conf \
  + ' ' + packages \
  + ' ' + jars \
  + ' ' + py_files \
  + ' ' + 'pyspark-shell'

print(os.environ['PYSPARK_SUBMIT_ARGS'])

### Import Spark Libraries

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import RFormula
from pyspark.ml.classification import DecisionTreeClassifier

### Create Spark Session
This may take a minute or two.  Please be patient.

In [None]:
from pyspark.sql import SparkSession

spark_session = SparkSession.builder.getOrCreate()

### Read Data from Public S3 Bucket
* AWS credentials are not needed.
* We're asking Spark to infer the schema
* The data has a header
* Using `bzip2` because it's a splittable compression file format

In [None]:
df = spark_session.read.format("csv") \
  .option("inferSchema", "true").option("header", "true") \
  .load("s3a://datapalooza/R/census.csv")

df.head()

In [None]:
print(df.count())

## Create and Train Pipeline

In [None]:
formula = RFormula(formula = "income ~ .")
classifier = DecisionTreeClassifier()

pipeline = Pipeline(stages = [formula, classifier])

pipeline_model = pipeline.fit(df)

print(pipeline_model)

## Step 4:  Export the Pipeline Model

In [None]:
from jpmml import toPMMLBytes

model = toPMMLBytes(spark_session, df, pipeline_model)

with open('model', 'wb') as fh:
    fh.write(model)

### Deployment Option 2 of 2:  REST API

In [None]:
import requests

deploy_url = 'http://prediction-pmml.demo.pipeline.io/api/v1/model/deploy/pmml/default/census-rest/v0'

files = {'file': open('census.model', 'rb')}

response = requests.post(deploy_url, files=files)

print("Success! %s" % response.text)

## Step 6:  Predict With Deployed Pipeline Model

### Setup Prediction Inputs

In [None]:
data = {"age":39,
        "workclass":"State-gov",
        "education":"Bachelors",
        "education_num":13,
        "marital_status":"Never-married",
        "occupation":"Adm-clerical",
        "relationship":"Not-in-family",
        "race":"White",
        "sex":"Male",
        "capital_gain":2174,
        "capital_loss":0,
        "hours_per_week":40,
        "native_country":"United-States"}

json_data = json.dumps(data)

with open('census-predict-inputs.json', 'wt') as fh:
    fh.write(json_data)

### Predict with CLI

In [None]:
%%bash

pio predict --model-version=v0 \
            --model-input-filename=census-predict-inputs.json

### Predict with REST

In [None]:
import json

# Note:  You may need to run this twice.
#        A fallback will trigger the first time. (Bug)
#predict_url = 'http://prediction-pmml.demo.pipeline.io/api/v1/model/predict/pmml/default/census-cli/v0'

predict_url = 'http://prediction-pmml.demo.pipeline.io/api/v1/model/predict/pmml/default/pmml_census/v0'

headers = {'content-type': 'application/json'}

response = requests.post(predict_url, 
                         data=json_data, 
                         headers=headers)

print(response.text)

## Step 7:  Monitor Model Servers through Dashboards

### Fallbacks and Circuit Breaker [Dashboard](http://hystrix.demo.pipeline.io/hystrix-dashboard/monitor/monitor.html?streams=%5B%7B%22name%22%3A%22Model%20Servers%22%2C%22stream%22%3A%22http%3A%2F%2Fturbine.demo.pipeline.io%2Fturbine.stream%22%2C%22auth%22%3A%22%22%2C%22delay%22%3A%22%22%7D%5D)

In [None]:
%%html

<iframe width=800 height=600 src="http://hystrix.demo.pipeline.io/hystrix-dashboard/monitor/monitor.html?streams=%5B%7B%22name%22%3A%22Model%20Servers%22%2C%22stream%22%3A%22http%3A%2F%2Fturbine.demo.pipeline.io%2Fturbine.stream%22%2C%22auth%22%3A%22%22%2C%22delay%22%3A%22%22%7D%5D"></iframe>

### Grafana Prediction Metrics [Dashboard](http://grafana.demo.pipeline.io)

In [None]:
%%html

<iframe width=800 height=600 src="http://grafana.demo.pipeline.io"></iframe>