## Step 0: Setup Spark

### Configure Spark for Your Notebook
* This examples uses the local Spark Master `--master local[1]`
* In production, you would use the PipelineIO Spark Master `--master spark://apachespark-master-2-1-0:7077`

In [262]:
import os

master = '--master local[1]'
#master = '--master spark://apachespark-master-2-1-0:7077'
conf = '--conf spark.cores.max=1 --conf spark.executor.memory=512m'
packages = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1'
jars = '--jars /root/lib/jpmml-sparkml-package-1.0-SNAPSHOT.jar'
py_files = '--py-files /root/lib/jpmml.py'

os.environ['PYSPARK_SUBMIT_ARGS'] = master \
  + ' ' + conf \
  + ' ' + packages \
  + ' ' + jars \
  + ' ' + py_files \
  + ' ' + 'pyspark-shell'

print(os.environ['PYSPARK_SUBMIT_ARGS'])

--master local[1] --conf spark.cores.max=1 --conf spark.executor.memory=512m --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1 --jars /root/lib/jpmml-sparkml-package-1.0-SNAPSHOT.jar --py-files /root/lib/jpmml.py pyspark-shell


### Import Spark Libraries

In [263]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import RFormula
from pyspark.ml.classification import DecisionTreeClassifier

### Create Spark Session
This may take a minute or two.  Please be patient.

In [264]:
from pyspark.sql import SparkSession

sparkSession = SparkSession.builder.getOrCreate()

## Step 1: Load Training Data into Spark Cluster

### Read Data from Public S3 Bucket
* AWS credentials are not needed.
* We're asking Spark to infer the schema
* The data has a header
* Using `bzip2` because it's a splittable compression file format

In [265]:
data = sparkSession.read.format("csv") \
  .option("inferSchema", "true").option("header", "true") \
  .load("s3a://datapalooza/R/census.csv")

data.head()

Row(age=39, workclass='State-gov', education='Bachelors', education_num=13, marital_status='Never-married', occupation='Adm-clerical', relationship='Not-in-family', race='White', sex='Male', capital_gain=2174, capital_loss=0, hours_per_week=40, native_country='United-States', income='<=50K')

In [266]:
print(df.count())

198576


## Step 2: Feature Engineering
Already engineered.

## Step 3: Train the Pipeline

In [267]:
formula = RFormula(formula = "income ~ .")
classifier = DecisionTreeClassifier()

pipeline = Pipeline(stages = [formula, classifier])

pipelineModel = pipeline.fit(data)

print(pipelineModel)

PipelineModel_4eaf9a759b5c91517475


## Step 4:  Export the Pipeline Model

In [268]:
from jpmml import toPMMLBytes

model = toPMMLBytes(spark, training_dataset, pipeline_model)

with open('census.model', 'wb') as fh:
    fh.write(model)

## Step 5:  Deploy the Pipeline Model

### Deployment Option 1 of 2:  Use PipelineIO Command Line!

In [259]:
%%bash

pip install -q pio-cli==0.37

Configure CLI for Model Deployment

In [260]:
%%bash

pio init-model --model-server-url=http://prediction-jvm.demo.pipeline.io \
    --model-type=pmml --model-namespace=default --model-name=census-cli


("Merging dict '{'model_output_mime_type': 'application/json', "
 "'model_namespace': 'default', 'model_type': 'pmml', 'model_name': "
 "'census-cli', 'model_input_mime_type': 'application/json', "
 "'model_server_url': 'http://prediction-jvm.demo.pipeline.io'}' with existing "
 "config '/root/.pio/config'.")
---
model_input_mime_type: application/json
model_name: census-cli
model_namespace: default
model_output_mime_type: application/json
model_server_url: http://prediction-jvm.demo.pipeline.io
model_type: pmml
pio_api_version: v1
pio_git_home: https://github.com/fluxcapacitor/pipeline/
pio_git_version: v1.2.0



{'model_input_mime_type': 'application/json',
 'model_name': 'census-cli',
 'model_namespace': 'default',
 'model_output_mime_type': 'application/json',
 'model_server_url': 'http://prediction-jvm.demo.pipeline.io',
 'model_type': 'pmml',
 'pio_api_version': 'v1',
 'pio_git_home': 'https://github.com/fluxcapacitor/pipeline/',
 'pio_git_version': 'v1.2.0'}



In [261]:
%%bash

pio deploy --model-version=v0 census.model

model_version: v0
model_path: /root/volumes/source.ml/jupyterhub.ml/notebooks/spark/census.model
request_timeout: 600

Deploying model '/root/volumes/source.ml/jupyterhub.ml/notebooks/spark/census.model' to 'http://prediction-jvm.demo.pipeline.io/api/v1/model/deploy/pmml/default/census-cli/v0'.


Success!

Predict with 'pio predict' or POST to 'http://prediction-jvm.demo.pipeline.io/api/v1/model/deploy/pmml/default/census-cli/v0'



### Deployment Option 2 of 2:  REST API

In [249]:
import requests

deploy_url = 'http://prediction-jvm.demo.pipeline.io/api/v1/model/deploy/pmml/default/census-rest/v0'

files = {'file': open('census.model', 'rb')}

response = requests.post(deploy_url, files=files)

print("Success! %s" % response.text)

Success! 


## Step 6:  Predict With Deployed Pipeline Model

### Setup Prediction Inputs

In [250]:
data = {"age":39,
        "workclass":"State-gov",
        "education":"Bachelors",
        "education_num":13,
        "marital_status":"Never-married",
        "occupation":"Adm-clerical",
        "relationship":"Not-in-family",
        "race":"White",
        "sex":"Male",
        "capital_gain":2174,
        "capital_loss":0,
        "hours_per_week":40,
        "native_country":"United-States"}

json_data = json.dumps(data)

with open('census-predict-inputs.json', 'wt') as fh:
    fh.write(json_data)

### Predict with CLI

In [251]:
%%bash

pio predict --model-version=v0 \
            --model-input-filename=census-predict-inputs.json

model_version: v0
model_input_filename: census-predict-inputs.json
request_timeout: 30

Predicting file 'census-predict-inputs.json' with model 'pmml/default/census-cli/v0' at 'http://prediction-jvm.demo.pipeline.io/api/v1/model/predict/pmml/default/census-cli/v0'...

('{"timestamp":1494020447799,"status":500,"error":"Internal Server '
 'Error","exception":"java.lang.NoClassDefFoundError","message":"io/pipeline/prediction/jvm/PMMLEvaluationCommand","path":"/api/v1/model/predict/pmml/default/census-cli/v0"}')



### Predict with REST

In [236]:
import json

# Note:  You may need to run this twice.
#        A fallback will trigger the first time. (Bug)
#predict_url = 'http://prediction-jvm.demo.pipeline.io/api/v1/model/predict/pmml/default/census-cli/v0'

predict_url = 'http://prediction-jvm.demo.pipeline.io/api/v1/model/predict/pmml/default/pmml_census/v0'

headers = {'content-type': 'application/json'}

response = requests.post(predict_url, 
                         data=json_data, 
                         headers=headers)

print(response.text)

{"results":[[{'income': 'NodeScoreDistribution{result=<=50K, probability_entries=[<=50K=0.9564524694636218, >50K=0.04354753053637812], entityId=7, confidence_entries=[]}'}]]}


## Step 7:  Monitor Model Servers through Dashboards

### Fallbacks and Circuit Breaker [Dashboard](http://hystrix.demo.pipeline.io/hystrix-dashboard/monitor/monitor.html?streams=%5B%7B%22name%22%3A%22Model%20Servers%22%2C%22stream%22%3A%22http%3A%2F%2Fturbine.demo.pipeline.io%2Fturbine.stream%22%2C%22auth%22%3A%22%22%2C%22delay%22%3A%22%22%7D%5D)

In [206]:
%%html

<iframe width=800 height=600 src="http://hystrix.demo.pipeline.io/hystrix-dashboard/monitor/monitor.html?streams=%5B%7B%22name%22%3A%22Model%20Servers%22%2C%22stream%22%3A%22http%3A%2F%2Fturbine.demo.pipeline.io%2Fturbine.stream%22%2C%22auth%22%3A%22%22%2C%22delay%22%3A%22%22%7D%5D"></iframe>

### Grafana Prediction Metrics [Dashboard](http://grafana.demo.pipeline.io)

In [207]:
%%html

<iframe width=800 height=600 src="http://grafana.demo.pipeline.io"></iframe>