# Predict chat reason using IBM Watson Machine Learning

This notebook introduces commands for getting data and for basic data cleaning and exploration, pipeline creation, model training, model persistance to Watson Machine Learning repository, model deployment, and scoring.

Some familiarity with Python is helpful. This notebook uses Python 2.0 and Apache® Spark 2.0.


## Learning goals

The learning goals of this notebook are:

-  Load a CSV file into an Apache® Spark DataFrame.
-  Explore data.
-  Prepare data for training and evaluation.
-  Create an Apache® Spark machine learning pipeline.
-  Train and evaluate a model.
-  Persist a pipeline and model in Watson Machine Learning repository.
-  Deploy a model for online scoring using Wastson Machine Learning API.
-  Score sample scoring data using the Watson Machine Learning API.



## Contents

This notebook contains the following parts:

1.	[Setup](#setup)
2.	[Load and explore data](#load)
3.	[Create spark ml model](#model)
4.	[Persist model](#save)
5.	[Predict locally and visualize](#predict)
6.	[Deploy and score in a Cloud](#deploy)


<a id="setup"></a>
## 1. Setup

Before you use the sample code in this notebook, you must perform the following setup tasks:

-  Create a [Watson Machine Learning Service](https://console.ng.bluemix.net/catalog/services/ibm-watson-machine-learning/) instance (a free plan is offered). 
-  Upload **cox.csv** data as a data asset in IBM Data Science Experience.
-  Make sure that you are using a Spark 2.0 kernel.


<a id="load"></a>
## 2.  Load and explore data

IBM Data Science Experience (DSX) makes it easy to load your files with a few clicks!

**Action**: Import the data and add .option('inferSchema','true)

In [None]:
import ibmos2spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# @hidden_cell
# The following code is used to access your data and contains your credentials.
# You might want to remove those credentials before you share your notebook.

properties_3120d808e7fd4d15a725a507f90ae16c = {
    'jdbcurl': 'jdbc:db2://',
    'user': 'bluadmin',
    'password': ''
}

data_df_1 = spark.read.jdbc(properties_3120d808e7fd4d15a725a507f90ae16c['jdbcurl'], table='CHANGE_SCHEMA_NAME.TRAINING', properties=properties_3120d808e7fd4d15a725a507f90ae16c)
data_df_1.head()



Explore the loaded data by using the following Apache® Spark DataFrame methods:
-  print schema
-  count all the records
-  print top five records

In [None]:
df = data_df_1

df.printSchema()
print "# of records: " + str(df.count())

We can see that there are 1165 rows and we have 7 fields we will use to predict the title (label) of the movie.

In [None]:
df.show(5)

Top 5 rows

<a id="model"></a>
## 3. Create an Apache Spark machine learning model

In this section we will prepare data, create an Apache Spark machine learning pipeline, and train a model.


### 3.1:  Prepare data

In this subsection we will split our data into: training, test, and predict datasets.

In [None]:
split_data = df.randomSplit([0.9, 0.1], 24)

training_data = df
test_data = split_data[1]

print "Training records: " + str(training_data.count())
print "Test records: " + str(test_data.count())

As you can see our data has been successfully split into three datasets: 

-  The training dataset, which is the largest group, is used for training.
-  The test dataset will be used for model evaluation and is used to test the assumptions of the model.
-  The predict dataset will be used for prediction.

### 3.2:  Create pipeline and train a model

In this section we create an Apache Spark machine learning pipeline and then train the model.

First we need to import several packages that will be used in the next few steps.

In [None]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model

First we need to convert all the string fields to numeric values.

In [None]:
categoricalColumns = ["KEY1", "KEY2", "KEY3", "KEY4", "KEY5", "KEY6"]
stages = [] # stages in our Pipeline
for categoricalCol in categoricalColumns:
  stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
  stringIndexer.setHandleInvalid("skip")
  encoder = OneHotEncoder(inputCol=categoricalCol+"Index", outputCol=categoricalCol+"classVec")
  stages += [stringIndexer, encoder]
stringIndexer_label = StringIndexer(inputCol="CLASSIFICATION", outputCol="label").fit(df)
stages += [stringIndexer_label]



Create a feature vector by combining all features together.

In [None]:
assemblerInputs = map(lambda c: c + "classVec", categoricalColumns)
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

Next we define a Random Forest estimator.

In [None]:
rf = NaiveBayes(labelCol="label", featuresCol="features",smoothing=.1)
stages += [rf]

Next we convert the indexed labels back to the original label.

In [None]:
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=stringIndexer_label.labels)
stages += [labelConverter]

Now we will put all the steps into a pipeline. 

In [None]:
pipeline_rf = Pipeline(stages=stages)

Now we will create a model using our pipeline and the training_data dataset.

In [None]:
model_rf = pipeline_rf.fit(training_data)

Now we will check our model accuracy using our test_data dataset.

In [None]:
predictions = model_rf.transform(test_data)
evaluatorRF = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluatorRF.evaluate(predictions)
print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))

At this point we would tune the model for desired accuracy, for this example we will move on.

<a id="save"></a>
## 4. Persist model in IBM Watson Machine Learning

In this section you will learn how to store your pipeline and model in Watson Machine Learning repository by using python client libraries.

First, you must import client libraries.

**Note**: Apache Spark 2.0 or higher is required.

In [None]:
from repository.mlrepositoryclient import MLRepositoryClient
from repository.mlrepositoryartifact import MLRepositoryArtifact

Authenticate to Watson Machine Learning service on Bluemix.

**Action**: Use your Watson Machine Learning service instance credentials below.



In [None]:
username = ''
password = ''
service_path = 'https://ibm-watson-ml.mybluemix.net'
instance_id = ''

**Tip**: service_path, user and password can be found on **Service Credentials** tab of service instance created in Bluemix. If you cannot see **instance_id** field in **Serice Credentials** generate new credentials by pressing **New credential (+)** button. 

In [None]:
ml_repository_client = MLRepositoryClient(service_path)
ml_repository_client.authorize(username, password)

Create model artifact (abstraction layer)

In [None]:
model_artifact = MLRepositoryArtifact(model_rf, training_data=training_data, name="NLU Demo")

**Tip**: The MLRepositoryArtifact method expects a trained model object, training data, and a model name. (It is this model name that is displayed by the Watson Machine Learning service).

### 4.1: Save pipeline and model

In [None]:
saved_model = ml_repository_client.models.save(model_artifact)

Get saved model metadata from Watson Machine Learning using the meta.available_props() method.

In [None]:
saved_model.meta.available_props()

**Tip**:  **modelVersionHref** is our model unique id in Watson Machine Learning.

In [None]:
print saved_model.meta.prop("modelVersionHref")

### 4.2: Load model

Now that we saved the model we will load it and verify the name.

In [None]:
loadedModelArtifact = ml_repository_client.models.get(saved_model.uid)

In [None]:
print str(loadedModelArtifact.name)

<a id="predict"></a>
## 5. Predict locally and visualize

In this section we will score test data using the loaded model.

### 5.1: Make local prediction using loaded model and predict data

In [None]:
predictions = loadedModelArtifact.model_instance().transform(test_data)

In [None]:
predictions.show(3)


In [None]:
predictions.select("predictedLabel").groupBy("predictedLabel").count().show()

<a id="deploy"></a>
## 6. Deploy and create online scoring endpoint

In this section you will learn how to create online scoring and to score a new data record by using the Watson Machine Learning REST API. 
For more information about REST APIs, see the [Swagger Documentation](http://watson-ml-api.mybluemix.net/).

To work with the Watson Machine Leraning REST API you must generate an access token. To do that you can use the following sample code:

In [None]:
import urllib3, requests, json

headers = urllib3.util.make_headers(basic_auth='{}:{}'.format(username, password))
url = '{}/v3/identity/token'.format(service_path)
response = requests.get(url, headers=headers)
mltoken = json.loads(response.text).get('token')

Now that we have the token we can create an online scoring endpoint.

First we will check the model for existing deployments and get the deployments url, then we will create the online deployment.

In [None]:
published_model_details = service_path + "/v3/wml_instances/" + instance_id + "/published_models/"\
+ loadedModelArtifact.uid 
header = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + mltoken}

response_get_model_details = requests.get(published_model_details, headers=header)

print 'Existing deployment count: ' + str(json.loads(response_get_model_details.text).get('entity').get('deployments').get('count'))
deployments_endpoint = json.loads(response_get_model_details.text).get('entity').get('deployments').get('url')
print deployments_endpoint

Now take the scoring endpoint and the list of Labels and update the node red flow.

In [None]:
payload_online_endpoint = {"name": "COX Prediction Deployment", "description": "NLU prediction endpoint", "type": "online"}
response_online = requests.post(deployments_endpoint, json=payload_online_endpoint, headers=header)

scoring_endpoint = json.loads(response_online.text).get('entity').get('scoring_url')
print scoring_endpoint

print '["%s"]' % '", "'.join(map(str, stringIndexer_label.labels))