This notebook contains steps and code to get data from the **IBM Watson Studio Community**, create a predictive model, and start scoring new data. This notebook introduces commands for getting data and for basic data cleaning and exploration, pipeline creation, model training, model persistance to the Watson Machine Learning repository, model deployment, and scoring.

Some familiarity with Python is helpful. This notebook uses Python 3 and Apache Spark 2.3.

You will use a publicly available data set, GoSales Transactions for Naive Bayes Model, which details anonymous outdoor equipment purchases. Use the details of this data set to predict clients' interests in terms of product line, such as golf accessories, camping equipment and so on.

## Learning goals
You will learn how to:

- Load a CSV file into an Apache Spark DataFrame.
- Explore data.
- Prepare data for training and evaluation.
- Create an Apache Spark machine learning pipeline.
- Train and evaluate a model.
- Store a pipeline and model in Watson Machine Learning (WML) repository.
- Deploy a model for online scoring using the Watson Machine Learning (WML) API.
- Score sample scoring data using the WML API.
- Explore and visualize the prediction result using the plotly package.

## 1. Set up

Install **pySpark** by using Anaconda `conda install -c conda-forge pyspark`

Install **wget** `conda install -c menpo wget`

## 2. Load data

In this section you will load the data as an **Apache Spark DataFrame** and perform a basic exploration.

Load the data to the Spark DataFrame by using *wget* to upload the data to gpfs and then use spark read method to read data

In [4]:
import wget

link_to_data = 'https://apsportal.ibm.com/exchange-api/v1/entries/8044492073eb964f46597b4be06ff5ea/data?accessKey=9561295fa407698694b1e254d0099600'
filename = wget.download(link_to_data)

print(filename)

-1 / unknownGoSales_Tx_NaiveBayes.csv


In [6]:
from pyspark.sql import *
spark = SparkSession.builder.getOrCreate()

df_data = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .option('inferSchema', 'true')\
  .load(filename)

In [7]:
df_data.printSchema()

root
 |-- PRODUCT_LINE: string (nullable = true)
 |-- GENDER: string (nullable = true)
 |-- AGE: integer (nullable = true)
 |-- MARITAL_STATUS: string (nullable = true)
 |-- PROFESSION: string (nullable = true)



In [8]:
df_data.show()

+--------------------+------+---+--------------+------------+
|        PRODUCT_LINE|GENDER|AGE|MARITAL_STATUS|  PROFESSION|
+--------------------+------+---+--------------+------------+
|Personal Accessories|     M| 27|        Single|Professional|
|Personal Accessories|     F| 39|       Married|       Other|
|Mountaineering Eq...|     F| 39|       Married|       Other|
|Personal Accessories|     F| 56|   Unspecified| Hospitality|
|      Golf Equipment|     M| 45|       Married|     Retired|
|      Golf Equipment|     M| 45|       Married|     Retired|
|   Camping Equipment|     F| 39|       Married|       Other|
|   Camping Equipment|     F| 49|       Married|       Other|
|  Outdoor Protection|     F| 49|       Married|       Other|
|      Golf Equipment|     M| 47|       Married|     Retired|
|      Golf Equipment|     M| 47|       Married|     Retired|
|Mountaineering Eq...|     M| 21|        Single|      Retail|
|Personal Accessories|     F| 66|       Married|       Other|
|   Camp

In [9]:
print("Number of records: " + str(df_data.count()))

Number of records: 60252


## 3. Create an Apache Spark machine learning model

In this section, we will:

* 3.1 Prepare data
* 3.2 Create an Apache Spark machine learning pipeline
* 3.3 Train a model

In [10]:
splitted_data = df_data.randomSplit([0.8, 0.18, 0.02], 24)
train_data = splitted_data[0]
test_data = splitted_data[1]
predict_data = splitted_data[2]

print("Number of training records: " + str(train_data.count()))
print("Number of testing records : " + str(test_data.count()))
print("Number of prediction records : " + str(predict_data.count()))

Number of training records: 48176
Number of testing records : 10860
Number of prediction records : 1216


As you can see your data has been successfully split into three data sets:

* The train data set, which is the largest group, is used for training.
* The test data set will be used for model evaluation and is used to test the assumptions of the model.
* The predict data set will be used for prediction.

### 3.2 Create the pipeline
In this section you will create an Apache Spark machine learning pipeline and then train the model.

In the first step you need to import the Apache Spark machine learning packages that will be needed in the subsequent steps.

In [11]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model