# Customer Churn Demo: Remote Spark Livy

In this notebook you will build a predictive model with Spark machine learning(SparkML) and deploy the model in Machine Learning (ML) in REST endpoint using HDP and IBM DSX. This notebook walks you through following steps:

- Fetching data from HDFS
- Feature engineering
- Data Visualization
- Build a binary classifier model with SparkML API
- Save the model in the ML repository
- Deploy model online(via UI)
- Test the model (via UI)
- Test the model (via REST API).

## Use Case

The analytics use case implemented in this notebook is telco churn. While it's a simple use case, it implements all steps from the CRISP-DM methodolody, which is the recommended best practice for implementing predictive analytics. 
![CRISP-DM](https://raw.githubusercontent.com/yfphoon/dsx_demo/master/crisp_dm.png)

The analytics process starts with defining the business problem and identifying the data that can be used to solve the problem. For Telco churn, we use demographic and historical transaction data. We also know which customers have churned, which is the critical information for building predictive models. In the next step, we use visual APIs for data understanding and complete some data preparation tasks. In a typical analytics project data preparation will include more steps (for example, formatting data or deriving new variables). 

Once the data is ready, we can build a predictive model. In our example we are using the SparkML Random Forrest classification model. Classification is a statistical technique which assigns a "class" to each customer record (for our use case "churn" or "no churn"). Classification models use historical data to come up with the logic to predict "class", this process is called model training. After the model is created, it's usually evaluated using another data set. 

Finally, if the model's accuracy meets the expectations, it can be deployed for scoring. Scoring is the process of applying the model to a new set of data. For example, when we receive new transactional data, we can score the customer for the risk of churn.  

We also developed a sample Python Flask application to illustrate deployment: http://predictcustomerchurn.mybluemix.net/. This application implements the REST client call to the model.

# I: Set up remote Spark Session

In [2]:
# spark magics
%load_ext sparkmagic.magics

In [3]:
# create spark session
# http://172.26.192.223:8999
%manage_spark

Added endpoint http://172.26.192.223:8999
Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
2,application_1509554242062_0005,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


# II. Data Ingestion

In [4]:
%%spark
from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.sql import SQLContext

## Step 1: Data validation over webHDFS

In [5]:
# Chustomer data
!curl -i -L "http://edwdemo0.field.hortonworks.com:50070/webhdfs/v1/user/dsx_datasets/customer.csv?op=OPEN" | tail -n 5

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  272k  100  272k    0     0  2791k      0 --:--:-- --:--:-- --:--:-- 2791k
3821,"F","S",0.000000,78851.300000,"N",48.373333,0.370000,0.000000,28.660000,0.000000,"CC","FreeLocal","Standard",29.040000,4.000000
3822,"F","S",1.000000,17540.700000,"Y",62.786667,22.170000,0.570000,13.450000,0.000000,"Auto","Budget","Standard",36.200000,1.000000
3823,"F","M",0.000000,83891.900000,"Y",61.020000,28.920000,0.000000,45.470000,0.000000,"CH","Budget","Standard",74.400000,4.000000
3824,"F","M",2.000000,28220.800000,"N",38.766667,26.490000,0.000000,12.460000,0.000000,"CC","FreeLocal","Standard",38.950000,4.000000
3825,"F","S",0.000000,28589.100000,"N",15.600000,13.190000,0.000000,87.090000,0.000000,"CC","FreeLocal","Standard",100.280000,3.000000


In [6]:
# Churn data
!curl -i -L "http://edwdemo0.field.hortonworks.com:50070/webhdfs/v1/user/dsx_datasets/churn.csv?op=OPEN" | tail -n 5

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 20079  100 20079    0     0   263k      0 --:--:-- --:--:-- --:--:--  263k
3821,"T"
3822,"T"
3823,"T"
3824,"T"
3825,"T"


## Step 2: Data Load
**Note:** Once you have run the below row you can check up your Yarn UI to verify a job has been triggered.

In [7]:
%%spark

# Customer Information
# Add asset from file system
customer = SQLContext(sc).read.csv('hdfs://edwdemo0.field.hortonworks.com:8020/user/dsx_datasets/customer.csv', header=True, inferSchema=True)
#customer = sqlContext(sc).load(source="hdfs://edwdemo0.field.hortonworks.com:8020/user/dsx_datasets", header="true", path = "customer.csv")
#customer = SQLContext(sc).read.format("csv").options(header="true").options(inferSchema="true").load("hdfs://edwdemo0.field.hortonworks.com:8020/user/dsx_datasets/customer.csv")
customer.show(5)


#Churn information    
# Add asset from file system
customer_churn = SQLContext(sc).read.csv('hdfs://edwdemo0.field.hortonworks.com:8020/user/dsx_datasets/churn.csv', header=True, inferSchema=True)
#customer_churn = sqlContext(sc).load(source="hdfs://edwdemo0.field.hortonworks.com:8020/user/dsx_datasets", header="true", path = "churn.csv")
#customer_churn = SQLContext(sc).read.format("csv").options(header="true").options(inferSchema="true").load("hdfs://edwdemo0.field.hortonworks.com:8020/user/dsx_datasets/churn.csv")
customer_churn.show(5)

+---+------+------+--------+----------+---------+---------+------------+-------------+------+-------+---------+-------------+--------------------+------+--------+
| ID|Gender|Status|Children|Est Income|Car Owner|      Age|LongDistance|International| Local|Dropped|Paymethod|LocalBilltype|LongDistanceBilltype| Usage|RatePlan|
+---+------+------+--------+----------+---------+---------+------------+-------------+------+-------+---------+-------------+--------------------+------+--------+
|  1|     F|     S|     1.0|   38000.0|        N|24.393333|       23.56|          0.0|206.08|    0.0|       CC|       Budget|      Intnl_discount|229.64|     3.0|
|  6|     M|     M|     2.0|   29616.0|        N|49.426667|       29.78|          0.0|  45.5|    0.0|       CH|    FreeLocal|            Standard| 75.29|     2.0|
|  8|     M|     M|     0.0|   19732.8|        N|50.673333|       24.81|          0.0| 22.44|    0.0|       CC|    FreeLocal|            Standard| 47.25|     3.0|
| 11|     M|     S|   