## Binary Classification Example
### http://bit.ly/2jWiCQO

In this notebok, we will build a binary classification application using the MLlib Pipelines API (https://spark.apache.org/docs/latest/ml-pipeline.html). 
The Pipelines API provides higher-level API built on top of DataFrames for constructing ML pipelines. 

#### You can read more about the Pipelines API in the programming guide - http://spark.apache.org/docs/latest/ml-guide.html

Binary Classification is the task of predicting a binary label. 
E.g., is an email spam or not spam? Should I show this ad to this user or not? 
Will it rain tomorrowor not? 
This notebook demonstrates algorithms for making these types of predictions.

## Dataset Review
The Adult dataset we are going to use is publicly available at the UCI Machine Learning Repository. 
This data derives from census data, and consists of information about 48842 individuals and their annual income. 
We will use this information to predict if an individual earns >50k a year or <=50K a year. 
The dataset is rather clean, and consists of both numeric and categorical variables.

## Attribute Information:

- age: continuous
- workclass: Private,Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
- fnlwgt: continuous
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc...
- education-num: continuous
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent...
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners...
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
- sex: Female, Male
- capital-gain: continuous
- capital-loss: continuous
- hours-per-week: continuous
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany...
- Target/Label: - <=50K, >50K

In [1]:
import pixiedust

Pixiedust database opened successfully


In [3]:
from pyspark.sql.types import *

schema = StructType([
        StructField("age", DoubleType(), True),
        StructField("workclass", StringType(), True),
        StructField("fnlwgt", DoubleType(), True),
        StructField("education", StringType(), True),
        StructField("education_num", DoubleType(), True),
        StructField("marital_status", StringType(), True),
        StructField("occupation", StringType(), True),
        StructField("relationship", StringType(), True),
        StructField("race", StringType(), True),
        StructField("sex", StringType(), True),
        StructField("capital_gain", DoubleType(), True),
        StructField("capital_loss", DoubleType(), True),
        StructField("hours_per_week", DoubleType(), True),
        StructField("native_country", StringType(), True),
        StructField("income", StringType(), True)        
])

In [4]:
from pyspark.sql import SparkSession

# @hidden_cell
# This function is used to setup the access of Spark to your Object Storage. The definition contains your credentials.
# You might want to remove those credentials before you share your notebook.
def set_hadoop_config_with_credentials_19099026f8df40b6aec4353c7e897e95(name):
    """This function sets the Hadoop configuration so it is possible to
    access data from Bluemix Object Storage using Spark"""

    prefix = 'fs.swift.service.' + name
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + '.auth.url', 'https://identity.open.softlayer.com'+'/v3/auth/tokens')
    hconf.set(prefix + '.auth.endpoint.prefix', 'endpoints')
    hconf.set(prefix + '.tenant', 'cc29768790ec45439a43668592b02f84')
    hconf.set(prefix + '.username', 'a55ccc8b825944fa90f0188f8e5a2ffc')
    hconf.set(prefix + '.password', 'Q#i79zYI{qV?d74u')
    hconf.setInt(prefix + '.http.port', 8080)
    hconf.set(prefix + '.region', 'dallas')
    hconf.setBoolean(prefix + '.public', False)

# you can choose any name
name = 'keystone'
set_hadoop_config_with_credentials_19099026f8df40b6aec4353c7e897e95(name)

spark = SparkSession.builder.getOrCreate()

# Please read the documentation of PySpark to learn more about the possibilities to load data files.
# PySpark documentation: https://spark.apache.org/docs/2.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession
# The SparkSession object is already initalized for you.
# The following variable contains the path to your file on your Object Storage.
path_1 = "swift://Databricks." + name + "/adult.data"

In [5]:
dataset = (spark.read
  .schema(schema)
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
  .option('header', 'true')
  .load(path_1))
  
# dataset.take(5)
display(dataset)

label,features,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0.0,"(100,[1,10,23,31,43,48,52,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,50.0,83311.0,13.0,13.0])",50.0,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
0.0,"(100,[0,8,25,38,44,48,52,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,38.0,215646.0,9.0,40.0])",38.0,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
0.0,"(100,[0,13,23,38,43,49,52,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,53.0,234721.0,7.0,40.0])",53.0,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
0.0,"(100,[0,10,23,29,47,49,62,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,28.0,338409.0,13.0,40.0])",28.0,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K
0.0,"(100,[0,11,23,31,47,48,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,37.0,284582.0,14.0,40.0])",37.0,Private,284582.0,Masters,14.0,Married-civ-spouse,Exec-managerial,Wife,White,Female,0.0,0.0,40.0,United-States,<=50K
0.0,"(100,[0,18,28,34,44,49,64,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,49.0,160187.0,5.0,16.0])",49.0,Private,160187.0,9th,5.0,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0.0,0.0,16.0,Jamaica,<=50K
1.0,"(100,[1,8,23,31,43,48,52,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,52.0,209642.0,9.0,45.0])",52.0,Self-emp-not-inc,209642.0,HS-grad,9.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,45.0,United-States,>50K
1.0,"(100,[0,11,24,29,44,48,53,94,95,96,97,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,31.0,45781.0,14.0,14084.0,50.0])",31.0,Private,45781.0,Masters,14.0,Never-married,Prof-specialty,Not-in-family,White,Female,14084.0,0.0,50.0,United-States,>50K
1.0,"(100,[0,10,23,31,43,48,52,53,94,95,96,97,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,42.0,159449.0,13.0,5178.0,40.0])",42.0,Private,159449.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178.0,0.0,40.0,United-States,>50K
1.0,"(100,[0,9,23,31,43,49,52,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,37.0,280464.0,10.0,80.0])",37.0,Private,280464.0,Some-college,10.0,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0.0,0.0,80.0,United-States,>50K


In [6]:
dataset.printSchema

<bound method DataFrame.printSchema of DataFrame[age: double, workclass: string, fnlwgt: double, education: string, education_num: double, marital_status: string, occupation: string, relationship: string, race: string, sex: string, capital_gain: double, capital_loss: double, hours_per_week: double, native_country: string, income: string]>

In [7]:
cols = dataset.columns

### Preprocess Data
Since we are going to try algorithms like Logistic Regression, we will have to convert the categorical variables in the dataset into numeric variables. 
There are 2 ways we can do this.

### Category Indexing.
This is basically assigning a numeric value to each category from {0, 1, 2, ...numCategories-1}. 
This introduces an implicit ordering among your categories, and is more suitable for ordinal variables (eg: Poor: 0, Average: 1, Good: 2)

### One-Hot Encoding http://spark.apache.org/docs/latest/ml-features.html#onehotencoder
This converts categories into binary vectors with at most one nonzero value (eg: (Blue: [1, 0]), (Green: [0, 1]), (Red: [0, 0]))

In this dataset, we have ordinal variables like education (Preschool - Doctorate), and also nominal variables like relationship (Wife, Husband, Own-child, etc). 
For simplicity’s sake, we will use One-Hot Encoding to convert all categorical variables into binary vectors. 
It is possible here to improve prediction accuracy by converting each categorical column with an appropriate method.

Here, we will use a combination of StringIndexer (http://spark.apache.org/docs/latest/ml-features.html#stringindexer) and OneHotEncoder to convert the categorical variables. 
The OneHotEncoder will return a SparseVector.

Since we will have more than 1 stages of feature transformations, we use a Pipeline to tie the stages together. 
This simplifies our code.

In [8]:
###One-Hot Encoding
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler

categoricalColumns = ["workclass", "education", "marital_status", "occupation", "relationship", "race", "sex", "native_country"]
stages = [] # stages in our Pipeline
for categoricalCol in categoricalColumns:
  # Category Indexing with StringIndexer
  stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
  # Use OneHotEncoder to convert categorical variables into binary SparseVectors
  encoder = OneHotEncoder(inputCol=categoricalCol+"Index", outputCol=categoricalCol+"classVec")
  # Add stages.  These are not run here, but will run all at once later on.
  stages += [stringIndexer, encoder]

The above code basically indexes each categorical column using the StringIndexer, and then converts the indexed categories into one-hot encoded variables. 
The resulting output has the binary vectors appended to the end of each row.

We use the StringIndexer again here to encode our labels to label indices.

In [9]:
# Convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol = "income", outputCol = "label")
stages += [label_stringIdx]

Next, we will use the VectorAssembler (http://spark.apache.org/docs/latest/ml-features.html#vectorassembler) to combine all the feature columns into a single vector column. 
This will include both the numeric columns and the one-hot encoded binary vector columns in our dataset.

In [10]:
# Transform all features into a vector using VectorAssembler
numericCols = ["age", "fnlwgt", "education_num", "capital_gain", "capital_loss", "hours_per_week"]
assemblerInputs = map(lambda c: c + "classVec", categoricalColumns) + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

We finally run our stages as a Pipeline. 
This puts the data through all of the feature transformations we described in a single call.

In [None]:
# Create a Pipeline. https://spark.apache.org/docs/latest/ml-pipeline.html
pipeline = Pipeline(stages=stages)
# Run the feature transformations.
#  - fit() computes feature statistics as needed.
#  - transform() actually transforms the features.
pipelineModel = pipeline.fit(dataset)
dataset = pipelineModel.transform(dataset)

# Keep relevant columns
selectedcols = ["label", "features"] + cols
dataset = dataset.select(selectedcols)
display(dataset)
#dataset.take(5)

label,features,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0.0,"(100,[1,10,23,31,43,48,52,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,50.0,83311.0,13.0,13.0])",50.0,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K
0.0,"(100,[0,8,25,38,44,48,52,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,38.0,215646.0,9.0,40.0])",38.0,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K
0.0,"(100,[0,13,23,38,43,49,52,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,53.0,234721.0,7.0,40.0])",53.0,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K
0.0,"(100,[0,10,23,29,47,49,62,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,28.0,338409.0,13.0,40.0])",28.0,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K
0.0,"(100,[0,11,23,31,47,48,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,37.0,284582.0,14.0,40.0])",37.0,Private,284582.0,Masters,14.0,Married-civ-spouse,Exec-managerial,Wife,White,Female,0.0,0.0,40.0,United-States,<=50K
0.0,"(100,[0,18,28,34,44,49,64,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,49.0,160187.0,5.0,16.0])",49.0,Private,160187.0,9th,5.0,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0.0,0.0,16.0,Jamaica,<=50K
1.0,"(100,[1,8,23,31,43,48,52,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,52.0,209642.0,9.0,45.0])",52.0,Self-emp-not-inc,209642.0,HS-grad,9.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,45.0,United-States,>50K
1.0,"(100,[0,11,24,29,44,48,53,94,95,96,97,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,31.0,45781.0,14.0,14084.0,50.0])",31.0,Private,45781.0,Masters,14.0,Never-married,Prof-specialty,Not-in-family,White,Female,14084.0,0.0,50.0,United-States,>50K
1.0,"(100,[0,10,23,31,43,48,52,53,94,95,96,97,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,42.0,159449.0,13.0,5178.0,40.0])",42.0,Private,159449.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178.0,0.0,40.0,United-States,>50K
1.0,"(100,[0,9,23,31,43,49,52,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,37.0,280464.0,10.0,80.0])",37.0,Private,280464.0,Some-college,10.0,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0.0,0.0,80.0,United-States,>50K


In [12]:
### Randomly split data into training and test sets. set seed for reproducibility
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed = 100)
print trainingData.count()
print testData.count()

22837
9723


### Fit and Evaluate Models
We are now ready to try out some of the Binary Classification algorithms available in the Pipelines API.

Out of these algorithms, the below are also capable of supporting multiclass classification with the Python API: 
    - Decision Tree Classifier (http://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier)
    - Random Forest Classifier (http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier)

These are the general steps we will take to build our models: 
    - Create initial model using the training set 
    - Tune parameters with a ParamGrid and 5-fold Cross Validation 
    - Evaluate the best model obtained from the Cross Validation using the test set

We will be using the BinaryClassificationEvaluator to evaluate our models. 
    - The default metric used here is areaUnderROC.

### Logistic Regression
You can read more about Logistic Regression from the Programming Guide here http://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression. 
In the Pipelines API, we are now able to perform Elastic-Net Regularization with Logistic Regression, as well as other linear methods.

Note: As of Spark 2.0.0, The Python API does not yet support multiclass classification for Logistic Regression, but will be available in future.

In [13]:
from pyspark.ml.classification import LogisticRegression

# Create initial LogisticRegression model
lr = LogisticRegression(labelCol="label", featuresCol="features", maxIter=10)

# Train model with Training Data
lrModel = lr.fit(trainingData)

In [14]:
# Make predictions on test data using the transform() method.
# LogisticRegression.transform() will only use the 'features' column.
predictions = lrModel.transform(testData)

In [15]:
predictions.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- age: double (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: double (nullable = true)
 |-- education: string (nullable = true)
 |-- education_num: double (nullable = true)
 |-- marital_status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital_gain: double (nullable = true)
 |-- capital_loss: double (nullable = true)
 |-- hours_per_week: double (nullable = true)
 |-- native_country: string (nullable = true)
 |-- income: string (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)



In [None]:
# View model's predictions and probabilities of each prediction class
# You can select any columns in the above schema to view as well. For example's sake we will choose age & occupation
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
display(selected)
#selected.take(5)

label,prediction,probability,age,occupation
0.0,0.0,"[0.663426946702,0.336573053298]",26.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",30.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",31.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",32.0,Prof-specialty
0.0,0.0,"[0.577568322911,0.422431677089]",39.0,Prof-specialty
0.0,0.0,"[0.577568322911,0.422431677089]",47.0,Prof-specialty
0.0,0.0,"[0.596380070964,0.403619929036]",50.0,Prof-specialty
0.0,0.0,"[0.596380070964,0.403619929036]",51.0,Prof-specialty
0.0,0.0,"[0.607867531817,0.392132468183]",60.0,Prof-specialty
0.0,0.0,"[0.607867531817,0.392132468183]",61.0,Prof-specialty


We can make use of the BinaryClassificationEvaluator method to evaluate our model. 
The Evaluator expects two input columns: (rawPrediction, label).

In [17]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
evaluator.evaluate(predictions)

0.9032867661805299

Note that the default metric for the BinaryClassificationEvaluator is areaUnderROC

In [18]:
evaluator.getMetricName()

'areaUnderROC'

The evaluator currently accepts 2 kinds of metrics - areaUnderROC and areaUnderPR. We can set it to areaUnderPR by using evaluator.setMetricName(“areaUnderPR”).

Now we will try tuning the model with the ParamGridBuilder and the CrossValidator.

If you are unsure what params are available for tuning, you can use explainParams() to print a list of all params and their definitions.

In [19]:
print lr.explainParams()

elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
featuresCol: features column name. (default: features, current: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label, current: label)
maxIter: max number of iterations (>= 0). (default: 100, current: 10)
predictionCol: prediction column name. (default: prediction)
probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities. (default: probability)
rawPredictionCol: raw prediction (a.k.a. confidence) column name. (default: rawPrediction)
regParam: regularization parameter (>= 0). (default: 0.0)
standardization: whether to standardize the training features before fitting the model. (default: True)
thresho

As we indicate 3 values for regParam, 3 values for maxIter, and 2 values for elasticNetParam, this grid will have 3 x 3 x 3 = 27 parameter settings for CrossValidator to choose from. We will create a 5-fold cross validator.

In [20]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Create ParamGrid for Cross Validation
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.01, 0.5, 2.0])
             .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
             .addGrid(lr.maxIter, [1, 5, 10])
             .build())

In [21]:
# Create 5-fold CrossValidator
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Run cross validations
cvModel = cv.fit(trainingData)
# this will likely take a fair amount of time because of the amount of models that we're creating and testing

In [22]:
# Use test set here so we can measure the accuracy of our model on new data
predictions = cvModel.transform(testData)

In [23]:
# cvModel uses the best model found from the Cross Validation
# Evaluate best model
evaluator.evaluate(predictions)

0.9008296135002725

We can also access the model’s feature weights and intercepts easily

In [24]:
print 'Model Intercept: ', cvModel.bestModel.intercept

Model Intercept:  -1.38678540211


In [25]:
# weights = cvModel.bestModel.weights
# on Spark 2.X weights are available as ceofficients
weights = cvModel.bestModel.coefficients
weights = map(lambda w: (float(w),), weights)  # convert numpy type to float, and to tuple
weightsDF = sqlContext.createDataFrame(weights, ["Feature Weight"])
display(weightsDF)
#weightsDF.take(5)

Feature Weight
-0.301328824794
-0.650958361656
-0.411390271109
-0.527199911881
-0.500089260447
-0.07460528375
0.216620937681
-2.50965912618
-0.560537834434
-0.238684509585


In [None]:
# View best model's predictions and probabilities of each prediction class
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
display(selected)
#selected.take(5)

label,prediction,probability,age,occupation
0.0,0.0,"[0.663426946702,0.336573053298]",26.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",30.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",31.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",32.0,Prof-specialty
0.0,0.0,"[0.577568322911,0.422431677089]",39.0,Prof-specialty
0.0,0.0,"[0.577568322911,0.422431677089]",47.0,Prof-specialty
0.0,0.0,"[0.596380070964,0.403619929036]",50.0,Prof-specialty
0.0,0.0,"[0.596380070964,0.403619929036]",51.0,Prof-specialty
0.0,0.0,"[0.607867531817,0.392132468183]",60.0,Prof-specialty
0.0,0.0,"[0.607867531817,0.392132468183]",61.0,Prof-specialty


# Decision Trees

You can read more about Decision Trees in the Spark MLLib Programming Guide here (http://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier)

The Decision Trees algorithm is popular because it handles categorical data and works out of the box with multiclass classification tasks.

In [27]:
from pyspark.ml.classification import DecisionTreeClassifier

# Create initial Decision Tree Model
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=3)

# Train model with Training Data
dtModel = dt.fit(trainingData)

We can extract the number of nodes in our decision tree as well as the tree depth of our model.

In [28]:
print "numNodes = ", dtModel.numNodes
print "depth = ", dtModel.depth

numNodes =  15
depth =  3


In [29]:
# Make predictions on test data using the Transformer.transform() method.
predictions = dtModel.transform(testData)

In [30]:
predictions.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- age: double (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: double (nullable = true)
 |-- education: string (nullable = true)
 |-- education_num: double (nullable = true)
 |-- marital_status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital_gain: double (nullable = true)
 |-- capital_loss: double (nullable = true)
 |-- hours_per_week: double (nullable = true)
 |-- native_country: string (nullable = true)
 |-- income: string (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)



In [None]:
# View model's predictions and probabilities of each prediction class
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
display(selected)
#selected.take(5)

label,prediction,probability,age,occupation
0.0,0.0,"[0.663426946702,0.336573053298]",26.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",30.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",31.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",32.0,Prof-specialty
0.0,0.0,"[0.577568322911,0.422431677089]",39.0,Prof-specialty
0.0,0.0,"[0.577568322911,0.422431677089]",47.0,Prof-specialty
0.0,0.0,"[0.596380070964,0.403619929036]",50.0,Prof-specialty
0.0,0.0,"[0.596380070964,0.403619929036]",51.0,Prof-specialty
0.0,0.0,"[0.607867531817,0.392132468183]",60.0,Prof-specialty
0.0,0.0,"[0.607867531817,0.392132468183]",61.0,Prof-specialty


We will evaluate our Decision Tree model with BinaryClassificationEvaluator.

In [32]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(predictions)

0.615621264647738

Entropy and the Gini coefficient are the supported measures of impurity for Decision Trees. This is Gini by default. Changing this value is simple, model.setImpurity("Entropy").

In [33]:
dt.getImpurity()

'gini'

Now we will try tuning the model with the ParamGridBuilder and the CrossValidator.

As we indicate 3 values for maxDepth and 3 values for maxBin, this grid will have 3 x 3 = 9 parameter settings for CrossValidator to choose from. We will create a 5-fold CrossValidator.

In [34]:
# Create ParamGrid for Cross Validation
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

paramGrid = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [1,2,6,10])
             .addGrid(dt.maxBins, [20,40,80])
             .build())

In [35]:
# Create 5-fold CrossValidator
cv = CrossValidator(estimator=dt, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Run cross validations
cvModel = cv.fit(trainingData)
# Takes ~5 minutes

In [36]:
print "numNodes = ", cvModel.bestModel.numNodes
print "depth = ", cvModel.bestModel.depth

numNodes =  491
depth =  10


In [37]:
# Use test set here so we can measure the accuracy of our model on new data
predictions = cvModel.transform(testData)

In [38]:
# cvModel uses the best model found from the Cross Validation
# Evaluate best model
evaluator.evaluate(predictions)

0.7896884508494975

In [None]:
# View Best model's predictions and probabilities of each prediction class
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
display(selected)
#selected.take(5)

label,prediction,probability,age,occupation
0.0,0.0,"[0.663426946702,0.336573053298]",26.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",30.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",31.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",32.0,Prof-specialty
0.0,0.0,"[0.577568322911,0.422431677089]",39.0,Prof-specialty
0.0,0.0,"[0.577568322911,0.422431677089]",47.0,Prof-specialty
0.0,0.0,"[0.596380070964,0.403619929036]",50.0,Prof-specialty
0.0,0.0,"[0.596380070964,0.403619929036]",51.0,Prof-specialty
0.0,0.0,"[0.607867531817,0.392132468183]",60.0,Prof-specialty
0.0,0.0,"[0.607867531817,0.392132468183]",61.0,Prof-specialty


# Random Forest

Random Forests uses an ensemble of trees to improve model accuracy.

You can read more about Random Forest from the programming guide here (http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression)

In [40]:
from pyspark.ml.classification import RandomForestClassifier

# Create an initial RandomForest model.
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

# Train model with Training Data
rfModel = rf.fit(trainingData)

In [41]:
# Make predictions on test data using the Transformer.transform() method.
predictions = rfModel.transform(testData)

In [42]:
predictions.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- age: double (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: double (nullable = true)
 |-- education: string (nullable = true)
 |-- education_num: double (nullable = true)
 |-- marital_status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital_gain: double (nullable = true)
 |-- capital_loss: double (nullable = true)
 |-- hours_per_week: double (nullable = true)
 |-- native_country: string (nullable = true)
 |-- income: string (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)



In [None]:
# View model's predictions and probabilities of each prediction class
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
display(selected)

label,prediction,probability,age,occupation
0.0,0.0,"[0.663426946702,0.336573053298]",26.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",30.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",31.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",32.0,Prof-specialty
0.0,0.0,"[0.577568322911,0.422431677089]",39.0,Prof-specialty
0.0,0.0,"[0.577568322911,0.422431677089]",47.0,Prof-specialty
0.0,0.0,"[0.596380070964,0.403619929036]",50.0,Prof-specialty
0.0,0.0,"[0.596380070964,0.403619929036]",51.0,Prof-specialty
0.0,0.0,"[0.607867531817,0.392132468183]",60.0,Prof-specialty
0.0,0.0,"[0.607867531817,0.392132468183]",61.0,Prof-specialty


We will evaluate our Random Forest model with BinaryClassificationEvaluator.

In [44]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(predictions)

0.8890153568461264

Now we will try tuning the model with the ParamGridBuilder and the CrossValidator.

As we indicate 3 values for maxDepth, 2 values for maxBin, and 2 values for numTrees, this grid will have 3 x 2 x 2 = 12 parameter settings for CrossValidator to choose from. We will create a 5-fold CrossValidator.

In [45]:
# Create ParamGrid for Cross Validation
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

paramGrid = (ParamGridBuilder()
             .addGrid(rf.maxDepth, [2, 4, 6])
             .addGrid(rf.maxBins, [20, 60])
             .addGrid(rf.numTrees, [5, 20])
             .build())

In [46]:
# Create 5-fold CrossValidator
cv = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Run cross validations.  This can take about 6 minutes since it is training over 20 trees!
cvModel = cv.fit(trainingData)

In [47]:
# Use test set here so we can measure the accuracy of our model on new data
predictions = cvModel.transform(testData)

In [48]:
# cvModel uses the best model found from the Cross Validation
# Evaluate best model
evaluator.evaluate(predictions)

0.8980864395610777

In [49]:
# View Best model's predictions and probabilities of each prediction class
selected = predictions.select("label", "prediction", "probability", "age", "occupation")
display(selected)

label,prediction,probability,age,occupation
0.0,0.0,"[0.663426946702,0.336573053298]",26.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",30.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",31.0,Prof-specialty
0.0,0.0,"[0.627803801305,0.372196198695]",32.0,Prof-specialty
0.0,0.0,"[0.577568322911,0.422431677089]",39.0,Prof-specialty
0.0,0.0,"[0.577568322911,0.422431677089]",47.0,Prof-specialty
0.0,0.0,"[0.596380070964,0.403619929036]",50.0,Prof-specialty
0.0,0.0,"[0.596380070964,0.403619929036]",51.0,Prof-specialty
0.0,0.0,"[0.607867531817,0.392132468183]",60.0,Prof-specialty
0.0,0.0,"[0.607867531817,0.392132468183]",61.0,Prof-specialty


# Make Predictions

As Random Forest gives us the best areaUnderROC value, we will use the bestModel obtained from Random Forest for deployment, and use it to generate predictions on new data. In this example, we will simulate this by generating predictions on the entire dataset.

In [50]:
bestModel = cvModel.bestModel

In [51]:
# Generate predictions for entire dataset
finalPredictions = bestModel.transform(dataset)

In [52]:
# Evaluate best model
evaluator.evaluate(finalPredictions)

0.9032539827497698

In this example, we will also look into predictions grouped by age and occupation.

In [53]:
finalPredictions.createOrReplaceTempView("finalPredictions")

In [54]:
#https://spark.apache.org/docs/2.0.0-preview/sql-programming-guide.html#sql

In [55]:
display(spark.sql("select * FROM finalPredictions"))

label,features,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,rawPrediction,probability,prediction
0.0,"(100,[1,10,23,31,43,48,52,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,50.0,83311.0,13.0,13.0])",50.0,Self-emp-not-inc,83311.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K,"[7.34840503579,12.6515949642]","[0.367420251789,0.632579748211]",1.0
0.0,"(100,[0,8,25,38,44,48,52,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,38.0,215646.0,9.0,40.0])",38.0,Private,215646.0,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,United-States,<=50K,"[18.3017203792,1.69827962082]","[0.915086018959,0.0849139810408]",0.0
0.0,"(100,[0,13,23,38,43,49,52,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,53.0,234721.0,7.0,40.0])",53.0,Private,234721.0,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,United-States,<=50K,"[14.2981444413,5.70185555868]","[0.714907222066,0.285092777934]",0.0
0.0,"(100,[0,10,23,29,47,49,62,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,28.0,338409.0,13.0,40.0])",28.0,Private,338409.0,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,"[11.6968601566,8.30313984343]","[0.584843007829,0.415156992171]",0.0
0.0,"(100,[0,11,23,31,47,48,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,37.0,284582.0,14.0,40.0])",37.0,Private,284582.0,Masters,14.0,Married-civ-spouse,Exec-managerial,Wife,White,Female,0.0,0.0,40.0,United-States,<=50K,"[9.67632051629,10.3236794837]","[0.483816025815,0.516183974185]",1.0
0.0,"(100,[0,18,28,34,44,49,64,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,49.0,160187.0,5.0,16.0])",49.0,Private,160187.0,9th,5.0,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0.0,0.0,16.0,Jamaica,<=50K,"[18.5072313584,1.49276864161]","[0.925361567919,0.0746384320806]",0.0
1.0,"(100,[1,8,23,31,43,48,52,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,52.0,209642.0,9.0,45.0])",52.0,Self-emp-not-inc,209642.0,HS-grad,9.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,45.0,United-States,>50K,"[11.526060488,8.47393951199]","[0.576303024401,0.423696975599]",0.0
1.0,"(100,[0,11,24,29,44,48,53,94,95,96,97,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,31.0,45781.0,14.0,14084.0,50.0])",31.0,Private,45781.0,Masters,14.0,Never-married,Prof-specialty,Not-in-family,White,Female,14084.0,0.0,50.0,United-States,>50K,"[10.7832418759,9.21675812409]","[0.539162093795,0.460837906205]",0.0
1.0,"(100,[0,10,23,31,43,48,52,53,94,95,96,97,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,42.0,159449.0,13.0,5178.0,40.0])",42.0,Private,159449.0,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178.0,0.0,40.0,United-States,>50K,"[4.46063987767,15.5393601223]","[0.223031993883,0.776968006117]",1.0
1.0,"(100,[0,9,23,31,43,49,52,53,94,95,96,99],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,37.0,280464.0,10.0,80.0])",37.0,Private,280464.0,Some-college,10.0,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0.0,0.0,80.0,United-States,>50K,"[11.2875551546,8.71244484537]","[0.564377757731,0.435622242269]",0.0


#### In an operational environment, analysts may use a similar machine learning pipeline to obtain predictions on new data, organize it into a table and use it for analysis or lead targeting.

In [56]:
# http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/SparkSession.html

In [57]:
display(spark.sql("SELECT occupation, prediction, count(*) as count FROM finalPredictions GROUP BY occupation, prediction ORDER BY occupation"))

occupation,prediction,count
?,0.0,1710
?,1.0,133
Adm-clerical,1.0,199
Adm-clerical,0.0,3570
Armed-Forces,1.0,1
Armed-Forces,0.0,8
Craft-repair,0.0,3822
Craft-repair,1.0,277
Exec-managerial,1.0,1435
Exec-managerial,0.0,2631


In [58]:
display(spark.sql("SELECT age, prediction, count(*) AS count FROM finalPredictions GROUP BY age, prediction ORDER BY age"))

age,prediction,count
17.0,0.0,395
18.0,0.0,550
19.0,0.0,712
20.0,0.0,753
21.0,0.0,720
22.0,0.0,764
22.0,1.0,1
23.0,0.0,874
23.0,1.0,3
24.0,0.0,792


In [59]:
display(sqlContext.sql("SELECT age, prediction, count(*) AS count FROM finalPredictions GROUP BY age, prediction ORDER BY age"))

age,prediction,count
17.0,0.0,395
18.0,0.0,550
19.0,0.0,712
20.0,0.0,753
21.0,0.0,720
22.0,0.0,764
22.0,1.0,1
23.0,0.0,874
23.0,1.0,3
24.0,0.0,792


In [60]:
#https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-Catalog.html

In [61]:
spark.sql("SHOW TABLES").show()

+----------------+-----------+
|       tableName|isTemporary|
+----------------+-----------+
|finalpredictions|       true|
+----------------+-----------+



In [62]:
spark.catalog.listTables()

[Table(name=u'finalpredictions', database=None, description=None, tableType=u'TEMPORARY', isTemporary=True)]