![ibm-cloud.png](attachment:ibm-cloud.png)

# Spark  Supervised Machine Learning

Keep the main [Spark ML documentation](https://spark.apache.org/docs/latest/ml-pipeline.html) as you go through this tutorial.  MLlib is Spark’s machine learning (ML) library. **Spark ML** is not an official name, but we will use it to refer to the MLlib DataFrame-based API that embraces ML pipelines. Before we get into Spark ML by demonstrating a couple of examples we will first review Spark DataFrames.

In [1]:
import re
import os
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.style.use('seaborn')
%matplotlib inline

DATA_DIR = os.path.join("..","data")

## Spark SQL and DataFrames

What is Spark SQL?
- Spark SQL takes basic RDDs and **puts a schema on them**.

What is a DataFrame?
- DataFrames are the primary abstraction in Spark SQL.
- Think of a DataFrames as **RDDs with schema**.

What are **schemas**?
- Schemas are metadata about your data.
- Schema = Table Names + Column Names + Column Types

What are the pros of schemas?
- Schemas enable using **column names** instead of column positions
- Schemas enable **queries** using SQL and DataFrame syntax
- Schemas make your data more **structured**.

See the [Spark SQL documentation](https://spark.apache.org/docs/latest/sql-programming-guide.html) as a main point of reference for Spark SQL, DataFrames and Datasets.

## Creating DataFrames

You can create a DataFrame from an existing RDD (whatever source you used to create this one), if you add a schema.

To build a schema, you will use existing data types provided in the [`pyspqrk.sql.types`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.types) module.  

<center>
<table style="width:50%">
  <tr>
      <th>Types</th>
      <th>Python Equivalent</th>
    </tr>
  <tr>
      <td>StringType</td>
      <td>string</td>
  </tr>
  <tr>
      <td>IntegerType</td>
      <td>integer</td>
   <tr>
      <td>FloatType</td>
      <td>float</td> 
  <tr>
      <td>ArrayType</td>
      <td>array or list</td>
   </tr>
    <tr>
      <td>MapType</td>
      <td>dict</td>
   </tr>       
</table>
</center>

First we initialize the Spark Environment

In [3]:
import pyspark as ps

spark = ps.sql.SparkSession.builder \
            .master("local[4]") \
            .appName("spark-ml-examples") \
            .getOrCreate()

sc = spark.sparkContext

the `local[4]` will create a `local` cluster made of the driver using all 4 cores.  Lets start with a very small file to to demonstrate the different ways to create Spark DataFrames.

In [4]:
def casting_function(args):
    user_id, date, num_streams, country, invoice_item = args
    return((int(user_id), date, int(num_streams), country, int(invoice_item)))

rdd_aavail = sc.textFile(os.path.join(DATA_DIR, 'example-data.csv'))\
                         .map(lambda rowstr : rowstr.split(","))\
                         .filter(lambda row: not row[0].startswith('#'))\
                         .map(casting_function)

rdd_aavail.collect()

[(111, '10/13/2019', 4, 'US', 1),
 (122, '10/15/2019', 5, 'SG', 1),
 (102, '10/16/2019', 11, 'US', 1),
 (144, '10/25/2019', 14, 'US', 2),
 (121, '10/26/2019', 7, 'SG', 1),
 (155, '10/27/2019', 9, 'US', 3)]

You can create a Spark DataFrame using a schema that you have defined or it can be inferred.  To create your own. 

In [5]:
from pyspark.sql.types import *

schema = StructType([
    StructField('user_id', IntegerType(), True),
    StructField('date', StringType(), True),
    StructField('num_streams', IntegerType(), True),
    StructField('country', StringType(), True),
    StructField('invoice_items', IntegerType(), True) ])
    
# feed that into a DataFrame
df = spark.createDataFrame(rdd_aavail, schema)

# show the result
df.show()

# print the schema
df.printSchema()  

+-------+----------+-----------+-------+-------------+
|user_id|      date|num_streams|country|invoice_items|
+-------+----------+-----------+-------+-------------+
|    111|10/13/2019|          4|     US|            1|
|    122|10/15/2019|          5|     SG|            1|
|    102|10/16/2019|         11|     US|            1|
|    144|10/25/2019|         14|     US|            2|
|    121|10/26/2019|          7|     SG|            1|
|    155|10/27/2019|          9|     US|            3|
+-------+----------+-----------+-------+-------------+

root
 |-- user_id: integer (nullable = true)
 |-- date: string (nullable = true)
 |-- num_streams: integer (nullable = true)
 |-- country: string (nullable = true)
 |-- invoice_items: integer (nullable = true)



You may also read the data directly from a file and **infer** the schema

In [6]:
# read CSV
df = spark.read.csv(os.path.join(DATA_DIR, 'example-data.csv'),
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?


# some functions are still valid
print("line count: {}".format(df.count()))

# show the table in a nice format
df.show()

# prints the schema
df.printSchema()

line count: 6
+-----+----------+-----------+-------+------------+
|#user|      date|num_streams|country|invoice_item|
+-----+----------+-----------+-------+------------+
|  111|10/13/2019|          4|     US|           1|
|  122|10/15/2019|          5|     SG|           1|
|  102|10/16/2019|         11|     US|           1|
|  144|10/25/2019|         14|     US|           2|
|  121|10/26/2019|          7|     SG|           1|
|  155|10/27/2019|          9|     US|           3|
+-----+----------+-----------+-------+------------+

root
 |-- #user: integer (nullable = true)
 |-- date: string (nullable = true)
 |-- num_streams: integer (nullable = true)
 |-- country: string (nullable = true)
 |-- invoice_item: integer (nullable = true)



You can turn the DataFrame into a Panda DataFrame, but be careful since this 'action' will put all the data into memory

In [7]:
df.toPandas()

Unnamed: 0,#user,date,num_streams,country,invoice_item
0,111,10/13/2019,4,US,1
1,122,10/15/2019,5,SG,1
2,102,10/16/2019,11,US,1
3,144,10/25/2019,14,US,2
4,121,10/26/2019,7,SG,1
5,155,10/27/2019,9,US,3


Here are some common operations that you might perform on a DataFrame

In [8]:
# prints the schema
print("--- printSchema()")
df.printSchema()

# prints the table itself
print("--- show()")
df.show()

# show the statistics of all numerical columns
print("--- describe()")
df.describe().show()

# show the statistics of one specific column
print("--- describe(Amount)")
df.describe("num_streams").show()

--- printSchema()
root
 |-- #user: integer (nullable = true)
 |-- date: string (nullable = true)
 |-- num_streams: integer (nullable = true)
 |-- country: string (nullable = true)
 |-- invoice_item: integer (nullable = true)

--- show()
+-----+----------+-----------+-------+------------+
|#user|      date|num_streams|country|invoice_item|
+-----+----------+-----------+-------+------------+
|  111|10/13/2019|          4|     US|           1|
|  122|10/15/2019|          5|     SG|           1|
|  102|10/16/2019|         11|     US|           1|
|  144|10/25/2019|         14|     US|           2|
|  121|10/26/2019|          7|     SG|           1|
|  155|10/27/2019|          9|     US|           3|
+-----+----------+-----------+-------+------------+

--- describe()
+-------+------------------+----------+-----------------+-------+------------------+
|summary|             #user|      date|      num_streams|country|      invoice_item|
+-------+------------------+----------+-----------------+

## Transformations on DataFrames

- They are still **lazy**: Spark doesn't apply the transformation right away, it just builds on the **DAG**
- They transform a DataFrame into another because DataFrames are also **immutable**.
- They can be **wide** or **narrow** (whether they shuffle partitions or not).


Lets read in in the AAVAIL dataset that we have been working with to demonstrate the transformations.

In [9]:
# read CSV
df_aavail = spark.read.csv(os.path.join(DATA_DIR, 'aavail-target.csv'),
                           header=True,       
                           quote='"',         
                           sep=",",          
                           inferSchema=True)
df_aavail.describe().show()

+-------+-----------------+------------------+-------------+------------------+--------------+----------------+-----------------+
|summary|      customer_id|     is_subscriber|      country|               age| customer_name| subscriber_type|      num_streams|
+-------+-----------------+------------------+-------------+------------------+--------------+----------------+-----------------+
|  count|             1000|              1000|         1000|              1000|          1000|            1000|             1000|
|   mean|            500.5|             0.711|         null|            25.325|          null|            null|           17.695|
| stddev|288.8194360957494|0.4535247343692345|         null|12.184655959067568|          null|            null|4.798020007877829|
|    min|                1|                 0|    singapore|               -50|Aaliyah Duarte|    aavail_basic|                1|
|    max|             1000|                 1|united_states|                50|   Zoie Cor

## Remove one or more columns

In [17]:
columns_to_drop = ['customer_id', 'customer_name']
df_aavail = df_aavail.drop(*columns_to_drop)
df_aavail.describe().show()
df_aavail.groupBy("subscriber_type").count().show()

+-------+------------------+-------------+------------------+----------------+-----------------+
|summary|     is_subscriber|      country|               age| subscriber_type|      num_streams|
+-------+------------------+-------------+------------------+----------------+-----------------+
|  count|              1000|         1000|              1000|            1000|             1000|
|   mean|             0.711|         null|            25.325|            null|           17.695|
| stddev|0.4535247343692345|         null|12.184655959067568|            null|4.798020007877829|
|    min|                 0|    singapore|               -50|    aavail_basic|                1|
|    max|                 1|united_states|                50|aavail_unlimited|               29|
+-------+------------------+-------------+------------------+----------------+-----------------+

+----------------+-----+
| subscriber_type|count|
+----------------+-----+
|  aavail_premium|  331|
|aavail_unlimited|  302|
|

## Transformations on a feature matrix

The following example demonstrates how to deal with categorical features and scale continuous ones

In [32]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer
from pyspark.ml.feature import StandardScaler, VectorAssembler
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import DecisionTreeClassifier

from pyspark.ml import Pipeline

## scale the continuous features
va = VectorAssembler(inputCols=["age", "num_streams"], outputCol="cont_features")
ss = standardScaler = StandardScaler(inputCol="cont_features", outputCol="cont_scaled")



## categorical variable transformation
cat_cols = ["country", "subscriber_type"]
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index") for column in cat_cols]
encoders = [OneHotEncoder(inputCol=column+"_index", outputCol=column+"_oh") for column in cat_cols]



## assemple the features for input into the ML model
assembler = VectorAssembler(inputCols=["cont_scaled", "country_oh", "subscriber_type_oh"], outputCol="features")

## setup a model
gbt = DecisionTreeClassifier(labelCol="is_subscriber", featuresCol="features")
paramMap = {gbt.maxDepth: 5}

In [33]:
df_aavail

DataFrame[is_subscriber: int, country: string, age: int, subscriber_type: string, num_streams: int]

## Setup the pipeline and train the model

In [34]:
## run the whole pipeline
pipe = Pipeline(stages=indexers+encoders+[va, ss, gbt, assembler ])
model = pipe.fit(df_aavail, paramMap)
result = model.transform(df_aavail)
result.select("features", "is_subscriber", "rawPrediction", "probability", "prediction").show()

+--------------------+-------------+-------------+--------------------+----------+
|            features|is_subscriber|rawPrediction|         probability|prediction|
+--------------------+-------------+-------------+--------------------+----------+
|[1.72347910934425...|            1| [79.0,504.0]|[0.13550600343053...|       1.0|
|(5,[0,1],[2.46211...|            0| [133.0,47.0]|[0.73888888888888...|       0.0|
|[1.72347910934425...|            0| [79.0,504.0]|[0.13550600343053...|       1.0|
|[1.64140867556596...|            1| [79.0,504.0]|[0.13550600343053...|       1.0|
|[1.72347910934425...|            1| [133.0,47.0]|[0.73888888888888...|       0.0|
|[1.72347910934425...|            1| [79.0,504.0]|[0.13550600343053...|       1.0|
|[3.93938082135830...|            0|    [5.0,5.0]|           [0.5,0.5]|       0.0|
|[3.85731038758001...|            1| [79.0,504.0]|[0.13550600343053...|       1.0|
|[1.72347910934425...|            0|  [11.0,34.0]|[0.24444444444444...|       1.0|
|[2.

In [39]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol="is_subscriber", metricName="areaUnderROC",
                                         rawPredictionCol='rawPrediction')
accuracy = evaluator.evaluate(result)
print("Test Error = %g " % (1.0 - accuracy))

Test Error = 0.290458 


Now the same procedure with a train-test split, cross-validations and grid-search

In [21]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

train, test = df_aavail.randomSplit([0.8, 0.2], seed=42)

gbt = NaiveBayes(labelCol="is_subscriber", featuresCol="features")
paramGrid = ParamGridBuilder() \
    .addGrid(gbt.smoothing, [1.0, 0.8]) \
    .build()

pipe = Pipeline(stages=indexers+encoders+[va, ss, assembler])
pipeline_model = pipe.fit(train)
prepped_train = pipeline_model.transform(train)
prepped_test = pipeline_model.transform(test)

crossval = CrossValidator(estimator=gbt,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(labelCol="is_subscriber"),
                          numFolds=3)

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(prepped_train)
print("model trained")

Py4JJavaError: An error occurred while calling o2437.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 86.0 failed 1 times, most recent failure: Lost task 0.0 in stage 86.0 (TID 465, 442f5c66ca15, executor driver): org.apache.spark.SparkException: Failed to execute user defined function(NaiveBayes$$Lambda$3431/0x0000000840e96040: (struct<type:tinyint,size:int,indices:array<int>,values:array<double>>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
	at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1130)
	at org.apache.spark.ml.stat.SummaryBuilderImpl$MetricsAggregate.update(Summarizer.scala:338)
	at org.apache.spark.ml.stat.SummaryBuilderImpl$MetricsAggregate.update(Summarizer.scala:308)
	at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.update(interfaces.scala:553)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2(AggregationIterator.scala:196)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2$adapted(AggregationIterator.scala:196)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7(AggregationIterator.scala:213)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7$adapted(AggregationIterator.scala:207)
	at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:159)
	at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:78)
	at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2(ObjectHashAggregateExec.scala:129)
	at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2$adapted(ObjectHashAggregateExec.scala:107)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:859)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:859)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.IllegalArgumentException: requirement failed: Naive Bayes requires nonnegative feature values but found (5,[0,1],[-3.991431737551195,4.416049903661859]).
	at scala.Predef$.require(Predef.scala:281)
	at org.apache.spark.ml.classification.NaiveBayes$.requireNonnegativeValues(NaiveBayes.scala:357)
	at org.apache.spark.ml.classification.NaiveBayes.$anonfun$trainDiscreteImpl$1(NaiveBayes.scala:176)
	at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:157)
	at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1127)
	... 29 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2139)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2164)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
	at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3627)
	at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2940)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
	at org.apache.spark.sql.Dataset.collect(Dataset.scala:2940)
	at org.apache.spark.ml.classification.NaiveBayes.trainDiscreteImpl(NaiveBayes.scala:192)
	at org.apache.spark.ml.classification.NaiveBayes.$anonfun$trainWithLabelCheck$1(NaiveBayes.scala:159)
	at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
	at org.apache.spark.ml.classification.NaiveBayes.trainWithLabelCheck(NaiveBayes.scala:143)
	at org.apache.spark.ml.classification.NaiveBayes.train(NaiveBayes.scala:132)
	at org.apache.spark.ml.classification.NaiveBayes.train(NaiveBayes.scala:94)
	at org.apache.spark.ml.Predictor.fit(Predictor.scala:150)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(NaiveBayes$$Lambda$3431/0x0000000840e96040: (struct<type:tinyint,size:int,indices:array<int>,values:array<double>>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)
	at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1130)
	at org.apache.spark.ml.stat.SummaryBuilderImpl$MetricsAggregate.update(Summarizer.scala:338)
	at org.apache.spark.ml.stat.SummaryBuilderImpl$MetricsAggregate.update(Summarizer.scala:308)
	at org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.update(interfaces.scala:553)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2(AggregationIterator.scala:196)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2$adapted(AggregationIterator.scala:196)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7(AggregationIterator.scala:213)
	at org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7$adapted(AggregationIterator.scala:207)
	at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:159)
	at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.<init>(ObjectAggregationIterator.scala:78)
	at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2(ObjectHashAggregateExec.scala:129)
	at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec.$anonfun$doExecute$2$adapted(ObjectHashAggregateExec.scala:107)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:859)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:859)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	... 1 more
Caused by: java.lang.IllegalArgumentException: requirement failed: Naive Bayes requires nonnegative feature values but found (5,[0,1],[-3.991431737551195,4.416049903661859]).
	at scala.Predef$.require(Predef.scala:281)
	at org.apache.spark.ml.classification.NaiveBayes$.requireNonnegativeValues(NaiveBayes.scala:357)
	at org.apache.spark.ml.classification.NaiveBayes.$anonfun$trainDiscreteImpl$1(NaiveBayes.scala:176)
	at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:157)
	at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1127)
	... 29 more


In [26]:
prediction = cvModel.transform(prepped_test)
result = prediction.select("features", "is_subscriber", "rawPrediction", "probability", "prediction")
result.show()

+--------------------+-------------+--------------------+--------------------+----------+
|            features|is_subscriber|       rawPrediction|         probability|prediction|
+--------------------+-------------+--------------------+--------------------+----------+
|[-3.7470583658643...|            0|[43.1212677243314...|[0.43121267724331...|       1.0|
|[1.22186685843403...|            0|[65.4764954611438...|[0.65476495461143...|       0.0|
|[1.38478243955857...|            0|[42.5156472266135...|[0.42515647226613...|       1.0|
|(5,[0,1],[1.38478...|            0|[63.9301262677363...|[0.63930126267736...|       0.0|
|[1.46624023012084...|            0|[64.8543858101797...|[0.64854385810179...|       0.0|
|[1.46624023012084...|            0|[62.7304327022885...|[0.62730432702288...|       0.0|
|[1.54769802068311...|            0|[63.7622754717455...|[0.63762275471745...|       0.0|
|(5,[0,1],[1.54769...|            0|[62.6302779420164...|[0.62630277942016...|       0.0|
|[1.629155

In [36]:
predictions = prediction.select("is_subscriber", "rawPrediction")
list(predictions)

[Column<b'is_subscriber'>, Column<b'rawPrediction'>]

In [42]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics

predictionLabelsRDD = prediction.select('is_subscriber','prediction').rdd.map(lambda row: (float(row['prediction']), float(row['is_subscriber'])))
# Instantiate metrics object
print(predictionLabelsRDD)
metrics = BinaryClassificationMetrics(predictionLabelsRDD)

# Area under precision-recall curve
print("Area under PR = %s" % metrics.areaUnderPR)

# Area under ROC curve
print("Area under ROC = %s" % metrics.areaUnderROC)

PythonRDD[5681] at RDD at PythonRDD.scala:53
Area under PR = 0.862475204506754
Area under ROC = 0.7738909517295289


In [50]:
cvModel.avgMetrics

[0.7293055045280169,
 0.7354573482910156,
 0.7231377520278637,
 0.7414176439847097]