# Writing Continuous Applications with Structured Streaming Python APIs in Apache Spark

Tutorial for <img src="https://databricks.com/wp-content/uploads/2018/12/pydata-logo-4.png" alt="" width="6%"/> Miami 

At first glance, building a distributed streaming engine might seem as simple as launching a set of servers and pushing data between them. Unfortunately, distributed stream processing runs into multiple complications that don’t affect simpler computations like batch jobs. Fortunately, PySpark 2.4 and Databricks makes this simple!

This notebook shows how one can train a model using Apache Spark and MLlib then deploy that model using Spark's structured streaming for making predictions as a continunous application.

This example will use a credit card fraud use case to demonstrate how MLlib models and structured streaming can be combined, to constitutue a continunous application. In our hypothetical use case, we have some historical data of credit card transactions, some of which have been identified as fraud. We want to train a model using this historical data that can flag potentially fraudulent transactions coming in as a live stream. We then want to deploy that model as part of a data pipeline which will work with a stream of transaction data to identify potential fraud hotspots in a continunous manner.

-sandbox
<div style="line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/04/PySparkStructuredStreaming-1.jpg" alt="Structrured Streaming" width="50%" style=>
</div>

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML

Please cite: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

A copy of this data and its licence are available at https://s3-us-west-2.amazonaws.com/ml-team-public-read/credit-card-fraud.zip

This dataset has 3 columns we'll be using.

**pcaVector:** The PCA transformation of raw transaction data. The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of many variables correlated with each other. Put simply, it is a method of summarizing data.

**amountRange:** This column is a value between 0 and 7 and tells us the approximate amount of a transaction. The values correspond to 0-1, 1-5, 5-10, 10-20, 20-50, 50-100, 100-200, and 200+ in dollars.

**label:** 0 or 1, whether a transaction was fraudulent.

We want to build a model which will predict the label using the pcaVector and amountRange data. We'll do this by using a ML pipeline with 3 stages:
* 1) A **OneHotEncoder** to build a vector from our _amountRange_ column. It is a process by which categorical variables are converted into a vector form that could be provided to ML algorithms to do a better job in prediction.
* 2) A **Vector assembler** to merge our _pcaVector_ & _amountRange_ vector into our features vector. It is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees 
* 3) A **GBTClassifier** to serve as our Estimator. It's a learning algorithm for classification. It supports binary labels, as well as both continuous and categorical features.

## Setup input and output files

In [6]:
input_data = "/databricks-datasets/credit-card-fraud/data"
output_test_parquet_data = "/tmp/pydata/credit-card-frauld-test-data"

In [7]:
#
# Lets take a look at the schema of the historical dataset we'll be working with today
#
data = spark.read.parquet(input_data)
display(data)

In [8]:
data.count()

We using PySpark so import the appropriate classes

In [10]:
from pyspark.ml.feature import OneHotEncoderEstimator, VectorAssembler, VectorSizeHint
from pyspark.ml.classification import GBTClassifier

from pyspark.sql.types import *
from pyspark.sql.functions import count, rand, collect_list, explode, struct, count

The way we do this will be very familiar to anyone who has used MLlib, but because we intend to use this model in a streaming context, there a few things we should be aware of.

First, you may notice that we used a `OneHotEncoderEstimator`, which is new in Spark 2.3, and not a OneHotEncoder, which has now been deprecated. This new estimator fixes several issues related of the `OneHotEncoder` and will also allow you to do one hot encoding on streaming dataframes.

And the second thing to be aware of when using MLlib with structured streaming is that `VectorAssembler` has some limitations in a streaming context. Specifically, `VectorAssembler` can only work on Vector columns of known size. To address this issue we can explicitly specify the size of the pcaVector column so that we'll be be able to use our pipeline with structured streaming. To do this we'll use the `VectorSizeHint` transformer.

In [12]:
oneHot = OneHotEncoderEstimator(inputCols=["amountRange"], outputCols=["amountVect"])

vectorAssembler = VectorAssembler(inputCols=["amountVect", "pcaVector"], outputCol="features")

estimator = GBTClassifier(labelCol="label", featuresCol="features")

In [13]:
from pyspark.ml.feature import VectorSizeHint

vectorSizeHint = VectorSizeHint(inputCol="pcaVector", size=28)

### Now we're ready to build a our ML Pipeline and fit it.

In [15]:
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

pipeline = Pipeline(stages=[oneHot, vectorSizeHint, vectorAssembler, estimator])
#
# let's split the data into testing and training datasets. 
# We will shave the test dataset for later
#
#
train = data.filter(col("time") % 10 < 8)
test = data.filter(col("time") % 10 >= 8)
#
# save our data into partitions so we can read them as files
#
(test.repartition(20).write
  .mode("overwrite")
  .parquet(output_test_parquet_data))

In [16]:
train.count()

In [17]:
test.count()

## Let's fit the model with our training data

In [19]:
pipelineModel = pipeline.fit(train)

We can simulate a stream by reading our test data from a file, since we don't have a Kafka cluster availale for the demo.
But the effect is no different; you are still using PySpark APIs to read off the filesystem as you would off Kafka topics.

First, let's define the schema

In [21]:
from pyspark.sql.types import *
from pyspark.ml.linalg import VectorUDT

schema = (StructType([StructField("time", IntegerType(), True), 
                      StructField("amountRange", IntegerType(), True), 
                      StructField("label", IntegerType(), True), 
                      StructField("pcaVector", VectorUDT(), True)]))

## **Start:** 
Read files simulating as a Kafka stream using one file at a time

In [23]:
streamingData = (spark.readStream 
                 .schema(schema) 
                 .option("maxFilesPerTrigger", 1) 
                 .parquet(output_test_parquet_data)) # our test data

Transform the Streaming DataFrame using the model and use DataFrame PySpark API to make queries

In [25]:
from pyspark.sql.functions import *

stream = pipelineModel.transform(streamingData)

## Do aggregations using PySpark DataFrame APIs

1. _groupBy_("label", "preditcions")
2. _sort_("label", "predictions")

And finally _display()_ the predictions as they are scored in real-time from the stream

In [27]:
streamPredictions = (pipelineModel.transform(streamingData) #infer or score against our test data
          .groupBy("label", "prediction")
          .count()
          .sort("label", "prediction"))

In [28]:
display(streamPredictions)

### HOME WORK CHALLENGE-1:
Can you compute the Precision and Recall?

Note: The formula to compute Precision (P) and Recall (r)

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2019/01/Screen-Shot-2019-01-02-at-9.38.02-AM.png" alt="Structrured Streaming" width="20%" style=>
</div>

This notebook demonstrates that MLlib Transformers, including PipelineModels, can be applied to streaming DataFrames. Except for the minor differences described in this notebook, you can work with structured streaming DataFrames the same way you would with batch DataFrames using MLlib.

That's the beauty of unified apis in Apache Spark 2.x and the ability to work with streaming DataFrames as static ones. 

Very cool!

### HOME WORK CHALLENGE-2:
Can you compute the F1 score, after computing the Precision and Recall?

**Note**: F1 Score = 2(PR/P + R)

In [32]:
dbutils.fs.rm(output_test_parquet_data, True)