# Structured Streaming + Machine Learning

In this notebook we run a more advanced streaming use case.

To keep it conscise, I do not repeat explanations from the "structured streaming" notebook

## The Task
You have an input stream of invoice data (Exciting!). You need to predict the unit quantity given [date, unit price, 
country].
You can train on some of the data (e.g. the first N=300K entries) and then predict the rest.
Because this is an exercise, you also get the unit quantity, so you can easily compute the quality of the prediction.

We don't care here to get a high precision - just demonstrate how to use ML with streaming.

## The plan

Accumulate N rows into a single dataframe, train a (regression) model.<br>
Use the next rows (until end of input stream) as test: perform the prediction and compare to the actual value [unit quantity]

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql.types import *
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler

In [None]:
SCHEMA = "InvoiceNo INT ,StockCode INT,Description STRING ,Quantity INT,InvoiceDate DATE,UnitPrice FLOAT,CustomerID FLOAT, country STRING"

spark = SparkSession.builder.appName('streamingML')\
    .config("spark.kryoserializer.buffer.max", "512m")\
    .config('spark.jars', '/home/jars/*.jar')\
    .getOrCreate()

spark.sparkContext.setLogLevel("INFO")

In [None]:
kafka_server = "kafka:9092"
topic = "retail"

batchSize = 5000

In [None]:
streaming_df = spark.readStream\
                  .format("kafka")\
                  .option("kafka.bootstrap.servers", kafka_server)\
                  .option("subscribe", topic)\
                  .option("startingOffsets", "earliest")\
                  .option("failOnDataLoss",False)\
                  .option("maxOffsetsPerTrigger", batchSize )\
                  .load()\
                  .select(f.from_csv(f.decode("value", "US-ASCII"), schema=SCHEMA).alias("value")).select("value.*")

In [None]:
train_data = None
trainedModel = None
numTrainRows = max(30000,batchSize)
numEpochTrain = numTrainRows / batchSize
print(f"The first {numEpochTrain:.0g} epochs will be used for training")

In [None]:
def transform(df):
    """
    Convert the raw input df to a format that can be used by the ML model.
    The RandomForestRegressor expects to have 'features' column.
    
    We drop invalid entries, so may return less rows (sometimes 0)
    """
    assembler = VectorAssembler(inputCols=['StockCode','UnitPrice','CustomerID'], outputCol="features", handleInvalid='skip')
    d = assembler.transform(df)
    return d

## Setting hyper parameters
In this code we focus on the actual stream handling, so just use a single set of parameters.

Look at the "MLlib" and sdg/advanced\* notebook to learn how to automatically perform a grid search on a range of parameters.

In [None]:
numTrees, maxDepth = 30,20

In [None]:
def handle_batch(data, epoch_num):
    global train_data
    global trainedModel
    global numTrainRows
    
    #raw_count = data.count()
    data = transform(data)
    trans_count = data.count()
    #print(f"batch size before/after tranform: {raw_count}/{trans_count}")
    if trans_count == 0:
        print(f"NOTHING to process in epoch {epoch_num}")
        return
    if epoch_num == 0:
        train_data = data
    elif epoch_num < numEpochTrain:
        train_data = train_data.union(data) # just collect more data (what about garbage collection?)
    elif epoch_num == numEpochTrain:
        # at last, train the model
        print("Got all the training data...")
        rf = RandomForestRegressor(numTrees=numTrees, maxDepth=maxDepth, labelCol='Quantity')
        # rf.setSeed(42) # just during debug!!! 
        trainedModel = rf.fit(train_data)
        del train_data  # not needed any more, so free some memory
        print("Finished model training")
    else:
        # apply the model
        test_predictions = trainedModel.transform(data.select(['Quantity','features']))
        evaluator = RegressionEvaluator(labelCol='Quantity', predictionCol='prediction')
        pred_error = evaluator.evaluate(test_predictions)
        print(f"{epoch_num}\t RMSE:{pred_error:4.3g}")

## Finally! run the next cell
Wait a few minutes until data is collected and the model is trained,  and then you will get the prediciton errors for each batch.

While you wait, open the UI at http://localhost:4040/StreamingQuery/ and click the **Run ID**

In [None]:
streaming_df.writeStream\
    .foreachBatch(handle_batch)\
    .start()\
    .awaitTermination()


<img src="https://www.freepnglogos.com/uploads/the-end-png/the-end-photographe-ois-love-life-photography-27.png" width="200"/>
<hr>

# Tips for developing your code
- start with simple data. Instead of using streaming, apply the code to a 'regular' dataframe
- start with a simple mode (for the RandomForest for example, use numTrees=2, maxDepth=2), then increase to get meaningful results
- save your source versions with meaningful comments. You will want to get back to a version that worked before. (use git)
- save a cache of a dataframe in Parquet file -- see below.

## Running handle_batch() manually (for developing and debugging)
I wanted to control the calls, and keep it fast, so I created a static df, read it from file and used it for training -- just to get the code working

In [None]:
from pyspark.sql.utils import AnalysisException
def loadData(numLines = 600):
    """
    Load data from cached file if it exsists, and from Kafka otherwise
    :return: dataframe
    """
    fname = f"./retail{numLines}.parquet"
    # Here you see the EAFP pattern in work
    # https://realpython.com/python-lbyl-vs-eafp/#the-easier-to-ask-forgiveness-than-permission-eafp-style
    try:
        df = spark.read.parquet(fname)
    except AnalysisException as ex: # catch only relevant exception, so other causes will crash the code asap.
        df = spark.read.format("kafka")\
                  .option("kafka.bootstrap.servers", kafka_server).option("subscribe", topic).option("startingOffsets", "earliest")\
                  .load().limit(numLines)
        df = df.select(f.from_csv(f.decode("value", "US-ASCII"), schema=SCHEMA).alias("value")).select("value.*")
        df = df.drop('InvoiceNo').drop('Description').drop('InvoiceDate').drop('country')
        df.write.parquet(fname)

    return df

I am running the next cell dozens of times, changing the functions above, until the code is working as expected.

During debugging I set `numTrees, maxDepth = 5,2`

In [None]:
x = loadData().randomSplit([1.0,1.0,1.0])

numEpochTrain = 1
handle_batch(x[0],0) # train
handle_batch(x[1],1) # train
handle_batch(x[2],2) # test 

# Check yourself
- In handle_batch() we train the model after N records. What is a potential problem with this method if it was a live stream?
- why should we remove such statements? `rf.setSeed(42)`
- what happens if you set `batchSize = 5000000` ? why?

- set `batchSize = 5000`. Find the **Process Rate**, **Batch Duration** in the UI link.
  - How long did the tranining take?
  - What is the fastest rate your code (on this machine) can handle data in units of [records/sec]