# Streaming Data Preparation

This notebook showcases the simple data preparation to be done on the raw streaming data on `readings_raw` topic using PySpark. I am using Spark's structured streaming as it is the new way to do Streaming in Spark, and offers SQL-like transformations which will ease development. However, there is one small thing that worries me when it comes to Spark and Kafka integration -- even the latest, 2.4.x, Spark version is still using the old Kafka broker 0.10 version.

The topic streams a comma-separated value format, which we will need to parse, deduplicate, and drop if any of the field values are null. The resulting clean data will then streamed back to `readings_prepared` Kafka topic, to be consumed by Spark streaming jobs downstream.


## Setup

First, we import required libraries and define variables to be used to control the stream I/O. The Kafka Bootstrap Server lives on `kafka-m` hostname, which is resolvable in Google Cloud VPC network.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import time

In [None]:
kafkaBootstrapServer = "kafka-m:9092"
kafkaSourceTopic = "readings_raw"
kafkaTargetTopic = "readings_prepared"
checkpointLocation = "/tmp"
deduplicateWindow = "1 minute"

## Read Stream and Prepare The Data

### Connect to Kafka and Subscribe to `readings_raw` Topic

Connect to Kafka and create the streaming DataFrame. We then take a look at the schema produced. In Jupyter Notebooks or PySpark shell, the `spark` variable is created by default. There is no need to create another one.

In [None]:
kafkaSourceDF = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", kafkaBootstrapServer) \
    .option("subscribe", kafkaSourceTopic) \
    .option("failOnDataLoss", False) \
    .load()
kafkaSourceDF.printSchema()

#### Peek at the Source Streaming Data

We can peek at the Kafka source stream, `kafkaSourceDF`, to ensure that Spark is reading the stream properly. There were times when, somehow, it didn't read data from the stream although the producer is producing data.

In [None]:
kafkaSourceQuery = kafkaSourceDF.writeStream\
    .queryName("kafka_source")\
    .format("memory")\
    .start()

In [None]:
# Wait 5 seconds before querying
time.sleep(5);
spark.sql("select * from kafka_source").show()

In [None]:
kafkaSourceQuery.status

In [None]:
# Print last progress, if necessary, and stop the 
# If we continue the query, we will eat the Driver's memory unnecessarily.
# kafkaSourceQuery.lastProgress
kafkaSourceQuery.stop()

### Parse the Comma-separated Values and Output an Array

We then use the `split()` function to split the `value` column, whose the comma separated values into a column with values in an array. 

In [None]:
# Get the CSV value from KafkaDF and turn it into an array
csvDF = kafkaSourceDF.select(
    from_json(col("value").cast("string"), "data STRING").alias("value")
).select(
    split(col("value.data"), ",").alias("value")
)
csvDF.printSchema()

### Parse the Array Column and Sanitise

Drop duplicates within **arbitrary** 1 minute watermark. What it means is that Spark keeps the state of the stream, then use the state to deduplicate records with `reading_ts` no later than 1 minute window backwards from the max seen `reading_ts`. The window duration is configurable, but the longer it takes, the more memory it consumes to keep the state in-memory. The `message_id` column is set to string to be able to be used as Kafka topic key.

See [Spark Structured Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) for more details.

In [None]:
# Parse the CSV and drop the duplicate values
# The stream from Kafka may deliver a message at least once
readingsDF = csvDF\
    .withColumn("message_id", col("value").getItem(0).cast("string"))\
    .withColumn("reading_ts", col("value").getItem(1).cast("integer").cast("timestamp"))\
    .withColumn("reading_value", col("value").getItem(2).cast("float"))\
    .withColumn("reading_type", col("value").getItem(3).cast("integer"))\
    .withColumn("plug_id", col("value").getItem(4).cast("integer"))\
    .withColumn("household_id", col("value").getItem(5).cast("integer"))\
    .withColumn("house_id", trim(col("value").getItem(6)).cast("integer"))\
    .drop("value")\
    .dropna()\
    .withWatermark("reading_ts", deduplicateWindow)\
    .dropDuplicates()

readingsDF.printSchema()

#### Peek the Parsed and Prepared Data Stream

In [None]:
readingsQuery = readingsDF.writeStream\
    .queryName("readings")\
    .format("memory")\
    .start()

In [None]:
readingsQuery.status

In [None]:
# Sleep 40 seconds because we need to wait for the streaming state to initialise
time.sleep(40)
spark.sql("select * from readings").show()

In [None]:
# Don't forget to stop this later
# readingsQuery.stop()

## Write Back Prepared Data to Kafka

Now that the data is clean, we write back to Kafka -- to the `readings_prepared` topic to be consumed by alerting and ingestion streaming jobs downstream. Write as soon as data is available, without a time-based trigger.

In [None]:
kafkaWriteQuery = readingsDF.selectExpr("timestamp AS key", "CAST(to_json(struct(*)) AS STRING) AS value")\
    .writeStream\
    .format("kafka")\
    .option("kafka.bootstrap.servers", kafkaBootstrapServer)\
    .option("checkpointLocation", checkpointLocation)\
    .option("topic", kafkaTargetTopic)\
    .start()

We can check the status and last progress of the write to Kafka using `.lastProgress` and `status`.

In [None]:
time.sleep(15)
kafkaWriteQuery.lastProgress

In [None]:
kafkaWriteQuery.status

In [None]:
kafkaWriteQuery.lastProgress
# kafkaWriteQuery.stop()

## Write Back Prepared Data to Kafka

Now that the data is clean, we write back to Kafka -- to the `readings_prepared` topic to be consumed by alerting and ingestion streaming jobs downstream. Write as soon as data is available, without a time-based trigger.