# Streaming Data Preparation

This notebook showcases the simple data preparation to be done on the raw streaming data on `readings_raw` topic using PySpark. I am using Spark's structured streaming as it is the new way to do Streaming in Spark, and offers SQL-like transformations which will ease development. However, there is one small thing that worries me when it comes to Spark and Kafka integration -- even the latest, 2.4.x, Spark version is still using the old Kafka broker 0.10 version.

The topic streams a comma-separated value format, which we will need to parse, deduplicate, and drop if any of the field values are null. The resulting clean data will then streamed back to `readings_prepared` Kafka topic, to be consumed by Spark streaming jobs downstream.


## Setup

First, we import required libraries and define variables to be used to control the stream I/O. The Kafka Bootstrap Server lives on `kafka-m` hostname, which is resolvable in Google Cloud VPC network.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import time

In [2]:
kafkaBootstrapServer = "kafka-m:9092"
kafkaSourceTopic = "readings_raw"
kafkaTargetTopic = "readings_prepared"
checkpointLocation = "/tmp"
deduplicateWindow = "1 minute"

## Read Stream and Prepare The Data

### Connect to Kafka and Subscribe to `readings_raw` Topic

Connect to Kafka and create the streaming DataFrame. We then take a look at the schema produced. In Jupyter Notebooks or PySpark shell, the `spark` variable is created by default. There is no need to create another one.

In [3]:
kafkaSourceDF = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", kafkaBootstrapServer) \
    .option("subscribe", kafkaSourceTopic) \
    .option("failOnDataLoss", False) \
    .load()
kafkaSourceDF.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



#### Peek at the Source Streaming Data

We can peek at the Kafka source stream, `kafkaSourceDF`, to ensure that Spark is reading the stream properly. There were times when, somehow, it didn't read data from the stream although the producer is producing data.

In [4]:
kafkaSourceQuery = kafkaSourceDF.writeStream\
    .queryName("kafka_source")\
    .format("memory")\
    .start()

In [18]:
# Wait 5 seconds before querying
time.sleep(5);
spark.sql("select * from kafka_source").show()

+----+--------------------+------------+---------+------+--------------------+-------------+
| key|               value|       topic|partition|offset|           timestamp|timestampType|
+----+--------------------+------------+---------+------+--------------------+-------------+
|null|[7B 22 64 61 74 6...|readings_raw|        0| 74000|2020-03-02 15:26:...|            0|
|null|[7B 22 64 61 74 6...|readings_raw|        0| 74001|2020-03-02 15:26:...|            0|
|null|[7B 22 64 61 74 6...|readings_raw|        0| 74002|2020-03-02 15:26:...|            0|
|null|[7B 22 64 61 74 6...|readings_raw|        0| 74003|2020-03-02 15:26:...|            0|
|null|[7B 22 64 61 74 6...|readings_raw|        0| 74004|2020-03-02 15:26:...|            0|
|null|[7B 22 64 61 74 6...|readings_raw|        0| 74005|2020-03-02 15:26:...|            0|
|null|[7B 22 64 61 74 6...|readings_raw|        0| 74006|2020-03-02 15:26:...|            0|
|null|[7B 22 64 61 74 6...|readings_raw|        0| 74007|2020-03-02 15

In [6]:
kafkaSourceQuery.status

{'message': 'Processing new data',
 'isDataAvailable': True,
 'isTriggerActive': True}

In [7]:
# Print last progress, if necessary, and stop the 
# If we continue the query, we will eat the Driver's memory unnecessarily.
# kafkaSourceQuery.lastProgress
kafkaSourceQuery.stop()

### Parse the Comma-separated Values and Output an Array

We then use the `split()` function to split the `value` column, whose the comma separated values into a column with values in an array. 

In [8]:
# Get the CSV value from KafkaDF and turn it into an array
csvDF = kafkaSourceDF.select(
    from_json(col("value").cast("string"), "data STRING").alias("value"),
    col("timestamp")
).select(
    split(col("value.data"), ",").alias("value"),
    col("timestamp")
)
csvDF.printSchema()

root
 |-- value: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- timestamp: timestamp (nullable = true)



### Parse the Array Column and Sanitise

Drop duplicates within **arbitrary** 1 minute watermark. What it means is that Spark keeps the state of the stream, then use the state to deduplicate records with `reading_ts` no later than 1 minute window backwards from the max seen `reading_ts`. The window duration is configurable, but the longer it takes, the more memory it consumes to keep the state in-memory. The `message_id` column is set to string to be able to be used as Kafka topic key.

See [Spark Structured Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) for more details.

In [9]:
# Parse the CSV and drop the duplicate values
# The stream from Kafka may deliver a message at least once
readingsDF = csvDF\
    .withColumn("message_id", col("value").getItem(0).cast("string"))\
    .withColumn("reading_ts", col("value").getItem(1).cast("integer").cast("timestamp"))\
    .withColumn("reading_value", col("value").getItem(2).cast("float"))\
    .withColumn("reading_type", col("value").getItem(3).cast("integer"))\
    .withColumn("plug_id", col("value").getItem(4).cast("integer"))\
    .withColumn("household_id", col("value").getItem(5).cast("integer"))\
    .withColumn("house_id", trim(col("value").getItem(6)).cast("integer"))\
    .drop("value")\
    .dropna()\
    .withWatermark("reading_ts", deduplicateWindow)\
    .dropDuplicates()

readingsDF.printSchema()

root
 |-- timestamp: timestamp (nullable = true)
 |-- message_id: string (nullable = true)
 |-- reading_ts: timestamp (nullable = true)
 |-- reading_value: float (nullable = true)
 |-- reading_type: integer (nullable = true)
 |-- plug_id: integer (nullable = true)
 |-- household_id: integer (nullable = true)
 |-- house_id: integer (nullable = true)



#### Peek the Parsed and Prepared Data Stream

In [10]:
readingsQuery = readingsDF.writeStream\
    .queryName("readings")\
    .format("memory")\
    .start()

In [11]:
readingsQuery.status

{'message': 'Getting offsets from KafkaV2[Subscribe[readings_raw]]',
 'isDataAvailable': False,
 'isTriggerActive': True}

In [12]:
# Sleep 40 seconds because we need to wait for the streaming state to initialise
time.sleep(40)
spark.sql("select * from readings").show()

+--------------------+----------+-------------------+-------------+------------+-------+------------+--------+
|           timestamp|message_id|         reading_ts|reading_value|reading_type|plug_id|household_id|house_id|
+--------------------+----------+-------------------+-------------+------------+-------+------------+--------+
|2020-03-02 15:26:...|  14240628|2013-09-01 00:22:40|        3.216|           0|      1|           0|       3|
|2020-03-02 15:26:...|  14274618|2013-09-01 00:23:00|        0.788|           0|      0|           0|       4|
|2020-03-02 15:26:...|  14408067|2013-09-01 00:24:20|        0.788|           0|      0|           0|       4|
|2020-03-02 15:26:...|  14408242|2013-09-01 00:24:20|          0.0|           1|      1|           0|       7|
|2020-03-02 15:26:...|  15572218|2013-09-01 00:36:00|          0.0|           1|      2|           0|       7|
|2020-03-02 15:26:...|  15704595|2013-09-01 00:37:20|        3.351|           1|      1|           0|       1|
|

In [13]:
# Don't forget to stop this later
readingsQuery.stop()

## Write Back Prepared Data to Kafka

Now that the data is clean, we write back to Kafka -- to the `readings_prepared` topic to be consumed by alerting and ingestion streaming jobs downstream. Write as soon as data is available, without a time-based trigger. Using timestamp as key in case I need to restart streaming jobs while iterating.

In [21]:
kafkaWriteQuery = readingsDF.selectExpr("CAST(timestamp AS STRING) AS key", "CAST(to_json(struct(*)) AS STRING) AS value")\
    .writeStream\
    .format("kafka")\
    .option("kafka.bootstrap.servers", kafkaBootstrapServer)\
    .option("checkpointLocation", checkpointLocation)\
    .option("topic", kafkaTargetTopic)\
    .start()

We can check the status and last progress of the write to Kafka using `.lastProgress` and `status`.

In [22]:
time.sleep(15)
kafkaWriteQuery.lastProgress

{'id': '0226cc16-5616-4ca7-a484-423e99021fe1',
 'runId': '93ce24a1-a2dc-4369-869e-491ccbf0b441',
 'name': None,
 'timestamp': '2020-03-02T15:28:26.107Z',
 'batchId': 0,
 'numInputRows': 0,
 'processedRowsPerSecond': 0.0,
 'durationMs': {'addBatch': 11606,
  'getBatch': 10,
  'queryPlanning': 120,
  'triggerExecution': 11828},
 'eventTime': {'watermark': '1970-01-01T00:00:00.000Z'},
 'stateOperators': [{'numRowsTotal': 0,
   'numRowsUpdated': 0,
   'memoryUsedBytes': 44599,
   'customMetrics': {'loadedMapCacheHitCount': 0,
    'loadedMapCacheMissCount': 0,
    'stateOnCurrentVersionSizeBytes': 15799}}],
 'sources': [{'description': 'KafkaV2[Subscribe[readings_raw]]',
   'startOffset': None,
   'endOffset': {'readings_raw': {'0': 65000}},
   'numInputRows': 0,
   'processedRowsPerSecond': 0.0}],
 'sink': {'description': 'org.apache.spark.sql.kafka010.KafkaSourceProvider@1b57f220'}}

In [23]:
kafkaWriteQuery.status

{'message': 'Processing new data',
 'isDataAvailable': True,
 'isTriggerActive': True}

In [25]:
kafkaWriteQuery.lastProgress
# kafkaWriteQuery.stop()

{'id': '0226cc16-5616-4ca7-a484-423e99021fe1',
 'runId': '93ce24a1-a2dc-4369-869e-491ccbf0b441',
 'name': None,
 'timestamp': '2020-03-02T15:28:37.936Z',
 'batchId': 1,
 'numInputRows': 33000,
 'inputRowsPerSecond': 2789.753994420492,
 'processedRowsPerSecond': 2117.9641871510175,
 'durationMs': {'addBatch': 15369,
  'getBatch': 1,
  'getEndOffset': 0,
  'queryPlanning': 86,
  'setOffsetRange': 48,
  'triggerExecution': 15580,
  'walCommit': 29},
 'eventTime': {'avg': '2013-09-01T03:48:44.664Z',
  'max': '2013-09-01T18:08:00.000Z',
  'min': '2013-08-31T22:00:20.000Z',
  'watermark': '1970-01-01T00:00:00.000Z'},
 'stateOperators': [{'numRowsTotal': 33000,
   'numRowsUpdated': 33000,
   'memoryUsedBytes': 8204303,
   'customMetrics': {'loadedMapCacheHitCount': 200,
    'loadedMapCacheMissCount': 0,
    'stateOnCurrentVersionSizeBytes': 8146703}}],
 'sources': [{'description': 'KafkaV2[Subscribe[readings_raw]]',
   'startOffset': {'readings_raw': {'0': 65000}},
   'endOffset': {'readings_