# Data cleaning using Spark Streaming

This notebook reads data from the `ingest` topic on our Kafka distributed queue, cleans each of the messages and write the result to the Kafka topic `ingest-cleaned`.

Spark Structured Streaming treats a live data stream as a table to which we can continuously append. We can run queries on this table to our heart's content. A new event results in a new record in the table, after which the result of the queries will be recomputed in an intelligent way so it does not have to recompute everything, but instead works with a delta change.  

Spark is responsible for updating the results table when there is  new data and relieves us from maintaining running aggregrations, ensuring data consistency and fault tolerance. Everything is done for us, which makes our lives simpler, allowing us to focus on the essentials.

#### Input

We will now use the Spark Structured Streaming API to clean our event stream. We will us e a DataStreamReader to read from a Kafka source. We have added events to our Kafka distributed queue in notebook `1_read_and_POST.ipynb` and are now ready to process them.



#### Output

We will output our resulting data to a Kafka sink. Each row of our dataframe will be written to the Kafka topic `ingest-cleaned`. We will use the outpot mode `append`, which allows us to append new rows to the results table.

## Cleaning the data

In [1]:
%%bash
# Ensure the required Python 3 dependencies are installed.
python3 -m pip install kafka-python



We will now create a Spark context and specify that the Python spark-kafka libraries need to be added.

In [2]:
from IPython.display import display, clear_output
from time import sleep

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 pyspark-shell'

import pyspark 
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

# Create a local Spark cluster with two executors (if it doesn't already exist)
spark = SparkSession.builder.master('local[2]').getOrCreate()
sc = spark.sparkContext


We will now creating a streaming DataFrame that respresents the events received from the Kafka topic `ingest`.

In [3]:
input = (
    spark.readStream.format("kafka")
    .option("kafka.bootstrap.servers","localhost:9092")
    # Change `ingest` to your topic of choice.
    .option("subscribe", "ingest-test")
    # earliest: start reading from the beginning of the queue
    # this will also read all messages already present on the Kafka topic
    .option("startingOffsets", "earliest")
    .load()
)


We can't just run the query and see the output because the query will never stop. After all, these are streaming dataframes. For debugging purposes, we can spin up a query, wait a few seconds so we have some results, and show the contents of the in-memory table.

We stop the running query so we don't run out of memory.

In [4]:
stream_decoded = (
    input
    .withColumn("value", input["value"].cast("string"))
    .select("value", "timestamp")
    )

In [6]:
try:
    # In case the previous query wasn't stopped
    tq.stop()
except:
    pass

tq = (
    # Create an output stream
    stream_decoded.writeStream               
    # Only write new rows to the output
    # To clean data, we can only use the outputMode 'append'
    .outputMode("append")           
    # Write output stream to an in-memory Spark table (a DataFrame)
    .format("memory")               
    # The name of the output table will be the same as the name of the query
    .queryName("test_query")
    # Submit the query to Spark and execute it
    .start()
)

sleep(2)

# When the status says "Waiting for data to arrive", that means the query
# has finished its current iteration and is waiting for new messages from
# Kafka.
display(tq.status)

memory_sink = spark.table("test_query")
# Show result table in Jupyter Notebook. Since Jupyter Notebooks have native support for showing pandas tables,
# we convert the Spark DataFrame.
display(memory_sink.toPandas())

# Stop the query
tq.stop()

{'message': 'Waiting for data to arrive',
 'isDataAvailable': False,
 'isTriggerActive': False}

Unnamed: 0,value,timestamp
0,"{""lat"": 40.297875899999994, ""lng"": -75.5812935...",2020-12-03 15:35:48.143
1,"{""lat"": 40.2580614, ""lng"": -75.26467990000002,...",2020-12-03 15:35:48.157
2,"{""lat"": 40.121181799999995, ""lng"": -75.3519752...",2020-12-03 15:35:48.171
3,"{""lat"": 40.116153000000004, ""lng"": -75.343513,...",2020-12-03 15:35:48.186
4,"{""lat"": 40.251492, ""lng"": -75.6033497, ""desc"":...",2020-12-03 15:35:48.204
...,...,...
9995,"{""lat"": 40.075536299999996, ""lng"": -75.3046354...",2020-12-03 15:38:07.331
9996,"{""lat"": 40.2116628, ""lng"": -75.2759685, ""desc""...",2020-12-03 15:38:07.344
9997,"{""lat"": 40.069013, ""lng"": -75.134458, ""desc"": ...",2020-12-03 15:38:07.360
9998,"{""lat"": 40.3126186, ""lng"": -75.31258270000001,...",2020-12-03 15:38:07.374


We use the `from_json` function to convert our JSON to a tuple in one column. We will later flatten this column so that each field of our tuple becomes a column in our DataFrame. 

In [7]:
# lat,lng,desc,zip,title,timeStamp,twp,addr,e

schema = StructType([
    StructField("lat", DoubleType()),
    StructField("lng", DoubleType()),
    StructField("desc", StringType()),
    StructField("zip", FloatType()),
    StructField("title", StringType()),
    StructField("timeStamp", TimestampType()),
    StructField("twp", StringType()),
    StructField("addr", StringType()),
    StructField("e", IntegerType()),
])

decoded_json_stream = (
    stream_decoded.withColumn("nineoneone", from_json(col("value"), schema))
)


In [8]:
import numpy as np

flattened_stream = (
    decoded_json_stream
    .select("nineoneone.*") 
)
# Create two requested columns from column 'title'
split_col = pyspark.sql.functions.split(flattened_stream['title'], ':')
flattened_stream = flattened_stream.withColumn('majorTitle', split_col.getItem(0)).withColumn('minorTitle', split_col.getItem(1))

# Deal with NaN
flattened_stream = flattened_stream.replace(float('nan'), None)
flattened_stream = flattened_stream.withColumn("zip", col("zip").cast(IntegerType()))

# Make columns hour, date
flattened_stream = flattened_stream.withColumn("hour", hour(col("timeStamp")).cast(IntegerType()))
flattened_stream = flattened_stream.withColumn("date", to_date(col("timeStamp")))

Let's take a look at our flattened stream. We will do so using Pandas and we can also see that our columns are typed appropriately by using `dtypes`.

In [10]:
try:
    # In case the previous query wasn't stopped
    tq.stop()
except:
    pass

tq = (
    # Create an output stream
    flattened_stream.writeStream               
    # Only write new rows to the output
    .outputMode("append")           
    # Write output stream to an in-memory Spark table (a DataFrame)
    .format("memory")               
    # The name of the output table will be the same as the name of the query
    .queryName("test_query")
    # Submit the query to Spark and execute it
    .start()
)

sleep(2)

# When the status says "Waiting for data to arrive", that means the query
# has finished its current iteration and is waiting for new messages from
# Kafka.
display(tq.status)

memory_sink = spark.table("test_query")


# Show result table in Jupyter Notebook. Since Jupyter Notebooks have native support for showing pandas tables,
# we convert the Spark DataFrame.
display(memory_sink.toPandas().head(10))
display(memory_sink.dtypes)

# Stop the query
tq.stop()

{'message': 'Getting offsets from KafkaV2[Subscribe[ingest-test]]',
 'isDataAvailable': False,
 'isTriggerActive': True}

Unnamed: 0,lat,lng,desc,zip,title,timeStamp,twp,addr,e,majorTitle,minorTitle,hour,date
0,40.297876,-75.581294,REINDEER CT & DEAD END; NEW HANOVER; Station ...,19525.0,EMS: BACK PAINS/INJURY,2015-12-10 17:10:52,NEW HANOVER,REINDEER CT & DEAD END,1,EMS,BACK PAINS/INJURY,17,2015-12-10
1,40.258061,-75.26468,BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...,19446.0,EMS: DIABETIC EMERGENCY,2015-12-10 17:29:21,HATFIELD TOWNSHIP,BRIAR PATH & WHITEMARSH LN,1,EMS,DIABETIC EMERGENCY,17,2015-12-10
2,40.121182,-75.351975,HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St...,19401.0,Fire: GAS-ODOR/LEAK,2015-12-10 14:39:21,NORRISTOWN,HAWS AVE,1,Fire,GAS-ODOR/LEAK,14,2015-12-10
3,40.116153,-75.343513,AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;...,19401.0,EMS: CARDIAC EMERGENCY,2015-12-10 16:47:36,NORRISTOWN,AIRY ST & SWEDE ST,1,EMS,CARDIAC EMERGENCY,16,2015-12-10
4,40.251492,-75.60335,CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S...,,EMS: DIZZINESS,2015-12-10 16:56:52,LOWER POTTSGROVE,CHERRYWOOD CT & DEAD END,1,EMS,DIZZINESS,16,2015-12-10
5,40.253473,-75.283245,CANNON AVE & W 9TH ST; LANSDALE; Station 345;...,19446.0,EMS: HEAD INJURY,2015-12-10 15:39:04,LANSDALE,CANNON AVE & W 9TH ST,1,EMS,HEAD INJURY,15,2015-12-10
6,40.182111,-75.127795,LAUREL AVE & OAKDALE AVE; HORSHAM; Station 35...,19044.0,EMS: NAUSEA/VOMITING,2015-12-10 16:46:48,HORSHAM,LAUREL AVE & OAKDALE AVE,1,EMS,NAUSEA/VOMITING,16,2015-12-10
7,40.217286,-75.405182,COLLEGEVILLE RD & LYWISKI RD; SKIPPACK; Stati...,19426.0,EMS: RESPIRATORY EMERGENCY,2015-12-10 16:17:05,SKIPPACK,COLLEGEVILLE RD & LYWISKI RD,1,EMS,RESPIRATORY EMERGENCY,16,2015-12-10
8,40.289027,-75.39959,MAIN ST & OLD SUMNEYTOWN PIKE; LOWER SALFORD;...,19438.0,EMS: SYNCOPAL EPISODE,2015-12-10 16:51:42,LOWER SALFORD,MAIN ST & OLD SUMNEYTOWN PIKE,1,EMS,SYNCOPAL EPISODE,16,2015-12-10
9,40.102398,-75.291458,BLUEROUTE & RAMP I476 NB TO CHEMICAL RD; PLYM...,19462.0,Traffic: VEHICLE ACCIDENT -,2015-12-10 17:35:41,PLYMOUTH,BLUEROUTE & RAMP I476 NB TO CHEMICAL RD,1,Traffic,VEHICLE ACCIDENT -,17,2015-12-10


[('lat', 'double'),
 ('lng', 'double'),
 ('desc', 'string'),
 ('zip', 'int'),
 ('title', 'string'),
 ('timeStamp', 'timestamp'),
 ('twp', 'string'),
 ('addr', 'string'),
 ('e', 'int'),
 ('majorTitle', 'string'),
 ('minorTitle', 'string'),
 ('hour', 'int'),
 ('date', 'date')]

## Write entries to ingest-cleaned Kafka topic

Finally, we want to write the cleaned 911 entries to the `ingest-cleaned` Kafka topic. This Kafka output stream expects a dataframe, a value and an optional key column.

To create the `value` column, we first create a struct from all columns in the dataframe by using the `struct` function, serialize the result to json using `to_json`, and keep only the value column using `select` and `alias`.

In [11]:
output_stream = flattened_stream.select(to_json(struct("*")).alias("value"))

In [12]:
try:
    # In case the previous query wasn't stopped
    tq.stop()
    # Remove old checkpoint dir, otherwise you'll get weird runtime faults
    os.rmdir("checkpoints-cleanup")
except:
    pass

# Prepare df for Kafka and write to kafka
tq = (
    output_stream
    .writeStream.format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("topic", "ingest-cleaned")
    .option("checkpointLocation", "checkpoints-cleanup")
    .start()
)

sleep(2)
display(tq.status)


{'message': 'Processing new data',
 'isDataAvailable': True,
 'isTriggerActive': True}