-sandbox
<img src="https://files.training.databricks.com/images/Apache-Spark-Logo_TM_200px.png" style="float: left: margin: 20px"/>

# Working with Time Windows Lab

#### In this lesson, I will practice:
* Use sliding windows to aggregate over chunks of data rather than all data
* Apply watermarking to throw away stale old data that you do not have space to keep
* Plot live graphs using `display`

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Getting Started</h2>

Run the following cell to configure our "classroom."

In [0]:
%run "../Includes/Classroom-Setup"

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise 1: Read data into a stream</h2>

The dataset used in this exercise consists of flight information about flights from/to various airports in 2007.

The following cell shows what the streaming data will look like.

In [0]:
display(
  spark.read.parquet("dbfs:/mnt/training/asa/flights/2007-01-stream.parquet/part-00000-tid-9167815511861375854-22d81a30-d5b4-43d0-9216-0c20d14c3f54-178-c000.snappy.parquet")
)

DepartureAt,FlightDate,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2007-01-01T00:01:00.000+0000,2007-01-01,00:01,22:05,07:24,05:33,NW,96,N593NW,323.0,328,293.0,111.0,116,HNL,SEA,2677,5,25,0,,0,111,0,0,0,0
2007-01-01T00:01:00.000+0000,2007-01-01,00:01,23:45,07:22,06:50,DL,704,N607DL,261.0,245,228.0,32.0,16,LAX,CVG,1900,7,26,0,,0,16,0,16,0,0
2007-01-01T00:01:00.000+0000,2007-01-01,00:01,23:30,07:05,06:42,DL,748,N380DA,244.0,252,221.0,23.0,31,SEA,CVG,1964,7,16,0,,0,23,0,0,0,0
2007-01-01T00:01:00.000+0000,2007-01-01,00:01,22:45,08:44,07:10,AA,192,N632AA,343.0,325,320.0,94.0,76,LAX,BOS,2611,6,17,0,,0,51,0,18,0,25
2007-01-01T00:01:00.000+0000,2007-01-01,00:01,23:55,06:13,06:15,CO,434,N76504,252.0,260,232.0,-2.0,6,PHX,EWR,2133,5,15,0,,0,0,0,0,0,0
2007-01-01T00:03:00.000+0000,2007-01-01,00:03,23:55,02:48,02:32,YV,2876,N921FJ,105.0,97,80.0,16.0,8,LAS,ELP,584,4,21,0,,0,16,0,0,0,0
2007-01-01T00:04:00.000+0000,2007-01-01,00:04,00:10,07:34,07:38,NW,336,N523US,270.0,268,233.0,-4.0,-6,LAX,DTW,1979,13,24,0,,0,0,0,0,0,0
2007-01-01T00:04:00.000+0000,2007-01-01,00:04,23:57,07:32,07:16,US,49,N828AW,268.0,259,243.0,16.0,7,LAS,DCA,2089,3,22,0,,0,7,0,9,0,0
2007-01-01T00:04:00.000+0000,2007-01-01,00:04,20:50,02:04,23:00,AA,2376,N509AA,120.0,130,99.0,184.0,194,DFW,ORD,802,12,9,0,,0,0,0,0,0,184
2007-01-01T00:05:00.000+0000,2007-01-01,00:05,23:55,05:30,05:27,CO,229,N17344,205.0,212,180.0,3.0,10,DEN,EWR,1605,9,16,0,,0,0,0,0,0,0


For this exercise you will need to complete the following tasks:
0. Start a stream that reads parquet files dumped to the directory `dataPath`
0. Control the size of each partition by forcing Spark to processes only 1 file per trigger.

Other notes:
0. The source data has already been defined as `dataPath`
0. The schema has already be defined as `parquetSchema`

In [0]:
# TODO
dataPath = "/mnt/training/asa/flights/2007-01-stream.parquet/"

parquetSchema = "DepartureAt timestamp, FlightDate string, DepTime string, CRSDepTime string, ArrTime string, CRSArrTime string, UniqueCarrier string, FlightNum integer, TailNum string, ActualElapsedTime string, CRSElapsedTime string, AirTime string, ArrDelay string, DepDelay string, Origin string, Dest string, Distance string, TaxiIn string, TaxiOut string, Cancelled integer, CancellationCode string, Diverted integer, CarrierDelay string, WeatherDelay string, NASDelay string, SecurityDelay string, LateAircraftDelay string"
  
# Configure the shuffle partitions to match the number of cores  
spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)

streamDF = (spark                          # Start with the SparkSesion
  .readStream                              # Get the DataStreamReader
  .format("parquet")                       # Configure the stream's source for the appropriate file type
  .schema(parquetSchema)                   # Specify the parquet files' schema
  .option("maxFilesPerTrigger", 1)         # Restrict Spark to processing only 1 file per trigger
  .load(dataPath)                          # Load the DataFrame specifying its location with dataPath
)

In [0]:
# TEST - Run this cell to test your solution.
schemaStr = str(streamDF.schema)

dbTest("SS-03-shuffles",  sc.defaultParallelism, spark.conf.get("spark.sql.shuffle.partitions"))

dbTest("SS-03-schema-1",  True, "(DepartureAt,TimestampType,true)" in schemaStr)
dbTest("SS-03-schema-2",  True, "(FlightDate,StringType,true)" in schemaStr)
dbTest("SS-03-schema-3",  True, "(DepTime,StringType,true)" in schemaStr)
dbTest("SS-03-schema-4",  True, "(CRSDepTime,StringType,true)" in schemaStr)
dbTest("SS-03-schema-5",  True, "(ArrTime,StringType,true)" in schemaStr)
dbTest("SS-03-schema-6",  True, "(CRSArrTime,StringType,true)" in schemaStr)
dbTest("SS-03-schema-7",  True, "(UniqueCarrier,StringType,true)" in schemaStr)
dbTest("SS-03-schema-8",  True, "(FlightNum,IntegerType,true)" in schemaStr)
dbTest("SS-03-schema-9",  True, "(TailNum,StringType,true)" in schemaStr)
dbTest("SS-03-schema-10",  True, "(ActualElapsedTime,StringType,true)" in schemaStr)
dbTest("SS-03-schema-11",  True, "(CRSElapsedTime,StringType,true)" in schemaStr)
dbTest("SS-03-schema-12",  True, "(AirTime,StringType,true)" in schemaStr)
dbTest("SS-03-schema-13",  True, "(ArrDelay,StringType,true)" in schemaStr)
dbTest("SS-03-schema-14",  True, "(DepDelay,StringType,true)" in schemaStr)
dbTest("SS-03-schema-15",  True, "(Origin,StringType,true)" in schemaStr)
dbTest("SS-03-schema-16",  True, "(Dest,StringType,true)" in schemaStr)
dbTest("SS-03-schema-17",  True, "(Distance,StringType,true)" in schemaStr)
dbTest("SS-03-schema-18",  True, "(TaxiIn,StringType,true)" in schemaStr)
dbTest("SS-03-schema-19",  True, "(TaxiOut,StringType,true)" in schemaStr)
dbTest("SS-03-schema-20",  True, "(Cancelled,IntegerType,true)" in schemaStr)
dbTest("SS-03-schema-21",  True, "(CancellationCode,StringType,true)" in schemaStr)
dbTest("SS-03-schema-22",  True, "(Diverted,IntegerType,true)" in schemaStr)
dbTest("SS-03-schema-23",  True, "(CarrierDelay,StringType,true)" in schemaStr)
dbTest("SS-03-schema-24",  True, "(WeatherDelay,StringType,true)" in schemaStr)
dbTest("SS-03-schema-25",  True, "(NASDelay,StringType,true)" in schemaStr)
dbTest("SS-03-schema-26",  True, "(SecurityDelay,StringType,true)" in schemaStr)
dbTest("SS-03-schema-27",  True, "(LateAircraftDelay,StringType,true)" in schemaStr)

print("Tests passed!")

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise 2: Plot grouped events</h2>

Plot the count of all flights aggregated by a 30 minute window and `UniqueCarrier`. 

Ignore any events delayed by 300 minutes or more.

You will need to:
0. Use a watermark to discard events not received within 300 minutes
0. Configure the stream for a 30 minute sliding window
0. Aggregate by the 30 minute window and the column `UniqueCarrier`
0. Add the column `start` by extracting it from `window.start`
0. Sort the stream by `start`

In order to create a LIVE bar chart of the data, you'll need to specify the following <b>Plot Options</b>:
* **Keys** is set to `start`
* **Series groupings** is set to `UniqueCarrier`
* **Values** is set to `count`

In [0]:
# TODO
from pyspark.sql.functions import window, col

countsDF = (streamDF                                                # Start with the DataFrame
  .withWatermark("DepartureAt", "300 minutes")                      # Specify the watermark
  .groupby(window("DepartureAt", "30 minutes"), 
           "UniqueCarrier")                                         # Aggregate the data
  .count()                                                          # Produce a count for each aggreate
  .withColumn("start", col("window.start"))                         # Add the column "hour", extracting it from "window.start"
  .orderBy("start")
)

display(countsDF, streamName = "flightCountStream")

window,UniqueCarrier,count,start
"List(2007-01-01T00:00:00.000+0000, 2007-01-01T00:30:00.000+0000)",YV,8,2007-01-01T00:00:00.000+0000
"List(2007-01-01T00:00:00.000+0000, 2007-01-01T00:30:00.000+0000)",DL,3,2007-01-01T00:00:00.000+0000
"List(2007-01-01T00:00:00.000+0000, 2007-01-01T00:30:00.000+0000)",CO,5,2007-01-01T00:00:00.000+0000
"List(2007-01-01T00:00:00.000+0000, 2007-01-01T00:30:00.000+0000)",WN,3,2007-01-01T00:00:00.000+0000
"List(2007-01-01T00:00:00.000+0000, 2007-01-01T00:30:00.000+0000)",F9,3,2007-01-01T00:00:00.000+0000
"List(2007-01-01T00:00:00.000+0000, 2007-01-01T00:30:00.000+0000)",NW,3,2007-01-01T00:00:00.000+0000
"List(2007-01-01T00:00:00.000+0000, 2007-01-01T00:30:00.000+0000)",B6,4,2007-01-01T00:00:00.000+0000
"List(2007-01-01T00:00:00.000+0000, 2007-01-01T00:30:00.000+0000)",AA,6,2007-01-01T00:00:00.000+0000
"List(2007-01-01T00:00:00.000+0000, 2007-01-01T00:30:00.000+0000)",US,15,2007-01-01T00:00:00.000+0000
"List(2007-01-01T00:00:00.000+0000, 2007-01-01T00:30:00.000+0000)",UA,4,2007-01-01T00:00:00.000+0000


In [0]:
# TEST - Run this cell to test your solution.
schemaStr = str(countsDF.schema)

dbTest("SS-03-schema-1",  True, "(UniqueCarrier,StringType,true)" in schemaStr)
dbTest("SS-03-schema-2",  True, "(count,LongType,false)" in schemaStr)
dbTest("SS-03-schema-5",  True, "(start,TimestampType,true)" in schemaStr)

print("Tests passed!")

Wait until stream is done initializing...

In [0]:
untilStreamIsReady("flightCountStream")

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Exercise 3: Stop streaming jobs</h2>

Before we can conclude, we need to shut down all active streams.

In [0]:
# TODO
for s in spark.streams.active:               # Iterate over all active streams
  print("stopping " + s.name)                # A little console output
  s.stop()                                   # Stop the stream

In [0]:
# TEST - Run this cell to test your solution.
dbTest("SS-03-numActiveStreams", 0, len(spark.streams.active))

print("Tests passed!")