<img src="https://files.training.databricks.com/images/Apache-Spark-Logo_TM_200px.png" style="float: left: margin: 20px"/>

# Structured Streaming

## Prerequisites
* Web browser: **Chrome**
* A cluster configured with **8 cores** and **DBR 6.2**
* Run the code on the databricks Community edition workspace.


## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the<br/>
start of the cell (see the next cell) and the **`Classroom-Cleanup`** cell at the end.

In [None]:
%run "../Includes/Classroom-Setup"

Define the name of the stream we are to use later in this lesson:

In [None]:
myStreamName = "lab02_ps"

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png">Read Stream</h2>

The dataset used in this exercise consists of flight information about flights from/to various airports in 2007.

Run the following cell to see what the streaming data will look like.

In [None]:
display(
  spark.read.parquet("dbfs:/mnt/training/asa/flights/2007-01-stream.parquet/part-00000-tid-9167815511861375854-22d81a30-d5b4-43d0-9216-0c20d14c3f54-178-c000.snappy.parquet")
)

DepartureAt,FlightDate,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2007-01-01T00:01:00.000+0000,2007-01-01,00:01,22:05,07:24,05:33,NW,96,N593NW,323.0,328,293.0,111.0,116,HNL,SEA,2677,5,25,0,,0,111,0,0,0,0
2007-01-01T00:01:00.000+0000,2007-01-01,00:01,23:45,07:22,06:50,DL,704,N607DL,261.0,245,228.0,32.0,16,LAX,CVG,1900,7,26,0,,0,16,0,16,0,0
2007-01-01T00:01:00.000+0000,2007-01-01,00:01,23:30,07:05,06:42,DL,748,N380DA,244.0,252,221.0,23.0,31,SEA,CVG,1964,7,16,0,,0,23,0,0,0,0
2007-01-01T00:01:00.000+0000,2007-01-01,00:01,22:45,08:44,07:10,AA,192,N632AA,343.0,325,320.0,94.0,76,LAX,BOS,2611,6,17,0,,0,51,0,18,0,25
2007-01-01T00:01:00.000+0000,2007-01-01,00:01,23:55,06:13,06:15,CO,434,N76504,252.0,260,232.0,-2.0,6,PHX,EWR,2133,5,15,0,,0,0,0,0,0,0
2007-01-01T00:03:00.000+0000,2007-01-01,00:03,23:55,02:48,02:32,YV,2876,N921FJ,105.0,97,80.0,16.0,8,LAS,ELP,584,4,21,0,,0,16,0,0,0,0
2007-01-01T00:04:00.000+0000,2007-01-01,00:04,00:10,07:34,07:38,NW,336,N523US,270.0,268,233.0,-4.0,-6,LAX,DTW,1979,13,24,0,,0,0,0,0,0,0
2007-01-01T00:04:00.000+0000,2007-01-01,00:04,23:57,07:32,07:16,US,49,N828AW,268.0,259,243.0,16.0,7,LAS,DCA,2089,3,22,0,,0,7,0,9,0,0
2007-01-01T00:04:00.000+0000,2007-01-01,00:04,20:50,02:04,23:00,AA,2376,N509AA,120.0,130,99.0,184.0,194,DFW,ORD,802,12,9,0,,0,0,0,0,0,184
2007-01-01T00:05:00.000+0000,2007-01-01,00:05,23:55,05:30,05:27,CO,229,N17344,205.0,212,180.0,3.0,10,DEN,EWR,1605,9,16,0,,0,0,0,0,0,0


Start by reading the stream. 

For this step you will need to:
1. Starting with `spark`, an instance of `SparkSession`, and get the `DataStreamReader`
2. We will make sure to only consume only 1 file per trigger for this Data.
3. We have Specified the stream's schema using the instance `dataSchema`
4. Use `dsr.parquet()` to specify the stream's file type and source directory, `dataPath` 

When you are done, run the TEST cell that follows to verify your results.

In [None]:
# TODO
dataSchema = "DepartureAt timestamp, FlightDate string, DepTime string, CRSDepTime string, ArrTime string, CRSArrTime string, UniqueCarrier string, FlightNum integer, TailNum string, ActualElapsedTime string, CRSElapsedTime string, AirTime string, ArrDelay string, DepDelay string, Origin string, Dest string, Distance string, TaxiIn string, TaxiOut string, Cancelled integer, CancellationCode string, Diverted integer, CarrierDelay string, WeatherDelay string, NASDelay string, SecurityDelay string, LateAircraftDelay string"

dataPath = "dbfs:/mnt/training/asa/flights/2007-01-stream.parquet"

initialDF = (spark.readStream
             .option("maxFilesPerTrigger", 1)
             .schema(dataSchema)
             .parquet(dataPath)
)


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Calculate the total of all delays</h2>

We want to calculate (and later graph) the total delay of each flight
1. Start with `initialDF` from the previous cell. 
2. Convert the following columns from `String` to `Integer`: `CarrierDelay`, `WeatherDelay`, `NASDelay`, `SecurityDelay` and `LateAircraftDelay`
3. Add the column `TotalDelay` which is the sum of the other 5 delays
4. Filter the flights by `UniqueCarrier` down to the carriers **AS**, **AQ**, **HA** and **F9**
5. Filter the results to non-zero delay's (`TotalDelay` > 0)
6. Assign the final DataFrame to `delaysDF`



In [None]:
# TODO
from pyspark.sql.functions import *
delaysDF = (initialDF
  .withColumn("CarrierDelay", col("CarrierDelay").cast("integer"))  # Convert CarrierDelay to an Integer
  .withColumn("WeatherDelay", col("WeatherDelay").cast("integer"))  # Convert WeatherDelay to an Integer
  .withColumn("NASDelay", col("NASDelay").cast("integer"))  # Convert NASDelay to an Integer
  .withColumn("SecurityDelay", col("SecurityDelay").cast("integer"))  # Convert SecurityDelay to an Integer
  .withColumn("LateAircraftDelay", col("LateAircraftDelay").cast("integer"))  # Convert LateAircraftDelay to an Integer
  .withColumn("TotalDelay", col("CarrierDelay")+col("WeatherDelay")+col("NASDelay")+col("SecurityDelay")+col("LateAircraftDelay"))  # Sum all five as TotalDelay
  .filter(col("UniqueCarrier").isin("AS","AQ","HA","F9"))   # Filter UniqueCarrier to only "AS", "AQ", "HA" and "F9"
  .filter(col("TotalDelay") > 0)  # TotalDelay to non-zero values
)

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png">Plot a LIVE graph</h2>

Plot `delaysDF` and give the stream the name "delays_python"

Once the data is loaded, render a line graph with 
* **Keys** is set to `DepartureAt`
* **Series groupings** is set to `UniqueCarrier`
* **Values** is set to `TotalDelay`


In [None]:
initialDF.isStreaming

In [None]:
# TODO
myStreamName = "delays_python"
display(delaysDF, streamName = myStreamName)

DepartureAt,FlightDate,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,TotalDelay
2007-01-01T00:06:00.000+0000,2007-01-01,00:06,23:55,05:51,05:35,F9,380,N948FR,225,220,208,16,11,DEN,FLL,1703,4,13,0,,0,11,0,5,0,0,16
2007-01-01T00:34:00.000+0000,2007-01-01,00:34,23:55,01:32,00:52,AS,191,N765AS,58,57,37,40,39,ANC,FAI,261,4,17,0,,0,39,0,1,0,0,40
2007-01-01T07:05:00.000+0000,2007-01-01,07:05,07:00,08:39,08:23,AS,583,N622AS,94,83,56,16,5,LAX,SFO,337,21,17,0,,0,0,0,16,0,0,16
2007-01-01T08:06:00.000+0000,2007-01-01,08:06,08:10,15:45,15:25,HA,10,N590HA,459,435,432,20,-4,HNL,LAX,2556,10,17,0,,0,20,0,0,0,0,20
2007-01-01T08:34:00.000+0000,2007-01-01,08:34,07:25,10:39,09:30,AS,351,N302AS,125,125,109,69,69,SFO,SEA,679,3,13,0,,0,69,0,0,0,0,69
2007-01-01T08:35:00.000+0000,2007-01-01,08:35,08:30,11:48,11:30,AS,595,N972AS,193,180,168,18,5,SAN,SEA,1050,6,19,0,,0,0,0,18,0,0,18
2007-01-01T09:11:00.000+0000,2007-01-01,09:11,08:05,11:10,10:22,AS,232,N786AS,119,137,98,48,66,SEA,SFO,679,5,16,0,,0,48,0,0,0,0,48
2007-01-01T09:29:00.000+0000,2007-01-01,09:29,09:00,13:57,13:30,AS,196,N645AS,208,210,187,27,29,ANC,SEA,1449,7,14,0,,0,27,0,0,0,0,27
2007-01-01T09:42:00.000+0000,2007-01-01,09:42,09:15,12:29,11:56,AS,663,N649AS,167,161,146,33,27,LAS,SEA,866,3,18,0,,0,27,0,6,0,0,33
2007-01-01T09:47:00.000+0000,2007-01-01,09:47,08:25,12:54,12:00,HA,35,N594HA,187,215,166,54,82,PHX,HNL,2917,3,18,0,,0,54,0,0,0,0,54


In [None]:
# TEST - Run this cell to test your solution.
count = 0
for s in spark.streams.active:
  if (s.name == myStreamName):
    count = count + 1

dbTest("SS-02-runningCount", 1, count)

print("Tests passed!")

When you are done, stop the stream:

In [None]:
stopAllStreams()

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png">Write Stream</h2>

Write the stream to an in-memory table
1. Use appropriate `format`
2. For this exercise, we want to append new records to the results table
3. We have configured a 15 second trigger
4. Name the query "delays_python"
5. Start the query
6. Assign the query to `delayQuery`

In [None]:
# TODO
delayQuery = (delaysDF.writeStream              # From the DataFrame get the DataStreamWriter
             .format("memory")             # Specify the sink format as "memory"
             .outputMode("append")               # Configure the output mode as "append"
             .queryName(myStreamName) # Name the query with myStreamName
             .trigger(processingTime = "15 seconds")               # Use a 15 second trigger
             .start()               # Start the query
)

In [None]:
# TEST - Run this cell to test your solution.
dbTest("SS-02-isActive", True, delayQuery.isActive)
dbTest("SS-02-name", myStreamName, delayQuery.name)
# The query's trigger is not available via the Python API

print("Tests passed!")

Wait until stream is done initializing...

In [None]:
myStreamName = "delays_python"


<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.

*   List item
*   List item

png"> Exercise 5: Stop streaming jobs</h2>

Before we can conclude, we need to shut down all active streams.

In [None]:
# TODO
for streams in spark.streams.active:# Iterate over all active streams
  try:
    print("stopping " + streams.name)  # A little console output
    streams.stop()     # Stop the stream
    
  except Exception as e:
    print("some error occured")

In [None]:
# TEST - Run this cell to test your solution.
dbTest("SS-02-numActiveStreams", 0, len(spark.streams.active))

print("Tests passed!")

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [None]:
%run "../Includes/Classroom-Cleanup"

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Next Steps</h2>

Start the next part, [Time Windows]($../SS 03 - Time Windows).