In [4]:
%run "./Includes/Classroom-Setup"

Set up relevant paths.

In [7]:
dataPath = "dbfs:/mnt/training/definitive-guide/data/activity-data"

## Streaming Concepts

<b>Stream processing</b> is where you continuously incorporate new data into a data lake and compute results.

The data is coming in faster than it can be consumed.

Treat a <b>stream</b> of data as a table to which data is continously appended. 

In this course we are assuming Databricks Structured Streaming, which uses the DataFrame API. 

There are other kinds of streaming systems.

Examples are bank card transactions, Internet of Things (IoT) device data, and video game play events. 

Data coming from a stream is typically not ordered in any way.

A streaming system consists of 
* <b>Input source</b> such as Kafka, Azure Event Hub, files on a distributed system or TCP-IP sockets
* <b>Sinks</b> such as Kafka, Azure Event Hub, various file formats, `foreach` sinks, console sinks or memory sinks

### Streaming and Databricks Delta

In streaming, the problems of traditional data pipelines are exacerbated. 

Specifically, with frequent meta data refreshes, table repairs and accumulation of small files on a secondly- or minutely-basis!

Many small files result because data (may be) streamed in at low volumes with short triggers.

Databricks Delta is uniquely designed to address these needs.

### READ Stream using Databricks Delta

The `readStream` method is similar to a transformation that outputs a DataFrame with specific schema specified by `.schema()`. 

Each line of the streaming data becomes a row in the DataFrame once `writeStream` is invoked.

In this lesson, we limit flow of stream to one file per trigger with `option("maxFilesPerTrigger", 1)` so that you do not exceed file quotas you may have on your end. The default value is 1000.

Notice that nothing happens until you engage an action, i.e. a `writeStream` operation a few cells down.

Do some data normalization as well:
* Convert `Arrival_Time` to `timestamp` format.
* Rename `Index` to `User_ID`.

In [10]:
static = spark.read.json(dataPath)
dataSchema = static.schema

deltaStreamWithTimestampDF = (spark
  .readStream
  .option("maxFilesPerTrigger", 1)
  .schema(dataSchema)
  .json(dataPath)
  .withColumnRenamed('Index', 'User_ID')
  .selectExpr("*","cast(cast(Arrival_Time as double)/1000 as timestamp) as event_time")
)

### WRITE Stream using Databricks Delta

#### General Notation
Use this format to write a streaming job to a Databricks Delta table.

> `(myDF` <br>
  `.writeStream` <br>
  `.format("delta")` <br>
  `.option("checkpointLocation", somePath)` <br>
  `.outputMode("append")` <br>
  `.table("my_table")` or `.start(path)` <br>
`)`

If you use the `.table()` notation, it will write output to a default location. 
* This would be in parquet files under `/user/hive/warehouse/default.db/my_table`

In this course, we want everyone to write data to their own directory; so, instead, we use the `.start()` notation.

#### Output Modes
Notice, besides the "obvious" parameters, specify `outputMode`, which can take on these values
* `append`: add only new records to output sink
* `complete`: rewrite full output - applicable to aggregations operations
* `update`: update changed records in place

#### Checkpointing

When defining a Delta streaming query, one of the options that you need to specify is the location of a checkpoint directory.

`.writeStream.format("delta").option("checkpointLocation", <path-to-checkpoint-directory>) ...`

This is actually a structured streaming feature. It stores the current state of your streaming job.

Should your streaming job stop for some reason and you restart it, it will continue from where it left off.

If you do not have a checkpoint directory, when the streaming job stops, you lose all state around your streaming job and upon restart, you start from scratch.

Also note that every streaming job should have its own checkpoint directory: no sharing.

### Let's Do Some Streaming

In the cell below, we write streaming query to a Databricks Delta table. 

Notice how we do not need to specify a schema: it is inferred from the data!

And to help us manage our streams better, we will make use of **`untilStreamIsReady()`**, **`stopAllStreams()`** and define the following, **`myStreamName`**:

In [14]:
myStreamName = "lesson05_ps"

In [15]:
writePath =      workingDir + "/output.delta"
checkpointPath = workingDir + "/output.checkpoint"

deltaStreamingQuery = (deltaStreamWithTimestampDF
  .writeStream
  .format("delta")
  .option("checkpointLocation", checkpointPath)
  .outputMode("append")
  .queryName(myStreamName)
  .start(writePath)
)

See list of active streams.

In [17]:
for s in spark.streams.active:
  print("{}: {}".format(s.name, s.id))

Wait until stream is done initializing...

In [19]:
untilStreamIsReady(myStreamName)

In [20]:
stopAllStreams()

# LAB

## Step 1: Table-to-Table Stream

Here we read a stream of data from from `writePath` and write another stream to `activityPath`.

The data consists of a grouped count of `gt` events.

Make sure the stream using `deltaStreamingQuery` is still running!

To perform an aggregate operation, what kind of `outputMode` should you use?

In [23]:
# TODO
activityPath   = workingDir + "/activityCount.delta"
checkpointPath = workingDir + "/activityCount.checkpoint"

activityCountsQuery = (spark.readStream
  .format("delta")
  .load(str(writePath))   
  .groupBy("gt")
  .count()
  .writeStream
  .format("delta")
  .option("checkpointLocation", checkpointPath)
  .outputMode("complete")
  .queryName(myStreamName)
  .start(activityPath)
)

Wait until stream is done initializing...

In [25]:
untilStreamIsReady(myStreamName)

In [26]:
# TEST - Run this cell to test your solution.
activityQueryTruth = spark.streams.get(activityCountsQuery.id).isActive

dbTest("Delta-05-activityCountsQuery", True, activityQueryTruth)

print("Tests passed!")

In [27]:
stopAllStreams()

## Step 2

Plot the occurrence of all events grouped by `gt`.

Under <b>Plot Options</b>, use the following:
* <b>Series Groupings:</b> `gt`
* <b>Values:</b> `count`

In <b>Display type</b>, use <b>Bar Chart</b> and click <b>Apply</b>.

In the cell below, we use the `withWatermark` and `window` methods, which aren't covered in this course. 

In [29]:
# TODO
from pyspark.sql.functions import hour, window, col

countsDF = (deltaStreamWithTimestampDF      
  .withWatermark("event_time", "180 minutes")
  .groupBy(window("event_time", "60 minute"), "gt")
  .count()
)

display(countsDF.withColumn('hour',hour(col('window.start'))), streamName = myStreamName)

window,gt,count,hour
"List(2015-02-23T10:00:00.000+0000, 2015-02-23T11:00:00.000+0000)",walk,14268,10
"List(2015-02-23T10:00:00.000+0000, 2015-02-23T11:00:00.000+0000)",stairsup,8685,10
"List(2015-02-24T13:00:00.000+0000, 2015-02-24T14:00:00.000+0000)",sit,9827,13
"List(2015-02-24T13:00:00.000+0000, 2015-02-24T14:00:00.000+0000)",,10909,13
"List(2015-02-23T14:00:00.000+0000, 2015-02-23T15:00:00.000+0000)",stairsdown,9576,14
"List(2015-02-24T13:00:00.000+0000, 2015-02-24T14:00:00.000+0000)",walk,14589,13
"List(2015-02-23T10:00:00.000+0000, 2015-02-23T11:00:00.000+0000)",bike,10008,10
"List(2015-02-24T14:00:00.000+0000, 2015-02-24T15:00:00.000+0000)",bike,27422,14
"List(2015-02-24T14:00:00.000+0000, 2015-02-24T15:00:00.000+0000)",walk,14220,14
"List(2015-02-23T13:00:00.000+0000, 2015-02-23T14:00:00.000+0000)",stairsdown,9403,13


In [30]:
untilStreamIsReady(myStreamName)

In [31]:
# TEST - Run this cell to test your solution.
schemaStr = str(countsDF.schema)

dbTest("Assertion #1", 3, len(countsDF.columns))
dbTest("Assertion #2", True, "(gt,StringType,true)" in schemaStr) 
dbTest("Assertion #3", True, "(count,LongType,false)" in schemaStr) 

dbTest("Assertion #5", True, "window,StructType" in schemaStr)
dbTest("Assertion #6", True, "(start,TimestampType,true)" in schemaStr) 
dbTest("Assertion #7", True, "(end,TimestampType,true)" in schemaStr) 

print("Tests passed!")

In [32]:
stopAllStreams()

In [34]:
%run "./Includes/Classroom-Cleanup"

## Review Questions
**Q:** Why is Databricks Delta so important for a data lake that incorporates streaming data?<br>
**A:** Frequent meta data refreshes, table repairs and accumulation of small files on a secondly- or minutely-basis!

**Q:** What happens if you shut off your stream before it has fully initialized and started and you try to `CREATE TABLE .. USING DELTA` ? <br>
**A:** You will get this: `Error in SQL statement: AnalysisException: The user specified schema is empty;`.

**Q:** When you do a write stream command, what does this option do `outputMode("append")` ?<br>
**A:** This option takes on the following values and their respective meanings:
* <b>append</b>: add only new records to output sink
* <b>complete</b>: rewrite full output - applicable to aggregations operations
* <b>update</b>: update changed records in place

**Q:** What happens if you do not specify `option("checkpointLocation", pointer-to-checkpoint directory)`?<br>
**A:** When the streaming job stops, you lose all state around your streaming job and upon restart, you start from scratch.

**Q:** How do you view the list of active streams?<br>
**A:** Invoke `spark.streams.active`.

**Q:** How do you verify whether `streamingQuery` is running (boolean output)?<br>
**A:** Invoke `spark.streams.get(streamingQuery.id).isActive`.