# Spark Structured Streaming

This notebook is about *stream processing*. This term refers to *near realtime* or *low latency* data processing. Typically, there is a data source, which continuously emits new events. Then the task is to process these eventss immediately as they happen. Traditionally, Spark has been a *batch processing* framework optimized fro throughput and not for latency. Nevertheless, the Spark developers implemented a streaming mechanism which essentially processes incoming data in *micro batches* (i.e. very small batches with tens of records).

In order to follow this notebook, you need a streaming source. We will use Kafka, which is a very commonly found platform. The data in this notebook comes from Twitter, and is freely available from The Internet Archive at https://archive.org/details/archiveteam-twitter-stream-2016-07

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","4G") \
        .getOrCreate()

spark

In [None]:
spark.conf.set("spark.sql.shuffle.partitions", 16)

# 1. Connect to data source

First you need to fill a Kafka topic, for example via

    s3cat.py -I1 -B10 s3://dimajix-training/data/twitter-sample/ | /opt/kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic twitter

Then we connect to the raw data socket as the datasource by using the `DataStreamReader` API via `spark.readStream`. We need to specify the options `kafka.bootstrap.servers` and `subscribe` and we need to use the format `kafka` for connecting to the data source. The Kafka topic will stream Twitter data samples in raw JSON format, i.e. one JSON document per line.

In [None]:
!hostname

In [None]:
# Fill in the correct AWS VPC address of your master host
master = # YOUR CODE HERE

In [None]:
# Connect to Kafka using the DataStreamReader API via spark.readStream. You need to specify the options `kafka.bootstrap.servers`, `subscribe` and you need to use the format `kafka`
# YOUR CODE HERE

## 1.1 Inspect Schema

The result of the load method is a `DataFrame` again, but a streaming one. This `DataFrame` again has a schema, which we can inspect with the usual method:

In [None]:
# YOUR CODE HERE

# 2. Inspect Data

Of course we also want to inspect the data inside the DataFrame. But this time, we cannot simply invoke `show`, because normal actions do not (directly) work on streaming DataFrames. Instead we need to create a continiuous query. Later, we will see a neat trick how a streaming query can be transformed into a volatile table.

In order to create a continuous query, we need to perform the following steps

1. Create a `DataStreamWriter` by using the `writeStream` method of a DataFrame
2. Specify the output format. We use `console` in our case
3. Specify a checkpoint location on HDFS. This is required for restarting
4. Optionally specify a processing period
5. Start the query

In [None]:
import time

# YOUR CODE HERE

## 2.1 Stop Query

In contrast to the RDD API, we can simply stop an individual query instead of a whole StreamingContext by simply calling the `stop` method on the query object. This makes working with streams much easier.

In [None]:
# YOUR CODE HERE

# 3. Counting Hash-Tags

So we now want to create a streaming hashtag count. First we need to extract the Tweet itself from the JSON document, then we need to extract the hashtags in a similar way to the batch word traditional DataFrame word count example, i.e. we split every line into words, keep only hash-tags, group the words and count the sizes of the groups.

Each query looks as follows

```
{ "contributors" : null,
  "coordinates" : null,
  "created_at" : "Fri Jul 29 12:46:00 +0000 2016",
  "entities" : { "hashtags" : [  ],
      "symbols" : [  ],
      "urls" : [ { "display_url" : "fb.me/ItnwZEhy",
            "expanded_url" : "http://fb.me/ItnwZEhy",
            "indices" : [ 33,
                56
              ],
            "url" : "https://t.co/mM0if95F1K"
          } ],
      "user_mentions" : [  ]
    },
  "favorite_count" : 0,
  "favorited" : false,
  "filter_level" : "low",
  "geo" : null,
  "id" : 759007065155117058,
  "id_str" : "759007065155117058",
  "in_reply_to_screen_name" : null,
  "in_reply_to_status_id" : null,
  "in_reply_to_status_id_str" : null,
  "in_reply_to_user_id" : null,
  "in_reply_to_user_id_str" : null,
  "is_quote_status" : false,
  "lang" : "en",
  "place" : null,
  "possibly_sensitive" : false,
  "retweet_count" : 0,
  "retweeted" : false,
  "source" : "<a href=\"http://www.facebook.com/twitter\" rel=\"nofollow\">Facebook</a>",
  "text" : "I posted a new video to Facebook https://t.co/mM0if95F1K",
  "timestamp_ms" : "1469796360659",
  "truncated" : false,
  "user" : { "contributors_enabled" : false,
      "created_at" : "Sat Sep 08 08:28:55 +0000 2012",
      "default_profile" : false,
      "default_profile_image" : false,
      "description" : null,
      "favourites_count" : 0,
      "follow_request_sent" : null,
      "followers_count" : 0,
      "following" : null,
      "friends_count" : 0,
      "geo_enabled" : false,
      "id" : 810489374,
      "id_str" : "810489374",
      "is_translator" : false,
      "lang" : "zh-tw",
      "listed_count" : 0,
      "location" : null,
      "name" : "張冥閻",
      "notifications" : null,
      "profile_background_color" : "FFF04D",
      "profile_background_image_url" : "http://abs.twimg.com/images/themes/theme19/bg.gif",
      "profile_background_image_url_https" : "https://abs.twimg.com/images/themes/theme19/bg.gif",
      "profile_background_tile" : false,
      "profile_image_url" : "http://pbs.twimg.com/profile_images/378800000157469481/0a267258c8ccd1bf53d01c115677dbd7_normal.jpeg",
      "profile_image_url_https" : "https://pbs.twimg.com/profile_images/378800000157469481/0a267258c8ccd1bf53d01c115677dbd7_normal.jpeg",
      "profile_link_color" : "0099CC",
      "profile_sidebar_border_color" : "FFF8AD",
      "profile_sidebar_fill_color" : "F6FFD1",
      "profile_text_color" : "333333",
      "profile_use_background_image" : true,
      "protected" : false,
      "screen_name" : "nineemperor1",
      "statuses_count" : 9652,
      "time_zone" : null,
      "url" : null,
      "utc_offset" : null,
      "verified" : false
    }
}
```

In order to extract a field from a JSON document, we can use the `get_json_object` function.

## 3.1 Extract Tweet

First we need to extract the tweet text itself via the `get_json_object` function and store it into a new column.

In [None]:
ts_text = # YOUR CODE HERE

In [None]:
ts_text.printSchema()

## 3.2 Extract Topics

Now that we have the Tweet text itself, we extract all topics with the following approach:
1. Split text along spaces using `split`
2. Create multiple records from all words using `explode`
3. Filter all hash-tags (words that start with a `#`)
4. Filter out all empty topics (topic name only consists of hash-tag `#` itself)

In [None]:
topics = # YOUR CODE HERE

## 3.3 Count Topics

Now that we have the hash tags (topics), we perform a simple aggregation as usual: Group by hashtag (`topic`) and count number of tweets (using `count` or `sum(1)`)

In [None]:
counts = # YOUR CODE HERE

In [None]:
counts.printSchema()

## 3.4 Print Results onto Console

Again we want to print the results onto the console.

In [None]:
query = counts.writeStream \
    .format("console") \
    .outputMode("update") \
    .option("truncate", False) \
    .option("checkpointLocation", "/tmp/zeppelin/checkpoint-twitter-count-" + str(time.time())) \
    .start()

In [None]:
query.stop()

# 4. Time-Windowed Aggregation

Another interesting (and probably more realistic) application is to perform time windowed aggregations. This means that we define a sliding time window used in the `groupBy` clause. In addition we also define a so called *watermark* which tells Spark how long to wait for late arrivels of individual data points (we don't have them in our simple example).

## 4.1 Define Window and Watermark

In [None]:
windowedCounts = # YOUR CODE HERE

In [None]:
windowedCounts.printSchema()

## 4.2 Start Query

Let's again output the data. This time, we also like to investigate the different output modes `append`, `complete` and `update`.

In [None]:
query = windowedCounts.writeStream \
    .outputMode("update") \
    .format("console") \
    .trigger(processingTime="1 seconds") \
    .option("checkpointLocation", "/tmp/zeppelin/checkpoint-twitter-console-" + str(time.time())) \
    .option("truncate", False) \
    .start()

In [None]:
query.stop()

# 5. Kafka Output

So far, we have only used Kafka for reading and dumped the result onto the console. Of course, this is not a realtistic scenario. Instead, a good idea is to send the results to Kafka again. Then an additional system can fetch the records from Kafka again and do whatever it thinks could make sense. This approach will technically decouple Spark from a real sink.

## 5.1 Format Result

Kafka only accepts simple messages. We therefore store the result into a JSON object as follows:

In [None]:
windowedCountsAsValues = windowedCounts.withColumn("value", 
            f.to_json(
                f.struct(
                    windowedCounts["window"],
                    windowedCounts["topic"],
                    windowedCounts["count"]
                )
            )
        )

## 5.2 Start Query

Now we use `kafka` as the output format. This requires some additional configuration (for example the address of some Kafka bootstrap servers and the Kafka topic name).

In [None]:
query = windowedCountsAsValues.writeStream \
    .outputMode("update") \
    # YOUR CODE HERE
    .trigger(processingTime="1 seconds") \
    .option("checkpointLocation", "/tmp/zeppelin/checkpoint-twitter-console-" + str(time.time())) \
    .start()   

While the query is running, you can peek inside the Kafka topic via the following command line:

```shell
/opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic kku
```

In [None]:
query.stop()