In [4]:
%run "./Includes/Classroom-Setup"

<div>
  <h2>The Kafka Ecosystem</h2>
  <p>Kafka is software designed upon the <b>publish/subscribe</b> messaging pattern.
     Publish/subscribe messaging is where a sender (publisher) sends a message that is not specifically directed to a receiver (subscriber). 
     The publisher classifies the message somehow and the receiver subscribes to receive certain categories of messages.
     There are other usage patterns for Kafka, but this is the pattern we focus on in this course.
  </p>
  <p>Publisher/subscriber systems typically have a central point where messages are published, called a <b>broker</b>. 
     The broker receives messages from publishers, assigns offsets to them and commits messages to storage.
  </p>

  <p>The Kafka version of a unit of data an array of bytes called a <b>message</b>.</p>

  <p>A message can also contain a bit of information related to partitioning called a <b>key</b>.</p>

  <p>In Kafka, messages are categorized into <b>topics</b>.</p>
</div>

<h2>The Kafka Server</h2>

The Kafka server is fed by a separate TCP server that reads the Wikipedia edits, in real time, from the various language-specific IRC channels to which Wikimedia posts them. 

That server parses the IRC data, converts the results to JSON, and sends the JSON to
a Kafka server, with the edits segregated by language. The various languages are <b>topics</b>.

For example, the Kafka topic "en" corresponds to edits for en.wikipedia.org.

### Required Options

When consuming from a Kafka source, you **must** specify at least two options:

<p>1. The Kafka bootstrap servers, for example:</p>
<p>`dsr.option("kafka.bootstrap.servers", "server1.databricks.training:9092")`</p>
<p>2. Some indication of the topics you want to consume.</p>

#### Specifying a Topic

There are three, mutually-exclusive, ways to specify the topics for consumption:

| Option        | Value                                          | Description                            | Example |
| ------------- | ---------------------------------------------- | -------------------------------------- | ------- |
| **subscribe** | A comma-separated list of topics               | A list of topics to which to subscribe | `dsr.option("subscribe", "topic1")` <br/> `dsr.option("subscribe", "topic1,topic2,topic3")` |
| **assign**    | A JSON string indicating topics and partitions | Specific topic-partitions to consume.  | `dsr.dsr.option("assign", "{'topic1': [1,3], 'topic2': [2,5]}")`
| **subscribePattern**   | A (Java) regular expression           | A pattern to match desired topics      | `dsr.option("subscribePattern", "e[ns]")` <br/> `dsr.option("subscribePattern", "topic[123]")`|

In the example to follow, we're using the "subscribe" option to select the topics we're interested in consuming. 
We've selected only the "en" topic, corresponding to edits for the English Wikipedia. 
If we wanted to consume multiple topics (multiple Wikipedia languages, in our case), we could just specify them as a comma-separate list:

```dsr.option("subscribe", "en,es,it,fr,de,eo")```

There are other, optional, arguments you can give the Kafka source. 

<h2>The Kafka Schema</h2>

Reading from Kafka returns a `DataFrame` with the following fields:

| Field             | Type   | Description |
|------------------ | ------ |------------ |
| **key**           | binary | The key of the record (not needed) |
| **value**         | binary | Our JSON payload. We'll need to cast it to STRING |
| **topic**         | string | The topic this record is received from (not needed) |
| **partition**     | int    | The Kafka topic partition from which this record is received (not needed). This server only has one partition. |
| **offset**        | long   | The position of this record in the corresponding Kafka topic partition (not needed) |
| **timestamp**     | long   | The timestamp of this record  |
| **timestampType** | int    | The timestamp type of a record (not needed) |

In the example below, the only column we want to keep is `value`.

The default of `spark.sql.shuffle.partitions` is 200.
This setting is used in operations like `groupBy`.
In this case, we should be setting this value to match the current number of cores.

In [11]:
from pyspark.sql.functions import col
spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)

kafkaServer = "server1.databricks.training:9092"   # US (Oregon)
# kafkaServer = "server2.databricks.training:9092" # Singapore

editsDF = (spark.readStream                        # Get the DataStreamReader
  .format("kafka")                                 # Specify the source format as "kafka"
  .option("kafka.bootstrap.servers", kafkaServer)  # Configure the Kafka server name and port
  .option("subscribe", "en")                       # Subscribe to the "en" Kafka topic
  .option("startingOffsets", "earliest")           # Rewind stream to beginning when we restart notebook
  .option("maxOffsetsPerTrigger", 1000)            # Throttle Kafka's processing of the streams
  .load()                                          # Load the DataFrame
  .select(col("value").cast("STRING"))             # Cast the "value" column to STRING
)

Let's display some data.

In [13]:
myStreamName = "lesson04a_ps"
display(editsDF,  streamName = myStreamName)

value
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-04-12T06:47:34.438Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=950463875&oldid=950463326"",""isUnpatrolled"":false,""page"":""Out Zone"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""/* Reception and legacy */"",""userURL"":""http://en.wikipedia.org/wiki/User:KGRAMR"",""pageURL"":""http://en.wikipedia.org/wiki/Out_Zone"",""delta"":120,""flag"":"""",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""KGRAMR"",""namespace"":""article""}"
"{""isRobot"":true,""channel"":""#en.wikipedia"",""timestamp"":""2020-04-12T06:47:35.490Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=950463877&oldid=928751794"",""isUnpatrolled"":false,""page"":""Scott V. Edwards"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""[[Wikipedia:OABOT|Open access bot]]: doi added to citation with #oabot."",""userURL"":""http://en.wikipedia.org/wiki/User:OAbot"",""pageURL"":""http://en.wikipedia.org/wiki/Scott_V._Edwards"",""delta"":18,""flag"":""MB"",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""OAbot"",""namespace"":""article""}"
"{""isRobot"":true,""channel"":""#en.wikipedia"",""timestamp"":""2020-04-12T06:47:35.572Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=950463876&oldid=949175009"",""isUnpatrolled"":false,""page"":""Black River Township, Pennington County, Minnesota"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""Move 1 url. [[User:GreenC/WaybackMedic_2.5|Wayback Medic 2.5]]"",""userURL"":""http://en.wikipedia.org/wiki/User:GreenC bot"",""pageURL"":""http://en.wikipedia.org/wiki/Black_River_Township,_Pennington_County,_Minnesota"",""delta"":-137,""flag"":""B"",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""GreenC bot"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-04-12T06:47:36.949Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=950463879&oldid=949434316"",""isUnpatrolled"":false,""page"":""Craig T. Nelson"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""/* Career */"",""userURL"":""http://en.wikipedia.org/wiki/User:Pattydornoff"",""pageURL"":""http://en.wikipedia.org/wiki/Craig_T._Nelson"",""delta"":-5,""flag"":"""",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""Pattydornoff"",""namespace"":""article""}"
"{""isRobot"":true,""channel"":""#en.wikipedia"",""timestamp"":""2020-04-12T06:47:37.114Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=950463878&oldid=949175015"",""isUnpatrolled"":false,""page"":""Lodi (town), Wisconsin"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""Move 1 url. [[User:GreenC/WaybackMedic_2.5|Wayback Medic 2.5]]"",""userURL"":""http://en.wikipedia.org/wiki/User:GreenC bot"",""pageURL"":""http://en.wikipedia.org/wiki/Lodi_(town),_Wisconsin"",""delta"":-134,""flag"":""B"",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""GreenC bot"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-04-12T06:47:38.464Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=950463881&oldid=950462934"",""isUnpatrolled"":false,""page"":""Ikue Ōtani"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""/* Discography */"",""userURL"":""http://en.wikipedia.org/wiki/User:2601:681:0:2F70:79D1:AD3E:C5B9:2A18"",""pageURL"":""http://en.wikipedia.org/wiki/Ikue_Ōtani"",""delta"":156,""flag"":"""",""isNewPage"":false,""isAnonymous"":true,""geocoding"":{""countryCode2"":""US"",""city"":null,""latitude"":39.7599983215332,""country"":""United States"",""longitude"":39.7599983215332,""stateProvince"":null,""countryCode3"":""USA""},""user"":""2601:681:0:2F70:79D1:AD3E:C5B9:2A18"",""namespace"":""article""}"
"{""isRobot"":true,""channel"":""#en.wikipedia"",""timestamp"":""2020-04-12T06:47:38.636Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=950463880&oldid=949175020"",""isUnpatrolled"":false,""page"":""Pleasant Township, Van Wert County, Ohio"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""Reformat 1 archive link; Move 1 url. [[User:GreenC/WaybackMedic_2.5|Wayback Medic 2.5]]"",""userURL"":""http://en.wikipedia.org/wiki/User:GreenC bot"",""pageURL"":""http://en.wikipedia.org/wiki/Pleasant_Township,_Van_Wert_County,_Ohio"",""delta"":-134,""flag"":""B"",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""GreenC bot"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-04-12T06:47:39.859Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=950463884&oldid=950463763"",""isUnpatrolled"":false,""page"":""UN mediation of the Kashmir dispute"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""/* Overview */"",""userURL"":""http://en.wikipedia.org/wiki/User:2409:4072:90C:5490:3438:C23C:89B2:7D1E"",""pageURL"":""http://en.wikipedia.org/wiki/UN_mediation_of_the_Kashmir_dispute"",""delta"":-5,""flag"":"""",""isNewPage"":false,""isAnonymous"":true,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""2409:4072:90C:5490:3438:C23C:89B2:7D1E"",""namespace"":""article""}"
"{""isRobot"":false,""channel"":""#en.wikipedia"",""timestamp"":""2020-04-12T06:47:40.386Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=950463882&oldid=950461514"",""isUnpatrolled"":false,""page"":""Hubble Space Telescope"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""([[c:GR|GR]]) [[c:COM:FR|File renamed]]: [[File:HST in Orion.jpg]] → [[File:Hubble Space Telescope in Orion.jpg]] [[c:COM:FR#FR1|Criterion 1]] (original uploader’s request)"",""userURL"":""http://en.wikipedia.org/wiki/User:Masum Ibn Musa"",""pageURL"":""http://en.wikipedia.org/wiki/Hubble_Space_Telescope"",""delta"":19,""flag"":""M"",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""Masum Ibn Musa"",""namespace"":""article""}"
"{""isRobot"":true,""channel"":""#en.wikipedia"",""timestamp"":""2020-04-12T06:47:40.446Z"",""url"":""https://en.wikipedia.org/w/index.php?diff=950463883&oldid=949175043"",""isUnpatrolled"":false,""page"":""New Woodville, Oklahoma"",""wikipedia"":""en"",""wikipediaURL"":""http://en.wikipedia.org"",""comment"":""Move 3 urls. [[User:GreenC/WaybackMedic_2.5|Wayback Medic 2.5]]"",""userURL"":""http://en.wikipedia.org/wiki/User:GreenC bot"",""pageURL"":""http://en.wikipedia.org/wiki/New_Woodville,_Oklahoma"",""delta"":-85,""flag"":""B"",""isNewPage"":false,""isAnonymous"":false,""geocoding"":{""countryCode2"":null,""city"":null,""latitude"":null,""country"":null,""longitude"":null,""stateProvince"":null,""countryCode3"":null},""user"":""GreenC bot"",""namespace"":""article""}"


Wait until stream is done initializing...

In [15]:
untilStreamIsReady(myStreamName)

Make sure to stop the stream before continuing.

In [17]:
stopAllStreams()

<h2>Use Kafka to display the raw data</h2>

The Kafka server acts as a sort of "firehose" (or asynchronous buffer) and displays raw data.

Since raw data coming in from a stream is transient, we'd like to save it to a more permanent data structure.

The first step is to define the schema for the JSON payload.

In [19]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, BooleanType
from pyspark.sql.functions import from_json, unix_timestamp

schema = StructType([
  StructField("channel", StringType(), True),
  StructField("comment", StringType(), True),
  StructField("delta", IntegerType(), True),
  StructField("flag", StringType(), True),
  StructField("geocoding", StructType([                 # (OBJECT): Added by the server, field contains IP address geocoding information for anonymous edit.
    StructField("city", StringType(), True),
    StructField("country", StringType(), True),
    StructField("countryCode2", StringType(), True),
    StructField("countryCode3", StringType(), True),
    StructField("stateProvince", StringType(), True),
    StructField("latitude", DoubleType(), True),
    StructField("longitude", DoubleType(), True),
  ]), True),
  StructField("isAnonymous", BooleanType(), True),      # (BOOLEAN): Whether or not the change was made by an anonymous user
  StructField("isNewPage", BooleanType(), True),
  StructField("isRobot", BooleanType(), True),
  StructField("isUnpatrolled", BooleanType(), True),
  StructField("namespace", StringType(), True),         # (STRING): Page's namespace. See https://en.wikipedia.org/wiki/Wikipedia:Namespace 
  StructField("page", StringType(), True),              # (STRING): Printable name of the page that was edited
  StructField("pageURL", StringType(), True),           # (STRING): URL of the page that was edited
  StructField("timestamp", StringType(), True),         # (STRING): Time the edit occurred, in ISO-8601 format
  StructField("url", StringType(), True),
  StructField("user", StringType(), True),              # (STRING): User who made the edit or the IP address associated with the anonymous editor
  StructField("userURL", StringType(), True),
  StructField("wikipediaURL", StringType(), True),
  StructField("wikipedia", StringType(), True),         # (STRING): Short name of the Wikipedia that was edited (e.g., "en" for the English)
])

Next we can use the function `from_json` to parse out the full message with the schema specified above.

In [21]:
from pyspark.sql.functions import col, from_json

jsonEdits = editsDF.select(
  from_json("value", schema).alias("json"))  # Parse the column "value" and name it "json"

When parsing a value from JSON, we end up with a single column containing a complex object.

We can clearly see this by simply printing the schema.

In [23]:
jsonEdits.printSchema()

The fields of a complex object can be referenced with a "dot" notation as in:

`col("json.geocoding.countryCode3")` 
 

A large number of these fields/columns can become unwieldy.

For that reason, it is common to extract the sub-fields and represent them as first-level columns as seen below:

In [25]:
from pyspark.sql.functions import isnull, unix_timestamp

anonDF = (jsonEdits
  .select(col("json.wikipedia").alias("wikipedia"),      # Promoting from sub-field to column
          col("json.isAnonymous").alias("isAnonymous"),  #     "       "      "      "    "
          col("json.namespace").alias("namespace"),      #     "       "      "      "    "
          col("json.page").alias("page"),                #     "       "      "      "    "
          col("json.pageURL").alias("pageURL"),          #     "       "      "      "    "
          col("json.geocoding").alias("geocoding"),      #     "       "      "      "    "
          col("json.user").alias("user"),                #     "       "      "      "    "
          col("json.timestamp").cast("timestamp"))       # Promoting and converting to a timestamp
  .filter(col("namespace") == "article")                 # Limit result to just articles
  .filter(~isnull(col("geocoding.countryCode3")))        # We only want results that are geocoded
)

<h2>Mapping Anonymous Editors' Locations</h2>

When you run the query, the default is a live html table.

The geocoded information allows us to associate an anonymous edit with a country.

We can then use that geocoded information to plot edits on a live world map.

In order to create a slick world map visualization of the data, you'll need to click on the item below.

Under <b>Plot Options</b>, use the following:
* <b>Keys:</b> `countryCode3`
* <b>Values:</b> `count`

In <b>Display type</b>, use <b>World map</b> and click <b>Apply</b>.

By invoking a `display` action on a DataFrame created from a `readStream` transformation, we can generate a LIVE visualization!

In [27]:
mappedDF = (anonDF
  .groupBy("geocoding.countryCode3") # Aggregate by country (code)
  .count()                           # Produce a count of each aggregate
)
display(mappedDF, streamName = myStreamName)

countryCode3,count
IND,89
GRC,1
NZL,9
KEN,2
CHL,4
CRI,1
DEU,7
AUS,34
NGA,1
PHL,7


Wait until stream is done initializing...

In [29]:
untilStreamIsReady(myStreamName)

Stop the streams.

In [31]:
stopAllStreams()

<h2>Review Questions</h2>

**Q:** What `format` should you use with Kafka?<br>
**A:** `format("kafka")`

**Q:** How do you specify a Kafka server?<br>
**A:** `.option("kafka.bootstrap.servers"", "server1.databricks.training:9092")`

**Q:** What verb should you use in conjunction with `readStream` and Kafka to start the streaming job?<br>
**A:** `load()`, but with no parameters since we are pulling from a Kafka server.

**Q:** What fields are returned in a Kafka DataFrame?<br>
**A:** Reading from Kafka returns a DataFrame with the following fields:
key, value, topic, partition, offset, timestamp, timestampType

In [34]:
%run "./Includes/Classroom-Cleanup"