In [4]:
%run "./Includes/Classroom-Setup"

## Lambda Architecture

The Lambda architecture is a big data processing architecture that combines both batch and real-time processing methods.
It features an append-only immutable data source that serves as system of record. Timestamped events are appended to 
existing events (nothing is overwritten). Data is implicitly ordered by time of arrival. 

Notice how there are really two pipelines here, one batch and one streaming, hence the name <i>lambda</i> architecture.

It is very difficult to combine processing of batch and real-time data as is evidenced by the diagram below.

## Databricks Delta Architecture

The Databricks Delta Architecture is a vast improvement upon the traditional Lambda architecture.

Text files, RDBMS data and streaming data is all collected into a <b>raw</b> table (also known as "bronze" tables at Databricks).

A Raw table is then parsed into <b>query</b> tables (also known as "silver" tables at Databricks). They may be joined with dimension tables.

<b>Summary</b> tables (also known as "gold" tables at Databricks) are business level aggregates often used for reporting and dashboarding. 
This would include aggregations such as daily active website users.

The end outputs are actionable insights, dashboards and reports of business metrics.

## Databricks Delta Architecture

We use terminology 
* "bronze" (instead of "raw"), 
* "silver" (instead of "query"), 
* "gold" (instead of "summary"), 
* "platinum" (another level of refinement)

This is not standard in the industry.

Set up relevant paths.

In [10]:
bronzePath           = workingDir + "/wikipedia/bronze.delta"
bronzeCheckpointPath = workingDir + "/wikipedia/bronze.checkpoint"

silverPath           = workingDir + "/wikipedia/silver.delta"
silverCheckpointPath = workingDir + "/wikipedia/silver.checkpoint"

And to help us manage our streams better, we will make use of **`untilStreamIsReady()`**, **`stopAllStreams()`** and define the following, **`bronzeStreamName`**, **`silverStreamName`** and **`goldStreamName`**:

In [12]:
bronzeStreamName = "bronze_stream_ps"
silverStreamName = "silver_stream_ps"
goldStreamName = "gold_stream_ps"

## Save to RAW table (aka "bronze table")

<b>Raw data</b> is unaltered data that is collected into a data lake, either via bulk upload or through streaming sources.

The following function reads the Wikipedia IRC channels that has been dumped into our Kafka server.

The Kafka server acts as a sort of "firehose" and dumps raw data into our data lake.

Below, the first step is to set up schema. The fields we use further down in the notebook are commented.

In [14]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, BooleanType

schema = StructType([
  StructField("channel", StringType(), True),
  StructField("comment", StringType(), True),
  StructField("delta", IntegerType(), True),
  StructField("flag", StringType(), True),
  StructField("geocoding", StructType([                 # (OBJECT): Added by the server, field contains IP address geocoding information for anonymous edit.
    StructField("city", StringType(), True),
    StructField("country", StringType(), True),
    StructField("countryCode2", StringType(), True),
    StructField("countryCode3", StringType(), True),
    StructField("stateProvince", StringType(), True),
    StructField("latitude", DoubleType(), True),
    StructField("longitude", DoubleType(), True),
  ]), True),
  StructField("isAnonymous", BooleanType(), True),      # (BOOLEAN): Whether or not the change was made by an anonymous user
  StructField("isNewPage", BooleanType(), True),
  StructField("isRobot", BooleanType(), True),
  StructField("isUnpatrolled", BooleanType(), True),
  StructField("namespace", StringType(), True),         # (STRING): Page's namespace. See https://en.wikipedia.org/wiki/Wikipedia:Namespace 
  StructField("page", StringType(), True),              # (STRING): Printable name of the page that was edited
  StructField("pageURL", StringType(), True),           # (STRING): URL of the page that was edited
  StructField("timestamp", StringType(), True),         # (STRING): Time the edit occurred, in ISO-8601 format
  StructField("url", StringType(), True),
  StructField("user", StringType(), True),              # (STRING): User who made the edit or the IP address associated with the anonymous editor
  StructField("userURL", StringType(), True),
  StructField("wikipediaURL", StringType(), True),
  StructField("wikipedia", StringType(), True),         # (STRING): Short name of the Wikipedia that was edited (e.g., "en" for the English)
])

Next, stream into bronze Databricks Delta directory.

Notice how we are invoking the `.start(path)` method. 

This is so that the data is streamed into the path we want (and not a default directory).

In [16]:
from pyspark.sql.functions import from_json, col
(spark.readStream
  .format("kafka")  
  .option("kafka.bootstrap.servers", "server1.databricks.training:9092")  # Oregon
  #.option("kafka.bootstrap.servers", "server2.databricks.training:9092") # Singapore
  .option("subscribe", "en")
  .load()
  .withColumn("json", from_json(col("value").cast("string"), schema))
  .select(col("timestamp").alias("kafka_timestamp"), col("json.*"))
  .writeStream
  .format("delta")
  .option("checkpointLocation", bronzeCheckpointPath)
  .outputMode("append")
  .queryName(bronzeStreamName)
  .start(bronzePath)
)

In [17]:
# Wait until the stream is done initializing...
untilStreamIsReady(bronzeStreamName)

Take a look the first row of the raw table without explicitly creating a table.

In [19]:
bronzeDF = spark.sql(f"SELECT * FROM delta.`{bronzePath}` limit 1")
display(bronzeDF)

kafka_timestamp,channel,comment,delta,flag,geocoding,isAnonymous,isNewPage,isRobot,isUnpatrolled,namespace,page,pageURL,timestamp,url,user,userURL,wikipediaURL,wikipedia
1969-12-31T23:59:59.999+0000,#en.wikipedia,/* Comparison */,13,,"List(null, null, null, null, null, null, null)",False,False,False,False,article,DOCSIS,http://en.wikipedia.org/wiki/DOCSIS,2020-04-15T10:46:28.513Z,https://en.wikipedia.org/w/index.php?diff=951077732&oldid=950219702,Xose.vazquez,http://en.wikipedia.org/wiki/User:Xose.vazquez,http://en.wikipedia.org,en


## Create QUERY tables (aka "silver tables")

Notice how `WikipediaEditsRaw` has JSON encoding. For example `{"city":null,"country":null,"countryCode2":null,"c..`

In order to be able parse the data in human-readable form, create query tables out of the raw data using columns<br>
`wikipedia`, `isAnonymous`, `namespace`, `page`, `pageURL`, `geocoding`, `timestamp` and `user`.

Stream into a Databricks Delta query directory.

In [21]:
from pyspark.sql.functions import unix_timestamp, col

(spark.readStream
  .format("delta")
  .load(bronzePath)
  .select(col("wikipedia"),
          col("isAnonymous"),
          col("namespace"),
          col("page"),
          col("pageURL"),
          col("geocoding"),
          unix_timestamp(col("timestamp"), "yyyy-MM-dd'T'HH:mm:ss.SSSX").cast("timestamp").alias("timestamp"),
          col("user"))
  .writeStream
  .format("delta")
  .option("checkpointLocation", silverCheckpointPath)
  .outputMode("append")
  .queryName(silverStreamName)
  .start(silverPath)
)

In [22]:
# Wait until the stream is done initializing...
untilStreamIsReady(silverStreamName)

Take a peek at the streaming query view without explicitly creating tables.

Notice how the fields are more meaningful than the fields in the bronze data set.

Notice that we are explicitly creating a DataFrame. This is so we can pass it to the `display` function.

In [24]:
silverDF = spark.sql("SELECT * FROM delta.`{}` limit 3".format(silverPath))
display(silverDF)

wikipedia,isAnonymous,namespace,page,pageURL,geocoding,timestamp,user
en,False,article,"Cleeve Hill, Gloucestershire","http://en.wikipedia.org/wiki/Cleeve_Hill,_Gloucestershire","List(null, null, null, null, null, null, null)",2020-04-15T10:51:29.000+0000,Imaginatorium
en,False,article,Harmonic coordinates,http://en.wikipedia.org/wiki/Harmonic_coordinates,"List(null, null, null, null, null, null, null)",2020-04-15T10:51:30.000+0000,OAbot
en,False,user talk,User talk:2405:6E00:2ED6:3D00:7CF6:D68E:362B:9DFC,http://en.wikipedia.org/wiki/User_talk:2405:6E00:2ED6:3D00:7CF6:D68E:362B:9DFC,"List(null, null, null, null, null, null, null)",2020-04-15T10:51:33.000+0000,Passengerpigeon


## Create SUMMARY (aka "gold") level data 

Summary queries can take a long time.

Instead of running the below query off the data under `silverPath`, let's create a summary query.

We are interested in a breakdown of which countries that are producing anonymous edits.

In [26]:
from pyspark.sql.functions import col, desc, count

goldDF = (spark.readStream
  .format("delta")
  .load(silverPath)
  .withColumn("countryCode", col("geocoding.countryCode3"))
  .filter(col("namespace") == "article")
  .filter(col("countryCode") != "null")
  .filter(col("isAnonymous") == True)
  .groupBy(col("countryCode"))
  .count() 
  .withColumnRenamed("count", "total")
  .orderBy(col("total").desc())
)

## Creating Visualizations (aka "platinum" level) 

#### Mapping Anonymous Editors' Locations

Use that geocoding information to figure out the countries associated with the editors.

When you run the query, the default is a (live) html table.

In order to create a slick world map visualization of the data, you'll need to click on the item below.

Under <b>Plot Options</b>, use the following:
* <b>Keys:</b> `countryCode`
* <b>Values:</b> `total`

In <b>Display type</b>, use <b>World Map</b> and click <b>Apply</b>.


By invoking a `display` action on a DataFrame created from a `readStream` transformation, we can generate a LIVE visualization!

Keep an eye on the plot for a minute or two and watch the colors change.

## Creating Visualizations (aka "platinum" level) 

LIVE means you can see the colors change if you watch the plot.

In [29]:
display(goldDF, streamName = goldStreamName)

countryCode,total
GBR,41
USA,24
AUS,7
IND,6
BEL,5
PHL,5
ITA,5
IDN,4
DEU,4
MYS,3


In [30]:
# Wait until the stream is done initializing...
untilStreamIsReady(goldStreamName)

When you are all done, make sure to stop all the streams.

In [32]:
stopAllStreams()

In [34]:
%run "./Includes/Classroom-Cleanup"

## Review Questions
**Q:** What is the difference between Lambda and Databricks Delta architecture?<br>
**A:** The principal difference is that 
* In a Databricks Delta architecture, output queries can be performed on streaming and historical data at the same time.
* In a Lambda architecture, streaming and historical data are treated as two separate branches feeding output queries.

**Q:** What is role of raw (bronze) tables?<br>
**A:** Raw tables capture streaming and historical data into a permanent record (streaming data tends to disappear after a short while). Though, it's generally hard to query.

**Q:** What is role of query (silver) tables?<br>
**A:** Query tables consist of normalized raw data that is easier to query.

**Q:** What is role of summary (gold) tables?<br>
**A:** Summary tables contain aggregated key business metrics that are queried frequently, but the silver queries themselves would take too long.