## Demo - Use of Databricks Structured streaming to process streaming data 
### Load the data from Azure event hub to delta lake

This notebook shows you how to use Databricks notebbok to consume real time event data from Azure event hub.

ADLS Gen2 is mounted to store event data to a data lake.

In [0]:
#Input parameters
#mount point path
dbutils.widgets.text("mount_point_path", "/mnt/stream-data")
#file_system
dbutils.widgets.text("file_system", "stream-data")
#account_name
dbutils.widgets.text("account_name", "dbcoedl")

In [0]:
#Read Parameters

mount_point_path = dbutils.widgets.get("mount_point_path")
file_system = dbutils.widgets.get("file_system")
account_name = dbutils.widgets.get("account_name")

In [0]:
#Mount with access key is not recommonded way
dbutils.fs.mount(
  source = "wasbs://{}@{}.blob.core.windows.net".format(file_system,account_name),
  mount_point = mount_point_path,
  extra_configs = {"fs.azure.account.key.dbcoedl.blob.core.windows.net": "xgFPK3uYt2t0rCRcfkflpq1U0hzBzV9PS73QYJ4UDHy1rPOPwgaGlAuUO/tG5EDuCdmKugk7srdT+AStHfcDJR=="})

In [0]:
#Check mount points
dbutils.fs.mounts()

#### Preparation (Set up Event Hub and library installation)
Before starting,

- Create Event Hub Namespace resource in Azure Portal
- Create new Event Hub in the previous namespace
- Create SAS policy and copy connection string on generated Event Hub entity
- Install Event Hub library to your cluster
- Go to Cluster -> Libraries -> Install New and Select Maven. Install "com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.22" on "Maven" source
- Insall "org.mongodb.spark:mongo-spark-connector_2.12:3.0.2" library in similar way 
- Add following config to Cluster

`spark.mongodb.output.uri mongodb+srv://admin:demo%40PSL@cluster0.s5tuva0.mongodb.net/events_db?retryWrites=true&w=majority`
`spark.mongodb.input.uri mongodb+srv://admin:demo%40PSL@cluster0.s5tuva0.mongodb.net/events_db?retryWrites=true&w=majority`

Read stream from Azure Event Hub as streaming dataframe using `readStream()`.  
You must set your namespace, entity, policy name, and key for Azure Event Hub in the following command.

In [0]:
# Read Event Hub's stream
conf = {}
conf["eventhubs.connectionString"] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt("Endpoint=sb://events-feed.servicebus.windows.net/;SharedAccessKeyName=manage_user_access_policy;SharedAccessKey=EYMfb85RM5wMgBujKH+D+P/MbFb1Auo+BGkgAbWakIJ=;EntityPath=demo-topic")


In [0]:
read_df = (
  spark
    .readStream
    .format("eventhubs")
    .options(**conf)
    .load()
)

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

claims_schema = StructType([
    StructField("id", StringType(), True),
    StructField("customer_name", StringType(), True),
    StructField("phone_number", StringType(), True),
    StructField("country", StringType(), True),
    StructField("claim_amount", IntegerType(), True),
    StructField("type_id", StringType(), True),
    StructField("status", StringType(), True)
])


In [0]:
# Read the event body
decoded_df = read_df.select(from_json(col("body").cast("string"), claims_schema).alias("payload"))

In [0]:
claims_df = decoded_df.withColumn("id", col("payload.id"))\
.withColumn("customer_name", col("payload.customer_name"))\
.withColumn("phone_number", col("payload.phone_number"))\
.withColumn("country", col("payload.country"))\
.withColumn("claim_amount", col("payload.claim_amount"))\
.withColumn("type_id", col("payload.type_id"))\
.withColumn("status", col("payload.status"))\
.drop("payload")

claims_df.printSchema()

In [0]:
display(claims_df, processingTime = "5 seconds")

In [0]:
#data enahancement
claims_df = claims_df.withColumn("processed", current_timestamp())

In [0]:
claims_types_df = spark.read.format("delta").table("events_db.insurance_types")
display(claims_types_df)

In [0]:
claims_df=claims_df.join(claims_types_df, claims_df.type_id==claims_types_df.id, "inner").drop(claims_types_df.id)
display(claims_df)

For real IoT or Sales data stream , you would drop duplicates, do aggregation using `window` function etc.  
As an example,
```
def aggregateSalesREvenue(df,watermarkLateness,timeWindowSize,aggregationKey):
  return (
  df.withWatermark("timestamp", watermarkLateness)
  .groupBy(
    window("timestamp", timeWindowSize),
    col(aggregationKey))
  .agg(sum(col("sales")).alias("sales")))
```

#### Write the datastream to Delta Table for Data Analysis

We will create a database and store the stream data as delta table.

In [0]:
%sql
CREATE DATABASE IF NOT EXISTS events_db;
USE events_db;

In [0]:
check_point_path = "dbfs:/FileStore/events/_checkpoints/event_stream"

delta_write_query = claims_df.writeStream\
.format("delta")\
.outputMode("append")\
.option("checkpointLocation", check_point_path)\
.queryName("delta_write_query")\
.toTable("claims_data")

In [0]:
%sql
SELECT * FROM claims_data LIMIT 10;

In [0]:
%sql
SELECT count(*) FROM claims_data;

We start streaming computation by defining the sink as streaming query named "data_lake_query".  
`start()` function kicks off the streaming and continue to run as background jobs ...

In [0]:
save_loc = "/mnt/stream-data/claims"

datalake_write_query = claims_df.writeStream\
.format("csv")\
.outputMode("append")\
.queryName("data_lake_query")\
.trigger(processingTime='30 seconds')\
.option("checkpointLocation", f"{save_loc}/_checkpoint")\
.start(save_loc)

After completed, cancel (stop) previous jobs.

In [0]:
for s in spark.streams.active:
    s.stop()

In [0]:
#Unmount
dbutils.fs.unmount("/mnt/stream-data")

##### ==== end of notebook ====

In [0]:
%sh
cd /dbfs/FileStore/events/_checkpoints/event_stream
rm -rf *