# DAIS RTM Demo
* Small Demo that demonstrates the low latency benefits of RealTimeMode vs MicroBatchMode
  * MicroBatchMode is the defacto/default mode in structured Streaming
  * RealTimeMode is the new low latency mode in spark to allow for sub second streaming 
* In this demo we will do the following
  * Use Spark Rate Source to generate data
  * Apply Transformwithstate to do stateful operations on the data
  * Write to Kafka with both MultiBatch mode (MBM) and RealTime Mode (RTM)
  * Calculate the latency differences between the 2

## Load resource files
* SensorDataGenerator - Uses spark nate rate source to generate records
* EnvironemntalMonitorListProcessor - TransformWithState Operator (Stateful Operation)
* HelperFunctions - Functions to simplify Demo

In [0]:
%run ./resources/SensorDataGenerator

In [0]:
%run ./resources/EnvironmentalMonitorListProcessor

In [0]:
%run ./resources/HelperFunctions

## Rate Source
* Use spark native rate source to generate records
* Set to generate 200 rows per second
* timestamp col is both the record generation timestamp and used for sensor timestamp
  * Effectively our "source timestamp"


In [0]:
import com.databricks.dais2025.SensorDataGenerator

// note don't increase rps too high it will affect performance, this is setup to be a demonstration not benchmarking
// Default is 200 rowspersecond, can be changed via widget
val stream = SensorDataGenerator.createStream(spark, dbutils.widgets.get("rowsPerSecond").toInt, 8) 

// Used to track data in kafka
val runId = s"rtmRunID${scala.util.Random.alphanumeric.take(6).mkString}" 

## Apply TransformFormWithState to Stream
* On the stream we apply [transformwithstate](https://docs.databricks.com/aws/en/stateful-applications/) operator to calculate the state of the senors per city and create alerts based on thresholds set in there
  * group by City
  * construct the columns we want to send to kafka as json string string column called "value"

In [0]:
import com.databricks.dais2025.tws.EnvironmentalMonitorListProcessor
import org.apache.spark.sql.streaming.{OutputMode, TimeMode}
import org.apache.spark.sql.functions.{col, struct, to_json, array, lit}
import com.databricks.dais2025.MyStructs._

val twsStream = stream
  .as[Input]
  .groupByKey(x => x.city)
  .transformWithState(
    new EnvironmentalMonitorListProcessor(), 
    TimeMode.ProcessingTime(), 
    OutputMode.Update()
  )
  .as[Output]
  .withColumn("value", to_json(struct(
    col("sensor_id"),
    col("location"),
    col("city"),
    col("timestamp"),
    col("temperature"),
    col("humidity"),
    col("co2_level"),
    col("pm25_level"),
    col("hourly_avg_temp"),
    col("daily_avg_temp"),
    col("temperature_trend"),
    col("high_temp_count"),
    col("alerts")
  )))

## Shuffle Partitions
* Use 8 Shuffle partitions for shuffle stage (TWS)

In [0]:
spark.conf.set("spark.sql.shuffle.partitions", 8)

## Create Kafka Topic to Write to
* This is setup for oetrta Kafka to bused on e2-demo-field-eng
* Modify the kafka props if environment changes

In [0]:
val topicName = s"daisrtm2025${runId}"
val bootstrapServers = dbutils.secrets.get("oetrta", "kafka-bootstrap-servers-tls")
val kafkaProps = {
  val props = new java.util.Properties()
    props.put("bootstrap.servers", bootstrapServers)
    props.put("security.protocol", "SSL")
  props
}
val partitionCount = 4

HelperFunctions.createKafkaTopic(
  topicName=topicName, props=kafkaProps,
  partitionCount=partitionCount, 
  replicationFactor=2) 

## WriteStream Options
* These are just our spark writestream options we will be using
* Please update checkpointLocation(via widget) base path to a volume you have access to

In [0]:
val checkpointLocationRaw = dbutils.widgets.get("checkpointLocation")
val checkpointLocation = if (checkpointLocationRaw.endsWith("/")) checkpointLocationRaw else checkpointLocationRaw.stripSuffix("/")

val writeStreamOptions = Map(
  "kafka.bootstrap.servers" -> bootstrapServers,
  "kafka.security.protocol" -> "SSL",
  "topic" -> topicName,
  "checkpointLocation" -> s"${checkpointLocation}/${runId}"
)

### Run Stream in MBM Mode
* First do 20 batches of Mbm (Microbatch Mode) 
  * This is the defacto mode in spark structured streaming
* Use ProcessingTime of 0 seconds so it runs as fast as possible

In [0]:
import org.apache.spark.sql.streaming.Trigger

val queryProgressBatch = HelperFunctions.stopStreamAfterBatchesCollectProgress(
  spark=spark,
  stream = twsStream,
  queryName = "RTMDemo",
  maxBatches = 20,
  trigger = Trigger.ProcessingTime("0 seconds"), //default
  writeStreamOptions = writeStreamOptions,
  runId = runId
)

## MBM No Latency Metrics
* MBM doesnt have Latency Metrics in the query progress after a batch
* We have to get from reading from kafka and subtracting source and sink timestamp

In [0]:
HelperFunctions.printRTMStreamMetrics(queryProgressBatch)

## Now Run In RealTime Mode
* We set trigger interval to 60 seconds (Default is 5 minutes)
  * This means that we checkpoint every 60 seconds, we only attempt to checkpoint then
  * If we used the default of 5 minutes the p99 would look better as well since less time is spent checkpointing
* We run 5 batches here, roughly a total of 5 minutes


In [0]:
import org.apache.spark.sql.streaming.Trigger

val queryProgressRTM = HelperFunctions.stopStreamAfterBatchesCollectProgress(
  spark = spark,
  stream = twsStream,
  queryName = "RTMDemo",
  maxBatches = 5,
  trigger = Trigger.RealTime("60 seconds"),
  writeStreamOptions = writeStreamOptions,
  runId = runId
)

## RTM Latency Metrics
* Real Time Mode has latency metrics, most importantly e2eLatency which gives us how long a record took from souce to get to sink (in this case kafka)
* [Latency Metrics Info](https://docs.databricks.com/aws/en/structured-streaming/real-time#use-streamingqueryprogress)
* Note, there is generally some overhead on the first initial batch of realtime mode(and mbm) hence why we run multiple benches for demoing to show consistent low latency

In [0]:
HelperFunctions.printRTMStreamMetrics(queryProgressRTM)
// discuss numbers below

## Mbm Latency Numbers
* To get the Latency numbers of Mbm, we can batch read from kafka and subtract the source timestamp (we discussed earlier) and the timestamp that kafka adds to message when recieved

In [0]:
import org.apache.spark.sql.functions.{avg, expr, percentile_approx, lit, dense_rank}
import org.apache.spark.sql.expressions.Window

// Read kafka as batch and filter for unique RunId we added to Kafka Header
val filteredKafkaDf = HelperFunctions.readAndFilterKafkaBatch(
  spark,
  Map(
    "kafka.bootstrap.servers" -> bootstrapServers,
    "kafka.security.protocol" -> "SSL",
    "includeHeaders" -> "true",
    "subscribe" -> topicName,
    "startingOffsets" -> "earliest",
    "endingOffsets" -> "latest"
  ),
  runId
).withColumn("batchnumberfortriggertype", dense_rank().over(Window.partitionBy("triggertype").orderBy("batchId")))

// Lets take a quick look at the actual data
display(filteredKafkaDf)

In [0]:
// Calculate latency numbers and p90's per batch

val latencyNumbers = filteredKafkaDf.groupBy("triggertype", "runId", "batchId", "batchnumberfortriggertype")
    .agg(
      avg("latency").alias("mean_latency"),
      percentile_approx(expr("latency"), lit(0.5), lit(100000)).alias("p50_latency"),
      percentile_approx(expr("latency"), lit(0.95), lit(100000)).alias("p95_latency"),
      percentile_approx(expr("latency"), lit(0.99), lit(100000)).alias("p99_latency")
    )
    .orderBy("triggertype", "batchId")

In [0]:
// processingtime0 is MBM mode and realtime is realtime mode batches
// discuss them (might be very sligh variation of the e2e latency numbers for realtime mode here vs streamingqueryprogress)
display(latencyNumbers)

triggertype,runId,batchId,batchnumberfortriggertype,mean_latency,p50_latency,p95_latency,p99_latency
processingtime0,rtmRunIDMkQrhU,0,1,29833.510869565216,30051,31701,31882
processingtime0,rtmRunIDMkQrhU,1,2,18044.39925816024,18038,33199,34549
processingtime0,rtmRunIDMkQrhU,2,3,1556.1492063492065,1564,2266,2328
processingtime0,rtmRunIDMkQrhU,3,4,1764.0641711229946,1773,2210,2292
processingtime0,rtmRunIDMkQrhU,4,5,1456.2871287128712,1453,2139,2196
processingtime0,rtmRunIDMkQrhU,5,6,1065.775147928994,1068,1446,1482
processingtime0,rtmRunIDMkQrhU,6,7,1031.0448717948718,1030,1386,1417
processingtime0,rtmRunIDMkQrhU,7,8,1001.8516129032258,1000,1351,1378
processingtime0,rtmRunIDMkQrhU,8,9,998.8607594936708,994,1358,1380
processingtime0,rtmRunIDMkQrhU,9,10,1048.3885350318471,1040,1411,1432


In [0]:
// p99 graph mbm vs rtm (ignore 1st batch of each triggertype)
// ignore first 2 batches for both as mbm has skewed batch time on it as well

display(
  filteredKafkaDf
    .where("batchnumberfortriggertype not in (1,2)")
    .groupBy("triggertype")
    .agg(
      percentile_approx(expr("latency"), lit(0.99), lit(10000)).alias("p99_latency")
    )
)

triggertype,p99_latency
processingtime0,2270
realtime,211


Databricks visualization. Run in Databricks to view.

In [0]:
// Clean up kafka topic when done
HelperFunctions.deleteKafkaTopic(topicName, kafkaProps)
// delete checkpoint location
dbutils.fs.rm(s"${checkpointLocation}/${runId}", True)