# Case 1 Part 1: Calculating Running Average and Standard Deviation

One of the analytics use case to be tackled is a real-time alerting system:

`Hourly consumption for a household is higher than 1 standard deviation of that household's historical mean consumption for that hour.`

To do that, I split the use case into three parts. 

First, read the data stream from the `readings_prepared` topic, group them by `house_id` and the **hour** of the `reading_ts`, and calculate the statistics (average and standard deviation) of that group. Everytime the value changes, or every small period of time, we persist the change inside a database / stream. Let's name the target store `alert_1_stats`. This first task is reflected on this notebook. 

The second part is reading the data stream from both `readings_prepared` and `alert_1_stats` and use them to detect anomalies, and store the anomalies in a persistent data store. 

The third task is a developign scheduled job to check that data store to see if there are anomalies and send alerts to the related parties. **The last 2 tasks will have their own notebook / code.**


## Setup

Import all the required libraries and set the stream configuration variables.

In [1]:
import spark.implicits._
import java.sql.Timestamp
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.{GroupStateTimeout, OutputMode, GroupState}
import org.apache.spark.sql.streaming.Trigger

Intitializing Scala interpreter ...

Spark Web UI available at http://spark-alert-1-stats-m:8088/proxy/application_1583161277657_0002
SparkContext available as 'sc' (version = 2.4.5, master = yarn, app id = application_1583161277657_0002)
SparkSession available as 'spark'


import spark.implicits._
import java.sql.Timestamp
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.{GroupStateTimeout, OutputMode, GroupState}
import org.apache.spark.sql.streaming.Trigger


In [2]:
val kafkaBootstrapServer = "kafka-m:9092"
val kafkaSourceTopic = "readings_prepared"
val kafkaTargetTopic = "alert_1_stats"
val checkpointLocation = "/tmp"
val triggerTime = "1 minute"
val deduplicateWindow = "1 minute"

kafkaBootstrapServer: String = kafka-m:9092
kafkaSourceTopic: String = readings_prepared
kafkaTargetTopic: String = alert_1_stats
checkpointLocation: String = /tmp
triggerTime: String = 1 minute
deduplicateWindow: String = 1 minute


## Define The Required Schema and Classes

In [3]:
// This will be used to give the source `readings_prepared` stream data a schema
val mySchema = StructType(Seq(
    StructField("message_id", StringType, false),
    StructField("reading_ts", TimestampType, false),
    StructField("reading_value", FloatType, false),
    StructField("reading_type", IntegerType, false),
    StructField("plug_id", IntegerType, false),
    StructField("household_id", IntegerType, false),
    StructField("house_id", IntegerType, false)
))

mySchema: org.apache.spark.sql.types.StructType = StructType(StructField(message_id,StringType,false), StructField(reading_ts,TimestampType,false), StructField(reading_value,FloatType,false), StructField(reading_type,IntegerType,false), StructField(plug_id,IntegerType,false), StructField(household_id,IntegerType,false), StructField(house_id,IntegerType,false))


In [4]:
// Somehow case classes need to be defined in a separate cell
// This will be used in the stateful streaming calculation as the input rows
case class ReadingsInput(
    house_id: Int, 
    hour: Int, 
    reading_value: Float,
    reading_ts: java.sql.Timestamp 
)

// This will be used in the stateful streaming calculation as the state store schema
case class StatsState(
    house_id: Int,
    hour: Int,
    mean: Float,
    m2: Float,
    variance: Float,
    std_dev: Float,
    count: Long, 
    last_ts: java.sql.Timestamp // Will be used in stream-to-stream join
)

defined class ReadingsInput
defined class StatsState


### Read and Parse The Input Stream

Only take the Current Load readings (`reading_type = 1`).

In [5]:
val readings = spark
    .readStream 
    .format("kafka")
    .option("kafka.bootstrap.servers", kafkaBootstrapServer)
    .option("subscribe", kafkaSourceTopic)
    .option("failOnDataLoss", false)
    .load()
    .selectExpr("CAST(value AS STRING)")
    .select(from_json($"value", mySchema).as("data"))
    .select($"data.*")
    .withWatermark("reading_ts", deduplicateWindow) 
    .dropDuplicates()
    .filter($"reading_type" === 1) // Only take the "current load" measurement

readings: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [message_id: string, reading_ts: timestamp ... 5 more fields]


#### Peek at The Input Data Stream

In [6]:
val streamQuery = readings.writeStream.format("memory").queryName("readings").start()

streamQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@767b3bb8


In [7]:
streamQuery.status

res0: org.apache.spark.sql.streaming.StreamingQueryStatus =
{
  "message" : "Getting offsets from KafkaV2[Subscribe[readings_prepared]]",
  "isDataAvailable" : false,
  "isTriggerActive" : true
}


In [16]:
// Thread.sleep(60000)
spark.sql("select * from readings").show()

+----------+-------------------+-------------+------------+-------+------------+--------+
|message_id|         reading_ts|reading_value|reading_type|plug_id|household_id|house_id|
+----------+-------------------+-------------+------------+-------+------------+--------+
| 118504114|2013-09-01 17:53:00|          0.0|           1|      0|           0|       9|
| 115998526|2013-09-01 17:27:20|          0.0|           1|      0|           0|       0|
| 116872749|2013-09-01 17:36:20|          0.0|           1|      1|           0|       9|
| 118371375|2013-09-01 17:51:40|          0.0|           1|      2|           0|       4|
| 117153098|2013-09-01 17:39:20|          0.0|           1|      0|           0|       5|
| 117306922|2013-09-01 17:41:00|       50.627|           1|      0|           0|       8|
| 118503302|2013-09-01 17:53:00|       41.056|           1|      2|           0|       0|
| 118868629|2013-09-01 17:56:40|      122.738|           1|      1|           0|       1|
| 11972589

In [9]:
// streamQuery.stop()
streamQuery.lastProgress

res2: org.apache.spark.sql.streaming.StreamingQueryProgress = null


## Calculate The Running Average and Standard Deviation

To complete the first task, one thing to keep in mind is that we are calculating a running/moving average/standard deviation over an unbounded stream of data. With that in mind, the typical algorithm of calculating average over a limited amount of bounded/batch data, `average = sum_of_values / population_size`, won't work. 

Why? Since we need to sum all values on a stream, we need to store the state of the sum at every given point in time, and as the stream goes on and one, this can lead to numeric overflow issues. Instead of doing that, we can incrementally calculate average by keeping track of the current average and the current population size / record count, and then adjust as new data comes.  We can do that using [Welford's online algorithm](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm).

#### Storing The Running Statistics

Once we have a stream of running average and standard deviation, ideally, I want to output and store the stats in a dataset whose very low latency lookup speed like HBase or Google BigTable. However, due to technical issues with HBase's Spark library, I store the data back to Kafka on the `alert_1_stats` stream. I did not want to spend too much time trying to resolve the issue so I decided to drop it.

This [blog post](https://medium.com/@robbinjain19/challenges-faced-while-integrating-apache-spark-with-hbase-and-the-solution-5c1c8a068808) suggests that I downgrade to Spark 2.1.0 to resolve the HBase connector issue, but I need to use the `foreachBatch` functionality from Spark 2.4.0 and stream-to-stream join from Spark 2.3.0.


### Define the Functions to Be Used in MapGroupsWithState

I created 2 functions: `updateHouseHourStats(state, input)` and `updateAllHouseHourStats(key, input, groupState)`. The latter will be called by Spark's `mapGroupsWithState` function and call the former to work on individual rows to update the state.

The case states "`that household's historical mean consumption for that hour.`", hence here, the grouping keys used are `house_id` and the `hour(reading_ts)`. I looked at the data, the `household_id` is always 0 so it does not make sense to use it.

In [10]:
/**
 * To be called by updateAllHouseHourStats
 */
def updateHouseHourStats(state: StatsState, input: ReadingsInput): StatsState = {
    // This is implementing Welford's moving average / standard deviation algorithm
    val newCount = state.count + 1
    val delta = input.reading_value - state.mean
    val newMean = state.mean + (delta / newCount)
    val newDelta = input.reading_value - newMean
    val newM2 = state.m2 + (delta * newDelta)
    // Calculate Sample variance, state.count == newCount - 1
    val newVariance = if(newCount > 1) { newM2 / state.count } else { 0 } 
    val newStdDev = if(newCount > 1) { Math.sqrt(newVariance).toFloat } else { 0 }

    return StatsState(
        state.house_id,
        state.hour,
        newMean,
        newM2,
        newVariance,
        newStdDev,
        newCount,
        input.reading_ts
    )
}

/**
 * To be called by mapGroupWithState
 */
def updateAllHouseHourStats(
  key: (Int, Int),
  inputs: Iterator[ReadingsInput],
  groupState: GroupState[StatsState]
) : StatsState = {
    // Get previous state if exists, else create a new empty state
    var currentStatsState = groupState.getOption.getOrElse {
        new StatsState(
            key._1,
            key._2,
            0,  // Mean
            0,  // m2
            0,  // Variance
            0,  // STD Dev
            0,   // Count 
            null
        )
    }
    // Loop over the inputs in this microbatch
    for (input <- inputs) {
        currentStatsState = updateHouseHourStats(currentStatsState, input)
    }
    // Update the current state
    groupState.update(currentStatsState)
    // Return the current state to the stream
    return currentStatsState
}

updateHouseHourStats: (state: StatsState, input: ReadingsInput)StatsState
updateAllHouseHourStats: (key: (Int, Int), inputs: Iterator[ReadingsInput], groupState: org.apache.spark.sql.streaming.GroupState[StatsState])StatsState


# Send Update to Kafka

Pull the trigger every arbitrary **1 minute**, as set in `triggerTime`, since we may want to wait until we accumulate enough data for statistics calculation. We are using the `GroupStateTimeout.NoTimeout` option -- meaning that the stats state will continually gets updated until the stream is stopped.

In [11]:
val stats = readings
    .select($"house_id", hour($"reading_ts").alias("hour"), $"reading_value", $"reading_ts")
    .toDF()
    .as[ReadingsInput]
    .groupByKey(
        input => (input.house_id, input.hour)
    )
    .mapGroupsWithState(GroupStateTimeout.NoTimeout)(updateAllHouseHourStats)
    
val query = stats
    .selectExpr("CONCAT(house_id, '-', hour, '-', last_ts) AS key", "CAST(to_json(struct(*)) AS STRING) AS value")
    .writeStream
    .outputMode("update")
    .format("kafka")
    .trigger(Trigger.ProcessingTime(triggerTime))
    .option("kafka.bootstrap.servers", kafkaBootstrapServer)
    .option("checkpointLocation", checkpointLocation)
    .option("topic", kafkaTargetTopic)
    .start()

stats: org.apache.spark.sql.Dataset[StatsState] = [house_id: int, hour: int ... 6 more fields]
query: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@13e3be6


In [14]:
// Thread.sleep(10000)
query.status

res5: org.apache.spark.sql.streaming.StreamingQueryStatus =
{
  "message" : "Processing new data",
  "isDataAvailable" : true,
  "isTriggerActive" : true
}


In [17]:
// query.stop()
query.lastProgress

res8: org.apache.spark.sql.streaming.StreamingQueryProgress =
{
  "id" : "8e479727-56f4-41c9-ab47-d5e50904d707",
  "runId" : "6f8234eb-ce83-4e55-a34a-625defdc8a24",
  "name" : null,
  "timestamp" : "2020-03-02T15:29:00.001Z",
  "batchId" : 1,
  "numInputRows" : 33000,
  "inputRowsPerSecond" : 549.9908334861086,
  "processedRowsPerSecond" : 692.7097546128172,
  "durationMs" : {
    "addBatch" : 46391,
    "getBatch" : 0,
    "getEndOffset" : 0,
    "queryPlanning" : 674,
    "setOffsetRange" : 1,
    "triggerExecution" : 47639,
    "walCommit" : 426
  },
  "eventTime" : {
    "avg" : "2013-09-01T03:48:44.664Z",
    "max" : "2013-09-01T18:08:00.000Z",
    "min" : "2013-08-31T22:00:20.000Z",
    "watermark" : "1970-01-01T00:00:00.000Z"
  },
  "stateOperators" : [ {
    "numRowsTotal" : 110...