# Case 1 Part 2: Anomaly Detection

This notebook showcases the 2nd part of the analytics use case to be tackled in a real-time alerting system:

`Hourly consumption for a household is higher than 1 standard deviation of that household's historical mean consumption for that hour.`

The second part here is to read the data stream from both `readings_prepared` and `alert_1_stats` and use them to detect data anomalies -- the ones that go above 1 standard deviation from the mean. The detected anomalies are then stored in a persistent data store, which in this case is Google BigQuery.

BigQuery is chosen for the fit of further analytical queries down the line. We might be interested to do some BI or advanced analysis down the line. It is also serverless -- we just need to define the Datasets and Tables.


## Setup

Import all the required libraries and set the stream configuration variables.

In [1]:
import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql._

Intitializing Scala interpreter ...

Spark Web UI available at http://spark-alert-1-detect-m:8088/proxy/application_1583313185168_0001
SparkContext available as 'sc' (version = 2.4.5, master = yarn, app id = application_1583313185168_0001)
SparkSession available as 'spark'


import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql._


In [2]:
val kafkaBootstrapServer = "kafka-m:9092"
val kafkaReadingsTopic = "readings_prepared"
val kafkaStatsTopic = "alert_1_stats"
val kafkaDedupWatermarkTime = "1 minute"
val joinWatermarkTime = "1 minute"
val bigQueryTargetTable = "smartplugs.alert_1_anomaly"
val bigQueryTempBucket = "pandora-sde-case/alert_1"
val outputTriggerTime = "1 minute"

kafkaBootstrapServer: String = kafka-m:9092
kafkaReadingsTopic: String = readings_prepared
kafkaStatsTopic: String = alert_1_stats
kafkaDedupWatermarkTime: String = 1 minute
joinWatermarkTime: String = 1 minute
bigQueryTargetTable: String = smartplugs.alert_1_anomaly
bigQueryTempBucket: String = pandora-sde-case/alert_1
outputTriggerTime: String = 1 minute


## Define The Required Schema

In [3]:
// This will be used to give the source `readings_prepared` stream data a schema
val readingsSchema = StructType(Seq(
    StructField("message_id", StringType, false),
    StructField("reading_ts", TimestampType, false),
    StructField("reading_value", FloatType, false),
    StructField("reading_type", IntegerType, false),
    StructField("plug_id", IntegerType, false),
    StructField("household_id", IntegerType, false),
    StructField("house_id", IntegerType, false)
))

val statsSchema = StructType(Seq(
    StructField("house_id", IntegerType, false),
    StructField("hour", IntegerType, false),
    StructField("mean", FloatType, false),
    StructField("m2", FloatType, false),
    StructField("variance", FloatType, false),
    StructField("std_dev", FloatType, false),
    StructField("count", LongType, false),
    StructField("last_ts", TimestampType, false)
))

readingsSchema: org.apache.spark.sql.types.StructType = StructType(StructField(message_id,StringType,false), StructField(reading_ts,TimestampType,false), StructField(reading_value,FloatType,false), StructField(reading_type,IntegerType,false), StructField(plug_id,IntegerType,false), StructField(household_id,IntegerType,false), StructField(house_id,IntegerType,false))
statsSchema: org.apache.spark.sql.types.StructType = StructType(StructField(house_id,IntegerType,false), StructField(hour,IntegerType,false), StructField(mean,FloatType,false), StructField(m2,FloatType,false), StructField(variance,FloatType,false), StructField(std_dev,FloatType,false), StructField(count,LongType,false), StructField(last_ts,TimestampType,false))


### Read and Parse The Input Data Streams

There are 2 input streams this time, `readings_prep` and `alert_1_stats` topic. They will be joined to detect anomalies

In [4]:
// Drop duplicates if seen in an arbitrary watermark. Bounds are necessary so that Spark does not store 
// ALL records in the state
val readings = spark
    .readStream 
    .format("kafka")
    .option("kafka.bootstrap.servers", kafkaBootstrapServer)
    .option("subscribe", kafkaReadingsTopic)
    .load()
    .selectExpr("CAST(value AS STRING)")
    .select(from_json($"value", readingsSchema).as("data"))
    .select($"data.*")
    .withWatermark("reading_ts", kafkaDedupWatermarkTime) 
    .dropDuplicates()
    .filter($"reading_type" === 1) // Only take the "current load" measurement

readings: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [message_id: string, reading_ts: timestamp ... 5 more fields]


In [5]:
// We don't need to deduplicate the stats, but we drop the ones with mean=0
val stats = spark
    .readStream 
    .format("kafka")
    .option("kafka.bootstrap.servers", kafkaBootstrapServer)
    .option("subscribe", kafkaStatsTopic)
    .load()
    .selectExpr("CAST(value AS STRING)")
    .select(from_json($"value", statsSchema).as("data"))
    .select($"data.house_id", $"data.hour", $"data.mean", $"data.std_dev", $"data.last_ts")
    .filter($"mean" > 0.0)


stats: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [house_id: int, hour: int ... 3 more fields]


#### Peek at The Input Data Streams

##### Readings

In [6]:
val readingsQuery = readings.writeStream.format("memory").queryName("readings").start()
Thread.sleep(10000)
readingsQuery.status

readingsQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@3e63118e
res0: org.apache.spark.sql.streaming.StreamingQueryStatus =
{
  "message" : "Processing new data",
  "isDataAvailable" : true,
  "isTriggerActive" : true
}


In [7]:

spark.sql("select * from readings").show()

+----------+----------+-------------+------------+-------+------------+--------+
|message_id|reading_ts|reading_value|reading_type|plug_id|household_id|house_id|
+----------+----------+-------------+------------+-------+------------+--------+
+----------+----------+-------------+------------+-------+------------+--------+



In [8]:
// readingsQuery.stop()
readingsQuery.lastProgress

res2: org.apache.spark.sql.streaming.StreamingQueryProgress =
{
  "id" : "4d7065fb-beda-47bc-8762-23343f78cfd2",
  "runId" : "b19775d4-c45a-4f23-acd0-842b2c6f8a07",
  "name" : "readings",
  "timestamp" : "2020-03-04T09:19:31.369Z",
  "batchId" : 1,
  "numInputRows" : 0,
  "inputRowsPerSecond" : 0.0,
  "processedRowsPerSecond" : 0.0,
  "durationMs" : {
    "getEndOffset" : 1,
    "setOffsetRange" : 51,
    "triggerExecution" : 63
  },
  "eventTime" : {
    "watermark" : "1970-01-01T00:00:00.000Z"
  },
  "stateOperators" : [ {
    "numRowsTotal" : 0,
    "numRowsUpdated" : 0,
    "memoryUsedBytes" : 44599,
    "customMetrics" : {
      "loadedMapCacheHitCount" : 0,
      "loadedMapCacheMissCount" : 0,
      "stateOnCurrentVersionSizeBytes" : 15799
    }
  } ],
  "sources" : [ {
    "descr...

In [9]:
readingsQuery.status

res3: org.apache.spark.sql.streaming.StreamingQueryStatus =
{
  "message" : "Getting offsets from KafkaV2[Subscribe[readings_prepared]]",
  "isDataAvailable" : false,
  "isTriggerActive" : true
}


##### Stats

In [10]:
val statsQuery = stats.writeStream.format("memory").queryName("stats").start()
statsQuery.status

statsQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@7ab28881
res4: org.apache.spark.sql.streaming.StreamingQueryStatus =
{
  "message" : "Getting offsets from KafkaV2[Subscribe[alert_1_stats]]",
  "isDataAvailable" : false,
  "isTriggerActive" : true
}


In [11]:
// Thread.sleep(10000)
spark.sql("select * from stats").show()

+--------+----+----+-------+-------+
|house_id|hour|mean|std_dev|last_ts|
+--------+----+----+-------+-------+
+--------+----+----+-------+-------+



In [12]:
// statsQuery.stop()
statsQuery.lastProgress

res6: org.apache.spark.sql.streaming.StreamingQueryProgress =
{
  "id" : "ecd4dd6c-94a7-49c0-b50f-068c000459c0",
  "runId" : "bec3a72c-1e52-423f-8dc7-78dded1a5bb4",
  "name" : "stats",
  "timestamp" : "2020-03-04T09:19:34.075Z",
  "batchId" : 1,
  "numInputRows" : 0,
  "inputRowsPerSecond" : 0.0,
  "processedRowsPerSecond" : 0.0,
  "durationMs" : {
    "getEndOffset" : 0,
    "setOffsetRange" : 2,
    "triggerExecution" : 2
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "KafkaV2[Subscribe[alert_1_stats]]",
    "startOffset" : {
      "alert_1_stats" : {
        "0" : 0
      }
    },
    "endOffset" : {
      "alert_1_stats" : {
        "0" : 0
      }
    },
    "numInputRows" : 0,
    "inputRowsPerSecond" : 0.0,
    "processedRowsPerSecond" : 0.0
  } ],
  "sink" ...

In [13]:
statsQuery.status

res7: org.apache.spark.sql.streaming.StreamingQueryStatus =
{
  "message" : "Waiting for data to arrive",
  "isDataAvailable" : false,
  "isTriggerActive" : false
}


## Join Both Data Streams


We can only compare the readings with the past statistics. The `reading_ts` value of `readings_prepared` stream needs to be bigger than the `last_ts` value of the `alert_1_stats` stream. We need to ensure we are comparing the stats from the right `house_id` and `hour`.

After joining, we need to ensure that we only compare the readings with their latest past statistics. The `alert_1_stats` stream is assumed to be slower to arrive compared to the readings_prepared stream, since it needs to wait at least 2 records to start calculating, and the default write trigger is every 1 minute. 



In [14]:
// Anomaly detection is done by getting the latest std_dev and mean value
// Then act on it by comparing to the 
val anomaly = readings
    .withColumnRenamed("house_id", "readings_house_id")
    .withWatermark("reading_ts", joinWatermarkTime)
    .join(
        stats.withWatermark("last_ts", joinWatermarkTime),
        // Join conditions
        $"reading_ts" > $"last_ts" &&
        hour($"reading_ts") === $"hour" &&
        $"readings_house_id" === $"house_id",
        // Join type
        "inner"
    )
    .filter($"reading_value" > $"mean" + $"std_dev")
    .drop("readings_house_id")

anomaly: org.apache.spark.sql.DataFrame = [message_id: string, reading_ts: timestamp ... 9 more fields]


In [15]:
val peekQuery = anomaly
    .writeStream
    .format("memory")
    .queryName("anomaly")
    .start()

peekQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@5f3351a2


In [None]:
spark.sql("select * from anomaly").show()

In [17]:
peekQuery.lastProgress

res9: org.apache.spark.sql.streaming.StreamingQueryProgress = null


In [18]:
peekQuery.status

res10: org.apache.spark.sql.streaming.StreamingQueryStatus =
{
  "message" : "Processing new data",
  "isDataAvailable" : true,
  "isTriggerActive" : true
}


In [19]:
// peekQuery.stop()

## Write to BigQuery

Write to BigQuery once a minute. BigQuery inserts work better with mini-batched data and this ensures that we give enough time for `alert_1_stats` to accumulate. The [BigQuery connector](https://github.com/GoogleCloudDataproc/spark-bigquery-connector) is installed during cluster setup and loaded automatically when spark-shell session is initiated.

In [20]:
val ingestQuery = anomaly
    .writeStream
    .trigger(Trigger.ProcessingTime(outputTriggerTime))
    .foreachBatch{ 
        (batchDF: DataFrame, batchId: Long) =>
            batchDF.write.format("bigquery")
                .option("table", bigQueryTargetTable)
                .option("temporaryGcsBucket", bigQueryTempBucket)
                .mode(SaveMode.Append)
                .save()
    }.start()

ingestQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@79a1fc09


In [None]:
ingestQuery.lastProgress

In [None]:
ingestQuery.status

In [23]:
//ingestQuery.stop()