# Case 2 Part 2: Anomaly Detection

This notebook showcases the 2nd part of the analytics use case to be tackled in a real-time alerting system:

`Hourly consumption for a household is higher than 1 standard deviation of mean consumption across all households within that particular hour on that day.`

The second part here is to read the data stream from both `readings_prepared` and `alert_2_stats` and use them to detect data anomalies -- the ones that go above 1 standard deviation from the mean. The detected anomalies are then stored in a persistent data store, which in this case is Google BigQuery.

BigQuery is chosen for the fit of further analytical queries down the line. We might be interested to do some BI or advanced analysis down the line. It is also serverless -- we just need to define the Datasets and Tables.


## Setup

Import all the required libraries and set the stream configuration variables.

In [1]:
import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql._

Intitializing Scala interpreter ...

Spark Web UI available at http://spark-alert-2-detect-m:8088/proxy/application_1583161282596_0002
SparkContext available as 'sc' (version = 2.4.5, master = yarn, app id = application_1583161282596_0002)
SparkSession available as 'spark'


import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql._


In [2]:
val kafkaBootstrapServer = "kafka-m:9092"
val kafkaReadingsTopic = "readings_prepared"
val kafkaStatsTopic = "alert_2_stats"
val kafkaDedupWatermarkTime = "1 minute"
val joinWatermarkTime = "1 minute"
val bigQueryTargetTable = "smartplugs.alert_2_anomaly"
val bigQueryTempBucket = "pandora-sde-case/alert_2"
val outputTriggerTime = "1 minute"

kafkaBootstrapServer: String = kafka-m:9092
kafkaReadingsTopic: String = readings_prepared
kafkaStatsTopic: String = alert_2_stats
kafkaDedupWatermarkTime: String = 1 minute
joinWatermarkTime: String = 1 minute
bigQueryTargetTable: String = smartplugs.alert_2_anomaly
bigQueryTempBucket: String = pandora-sde-case/alert_2
outputTriggerTime: String = 1 minute


## Define The Required Schema

In [3]:
// This will be used to give the source `readings_prepared` stream data a schema
val readingsSchema = StructType(Seq(
    StructField("message_id", StringType, false),
    StructField("reading_ts", TimestampType, false),
    StructField("reading_value", FloatType, false),
    StructField("reading_type", IntegerType, false),
    StructField("plug_id", IntegerType, false),
    StructField("household_id", IntegerType, false),
    StructField("house_id", IntegerType, false)
))

val statsSchema = StructType(Seq(
    StructField("day", StringType, false),
    StructField("hour", IntegerType, false),
    StructField("mean", FloatType, false),
    StructField("m2", FloatType, false),
    StructField("variance", FloatType, false),
    StructField("std_dev", FloatType, false),
    StructField("count", LongType, false),
    StructField("last_ts", TimestampType, false)
))

readingsSchema: org.apache.spark.sql.types.StructType = StructType(StructField(message_id,StringType,false), StructField(reading_ts,TimestampType,false), StructField(reading_value,FloatType,false), StructField(reading_type,IntegerType,false), StructField(plug_id,IntegerType,false), StructField(household_id,IntegerType,false), StructField(house_id,IntegerType,false))
statsSchema: org.apache.spark.sql.types.StructType = StructType(StructField(day,StringType,false), StructField(hour,IntegerType,false), StructField(mean,FloatType,false), StructField(m2,FloatType,false), StructField(variance,FloatType,false), StructField(std_dev,FloatType,false), StructField(count,LongType,false), StructField(last_ts,TimestampType,false))


### Read and Parse The Input Data Streams

There are 2 input streams this time, `readings_prep` and `alert_1_stats` topic. They will be joined to detect anomalies

In [4]:
// Drop duplicates if seen in an arbitrary watermark. Bounds are necessary so that Spark does not store 
// ALL records in the state memory
val readings = spark
    .readStream 
    .format("kafka")
    .option("kafka.bootstrap.servers", kafkaBootstrapServer)
    .option("subscribe", kafkaReadingsTopic)
    .load()
    .selectExpr("CAST(value AS STRING)")
    .select(from_json($"value", readingsSchema).as("data"))
    .select($"data.*")
    .withWatermark("reading_ts", kafkaDedupWatermarkTime) 
    .dropDuplicates()
    .filter($"reading_type" === 1) // Only take the "current load" measurement

readings: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [message_id: string, reading_ts: timestamp ... 5 more fields]


In [5]:
// We don't need to deduplicate the stats, but we need to drop records with mean=/0
val stats = spark
    .readStream 
    .format("kafka")
    .option("kafka.bootstrap.servers", kafkaBootstrapServer)
    .option("subscribe", kafkaStatsTopic)
    .load()
    .selectExpr("CAST(value AS STRING)")
    .select(from_json($"value", statsSchema).as("data"))
    .select($"data.day", $"data.hour", $"data.mean", $"data.std_dev", $"data.last_ts")
    .filter($"mean" > 0.0)


stats: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [day: string, hour: int ... 3 more fields]


#### Peek at The Input Data Streams

##### Readings

In [6]:
val readingsQuery = readings.writeStream.format("memory").queryName("readings").start()
Thread.sleep(10000)
readingsQuery.status

readingsQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@b1b0b0d
res0: org.apache.spark.sql.streaming.StreamingQueryStatus =
{
  "message" : "Processing new data",
  "isDataAvailable" : true,
  "isTriggerActive" : true
}


In [26]:

spark.sql("select * from readings").show()

+----------+-------------------+-------------+------------+-------+------------+--------+
|message_id|         reading_ts|reading_value|reading_type|plug_id|household_id|house_id|
+----------+-------------------+-------------+------------+-------+------------+--------+
| 118504114|2013-09-01 17:53:00|          0.0|           1|      0|           0|       9|
| 115998526|2013-09-01 17:27:20|          0.0|           1|      0|           0|       0|
| 116872749|2013-09-01 17:36:20|          0.0|           1|      1|           0|       9|
| 118371375|2013-09-01 17:51:40|          0.0|           1|      2|           0|       4|
| 117153098|2013-09-01 17:39:20|          0.0|           1|      0|           0|       5|
| 117306922|2013-09-01 17:41:00|       50.627|           1|      0|           0|       8|
| 118503302|2013-09-01 17:53:00|       41.056|           1|      2|           0|       0|
| 118868629|2013-09-01 17:56:40|      122.738|           1|      1|           0|       1|
| 11972589

In [8]:
// readingsQuery.stop()
readingsQuery.lastProgress

res2: org.apache.spark.sql.streaming.StreamingQueryProgress =
{
  "id" : "de552970-b651-46c2-bca9-2b54ecb991ce",
  "runId" : "fddc0ba9-8f08-4048-9d1f-b3dae5f60edc",
  "name" : "readings",
  "timestamp" : "2020-03-02T15:27:34.977Z",
  "batchId" : 1,
  "numInputRows" : 0,
  "inputRowsPerSecond" : 0.0,
  "processedRowsPerSecond" : 0.0,
  "durationMs" : {
    "getEndOffset" : 0,
    "setOffsetRange" : 70,
    "triggerExecution" : 72
  },
  "eventTime" : {
    "watermark" : "1970-01-01T00:00:00.000Z"
  },
  "stateOperators" : [ {
    "numRowsTotal" : 0,
    "numRowsUpdated" : 0,
    "memoryUsedBytes" : 44599,
    "customMetrics" : {
      "loadedMapCacheHitCount" : 0,
      "loadedMapCacheMissCount" : 0,
      "stateOnCurrentVersionSizeBytes" : 15799
    }
  } ],
  "sources" : [ {
    "descr...

In [9]:
readingsQuery.status

res3: org.apache.spark.sql.streaming.StreamingQueryStatus =
{
  "message" : "Waiting for data to arrive",
  "isDataAvailable" : false,
  "isTriggerActive" : false
}


##### Stats

In [10]:
val statsQuery = stats.writeStream.format("memory").queryName("stats").start()
statsQuery.status

statsQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@5f31b855
res4: org.apache.spark.sql.streaming.StreamingQueryStatus =
{
  "message" : "Initializing sources",
  "isDataAvailable" : false,
  "isTriggerActive" : false
}


In [27]:
// Thread.sleep(10000)
spark.sql("select * from stats").show()

+--------+----+---------+---------+-------------------+
|     day|hour|     mean|  std_dev|            last_ts|
+--------+----+---------+---------+-------------------+
|20130901|  21|  13.8408|22.331156|2013-09-01 21:25:20|
|20130901|  22|1.8393385|4.0363626|2013-09-01 22:45:40|
+--------+----+---------+---------+-------------------+



In [12]:
// statsQuery.stop()
statsQuery.lastProgress

res6: org.apache.spark.sql.streaming.StreamingQueryProgress =
{
  "id" : "a0792e89-8486-4a84-b210-ff7d28f69dd2",
  "runId" : "013bb1f7-208a-4030-98df-65b79e51ca1a",
  "name" : "stats",
  "timestamp" : "2020-03-02T15:27:37.961Z",
  "batchId" : 1,
  "numInputRows" : 0,
  "inputRowsPerSecond" : 0.0,
  "processedRowsPerSecond" : 0.0,
  "durationMs" : {
    "getEndOffset" : 0,
    "setOffsetRange" : 2,
    "triggerExecution" : 2
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "KafkaV2[Subscribe[alert_2_stats]]",
    "startOffset" : {
      "alert_2_stats" : {
        "0" : 0
      }
    },
    "endOffset" : {
      "alert_2_stats" : {
        "0" : 0
      }
    },
    "numInputRows" : 0,
    "inputRowsPerSecond" : 0.0,
    "processedRowsPerSecond" : 0.0
  } ],
  "sink" ...

In [13]:
statsQuery.status

res7: org.apache.spark.sql.streaming.StreamingQueryStatus =
{
  "message" : "Getting offsets from KafkaV2[Subscribe[alert_2_stats]]",
  "isDataAvailable" : false,
  "isTriggerActive" : true
}


## Join Both Data Streams


We can only compare the readings with the past statistics. The `reading_ts` value of `readings_prepared` stream needs to be bigger than the `last_ts` value of the `alert_2_stats` stream. We need to ensure we are comparing the stats from the right `day` and `hour`.

After joining, we need to ensure that we only compare the readings with their latest past statistics.

The `alert_2_stats` stream is assumed to be slower to arrive compared to the readings_prepared stream, since it needs to wait at least 2 records to start calculating, and the default write trigger is every 1 minute. 

In [38]:
// Anomaly detection is done by getting the latest std_dev and mean value
// Then act on it by comparing to the 
val anomaly = readings
    .withWatermark("reading_ts", joinWatermarkTime)
    .join(
        stats.withWatermark("last_ts", joinWatermarkTime),
        // Join conditions
        $"reading_ts" > $"last_ts" &&
        hour($"reading_ts") === $"hour" &&
        date_format($"reading_ts", "yyyyMMdd") === $"day",
        // Join type
        "inner"
    )
    .filter($"reading_value" > $"mean" + $"std_dev")

anomaly: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [message_id: string, reading_ts: timestamp ... 10 more fields]


In [39]:
val peekQuery = anomaly
    .writeStream
    .format("memory")
    .queryName("anomaly")
    .start()

peekQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@212ee9e3


In [46]:
spark.sql("select * from anomaly").show()

+----------+-------------------+-------------+------------+-------+------------+--------+--------+----+---------+---------+-------------------+
|message_id|         reading_ts|reading_value|reading_type|plug_id|household_id|house_id|     day|hour|     mean|  std_dev|            last_ts|
+----------+-------------------+-------------+------------+-------+------------+--------+--------+----+---------+---------+-------------------+
| 348076248|2013-09-03 08:59:40|      131.527|           1|      1|           0|       1|20130903|   8|31.719673| 96.85666|2013-09-03 08:58:40|
| 348077554|2013-09-03 08:59:40|      405.686|           1|      1|           0|       8|20130903|   8|31.719673| 96.85666|2013-09-03 08:58:40|
| 352332060|2013-09-03 09:42:20|      389.533|           1|      1|           0|       8|20130903|   9|40.451565|104.27064|2013-09-03 09:23:00|
| 351101926|2013-09-03 09:30:00|      485.752|           1|      1|           0|       8|20130903|   9|40.451565|104.27064|2013-09-03 09

In [41]:
peekQuery.lastProgress

res30: org.apache.spark.sql.streaming.StreamingQueryProgress = null


In [42]:
peekQuery.status

res31: org.apache.spark.sql.streaming.StreamingQueryStatus =
{
  "message" : "Processing new data",
  "isDataAvailable" : true,
  "isTriggerActive" : true
}


In [36]:
// peekQuery.stop()

## Write to BigQuery

Write to BigQuery once a minute. BigQuery inserts works better with batched data and this ensures that we give enough time for `alert_2_stats` to accumulate. The [BigQuery connector](https://github.com/GoogleCloudDataproc/spark-bigquery-connector) is installed during cluster setup and loaded automatically when spark-shell session is initiated.

In [43]:
val ingestQuery = anomaly
    .writeStream
    .trigger(Trigger.ProcessingTime(outputTriggerTime))
    .foreachBatch{ 
        (batchDF: DataFrame, batchId: Long) =>
            batchDF.write.format("bigquery")
                .option("table", bigQueryTargetTable)
                .option("temporaryGcsBucket", bigQueryTempBucket)
                .mode(SaveMode.Append)
                .save()
    }.start()

ingestQuery: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@6bd084a0


In [49]:
ingestQuery.lastProgress

res37: org.apache.spark.sql.streaming.StreamingQueryProgress =
{
  "id" : "71895998-54a9-41b1-8fa8-749db8b65f00",
  "runId" : "9c0e9bdb-4104-4030-ab21-6d711de8e531",
  "name" : null,
  "timestamp" : "2020-03-02T15:44:22.694Z",
  "batchId" : 1,
  "numInputRows" : 15711,
  "inputRowsPerSecond" : 193.30191813182083,
  "processedRowsPerSecond" : 179.41895256149647,
  "durationMs" : {
    "addBatch" : 87265,
    "getBatch" : 0,
    "getEndOffset" : 0,
    "queryPlanning" : 224,
    "setOffsetRange" : 1,
    "triggerExecution" : 87566,
    "walCommit" : 50
  },
  "eventTime" : {
    "avg" : "2013-09-03T11:46:43.977Z",
    "max" : "2013-09-03T14:04:40.000Z",
    "min" : "2013-09-03T09:23:40.000Z",
    "watermark" : "1970-01-01T00:00:00.000Z"
  },
  "stateOperators" : [ {
    "numRowsTotal" : 7...

In [50]:
ingestQuery.status

res38: org.apache.spark.sql.streaming.StreamingQueryStatus =
{
  "message" : "Processing new data",
  "isDataAvailable" : true,
  "isTriggerActive" : true
}


In [37]:
// ingestQuery.stop()