# Anomaly Detection

1. For each model, compute the bi-monthly failure rate R per model: F = number of failures per model, O = number of cumulative days in operation per model, D = number of days between Jan 1, 2019 and March 28, 2019, inclusive:


\\[R = 100.0 * \left(\frac{1.0 * F}{O \div D}\right)\\]


2. Given R per model, find the mean M and standard deviation S

3. Use M, S to predict which models in operation on March 29, 2019 will fail, with failure
predicted if the model’s R exceeds M + 1S

Log data from Backblaze\
Source: https://www.backblaze.com/blog/backblaze-hard-drive-stats-q1-2019/ \
Reference: https://www.backblaze.com/cloud-storage/resources/hard-drive-test-data\

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

from datetime import datetime as dt

spark = SparkSession.builder.appName('dh3382-midterm-anomaly-detection').getOrCreate()

LOG_DIR_PATH = '/FileStore/tables/drive_stats_2019_Q1'

# specified date range
DATE_RANGE = ('2019-1-1', '2019-3-28')

# calculate number of days in date range for D variable in failure rate (R)
DATE_RANGE_DAYS = (dt.strptime(DATE_RANGE[1], "%Y-%m-%d") - dt.strptime(DATE_RANGE[0], "%Y-%m-%d")).days

In [0]:
# read in raw log data
log_df_raw = spark.read.options(inferSchema=True, header=True).csv(LOG_DIR_PATH)

# select only necessary columns
log_df = log_df_raw.select('date', 'serial_number', 'model', 'failure')

# filter entries outside of specified date range
log_df = log_df.filter(log_df.date.between(*DATE_RANGE))

log_df.show()

+----------+-------------+-----------+-------+
|      date|serial_number|      model|failure|
+----------+-------------+-----------+-------+
|2019-01-08|     Z305B2QN|ST4000DM000|      0|
|2019-01-09|     Z305B2QN|ST4000DM000|      0|
|2019-01-18|     Z305B2QN|ST4000DM000|      0|
|2019-01-19|     Z305B2QN|ST4000DM000|      0|
|2019-01-20|     Z305B2QN|ST4000DM000|      0|
|2019-01-21|     Z305B2QN|ST4000DM000|      0|
|2019-01-22|     Z305B2QN|ST4000DM000|      0|
|2019-01-23|     Z305B2QN|ST4000DM000|      0|
|2019-01-24|     Z305B2QN|ST4000DM000|      0|
|2019-01-25|     Z305B2QN|ST4000DM000|      0|
|2019-01-26|     Z305B2QN|ST4000DM000|      0|
|2019-01-27|     Z305B2QN|ST4000DM000|      0|
|2019-01-30|     Z305B2QN|ST4000DM000|      0|
|2019-01-31|     Z305B2QN|ST4000DM000|      0|
|2019-02-04|     Z305B2QN|ST4000DM000|      0|
|2019-02-10|     Z305B2QN|ST4000DM000|      0|
|2019-03-08|     Z305B2QN|ST4000DM000|      0|
|2019-03-09|     Z305B2QN|ST4000DM000|      0|
|2019-03-18| 