<h1 style="text-align: center;">
Anomaly Detection 
</h1>

1. For each model, compute the bi-monthly failure rate R per model: F = number of failures per model, O = number of cumulative days in operation per model, D = number of days between Jan 1, 2019 and March 28, 2019, inclusive:


\\[R = 100.0 * \left(\frac{1.0 * F}{O \div D}\right)\\]


2. Given R per model, find the mean M and standard deviation S

3. Use M, S to predict which models in operation on March 29, 2019 will fail, with failure
predicted if the model’s R exceeds M + 1S

Log data from Backblaze\
Source: https://www.backblaze.com/blog/backblaze-hard-drive-stats-q1-2019/ \
Reference: https://www.backblaze.com/cloud-storage/resources/hard-drive-test-data\

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, lit, mean, stddev, when

from datetime import datetime as dt

spark = SparkSession.builder.appName('dh3382-midterm-anomaly-detection').getOrCreate()

LOG_DIR_PATH = '/FileStore/tables/drive_stats_2019_Q1'

# specified date range
DATE_RANGE = ('2019-1-1', '2019-3-28')

# calculate number of days in date range for D variable in failure rate (R)
DATE_RANGE_DAYS = (dt.strptime(DATE_RANGE[1], "%Y-%m-%d") - dt.strptime(DATE_RANGE[0], "%Y-%m-%d")).days

<h3 style="text-align: center;">
Basic Data Cleaning
</h3>
Reads in the raw data, selects only the columns needed for anomaly detection analysis and filters out all rows outside of specified date range

In [0]:
# read in raw log data
log_df_raw = spark.read.options(header=True, inferSchema=True).csv(LOG_DIR_PATH)

# select only necessary columns
log_df = log_df_raw.select('date', 'model', 'failure')

# filter entries outside of specified date range. Initially checked for nulls but found none
log_df = log_df.filter(log_df.date.between(*DATE_RANGE) )

log_df.toPandas()

Unnamed: 0,date,serial_number,model,failure
0,2019-03-05,Z305B2QN,ST4000DM000,0
1,2019-03-05,ZJV0XJQ4,ST12000NM0007,0
2,2019-03-05,ZJV0XJQ3,ST12000NM0007,0
3,2019-03-05,ZJV0XJQ0,ST12000NM0007,0
4,2019-03-05,PL1331LAHG1S4H,HGST HMS5C4040ALE640,0
...,...,...,...,...
8617817,2019-02-27,PL1331LAHD1AWH,HGST HMS5C4040BLE640,0
8617818,2019-02-27,ZA10MCEQ,ST8000DM002,0
8617819,2019-02-27,ZCH0CRTK,ST12000NM0007,0
8617820,2019-02-27,PL1331LAHD1T5H,HGST HMS5C4040BLE640,0


<h3 style="text-align: center;">
Calculate Failures (F) and Accumulated Days of Operation (O)
</h3>

Groups by model, then adds two columns: every entry where a model does not fail is counted as one day of operation (O), every entry where a model fails is counted as a failure (F). Earlier data exploration showed that hard disks do not have logs for every day between their first entry and their failure entry, so assumption is that the hard disks were only active on days for which they have logs. Dataframe is then persisted as all subsequent actions will require this dataframe, and this avoids having to continuously recalculate the log_df_grouped dataframe from the significantly larger log_df_raw dataset

In [0]:
# count failures as number of times failure is logged per model, cumulative days as number of times non-failure
# activity is logged per model
log_df_grouped = log_df.groupBy('model').agg(\
    count(when(col("failure") == 1, True) ).alias("F"),
    count(when(col("failure") == 0, True) ).alias("O") )

# persist so that subsequent actions avoid recalculations with large table (8500000+ rows)
log_df_grouped.persist()

log_df_grouped.toPandas().head()

Unnamed: 0,model,F,O
0,ST4000DM000,97,1795336
1,ST12000NM0007,164,2654525
2,ST8000DM005,0,2025
3,ST320LT007,0,79
4,TOSHIBA MQ01ABF050M,3,29319


<h3 style="text-align: center;">
Calculate failure rate
</h3>
Failure rate is calculated according to formula provided in prompt using the lit function to insert the DATE_RANGE_DAYS const calculated in cmd 2

In [0]:
# calculate failure rate
log_df_fail_rate = log_df_grouped.withColumn('R', lit(100) * (col('F') / (col('O')/lit(DATE_RANGE_DAYS) ) ) )

log_df_fail_rate.toPandas().head()

Unnamed: 0,model,F,O,R
0,ST4000DM000,97,1795336,0.464648
1,ST12000NM0007,164,2654525,0.531319
2,ST8000DM005,0,2025,0.0
3,ST320LT007,0,79,0.0
4,TOSHIBA MQ01ABF050M,3,29319,0.879975


<h3 style="text-align: center;">
Calculate Mean, Standard Deviation, Failure Prediction Threshold
</h3>

Mean and standard deviation are calculated using built-in PySpark functions and saved to a dataframe. They are then saved as const variables, summed and assigned to the FAIL_PREDICT const variable in order to create dataframe of failure predictions in the final cell. The final lines of the cell simply prints each variable for manual error checking

In [0]:
# calculate mean and stddev and save to variables

log_df_mean_stddev = log_df_fail_rate.select(mean(col('R') ).alias('mean'), stddev(col('R') ).alias('stddev') )

# mean
LOG_MEAN = log_df_mean_stddev.collect()[0]['mean']

# standard deviation
LOG_STDDEV = log_df_mean_stddev.collect()[0]['stddev']

# mean + 1 * standard deviation calculate as failure prediction threshold
FAIL_PREDICT = LOG_MEAN + LOG_STDDEV

# print results
print("Mean: " + str(LOG_MEAN) )
print("Standard Deviation: " + str(LOG_STDDEV) )
print("Failure Prediction Threshold: " + str(FAIL_PREDICT) )

Mean: 0.4006754240974075
Standard Deviation: 1.0769376594635043
Failure Prediction Threshold: 1.4776130835609118


<h2 style="text-align: center;">
Hard Drive Failure Prediction Final Result
</h2>

The final result is calculated by simply filtering out all models at or below the failure rate threshold and showing the resulting dataframe, sorted in descending order by likelihood of failure

In [0]:
# filter out models unlikely to fail
log_df_failure_prediction = log_df_fail_rate.filter(col('R') > lit(FAIL_PREDICT) )

# print out result, sorted by likeliness to fail
log_df_failure_prediction.sort('R', ascending=False).toPandas()

Unnamed: 0,model,F,O,R
0,ST500LM030,9,12934,5.984228
1,WDC WD5000LPCX,2,4427,3.88525
2,TOSHIBA MQ01ABF050,12,42274,2.441217
3,ST500LM012 HN,10,45658,1.883569
