<h1 style="text-align: center;">
Anomaly Detection 
</h1>

1. For each model, compute the bi-monthly failure rate R per model: F = number of failures per model, O = number of cumulative days in operation per model, D = number of days between Jan 1, 2019 and March 28, 2019, inclusive:


\\[R = 100.0 * \left(\frac{1.0 * F}{O \div D}\right)\\]


2. Given R per model, find the mean M and standard deviation S

3. Use M, S to predict which models in operation on March 29, 2019 will fail, with failure
predicted if the model’s R exceeds M + 1S

Log data from Backblaze\
Source: https://www.backblaze.com/blog/backblaze-hard-drive-stats-q1-2019/ \
Reference: https://www.backblaze.com/cloud-storage/resources/hard-drive-test-data\

In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, lit, mean, stddev, when

from datetime import datetime as dt

spark = SparkSession.builder.appName('dh3382-midterm-anomaly-detection').getOrCreate()

LOG_DIR_PATH = 'shared/midterm/drive_stats_2019_Q1'

# specified date range, end date is inclusive
DATE_RANGE = ('2019-1-1', '2019-3-28')

# calculate number of days in date range for D variable in failure rate (R), add 1 because end date is excluded in count
DATE_RANGE_DAYS = (dt.strptime(DATE_RANGE[1], "%Y-%m-%d") - dt.strptime(DATE_RANGE[0], "%Y-%m-%d") ).days + 1
print("Number of days in range: " + str(DATE_RANGE_DAYS) )

Number of days in range: 87


<h3 style="text-align: center;">
Basic Data Cleaning
</h3>
Reads in the raw data, selects only the columns needed for anomaly detection analysis and filters out all rows outside of specified date range

In [5]:
# read in raw log data
log_df_raw = spark.read.options(header=True, inferSchema=True).csv(LOG_DIR_PATH)

# select only necessary columns
log_df = log_df_raw.select('date', 'model', 'failure')

# filter entries outside of specified date range. Initially checked for nulls but found none
log_df = log_df.filter(log_df.date.between(*DATE_RANGE) )

log_df.limit(5).toPandas()

                                                                                

Unnamed: 0,date,model,failure
0,2019-03-05,ST4000DM000,0
1,2019-03-05,ST12000NM0007,0
2,2019-03-05,ST12000NM0007,0
3,2019-03-05,ST12000NM0007,0
4,2019-03-05,HGST HMS5C4040ALE640,0


<h3 style="text-align: center;">
Calculate Failures (F) and Accumulated Days of Operation (O)
</h3>

Groups by model, then adds two columns: every entry where a model does not fail is counted as one day of operation (O), every entry where a model fails is counted as a failure (F). Earlier data exploration showed that hard disks do not have logs for every day between their first entry and their failure entry, so assumption is that the hard disks were only active on days for which they have logs. Dataframe is then persisted as all subsequent actions will require this dataframe, and this avoids having to continuously recalculate the log_df_grouped dataframe from the significantly larger log_df_raw dataset

In [8]:
# count failures as number of times failure is logged per model, cumulative days as number of times non-failure
# activity is logged per model
log_df_grouped = log_df.groupBy('model').agg(\
    count(when(col("failure") == 1, True) ).alias("F"),
    count(when(col("failure") == 0, True) ).alias("O")
    )

# persist so that subsequent actions avoid recalculations with large table (9000000+ rows)
log_df_grouped.persist()

# show sample of results
log_df_grouped.limit(5).toPandas()

23/11/11 11:41:37 WARN CacheManager: Asked to cache already cached data.


Unnamed: 0,model,F,O
0,ST4000DM000,104,1929966
1,ST12000NM0007,178,2850723
2,ST8000DM005,0,2175
3,ST320LT007,0,85
4,TOSHIBA MQ01ABF050M,3,31490


<h3 style="text-align: center;">
Calculate failure rate
</h3>
Failure rate is calculated according to formula provided in prompt using the lit function to insert the DATE_RANGE_DAYS const calculated in cmd 2

In [9]:
# calculate failure rate
log_df_fail_rate = log_df_grouped.withColumn('R', lit(100) * ( (col('F') / (col('O')/lit(DATE_RANGE_DAYS) ) ) ) )

log_df_fail_rate.toPandas().head()

Unnamed: 0,model,F,O,R
0,ST4000DM000,104,1929966,0.468817
1,ST12000NM0007,178,2850723,0.543231
2,ST8000DM005,0,2175,0.0
3,ST320LT007,0,85,0.0
4,TOSHIBA MQ01ABF050M,3,31490,0.828835


<h3 style="text-align: center;">
Calculate Mean, Standard Deviation, Failure Prediction Threshold
</h3>

Mean and standard deviation are calculated using built-in PySpark functions and saved to a dataframe. They are then saved as const variables, summed and assigned to the FAIL_PREDICT const variable in order to create dataframe of model failure predictions. The final lines of the cell simply prints each variable for manual error checking

In [10]:
# calculate mean and stddev
log_df_mean_stddev = log_df_fail_rate.select(mean(col('R') ).alias('mean'), stddev(col('R') ).alias('stddev') )

# save mean to variable
LOG_MEAN = log_df_mean_stddev.collect()[0]['mean']

# save standard deviation to variable
LOG_STDDEV = log_df_mean_stddev.collect()[0]['stddev']

# mean + (1 * standard deviation), calculate as failure prediction threshold
FAIL_PREDICT = LOG_MEAN + LOG_STDDEV

# print results
print("Mean: " + str(LOG_MEAN) )
print("Standard Deviation: " + str(LOG_STDDEV) )
print("Failure Prediction Threshold: " + str(FAIL_PREDICT) )

Mean: 1.0637228434381965
Standard Deviation: 4.780507362375504
Failure Prediction Threshold: 5.8442302058137


<h2 style="text-align: center;">
Hard Drive Failure Prediction By Model
</h2>

This result is calculated by simply filtering out all models at or below the failure rate threshold and showing the resulting dataframe, sorted in descending order by likelihood of failure

In [12]:
# filter out models unlikely to fail
log_df_failure_prediction = log_df_fail_rate.filter(col('R') > lit(FAIL_PREDICT) )

# print out result, sorted by likeliness to fail
log_df_failure_prediction.sort('R', ascending=False).toPandas()

Unnamed: 0,model,F,O,R
0,ST8000DM004,1,263,33.079848


<h2 style="text-align: center;">
Hard Drive Failure Predictions Saved to List
</h2>

Hard drive predicted failure models are saved to a list. This could also be a string variable, but saving it in the list format allows for a dataset in which more than one model is likely to fail

In [14]:
# save list of models from failure prediction table for isin() function
model_fails = log_df_failure_prediction.rdd.map(lambda cols: cols[0]).collect()
print(model_fails)

['ST8000DM004']


<h2 style="text-align: center;">
Create dataframe with all HDs in operation on March 29
</h2>

Go back to raw data set to create a dataframe consisting of all HDs in operation on March 29 with serial number column included

In [19]:
# save full data to include serial numbers as well as date/model/failure
log_df_serial_num = log_df_raw.select('date', 'serial_number', 'model', 'failure')

# select only models in operation on March 29, which implies removing all drives that failed on March 29
log_df_in_operation = log_df_serial_num.filter(col('date') == '2019-3-29')\
    .filter(col('failure') == 0)

<h1 style="text-align: center;">
Hard Drive Failure Predictions Final Result
</h1>

Filter dataframe with all HDs in operation on March 29th to show only those whose models are likely to fail using isin() function with the model_fails list saved earlier

In [20]:
log_df_res = log_df_in_operation.filter(col('model').isin(model_fails) )

log_df_res.select(col('serial_number'), col('model') ).limit(20).toPandas()

Unnamed: 0,serial_number,model
0,WCT0EJDJ,ST8000DM004
1,WCT0EKW3,ST8000DM004
2,WCT0EJY6,ST8000DM004
