# Disk Failure Prediciton for Backblaze HDDs

#### Problem statement - Disk failure prediction with the given backblaze HDD dataset between the periods of 1st April'16 and 30th June'16

##### Background information on the disk features

Each day in the Backblaze data center, we take a snapshot of each operational hard drive. This snapshot includes basic drive information along with the S.M.A.R.T. statistics reported by that drive. The daily snapshot of one drive is one record or row of data. 

Date – The date of the file in yyyy-mm-dd format

Serial Number – The manufacturer-assigned serial number of the drive

Model – The manufacturer-assigned model number of the drive

Capacity – The drive capacity in bytes

Failure – Contains a “0” if the drive is OK. Contains a “1” if this is the last day the drive was operational before failing

SMART Stats – 90 columns of data, that are the Raw and Normalized values for 45 different SMART stats as reported by the given drive. Each value is the number reported by the drive.

NOTE : There is no necessity of building a prediction model for the above statement. The candidate is expected to delineate the entire process of solving the problem including

                        1) Data preprocessing
                        2) Model selection 
                        3) Evaluation metrics
                        
by giving supporting reasons for the selection.

### Use the space below to explore the dataset to make valid decisions

In [1]:
# --------------------- Add more imports for necessary modules --------------------------

import pandas as pd
import numpy as np
import datetime

In [2]:
# returns a combined data of disks info present between the start and end dates as a dataframe

def get_disk_stats_dataframe(start=datetime.date(2016, 4, 1), end=datetime.date(2016, 4, 5)):
    day = start
    final_df = pd.DataFrame()
    while day <= end:
        day_df = pd.read_csv('2016-Q2-backblaze/'+str(day)+'.csv')
        final_df = final_df.append(day_df)
        day += datetime.timedelta(1) 
    return final_df

In [3]:
df = get_disk_stats_dataframe()
df.head()

Unnamed: 0,date,serial_number,model,capacity_bytes,failure,smart_1_normalized,smart_1_raw,smart_2_normalized,smart_2_raw,smart_3_normalized,...,smart_250_normalized,smart_250_raw,smart_251_normalized,smart_251_raw,smart_252_normalized,smart_252_raw,smart_254_normalized,smart_254_raw,smart_255_normalized,smart_255_raw
0,2016-04-01,MJ0351YNG9Z0XA,Hitachi HDS5C3030ALA630,3000592982016,0,100,0,135.0,108.0,143,...,,,,,,,,,,
1,2016-04-01,Z305B2QN,ST4000DM000,4000787030016,0,117,140875840,,,95,...,,,,,,,,,,
2,2016-04-01,MJ0351YNG9Z7LA,Hitachi HDS5C3030ALA630,3000592982016,0,100,0,136.0,104.0,123,...,,,,,,,,,,
3,2016-04-01,MJ0351YNGABYAA,Hitachi HDS5C3030ALA630,3000592982016,0,100,0,136.0,104.0,137,...,,,,,,,,,,
4,2016-04-01,WD-WMC4N2899475,WDC WD30EFRX,3000592982016,0,200,0,,,175,...,,,,,,,,,,
