This notebook examines the hard drive data collected from BackBlaze. https://www.backblaze.com/b2/hard-drive-test-data.html  BackBlaze is a cloud storage company that has made their hard drive data public.  The data was presumably obtained by using open source software like smartmontools (https://sourceforge.net/projects/smartmontools/). 
The data is collected daily for each hard drive and their respective S.M.A.R.T data counters.
First part of this notebook will examine the raw data received, then the data will be cleaned before running through


In [30]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import ensemble, metrics
from IPython.display import Image
from IPython.core.display import HTML 

In [31]:
# limit to first 1000 rows for now until doc is complete
hdd = pd.read_csv('../input/harddrive.csv')#,nrows = 10000)
hdd.head()

Unnamed: 0,date,serial_number,model,capacity_bytes,failure,smart_1_normalized,smart_1_raw,smart_2_normalized,smart_2_raw,smart_3_normalized,...,smart_250_normalized,smart_250_raw,smart_251_normalized,smart_251_raw,smart_252_normalized,smart_252_raw,smart_254_normalized,smart_254_raw,smart_255_normalized,smart_255_raw
0,2016-01-01,MJ0351YNG9Z0XA,Hitachi HDS5C3030ALA630,1.48249e-311,0,100,0,135.0,108.0,143,...,,,,,,,,,,
1,2016-01-01,Z305B2QN,ST4000DM000,1.976651e-311,0,113,54551400,,,96,...,,,,,,,,,,
2,2016-01-01,MJ0351YNG9Z7LA,Hitachi HDS5C3030ALA630,1.48249e-311,0,100,0,136.0,104.0,124,...,,,,,,,,,,
3,2016-01-01,MJ0351YNGABYAA,Hitachi HDS5C3030ALA630,1.48249e-311,0,100,0,136.0,104.0,137,...,,,,,,,,,,
4,2016-01-01,WD-WMC4N2899475,WDC WD30EFRX,1.48249e-311,0,200,0,,,175,...,,,,,,,,,,


In [32]:
# number of rows and columns in dataset
hdd.shape

(3179295, 95)

In [33]:
# number of hdd
hdd['serial_number'].value_counts().shape


(65993,)

In [34]:
# number of different model types of harddrives
hdd['model'].value_counts().shape

(69,)

In [35]:
#failed drives (model and count)
print(hdd.groupby('model')['failure'].sum().sort_values(ascending=False).iloc[:30])

model
ST4000DM000                139
ST320LT007                  15
Hitachi HDS722020ALA330     13
WDC WD800AAJS                6
WDC WD30EFRX                 6
Hitachi HDS5C3030ALA630      5
Hitachi HDS5C4040ALE630      4
WDC WD20EFRX                 3
ST3160318AS                  2
ST4000DX000                  2
WDC WD10EADS                 2
HGST HMS5C4040ALE640         2
WDC WD1600AAJS               2
WDC WD1600AAJB               2
ST6000DX000                  2
ST3160316AS                  1
TOSHIBA MD04ABA500V          1
WDC WD3200BEKT               1
TOSHIBA DT01ACA300           1
WDC WD3200BEKX               1
ST9250315AS                  1
HGST HMS5C4040BLE640         1
WDC WD60EFRX                 1
WDC WD800AAJB                1
WDC WD800BB                  1
ST3500320AS                  0
ST31500341AS                 0
ST31500541AS                 0
WDC WD800LB                  0
ST250LT007                   0
Name: failure, dtype: int64


Wikipedia has a table explaining all the S.M.A.R.T attributes and the ones that are critical for predicting hard drive failure.  So, we want to eliminate the other S.M.A.R.T values because they will be noise.  The attributes we want to keep are:
5 - Reallocated sectors count
10 - Spin retry count
184 - End-to-end error (IOEDC)
187 - Reported Uncorrectable Errors
188 - Command timeout
196 - Reallocation Event Count
197 - Current pending sector count
198 - Uncorrectable sector count
201 - Soft read error rate

Link to wikipedia table:  https://en.wikipedia.org/wiki/S.M.A.R.T.

In [36]:
'''
#columns_to_drop =['date','smart_1_normalized', 'smart_1_raw', 
#'smart_2_normalized', 'smart_2_raw',
#'smart_3_normalized', 'smart_3_raw', 'smart_4_normalized', 'smart_4_raw',
#'smart_7_normalized', 'smart_7_raw', 'smart_8_normalized', 'smart_8_raw', 
#'smart_9_normalized', 'smart_9_raw', 'smart_13_normalized', 'smart_13_raw',
#'smart_190_normalized', 'smart_190_raw', 'smart_191_normalized', 'smart_191_raw',
#'smart_192_normalized', 'smart_192_raw', 'smart_193_normalized', 'smart_193_raw', 
#'smart_194_normalized', 'smart_194_raw', 'smart_195_normalized', 'smart_195_raw', 
#'smart_199_normalized', 'smart_199_raw', 'smart_200_normalized', 'smart_200_raw',
#'smart_220_normalized', 'smart_220_raw', 'smart_222_normalized', 'smart_222_raw',
#'smart_223_normalized', 'smart_223_raw', 'smart_224_normalized', 'smart_224_raw', 
#'smart_225_normalized', 'smart_225_raw', 'smart_226_normalized', 'smart_226_raw', 
#'smart_240_normalized', 'smart_240_raw', 'smart_241_normalized', 'smart_241_raw', 
#'smart_242_normalized', 'smart_242_raw', 'smart_250_normalized', 'smart_250_raw', 
#'smart_251_normalized', 'smart_251_raw', 'smart_252_normalized', 'smart_252_raw', 
#'smart_254_normalized', 'smart_254_raw', 'smart_255_normalized', 'smart_255_raw']
'''
columns_to_drop =['date', 'capacity_bytes']
hdd.drop(columns_to_drop, inplace=True, axis=1)

# drop constant columns
hdd = hdd.loc[:, ~hdd.isnull().all()]
# remove normalized values that are left
#hdd = hdd.select(lambda x: x[-10:] != 'normalized', axis=1)

# no null values left. 
hdd.isnull().any()
hdd.fillna(-1, inplace=True)
#hdd = hdd.drop(['model', 'capacity_bytes'], axis=1)
hdd.head()

Unnamed: 0,serial_number,model,failure,smart_1_normalized,smart_1_raw,smart_2_normalized,smart_2_raw,smart_3_normalized,smart_3_raw,smart_4_normalized,...,smart_242_normalized,smart_242_raw,smart_250_normalized,smart_250_raw,smart_251_normalized,smart_251_raw,smart_252_normalized,smart_252_raw,smart_254_normalized,smart_254_raw
0,MJ0351YNG9Z0XA,Hitachi HDS5C3030ALA630,0,100,0,135,108,143,540,100,...,-1,0.0,-1,-1,-1,-1,-1,-1,-1,-1
1,Z305B2QN,ST4000DM000,0,113,54551400,-1,-1,96,0,100,...,100,1.316882e-315,-1,-1,-1,-1,-1,-1,-1,-1
2,MJ0351YNG9Z7LA,Hitachi HDS5C3030ALA630,0,100,0,136,104,124,566,100,...,-1,0.0,-1,-1,-1,-1,-1,-1,-1,-1
3,MJ0351YNGABYAA,Hitachi HDS5C3030ALA630,0,100,0,136,104,137,507,100,...,-1,0.0,-1,-1,-1,-1,-1,-1,-1,-1
4,WD-WMC4N2899475,WDC WD30EFRX,0,200,0,-1,-1,175,6250,100,...,-1,0.0,-1,-1,-1,-1,-1,-1,-1,-1


In [37]:
hdd.shape

(3179295, 89)

In [38]:
hdd.columns

Index([u'serial_number', u'model', u'failure', u'smart_1_normalized',
       u'smart_1_raw', u'smart_2_normalized', u'smart_2_raw',
       u'smart_3_normalized', u'smart_3_raw', u'smart_4_normalized',
       u'smart_4_raw', u'smart_5_normalized', u'smart_5_raw',
       u'smart_7_normalized', u'smart_7_raw', u'smart_8_normalized',
       u'smart_8_raw', u'smart_9_normalized', u'smart_9_raw',
       u'smart_10_normalized', u'smart_10_raw', u'smart_11_normalized',
       u'smart_11_raw', u'smart_12_normalized', u'smart_12_raw',
       u'smart_13_normalized', u'smart_13_raw', u'smart_22_normalized',
       u'smart_22_raw', u'smart_183_normalized', u'smart_183_raw',
       u'smart_184_normalized', u'smart_184_raw', u'smart_187_normalized',
       u'smart_187_raw', u'smart_188_normalized', u'smart_188_raw',
       u'smart_189_normalized', u'smart_189_raw', u'smart_190_normalized',
       u'smart_190_raw', u'smart_191_normalized', u'smart_191_raw',
       u'smart_192_normalized', u'smart_192_

Since each vendor handles SMART data differently, we are going to select a specific model for our dataset. The Seagate disk ST4000DM000 has the most failures, so it is the best candidate.

In [39]:
# select specific model, since vendors differ on how SMART values are used
hdd = hdd.query('model == "ST4000DM000"')
hdd.shape

(1681473, 89)

In [40]:
from sklearn.preprocessing import LabelEncoder
serial_encoder = LabelEncoder()
hdd['serial_number'] = serial_encoder.fit_transform(hdd['serial_number'].astype('str'))
hdd.head()

Unnamed: 0,serial_number,model,failure,smart_1_normalized,smart_1_raw,smart_2_normalized,smart_2_raw,smart_3_normalized,smart_3_raw,smart_4_normalized,...,smart_242_normalized,smart_242_raw,smart_250_normalized,smart_250_raw,smart_251_normalized,smart_251_raw,smart_252_normalized,smart_252_raw,smart_254_normalized,smart_254_raw
1,30154,ST4000DM000,0,113,54551400,-1,-1,96,0,100,...,100,1.316882e-315,-1,-1,-1,-1,-1,-1,-1,-1
7,2976,ST4000DM000,0,107,13985080,-1,-1,97,0,100,...,100,1.761059e-314,-1,-1,-1,-1,-1,-1,-1,-1
8,2975,ST4000DM000,0,116,109242152,-1,-1,97,0,100,...,100,2.551732e-314,-1,-1,-1,-1,-1,-1,-1,-1
9,21345,ST4000DM000,0,112,46112000,-1,-1,92,0,100,...,100,2.356922e-313,-1,-1,-1,-1,-1,-1,-1,-1
10,16688,ST4000DM000,0,116,117245752,-1,-1,93,0,100,...,100,6.064623e-313,-1,-1,-1,-1,-1,-1,-1,-1


In [41]:
# number of unique hdd
serials_df = pd.DataFrame()
serials_df['serial_number'] = hdd['serial_number']
serials_df.drop_duplicates('serial_number', inplace=True)
print len(hdd['serial_number'].unique())
serials_df.shape[0]


35057


35057

In [42]:
#number of failed drives
print hdd.loc[hdd['failure'] == 1].shape[0]

139


In [45]:
# remove normalized values that are left
hdd = hdd.select(lambda x: x[-10:] != 'normalized', axis=1)
hdd.shape

(1681473, 46)

In [46]:
# remove model number
hdd.drop(['model'], inplace=True, axis=1)

## Resampling the data
This dataset is extremely skewed.  169 failures out of 35057 drives.  The total number of records is 1681473, so an expected resampling should come close to doubling the size of the data.

In [47]:
from imblearn.over_sampling import SMOTE
# Apply SMOTE's
X = hdd
y = np.asarray(hdd['failure'])
kind = 'regular'
sm = SMOTE(kind='regular')
#X.shape
X_res, y_res = sm.fit_sample(X, y)
X_res.shape

(3362668, 45)

Need to create s solution set, so we can compare predictions later on.

In [48]:
#save solution set
serial_df = pd.DataFrame()
serial_df['serial_number'] = hdd['serial_number']
serial_df['failure'] = hdd['failure']
serial_df.head()

Unnamed: 0,serial_number,failure
1,30154,0
7,2976,0
8,2975,0
9,21345,0
10,16688,0


In [49]:
serial_df.shape

(1681473, 2)

In [50]:
hdd.columns


Index([u'serial_number', u'failure', u'smart_1_raw', u'smart_2_raw',
       u'smart_3_raw', u'smart_4_raw', u'smart_5_raw', u'smart_7_raw',
       u'smart_8_raw', u'smart_9_raw', u'smart_10_raw', u'smart_11_raw',
       u'smart_12_raw', u'smart_13_raw', u'smart_22_raw', u'smart_183_raw',
       u'smart_184_raw', u'smart_187_raw', u'smart_188_raw', u'smart_189_raw',
       u'smart_190_raw', u'smart_191_raw', u'smart_192_raw', u'smart_193_raw',
       u'smart_194_raw', u'smart_195_raw', u'smart_196_raw', u'smart_197_raw',
       u'smart_198_raw', u'smart_199_raw', u'smart_200_raw', u'smart_201_raw',
       u'smart_220_raw', u'smart_222_raw', u'smart_223_raw', u'smart_224_raw',
       u'smart_225_raw', u'smart_226_raw', u'smart_240_raw', u'smart_241_raw',
       u'smart_242_raw', u'smart_250_raw', u'smart_251_raw', u'smart_252_raw',
       u'smart_254_raw'],
      dtype='object')

## Writing the files



In [51]:
# save the clean, new files
hdd.to_csv('../input/harddrive_resampled.csv', index=False)
serial_df.to_csv('../input/solutions_resampled.csv', index=False)