## Final Results
Run XGBoost and Random Forrest Regressor using cleaned data file containing only the following model: ST4000DM000

Conclusion: SMART data alone is not a "good enough" predictor for hard drive failure.
The test set used below contained data for over 11,000 hard drives.

SMART attributes used in this set: 5,187,188,197,198 (previous runs were completed with all SMART attributes, this one yields the best results)

The false negative rate was high due to those failed hard drives not having any relevant SMART data indicators for failure (i.e. SMART counters were 0's, yet drive failed -- 26% of failed drives fit that criteria)

### Example results:
#### Total number of drives 11569
#### Number of failed drives 42
#### Number of predicted failures 25
#### Number of false positives 9
#### Number of false negatives 26
#### Number of true negatives 0.0
#### Number of true positives 16.0
#### Fmeasure = 0.733944954128

In [12]:
import numpy as np
from numpy import loadtxt
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import ensemble, metrics 

# load data
hdd = pd.read_csv('../input/ST4000DM000_clean_SMART_harddrive.csv')

hdd_group = hdd.groupby('serial_number')
hdd_last_day = hdd_group.nth(-1) # take the last row from each group

# the number of drives in the dataset
uniq_serial_num = pd.Series(hdd_last_day.index.unique())
uniq_serial_num.shape

# hold out 33% of data for testing
test_ids = uniq_serial_num.sample(frac=0.33)

train = hdd_last_day.query('index not in @test_ids')
test = hdd_last_day.query('index in @test_ids')

print test['failure'].value_counts()
print train['failure'].value_counts()

0    11527
1       42
Name: failure, dtype: int64
0    23399
1       89
Name: failure, dtype: int64


In [13]:

# Create prediction dataframe
# serial_number,r5,r187,r188,r197,r198,fail,predxgb,altpred  - INTs
predict_df = pd.DataFrame({'failure': test['failure'], 'r5':test['smart_5_raw'], 'r187':test['smart_187_raw'], \
    'r188':test['smart_188_raw'], 'r197':test['smart_197_raw'], 'r198':test['smart_198_raw']})
predict_df['sum'] = (predict_df['r5'] + predict_df['r187'] + predict_df['r188'].astype(float).round() + \
    predict_df['r197'] + predict_df['r198']).astype(float).round()

print predict_df.head()


               failure  r187           r188  r197  r198  r5  sum
serial_number                                                   
0                    0     0   0.000000e+00     0     0   0    0
2                    0     0  1.482197e-323     0     0   0    0
4                    0     0   0.000000e+00     0     0   0    0
5                    0     0   0.000000e+00     0     0   0    0
6                    0    30   0.000000e+00     0     0   0   30


In [14]:

train_labels = train['failure']
test_labels = test['failure']
train = train.drop('failure', axis=1)
test = test.drop('failure', axis=1)

X_train = train
y_train = train_labels
X_test = test
y_test = test_labels

# model

model = XGBClassifier()
model.fit(X_train, y_train)

# make predictions for test data
y_pred = model.predict(X_test)

# evaluate predictions
accuracy = accuracy_score(y_test, y_pred)
print("XGB - Accuracy: %.2f%%" % (accuracy * 100.0))


XGB - Accuracy: 99.70%


In [15]:
unique, counts = np.unique(y_pred, return_counts=True)
print dict(zip(unique, counts))


{0: 11544, 1: 25}


In [16]:

predict_df['xgb'] = y_pred
columns_to_drop = ['r5', 'r187', 'r188', 'r197', 'r198']
predict_df = predict_df.drop(columns_to_drop, axis=1)


predict_df['sum'] = predict_df['sum'].map(lambda x: 1 if x > 0 else 0)
#predict_df.to_csv("ST4000DM000_pred.csv", index='false')
predict_df['fp'] = (predict_df['xgb'] - predict_df['failure'])
predict_df['fp'] = predict_df['fp'].map(lambda x: 1 if x > 0 else 0)
predict_df['fn'] = predict_df['failure'] - predict_df['xgb']
predict_df['fn'] = predict_df['fn'].map(lambda x: 1 if x > 0 else 0)

predict_df.loc[(predict_df['failure'] == 1) & (predict_df['xgb'] == 1), 'tp'] = 1
predict_df.loc[(predict_df['failure'] == 0) & (predict_df['xgb'] == 0), 'tn'] = 0
predict_df.fillna(0, inplace=True)


print predict_df.head()


               failure  sum  xgb  fp  fn  tp  tn
serial_number                                   
0                    0    0    0   0   0   0   0
2                    0    0    0   0   0   0   0
4                    0    0    0   0   0   0   0
5                    0    0    0   0   0   0   0
6                    0    1    0   0   0   0   0


In [17]:

print ("Number of total drives {}".format(predict_df.shape[0]))
print ("Number of failed drives {}".format(predict_df['failure'].sum())) #value_counts().shape[0]))
print ("Number of predicted failures {}".format(predict_df['xgb'].sum())) #value_counts().shape[0]))
print ("Number of false positives {}".format(predict_df['fp'].sum()))
print ("Number of false negatives {}".format(predict_df['fn'].sum()))
print ("Number of true negatives {}".format(predict_df['tn'].sum()))
print ("Number of true positives {}".format(predict_df['tp'].sum()))


Number of total drives 11569
Number of failed drives 42
Number of predicted failures 25
Number of false positives 9
Number of false negatives 26
Number of true negatives 0.0
Number of true positives 16.0


In [18]:
fp = predict_df['fp'].sum()
fn = predict_df['fn'].sum()
tn = predict_df['tn'].sum()
tp = predict_df['tp'].sum()
recall = tp / (tp + fn)
FPrate = fp / (fp + tn)
precision = tp / tp + fp
Fmeasure = (2 * precision * recall) / (precision + recall)
print (" FPrate = {}".format(FPrate))
print (" Fmeasure = {}".format(Fmeasure))
print ("Number of code predicted failures {}".format(predict_df['sum'].sum()))


 FPrate = 1.0
 Fmeasure = 0.733944954128
Number of code predicted failures 144


In [19]:
predict_df.to_csv("ST4000DM000_pred_all_parms.csv", index='false')
columns_to_drop = ['fp', 'fn', 'sum', 'tp', 'tn']
predict_df = predict_df.drop(columns_to_drop, axis=1)
rf = ensemble.RandomForestClassifier()
rf.fit(X_train, y_train)
preds_rf = rf.predict(X_test)

predict_df['RF'] = preds_rf
print('logloss', metrics.log_loss(y_true=y_test, y_pred=preds_rf))
print('roc_auc', metrics.roc_auc_score(y_true=y_test, y_score=preds_rf))

('logloss', 0.10747714984818632)
('roc_auc', 0.67818104078622865)


In [20]:
# create file to see predicted vs actual
predict_df.to_csv("ST4000DM000_pred_limited.csv", index='false')
print predict_df.head()

               failure  xgb  RF
serial_number                  
0                    0    0   0
2                    0    0   0
4                    0    0   0
5                    0    0   0
6                    0    0   0
