In [2]:
import pandas as pd
import numpy as np
import utils_backblaze as utils
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from dldb import DLDB
import os

## 1. Load in the data

The data is loaded from many individual CSV files, and then cut off at times specified by the labels.

To make this notebook more interactive and because the data is heavily imbalanced toward working hard drives, we downsample the "negative class". A positive label means that a hard drive failed on the subsequent day, while a negative means that it did not. To do this downsampling, we remove 90% of the hard drives that never failed across the duration of the available CSV files.

In [7]:
data_dir = "/Users/bschreck/Google Drive File Stream/My Drive/Feature Labs Shared/EntitySets/entitysets/backblaze_harddrive/data"
df = utils.load_data_as_dataframe(data_dir=data_dir, csv_glob='*.csv',
                                  negative_downsample_frac=0.01)

In [19]:
df.groupby('serial_number')['failure'].last().value_counts()

Defaulting to column but this will raise an ambiguity error in a future version
  """Entry point for launching an IPython kernel.


False    853
True     321
Name: failure, dtype: int64

We still need to create an EntitySet in this notebook, because the labeling utility function depends on it. In general, you shouldn't need to create an EntitySet to just run DLDB on a denormalized dataset. But it might be helpful to simplify/unify your code.

In [11]:
es = utils.load_entityset_from_dataframe(df)
es

Entityset: BackBlaze
  Entities:
    SMART_observations [Rows: 67424, Columns: 94]
    HDD [Rows: 1174, Columns: 4]
    models [Rows: 27, Columns: 1]
  Relationships:
    SMART_observations.serial_number -> HDD.serial_number
    HDD.model -> models.model

In [12]:
training_window = "20 days"
lead = pd.Timedelta('1 day')
prediction_window = pd.Timedelta('25 days')
min_training_data = pd.Timedelta('5 days')

In [13]:
labels = utils.create_labels(es,
                             lead,
                             min_training_data)

Creating labels...: 100%|██████████| 1175/1175 [00:03<00:00, 360.44it/s]


In [14]:
labels.value_counts()

False    852
True     282
Name: label, dtype: int64

In [15]:
cutoff_raw = utils.cutoff_raw_data(df, labels, training_window)

## Initialize DLDB with desired hyperparameters

In this example, we use 2 fairly small [LSTM](https://keras.io/layers/recurrent/) layers and 2 feed-forward layers (called "Dense layers" in Keras/Tensor Flow terminology). DLDB has an extremely simple API, and exposes a large number of hyperparameters, so is amenable to hyperparameter optimization algorithms.

Each categorical feature will be mapped to a 12-dimensional embedding, with a maximum of 20 unique categorical values (the top 20 most frequent values will be chosen, and the rest will be converted to a single token).

In [16]:
dl_model = DLDB(
    regression=False,
    classes=[False, True],
    recurrent_layer_sizes=(32, 32),
    dense_layer_sizes=(32, 16),
    dropout_fraction=0.2,
    recurrent_dropout_fraction=0.2,
    categorical_embedding_size=12,
    categorical_max_vocab=20)

## Train the model and test using cross-validation

We use a `batch_size` of 128 (for each gradient update step) and train over 3 passes of the dataset (epochs).

In [17]:
n_splits=7
splitter = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=0)

Note that we tell DL-DB explicitly what feature names are categorical.

In [18]:
cv_score = []

for i, train_test_index in enumerate(splitter.split(labels, labels)):
    train_labels = labels.reset_index('cutoff', drop=True).iloc[train_test_index[0]]
    test_labels = labels.reset_index('cutoff', drop=True).iloc[train_test_index[1]]
    train_ftens = cutoff_raw.reset_index('date', drop=True).loc[train_labels.index, :]
    test_ftens = cutoff_raw.reset_index('date', drop=True).loc[test_labels.index, :]

    dl_model.fit(
        train_ftens, train_labels,
        categorical_feature_names=['model'],
        batch_size=128,
        # Set this to number of cores
        workers=8,
        use_multiprocessing=True,
        shuffle=False,
        epochs=3)

    predictions = dl_model.predict(test_ftens)
    score = roc_auc_score(test_labels, predictions)
    print("cv score: ", score)
    cv_score.append(score)

mean_score = np.mean(cv_score)
stderr = 2 * (np.std(cv_score) / np.sqrt(n_splits))

print("DENORM AUC %.2f +/- %.2f" % (mean_score, stderr))

Epoch 1/3
Epoch 2/3
Epoch 1/3
Epoch 3/3
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.5981607357057177
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.5269892043182727
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.4364754098360656
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.6380122950819671
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.6737704918032787
Epoch 1/3
Epoch 2/3

Epoch 3/3
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.4925619834710744
Epoch 1/3
Epoch 2/3
Epoch 3/3
Trans