In [1]:
import featuretools as ft
from featuretools.primitives import NumTrue, PercentTrue
from featuretools.selection import remove_low_information_features
import pandas as pd
import numpy as np
import utils_backblaze as utils
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.ensemble import (RandomForestClassifier,
                              RandomForestRegressor)
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
from dldb import DLDB
import os
ft.__version__

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


'0.1.20'

# DLDB: Using DFS to Train Recurrent Neural Networks


### Brief DFS primer
Deep Feature Synthesis (DFS) works on time-varying, transactional-level data to generate powerful, interpretable features for machine learning. Raw data consists of many tables, with some columns acting as links between tables. We want to produce a fixed-size feature vector for each row of one of these tables, but taking advantage of the data contained in the other tables. DFS generates these feature vectors by applying many statistical functions, called primitives, across tables. And importantly, it generates these features at specific moments in time, taking precautions to only use data from before the desired time.

For instance, the data we will use in this notebook contains a table with a row for each Instacart user, and several other tables about their shopping behavior. DFS can apply the "sum" primitive to the dollar amount of each order per user, producing a feature for "the total amount spent on Instacart per user". Adding a *cutoff time* of March 1, 2015, the feature becomes the "total amount spent on Instacart per user before March 1, 2015". DFS can also combine several primitives, allowing it to form features like the "standard deviation of the number of items in each user's previous orders".

For a more in depth explanation of DFS, we encourage you to check out this [blog post](https://www.featurelabs.com/blog/deep-feature-synthesis/) and [this page](https://docs.featuretools.com/automated_feature_engineering/afe.html) in the Featuretools documentation.

### Producing a 3D tensor from DFS
DFS as described produces a 2-dimensional feature matrix that can be used for classic machine learning techniques, such as SVM or Random Forest. These techniques need a fixed-size feature matrix where any time-dependence is summarized into historical statistics (e.g. Number of items a customer purchased in the past 30 days).

We can take more explicit advantage of the time-dimension in this type of data using Recurrent Neural Networks. RNNs take in sequences of features, where the 3rd dimension in our case would represent time. Since RNNs learn high-level features on their own, the usual approach when using multiple tables is just to join all of them together and use the raw values. We will show that approach as a baseline here.

Instead, we can use DFS to produce high-level features at different points in time, and then learn from these sequences of features, rather than raw data. In this case, we would use DFS to produce a 3D tensor flattened as a 2D matrix, with multiple times for each instance. Combining DFS with RNNs essentially encodes prior human intuition and assumptions about relevant data transformations into the problem before letting the deep learning do its thing. Because the net doesn't have to learn every feature from scratch, we may be able to reduce training time, use a simpler net, not have to tweak as many hyperparameters, use less data, or boost performance. In this notebook, we will show a network using DFS features that produces higher scores with less variation than the same network trained on raw data.

We'll try it on [harddrive failure data from Backblaze](https://www.backblaze.com/b2/hard-drive-test-data.html).

## DLDB Library

[DLDB](https://github.com/HDI-Project/DL-DB) is a utility library for building recurrent neural networks from a feature matrix with multiple cutoff times per instance. Internally, it uses the [Keras](keras.io) library (which in turn uses [Tensor Flow](tensorflow.org)). 

It works by first imputing and scaling a sequence feature matrix (the result of calling `tdfs()`), and then separating the numeric features from the categoricals. Each categorical feature is mapped to a Keras Embedding layer in order to transform it into a dense, numeric vector. Then these embeddings are concatenated with the numeric features and fed into several recurrent layers (specified in hyperparameters) and several feed-forward layers (also specified in hyperparameters). It also includes an optional 1-D convolutional layer that will be applied before the recurrent layers. All the network layers, including the categorical embeddings, are trained end-to-end using any gradient update methods available in Keras.

We packaged DL-DB into a Python library that can be installed via pip:
    
```
pip install dldb
```

This library includes both a class to build these recurrent neural network models as well as the `tdfs()` function that creates time-series features as input.

## 1. Load in the data

The data is loaded from many individual CSV files, and then converted into the Featuretools Entityset format.

To make this notebook more interactive and because the data is heavily imbalanced toward working hard drives, we downsample the "negative class". A positive label means that a hard drive failed on the subsequent day, while a negative means that it did not. To do this downsampling, we remove 90% of the hard drives that never failed across the duration of the available CSV files.

In [2]:
data_dir = "/Users/bschreck/Google Drive File Stream/My Drive/Feature Labs Shared/EntitySets/entitysets/backblaze_harddrive/data"
df = utils.load_data_as_dataframe(data_dir=data_dir, csv_glob='*.csv',
                                  negative_downsample_frac=0.01)

In [3]:
df.groupby('serial_number')['failure'].last().value_counts()

Defaulting to column but this will raise an ambiguity error in a future version
  """Entry point for launching an IPython kernel.


False    850
True     321
Name: failure, dtype: int64

In [4]:
es = utils.load_entityset_from_dataframe(df)
es

Entityset: BackBlaze
  Entities:
    SMART_observations [Rows: 67839, Columns: 94]
    HDD [Rows: 1171, Columns: 4]
    models [Rows: 22, Columns: 1]
  Relationships:
    SMART_observations.serial_number -> HDD.serial_number
    HDD.model -> models.model

## 2. Construct labels

This utility function picks out a sampling of hard drives at particular days in their lifecycle, and labels each as True or False depending on whether they failed the following day.

In [5]:
training_window = "20 days"
lead = pd.Timedelta('1 day')
prediction_window = pd.Timedelta('25 days')
min_training_data = pd.Timedelta('5 days')

In [6]:
labels = utils.create_labels(es,
                             lead,
                             min_training_data)

Creating labels...: 100%|██████████| 1172/1172 [00:02<00:00, 415.31it/s]


In [7]:
labels.value_counts()

False    849
True     282
Name: label, dtype: int64

## Create time-stamped feature tensor using DFS

Here is where things start to get interesting. We use the [`make_temporal_cutoffs` function in Featuretools](https://github.com/HDI-Project/DL-DB/blob/master/dldb/tdfs.py) to produce a serious of preceding cutoff times for each label/cutoff time pair. We then provide these cutoffs to DFS to generate a feature tensor with several rows per harddrive serial number.

This `make_temporal_cutoffs` function has a few different ways of selecting these additional cutoff times. Here, we provide `window_size='1d'` and `starts` equal to the first recorded time for each drive. This produces sequences spaced out by 1 day (the frequency of recording in the actual dataset) from the first recording until the cutoff time, at which point we have to make a prediction.


In [8]:
instance_ids = labels.index.get_level_values('serial_number')
cutoffs = labels.index.get_level_values('cutoff')
starts = es['SMART_observations'].df.groupby('serial_number')['date'].min().loc[instance_ids].tolist()
temporal_cutoffs = ft.make_temporal_cutoffs(instance_ids=instance_ids,
                                            cutoffs=cutoffs,
                                            start=starts,
                                            window_size='1d')

In [9]:
trans_primitives = ["day", "days"]
label_feature = ft.Feature(es["SMART_observations"]["failure"]) == 1
seed_features = [label_feature,
                 NumTrue(label_feature, es["HDD"]), 
                 PercentTrue(label_feature, es["HDD"])]
ftens, fl = ft.dfs(entityset=es,
                target_entity="HDD",
                cutoff_time=temporal_cutoffs,
                cutoff_time_in_index=True,
                trans_primitives=["day", "days"],
                seed_features=seed_features,
                max_depth=2,
                verbose=True)
# Make sure ftens is sorted the same way as the labels
ftens = ftens.swaplevel(i=1, j=0).sort_index()

Built 1117 features
Elapsed: 44:43 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 11/11 chunks


In [10]:
ftens.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,model,capacity_bytes,SUM(SMART_observations.smart_10_normalized),SUM(SMART_observations.smart_10_raw),SUM(SMART_observations.smart_11_normalized),SUM(SMART_observations.smart_11_raw),SUM(SMART_observations.smart_12_normalized),SUM(SMART_observations.smart_12_raw),SUM(SMART_observations.smart_13_normalized),SUM(SMART_observations.smart_13_raw),...,models.STD(HDD.NUM_TRUE(SMART_observations.failure = 1)),models.STD(HDD.PERCENT_TRUE(SMART_observations.failure = 1)),models.MAX(HDD.NUM_TRUE(SMART_observations.failure = 1)),models.MAX(HDD.PERCENT_TRUE(SMART_observations.failure = 1)),models.SKEW(HDD.NUM_TRUE(SMART_observations.failure = 1)),models.SKEW(HDD.PERCENT_TRUE(SMART_observations.failure = 1)),models.MIN(HDD.NUM_TRUE(SMART_observations.failure = 1)),models.MIN(HDD.PERCENT_TRUE(SMART_observations.failure = 1)),models.MEAN(HDD.NUM_TRUE(SMART_observations.failure = 1)),models.MEAN(HDD.PERCENT_TRUE(SMART_observations.failure = 1))
time,serial_number,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2017-01-01,45CHK11WFMYB,TOSHIBA MD04ABA400V,4000787000000.0,100.0,0.0,0.0,0.0,100.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2017-01-01,45D7K132FMYB,TOSHIBA MD04ABA400V,4000787000000.0,100.0,0.0,0.0,0.0,100.0,3.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2017-01-01,5641SFR3S,TOSHIBA MQ01ABF050,500107900000.0,100.0,0.0,0.0,0.0,100.0,4.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2017-01-01,5641SFRIS,TOSHIBA MQ01ABF050,500107900000.0,100.0,0.0,0.0,0.0,100.0,5.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2017-01-01,MJ0351YNG9P2NA,Hitachi HDS5C3030ALA630,3000593000000.0,100.0,0.0,0.0,0.0,100.0,14.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Selecting features
DFS generates over 1000 features for this dataset. Many of them won't be useful, so we an do a pass of supervised feature selection before building the deep learning model.

NOTE: if DFS produces fewer features (100 is a good maximum), then no feature selection is necessary

To do this, we use a Random Forest Classifier's built-in feature importances.

First, one-hot-encode categoricals and drop zero-variance features

In [11]:
ftens, fl = ft.encode_features(ftens, fl)
ftens, fl = remove_low_information_features(ftens, fl)

Now, impute missing values and train a Random Forest on the last cutoff time for each hard drive.

In [12]:
est = RandomForestClassifier(n_estimators=1000, class_weight='balanced', n_jobs=-1, verbose=True)
imputer = Imputer(missing_values='NaN', strategy="mean", axis=0)
selector = SelectFromModel(est, threshold="mean")
pipeline = Pipeline([("imputer", imputer),("selector", selector)])

In [13]:
fm = ftens.groupby(level='serial_number').last()

In [14]:
pipeline.fit(fm, labels)

[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:    3.2s finished


Pipeline(memory=None,
     steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('selector', SelectFromModel(estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None,...state=None, verbose=True, warm_start=False),
        norm_order=1, prefit=False, threshold='mean'))])

Get the selected features from the pipeline and subselect the feature list

In [15]:
selected = set(fm.loc[:, pipeline.steps[-1][1].get_support()].columns.tolist())

In [16]:
fl_selected = [f for f in fl if f.get_name() in selected]

We can save for reuse in the future

In [17]:
ft.save_features(fl_selected, "fl_backblaze_selected.p")

Select the important features from the original feature matrix/tensor

In [18]:
ftens = ftens[[f.get_name() for f in fl]]

In [19]:
ftens.to_csv('backblaze_ftens.csv')

## Initialize DLDB with desired hyperparameters

In this example, we use 2 fairly small [LSTM](https://keras.io/layers/recurrent/) layers and 2 feed-forward layers (called "Dense layers" in Keras/Tensor Flow terminology). DLDB has an extremely simple API, and exposes a large number of hyperparameters, so is amenable to hyperparameter optimization algorithms.

Each categorical feature will be mapped to a 12-dimensional embedding, with a maximum of 20 unique categorical values (the top 20 most frequent values will be chosen, and the rest will be converted to a single token).

In [20]:
dl_model = DLDB(
    regression=False,
    classes=[False, True],
    recurrent_layer_sizes=(32, 32),
    dense_layer_sizes=(32, 16),
    dropout_fraction=0.2,
    recurrent_dropout_fraction=0.2,
    categorical_embedding_size=12,
    categorical_max_vocab=20)

## Train the model and test using cross-validation

We use a `batch_size` of 128 (for each gradient update step) and train over 3 passes of the dataset (epochs).

In [21]:
n_splits=7
splitter = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=0)

In [22]:
cv_score = []

for train_test_index in splitter.split(labels, labels):
    train_labels = labels.reset_index('cutoff', drop=True).iloc[train_test_index[0]]
    test_labels = labels.reset_index('cutoff', drop=True).iloc[train_test_index[1]]
    train_ftens = ftens.reset_index('time', drop=True).loc[train_labels.index, :]
    test_ftens = ftens.reset_index('time', drop=True).loc[test_labels.index, :]

    dl_model.fit(
        train_ftens, train_labels, fl=fl,
        batch_size=128,
        # Set this to number of cores
        workers=8,
        use_multiprocessing=True,
        shuffle=False,
        epochs=3)

    predictions = dl_model.predict(test_ftens)
    score = roc_auc_score(test_labels, predictions)
    print("cv score: ", score)
    cv_score.append(score)
mean_score = np.mean(cv_score)
stderr = 2 * (np.std(cv_score) / np.sqrt(n_splits))

print("DFS AUC %.2f +/- %.2f" % (mean_score, stderr))

Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.6439424230307876
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.8159736105557777
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.7424586776859503
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.6475206611570248
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.8011363636363636
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.762293388429752
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming inpu