In [1]:
import featuretools as ft
from featuretools.primitives import Day, Weekend, Weekday, Percentile
import pandas as pd
import numpy as np
import utils_instacart as utils
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from dldb import DLDB, tdfs
import os
ft.__version__

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


'0.1.18'

# DLDB: Using DFS for Smaller, Easier to Train Recurrent Neural Networks

Deep Feature Synthesis works on time-varying, transactional-level data to generate powerful, interpretable features for machine learning. Vanilla DFS produces a 2-dimensional feature matrix that can be used for classic machine learning techniques, such as SVM or Random Forest. These techniques need a fixed-size feature matrix where any time-dependence is summarized into historical statistics (e.g. Number of items a customer purchased in the past 30 days).

We can take more explicit advantage of the time-dimension in this type of data using Recurrent Neural Networks. RNNs take in sequences of features, where the 3rd dimension in our case would represent time. Since RNNs learn high-level features on their own, the usual approach when using multiple tables is just to join all of them together and use the raw values. We will show that approach as a baseline here.

Instead, we can use DFS to produce high-level features at different points in time, and then learn from these sequences of features, rather than raw data. This strategy essentially encodes prior human intuition and assumptions about relevant data transformations into the problem before letting the deep learning do its thing. Because the net doesn't have to learn every feature from scratch, we may be able to reduce training time, use a simpler net, not have to tweak as many hyperparameters, use less data, or boost performance. In this notebook, we will show that with a relatively simple network we can achieve pretty good AUC with the DFS features before the raw data network learns much of anything.

We'll try it on Instacart data, solving the same problem as this [previous Featuretools demo](https://github.com/Featuretools/predict_next_purchase), which used a Random Forest to predict whether a customer will buy a banana in their next Instacart order.

## DLDB Library

[DLDB](https://github.com/HDI-Project/DL-DB) is a utility library for building recurrent neural networks from a feature matrix with multiple cutoff times per instance. Internally, it uses the [Keras](keras.io) library (which in turn uses [Tensor Flow](tensorflow.org)). 

It works by mapping each categorical feature to a Keras Embedding layer in order to transform it into a dense, numeric vector. Then all the inputs are fed into several recurrent layers (specified in hyperparameters) and several feed-forward layers (also specified in hyperparameters). It also includes an optional 1-D convolutional layer that will be applied before the recurrent layers. 

We packaged DL-DB into a Python library that can be installed via pip:
    
```
pip install dldb
```

This library includes both a class to build these recurrent neural network models as well as a wrapper function around Featuretools (`tdfs()`) to create time-series features as input.

## 1. Load in the data

The data is partitioned into chunks based on `user_id`, and loaded into the Featuretools Entityset format. See [the original demo](https://github.com/Featuretools/predict_next_purchase) for more explananation about how the data is partitioned and the Entityset is formed.

In [2]:
es = utils.load_entityset('partitioned_data/part_0/')

## 2. Construct labels

This utility function picks out a window of time, and finds which users bought bananas. Again, more explanation in [the original demo](https://github.com/Featuretools/predict_next_purchase).

In [3]:
cutoff_time = pd.Timestamp('March 1, 2015')
training_window = ft.Timedelta("60 days")

In [4]:
label_times = utils.make_labels(es,
                                product_name="Banana",
                                cutoff_time=cutoff_time,
                                prediction_window=ft.Timedelta("4 weeks"),
                                training_window=training_window)
labels = label_times.set_index('user_id').sort_index()['label']

## Create time-stamped feature matrix using DFS

Here is where things start to get interesting. We use the [`tdfs` function in DLDB](https://github.com/HDI-Project/DL-DB/blob/master/dldb/tdfs.py) to produce a feature matrix with several rows per user. It works by adding additional cutoff times in the past to each `(user_id, cutoff_time)` provided in `label_times`.

This function has a few different ways of selecting these additional cutoff times. Here, we provide `window_size='3d'` and `start=cutoff_time - training_window`, which will go back in time in increments of 3 days until 60 days before the cutoff time of March 1st. This produces a sequence of 20 cutoff times per user.

We could have also specified `num_windows=20` and `window_size=3d` to produce the same result.

The rest of the arguments are standard DFS arguments. For an overview of DFS, check out the [Featuretools documentation](https://docs.featuretools.com/automated_feature_engineering/afe.html).

In [5]:
trans_primitives = [Day, Weekend, Weekday, Percentile]
fm, fl = tdfs(entityset=es,
              target_entity="users",
              cutoffs=label_times,
              trans_primitives=trans_primitives,
              training_window=training_window,
              max_depth=2,
              window_size='3d',
              start=cutoff_time - training_window,
              verbose=True)

fm = fm.sort_index()

Building features: 121it [00:00, 6344.76it/s]
Progress: 100%|██████████| 21/21 [02:55<00:00,  8.37s/cutoff time]


In [6]:
# Can save/restore our work without having to recompute feature matrix
#fm.to_csv('fm_part_0.csv')
#fm = pd.read_csv('fm_part_0.csv', parse_dates=['time'], index_col=['user_id', 'time'])

In [7]:
#ft.save_features(fl, 'fl_part_0.p')
#fl = ft.load_features('fl_part_0.p', es)

### Create Baseline Input Data

This "feature_matrix" is created by joining all entities in the data together into one dataframe.
Just like the feature matrix created from `tdfs`, we make sure to cutoff the data at the cutoff time, and only use 60 days of data.

In [8]:
fm_denormalized = utils.denormalize_entityset(es, cutoff_time, training_window)
fm_denormalized.sort_index(inplace=True)

## Initialize DLDB with desired hyperparameters

In this example, we use 2 fairly small [LSTM](https://keras.io/layers/recurrent/) layers and 2 feed-forward layers (called "Dense layers" in Keras/Tensor Flow terminology). DLDB has an extremely simple API, and exposes a large number of hyperparameters, so is amenable to hyperparameter optimization algorithms.

Each categorical feature will be mapped to a 12-dimensional embedding, with a maximum of 20 unique categorical values (the top 20 most frequent values will be chosen, and the rest will be converted to a single token).

In [6]:
dl_model = DLDB(
    regression=False,
    classes=[False, True],
    recurrent_layer_sizes=(32, 32),
    dense_layer_sizes=(32, 16),
    dropout_fraction=0.2,
    recurrent_dropout_fraction=0.2,
    categorical_embedding_size=12,
    categorical_max_vocab=20)

### Compile the network

**Note** Doing this step outside of the cross-validation loop is *slightly* cheating because we give it all the categorical values ahead of time. It most likely won't make a difference, and this step takes some time.

Feel free to move it inside of the cross-validation for loop and see how much the results change.

In [7]:
dl_model.compile(fm, fl=fl)

## Train the model and test using cross-validation

We use a `batch_size` of 128 (for each gradient update step) and train over 3 passes of the dataset (epochs).

In [8]:
cv_score = []
n_splits=10
splitter = StratifiedKFold(n_splits=n_splits, shuffle=True)

for i, train_test_index in enumerate(splitter.split(labels, labels)):
    train_labels = labels.iloc[train_test_index[0]]
    test_labels = labels.iloc[train_test_index[1]]
    train_fm = fm.loc[(train_labels.index, slice(None)), :]
    test_fm = fm.loc[(test_labels.index, slice(None)), :]


    dl_model.fit(
        train_fm, train_labels,
        # Provide this many samples to the network at a time
        batch_size=128,
        epochs=3,
        # After each epoch, test on a held out 10% validation set
        validation_split=0.1)
    
    predictions = dl_model.predict(test_fm)
    cv_score.append(roc_auc_score(test_labels, predictions))
mean_score = np.mean(cv_score)
stderr = 2 * (np.std(cv_score) / np.sqrt(n_splits))

print("AUC %.2f +/- %.2f" % (mean_score, stderr))

Transforming input matrix into numeric sequences
Fitting Keras model
Train on 678 samples, validate on 76 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input matrix into numeric sequences
Predicting using Keras model
Transforming outputs
Transforming input matrix into numeric sequences
Fitting Keras model
Train on 679 samples, validate on 76 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input matrix into numeric sequences
Predicting using Keras model
Transforming outputs
Transforming input matrix into numeric sequences
Fitting Keras model
Train on 679 samples, validate on 76 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input matrix into numeric sequences
Predicting using Keras model
Transforming outputs
Transforming input matrix into numeric sequences
Fitting Keras model
Train on 679 samples, validate on 76 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input matrix into numeric sequences
Predicting using Keras model
Transforming outputs
Transforming input matrix in

## Train the baseline model over raw data and test using cross-validation

We use the same parameters here. Note that we tell DL-DB explicitly what feature names are categorical.

In [12]:
# all columns are categorical except the Boolean "reordered"
categorical_feature_names=[c for c in fm_denormalized.columns if c != 'reordered']


dl_model.compile(fm_denormalized, categorical_feature_names=categorical_feature_names)

In [13]:
cv_score = []

for i, train_test_index in enumerate(splitter.split(labels, labels)):
    train_labels = labels.iloc[train_test_index[0]]
    test_labels = labels.iloc[train_test_index[1]]
    train_fm = fm_denormalized.loc[(train_labels.index, slice(None)), :]
    test_fm = fm_denormalized.loc[(test_labels.index, slice(None)), :]


    dl_model.fit(
        train_fm, train_labels,
        # Provide this many samples to the network at a time
        batch_size=128,
        epochs=3,
        # After each epoch, test on a held out 10% validation set
        validation_split=0.1)
    
    predictions = dl_model.predict(test_fm)
    cv_score.append(roc_auc_score(test_labels, predictions))
    
    if i == 3:
        break
mean_score = np.mean(cv_score)
stderr = 2 * (np.std(cv_score) / np.sqrt(n_splits))

print("AUC %.2f +/- %.2f" % (mean_score, stderr))

Transforming input matrix into numeric sequences
Fitting Keras model
Train on 678 samples, validate on 76 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input matrix into numeric sequences
Predicting using Keras model
Transforming outputs
Transforming input matrix into numeric sequences
Fitting Keras model
Train on 679 samples, validate on 76 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input matrix into numeric sequences
Predicting using Keras model
Transforming outputs
Transforming input matrix into numeric sequences
Fitting Keras model
Train on 679 samples, validate on 76 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input matrix into numeric sequences
Predicting using Keras model
Transforming outputs
Transforming input matrix into numeric sequences
Fitting Keras model
Train on 679 samples, validate on 76 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Transforming input matrix into numeric sequences
Predicting using Keras model
Transforming outputs
AUC 0.54 +/- 0.06


### Conclusions

The model using DFS features scored over 30% better AUC than the raw data model for the same parameters.

Try increasing the number of epochs in the raw data model- eventually you will end up with similar scores as the DFS model. I found that increasing from 3 to 7 epochs increases the raw data score from .53 to .66.

This is an interesting result, and hints at the idea that using good features to start out with can reduce the training time of deep learning models.

There are many more ideas we can test here:
 * What happens as we increase/decrease the amount of data we use? Remember that we used only a single partition (out of over 100)
 * How many rounds of hyperparameter optimization do we have to do to achieve the same result?
 * What if we use a more complex network?
 * Is it possible to visualize the effect of the input features on the LSTM network? This is a hard problem in deep learning in general