In [1]:
import pandas as pd
import numpy as np
import utils_instacart as utils
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from dldb import DLDB
import os

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## 1. Load in the data



The data is partitioned into chunks based on `user_id`, and loaded into the Featuretools Entityset format. See [the original demo](https://github.com/Featuretools/predict_next_purchase) for more explananation about how the data is partitioned and the Entityset is formed.

In [2]:
es = utils.load_entityset('partitioned_data/part_0/')

### Create Baseline Input Data

We load in the data into an EntitySet first, in the same way we do for the DFS case. Then, we merge all the tables together to form a single CSV. Note that the first step of creating the EntitySet is not necessary, but is done here because the `load_entityset` function takes care of a lot of the formatting of the data. Instead, you can just load in the raw data, format it, and merge.

We make sure to cutoff the data at the cutoff time, and only use 60 days of data.

In [3]:
cutoff_time = pd.Timestamp('March 1, 2015')
training_window = pd.Timedelta("60 days")

In [4]:
ftens_denormalized = utils.denormalize_entityset(es, cutoff_time, training_window)
ftens_denormalized.sort_index(inplace=True)
# Since we will never have the same order_id or order_product_id in the test set, we can't learn much from them
ftens_denormalized.drop(['order_id', 'order_product_id'], axis=1, inplace=True)

In [5]:
ftens_denormalized.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,reordered,product_name,aisle_id,department
user_id,order_time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2015-01-01 08:00:00,0,Soda,77,beverages
1,2015-01-01 08:00:00,0,Organic Unsweetened Vanilla Almond Milk,91,dairy eggs
1,2015-01-01 08:00:00,0,Original Beef Jerky,23,snacks
1,2015-01-01 08:00:00,0,Aged White Cheddar Popcorn,23,snacks
1,2015-01-01 08:00:00,0,XL Pick-A-Size Paper Towel Rolls,54,household


## 2. Construct labels

This utility function picks out a window of time, and finds which users bought bananas. Again, more explanation in [the original demo](https://github.com/Featuretools/predict_next_purchase).

In [6]:
label_times = utils.make_labels(es,
                                product_name="Banana",
                                cutoff_time=cutoff_time,
                                prediction_window=pd.Timedelta("28d"),
                                training_window=training_window)
labels = label_times.set_index('user_id').sort_index()['label']

## Initialize DLDB with desired hyperparameters

In this example, we use 2 fairly small [LSTM](https://keras.io/layers/recurrent/) layers and 2 feed-forward layers (called "Dense layers" in Keras/Tensor Flow terminology). DLDB has an extremely simple API, and exposes a large number of hyperparameters, so is amenable to hyperparameter optimization algorithms.

Each categorical feature will be mapped to a 12-dimensional embedding, with a maximum of 20 unique categorical values (the top 20 most frequent values will be chosen, and the rest will be converted to a single token).

In [7]:
dl_model = DLDB(
    regression=False,
    classes=[False, True],
    recurrent_layer_sizes=(32, 32),
    dense_layer_sizes=(32, 16),
    dropout_fraction=0.2,
    recurrent_dropout_fraction=0.2,
    categorical_embedding_size=12,
    categorical_max_vocab=20)

## Train the model and test using cross-validation

We use a `batch_size` of 128 (for each gradient update step) and train over 3 passes of the dataset (epochs).

In [8]:
n_splits=20
splitter = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=0)

Note that we tell DL-DB explicitly what feature names are categorical (all of them).

In [9]:
cv_score = []

for i, train_test_index in enumerate(splitter.split(labels, labels)):
    train_labels = labels.iloc[train_test_index[0]]
    test_labels = labels.iloc[train_test_index[1]]
    train_ftens = ftens_denormalized.reset_index('order_time', drop=True).loc[train_labels.index, :]
    test_ftens = ftens_denormalized.reset_index('order_time', drop=True).loc[test_labels.index, :]

    dl_model.fit(
        train_ftens, train_labels,
        categorical_feature_names=train_ftens.columns,
        # Provide this many samples to the network at a time
        batch_size=128,
        epochs=6,
        # Set this to number of cores
        workers=8,
        use_multiprocessing=True,
        shuffle=False,)
    
    predictions = dl_model.predict(test_ftens)
    score = roc_auc_score(test_labels, predictions)
    print("cv score: ", score)
    cv_score.append(score)

mean_score = np.mean(cv_score)
stderr = 2 * (np.std(cv_score) / np.sqrt(n_splits))

print("AUC %.2f +/- %.2f" % (mean_score, stderr))

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.5946428571428571
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.41785714285714287
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.3214285714285714
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.49464285714285716
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.6017857142857143
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epo

Epoch 4/6
Epoch 5/6
Epoch 6/6
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.7773109243697479
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.7373949579831933
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.8067226890756303
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.5840336134453782
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Transforming input tensor into numeric sequences
Predicting using Keras model
Transforming outputs
cv score:  0.5924369747899159
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Transforming input tensor