# Predicting a customer's next purchase using automated feature engineering

<p style="margin:30px">
    <img width=50% src="https://www.featuretools.com/wp-content/uploads/2017/12/FeatureLabs-Logo-Tangerine-800.png" alt="Featuretools" />
</p>

**As customers use your product, they leave behind a trail of behaviors that indicate how they will act in the future. Through automated feature engineering we can identify the predictive patterns in granular customer behavioral data that can be used to improve the customer's experience and generate additional revenue for your business.**

In this tutorial, we show how [Featuretools](http://www.featuretools.com) can be used to perform feature engineering on a multi-table dataset of 3 million online grocery orders provided by Instacart. We will generate a feature matrix that can be used to train a machine learning model to predict what product a customer buys next.

*Note: This notebook requires a dataset from Instacart. You can download the dataset [here](https://www.instacart.com/datasets/grocery-shopping-2017). Once you have downloaded the data, be sure to place the CSV files contained in the archive in a directory called `data`. If you use a different directory name, you will need to update the code below to point to the proper location.*

## Highlights

* We automatically generate features using Deep Feature Synthesis and select the 20 most important features for predictive modeling
* We demonstrate how to generate features in a scalable manner using [Dask](http://dask.pydata.org/en/latest/)
* We automatically generate label times using [Compose](https://github.com/FeatureLabs/compose) which can be reused for numerous prediction problems
* We develop a model for predicting what a customer will buy next, starting with a sample of data and then scaling to the full dataset

## You must have Featuretools version 0.16.0 or greater installed to run this notebook

In [1]:
import os
import composeml as cp
import featuretools as ft
import dask.dataframe as dd
import numpy as np
import pandas as pd
import utils
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

from dask.distributed import Client
ft.__version__

'0.17.0'

## Step 1. Load and preprocess data

First, we will create a Dask distributed client so we can track the progress of our computation on the Dask dashboard that is created when the client is initialized.

In [2]:
client = Client(n_workers=2)
client

0,1
Client  Scheduler: tcp://127.0.0.1:63436  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 2  Cores: 16  Memory: 17.18 GB


Next, we will specify our input and output directories and set the blocksize we will be using to read the raw CSV files into Dask dataframes. When running on a machine with 16GB of memory available for two Dask workers, a `100MB` blocksize has worked well. This number may need to be adjusted based on your specific environment. Refer to the [Dask documentation](https://docs.dask.org/en/latest/best-practices.html) for additional info.

In [3]:
data_dir = os.path.join("data")
output_dir = os.path.join("data", "dask_data")
blocksize = "100MB"

Now we will read our data into Dask dataframes. This operation will complete quite fast as we are not actually bringing the data into memory at this stage.

In [4]:
order_products = dd.concat([dd.read_csv(os.path.join(data_dir, "order_products__prior.csv"), blocksize=blocksize),
                            dd.read_csv(os.path.join(data_dir, "order_products__train.csv"), blocksize=blocksize)])
orders = dd.read_csv(os.path.join(data_dir, "orders.csv"), blocksize=blocksize)
departments = dd.read_csv(os.path.join(data_dir, "departments.csv"), blocksize=blocksize)
products = dd.read_csv(os.path.join(data_dir, "products.csv"), blocksize=blocksize)

In the next few cells, we will perform some required preprocessing to clean up our data. We will merge together some of the raw dataframes and add absolute order time information from the relative times used in the raw data. This will allow us to use cutoff times as part of the Deep Feature Synthesis process.

In [5]:
order_products = order_products.merge(products).merge(departments)

In [6]:
def add_time(df):
    df.reset_index(drop=True)
    df["order_time"] = np.nan
    days_since = df.columns.tolist().index("days_since_prior_order")
    hour_of_day = df.columns.tolist().index("order_hour_of_day")
    order_time = df.columns.tolist().index("order_time")

    df.iloc[0, order_time] = pd.Timestamp('Jan 1, 2015') +  pd.Timedelta(df.iloc[0, hour_of_day], "h")
    for i in range(1, df.shape[0]):
        df.iloc[i, order_time] = df.iloc[i - 1, order_time] \
            + pd.Timedelta(df.iloc[i, days_since], "d") \
                                    + pd.Timedelta(df.iloc[i, hour_of_day], "h")

    to_drop = ["order_number", "order_dow", "order_hour_of_day", "days_since_prior_order", "eval_set"]
    df.drop(to_drop, axis=1, inplace=True)

    return df

In [7]:
orders = orders.groupby("user_id").apply(add_time)
order_products = order_products.merge(orders[["order_id", "order_time"]])
order_products["order_product_id"] = order_products["order_id"] * 1000 + order_products["add_to_cart_order"]
order_products = order_products.drop(["product_id", "department_id", "add_to_cart_order"], axis=1)

  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  """Entry point for launching an IPython kernel.


Now that the preprocessing work is complete, we will save the results to disk. This will allow us to start from this point in the process in the future, without having to repeat all of the preprocessing steps. If you have already saved the results to disk previously, you can skip the cell below.

#### Note: The process of saving to CSV is computationally intensive and may take 45 minutes or more, depending on the system you are using. You can use the Dask dashboard to monitor the progress.

In [None]:
# Save preprocessed data to disk
orders.to_csv(os.path.join(output_dir, "orders-*.csv"), index=False)
order_products.to_csv(os.path.join(output_dir, "order_products-*.csv"), index=False)

If you have already performed the preprocessing steps and saved the processed files to disk, you can read them in with the commands in the following cell.

In [8]:
# Read preprocessed data from disk
orders = dd.read_csv(os.path.join(output_dir, "orders-*.csv"), blocksize=blocksize)
order_products = dd.read_csv(os.path.join(output_dir, "order_products-*.csv"), blocksize=blocksize)

In [9]:
orders.head()

Unnamed: 0,order_id,user_id,order_time
0,2086598,6,2015-01-01 18:00:00
1,298250,6,2015-01-08 10:00:00
2,998866,6,2015-01-21 04:00:00
3,1528013,6,2015-02-12 20:00:00
4,2565571,7,2015-01-01 09:00:00


In [10]:
order_products.head()

Unnamed: 0,order_id,reordered,product_name,aisle_id,department,order_time,order_product_id
0,5864,0,Organic Egg Whites,86,dairy eggs,2015-03-01 03:00:00,5864002
1,5864,0,Feta Cheese Crumbles,21,dairy eggs,2015-03-01 03:00:00,5864007
2,5864,0,Organic Extra Large Grade AA Brown Eggs,86,dairy eggs,2015-03-01 03:00:00,5864001
3,5864,0,Total 0% Nonfat Plain Greek Yogurt,120,dairy eggs,2015-03-01 03:00:00,5864005
4,5864,0,Mini Babybel Light Semisoft Edam Cheeses,21,dairy eggs,2015-03-01 03:00:00,5864010


## Step 2: Create a Featuretools entityset

When using Dask dataframes to create an entityset, variable type inference is not performed as it is with entitysets created from pandas dataframes. As a result, users must specify the Featuretools variable types for all of the columns in the dataframes that make up the entityset when using Dask. In the following cell we define the data types for the `order_products` and `orders` entities.

In [11]:
order_products_vtypes = {
    "order_id": ft.variable_types.Id,
    "reordered": ft.variable_types.Boolean,
    "product_name": ft.variable_types.Categorical,
    "aisle_id": ft.variable_types.Categorical,
    "department": ft.variable_types.Categorical,
    "order_time": ft.variable_types.Datetime,
    "order_product_id": ft.variable_types.Index,
}

order_vtypes = {
    "order_id": ft.variable_types.Index,
    "user_id": ft.variable_types.Id,
    "order_time": ft.variable_types.DatetimeTimeIndex,
}

Now that we have defined the data types, we can create the entityset and establish the relationship between the two entities. For our initial run we will use a sample of the full data to determine what features are the best predictors. Once we have the feature importances established we will rerun on the full dataset using only the most important features.

First we will sample our data - grabbing the orders for 1000 different customers. Because we cannot pass a Dask series to `.isin()` we must call `.compute()` on the ids to convert this into a pandas series.

In [12]:
ids = orders['user_id'].unique().compute()[0:1000]
orders_sample = orders[orders['user_id'].isin(ids)]

Next we will get the order products associated with the orders we sampled. 

In [13]:
order_products_sample = order_products[order_products['order_id'].isin(orders_sample['order_id'].compute())]

Now that we have sampled our data, we can create an entityset from these sampled dataframes.

In [14]:
es = ft.EntitySet("instacart_sample")
es.entity_from_dataframe(entity_id="order_products",
                         dataframe=order_products_sample,
                         index="order_product_id",
                         variable_types=order_products_vtypes,
                         time_index="order_time")

es.entity_from_dataframe(entity_id="orders",
                         dataframe=orders_sample,
                         index="order_id",
                         variable_types=order_vtypes,
                         time_index="order_time")

es.add_relationship(ft.Relationship(es["orders"]["order_id"], es["order_products"]["order_id"]))

Entityset: instacart_sample
  Entities:
    order_products [Rows: Delayed('int-416b06c4-f86b-4d71-bab4-b0bf71648a2b'), Columns: 7]
    orders [Rows: Delayed('int-bd26e6eb-531b-46e3-8408-3e4a2d0e6cf9'), Columns: 3]
  Relationships:
    order_products.order_id -> orders.order_id

Next, we will normalize the `orders` entity to create a new `users` entity that we will later use as the target entity during the deep feature synthesis process.

In [15]:
es.normalize_entity(base_entity_id="orders", new_entity_id="users", index="user_id")

Entityset: instacart_sample
  Entities:
    order_products [Rows: Delayed('int-c41f7235-1960-4192-9364-2bd42f1ae0e3'), Columns: 7]
    orders [Rows: Delayed('int-3f56ed1f-f5b8-4fc6-b77c-dee5a712a4df'), Columns: 3]
    users [Rows: Delayed('int-85264b0b-4050-48f8-b2f7-214c69dac75c'), Columns: 2]
  Relationships:
    order_products.order_id -> orders.order_id
    orders.user_id -> users.user_id

To finish up creation of the entity set we will add last time indexes and set some interesting values.

In [16]:
es.add_last_time_indexes()

In [17]:
es["order_products"]["department"].interesting_values = ['produce', 'dairy eggs', 'snacks', 'beverages', 'frozen', 'pantry', 'bakery', 'canned goods', 'deli', 'dry goods pasta']
es["order_products"]["product_name"].interesting_values = ['Banana', 'Bag of Organic Bananas', 'Organic Baby Spinach', 'Organic Strawberries', 'Organic Hass Avocado', 'Organic Avocado', 'Large Lemon', 'Limes', 'Strawberries', 'Organic Whole Milk']

## Step 3. Use Compose to generate our cutoff times dataframe

In the cells that follow we will demonstrate how [Compose](https://github.com/FeatureLabs/compose) can be used to generate the label times dataframe that will be used as cutoff times for deep feature synthesis.

First we define a labeling function to add a label if a user has bought a specific product or not, and then we will create our `LabelMaker` using this function.

In [18]:
def bought_product(df, product_name):
    purchased = df.product_name.str.contains(product_name).any()
    return purchased

In [19]:
lm = cp.LabelMaker(
    target_entity='user_id',
    time_index='order_time',
    labeling_function=bought_product,
    window_size='4w',
)

In [20]:
def denormalize(es):
    df = es['order_products'].df.merge(es['orders'].df).merge(es['users'].df)
    return df

Compose does not currently work on Dask dataframes, so we must first run `.compute()` on the denormalized entityset to switch to pandas.

In [21]:
df = denormalize(es).compute()

Now we can create our labels, indicating whether or not a user has purchased Bananas.

In [22]:
label_times = lm.search(
    df.sort_values('order_time'),
    minimum_data='2015-03-15',
    num_examples_per_instance=2,
    product_name='Banana',
    verbose=True,
)

Elapsed: 00:02 | Remaining: 00:00 | Progress: 100%|██████████| user_id: 2000/2000 


In [23]:
label_times.head()

Unnamed: 0_level_0,user_id,time,bought_product
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,7,2015-03-15,False
1,7,2015-04-12,False
2,14,2015-03-15,False
3,14,2015-04-12,False
4,16,2015-03-15,True


## Step 4. Run Deep Feature Synthesis

With our label times created, we are ready to run deep feature synthesis to generate our feature matrix. This will execute quickly and the resulting feature matrix will be returned as a Dask dataframe. This process does not cause the feature matrix to be computed or brought into memory.

When we use DFS, we specify
- `target_entity` - the table to build features for - `users` in this case
- `cutoff_time` - the point in time to calculate the features

A good way to think of the `cutoff_time` is that it let's us "pretend" we are at an earlier point in time when generating our features so we can simulate making predictions. We get this time for each customer from the label times we generated above.

For this initial run we will not specify any primitives, which will result in all the default primitives being used to create features.

In [24]:
feature_matrix, features = ft.dfs(target_entity="users",
                                  cutoff_time=label_times,
                                  entityset=es,
                                  verbose=True)

Built 59 features
Elapsed: 00:04 | Progress: 100%|██████████


Now that we have a Dask feature matrix, we can save it to disk for future use along with the features we generated. This process could take some time, but you can monitor the progress using the Dask dashboard.

In [25]:
ft.save_features(features, os.path.join(output_dir, "initial_features.txt"))
feature_matrix.to_csv(os.path.join(output_dir, "initial_feature_matrix-*.csv"), index=False)

['/Users/nate.parsons/dev/featuretools-demos/predict-next-purchase/data/dask_data/initial_feature_matrix-0.csv',
 '/Users/nate.parsons/dev/featuretools-demos/predict-next-purchase/data/dask_data/initial_feature_matrix-1.csv',
 '/Users/nate.parsons/dev/featuretools-demos/predict-next-purchase/data/dask_data/initial_feature_matrix-2.csv']

Next, let's read back in the feature matrix we just saved, compute it and take a look at what we created.

In [26]:
features = ft.load_features(os.path.join(output_dir, "initial_features.txt"))
fm = dd.read_csv(os.path.join(output_dir, "initial_feature_matrix-*.csv"), assume_missing=True).compute()
fm.head()

Unnamed: 0,COUNT(orders),COUNT(order_products),PERCENT_TRUE(order_products.reordered),NUM_UNIQUE(order_products.department),NUM_UNIQUE(order_products.product_name),NUM_UNIQUE(order_products.aisle_id),DAY(first_orders_time),YEAR(first_orders_time),MONTH(first_orders_time),WEEKDAY(first_orders_time),...,COUNT(order_products WHERE product_name = Organic Whole Milk),COUNT(order_products WHERE department = snacks),COUNT(order_products WHERE department = bakery),COUNT(order_products WHERE product_name = Organic Hass Avocado),COUNT(order_products WHERE product_name = Organic Avocado),COUNT(order_products WHERE product_name = Banana),COUNT(order_products WHERE department = beverages),NUM_UNIQUE(order_products.orders.user_id),user_id,bought_product
0,4.0,73.0,0.493151,11.0,37.0,22.0,1.0,2015.0,1.0,3.0,...,0.0,9.0,4.0,0.0,0.0,0.0,21.0,1.0,7.0,False
1,4.0,40.0,0.175,14.0,33.0,25.0,1.0,2015.0,1.0,3.0,...,0.0,2.0,1.0,0.0,0.0,0.0,1.0,1.0,14.0,False
2,4.0,50.0,0.28,8.0,36.0,14.0,1.0,2015.0,1.0,3.0,...,0.0,6.0,0.0,0.0,0.0,1.0,0.0,1.0,16.0,True
3,15.0,110.0,0.454545,13.0,60.0,29.0,1.0,2015.0,1.0,3.0,...,0.0,3.0,2.0,0.0,0.0,0.0,22.0,1.0,17.0,False
4,7.0,46.0,0.347826,9.0,30.0,18.0,1.0,2015.0,1.0,3.0,...,0.0,10.0,4.0,0.0,0.0,0.0,14.0,1.0,21.0,False


Before we use this feature matrix to build a predictive model, we will first encode any categorical features using `ft.encode_features()`. Note, at this time `ft.encode_features()` does not work with a Dask feature matrix, so we will use the pandas version we read from disk and computed above.

In [27]:
fm_encoded, features_encoded = ft.encode_features(fm,
                                                  features)

print("Number of features %s" % len(features_encoded))
fm_encoded.head()

Number of features 88


Unnamed: 0,COUNT(orders),COUNT(order_products),PERCENT_TRUE(order_products.reordered),NUM_UNIQUE(order_products.department),NUM_UNIQUE(order_products.product_name),NUM_UNIQUE(order_products.aisle_id),DAY(first_orders_time) = 1.0,DAY(first_orders_time) = 8.0,DAY(first_orders_time) = 9.0,DAY(first_orders_time) = 3.0,...,COUNT(order_products WHERE product_name = Organic Whole Milk),COUNT(order_products WHERE department = snacks),COUNT(order_products WHERE department = bakery),COUNT(order_products WHERE product_name = Organic Hass Avocado),COUNT(order_products WHERE product_name = Organic Avocado),COUNT(order_products WHERE product_name = Banana),COUNT(order_products WHERE department = beverages),NUM_UNIQUE(order_products.orders.user_id),user_id,bought_product
0,4.0,73.0,0.493151,11.0,37.0,22.0,1,0,0,0,...,0.0,9.0,4.0,0.0,0.0,0.0,21.0,1.0,7.0,False
1,4.0,40.0,0.175,14.0,33.0,25.0,1,0,0,0,...,0.0,2.0,1.0,0.0,0.0,0.0,1.0,1.0,14.0,False
2,4.0,50.0,0.28,8.0,36.0,14.0,1,0,0,0,...,0.0,6.0,0.0,0.0,0.0,1.0,0.0,1.0,16.0,True
3,15.0,110.0,0.454545,13.0,60.0,29.0,1,0,0,0,...,0.0,3.0,2.0,0.0,0.0,0.0,22.0,1.0,17.0,False
4,7.0,46.0,0.347826,9.0,30.0,18.0,1,0,0,0,...,0.0,10.0,4.0,0.0,0.0,0.0,14.0,1.0,21.0,False


## Step 5. Machine Learning

Using the default parameters, we generated dozens of potential features for our prediction problem. With a few simple commands, this feature matrix can be used for machine learning

In [28]:
X = fm_encoded.merge(label_times)
X.drop(["user_id", "time"], axis=1, inplace=True)
X = X.fillna(0)
y = X.pop("bought_product").astype('bool')

In [29]:
X.head()

Unnamed: 0,COUNT(orders),COUNT(order_products),PERCENT_TRUE(order_products.reordered),NUM_UNIQUE(order_products.department),NUM_UNIQUE(order_products.product_name),NUM_UNIQUE(order_products.aisle_id),DAY(first_orders_time) = 1.0,DAY(first_orders_time) = 8.0,DAY(first_orders_time) = 9.0,DAY(first_orders_time) = 3.0,...,COUNT(order_products WHERE department = deli),COUNT(order_products WHERE product_name = Strawberries),COUNT(order_products WHERE product_name = Organic Whole Milk),COUNT(order_products WHERE department = snacks),COUNT(order_products WHERE department = bakery),COUNT(order_products WHERE product_name = Organic Hass Avocado),COUNT(order_products WHERE product_name = Organic Avocado),COUNT(order_products WHERE product_name = Banana),COUNT(order_products WHERE department = beverages),NUM_UNIQUE(order_products.orders.user_id)
0,4.0,73.0,0.493151,11.0,37.0,22.0,1,0,0,0,...,8.0,0.0,0.0,9.0,4.0,0.0,0.0,0.0,21.0,1.0
1,4.0,73.0,0.493151,11.0,37.0,22.0,1,0,0,0,...,8.0,0.0,0.0,9.0,4.0,0.0,0.0,0.0,21.0,1.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,10.0,0.9,8.0,6.0,10.0,0,0,0,0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0


In [31]:
y.tail()

3453    False
3454    False
3455     True
3456    False
3457    False
Name: bought_product, dtype: bool

Let's train a Random Forest and validate using 3-fold cross validation

In [32]:
clf = RandomForestClassifier(n_estimators=400, n_jobs=-1)
scores = cross_val_score(estimator=clf,X=X, y=y, cv=3,
                         scoring="roc_auc", verbose=True)

"AUC %.2f +/- %.2f" % (scores.mean(), scores.std())

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    3.7s finished


'AUC 0.76 +/- 0.04'

As you can see this model predicts the next purchase much better than guessing. 

Next we will identify the top 20 features so we can use them to later perform machine learning on the whole dataset.

In [33]:
clf.fit(X, y)
top_features = utils.feature_importances(clf, features_encoded, n=20)

1: Feature: COUNT(order_products WHERE product_name = Banana), 0.073
2: Feature: COUNT(order_products WHERE product_name = Bag of Organic Bananas), 0.068
3: Feature: COUNT(order_products WHERE department = produce), 0.041
4: Feature: COUNT(order_products WHERE department = dairy eggs), 0.031
5: Feature: MEAN(orders.NUM_UNIQUE(order_products.product_name)), 0.029
6: Feature: MEAN(orders.COUNT(order_products)), 0.027
7: Feature: SUM(orders.NUM_UNIQUE(order_products.product_name)), 0.026
8: Feature: COUNT(order_products), 0.026
9: Feature: MEAN(orders.NUM_UNIQUE(order_products.aisle_id)), 0.025
10: Feature: MEAN(orders.NUM_UNIQUE(order_products.department)), 0.024
11: Feature: NUM_UNIQUE(order_products.product_name), 0.023
12: Feature: PERCENT_TRUE(order_products.reordered), 0.023
13: Feature: SUM(orders.NUM_UNIQUE(order_products.aisle_id)), 0.022
14: Feature: NUM_UNIQUE(order_products.aisle_id), 0.022
15: Feature: STD(orders.NUM_UNIQUE(order_products.department)), 0.022
16: Feature: MAX(

To persist these features, we can save them to disk.

In [34]:
ft.save_features(top_features, os.path.join(data_dir, "top_features.txt"))

### Understanding feature engineering in Featuretools

Before moving forward, take a look at the features we created. You will see that they are more than just simple transformations of columns in our raw data. Instead, they perform aggregations (and sometimes stacking of aggregations) across the relationships in our dataset. If you're curious how this works, learn about the Deep Feature Synthesis algorithm in our documentation [here](https://docs.featuretools.com/en/stable/automated_feature_engineering/afe.html).

DFS is so powerful because with no manual work, the library figured out that historical purchases of bananas are important for predicting future purchases. Additionally, it surfaces that purchasing dairy or eggs and reordering behavior are important features.

Even though these features are intuitive, Deep Feature Synthesis will automatically adapt as we change the prediction problem, saving us the time of manually brainstorming and implementing these data transformation.

## Step 6. Scale to Full Dataset

Now that we have established the most important features, we will repeat the process of creating a feature matrix, using only these features, and then make predictions using our full dataset.

To start, we will create a new entityset containing our full dataset.

In [35]:
es = ft.EntitySet("instacart_full")
es.entity_from_dataframe(entity_id="order_products",
                         dataframe=order_products,
                         index="order_product_id",
                         variable_types=order_products_vtypes,
                         time_index="order_time")

es.entity_from_dataframe(entity_id="orders",
                         dataframe=orders,
                         index="order_id",
                         variable_types=order_vtypes,
                         time_index="order_time")

es.add_relationship(ft.Relationship(es["orders"]["order_id"], es["order_products"]["order_id"]))

Entityset: instacart_full
  Entities:
    order_products [Rows: Delayed('int-9dff13ee-2a39-4053-8720-d76cc2503051'), Columns: 7]
    orders [Rows: Delayed('int-5693dd65-efa3-42b1-9b75-ef030b4845a5'), Columns: 3]
  Relationships:
    order_products.order_id -> orders.order_id

In [36]:
es.normalize_entity(base_entity_id="orders", new_entity_id="users", index="user_id")

Entityset: instacart_full
  Entities:
    order_products [Rows: Delayed('int-901238c3-515c-4ffb-842a-f29ce794941c'), Columns: 7]
    orders [Rows: Delayed('int-f762af4b-e24d-41d6-9b9a-ea6f19d1e0fa'), Columns: 3]
    users [Rows: Delayed('int-fda85e6a-8537-4af0-9559-520137e82242'), Columns: 2]
  Relationships:
    order_products.order_id -> orders.order_id
    orders.user_id -> users.user_id

In [37]:
es.add_last_time_indexes()
es["order_products"]["department"].interesting_values = ['produce', 'dairy eggs', 'snacks', 'beverages', 'frozen', 'pantry', 'bakery', 'canned goods', 'deli', 'dry goods pasta']
es["order_products"]["product_name"].interesting_values = ['Banana', 'Bag of Organic Bananas', 'Organic Baby Spinach', 'Organic Strawberries', 'Organic Hass Avocado', 'Organic Avocado', 'Large Lemon', 'Limes', 'Strawberries', 'Organic Whole Milk']

We will use our previously defined Compose label maker to create cutoff times for the full dataset.

In [38]:
df = denormalize(es).compute()

In [39]:
label_times = lm.search(
    df.sort_values('order_time'),
    minimum_data='2015-03-15',
    num_examples_per_instance=2,
    product_name='Banana',
    verbose=True,
)

Elapsed: 09:12 | Remaining: 00:00 | Progress: 100%|██████████| user_id: 412418/412418  


In [41]:
label_times.head()

Unnamed: 0_level_0,user_id,time,bought_product
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1,2015-03-15,True
1,1,2015-04-12,False
2,2,2015-03-15,True
3,2,2015-04-12,True
4,3,2015-03-15,False


Next we will read in the top 20 features we identified previously and calculate a feature matrix using only these features.

In [40]:
top_features= ft.load_features(os.path.join(data_dir, "top_features.txt"))
top_features

[<Feature: COUNT(order_products WHERE product_name = Banana)>,
 <Feature: COUNT(order_products WHERE product_name = Bag of Organic Bananas)>,
 <Feature: COUNT(order_products WHERE department = produce)>,
 <Feature: COUNT(order_products WHERE department = dairy eggs)>,
 <Feature: MEAN(orders.NUM_UNIQUE(order_products.product_name))>,
 <Feature: MEAN(orders.COUNT(order_products))>,
 <Feature: SUM(orders.NUM_UNIQUE(order_products.product_name))>,
 <Feature: COUNT(order_products)>,
 <Feature: MEAN(orders.NUM_UNIQUE(order_products.aisle_id))>,
 <Feature: MEAN(orders.NUM_UNIQUE(order_products.department))>,
 <Feature: NUM_UNIQUE(order_products.product_name)>,
 <Feature: PERCENT_TRUE(order_products.reordered)>,
 <Feature: SUM(orders.NUM_UNIQUE(order_products.aisle_id))>,
 <Feature: NUM_UNIQUE(order_products.aisle_id)>,
 <Feature: STD(orders.NUM_UNIQUE(order_products.department))>,
 <Feature: MAX(orders.NUM_UNIQUE(order_products.product_name))>,
 <Feature: STD(orders.COUNT(order_products))>,
 

Having read in the top features we want to use, we can now create our feature matrix on the full dataset with a call to `ft.calculate_feature_matrix()`.

In [42]:
fm = ft.calculate_feature_matrix(top_features, entityset=es, cutoff_time=label_times, verbose=True)

Elapsed: 00:01 | Progress: 100%|██████████


Next, we will compute our feature matrix to bring the results into memory, allowing us to encode categorical features and make our predictions.

In [43]:
fm = fm.compute()

In [44]:
fm_encoded, features_encoded = ft.encode_features(fm, top_features)

print("Number of features %s" % len(features_encoded))
fm_encoded.head()

Number of features 20


Unnamed: 0,COUNT(order_products WHERE product_name = Banana),COUNT(order_products WHERE product_name = Bag of Organic Bananas),COUNT(order_products WHERE department = produce),COUNT(order_products WHERE department = dairy eggs),MEAN(orders.NUM_UNIQUE(order_products.product_name)),MEAN(orders.COUNT(order_products)),SUM(orders.NUM_UNIQUE(order_products.product_name)),COUNT(order_products),MEAN(orders.NUM_UNIQUE(order_products.aisle_id)),MEAN(orders.NUM_UNIQUE(order_products.department)),...,SUM(orders.NUM_UNIQUE(order_products.aisle_id)),NUM_UNIQUE(order_products.aisle_id),STD(orders.NUM_UNIQUE(order_products.department)),MAX(orders.NUM_UNIQUE(order_products.product_name)),STD(orders.COUNT(order_products)),STD(orders.PERCENT_TRUE(order_products.reordered)),MEAN(orders.PERCENT_TRUE(order_products.reordered)),STD(orders.NUM_UNIQUE(order_products.product_name)),user_id,bought_product
0,0.0,0.0,10.0,11.0,18.25,18.25,73,73,12.5,7.75,...,50,22,1.5,24,5.315073,0.410944,0.443452,5.315073,7,False
1,0.0,0.0,8.0,3.0,10.0,10.0,40,40,8.75,6.25,...,35,25,3.947573,27,11.372481,0.288057,0.253704,11.372481,14,False
2,1.0,0.0,32.0,4.0,12.5,12.5,50,50,6.75,4.5,...,27,14,1.732051,15,1.732051,0.207044,0.301136,1.732051,16,True
3,0.0,0.0,3.0,15.0,7.333333,7.333333,110,110,5.8,4.8,...,87,29,1.373213,15,3.265986,0.253803,0.448565,3.265986,17,False
4,0.0,0.0,2.0,10.0,6.571429,6.571429,46,46,5.857143,4.0,...,41,18,1.914854,14,3.55233,0.292543,0.363605,3.55233,21,False


In [45]:
X = fm_encoded.merge(label_times)
X.drop(["user_id", "time"], axis=1, inplace=True)
X = X.fillna(0)
y = X.pop("bought_product").astype('bool')

In [46]:
clf = RandomForestClassifier(n_estimators=400, n_jobs=-1)
scores = cross_val_score(estimator=clf,X=X, y=y, cv=3,
                         scoring="roc_auc", verbose=True)

"AUC %.2f +/- %.2f" % (scores.mean(), scores.std())

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  3.2min finished


'AUC 0.87 +/- 0.04'

In [47]:
clf.fit(X, y)
top_features = utils.feature_importances(clf, top_features, n=20)

1: Feature: COUNT(order_products WHERE product_name = Banana), 0.186
2: Feature: COUNT(order_products WHERE product_name = Bag of Organic Bananas), 0.130
3: Feature: COUNT(order_products WHERE department = produce), 0.061
4: Feature: STD(orders.PERCENT_TRUE(order_products.reordered)), 0.043
5: Feature: MEAN(orders.PERCENT_TRUE(order_products.reordered)), 0.043
6: Feature: SUM(orders.NUM_UNIQUE(order_products.product_name)), 0.041
7: Feature: PERCENT_TRUE(order_products.reordered), 0.040
8: Feature: COUNT(order_products), 0.039
9: Feature: COUNT(order_products WHERE department = dairy eggs), 0.039
10: Feature: STD(orders.NUM_UNIQUE(order_products.department)), 0.039
11: Feature: SUM(orders.NUM_UNIQUE(order_products.aisle_id)), 0.037
12: Feature: MEAN(orders.COUNT(order_products)), 0.037
13: Feature: MEAN(orders.NUM_UNIQUE(order_products.product_name)), 0.037
14: Feature: STD(orders.NUM_UNIQUE(order_products.product_name)), 0.036
15: Feature: STD(orders.COUNT(order_products)), 0.036
16: 

We can see the top features have shifted around some, but the most important features have remained the same.

Now that we have finished, we can close our Dask client.

In [48]:
client.close()

## Next Steps

While this is an end-to-end example of going from raw data to a trained machine learning model, it is necessary to do further exploration before claiming we've built something impactful.

Fortunately, Featuretools makes it easy to build structured data science pipeline. As a next steps, you could

- Further validate these results by creating feature vectors at different cutoff times
- Perform feature selection on a larger subset of the original data to improve results
- Define other prediction problems for this dataset (you can even change the entity you are making predictions on!)
- Save feature matrices to disk as CSVs so they can be reused with different problems without recalculating
- Experiment with parameters to Deep Feature Synthesis