# Predict Next Purchase

In this example, we build a machine learning application that predicts whether customers will purchase a product within the next shopping period. This application is structured into three important steps:

* Prediction Engineering
* Feature Engineering
* Machine Learning

In the first step, we generate our own labels from the data by using [Compose](https://compose.alteryx.com/). In the second step, we generate the features for labels by using [Featuretools](https://docs.featuretools.com/). In the third step, we search for the best machine learning pipeline for the features and labels by using [EvalML](https://evalml.alteryx.com/). After working through these steps, you will learn how to build machine learning applications for real-world problems like predicting consumer spending. Let's get started.

In [None]:
from demo.predict_next_purchase import load_sample
from evalml import AutoMLSearch
from evalml.preprocessing import split_data
import composeml as cp
import featuretools as ft
import matplotlib as mpl

We will use this historical data of online grocery orders provided by Instacart.

In [None]:
df = load_sample()

df.head()

## Prediction Engineering

Note we have two parameters in the prediction problem:

* The name of the product.
* The length of the shopping period.

We can change these parameters to create different prediction problems. For example, will a customer purchase an avocado within the next day or a banana within the next 5 days? These variations can be done by simply tweaking the parameters. This helps us explore different scenarios which is crucial for making better decisions.


### Defining the Labeling Process

In each shopping period, we will check whether a customer bought a product. Let’s define this as a labeling function with a parameter for the product name.

In [None]:
def bought_product(ds, product_name):
    return ds.product_name.str.contains(product_name).any()

### Representing the Prediction Problem

We will represent the prediction problem using a label maker. This way, we can run searches on the online grocery orders to generate the training examples. This is done by setting the following parameters:

* The `target_entity` as the customer, because we want to label orders for each individual customer.
* The `labeling_function` as the function we defined previously.
* The `time_index` as the order time, because shoppings periods are based on the order time.
* The `window_size` as the length of a shopping period. We can tweak this parameter to create variations of the prediction problem.

In [None]:
lm = cp.LabelMaker(
    target_entity='user_id',
    time_index='order_time',
    labeling_function=bought_product,
    window_size='3d',
)

### Finding the Training Examples

Now, we can run a search to find purchases of the product within the shopping periods of each customer. This is done using the following parameters:

* The online grocery orders sorted by the order time.
* The `num_examples_per_instance` to find the number of training examples per customer. We search for all existing examples.
* The `product_name` as the product that we will check for purchases.
* The `minimum_data`  

The output from the search is a label times table with three columns:

* The user ID associated to the online grocery orders.
* The start time of the shopping period. This is also known as a cutoff time for building features. Only data that existed before the shopping period is valid to use for making predictions about the outcome.
* Whether or not the product was purchased in the shopping period. This is calculated by our labeling function.

In [None]:
lt = lm.search(
    df.sort_values('order_time'),
    num_examples_per_instance=-1,
    product_name='Banana',
    minimum_data='3d',
    verbose=False,
)

lt.head()

It can become difficult to track the parameters that were used to create the labels. As a helpful reference, we can look at the label description to understand how the labels were created from the start. The description also shows us the label distribution which we can check for imbalanced labels.


In [None]:
lt.describe()

We can get a better look at the labels by plotting the distribution and the cumulative count across time.

In [None]:
%matplotlib inline
fig = mpl.pyplot.figure(figsize=(5, 8))
ax0 = fig.add_subplot(211)
ax1 = mpl.pyplot.subplot(212)
fig.tight_layout()

lt.plot.distribution(ax=ax0)
lt.plot.count_by_time(ax=ax1);

## Feature Engineering

In the previous step, we generated the labels. The next step is to generate the features.

### Representing the Data

We will represent the online grocery orders using an entity set. This way, we can generate features based on the relational structure of the dataset. We currently have a single table of orders where one user can many orders. This one-to-many relationship can be represented in an entity set by normalizing an entity for the users. The same can be done for products, departments, and so on.


In [None]:
es = ft.EntitySet('instacart')

es.entity_from_dataframe(
    dataframe=df.reset_index(),
    entity_id='order_products',
    time_index='order_time',
    index='id',
)

es.normalize_entity(
    base_entity_id='order_products',
    new_entity_id='orders',
    index='order_id',
    additional_variables=['user_id'],
    make_time_index=False,
)

es.normalize_entity(
    base_entity_id='orders',
    new_entity_id='users',
    index='user_id',
    make_time_index=False,
)

es.normalize_entity(
    base_entity_id='order_products',
    new_entity_id='products',
    index='product_id',
    additional_variables=['aisle_id', 'department_id'],
    make_time_index=False,
)

es.normalize_entity(
    base_entity_id='products',
    new_entity_id='aisles',
    index='aisle_id',
    additional_variables=['department_id'],
    make_time_index=False,
)

es.normalize_entity(
    base_entity_id='aisles',
    new_entity_id='departments',
    index='department_id',
    make_time_index=False,
)

es["order_products"]["department"].interesting_values = ['produce']
es["order_products"]["product_name"].interesting_values = ['Banana']
es.plot()

### Calculating the Features

Now, we can generate features by using a method called Deep Feature Synthesis (DFS). This will automatically build features by stacking and applying mathematical operations called primitives across relationships in an entity set. The more structured an entity set is, the better DFS can leverage the relationships to generate better features. Let’s run DFS using the following parameters:

* The target entity as the user, because we want to generate features for each user. 
* The cutoff time as the labels that we created previously. 

There are two outputs from DFS: a feature matrix and feature definitions. The feature matrix is a table that contains the calculated feature values based on cutoff times from our labels. Feature definitions are features in a list that can be stored and reused later to calculate the same set of features on new data.


In [None]:
fm, fd = ft.dfs(
    entityset=es,
    target_entity='users',
    cutoff_time=lt,
    cutoff_time_in_index=True,
    include_cutoff_time=False,
    verbose=False,
)

fm.head()

## Machine Learning

Now, we can create a machine learning model. Let's extract the labels from the feature matrix and split the data into training and holdout sets.

In [None]:
y = fm.pop('bought_product')
splits = split_data(fm, y, test_size=0.2, random_state=0)
X_train, X_holdout, y_train, y_holdout = splits

### Train Model

Next, we search for the optimal pipeline by trying out different models on the training set.

In [None]:
automl = AutoMLSearch(problem_type='binary', objective='f1', random_state=0)
automl.search(X_train, y_train, data_checks='disabled', show_iteration_plot=False)

In [None]:
automl.best_pipeline.describe()
automl.best_pipeline.graph()

### Test Model

Finally, we score the model performance by evaluating predictions on the holdout set.

In [None]:
best_pipeline = automl.best_pipeline.fit(X_train, y_train)
score = best_pipeline.score(X_holdout, y_holdout, objectives=['f1'])
dict(score)