# Predict Next Purchase

In this example, we will generate labels on online grocery orders provided by Instacart using Compose. The labels can be used to train a machine learning model to predict whether a customer will buy a specific product within the next month.

If you plan to run this notebook, you can use the following command at the root directory of the repository.

```bash
jupyter notebook docs/source/examples/predict-next-purchase/example.ipynb
```

## Load Data

In [None]:
%matplotlib inline
import composeml as cp
import featuretools as ft
from demo.predict_next_purchase import load_sample
from evalml import AutoMLSearch
from evalml.preprocessing import split_data

The data hosted [here](https://www.instacart.com/datasets/grocery-shopping-2017) will be downloaded automatically into the `data` module of this notebook unless it already exist. Once the data is in place, we can preview the grocery orders to see how they look.

In [None]:
df = load_sample()

df.head()

## Generate Labels
Now with the grocery orders loaded, we are ready to generate labels for our prediction problem.

### Create Labeling Function
To get started, we define the labeling function that will return whether a customer purchased the product in a given month.

In [None]:
def bought_product(df, product_name):
    return df.product_name.str.contains(product_name).any()

### Construct Label Maker

With the labeling function, we create the label maker for our prediction problem. To process one month of orders for each customer, we set the `target_entity` to the customer ID and the `window_size` to one month. When window size is set to `1MS`, the window size will end on the first day of the next month. Alias definitions are listed [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases).

In [None]:
lm = cp.LabelMaker(
    target_entity='user_id',
    time_index='order_time',
    labeling_function=bought_product,
    window_size='7d',
)

### Search Labels
Next, the label maker will search through the data continously to label whether a customer bought bananas in a given month. This happens when we use `LabelMaker.search` and set the `product_name` to bananas. If you are running this code yourself, feel free to expirement with other products (e.g. limes, avocados, etc.) and different time frames!

In [None]:
lt = lm.search(
    df.sort_values('order_time'),
    minimum_data='3d',
    num_examples_per_instance=-1,
    product_name='Banana',
    gap='3d',
    verbose=True,
)

lt.head()

### Describe Labels

With the generate label times, we can use `LabelTimes.describe` to print out the distribution with the settings and transforms that were used to make these labels. This is useful as a reference for understanding how the labels were generated from raw data. Also, the label distribution is helpful for determining if we have imbalanced labels.

In [None]:
lt.describe()

### Plot Labels

Additionally, there are plots available for insight to the labels.

#### Distribution

This plot shows the label distribution.

In [None]:
lt.plot.distribution();

#### Count by Time

This plot shows the label distribution across cutoff times.

In [None]:
lt.plot.count_by_time();

In [None]:
es = ft.EntitySet('instacart')

es.entity_from_dataframe(
    dataframe=df.reset_index(),
    entity_id='order_products',
    time_index='order_time',
    index='id',
)

es.normalize_entity(
    base_entity_id='order_products',
    new_entity_id='orders',
    index='order_id',
    additional_variables=['user_id'],
    make_time_index=False,
)

es.normalize_entity(
    base_entity_id='orders',
    new_entity_id='users',
    index='user_id',
    make_time_index=False,
)

es.normalize_entity(
    base_entity_id='order_products',
    new_entity_id='products',
    index='product_id',
    additional_variables=['aisle_id', 'department_id'],
    make_time_index=False,
)

es.normalize_entity(
    base_entity_id='products',
    new_entity_id='aisles',
    index='aisle_id',
    additional_variables=['department_id'],
    make_time_index=False,
)

es.normalize_entity(
    base_entity_id='aisles',
    new_entity_id='departments',
    index='department_id',
    make_time_index=False,
)

es["order_products"]["department"].interesting_values = ['produce']
es["order_products"]["product_name"].interesting_values = ['Banana']
es.plot()

In [None]:
X, features = ft.dfs(
    entityset=es,
    target_entity='users',
    cutoff_time=lt,
    cutoff_time_in_index=True,
    include_cutoff_time=False,
    verbose=False,
)

X.head()

In [None]:
y = X.pop('bought_product')

y.head()

In [None]:
X_train, X_holdout, y_train, y_holdout = split_data(
    X=X,
    y=y,
    test_size=0.2,
    random_state=0,
)

In [None]:
automl = AutoMLSearch(
    problem_type='binary',
    objective='f1',
    random_state=0,
)

automl.search(X_train, y_train, data_checks=None)

In [None]:
automl.best_pipeline.describe()
automl.best_pipeline.graph()

In [None]:
best_pipeline = automl.best_pipeline.fit(X_train, y_train)

score = best_pipeline.score(
    X=X_holdout,
    y=y_holdout,
    objectives=['f1'],
)

dict(score)