# Predict Next Purchase
In this tutorial, build a machine learning application that predicts whether customers will purchase a product within the next shopping period. This application is structured into three important steps:

* Prediction Engineering

* Feature Engineering

* Machine Learning

In the first step, you generate new labels from the data by using Compose. In the second step, you generate features for the labels by using Featuretools. In the third step, you search for the best machine learning pipeline by using EvalML. After working through these steps, you should understand how to build machine learning applications for real-world problems like predicting consumer spending.

Note: In order to run this example, you should have Featuretools 1.4.0 or newer

In [97]:
import os
import pandas as pd
def load_sample(data_dir):
    #merge the csv files aisle.csv and products.csv
    df = pd.merge(pd.read_csv(os.path.join(data_dir, "aisles.csv")), pd.read_csv(os.path.join(data_dir, "products.csv")))
    #merge this df with department.csv
    df = pd.merge(df, pd.read_csv(os.path.join(data_dir, "departments.csv")))
    #merge this df with order_products__prior.csv
    df = pd.merge(df, pd.read_csv(os.path.join(data_dir, "file.csv")))
    #merge this df with orders.csv
    df = pd.merge(df, pd.read_csv(os.path.join(data_dir, "orders.csv")))
    return df

Use this historical data of online grocery orders provided by Instacart.

In [98]:
df = load_sample("/")

In [99]:
df.isnull().sum()

aisle_id                  0
aisle                     0
product_id                0
product_name              0
department_id             0
department                0
order_id                  0
add_to_cart_order         0
reordered                 0
user_id                   0
eval_set                  0
order_number              0
order_dow                 0
order_hour_of_day         0
days_since_prior_order    0
dtype: int64

In [100]:
df.head()

Unnamed: 0,aisle_id,aisle,product_id,product_name,department_id,department,order_id,add_to_cart_order,reordered,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,1,prepared soups salads,209,Italian Pasta Salad,20,deli,36461,6.0,1.0,50978,train,7.0,6.0,13.0,20.0
1,1,prepared soups salads,12398,Caprese Salad,20,deli,36461,7.0,0.0,50978,train,7.0,6.0,13.0,20.0
2,13,prepared meals,25407,Mashed Potatoes,20,deli,36461,5.0,0.0,50978,train,7.0,6.0,13.0,20.0
3,108,other creams cheeses,27323,Pure & Natural Sour Cream,16,dairy eggs,36461,1.0,1.0,50978,train,7.0,6.0,13.0,20.0
4,108,other creams cheeses,40593,Cream Cheese,16,dairy eggs,36461,11.0,1.0,50978,train,7.0,6.0,13.0,20.0


In [107]:
import pandas as pd
import numpy as np

def add_time(df):
    df.reset_index(drop=True)
    df["order_time"] = np.nan
    days_since = df.columns.tolist().index("days_since_prior_order")
    hour_of_day = df.columns.tolist().index("order_hour_of_day")
    order_time = df.columns.tolist().index("order_time")

    df.iloc[0, order_time] = pd.Timestamp('Jan 1, 2015') + pd.to_timedelta(df.iloc[0, hour_of_day], unit='h')
    for i in range(1, df.shape[0]):
        try:
            time_difference_days = pd.to_timedelta(df.iloc[i, days_since], unit='d')
            time_difference_hours = pd.to_timedelta(df.iloc[i, hour_of_day], unit='h')
            df.iloc[i, order_time] = df.iloc[i - 1, order_time] + time_difference_days + time_difference_hours
        except pd.errors.OutOfBoundsDatetime:
            df.iloc[i, order_time] = np.nan

    to_drop = ["order_number", "order_dow", "order_hour_of_day", "days_since_prior_order", "eval_set"]
    df.drop(to_drop, axis=1, inplace=True)

    return df




In [108]:
df = add_time(df)

In [48]:
df.head()

Unnamed: 0,aisle_id,aisle,product_id,product_name,department_id,department,order_id,add_to_cart_order,reordered,user_id,order_time
0,1,prepared soups salads,209,Italian Pasta Salad,20,deli,195206,18,1.0,1519,2015-01-01 09:00:00
1,1,prepared soups salads,26047,Tuna Salad,20,deli,195206,16,1.0,1519,2015-01-06 18:00:00
2,1,prepared soups salads,26714,Chicken Salad,20,deli,195206,15,1.0,1519,2015-01-12 03:00:00
3,1,prepared soups salads,47979,Butternut Squash Bisque,20,deli,195206,17,1.0,1519,2015-01-17 12:00:00
4,21,packaged cheese,7781,Organic Sticks Low Moisture Part Skim Mozzarel...,16,dairy eggs,195206,10,0.0,1519,2015-01-22 21:00:00


In [109]:
df.isnull().sum()

aisle_id                 0
aisle                    0
product_id               0
product_name             0
department_id            0
department               0
order_id                 0
add_to_cart_order        0
reordered                0
user_id                  0
order_time           33326
dtype: int64

# Prediction Engineering
Will customers purchase a product within the next shopping period?

In this prediction problem, there are two parameters:

* The product that a customer can purchase.

* The length of the shopping period.

You can change these parameters to create different prediction problems. For example, will a customer purchase a banana within the next 3 days or an avocado within the next three weeks? These variations can be done by simply tweaking the parameters. This helps you explore different scenarios that are crucial for making better decisions.

# Defining the Labeling Function
Start by defining a labeling function that checks if a customer bought a given product. Make the product a parameter of the function. Our labeling function is used by a label maker to extract the training examples.

In [110]:
def bought_product(ds, product_name):
    return ds.product_name.str.contains(product_name).any()

In [10]:
# %pip install composeml
# Run this cell for if this is the first exucution

# Representing the Prediction Problem
Represent the prediction problem by creating a label maker with the following parameters:

* target_dataframe_index as the columns for the customer ID, since you want to process orders for each customer.

* labeling_function as the function you defined previously.

* time_index as the column for the order time. The shoppings periods are based on this time index.

* window_size as the length of a shopping period. You can easily change this parameter to create variations of the prediction problem.

In [111]:
import composeml as cp

lm = cp.LabelMaker(
    target_dataframe_index='user_id',
    time_index='order_time',
    labeling_function=bought_product,
    window_size='3d',
)

In [112]:
df = df.dropna()

In [113]:
df.isnull().sum()

aisle_id             0
aisle                0
product_id           0
product_name         0
department_id        0
department           0
order_id             0
add_to_cart_order    0
reordered            0
user_id              0
order_time           0
dtype: int64

In [114]:
df

Unnamed: 0,aisle_id,aisle,product_id,product_name,department_id,department,order_id,add_to_cart_order,reordered,user_id,order_time
0,1,prepared soups salads,209,Italian Pasta Salad,20,deli,36461,6.0,1.0,50978,2015-01-01 13:00:00
1,1,prepared soups salads,12398,Caprese Salad,20,deli,36461,7.0,0.0,50978,2015-01-22 02:00:00
2,13,prepared meals,25407,Mashed Potatoes,20,deli,36461,5.0,0.0,50978,2015-02-11 15:00:00
3,108,other creams cheeses,27323,Pure & Natural Sour Cream,16,dairy eggs,36461,1.0,1.0,50978,2015-03-04 04:00:00
4,108,other creams cheeses,40593,Cream Cheese,16,dairy eggs,36461,11.0,1.0,50978,2015-03-24 17:00:00
...,...,...,...,...,...,...,...,...,...,...,...
5328,83,fresh vegetables,22935,Organic Yellow Onion,4,produce,263224,1.0,1.0,20085,2262-03-02 08:00:00
5329,83,fresh vegetables,27104,Fresh Cauliflower,4,produce,263224,14.0,0.0,20085,2262-03-06 23:00:00
5330,83,fresh vegetables,48679,Organic Garnet Sweet Potato (Yam),4,produce,263224,13.0,1.0,20085,2262-03-11 14:00:00
5331,116,frozen produce,2228,Organic Frozen Mango Chunks,1,frozen,263224,10.0,0.0,20085,2262-03-16 05:00:00


In [115]:
df['product_name'].mode()

0    Banana
Name: product_name, dtype: object

# Finding the Training Examples
Run a search to get the training examples by using the following parameters:

* The grocery orders sorted by the order time, since the search expects the orders to be sorted chronologically. Otherwise, an error is raised.

* num_examples_per_instance to find the number of training examples per customer. In this case, the search returns all existing examples.

* product_name as the product to check for purchases. This parameter gets passed directly to the our labeling function.

* minimum_data as the amount of data that is used to make features for the first training example.

In [116]:
lt = lm.search(
    df.sort_values('order_time'),
    num_examples_per_instance=-1,
    product_name='Banana',
    minimum_data='3d',
    verbose=False,
)

lt.head()

Unnamed: 0,user_id,time,bought_product
0,341,2037-12-03 19:00:00,False
1,341,2038-01-02 19:00:00,False
2,341,2038-02-01 19:00:00,False
3,341,2038-03-03 19:00:00,False
4,341,2038-04-02 19:00:00,False


The output from the search is a label times table with three columns:

* The customer ID associated to the orders. There can be many training examples generated from each customer.

* The start time of the shopping period. This is also the cutoff time for building features. Only data that existed beforehand is valid to use for predictions.

* Whether the product was purchased during the shopping period window. This is calculated by our labeling function.

In [117]:
uni = lt['bought_product'].unique()

In [118]:
uni

array([False,  True])

In [119]:
lt.describe()


Label Distribution
------------------
False     4739
True       145
Total:    4884


Settings
--------
gap                                    None
maximum_data                           None
minimum_data                             3d
num_examples_per_instance                -1
target_column                bought_product
target_dataframe_index              user_id
target_type                        discrete
window_size                              3d


Transforms
----------
No transforms applied



In [16]:
# %pip install featuretools

# Representing the Data
Start by representing the data with an EntitySet. That way, you can generate features based on the relational structure of the dataset. You currently have a single table of orders where one customer can have many orders. This one-to-many relationship can be represented by normalizing a customer dataframe. The same can be done for other one-to-many relationships like aisle-to-products. Because you want to make predictions based on the customer, you should use this customer dataframe as the target for generating features.

In [122]:
import featuretools as ft
es = ft.EntitySet('instacart')

es.add_dataframe(
    dataframe=df.reset_index(),
    dataframe_name='order_products',
    time_index='order_time',
    index='id',
)

es.normalize_dataframe(
    base_dataframe_name='order_products',
    new_dataframe_name='orders',
    index='order_id',
    additional_columns=['user_id'],
    make_time_index=False,
)

es.normalize_dataframe(
    base_dataframe_name='orders',
    new_dataframe_name='customers',
    index='user_id',
    make_time_index=False,
)

es.normalize_dataframe(
    base_dataframe_name='order_products',
    new_dataframe_name='products',
    index='product_id',
    additional_columns=['aisle_id', 'department_id'],
    make_time_index=False,
)

es.normalize_dataframe(
    base_dataframe_name='products',
    new_dataframe_name='aisles',
    index='aisle_id',
    additional_columns=['department_id'],
    make_time_index=False,
)

es.normalize_dataframe(
    base_dataframe_name='aisles',
    new_dataframe_name='departments',
    index='department_id',
    make_time_index=False,
)

es.add_interesting_values(dataframe_name='order_products',
                          values={'department': ['produce'],
                                  'product_name': ['Banana']})

index id not found in dataframe, creating new integer column


# Calculating the Features
Now you can generate features by using a method called Deep Feature Synthesis (DFS). That method automatically builds features by stacking and applying mathematical operations called primitives across relationships in an entityset. The more structured an entityset is, the better DFS can leverage the relationships to generate better features. Let’s run DFS using the following parameters:

* entity_set as the entityset we structured previously.

* target_dataframe_name as the customer dataframe.

* cutoff_time as the label times that we generated previously. The label values are appended to the feature matrix.

In [123]:
fm, fd = ft.dfs(
    entityset=es,
    target_dataframe_name='customers',
    cutoff_time=lt,
    cutoff_time_in_index=True,
    include_cutoff_time=False,
    verbose=False,
)

fm.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,COUNT(orders),COUNT(order_products),MAX(order_products.add_to_cart_order),MAX(order_products.index),MAX(order_products.reordered),MEAN(order_products.add_to_cart_order),MEAN(order_products.index),MEAN(order_products.reordered),MIN(order_products.add_to_cart_order),MIN(order_products.index),...,SUM(orders.NUM_UNIQUE(order_products.department)),SUM(orders.SKEW(order_products.add_to_cart_order)),SUM(orders.SKEW(order_products.index)),SUM(orders.SKEW(order_products.reordered)),SUM(orders.STD(order_products.add_to_cart_order)),SUM(orders.STD(order_products.index)),SUM(orders.STD(order_products.reordered)),COUNT(order_products WHERE product_name = Banana),COUNT(order_products WHERE department = produce),bought_product
user_id,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
341,2037-12-03 19:00:00,1,1,8.0,536.0,1.0,8.0,536.0,1.0,8.0,536.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,False
341,2038-01-02 19:00:00,1,2,11.0,537.0,1.0,9.5,536.5,1.0,8.0,536.0,...,2.0,0.0,0.0,0.0,2.12132,0.707107,0.0,0,0,False
341,2038-02-01 19:00:00,1,3,11.0,538.0,1.0,7.333333,537.0,1.0,3.0,536.0,...,3.0,-0.722109,0.0,0.0,4.041452,1.0,0.0,0,0,False
341,2038-03-03 19:00:00,1,4,18.0,539.0,1.0,10.0,537.5,0.75,3.0,536.0,...,4.0,0.437807,0.0,-2.0,6.271629,1.290994,0.5,0,0,False
341,2038-04-02 19:00:00,1,5,18.0,540.0,1.0,11.4,538.0,0.8,3.0,536.0,...,4.0,-0.285748,0.0,-2.236068,6.268971,1.581139,0.447214,0,0,False


There are two outputs from DFS: a feature matrix and feature definitions. The feature matrix is a table that contains the feature values with the corresponding labels based on the cutoff times. Feature definitions are features in a list that can be stored and reused later to calculate the same set of features on future data.

In [19]:
# %pip install evalml

False    178
True       1
Name: bought_product, dtype: int64


In [67]:
# %pip install imbalanced-learn



# Machine Learning
In the previous steps, you generated the labels and features. The final step is to build the machine learning pipeline.

## Splitting the Data
Start by extracting the labels from the feature matrix and splitting the data into a training set and a holdout set.

In [124]:
fm.reset_index(drop=True, inplace=True)
y = fm.ww.pop('bought_product')

splits = evalml.preprocessing.split_data(
    X=fm,
    y=y,
    test_size=0.2,
    random_seed=0,
    problem_type='binary',
)

X_train, X_holdout, y_train, y_holdout = splits

# Finding the Best Model
Run a search on the training set to find the best machine learning model. During the search process, predictions from several different pipelines are evaluated.

In [125]:
automl = evalml.AutoMLSearch(
    X_train=fm,
    y_train=y,
    problem_type='binary',
    objective='f1',
    random_seed=0,
    allowed_model_families=['catboost', 'random_forest'],
    max_iterations=3,
)

automl.search()

{1: {'Random Forest Classifier w/ Label Encoder + Drop Null Columns Transformer + Imputer + One Hot Encoder + Oversampler': 15.315903663635254,
  'Total time of batch': 15.481111764907837},
 2: {'Random Forest Classifier w/ Label Encoder + Drop Null Columns Transformer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model': 17.621840000152588,
  'Total time of batch': 17.78543519973755}}

In [126]:
automl.best_pipeline.describe()


*********************************************************************************************************************************************************


INFO:evalml.pipelines.pipeline_base.describe:
*********************************************************************************************************************************************************


* Random Forest Classifier w/ Label Encoder + Drop Null Columns Transformer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model *


INFO:evalml.pipelines.pipeline_base.describe:* Random Forest Classifier w/ Label Encoder + Drop Null Columns Transformer + Imputer + One Hot Encoder + Oversampler + RF Classifier Select From Model *


*********************************************************************************************************************************************************


INFO:evalml.pipelines.pipeline_base.describe:*********************************************************************************************************************************************************





INFO:evalml.pipelines.pipeline_base.describe:


Problem Type: binary


INFO:evalml.pipelines.pipeline_base.describe:Problem Type: binary


Model Family: Random Forest


INFO:evalml.pipelines.pipeline_base.describe:Model Family: Random Forest


Number of features: 70


INFO:evalml.pipelines.pipeline_base.describe:Number of features: 70





INFO:evalml.pipelines.pipeline_base.describe:


Pipeline Steps


INFO:evalml.pipelines.pipeline_base.describe:Pipeline Steps






1. Label Encoder


INFO:evalml.pipelines.component_graph.describe:1. Label Encoder


	 * positive_label : None


INFO:evalml.pipelines.components.component_base.describe:	 * positive_label : None


2. Drop Null Columns Transformer


INFO:evalml.pipelines.component_graph.describe:2. Drop Null Columns Transformer


	 * pct_null_threshold : 1.0


INFO:evalml.pipelines.components.component_base.describe:	 * pct_null_threshold : 1.0


3. Imputer


INFO:evalml.pipelines.component_graph.describe:3. Imputer


	 * categorical_impute_strategy : most_frequent


INFO:evalml.pipelines.components.component_base.describe:	 * categorical_impute_strategy : most_frequent


	 * numeric_impute_strategy : mean


INFO:evalml.pipelines.components.component_base.describe:	 * numeric_impute_strategy : mean


	 * boolean_impute_strategy : most_frequent


INFO:evalml.pipelines.components.component_base.describe:	 * boolean_impute_strategy : most_frequent


	 * categorical_fill_value : None


INFO:evalml.pipelines.components.component_base.describe:	 * categorical_fill_value : None


	 * numeric_fill_value : None


INFO:evalml.pipelines.components.component_base.describe:	 * numeric_fill_value : None


	 * boolean_fill_value : None


INFO:evalml.pipelines.components.component_base.describe:	 * boolean_fill_value : None


4. One Hot Encoder


INFO:evalml.pipelines.component_graph.describe:4. One Hot Encoder


	 * top_n : 10


INFO:evalml.pipelines.components.component_base.describe:	 * top_n : 10


	 * features_to_encode : None


INFO:evalml.pipelines.components.component_base.describe:	 * features_to_encode : None


	 * categories : None


INFO:evalml.pipelines.components.component_base.describe:	 * categories : None


	 * drop : if_binary


INFO:evalml.pipelines.components.component_base.describe:	 * drop : if_binary


	 * handle_unknown : ignore


INFO:evalml.pipelines.components.component_base.describe:	 * handle_unknown : ignore


	 * handle_missing : error


INFO:evalml.pipelines.components.component_base.describe:	 * handle_missing : error


5. Oversampler


INFO:evalml.pipelines.component_graph.describe:5. Oversampler


	 * sampling_ratio : 0.25


INFO:evalml.pipelines.components.component_base.describe:	 * sampling_ratio : 0.25


	 * k_neighbors_default : 5


INFO:evalml.pipelines.components.component_base.describe:	 * k_neighbors_default : 5


	 * n_jobs : -1


INFO:evalml.pipelines.components.component_base.describe:	 * n_jobs : -1


	 * sampling_ratio_dict : None


INFO:evalml.pipelines.components.component_base.describe:	 * sampling_ratio_dict : None


	 * categorical_features : [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139]


INFO:evalml.pipelines.components.component_base.describe:	 * categorical_features : [100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139]


	 * k_neighbors : 5


INFO:evalml.pipelines.components.component_base.describe:	 * k_neighbors : 5


6. RF Classifier Select From Model


INFO:evalml.pipelines.component_graph.describe:6. RF Classifier Select From Model


	 * number_features : None


INFO:evalml.pipelines.components.component_base.describe:	 * number_features : None


	 * n_estimators : 10


INFO:evalml.pipelines.components.component_base.describe:	 * n_estimators : 10


	 * max_depth : None


INFO:evalml.pipelines.components.component_base.describe:	 * max_depth : None


	 * percent_features : 0.5


INFO:evalml.pipelines.components.component_base.describe:	 * percent_features : 0.5


	 * threshold : median


INFO:evalml.pipelines.components.component_base.describe:	 * threshold : median


	 * n_jobs : -1


INFO:evalml.pipelines.components.component_base.describe:	 * n_jobs : -1


7. Random Forest Classifier


INFO:evalml.pipelines.component_graph.describe:7. Random Forest Classifier


	 * n_estimators : 100


INFO:evalml.pipelines.components.component_base.describe:	 * n_estimators : 100


	 * max_depth : 6


INFO:evalml.pipelines.components.component_base.describe:	 * max_depth : 6


	 * n_jobs : -1


INFO:evalml.pipelines.components.component_base.describe:	 * n_jobs : -1


In [133]:
best_pipeline = automl.best_pipeline.fit(X_train, y_train)

score = best_pipeline.score(
    X=X_holdout,
    y=y_holdout,
    objectives=['accuracy binary'],
)

dict(score)

{'Accuracy Binary': 0.8996929375639714}

In [128]:
feature_importance = best_pipeline.feature_importance
feature_importance = feature_importance.set_index('feature')['importance']
top_k = feature_importance.abs().sort_values().tail(20).index

# Making Predictions
You are ready to make predictions with your trained model. Start by calculating the same set of features by using the feature definitions. Also, use a cutoff time based on the latest information available in the dataset.

In [129]:
fm = ft.calculate_feature_matrix(
    features=fd,
    entityset=es,
    cutoff_time=ft.pd.Timestamp('2015-03-02'),
    cutoff_time_in_index=True,
    verbose=False,
)

fm.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,COUNT(orders),COUNT(order_products),MAX(order_products.add_to_cart_order),MAX(order_products.index),MAX(order_products.reordered),MEAN(order_products.add_to_cart_order),MEAN(order_products.index),MEAN(order_products.reordered),MIN(order_products.add_to_cart_order),MIN(order_products.index),...,SUM(orders.NUM_UNIQUE(order_products.aisle)),SUM(orders.NUM_UNIQUE(order_products.department)),SUM(orders.SKEW(order_products.add_to_cart_order)),SUM(orders.SKEW(order_products.index)),SUM(orders.SKEW(order_products.reordered)),SUM(orders.STD(order_products.add_to_cart_order)),SUM(orders.STD(order_products.index)),SUM(orders.STD(order_products.reordered)),COUNT(order_products WHERE product_name = Banana),COUNT(order_products WHERE department = produce)
user_id,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
50978,2015-03-02,1,3,7.0,2.0,1.0,6.0,1.0,0.333333,5.0,0.0,...,2.0,1.0,0.0,0.0,1.732051,1.0,1.0,0.57735,0,0
52115,2015-03-02,1,0,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
34652,2015-03-02,1,0,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
2588,2015-03-02,1,0,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
38892,2015-03-02,1,0,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0


Predict whether customers will purchase bananas within the next 3 days.

In [130]:
y_pred = best_pipeline.predict(fm)
y_pred = y_pred.values

prediction = fm[[]]
prediction['bought_product (estimate)'] = y_pred
prediction.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,bought_product (estimate)
user_id,time,Unnamed: 2_level_1
50978,2015-03-02,False
52115,2015-03-02,True
34652,2015-03-02,True
2588,2015-03-02,True
38892,2015-03-02,True


Next Steps
You have completed this tutorial. You can revisit each step to explore and fine-tune the model using different parameters until it is ready for production. For more information about how to work with the features produced by Featuretools, take a look at the Featuretools documentation.