# Exploration Phase - Linear Models Only - Fast Proof of Concept (POC)

**GOAL: Create a Classification Model that can predict whether or not a person would buy an item we offered them (via push notification) based on behavioral and personal features of that user (user id, ordered before, etc), features of that specific order (date, etc) and features of the items themselves (popularity, price, avg days to buy, etc)**

We must notice that sending too many notifications would have a negative impact on user experience.


## Imports

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
path = r'/home/carleondel/data-zrive-ds/box_builder_dataset/feature_frame.csv'
df = pd.read_csv(path)

## Data

This dataset contains 

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2880549 entries, 0 to 2880548
Data columns (total 27 columns):
 #   Column                            Dtype  
---  ------                            -----  
 0   variant_id                        int64  
 1   product_type                      object 
 2   order_id                          int64  
 3   user_id                           int64  
 4   created_at                        object 
 5   order_date                        object 
 6   user_order_seq                    int64  
 7   outcome                           float64
 8   ordered_before                    float64
 9   abandoned_before                  float64
 10  active_snoozed                    float64
 11  set_as_regular                    float64
 12  normalised_price                  float64
 13  discount_pct                      float64
 14  vendor                            object 
 15  global_popularity                 float64
 16  count_adults                      fl

In [5]:
df.head()

Unnamed: 0,variant_id,product_type,order_id,user_id,created_at,order_date,user_order_seq,outcome,ordered_before,abandoned_before,...,count_children,count_babies,count_pets,people_ex_baby,days_since_purchase_variant_id,avg_days_to_buy_variant_id,std_days_to_buy_variant_id,days_since_purchase_product_type,avg_days_to_buy_product_type,std_days_to_buy_product_type
0,33826472919172,ricepastapulses,2807985930372,3482464092292,2020-10-05 16:46:19,2020-10-05 00:00:00,3,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,33.0,42.0,31.134053,30.0,30.0,24.27618
1,33826472919172,ricepastapulses,2808027644036,3466586718340,2020-10-05 17:59:51,2020-10-05 00:00:00,2,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,33.0,42.0,31.134053,30.0,30.0,24.27618
2,33826472919172,ricepastapulses,2808099078276,3481384026244,2020-10-05 20:08:53,2020-10-05 00:00:00,4,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,33.0,42.0,31.134053,30.0,30.0,24.27618
3,33826472919172,ricepastapulses,2808393957508,3291363377284,2020-10-06 08:57:59,2020-10-06 00:00:00,2,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,33.0,42.0,31.134053,30.0,30.0,24.27618
4,33826472919172,ricepastapulses,2808429314180,3537167515780,2020-10-06 10:37:05,2020-10-06 00:00:00,3,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,33.0,42.0,31.134053,30.0,30.0,24.27618


Tenemos para cada orden todos los productos del inventario. Así que lo normal es que no se compren ('outcome' = 0).

We have all the items on inventory for each order. Then in the column 'outcome' we have 1 if the item was bought and 0 if it was not. Therefore, most of our order_ids have an 'outcome' of 0.

In [6]:
df['outcome'].value_counts()

outcome
0.0    2847317
1.0      33232
Name: count, dtype: int64

In [7]:
print(f"pct of items bought out of all inventory on each order: {100 * df['outcome'].sum() / len(df) :.2f}%")

pct of items bought out of all inventory on each order: 1.15%


In [8]:
df.columns

Index(['variant_id', 'product_type', 'order_id', 'user_id', 'created_at',
       'order_date', 'user_order_seq', 'outcome', 'ordered_before',
       'abandoned_before', 'active_snoozed', 'set_as_regular',
       'normalised_price', 'discount_pct', 'vendor', 'global_popularity',
       'count_adults', 'count_children', 'count_babies', 'count_pets',
       'people_ex_baby', 'days_since_purchase_variant_id',
       'avg_days_to_buy_variant_id', 'std_days_to_buy_variant_id',
       'days_since_purchase_product_type', 'avg_days_to_buy_product_type',
       'std_days_to_buy_product_type'],
      dtype='object')

In [9]:
# We group by 'order_id' and sum the outcomes
df_grouped = orders_filtered = df.groupby('order_id')['outcome'].sum().reset_index()
df_grouped

Unnamed: 0,order_id,outcome
0,2807985930372,9.0
1,2808027644036,6.0
2,2808099078276,9.0
3,2808393957508,13.0
4,2808429314180,3.0
...,...,...
3441,3643254800516,9.0
3442,3643274788996,5.0
3443,3643283734660,21.0
3444,3643294515332,7.0


In [10]:
# We filter the orders by the outcome sum >= 5
filtered_orders = df_grouped[df_grouped['outcome'] >=5 ]['order_id']
filtered_orders

0       2807985930372
1       2808027644036
2       2808099078276
3       2808393957508
5       2808434524292
            ...      
3438    3643241300100
3441    3643254800516
3442    3643274788996
3443    3643283734660
3444    3643294515332
Name: order_id, Length: 2603, dtype: int64

In [11]:
# We filter the original df using the filtered order ids 
df_filtered = df[df['order_id'].isin(filtered_orders)]

In [12]:
print(f"We have kept {100*len(df_filtered) / len(df):.2f} % of the original dataset")

We have kept 75.12 % of the original dataset


In [13]:
# Quick check to make sure we filtered properly
sum(df_filtered.groupby('order_id')['outcome'].sum() <5)

0