# Step 5 Instacart Exclusion Flag for Customers

# Table of Contents
* [1. Introduction](#1.-Introduction)
* [2. Importing Data](#2.-Importing-Data)
    * [2.1 Importing Libraries](#2.1-Importing-Libraries)
    * [2.2 Importing Data](#2.2-Importing-Data)
* [3. Exclusion Flag](#3.-Exclusion-Flag)
    * [3.1 Aggregate Variables](#3.1-Aggregate-Variables)
    * [3.2 Derived Variables](#3.2-Derived-Variables)
    * [3.3 Final Dataframe](#3.3-Final-Dataframe)
* [4. Exporting Data](#4.-Exporting-Data) 

# 1. Introduction

All of the pertinent dataframes have been combined into a single, extensive dataframe in order to answer Instacart's research question. We'll use a filtering approach because the Instacart CFO is focused on customers who make a substantial contribution to the app's revenue. Customers with less than five orders, or those with low activity, will be eliminated from the 'orders_products_all' dataframe. The goal of this step is to produce a final dataframe for more focused analysis that represents the behavior of higher-value clients.


# 2. Importing Data

## 2.1 Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import os

## 2.2 Importing Data

In [2]:
path = r'C:\Users\kuohe\Instacart Basket Analysis 12-18-2023'

In [3]:
orders_products_all = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared data', 'orders_products_all_step4.pkl'))

In [4]:
orders_products_all.shape

(32434489, 22)

In [5]:
orders_products_all.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,...,prices,First Name,Surname,Gender,STATE,Age,date_joined,No._of_dependants,family_status,income
0,2539329,1,1,2,8,,196,1,0,Soda,...,9.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,9.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,9.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,9.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,9.0,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423


# 3. Exclusion Flag

## 3.1 Aggregate Variables

In [6]:
# Observing what is the max number of order that each user have ordered
orders_products_all['max_order_number'] = orders_products_all.groupby(['user_id'])['order_number'].transform(np.max)

  orders_products_all['max_order_number'] = orders_products_all.groupby(['user_id'])['order_number'].transform(np.max)


In [8]:
# Checking results
orders_products_all.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,...,First Name,Surname,Gender,STATE,Age,date_joined,No._of_dependants,family_status,income,max_order_number
0,2539329,1,1,2,8,,196,1,0,Soda,...,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423,10
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423,10
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423,10
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423,10
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423,10


In [9]:
# Checking aggregations
orders_products_all.groupby(['user_id'])['order_number'].max()

user_id
1         10
2         14
3         12
4          5
5          4
          ..
206205     3
206206    67
206207    16
206208    49
206209    13
Name: order_number, Length: 206209, dtype: int8

In [10]:
# Creating a dataframe for checking the max order number
df = orders_products_all[['user_id', 'max_order_number']]

In [11]:
# Subgroupping user_id to check
df.loc[df['user_id'].isin([5, 206208])]

Unnamed: 0,user_id,max_order_number
16344673,206208,49
16344674,206208,49
16344675,206208,49
16344676,206208,49
16344677,206208,49
...,...,...
24652894,5,4
24652895,5,4
24652896,5,4
24652897,5,4


# 3.2 Derived Variable

In [19]:
# Creating the exclusion for low activity customers
# This code is for customers that have 5 orders or less
orders_products_all.loc[orders_products_all['max_order_number'] < 5, 'activity_customer'] = 'low activity customer'

In [14]:
# This code is for customers that have 5 orders or more
orders_products_all.loc[orders_products_all['max_order_number'] >= 5, 'activity_customer'] = 'regular customer'

In [20]:
# Checking if applied
orders_products_all.groupby(['user_id'])['activity_customer'].max()

user_id
1              regular customer
2              regular customer
3              regular customer
4              regular customer
5         low activity customer
                  ...          
206205    low activity customer
206206         regular customer
206207         regular customer
206208         regular customer
206209         regular customer
Name: activity_customer, Length: 206209, dtype: object

In [16]:
# Creating another dataframe for checking activity of customer
df = orders_products_all[['user_id', 'max_order_number', 'activity_customer']] 

In [21]:
# Subgroupping user_id to check 
df.loc[df['user_id'].isin([5, 206208])]

Unnamed: 0,user_id,max_order_number,activity_customer
16344673,206208,49,regular customer
16344674,206208,49,regular customer
16344675,206208,49,regular customer
16344676,206208,49,regular customer
16344677,206208,49,regular customer
...,...,...,...
24652894,5,4,low activity customers
24652895,5,4,low activity customers
24652896,5,4,low activity customers
24652897,5,4,low activity customers


In [18]:
# Checking frequency of flag of "activity_customer"
orders_products_all['activity_customer'].value_counts(dropna = False) 

activity_customer
regular customer          30992966
low activity customers     1441523
Name: count, dtype: int64

# 3.3 Final Dataframe

In [27]:
# Creating exclusion flag for low activity customers
orders_products_final = orders_products_all.loc[orders_products_all['activity_customer'] == 'regular customer' ]

In [28]:
orders_products_final.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,...,Surname,Gender,STATE,Age,date_joined,No._of_dependants,family_status,income,max_order_number,activity_customer
0,2539329,1,1,2,8,,196,1,0,Soda,...,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423,10,regular customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423,10,regular customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423,10,regular customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423,10,regular customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423,10,regular customer


In [29]:
# Checking the frequency for "activity_customer"
orders_products_final['activity_customer'].value_counts(dropna = False)

activity_customer
regular customer    30992966
Name: count, dtype: int64

In [30]:
orders_products_final.shape

(30992966, 24)

In [31]:
# Since the data is filtered, we can drop the column to show only the regular customer since the CFO does not want to see low activity customers
orders_products_final = orders_products_final.drop(columns = ['activity_customer'])

In [32]:
# Check column 
orders_products_final.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,...,First Name,Surname,Gender,STATE,Age,date_joined,No._of_dependants,family_status,income,max_order_number
0,2539329,1,1,2,8,,196,1,0,Soda,...,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423,10
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423,10
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423,10
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423,10
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,Linda,Nguyen,Female,Alabama,31,2019-02-17,3,married,40423,10


# 4. Exporting Data

In [33]:
# Exporting to prepared data folder
# Since there are 31 million rows of data it is better to export in pickle format for easier access
orders_products_final.to_pickle(os.path.join(path, '02 Data','Prepared data', 'orders_products_final_step5.pkl'))