This Notebook contains the following contents:

01 - Importing libraries

02 - Establishing a path and importing the dataframe

03 - Dealing with PII issues

04 - Comparing behaviour in different geographical areas 

05 - Locating and excluding low activity customers from the data frame (data exported to file)


0.1 Starting new notebook

In [1]:
# Importing Analitycal and Visual libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

02. Establishing path and importing dataframe

In [2]:
# Establishing a path to import data frames
path = r'C:\Users\Paola\Desktop\Shaul\Data\CF\August-2023 Instacart Basket Analysis'

In [3]:
path

'C:\\Users\\Paola\\Desktop\\Shaul\\Data\\CF\\August-2023 Instacart Basket Analysis'

In [4]:
# Importing 'orders_products_variablesfixed.pkl' data set
dfready = pd.read_pickle(os.path.join(path, '2-Data', 'Prepared Data', 'orders_products_customers.pkl'))

In [13]:
# Viewing the missing values
dfready.isnull().sum()

order_id                      0
user_id                       0
order_number                  0
orders_day_of_week            0
order_hour_of_day             0
days_since_prior_order        0
product_id                    0
add_to_cart_order             0
reordered                     0
product_name              30200
aisle_id                  30200
department_id             30200
prices                    30200
price_range_loc           30200
Busiest days                  0
Busiest hours                 0
max_order                     0
loyalty_flag                  0
avg_prices                    0
spending_flag                 0
median_prior_order            0
frequency_flag                0
gender                        0
state                         0
age                           0
dependants                    0
civil_status                  0
income                        0
dtype: int64

In [14]:
# Looking for duplicates
df_dups = dfready[dfready.duplicated()]

In [15]:
df_dups

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,avg_prices,spending_flag,median_prior_order,frequency_flag,gender,state,age,dependants,civil_status,income


No duplicates. Missing values are less than 0.09% of the data and cannot be imputated. Left as it is since it does not influence the overal analysis of the data.

03. Dealing with PII issues

In [5]:
dfready.columns

Index(['order_id', 'user_id', 'order_number', 'orders_day_of_week',
       'order_hour_of_day', 'days_since_prior_order', 'product_id',
       'add_to_cart_order', 'reordered', 'product_name', 'aisle_id',
       'department_id', 'prices', 'price_range_loc', 'Busiest days',
       'Busiest hours', 'max_order', 'loyalty_flag', 'avg_prices',
       'spending_flag', 'median_prior_order', 'frequency_flag', 'gender',
       'state', 'age', 'dependants', 'civil_status', 'income'],
      dtype='object')

PII issues: in the original dataframe, there were two additional columns: 'first_name' and 'last_name' which would allow us to identify particular customers. However, I have identified, then, that there were 11259 missing values in the 'first_name' columns. While it is possible to look into those missing values and replace them (by creating a subset with 'user@id', 'first_name' and 'last_name' columns to look it up), it is time consuming and would not particularly add much to the overall value of the analysis. Hence, these two columns were removed from the new 'dfready' dataframe (although they were kept in the original data set), so the new data is now more secure.

04. The Instacart officers are interested in comparing customer behavior in different geographic areas. Create a regional segmentation of the data.

In [6]:
# Creating 'region' column
Northeast =['Maine', 'New Hampshire', 'Vermont', 'Massachusetts', 'Rhode Island', 'Connecticut', 'New York', 'Pennsylvania', 'New Jersey']
Midwest = ['Wisconsin', 'Michigan','Illinois', 'Indiana', 'Ohio', 'North Dakota', 'South Dakota', 'Nebraska', 'Kansas', 'Minnesota', 'Iowa', 'Missouri']
South = ['Delaware', 'Maryland', 'District of Columbia', 'Virginia', 'West Virginia', 'North Carolina', 'South Carolina', 'Georgia', 'Florida', 'Kentucky', 'Tennessee', 'Mississippi', 'Alabama', 'Oklahoma', 'Texas', 'Arkansas', 'Louisiana']
West = ['Idaho', 'Montana', 'Wyoming', 'Nevada', 'Utah', 'Colorado', 'Arizona', 'New Mexico', 'Alaska', 'Washington', 'Oregon', 'California', 'Hawaii']

In [7]:
dfready.loc[dfready['state'].isin(Northeast), 'region'] = 'Northeast'

In [8]:
dfready.loc[dfready['state'].isin(Midwest), 'region'] = 'Midwest'

In [9]:
dfready.loc[dfready['state'].isin(South), 'region'] = 'South'

In [10]:
dfready.loc[dfready['state'].isin(West), 'region'] = 'West'

In [11]:
dfready['region'].head(100)

0       South
1       South
2       South
3       South
4       South
       ...   
95    Midwest
96    Midwest
97    Midwest
98    Midwest
99    Midwest
Name: region, Length: 100, dtype: object

In [12]:
dfready['region'].value_counts(dropna=False)

South        10801982
West          8300617
Midwest       7604016
Northeast     5728444
Name: region, dtype: int64

In [13]:
# Checking spending habits of different regions
crosstab = pd.crosstab(dfready['spending_flag'], dfready['region'], dropna = False)

In [14]:
crosstab.to_clipboard()

Results: The 'South' region has the highest count value among all regions (33%), followed by the 'West (26%), 'Midwest' (23%), and finally by the 'Northeast' (18%).
However, when comparing - proportionally - the numbers of High and Low spender customers, all regions came up with the same results: only 2% of their customers are High spenders, while the vast majority (98%) are Low spenders. One one hand, this means (from a marketing perspective) that all regions are ought to be approached with the same priority, as they are pron to give the similar results (proportionally). However, this also means that these regions will have a capital return similar to their proportions, with the South being the highest return.

05. The Instacart CFO isn’t interested in customers who don’t generate much revenue for the app. Create an exclusion flag for low-activity customers (customers with less than 5 orders) and exclude them from the data. Make sure you export this sample.

In [15]:
# Creating an exclusion flag column based in Customer loyalty
dfready.loc[dfready['max_order'] > 4, 'exclusion_flag'] = 'normal customer'

In [16]:
dfready.loc[dfready['max_order'] <= 4, 'exclusion_flag'] = 'low activity customer'

In [21]:
dfready['exclusion_flag'].value_counts(dropna=False)

normal customer    30993489
Name: exclusion_flag, dtype: int64

In [22]:
# Creating a subset to export low activity customers' data to file
dfready_low =  dfready[dfready['exclusion_flag']=='low activity customer']

In [24]:
# Exporting data to file
dfready_low.to_pickle(os.path.join(path, '2-Data','Prepared Data', 'low_activity_customers_data.pkl'))

In [25]:
# Deleting low activity customers' data from general dataframe
dfready = dfready[dfready.exclusion_flag != 'low activity customer']

In [26]:
# Low customers' data removed correctly from dataframe
dfready.shape

(30993489, 30)

In [27]:
# Exporting update dataframe to file
dfready.to_pickle(os.path.join(path, '2-Data','Prepared Data', 'orders_products_customers_updated.pkl'))