# Project: Instacart Basket Analysis
## Author: Cassy Stunkel
## Task 4.10, Part 1

## Table of Contents
### 01. Import Libraries and Data Set
### 02. Security Implications Consideration
### 03. Regional Customer Segmentation
### 04. Low-Activity Customer Exclusion

## 01. Import libraries and data set.

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

In [2]:
# Define path
path = r'/Users/cassystunkel/Documents/Instacart Basket Analysis'

In [3]:
# Import data set
df = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_cust.pkl'))

## 02. Consider any security implications that might exist for this new data.

In [4]:
# Checking column heads to ensure no identifying information present in dataframe

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32434212 entries, 0 to 32434211
Data columns (total 30 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   product_id              int64  
 1   product_name            object 
 2   aisle_id                int64  
 3   department_id           int64  
 4   prices                  float64
 5   order_id                int64  
 6   user_id                 int64  
 7   order_number            int64  
 8   orders_day_of_week      int64  
 9   order_hour_of_day       int64  
 10  days_since_prior_order  float64
 11  first_order             bool   
 12  add_to_cart_order       int64  
 13  reordered               int64  
 14  Busiest day             object 
 15  Busiest days            object 
 16  busiest_period_of_day   object 
 17  max_order               int64  
 18  loyalty_flag            object 
 19  mean_spend              float64
 20  spend_flag              object 
 21  median_order_frequency  float

#### No personally identifiable information present within the dataframe.

## 03. The Instacart officers are interested in comparing customer behavior in different geographic areas. Create a regional segmentation of the data.

In [6]:
# Create new column 'region' based on the 'state' column using for-loop
result = []
for value in df['state']:
    if value == 'Maine' or value == 'New Hampshire' or value == 'Vermont' or value == 'Massachusetts' or value == 'Rhode Island' or value == 'Connecticut' or value == 'New York' or value == 'Pennsylvania' or value == 'New Jersey':
        result.append('Northeast')
    elif value == 'Wisconsin' or value == 'Michigan' or value == 'Illinois' or value == 'Indiana' or value == 'Ohio' or value == 'North Dakota' or value == 'South Dakota' or value == 'Nebraska' or value == 'Kansas' or value == 'Minnesota' or value == 'Iowa' or value == 'Missouri':
        result.append('Midwest')
    elif value == 'Delaware' or value == 'Maryland' or value == 'District of Columbia' or value == 'Virginia' or value == 'West Virginia' or value == 'North Carolina' or value == 'South Carolina' or value == 'Georgia' or value == 'Florida' or value == 'Kentucky' or value == 'Tennessee' or value == 'Mississippi' or value == 'Alabama' or value == 'Oklahoma' or value == 'Texas' or value == 'Arkansas' or value == 'Louisiana':
        result.append('South')
    elif value == 'Idaho' or value == 'Montana' or value == 'Wyoming' or value == 'Nevada' or value == 'Utah' or value == 'Colorado' or value == 'Arizona' or value == 'New Mexico' or value == 'Alaska' or value == 'Washington' or value == 'Oregon' or value == 'California' or value == 'Hawaii':
        result.append('West')
    else:
        result.append('missing')

In [7]:
# Combine results with dataframe
df['region'] = result

In [8]:
# Print frequency of new 'region' column
df['region'].value_counts(dropna = False)

region
South        10801610
West          8300445
Midwest       7603810
Northeast     5728347
Name: count, dtype: int64

### Determine whether there's a difference in spending habits between the different US regions.

In [10]:
# Create crosstab to compare 'region' and 'spend_flag' columns
crosstab = pd.crosstab(df['region'], df['spend_flag'], dropna = False)

In [11]:
crosstab

spend_flag,High spender,Low spender
region,Unnamed: 1_level_1,Unnamed: 2_level_1
Midwest,156129,7447681
Northeast,108343,5620004
South,210182,10591428
West,160807,8139638


#### The South has both the highest numbers of high spenders and low spenders.

## 04. The Instacart CFO isn't interested in customers who don't generate much revenue for the app. Create an exclusion flag for low-activity customers (customers with less than 5 orders) and exclude them from the data.

In [12]:
# Create new column 'active_status' with customers with <5 orders flagged as 'low-activity' and customers with >=5 orders flagged as 'active' using loc()
df.loc[df['order_number'] < 5, 'active_status'] = 'low-activity'
df.loc[df['order_number'] >= 5, 'active_status'] = 'active'

In [17]:
# Check frequency of new column
df['active_status'].value_counts(dropna = False)

active_status
active          24436791
low-activity     7997421
Name: count, dtype: int64

In [18]:
# Create new dataframe dropping any rows with 'low-activity'
df_no_low_activity = df.drop(df[df['active_status'] == 'low-activity'].index)

In [19]:
# Check frequency of 'active_status' column in new dataframe
df_no_low_activity['active_status'].value_counts(dropna = False)

active_status
active    24436791
Name: count, dtype: int64

In [20]:
# Export new dataset
df_no_low_activity.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'no_low_activity.pkl'))