# Table of Contents

1. **Introduction to 4.10 Part 1b**
   - Objectives
2. **Creating a 'opc_with_dept' dataframe (granularity: product purchase)**
   - Loading 'opc_filtered.pkl'
   - Data Transformation: add full department name 
   - Data Transformtion: add department grouping 
   - Export dataframe: 'opc_with_dept' / update POPULATION FLOW

Next: 4.10 Part 1c.


# 1. Introduction to 4.10 Part 1b

#### Objectives: 
Following the removal of PII and the exclusion of low-activity customers (those with a maximum of fewer than 5 orders), as well as the addition of 'region' information based on state, this notebook aims to finalize the construction of comprehensive datasets at three levels:
- Product Purchase Level ('opc_all')
- Customer Level ('customer_acitivity')
- Order Level ('order_activity')
  
The goal is to streamline subsequent analyses and visualizations by ensuring these datasets are detailed, well-structured, and ready for exploration.

# 2. Creating 'opc_all' dataframe (granularity: product purchase)

In [16]:
# Import libraries. 
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

In [18]:
path = r'/Users/amyzhang/Desktop/Instacart Basket Analysis/'

In [20]:
opc_all= pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data 2', 'opc_filtered.pkl'))

In [22]:
opc_all.shape

(30991542, 30)

In [27]:
opc_all.columns

Index(['order_id', 'user_id', 'order_number', 'order_day_of_week',
       'order_time', 'days_since_prior_order', 'product_id', 'cart_position',
       'reorder_status', 'product_name', 'department_id', 'prices',
       'delinquent_status', 'price_range_loc', 'day_label',
       'busiest_period_of_day', 'max_order', 'loyalty_flag', 'avg_order_price',
       'spending_flag', 'median_days_since_prior_order', 'frequency_flag',
       'gender', 'state', 'age', 'date_joined', 'n_dependants', 'fam_status',
       'income', 'region'],
      dtype='object')

### Data Transformations: full department name + department groupings
#### Combining the two transformations into the same loop allows for iterating just once instead of going through the data twice.

In [37]:
dept= pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'departments_wrangled.csv'))

In [39]:
dept.shape

(21, 2)

In [44]:
dept.columns

Index(['Unnamed: 0', 'department'], dtype='object')

#### 1) Create the department grouping map. 

In [53]:
# Define categories and generate map
categories = {
    'Basic Necessities': [
        'produce', 'dairy eggs', 'meat seafood', 'pantry', 
        'dry goods pasta', 'canned goods', 'breakfast'
    ],
    'Treats & Indulgence': [
        'snacks', 'frozen', 'bakery', 'alcohol', 
        'beverages', 'deli'
    ],
    'Health & Personal Care': [
        'personal care', 'household', 'babies'
    ],
    'Specialty & Lifestyle': [
        'international', 'bulk', 'pets'
    ],
    'Missing/Uncategorized': [
        'missing'
    ]
}

department_grouping = {
    dept: category
    for category, departments in categories.items()
    for dept in departments
}


#### 2) Apply mapping to dept dataframe. 

In [49]:
dept['department_group'] = dept['department'].map(department_grouping)

In [55]:
dept.head(21)

Unnamed: 0.1,Unnamed: 0,department,department_group
0,1,frozen,Treats & Indulgence
1,2,other,
2,3,bakery,Treats & Indulgence
3,4,produce,Basic Necessities
4,5,alcohol,Treats & Indulgence
5,6,international,Specialty & Lifestyle
6,7,beverages,Treats & Indulgence
7,8,pets,Specialty & Lifestyle
8,9,dry goods pasta,Basic Necessities
9,10,bulk,Specialty & Lifestyle


In [65]:
# Check what products are in 'other'

# Filter opc_all for department_id 2
products_in_dept_2 = opc_all[opc_all['department_id'] == 2]

# View unique product names in this department
unique_product_names = products_in_dept_2['product_name'].unique()

# Print the first 50 unique product names
print(unique_product_names[:50])

['Zero Calorie Tonic Water' 'Crushed Chili' '93/7 Ground Beef'
 'Whole Bay Leaves' 'Raw Walnuts' 'Organic Whole Wheat Couscous'
 '3mg Melatonin Dietary Supplement Tablets - 240 CT' 'Deluxe Nut Mix'
 'Traditional Panettone' 'PM Simply Sleep Nighttime Sleep Aid Caplets'
 'Cherry Nighttime Instant Teething Pain Relief Gel'
 'Facial Tissues with Lotion' 'Melatonin TR, Time Release, 1 mg, Tablets'
 'Original Pickle' 'Giraffes Diapers Size 4 L'
 'Kick It Immune For Kids Drops' 'Max AAA Batteries'
 'Peppermint Essential Oil' 'Light CocoWhip! Coconut Whipped Topping'
 'Roasted Salted Pistachios' 'Walnuts'
 'PM Pain Reliever and Nighttime Sleep Aid Caplets'
 'Roasted Almond Butter' 'BabyRub® Soothing Ointment'
 'Rapid Relief Creamy Diaper Rash Ointment' 'Coconut Flour'
 'Quick Dry White Correction Fluid' 'Melatonin, 3 mg, Tablets'
 'Organic Garam Masala' 'Sunflower Seeds'
 'Coffee Mate French Vanilla Creamer Packets'
 'Hazelnut Liquid Coffee Creamer'
 'Cinnamon Vanilla Creme Liquid Coffee Cream

In [63]:
print(len(unique_product_names))

548


In [67]:
# Check what products are in 'missing'

# Filter opc_all for department_id 21
products_in_dept_21 = opc_all[opc_all['department_id'] == 21]

# View unique product names in this department
unique_product_names_21 = products_in_dept_21['product_name'].unique()

# Print the first 50 unique results
print(unique_product_names_21[:50])


['Strained Non-Fat Strawberry Icelandic Style Skyr Yogurt'
 'Cilantro Bunch' 'Whole Grain Thin Spaghetti'
 'Kings Hawaiian Smoked Bacon Bbq Sauce' 'Riced Cauliflower & Broccoli'
 'Organic Mango Yogurt' 'Organic Riced Cauliflower'
 'Unsweetened Original Almond Milk'
 "Pull Up's Boy's Nighttime Training Pants Size 2T 3T Jumbo Pack"
 "Women's Complete Multi-Vitamin Gummies"
 'Rippled Red Heirloom Potato Chips' 'Fresh Organic Carrots'
 'Organic Nondairy Lemon Cashew Yogurt' 'Peanut Butter Ice Cream Cup'
 'Organic Chocolate Chip Cookie Dough'
 'Soy & Dairy Free Plain Unsweetened Almond Milk Yogurt'
 'Natural Uncured Beef Hot Dog'
 'Plain Dairy-Free Probiotic Drinkable Cashewgurt'
 'Organic Plain Unsweetened Nondairy Cashew Yogurt' "S'mores Ice Cream"
 'Organic Vanilla Grassmilk Yogurt' 'Sugar'
 'Eat Your Colors Purples Puree Baby Food'
 "Children's Multi-Symptom Cold Relief Dye-Free Grape Flavored Syrup"
 'Limited Edition Entree, Chicken Tiki Masala' 'Organic Celery Bunch'
 'Plain Organic G

In [61]:
print(len(unique_product_names_21))


1247


#### Decision: Re-define department groupings so that 'missing' and 'other' are combined under a new 'Miscellaneous' category. A high proportion of purchases from this department will indicate the need for more manual profiling, due to the diverse and varied nature of products in this category.

#### 3) Re-create department grouping and re-apply to dept dataframe. 

In [71]:
categories = {
    'Basic Necessities': [
        'produce', 'dairy eggs', 'meat seafood', 'pantry', 
        'dry goods pasta', 'canned goods', 'breakfast'
    ],
    'Treats & Indulgence': [
        'snacks', 'frozen', 'bakery', 'alcohol', 
        'beverages', 'deli'
    ],
    'Health & Personal Care': [
        'personal care', 'household', 'babies'
    ],
    'Specialty & Lifestyle': [
        'international', 'bulk', 'pets'
    ],
    'Miscellaneous': [  # Added category for missing and other
        'missing', 'other'
    ]
}

department_grouping = {
    dept: category
    for category, departments in categories.items()
    for dept in departments
}

In [73]:
dept['department_group'] = dept['department'].map(department_grouping)

In [75]:
dept.head(21)

Unnamed: 0.1,Unnamed: 0,department,department_group
0,1,frozen,Treats & Indulgence
1,2,other,Miscellaneous
2,3,bakery,Treats & Indulgence
3,4,produce,Basic Necessities
4,5,alcohol,Treats & Indulgence
5,6,international,Specialty & Lifestyle
6,7,beverages,Treats & Indulgence
7,8,pets,Specialty & Lifestyle
8,9,dry goods pasta,Basic Necessities
9,10,bulk,Specialty & Lifestyle


In [105]:
# Rename column 'department' in order to be able to merge with opc_all on 'department_id'
dept.rename(columns={'department': 'department_id'}, inplace=True)

In [107]:
dept.head()

Unnamed: 0.1,Unnamed: 0,department_id,department_group
0,1,frozen,Treats & Indulgence
1,2,other,Miscellaneous
2,3,bakery,Treats & Indulgence
3,4,produce,Basic Necessities
4,5,alcohol,Treats & Indulgence


In [115]:
# MISTAKE -- wrong columns re-named. Correction here: 
dept.rename(columns={'department_id': 'department'}, inplace=True)
dept.rename(columns={'Unnamed: 0': 'department_id'}, inplace=True)

In [117]:
dept.head()

Unnamed: 0,department_id,department,department_group
0,1,frozen,Treats & Indulgence
1,2,other,Miscellaneous
2,3,bakery,Treats & Indulgence
3,4,produce,Basic Necessities
4,5,alcohol,Treats & Indulgence


#### 4. Merge 'dept' with 'opc_all'

In [119]:
# Ensure both 'department_id' columns are of the same type (either int or string)
opc_all['department_id'] = opc_all['department_id'].astype(str)
dept['department_id'] = dept['department_id'].astype(str)

# Now, perform the merge
opc_with_dept = opc_all.merge(dept, on='department_id', how='left')

In [122]:
opc_with_dept[['product_id', 'product_name', 'department_id', 'department', 'department_group']].head(25)

Unnamed: 0,product_id,product_name,department_id,department,department_group
0,196,Soda,7,beverages,Treats & Indulgence
1,14084,Organic Unsweetened Vanilla Almond Milk,16,dairy eggs,Basic Necessities
2,12427,Original Beef Jerky,19,snacks,Treats & Indulgence
3,26088,Aged White Cheddar Popcorn,19,snacks,Treats & Indulgence
4,26405,XL Pick-A-Size Paper Towel Rolls,17,household,Health & Personal Care
5,196,Soda,7,beverages,Treats & Indulgence
6,10258,Pistachios,19,snacks,Treats & Indulgence
7,12427,Original Beef Jerky,19,snacks,Treats & Indulgence
8,13176,Bag of Organic Bananas,4,produce,Basic Necessities
9,26088,Aged White Cheddar Popcorn,19,snacks,Treats & Indulgence


In [124]:
opc_with_dept.shape

(30991542, 32)

### Export opc_with_dept dataframe.

In [131]:
# Export dataframe. 
opc_with_dept.to_pickle(os.path.join(path, '02 Data','Prepared Data 2', 'opc_with_dept.pkl'))