Links to previous EDA work:

- eda1, general exploration of the full, original datase, mostly consists of histograms comparing reorders:all orders feature-by-feature: https://github.com/fractaldatalearning/Capstone2/blob/main/notebooks/eda1_w_data_direct_from_wrangling.ipynb

- eda2, creating new rows so that each order contains all newly-ordered items and reorders plus record of all items not re-ordered this time for the biggest user. Generally exploring some of this user's purchasing practices: https://github.com/fractaldatalearning/Capstone2/blob/main/notebooks/eda2_single_user.ipynb

In [1]:
import pandas as pd
import numpy as np
import os

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn import metrics

from library.sb_utils import save_file
import json

In [2]:
# Get df with practice user's orders plus rows for non-reorders
df = pd.read_csv('../data/processed/practice_user.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34021 entries, 0 to 34020
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   order_id                34021 non-null  int64  
 1   user_id                 34021 non-null  int64  
 2   order_by_user_sequence  34021 non-null  int64  
 3   order_dow               34021 non-null  int64  
 4   order_hour_of_day       34021 non-null  int64  
 5   days_since_prior_order  33995 non-null  float64
 6   product_id              34021 non-null  int64  
 7   add_to_cart_sequence    1992 non-null   float64
 8   reordered               34021 non-null  float64
 9   product_name            34021 non-null  object 
 10  aisle_name              34021 non-null  object 
 11  dept_name               34021 non-null  object 
 12  aisle_id                34021 non-null  float64
 13  department_id           34021 non-null  float64
 14  eval_set                1992 non-null 

In [3]:
# Get dictionaries connecting product-aisle-dept


with open('../data/processed/dicts/aisle_dept_dict.txt', 
          'r') as ad_file:
     ad_dict = json.load(ad_file)

with open('../data/processed/dicts/prod_aisle_dict.txt', 
          'r') as pa_file:
     pa_dict = json.load(pa_file)
        
with open('../data/processed/dicts/dept_id_name_dict.txt', 
          'r') as dd_file:
     dd_dict = json.load(dd_file)
        
with open('../data/processed/dicts/aisle_id_name_dict.txt', 
          'r') as aa_file:
     aa_dict = json.load(aa_file)
        
with open('../data/processed/dicts/prod_id_name_dict.txt', 
          'r') as pp_file:
     pp_dict = json.load(pp_file)
        
dd_dict

{'7': 'beverages',
 '16': 'dairy eggs',
 '19': 'snacks',
 '17': 'household',
 '4': 'produce',
 '14': 'breakfast',
 '13': 'pantry',
 '20': 'deli',
 '1': 'frozen',
 '11': 'personal care',
 '12': 'meat seafood',
 '6': 'international',
 '3': 'bakery',
 '15': 'canned goods',
 '9': 'dry goods pasta',
 '5': 'alcohol',
 '8': 'pets',
 '18': 'babies',
 '2': 'other',
 '21': 'missing',
 '10': 'bulk'}

In [4]:
pp_dict = {int(k):v for k,v in pp_dict.items()}
aa_dict = {int(k):v for k,v in aa_dict.items()}
dd_dict = {int(k):v for k,v in dd_dict.items()}
pa_dict = {int(k):v for k,v in pa_dict.items()}
ad_dict = {int(k):v for k,v in ad_dict.items()}

dd_dict

{7: 'beverages',
 16: 'dairy eggs',
 19: 'snacks',
 17: 'household',
 4: 'produce',
 14: 'breakfast',
 13: 'pantry',
 20: 'deli',
 1: 'frozen',
 11: 'personal care',
 12: 'meat seafood',
 6: 'international',
 3: 'bakery',
 15: 'canned goods',
 9: 'dry goods pasta',
 5: 'alcohol',
 8: 'pets',
 18: 'babies',
 2: 'other',
 21: 'missing',
 10: 'bulk'}

In the initial notebooks I was just playing around and really exploring to see what data exists. Here, work to actually do what it would take to make predictions about this user's reorders, before moving into feature engineering on a bigger portion of the full dataset.

Questions to answer:
- Given an order of a product by a person (here, this person; later many users), what is the likelihood that they will reorder it ever? Reorder it again, and again? 
- Within a department/aisle, what portion of items get reordered, period? Reordered many times? (as opposed to what I calculated previously in notebook 1, which was the total ratio of reorders, with no regard for whether it was a particular user reordering something many times or something many people reorder occasionally, etc.) 
- How might I make new features that indicate specific reorder practices i.e. not just "this is a reorder" but "this is a reorder, and it's this person's Nth time reordering the item" or "up until now, this person has reordered this item in p percent of all their orders."
- In addition to the reorder column, what meaningful features can be engineered from other existing features (i.e: Column for 0/1 for whether product name contains "organic.")
- What methods are best to use for engineering of features, getting dummies, etc?
- Which models might be best for predicting this one user's reorders? What can I infer about whether similar methods will be useful on a dataset with more users?

In [11]:
# What portion of items that this person orders once, do they reorder, ever?

products = set(df['product_name'].unique())
reordered_ever = set(df[df['reordered']==1]['product_name'].unique())
reordered_ever

{'100% Pineapple Juice',
 '100% Pure Apple Juice',
 '100% Pure Pumpkin',
 '100% Tangerine Juice',
 '2% Reduced Fat Milk',
 '360 Dusters Refills Unscented',
 '4% Milkfat Small Curd Grade A Pasteurized Cottage Cheese',
 '50% Less Sodium Garbanzo Beans',
 'All Natural Four Cheese Ravioli',
 'All Natural Marinara Sauce',
 'All Purpose Flour',
 'Almond & Apricot Bar',
 'Almond Walnut Macadamia Plus Bar',
 'Apple Pie with Cinnamon Sugar',
 'Baby Swiss Cheese',
 'Bag of Organic Bananas',
 'Baguette Sourdough',
 'Barbecue Sauce Original',
 'Bear Clover Premium Honey',
 'Beef Franks',
 'Beefsteak Tomato',
 'Biscuits for Dogs less than 20 lbs',
 'Black Beans',
 'Black Forest Ham',
 'Blueberry Pecan Plus Fiber Fruit & Nut Bar',
 'Blueberry on the Bottom Nonfat Greek Yogurt',
 'Bread Rolls',
 'Bread, Sliced, Extra Sourdough',
 'Buttermilk, Cultured Low Fat',
 'Caramel Almond and Sea Salt Nut Bar',
 'Cavatappi',
 'Cheese Enchilada Meal',
 'Cherubs Heavenly Salad Tomatoes',
 'Chicken Broth',
 'Chive

In [12]:
reordered_never = products - reordered_ever
reordered_never

{'100 Calorie  Per Bag Popcorn',
 '100% Natural Tomato Sauce',
 '100% White Grape Juice',
 '12\\" Aluminium Foil',
 '30 Gallon Large Trash Bags',
 '30% Less Sodium Chili Seasoning Mix',
 '3D White Clasicc Vivid Whitestrips',
 'Americone Dream® Ice Cream',
 'Anti-Slip Grip Cardboard Applicator Regular Absorbency Tampons',
 'Apple Cider Vinegar',
 'Apple Smoked Gruyere Cheese',
 'Aquarium Pump Hand Soap',
 'Asparation/Broccolini/Baby Broccoli',
 'Atlantic Salmon Fillet',
 'Banana',
 'Beef Stew Spices & Seasonings',
 "Bits o' Brickle Baking Chips",
 'Blackberries',
 'Brie Cheese',
 'Brilliance Premium Strength Special Occasion Flatware',
 'Bunched Carrots',
 'Buttermilk Complete Pancake & Waffle Mix',
 'Cabaret Crisp And Buttery Crackers',
 'Canola Oil',
 'Caramel Sauce',
 'Caramels',
 'Carnation Sweetened Condensed Milk',
 'Cherry Strawberry',
 'Cherry Vanilla Cherry On The Bottom Cream Top',
 'Chicken & Apple Smoked Chicken Sausage',
 'Chicken Curry Salad',
 'Chicken Rice Pilaf Mix',
 '

In [24]:
print('number of products = ', len(products))
print('number of products never reordered = ', len(reordered_never))
print('number of products reordered at least once = ', len(reordered_ever))
print('portion of products reordered at least once = ', len(reordered_ever)/len(products))

number of products =  519
number of products never reordered =  269
number of products reordered at least once =  250
portion of products reordered at least once =  0.4816955684007707


Ready for preprocessing when this one user's orders are being predicted in a way I feel good about. Use what I learn with how to engineer features here to do so on a larger, random set of users from the full dataset.