Links to previous EDA work:

- eda1, general exploration of the full, original datase, mostly consists of histograms comparing reorders:all orders feature-by-feature: https://github.com/fractaldatalearning/Capstone2/blob/main/notebooks/eda1_w_data_direct_from_wrangling.ipynb

- eda2, creating new rows so that each order contains all newly-ordered items and reorders plus record of all items not re-ordered this time for the biggest user. Generally exploring some of this user's purchasing practices: https://github.com/fractaldatalearning/Capstone2/blob/main/notebooks/eda2_single_user.ipynb

In [1]:
import pandas as pd
import numpy as np
import os

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn import metrics

from library.sb_utils import save_file
import json

In [2]:
# Get df with practice user's orders plus rows for non-reorders
df = pd.read_csv('../data/processed/practice_user.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34021 entries, 0 to 34020
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   order_id                34021 non-null  int64  
 1   user_id                 34021 non-null  int64  
 2   order_by_user_sequence  34021 non-null  int64  
 3   order_dow               34021 non-null  int64  
 4   order_hour_of_day       34021 non-null  int64  
 5   days_since_prior_order  33995 non-null  float64
 6   product_id              34021 non-null  int64  
 7   add_to_cart_sequence    1992 non-null   float64
 8   reordered               34021 non-null  float64
 9   product_name            34021 non-null  object 
 10  aisle_name              34021 non-null  object 
 11  dept_name               34021 non-null  object 
 12  aisle_id                34021 non-null  float64
 13  department_id           34021 non-null  float64
 14  eval_set                1992 non-null 

In [3]:
# Drop the irrelevant eval_set column
df = df.drop(columns = 'eval_set')
df.head()

Unnamed: 0,order_id,user_id,order_by_user_sequence,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,aisle_name,dept_name,aisle_id,department_id
0,2959648,32099,1,0,17,,40285,18.0,0.0,Traditional Snack Mix,trail mix snack mix,snacks,125.0,19.0
1,2959648,32099,1,0,17,,27966,15.0,0.0,Organic Raspberries,packaged vegetables fruits,produce,123.0,4.0
2,2959648,32099,1,0,17,,34969,16.0,0.0,Red Vine Tomato,fresh vegetables,produce,83.0,4.0
3,2959648,32099,1,0,17,,7419,12.0,0.0,Sweet Red Grape Tomatoes,fresh vegetables,produce,83.0,4.0
4,2959648,32099,1,0,17,,26209,9.0,0.0,Limes,fresh fruits,produce,24.0,4.0


In [4]:
# Get dictionaries connecting product-aisle-dept


with open('../data/processed/dicts/aisle_dept_dict.txt', 
          'r') as ad_file:
     ad_dict = json.load(ad_file)

with open('../data/processed/dicts/prod_aisle_dict.txt', 
          'r') as pa_file:
     pa_dict = json.load(pa_file)
        
with open('../data/processed/dicts/dept_id_name_dict.txt', 
          'r') as dd_file:
     dd_dict = json.load(dd_file)
        
with open('../data/processed/dicts/aisle_id_name_dict.txt', 
          'r') as aa_file:
     aa_dict = json.load(aa_file)
        
with open('../data/processed/dicts/prod_id_name_dict.txt', 
          'r') as pp_file:
     pp_dict = json.load(pp_file)
        
dd_dict

{'7': 'beverages',
 '16': 'dairy eggs',
 '19': 'snacks',
 '17': 'household',
 '4': 'produce',
 '14': 'breakfast',
 '13': 'pantry',
 '20': 'deli',
 '1': 'frozen',
 '11': 'personal care',
 '12': 'meat seafood',
 '6': 'international',
 '3': 'bakery',
 '15': 'canned goods',
 '9': 'dry goods pasta',
 '5': 'alcohol',
 '8': 'pets',
 '18': 'babies',
 '2': 'other',
 '21': 'missing',
 '10': 'bulk'}

In [5]:
pp_dict = {int(k):v for k,v in pp_dict.items()}
aa_dict = {int(k):v for k,v in aa_dict.items()}
dd_dict = {int(k):v for k,v in dd_dict.items()}
pa_dict = {int(k):v for k,v in pa_dict.items()}
ad_dict = {int(k):v for k,v in ad_dict.items()}

dd_dict

{7: 'beverages',
 16: 'dairy eggs',
 19: 'snacks',
 17: 'household',
 4: 'produce',
 14: 'breakfast',
 13: 'pantry',
 20: 'deli',
 1: 'frozen',
 11: 'personal care',
 12: 'meat seafood',
 6: 'international',
 3: 'bakery',
 15: 'canned goods',
 9: 'dry goods pasta',
 5: 'alcohol',
 8: 'pets',
 18: 'babies',
 2: 'other',
 21: 'missing',
 10: 'bulk'}

In the initial notebooks I was just playing around and really exploring to see what data exists. Here, work to actually do what it would take to make predictions about this user's reorders, before moving into feature engineering on a bigger portion of the full dataset.

Questions to answer:
- Given an order of a product by a person (here, this person; later many users), what is the likelihood that they will reorder it ever? How soon will they order it again, and again? 
- Within a department/aisle, what portion of items get reordered, period? Reordered many times? (as opposed to what I calculated previously in notebook 1, which was the total ratio of reorders, with no regard for whether it was a particular user reordering something many times or something many people reorder occasionally, etc.) 
- How might I make new features that indicate specific reorder practices i.e. not just "this is a reorder" but "this is a reorder, and it's this person's Nth time reordering the item" or "up until now, this person has reordered this item in p percent of all their orders."
- In addition to the reorder column, what meaningful features can be engineered from other existing features (i.e: Column for 0/1 for whether product name contains "organic.")
- What methods are best to use for engineering of features, getting dummies, etc?
- Which models might be best for predicting this one user's reorders? What can I infer about whether similar methods will be useful on a dataset with more users?

In [6]:
# What portion of items that this person orders once, do they reorder, ever?

products = set(df['product_name'].unique())
reordered_ever = set(df[df['reordered']==1]['product_name'].unique())
reordered_ever

{'100% Pineapple Juice',
 '100% Pure Apple Juice',
 '100% Pure Pumpkin',
 '100% Tangerine Juice',
 '2% Reduced Fat Milk',
 '360 Dusters Refills Unscented',
 '4% Milkfat Small Curd Grade A Pasteurized Cottage Cheese',
 '50% Less Sodium Garbanzo Beans',
 'All Natural Four Cheese Ravioli',
 'All Natural Marinara Sauce',
 'All Purpose Flour',
 'Almond & Apricot Bar',
 'Almond Walnut Macadamia Plus Bar',
 'Apple Pie with Cinnamon Sugar',
 'Baby Swiss Cheese',
 'Bag of Organic Bananas',
 'Baguette Sourdough',
 'Barbecue Sauce Original',
 'Bear Clover Premium Honey',
 'Beef Franks',
 'Beefsteak Tomato',
 'Biscuits for Dogs less than 20 lbs',
 'Black Beans',
 'Black Forest Ham',
 'Blueberry Pecan Plus Fiber Fruit & Nut Bar',
 'Blueberry on the Bottom Nonfat Greek Yogurt',
 'Bread Rolls',
 'Bread, Sliced, Extra Sourdough',
 'Buttermilk, Cultured Low Fat',
 'Caramel Almond and Sea Salt Nut Bar',
 'Cavatappi',
 'Cheese Enchilada Meal',
 'Cherubs Heavenly Salad Tomatoes',
 'Chicken Broth',
 'Chive

In [7]:
reordered_never = products - reordered_ever
reordered_never

{'100 Calorie  Per Bag Popcorn',
 '100% Natural Tomato Sauce',
 '100% White Grape Juice',
 '12\\" Aluminium Foil',
 '30 Gallon Large Trash Bags',
 '30% Less Sodium Chili Seasoning Mix',
 '3D White Clasicc Vivid Whitestrips',
 'Americone Dream® Ice Cream',
 'Anti-Slip Grip Cardboard Applicator Regular Absorbency Tampons',
 'Apple Cider Vinegar',
 'Apple Smoked Gruyere Cheese',
 'Aquarium Pump Hand Soap',
 'Asparation/Broccolini/Baby Broccoli',
 'Atlantic Salmon Fillet',
 'Banana',
 'Beef Stew Spices & Seasonings',
 "Bits o' Brickle Baking Chips",
 'Blackberries',
 'Brie Cheese',
 'Brilliance Premium Strength Special Occasion Flatware',
 'Bunched Carrots',
 'Buttermilk Complete Pancake & Waffle Mix',
 'Cabaret Crisp And Buttery Crackers',
 'Canola Oil',
 'Caramel Sauce',
 'Caramels',
 'Carnation Sweetened Condensed Milk',
 'Cherry Strawberry',
 'Cherry Vanilla Cherry On The Bottom Cream Top',
 'Chicken & Apple Smoked Chicken Sausage',
 'Chicken Curry Salad',
 'Chicken Rice Pilaf Mix',
 '

In [8]:
print('number of products = ', len(products))
print('number of products never reordered = ', len(reordered_never))
print('number of products reordered at least once = ', len(reordered_ever))
print('portion of products reordered at least once = ', len(reordered_ever)/len(products))

number of products =  519
number of products never reordered =  269
number of products reordered at least once =  250
portion of products reordered at least once =  0.4816955684007707


Now do the same, but with a product's third order (reordered twice), fourth, etc. In fact, make a column that gives the total reorder count at each given reorder. This will help understand reorder practices and will be a useful column later on for preprocessing, as well. 

Start with finding total orders, as that's easy. Then, copy the "reorders" column because if the value is 0 there then the value in the "reorders_so_far" will also be 0, and values can start at 1 early in the "order_by_user_sequence" with 1 added for every subsequent reorder.

In [9]:
# Getting the total reorders is easy

tot_reorders = df[['product_name', 'reordered']].groupby(
    'product_name', as_index=False).sum()

tot_reorders.sort_values('reordered', ascending=False)

Unnamed: 0,product_name,reordered
90,Coconut Blended Greek Yogurt,46.0
455,Strawberry on the Bottom Nonfat Greek Yogurt,45.0
295,Non Fat Black Cherry on the Bottom Greek Yogurt,44.0
368,Pineapple on the Bottom Greek Yogurt,42.0
8,2% Reduced Fat Milk,39.0
...,...,...
222,Imported Unsalted Butter,0.0
219,Ice Cream Light Chocolate Chip,0.0
218,Ibuprofen Tablets,0.0
216,Honey Vanilla Chamomile Caffeine-Free Herbal Tea,0.0


In [10]:
# ok so this person is constantly buying greek yogurt lol
# almost every-other order (they made 100 orders)

df['tot_reorders'] = tot_reorders.loc[:,'reordered']
df.head()

Unnamed: 0,order_id,user_id,order_by_user_sequence,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,aisle_name,dept_name,aisle_id,department_id,tot_reorders
0,2959648,32099,1,0,17,,40285,18.0,0.0,Traditional Snack Mix,trail mix snack mix,snacks,125.0,19.0,0.0
1,2959648,32099,1,0,17,,27966,15.0,0.0,Organic Raspberries,packaged vegetables fruits,produce,123.0,4.0,0.0
2,2959648,32099,1,0,17,,34969,16.0,0.0,Red Vine Tomato,fresh vegetables,produce,83.0,4.0,2.0
3,2959648,32099,1,0,17,,7419,12.0,0.0,Sweet Red Grape Tomatoes,fresh vegetables,produce,83.0,4.0,7.0
4,2959648,32099,1,0,17,,26209,9.0,0.0,Limes,fresh fruits,produce,24.0,4.0,3.0


In [11]:
# Prepare to show for each order which in a reorder sequence this is 

df['reorders_so_far'] = df.loc[:,'reordered']
df.head()

Unnamed: 0,order_id,user_id,order_by_user_sequence,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,aisle_name,dept_name,aisle_id,department_id,tot_reorders,reorders_so_far
0,2959648,32099,1,0,17,,40285,18.0,0.0,Traditional Snack Mix,trail mix snack mix,snacks,125.0,19.0,0.0,0.0
1,2959648,32099,1,0,17,,27966,15.0,0.0,Organic Raspberries,packaged vegetables fruits,produce,123.0,4.0,0.0,0.0
2,2959648,32099,1,0,17,,34969,16.0,0.0,Red Vine Tomato,fresh vegetables,produce,83.0,4.0,2.0,0.0
3,2959648,32099,1,0,17,,7419,12.0,0.0,Sweet Red Grape Tomatoes,fresh vegetables,produce,83.0,4.0,7.0,0.0
4,2959648,32099,1,0,17,,26209,9.0,0.0,Limes,fresh fruits,produce,24.0,4.0,3.0,0.0


In [12]:
# For order_by_user_sequence 1-2, at least, the "reorders_so_far"
# column is already accurate. What will it take to add 1 to the 
# first items reordered twice?

reordered2 = set(df[(df['order_by_user_sequence']==2) & 
                   (df['reordered']==1)]['product_name'])
reordered2

{'Large Organic Omega3 Brown Eggs'}

In [13]:
reordered3 = set(df[(df['order_by_user_sequence']==3) & 
                   (df['reordered']==1)]['product_name'])
reordered3

{'Classic Hummus',
 'Coconut Blended Greek Yogurt',
 'European Cucumber',
 'Iceberg Lettuce',
 'Lime Sparkling Water',
 'Organic Raspberries',
 'Thick Bacon'}

In [14]:
# No items reordered in both orders 2 and 3, try 4

reordered4 = set(df[(df['order_by_user_sequence']==4) & 
                   (df['reordered']==1)]['product_name'])
reordered4

{'Black Forest Ham',
 'Blueberry on the Bottom Nonfat Greek Yogurt',
 'Bread, Sliced, Extra Sourdough',
 'Coconut Blended Greek Yogurt',
 'Coke Classic',
 'Corn Chips',
 'European Cucumber',
 'Non Fat Black Cherry on the Bottom Greek Yogurt',
 'Organic Boneless Skinless Chicken Breast',
 'Pineapple on the Bottom Greek Yogurt',
 'Pink Lady Apple Kombucha',
 'Red Onion',
 'Strawberry on the Bottom Nonfat Greek Yogurt',
 'Traditional Snack Mix'}

In [15]:
reordered2_or3 = set(list(reordered2) + list(reordered3))
reordered2_or3                

{'Classic Hummus',
 'Coconut Blended Greek Yogurt',
 'European Cucumber',
 'Iceberg Lettuce',
 'Large Organic Omega3 Brown Eggs',
 'Lime Sparkling Water',
 'Organic Raspberries',
 'Thick Bacon'}

In [16]:
# Identify items that have been reordered twice so far now

second_reorder = reordered2_or3.intersection(reordered4)
second_reorder

{'Coconut Blended Greek Yogurt', 'European Cucumber'}

In [17]:
# Update reorders_so_far column

rows_to_change =  df.loc[(df.loc[:,'order_by_user_sequence']==4) & (
    df.loc[:,'product_name'].isin(second_reorder))]

rows_to_change

Unnamed: 0,order_id,user_id,order_by_user_sequence,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,aisle_name,dept_name,aisle_id,department_id,tot_reorders,reorders_so_far
87,931900,32099,4,1,20,4.0,45,12.0,1.0,European Cucumber,fresh vegetables,produce,83.0,4.0,0.0,1.0
101,931900,32099,4,1,20,4.0,48626,6.0,1.0,Coconut Blended Greek Yogurt,yogurt,dairy eggs,120.0,16.0,0.0,1.0


In [18]:
df.loc[rows_to_change.index, 'reorders_so_far'] = 2
df[df['order_by_user_sequence']==4].sort_values('reorders_so_far')

Unnamed: 0,order_id,user_id,order_by_user_sequence,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,aisle_name,dept_name,aisle_id,department_id,tot_reorders,reorders_so_far
2064,931900,32099,4,1,20,4.0,25392,,0.0,Plus Downy Clean Breeze Scent Liquid Laundry D...,laundry,household,75.0,17.0,,0.0
2083,931900,32099,4,1,20,4.0,6517,,0.0,Vitaminwater Zero Glow Strawberry Guanabana,energy sports drinks,beverages,64.0,7.0,,0.0
2082,931900,32099,4,1,20,4.0,35695,,0.0,Maple & Brown Sugar Instant Oatmeal,hot cereal pancake mixes,breakfast,130.0,14.0,,0.0
2081,931900,32099,4,1,20,4.0,48131,,0.0,Versatile Stain Remover 65 Loads,laundry,household,75.0,17.0,,0.0
2080,931900,32099,4,1,20,4.0,10869,,0.0,Total All Purpose Grease Cutting Lemon Cleanser,cleaning products,household,114.0,17.0,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
104,931900,32099,4,1,20,4.0,7349,26.0,1.0,"Bread, Sliced, Extra Sourdough",bread,bakery,112.0,3.0,1.0,1.0
92,931900,32099,4,1,20,4.0,44156,7.0,1.0,Strawberry on the Bottom Nonfat Greek Yogurt,yogurt,dairy eggs,120.0,16.0,0.0,1.0
84,931900,32099,4,1,20,4.0,40285,16.0,1.0,Traditional Snack Mix,trail mix snack mix,snacks,125.0,19.0,0.0,1.0
87,931900,32099,4,1,20,4.0,45,12.0,1.0,European Cucumber,fresh vegetables,produce,83.0,4.0,0.0,2.0


In [19]:
# Create loop to repeat this for all the orders
# I'll need a dictionary of each order and items reordered therein

products_reordered_each_order = {}

for order in range(2,101):
    items = set(df[(df['order_by_user_sequence']==order) & (
        df['reordered']==1)]['product_name'])
    products_reordered_each_order[order]=items

print(products_reordered_each_order[2])
print(products_reordered_each_order[3])
print(products_reordered_each_order[4])

{'Large Organic Omega3 Brown Eggs'}
{'Classic Hummus', 'Lime Sparkling Water', 'European Cucumber', 'Coconut Blended Greek Yogurt', 'Thick Bacon', 'Organic Raspberries', 'Iceberg Lettuce'}
{'Traditional Snack Mix', 'Organic Boneless Skinless Chicken Breast', 'Corn Chips', 'Bread, Sliced, Extra Sourdough', 'Strawberry on the Bottom Nonfat Greek Yogurt', 'Pineapple on the Bottom Greek Yogurt', 'Black Forest Ham', 'Coke Classic', 'Red Onion', 'Coconut Blended Greek Yogurt', 'Blueberry on the Bottom Nonfat Greek Yogurt', 'European Cucumber', 'Non Fat Black Cherry on the Bottom Greek Yogurt', 'Pink Lady Apple Kombucha'}


In [20]:
# Find remaining cases where reorders_so_far = 2 and update df

items_so_far = set(list(reordered2_or3) + list(reordered4))

for order in range(5,101):
    items_this_order = set(products_reordered_each_order[order])
    reordered_twice_now = items_so_far.intersection(items_this_order)
    rows_to_change =  df.loc[(df.loc[:,'order_by_user_sequence']==
                             order) & (df.loc[
        :,'product_name'].isin(reordered_twice_now))]
    df.loc[rows_to_change.index, 'reorders_so_far'] = 2
    items_so_far = set(list(items_so_far) + list(items_this_order))

df['reorders_so_far'].value_counts()

0.0    32548
2.0     1223
1.0      250
Name: reorders_so_far, dtype: int64

In [21]:
len(items_so_far)

250

In [22]:
df[df['reorders_so_far']==2].head(9)

Unnamed: 0,order_id,user_id,order_by_user_sequence,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,aisle_name,dept_name,aisle_id,department_id,tot_reorders,reorders_so_far
87,931900,32099,4,1,20,4.0,45,12.0,1.0,European Cucumber,fresh vegetables,produce,83.0,4.0,0.0,2.0
101,931900,32099,4,1,20,4.0,48626,6.0,1.0,Coconut Blended Greek Yogurt,yogurt,dairy eggs,120.0,16.0,0.0,2.0
114,2154511,32099,5,5,16,4.0,35221,2.0,1.0,Lime Sparkling Water,water seltzer sparkling water,beverages,115.0,7.0,3.0,2.0
115,2154511,32099,5,5,16,4.0,33129,3.0,1.0,Classic Hummus,fresh dips tapenades,deli,67.0,20.0,0.0,2.0
118,2154511,32099,5,5,16,4.0,23296,5.0,1.0,Blueberry on the Bottom Nonfat Greek Yogurt,yogurt,dairy eggs,120.0,16.0,0.0,2.0
120,2154511,32099,5,5,16,4.0,11576,21.0,1.0,Corn Chips,chips pretzels,snacks,107.0,19.0,0.0,2.0
121,2154511,32099,5,5,16,4.0,4462,22.0,1.0,Pink Lady Apple Kombucha,tea,beverages,94.0,7.0,0.0,2.0
123,2154511,32099,5,5,16,4.0,16168,4.0,1.0,Large Organic Omega3 Brown Eggs,eggs,dairy eggs,86.0,16.0,0.0,2.0
135,825019,32099,6,0,8,2.0,28993,9.0,1.0,Iceberg Lettuce,fresh vegetables,produce,83.0,4.0,0.0,2.0


It looks like this came out accurately. Repeat for 3 reorders_so_far.

It will get tricker to find rows that need replacing. If I just put the current code in another for loop and replace the 2 with a 3, rows that should remain labeled as reorders_so_far==2 will fit the criteria for getting re-labeled as 3. That wasn't a problem before because an item from order 5 could only stay the same or change to 2. But an item from order 6 could stay as 0-1 or stay as 2 or change to 2. I need to add the code for when to know that it needs to stay as 2. 

In [23]:
# Identify code for when to keep reorders_so_far value

grouped_by_user = df[df['reorders_so_far']==2].groupby(
    'product_name')['order_by_user_sequence']

df['keep'] = df.assign(min=grouped_by_user.transform(min))['min']

df[df['order_by_user_sequence']==df['keep']]

Unnamed: 0,order_id,user_id,order_by_user_sequence,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_sequence,reordered,product_name,aisle_name,dept_name,aisle_id,department_id,tot_reorders,reorders_so_far,keep
87,931900,32099,4,1,20,4.0,45,12.0,1.0,European Cucumber,fresh vegetables,produce,83.0,4.0,0.0,2.0,4.0
101,931900,32099,4,1,20,4.0,48626,6.0,1.0,Coconut Blended Greek Yogurt,yogurt,dairy eggs,120.0,16.0,0.0,2.0,4.0
114,2154511,32099,5,5,16,4.0,35221,2.0,1.0,Lime Sparkling Water,water seltzer sparkling water,beverages,115.0,7.0,3.0,2.0,5.0
115,2154511,32099,5,5,16,4.0,33129,3.0,1.0,Classic Hummus,fresh dips tapenades,deli,67.0,20.0,0.0,2.0,5.0
118,2154511,32099,5,5,16,4.0,23296,5.0,1.0,Blueberry on the Bottom Nonfat Greek Yogurt,yogurt,dairy eggs,120.0,16.0,0.0,2.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1924,2403204,32099,98,4,9,7.0,37637,41.0,1.0,Pure Cane Confectioners Powdered Sugar,baking ingredients,pantry,17.0,13.0,,2.0,98.0
1929,2403204,32099,98,4,9,7.0,19039,17.0,1.0,Honey Roasted Almonds,nuts seeds dried fruit,snacks,117.0,19.0,,2.0,98.0
1946,1629423,32099,100,1,14,1.0,43772,41.0,1.0,Cherubs Heavenly Salad Tomatoes,fresh vegetables,produce,83.0,4.0,,2.0,100.0
1972,1629423,32099,100,1,14,1.0,40603,29.0,1.0,Fabric Softener Sheets,laundry,household,75.0,17.0,,2.0,100.0


In [24]:
#The length of the set of rows where the order=keep is the number 
# of rows that should end up with a reorders_so_far value of 2.
len(df[df['order_by_user_sequence']==df['keep']])

176

In [25]:
# Loop to update all orders where item reorders_so_far = 3.

previous_round_items = set(df[df['reorders_so_far']==2]['product_name'])

for order in range(6,101):
    items_this_order = set(products_reordered_each_order[order])
    reordered_this_iteration = set(previous_round_items.intersection(
        items_this_order))
    rows_to_change = df.loc[(df.loc[
        :,'order_by_user_sequence']==order) & (df.loc[
        :, 'reorders_so_far']==2) & (df['order_by_user_sequence']!=
        df['keep']) & (df.loc[:,'product_name'].isin(
        reordered_this_iteration))]
    df.loc[rows_to_change.index, 'reorders_so_far'] = 3

df['reorders_so_far'].value_counts()

0.0    32548
3.0     1047
1.0      250
2.0      176
Name: reorders_so_far, dtype: int64

In [26]:
# Create nested loop for remaining values of reorders_so_far.



Ready for preprocessing when this one user's orders are being predicted in a way I feel good about. Use what I learn about how to engineer features here to do so on a larger, random set of users from the full dataset.