This notebook builds on EDA done here: https://github.com/fractaldatalearning/Capstone2/blob/main/notebooks/eda3_1user_modeling.ipynb


Here, I need to add rows for all non-orders I.e. if an item was ordered in order 1 and not order 2, there should be a new row with all the same order details (day, time, etc.) but the product that was previously ordered and a 0 in add_to_cart_sequence (& 'reordered') to indicate a non-order. This needs to be done for every subsequent order, so that by a user's final order, products included are everything they bought on that final order and rows indicating non-orders for all products they ever ordered. 

In [1]:
import pandas as pd
import numpy as np
import os
import random

from library.sb_utils import save_file

In [2]:
from IPython.display import Audio
sound_file = './alert.wav'

In [3]:
# import the original full df, drop  useless/redundant columns
df = pd.read_csv('../data/processed/full_data_cleaned.csv')
df = df.drop(columns = ['product_id', 'aisle_id', 'department_id', 'eval_set']).copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33819106 entries, 0 to 33819105
Data columns (total 11 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int64  
 1   user_id                 int64  
 2   order_by_user_sequence  int64  
 3   order_dow               int64  
 4   order_hour_of_day       int64  
 5   days_since_prior_order  float64
 6   add_to_cart_sequence    int64  
 7   reordered               int64  
 8   product_name            object 
 9   aisle_name              object 
 10  dept_name               object 
dtypes: float64(1), int64(7), object(3)
memory usage: 2.8+ GB


Decide what chunk of data to work with for the remainder of the project. Randomly choose users of some quantity to leave me with a df sized to function with the computer. Don't start out separating it into train/test split.

In [4]:
# How many total users are there?
len(df['user_id'].unique())

206209

In [5]:
# Deal with null values
df['days_since_prior_order'] = df['days_since_prior_order'].fillna(-1)
df.isnull().any()

order_id                  False
user_id                   False
order_by_user_sequence    False
order_dow                 False
order_hour_of_day         False
days_since_prior_order    False
add_to_cart_sequence      False
reordered                 False
product_name              False
aisle_name                False
dept_name                 False
dtype: bool

In [18]:
# After playing around, I found the computer was able to handle adding rows to a df of appx. 
# ***** users in a reasonable amount of time. Randomly select users. Repeat and then concatenate
# later after rows have been added, if I want to do pre-processing & modeling with more data.

all_users = set(df['user_id'].unique())
users1 = random.sample(list(all_users), 2000)
df1 = df.loc[df['user_id'].isin(users1), :].copy()
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 321100 entries, 6854 to 33813907
Data columns (total 11 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   order_id                321100 non-null  int64  
 1   user_id                 321100 non-null  int64  
 2   order_by_user_sequence  321100 non-null  int64  
 3   order_dow               321100 non-null  int64  
 4   order_hour_of_day       321100 non-null  int64  
 5   days_since_prior_order  321100 non-null  float64
 6   add_to_cart_sequence    321100 non-null  int64  
 7   reordered               321100 non-null  int64  
 8   product_name            321100 non-null  object 
 9   aisle_name              321100 non-null  object 
 10  dept_name               321100 non-null  object 
dtypes: float64(1), int64(7), object(3)
memory usage: 29.4+ MB


In [19]:
not_users1 = all_users - set(users1)
users2 = random.sample(list(not_users1), 20)  
df2 = df.loc[df['user_id'].isin(users2), :].copy()

not_users1or2 = not_users1 - set(users2)
users3 = random.sample(list(not_users1or2), 20)  
df3 = df.loc[df['user_id'].isin(users3), :].copy()

not_users1to3 = not_users1or2 - set(users3)
users4 = random.sample(list(not_users1to3), 20)  
df4 = df.loc[df['user_id'].isin(users4), :].copy()

not_users1to4 = not_users1to3 - set(users4)
users5 = random.sample(list(not_users1to4), 20)  
df4 = df.loc[df['user_id'].isin(users5), :].copy()

print(len(users1), len(users2), len(users3), len(users4), len(users5))

2000 20 20 20 20


In [20]:
# Work to adapt the row-adding process I'd used with a single user iterated over multiple users
# The actual code I'd used for a single user:
# for n in range(2,100):
    # Get items from order n not reordered in order n+1
    #order_n = df11[df1['order_by_user_sequence']==n
                           #]['product_id'].unique().tolist()
    #order_n1 = df1[df1['order_by_user_sequence']==(
        #n+1)]['product_id'].unique().tolist()
    #only_n = [x for x in order_n if x not in order_n1]
    # Get n1 deets from the big deets dict
    #order_n1_deets = orders_deets.get(n+1)
    # Add to n1 deets dict with product ids from order_n
    #order_n1_deets.update({'product_id': only_n})
    # Turn dict into df of new rows
    #order_n1_new_rows = pd.DataFrame.from_dict(order_n1_deets)
    # Add new rows to practice_user df
    #practice_user = pd.concat([practice_user, order_n1_new_rows])

#practice_user.info()

In [None]:
# Try doing this same thing except without a dictionary. All I really need is to duplicate 
# user_id and order_by_user_sequence and add rows with any products non-ordered in a given 
# order, and I can deal with filling in all the rest of the info later. 

for user in users1:
    rows_to_work_w = df1.loc[df1['user_id']==user, :].copy()
    for order in range(2,101):
        # Get list of items that new rows need to be created for in this order 
        # (orderd) previously but not here. 
        items_from_this_order = rows_to_work_w[rows_to_work_w[
            'order_by_user_sequence']==order]['product_name'].unique().tolist()
        items_ordered_prior = rows_to_work_w[rows_to_work_w[
            'order_by_user_sequence']==(order-1)]['product_name'].unique().tolist()
        non_orders_this_order = list(set(items_ordered_prior) - set(items_from_this_order))
        
        # Create rows containing just this user, order, and non_ordered products
        new_rows = pd.DataFrame({'user_id': user, 'order_by_user_sequence': order, 
                                 'product_name': non_orders_this_order})
        
        # Fill in other columns with nan (for now) for easy concatenation with existing rows.
        new_rows[['order_id', 'order_dow', 'order_hour_of_day', 'days_since_prior_order', 
                  'add_to_cart_sequence', 'reordered', 'aisle_name', 'dept_name']] = 'x'
        
        # Add these new rows to rows_to_work_with so these values are here when loop goes to
        # next order for this user and these products get duplicated there, as well, if they're 
        # not reordered.
        rows_to_work_w = pd.concat([rows_to_work_w, new_rows])
        
    # Once all a user's new rows are added, add these new rows to the full df before moving 
    # to the next user. These are rows where any of the columns have a value of x
    user_new_rows = rows_to_work_w.loc[rows_to_work_w['order_id']=='x', :].copy()
    df1 = pd.concat([df1, user_new_rows])

Audio(sound_file, autoplay=True)

In [None]:
df1.info()

In [None]:
# It looks like the length is what I hoped for. See if some random orders looks correct.
# Each order>1 should have new rows with add_to_cart_sequence=0 for all items ever ordered
df1['user_id'].unique()

In [None]:
df1[(df1['user_id']==3691) & (df1['order_by_user_sequence']==4)]

In [None]:
order_4_items = df1[(df1['user_id']==3691) & (df1['order_by_user_sequence']==4)][
    'product_name'].unique()
order_3_items = df1[(df1['user_id']==3691) & (df1['order_by_user_sequence']==3)][
    'product_name'].unique()
only_new_items_order_4_should_be = set(order_4_items) - set(order_3_items)
only_new_items_order_4_should_be

This worked. Items in this user's 4th order are all present, with all rows being either non-orders or re-orders except the two new items skim milk and sweet kale salad kit. 

New rows take a long time to create. Immediately save as a file and work with that file in a new notebook so I have the ability to restart the kernel without having it take forever to run as I continue manipulation. Repeat with each df1, df2, df3, etc. and then concatenate them in the next notebook before moving on. 

In [None]:
users_incorrect_deets = df1
datapath = '../data/processed'
save_file(users_incorrect_deets, 'users_incorrect_deets.csv', datapath)

Continue this in preprocessing1b: https://github.com/fractaldatalearning/Capstone2/blob/main/notebooks/preprocessing1b_get_usable_df.ipynb