# Table Summary

Here is my understanding of the data structure.

* `orders`
    * one row per order (index = order_id)
    * does not contain information about reorders
    * `eval_set` indicates whether the order is in the `train`/`test`/`prior`
        * the `test` set is data reserved for the testing of our final model
        * the `prior` and `train` eval_sets are defined below
    * columns:
        * `order_id`: order identifier
        * `user_id`: customer identifier
        * `eval_set`: which evaluation set this order belongs in (see `SET` described below)
        * `order_number`: the order sequence number for this user (1 = first, n = nth)
        * `order_dow`: the day of the week the order was placed on
        * `order_hour_of_day`: the hour of the day the order was placed on
        * `days_since_prior`: days since the last order, capped at 30 (with NAs for `order_number` = 1)

* `prior_orders`
    * information about orders prior to that users most recent order (~3.2M orders)
    * contains one row per item per order & whether or not each item is a 'reorder'
        * reorder: 1 if products has been ordered by this user in the past, 0 otherwise
    * columns:
        * `order_id`: foreign key
        * `product_id`: foreign key
        * `add_to_cart_order`: order in which each product was added to cart
        * `reordered`: 1 if this product has been ordered by this user in the past, 0 otherwise
        
    
* `train_orders`
    * training data supplied to participants of Kaggle competition
    * this table represents the users' most recent orders
    * contains one row per item per order & whether or not each item is a 'reorder'(for training data)
    * none of the rows in `train_orders` will be found in `prior_orders`
    * columns:
        * `order_id`: foreign key
        * `product_id`: foreign key
        * `add_to_cart_order`: order in which each product was added to cart
        * `reordered`: 1 if this product has been ordered by this user in the past, 0 otherwise
        
* `prod_detail`
    * this table is a combination of `products.csv`, `aisles.csv`, and `departments.csv`
        * created via SQL script (see `db_create.sql`)
    * contains all product details for each product

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import psycopg2


from db_config import get_db_params
from query_dfs import create_dfs

pd.set_option("display.max_columns", 101)

In [2]:
db_params = get_db_params()
conn = psycopg2.connect(**db_params)

In [3]:
# IMPORTANT: USE subset=True UNLESS YOU ARE SURE YOU HAVE ~25GB AVAILABLE MEM
df_orders, df_train, df_prior, df_prod_detail = create_dfs(subset=False)

**Let's take a look at how our DataFrames are structured.**

In [4]:
df_orders.shape, df_train.shape, df_prior.shape, df_prod_detail.shape

((3421083, 7), (1384617, 10), (32434489, 10), (49688, 5))

In [5]:
df_orders.head(3)

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0


In [6]:
df_train.head(3)

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,6129,24852,1,1,38907,train,7,1,14,30.0
1,6129,48364,2,1,38907,train,7,1,14,30.0
2,6129,21903,3,1,38907,train,7,1,14,30.0


In [7]:
df_prior.head(3)

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,114,24954,1,0,91891,prior,1,0,11,
1,114,1688,2,0,91891,prior,1,0,11,
2,114,37371,3,0,91891,prior,1,0,11,


In [8]:
df_prod_detail.head(3)

Unnamed: 0_level_0,aisle_id,department_id,product_name,aisle,department
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
30843,1,20,Detox Salad,prepared soups salads,deli
16618,1,20,Classic Potato Salad,prepared soups salads,deli
14864,1,20,Low-Fat Chicken Tortilla Soup,prepared soups salads,deli


In [9]:
user_prod_counts = (df_prior
                    .groupby(["product_id", "user_id"], as_index=False)
                    .agg({"order_id": "count"})
                    .rename(columns={'order_id': 'user_total_prod_orders'}))

user_prod_counts.head(3)

Unnamed: 0,product_id,user_id,user_total_prod_orders
0,1,138,2
1,1,709,1
2,1,764,2


In [10]:
# add 'cart' column to df_prior
df_prior = (df_prior
 .merge((df_prior
         .groupby(['user_id', 'order_id'], as_index=False)
             .agg({'product_id': 'unique'})
             .rename(columns={'product_id': 'cart'})),
        
        on=['user_id', 'order_id']))

df_prior.to_pickle("./pickle/df_prior.pickle")

In [11]:
# add 'in_cart' column to df_prior
df_prior['in_cart'] = (df_prior.apply(lambda row: row['product_id'] in row['cart'], axis=1)
                       .astype(int))

df_prior.head(3)

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,cart,in_cart
0,114,24954,1,0,91891,prior,1,0,11,,"[24954, 1688, 37371, 5782, 1263, 23763, 24385,...",1
1,114,1688,2,0,91891,prior,1,0,11,,"[24954, 1688, 37371, 5782, 1263, 23763, 24385,...",1
2,114,37371,3,0,91891,prior,1,0,11,,"[24954, 1688, 37371, 5782, 1263, 23763, 24385,...",1


## Feature Engineering

**We will use the `df_orders`, `df_train`, `df_prior`, and `df_prod_detail` DataFrames to populate a new DataFrame, `X`.**

**`X` will contain all of the features we'll use for our modeling.**

In [12]:
train_user_ids = df_train['user_id'].unique() 
X = user_prod_counts[user_prod_counts['user_id'].isin(train_user_ids)]
X.head(3)

Unnamed: 0,product_id,user_id,user_total_prod_orders
0,1,138,2
1,1,709,1
3,1,777,1


In [13]:
train_user_ids

array([38907, 87425, 64764, ..., 84438, 36854,  9808])

In [14]:
X.head(3)

Unnamed: 0,product_id,user_id,user_total_prod_orders
0,1,138,2
1,1,709,1
3,1,777,1


In [15]:
train_carts = (df_train.groupby('user_id', as_index=False)
                                      .agg({'product_id': 'unique'})
                                      .rename(columns={'product_id': 'cart'}))
train_carts.head(3)

Unnamed: 0,user_id,cart
0,1,"[196, 25133, 38928, 26405, 39657, 10258, 13032..."
1,2,"[22963, 7963, 16589, 32792, 41787, 22825, 1364..."
2,5,"[15349, 19057, 16185, 21413, 20843, 20114, 482..."


In [16]:
# DO NOT RE-RUN THIS CELL (if you do, must re-run all from where X is instantiated to fix)
X = X.merge(train_carts, on="user_id")
X.head(3)

Unnamed: 0,product_id,user_id,user_total_prod_orders,cart
0,1,138,2,[42475]
1,907,138,2,[42475]
2,1000,138,1,[42475]


In [17]:
X['in_cart'] = (X.apply(lambda row: row['product_id'] in row['cart'], axis=1).astype(int))

X.head(3)

Unnamed: 0,product_id,user_id,user_total_prod_orders,cart,in_cart
0,1,138,2,[42475],0
1,907,138,2,[42475],0
2,1000,138,1,[42475],0


**Let's make sure we understand exactly what 'reordered' means, since it's our target.**

Let's zoom in on one user's history with a particular product.

In [18]:
mask = (df_train.order_id == 2845485) & (df_train.product_id == 4957)
df_train[mask]

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
592194,2845485,4957,4,1,166435,train,8,0,18,13.0


In [19]:
mask = (df_prior.user_id == 166435) & (df_prior.product_id == 4957)
df_prior[mask].sort_values(by="order_number", ascending=False).head(3)

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,cart,in_cart
26655846,539282,4957,7,1,166435,prior,4,3,22,11.0,"[42719, 10892, 14858, 24852, 4210, 33754, 4957...",1
515385,117818,4957,11,1,166435,prior,3,6,14,20.0,"[42719, 41844, 38185, 35909, 10892, 22093, 176...",1
19286956,520422,4957,6,1,166435,prior,2,0,22,5.0,"[42719, 38185, 24852, 4210, 33754, 4957, 21137...",1


It turns out `reordered` does not refer to the user's most recent order. Instead, if the user has ever ordered the product in the past, it will be classified as a reorder.

Since we are trying to predict whether a product will be reordered in the user's **next** order (and not some future order), we should add a feature that states whether or not an item was in the user's `prior` order.

Let's do that now.

In [20]:
last_prior_carts = (df_prior[['user_id', 'order_number', 'cart']].groupby("user_id", as_index=False)
                    .agg({"order_number": "max"}))

last_prior_carts.head(3)
last_prior_carts = (last_prior_carts
         .merge(df_prior[['user_id', 'order_number', 'cart']],
                on=["user_id", "order_number"], suffixes=[None, "_last"]))

last_prior_carts.drop(columns="order_number", inplace=True)
last_prior_carts.drop_duplicates(subset="user_id", inplace=True)
last_prior_carts.head(4)

Unnamed: 0,user_id,cart
0,1,"[196, 46149, 39657, 38928, 25133, 10258, 35951..."
9,2,"[24852, 16589, 1559, 19156, 18523, 22825, 2741..."
25,3,"[39190, 18599, 23650, 21903, 47766, 24810]"
31,4,"[26576, 25623, 21573]"


In [21]:
X = (X.merge(last_prior_carts, how='left', on="user_id", suffixes=[None, "_last"]))
del last_prior_carts
X.rename(columns={"cart_last": "last_cart"}, inplace=True)

X.head(3)

Unnamed: 0,product_id,user_id,user_total_prod_orders,cart,in_cart,last_cart
0,1,138,2,[42475],0,"[46802, 22128, 40199, 21573, 26152, 12341]"
1,907,138,2,[42475],0,"[46802, 22128, 40199, 21573, 26152, 12341]"
2,1000,138,1,[42475],0,"[46802, 22128, 40199, 21573, 26152, 12341]"


In [22]:
X.shape

(8474661, 6)

In [23]:
X['in_last_cart'] = (X.apply(lambda row: row['product_id'] in row['last_cart'], axis=1).astype(int))

X.head(3)

Unnamed: 0,product_id,user_id,user_total_prod_orders,cart,in_cart,last_cart,in_last_cart
0,1,138,2,[42475],0,"[46802, 22128, 40199, 21573, 26152, 12341]",0
1,907,138,2,[42475],0,"[46802, 22128, 40199, 21573, 26152, 12341]",0
2,1000,138,1,[42475],0,"[46802, 22128, 40199, 21573, 26152, 12341]",0


In [24]:
# pickle all DFs for ease of use later
# comment this cell after executing once (provided no changes to DB/query_dfs)
X.to_pickle("./pickle/X_7.pickle")
df_orders.to_pickle("./pickle/df_orders.pickle")
df_prior.to_pickle("./pickle/df_prior.pickle")
df_train.to_pickle("./pickle/df_train.pickle")
df_prod_detail.to_pickle("./pickle/df_prod_detail.pickle")

Now that we've done this, let's move to `feature_engineering_2.ipynb`, where we will pick up where we left off by reading this pickled file.

Side note...check your computer's memory. You may need to shut down this notebook's kernel prior to running `feature_engineering_2.ipynb`