# Table Summary

Here is my understanding of the data structure.

* `orders`
    * one row per order (index = order_id)
    * does not contain information about reorders
    * `eval_set` indicates whether the order is in the `train`/`test`/`prior`
        * the `test` set is data reserved for the testing of our final model
        * the `prior` and `train` eval_sets are defined below
    * columns:
        * `order_id`: order identifier
        * `user_id`: customer identifier
        * `eval_set`: which evaluation set this order belongs in (see `SET` described below)
        * `order_number`: the order sequence number for this user (1 = first, n = nth)
        * `order_dow`: the day of the week the order was placed on
        * `order_hour_of_day`: the hour of the day the order was placed on
        * `days_since_prior`: days since the last order, capped at 30 (with NAs for `order_number` = 1)

* `prior_orders`
    * information about orders prior to that users most recent order (~3.2M orders)
    * contains one row per item per order & whether or not each item is a 'reorder'
        * reorder: 1 if products has been ordered by this user in the past, 0 otherwise
    * columns:
        * `order_id`: foreign key
        * `product_id`: foreign key
        * `add_to_cart_order`: order in which each product was added to cart
        * `reordered`: 1 if this product has been ordered by this user in the past, 0 otherwise
        
    
* `train_orders`
    * training data supplied to participants of Kaggle competition
    * this table represents the users' most recent orders
    * contains one row per item per order & whether or not each item is a 'reorder'(for training data)
    * none of the rows in `train_orders` will be found in `prior_orders`
    * columns:
        * `order_id`: foreign key
        * `product_id`: foreign key
        * `add_to_cart_order`: order in which each product was added to cart
        * `reordered`: 1 if this product has been ordered by this user in the past, 0 otherwise

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import psycopg2


from db_config import get_db_params
from query_dfs import create_dfs

pd.set_option("display.max_columns", 101)

In [2]:
db_params = get_db_params()
conn = psycopg2.connect(**db_params)

In [3]:
df_orders, df_train, df_prior, df_prod_detail = create_dfs()

KeyboardInterrupt: 

In [None]:
# pickle all DFs for ease of use later
# comment this cell after executing once (provided no changes to DB/query_dfs)

# df_orders.to_pickle("./pickle/df_orders.pickle")
# df_train.to_pickle("./pickle/df_train.pickle")
# df_prior.to_pickle("./pickle/df_prior.pickle")
# df_prod_detail.to_pickle("./pickle/df_prod_detail.pickle")


**Let's take a look at how our DataFrames are structured.**

In [None]:
df_orders.shape, df_train.shape, df_prior.shape, df_prod_detail.shape

In [None]:
df_orders.head(3)

In [None]:
df_train.head(3)

In [None]:
df_prior.head(3)

In [None]:
df_prod_detail.head(3)

In [None]:
user_prod_counts = (df_prior
                    .groupby(["product_id", "user_id"], as_index=False)
                    .agg({"order_id": "count"})
                    .rename(columns={'order_id': 'num_orders'}))

user_prod_counts.head(3)

**Let's make sure we understand exactly what 'reordered' means, since it's our target.**

Let's zoom in on one particular user's history with a particular product.

In [None]:
mask = (df_train.order_id == 2845485) & (df_train.product_id == 4957)
df_train[mask]

In [None]:
mask = (df_prior.user_id == 166435) & (df_prior.product_id == 4957)
df_prior[mask].sort_values(by="order_number", ascending=False).head(3)

It turns out `reordered` does not refer to the user's most recent order. Instead, if the user has ever ordered the product in the past, it will be classified as a reorder.

Since we are trying to predict whether a product will be reordered in the user's **next** order (and not some future order), we should add a feature that states whether or not an item was in the user's `prior` order. We'll start by adding a `cart` column to df_prior.

In [None]:
df_prior['next_order_num'] = df_prior.order_number + 1
df_prior.head(3)

In [None]:
df_prior = (df_prior
 .merge((df_prior.groupby(['user_id', 'order_id'], as_index=False)
             .agg({'product_id': 'unique'})
             .rename(columns={'product_id': 'cart'})),
        on=['user_id', 'order_id']))

df_prior.to_pickle("./pickle/df_prior.pickle")

## Feature Engineering

**We will use the `df_orders`, `df_train`, `df_prior`, and `df_prod_detail` DataFrames to populate a new DataFrame, `X`.**

**`X` will contain all of the features we'll use for our modeling.**

In [None]:
train_user_ids = df_train['user_id'].unique() 
X = user_prod_counts[user_prod_counts['user_id'].isin(train_user_ids)]
X.head(3)

In [None]:
train_user_ids

In [None]:
X.head(3)

In [None]:
train_carts = (df_train.groupby('user_id', as_index=False)
                                      .agg({'product_id': 'unique'})
                                      .rename(columns={'product_id': 'cart'}))
train_carts.head(3)

In [None]:
# DO NOT RE-RUN THIS CELL (if you do, must re-run all from where X is instantiated to fix)
X = X.merge(train_carts, on="user_id")
X.head(3)

**CAUTION**: the below cell takes a couple of minutes to run. To account for this, I pickle it immediately afterwards. Once you've pickled it once, comment out this line (two cells down):

`X.to_pickle(X_5_pickle_path)`

This will prevent the need to re-run this expensive operation.

In [None]:
X['in_cart'] = (X.apply(lambda row: row['product_id'] in row['cart'], axis=1).astype(int))

X.head(3)

In [None]:
# save this DF to pickle (X_5 since we have 5 features @ this checkpoint)
X_5_pickle_path = "./pickle/X_5.pickle"
# X.to_pickle(X_5_pickle_path)

Now that we've done this, let's move to `feature_engineering_2`, where we will pick up where we left off by reading this pickled file.