# Table Summary

Here is my understanding of the data structure.

* `orders`
    * one row per order (index = order_id)
    * does not contain information about reorders
    * `eval_set` indicates whether the order is in the `train`/`test`/`prior`
        * the `test` set is data reserved for the testing of our final model
        * the `prior` and `train` eval_sets are defined below
    * columns:
        * `order_id`: order identifier
        * `user_id`: customer identifier
        * `eval_set`: which evaluation set this order belongs in (see `SET` described below)
        * `order_number`: the order sequence number for this user (1 = first, n = nth)
        * `order_dow`: the day of the week the order was placed on
        * `order_hour_of_day`: the hour of the day the order was placed on
        * `days_since_prior`: days since the last order, capped at 30 (with NAs for `order_number` = 1)

* `prior_orders`
    * information about orders prior to that users most recent order (~3.2M orders)
    * contains one row per item per order & whether or not each item is a 'reorder'
        * reorder: 1 if products has been ordered by this user in the past, 0 otherwise
    * columns:
        * `order_id`: foreign key
        * `product_id`: foreign key
        * `add_to_cart_order`: order in which each product was added to cart
        * `reordered`: 1 if this product has been ordered by this user in the past, 0 otherwise
        
    
* `train_orders`
    * training data supplied to participants of Kaggle competition
    * contains one row per item per order & whether or not each item is a 'reorder'(for training data)
    * none of the rows in `train_orders` will be found in `prior_orders`
    * columns:
        * `order_id`: foreign key
        * `product_id`: foreign key
        * `add_to_cart_order`: order in which each product was added to cart
        * `reordered`: 1 if this product has been ordered by this user in the past, 0 otherwise

In [1]:
import numpy as np
import pandas as pd
import psycopg2

from db_config import get_db_params
from query_dfs import create_dfs

In [2]:
db_params = get_db_params()
conn = psycopg2.connect(**db_params)

In [3]:
df_orders, df_train, df_prior, df_prod_detail = create_dfs()

In [30]:
df_orders.shape, df_train.shape, df_prior.shape, df_prod_detail.shape

((3421083, 7), (1384617, 10), (32434489, 10), (49688, 6))

In [4]:
df_orders.head(3)

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0


In [5]:
df_train.head(3)

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,6129,24852,1,1,38907,train,7,1,14,30.0
1,6129,48364,2,1,38907,train,7,1,14,30.0
2,6129,21903,3,1,38907,train,7,1,14,30.0


In [6]:
df_prior.head(3)

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,114,24954,1,0,91891,prior,1,0,11,
1,114,1688,2,0,91891,prior,1,0,11,
2,114,37371,3,0,91891,prior,1,0,11,


In [7]:
df_prod_detail.head(3)

Unnamed: 0,product_id,aisle_id,department_id,product_name,aisle,department
0,30843,1,20,Detox Salad,prepared soups salads,deli
1,16618,1,20,Classic Potato Salad,prepared soups salads,deli
2,14864,1,20,Low-Fat Chicken Tortilla Soup,prepared soups salads,deli


In [None]:


user_prod_counts = (df_prior
                    .groupby(["product_id", "user_id"], as_index=False)
                    .agg({"order_id": "count"})
                    .rename(columns={'order_id': 'num_orders'}))

train_user_ids = df_prior['user_id'].unique() 
X = user_prod_counts[user_prod_counts['user_id'].isin(train_user_ids)]
X.head()

In [None]:
X.num_orders.value_counts()