## Kaggle Comp: Instacart Market Basket Analysis
### July 13, 2017
### Pablo Felgueres

The dataset for this competition is a relational set of files describing customers' orders over time. The goal of the competition is to predict which products will be in a user's next order. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders. For more information, see the blog post accompanying its public release.

Notes: 

- orders.csv

This file tells to which set (prior, train, test) an order belongs. You are predicting reordered items only for the test set orders. 'order_dow' is the day of week.

- order_products__*.csv

These files specify which products were purchased in each order. order_products__prior.csv contains previous order contents for all customers. 'reordered' indicates that the customer has a previous order that contains the product. Note that some orders will have no reordered items. You may predict an explicit 'None' value for orders with no reordered items. See the evaluation page for full details.

Useful links:

https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2

https://www.kaggle.com/c/instacart-market-basket-analysis/data

In [14]:
from os import path, listdir
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [15]:
#Get file names in from data folder
datapath = '../data/'
files = [path.join(datapath, file) for file in listdir(datapath) if file.endswith('csv')]

In [16]:
for x in files: print x

../data/aisles.csv
../data/departments.csv
../data/order_products__prior.csv
../data/order_products__train.csv
../data/orders.csv
../data/products.csv
../data/sample_submission.csv


In [29]:
# Load data to dataframe
df_aisles = pd.read_csv(files[0])
df_dpts = pd.read_csv(files[1])
df_order_prior = pd.read_csv(files[2])
df_order_train = pd.read_csv(files[3])
df_orders = pd.read_csv(files[4])
df_products = pd.read_csv(files[5])

In [32]:
df_order_prior.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32434489 entries, 0 to 32434488
Data columns (total 4 columns):
order_id             int64
product_id           int64
add_to_cart_order    int64
reordered            int64
dtypes: int64(4)
memory usage: 989.8 MB


In [34]:
df_order_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1384617 entries, 0 to 1384616
Data columns (total 4 columns):
order_id             1384617 non-null int64
product_id           1384617 non-null int64
add_to_cart_order    1384617 non-null int64
reordered            1384617 non-null int64
dtypes: int64(4)
memory usage: 42.3 MB


In [36]:
df_orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
order_id                  int64
user_id                   int64
eval_set                  object
order_number              int64
order_dow                 int64
order_hour_of_day         int64
days_since_prior_order    float64
dtypes: float64(1), int64(5), object(1)
memory usage: 182.7+ MB


- Seems there are some NaNs to watch for. 
- Products, Aisles , Dpts are informational -- maybe merge

In [38]:
#Merge aisles, dpt and products -- product is left.
df_products = df_products.merge(df_aisles, left_on= 'aisle_id', right_on= 'aisle_id')
df_products = df_products.merge(df_dpts, left_on = 'department_id', right_on= 'department_id')

### Divide dataset for trainining, validation and testing.

In [70]:
users_train = df_orders.loc[(df_orders.eval_set == "train")].user_id

In [72]:
users_test = df_orders.loc[(df_orders.eval_set == "test")].user_id

In [107]:
df_orders_train = df_orders.loc[df_orders.user_id.isin(users_train)].copy()
df_orders_test = df_orders.loc[df_orders.user_id.isin(users_test)].copy()

Get train dataset from merging df_order_prior +  df_orders_train

Merge by order_id

In [118]:
# The following merge is a dataframe for all prior orders for the training dataset.
df = df_order_prior.merge(df_orders_train, left_on='order_id', right_on='order_id', how='left')

### Building a baseline model, greedy stuff

In [152]:
train_users = users_train.sample(frac=0.7, random_state=15)

In [153]:
val_users = users_train[~users_train.isin(x_users)]

In [None]:
#partition dataframe
df_
