In [18]:
import numpy as np
import pandas as pd

from os import path
import re

import pickle

In [2]:
fd = ['..','data','raw']

data = {}
fn_list = ['orders.csv', 'products.csv', 'order_products__prior.csv', 'order_products__train.csv', 'departments.csv', 'aisles.csv']

for fn in fn_list:
    fp = path.join(*fd, fn)

    with open(file=fp, mode='r', encoding='utf8') as file:
        import re
        label = re.sub('\.csv$', '', fn)
        data[label] = pd.read_csv(file, encoding='utf8')

In [3]:
data['orders']['eval_set'].value_counts()

prior    3214874
train     131209
test       75000
Name: eval_set, dtype: int64

The naming conventions as per the Kaggle description are as follows:
- Prior: historic order data to be used in training models.
- Train: historic order data to be use in evaluating trained models.
- Test: "new" orders on which to make recommendations to submit to the competition for ultimate performance scoring. Data on products ordered for these orders is not provided.

Since this project is not actually participating in submission of predictions with the Kaggle competition (which has finished), we will simply focus on the "prior" and "train" sets as our train and test sets, respectively.

This assumes we take a simple approach of evaluating recommendations by their ability to predict items that will be in the set of test orders. If we shift to using an alternative evaluation method/metric, then this split approach may need to be revisited.

For the time being, we will create full datasets (joining all tables) for our training ("prior") and testing ("train") sets for convenience in future work and analysis:

In [10]:
data_train = pd.merge(data['orders'], data['order_products__prior'], on='order_id')\
               .merge(data['products'].merge(data['departments'], on='department_id').merge(data['aisles'], on='aisle_id'), on='product_id')

In [21]:
data_train.shape

(32434489, 15)

In [12]:
data_train['eval_set'].value_counts()

prior    32434489
Name: eval_set, dtype: int64

In [13]:
data_test = pd.merge(data['orders'], data['order_products__train'], on='order_id')\
               .merge(data['products'].merge(data['departments'], on='department_id').merge(data['aisles'], on='aisle_id'), on='product_id')

In [23]:
data_test.shape

(1384617, 15)

In [14]:
data_test['eval_set'].value_counts()

train    1384617
Name: eval_set, dtype: int64

We will save these datasets to the interim folder for now in case anything changes in our approach or further manipulation is done in terms of cleaning and feature engineering.

In [19]:
# Create pickle file for train data

f = 'train.p'
d = '../data/interim'
fp = path.join(d,f)

with open(fp, 'wb') as file:
    pickle.dump(data_train, file)

In [20]:
# Create pickle file for test data

f = 'test.p'
d = '../data/interim'
fp = path.join(d,f)

with open(fp, 'wb') as file:
    pickle.dump(data_test, file)