# Instacart: Feature Engineering

This notebook constructs matrices $\{X_s\}_{s \in \mathrm{DSets}}$, where $\mathrm{DSets} = \{\mathrm{train, test, kaggle}\}$. These matrices are inputs for the random forest classifier which we tune in [Instacart: Random Forest ParameterGrid Search](./instacart-random-forest-parametergrid-search/) and train in [Instacart: Top-N Random Forest Model](./instacart-top-n-random-forest-model/). Most features – columns of $X_s$ – are computed via aggregations and transformations of the raw data provided and studied in [Instacart: Exploratory Data Analysis](./instacart-exploratory-data-analysis.ipynb). In addition, some features are computed via an unsupervised learning technique, [Latent Dirichlet Allocation (LDA)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation), introduced in \cite{bleiLatentDirichletAllocation2003}. This is a probabilistic generative model that applied to the matrix of user-product purchase counts. Although this data is already fed into a random forest classifier as a column of $X_s$, the distributional assumptions of LDA can yield a bit more predictive power. We can view this as a simple [model-based collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering#Model-based) technique.

The matrices $X_s$ are limited to roughly 50 columns on the Kaggle platform on which the collection of notebooks comprising this project were run. The limiting resource is memory though the limitation occurs at the `sklearn.ensemble.RandomForestClassifier` calls in the subsequent notebooks. Kaggle provides instances with 16GB of memory, although the instance provides no virtual memory (on disk) so this is a hard limit on memory availability.

As well, there are additional sorts of complex features which require a modest increase in memory. For example, [non-negative matrix factorization (NMF)](https://en.wikipedia.org/wiki/Non-negative_matrix_factorization) techniques have traditionally been used in recommendation systems. Such techniques may be appropriate for the user-user matrix of, say, counts of common product purchases. One application of NMF to this matrix is dimensionality reduction – to create a relatively small number of user "topics" based on the common purchase count matrix. While this matrix is sparse, experimentation suggests it is tractable at a size of perhaps roughly 10GB for $s=\text{'train'}$ as indicated by constructing such matrices on subsets of the overall dataset.

## Feature Dictionary

There are a few groups of features –  Profiles – this notebook constructs. The User Profiles, for example, consists of operations and aggregations grouped by user, so that the index for the user profile is $U_s$, the list of users. The rows of $X_s$ are not merely users, but user-product pairs, which means that the User Profile is broadcast to the user-product index $I_s$ via a `.join()` operation. That is, the values of the User Profile are repeated across all products in the user-product index for any given user. An analogous statement holds for the Product Profile. Therefore, the User-Product profile will have the features with the greatest information content (and the Aisle and Department profiles the leat). The list of feature groups and features follows. 

| feature group prefix | name |
|---|---|
| `U` | User Profile |
| `P` | Product Profile |
| `UP` | User-Product Profile |
| `AD` | Aisle and Department Profiles (ignored) |
| `LDA` | Latent Dirichlet Allocation User Features |


| feature | dtype | description |
|---|---|---|
| `U_ultimate_order_dow` | `                float16 ` | dow of user's ultimate order |
| `U_ultimate_order_hour_of_day` | `        float16 ` | hour of user's ultimate order |
| `U_ultimate_days_since_prior_order` | `   float16 ` | days since user's previous order (from ultimate)|
| `U_orders_num` | `                        uint8 ` |   number of orders a given user has placed |
| `U_items_total` | `                       uint16 ` |  number of total items a given user has purchased |
| `U_order_size_mean` | `                   float16 ` |  mean basket size for a given user|
| `U_order_size_std` | `                    float16 ` |  std basket size for a given user |
| `U_unique_products` | `                   uint16 ` |   number of unique products a given user has purchased|
| `U_reordered_num` | `                     uint16 `  |  number of total items a given user has purchased which are reorders    |
| `U_reorder_size_mean` | `                 float16 ` |  mean reorders per basket    |
| `U_reorder_size_std` | `                  float16 ` |  std reorders per basket    |
| `U_reordered_ratio` | `                   float16 ` |  proportion of items a given user has purchased which are reorders    |
| `U_order_dow_mean` | `                    float16 ` |  mean order_dow    |
| `U_order_dow_var` | `                     float16 ` |  var order_dow    |
| `U_order_dow_score` | `                   float16 ` |  ultimate score for order_dow using circstd = sqrt(-2ln(circvar))    |
| `U_order_hour_of_day_mean` | `            float16 ` |  mean order_hour_of_day    |
| `U_order_hour_of_day_var` | `             float16 ` |  var order_hour_of_day    |
| `U_order_hour_of_day_score` | `           float16 ` |  ultimate score for order_hour_of_day using circstd = sqrt(-2ln(circvar))    |
| `U_days_since_prior_order_mean` | `       float16 ` |  mean days since prior order (mean user order time interval)    |
| `U_days_since_prior_order_std` | `        float16 ` |  std days since prior order (std user order time interval)    |
| `P_orders_num` | `                        uint32 ` |  number of total purchases    |
| `P_unique_users` | `                      uint16 ` |  number of purchasers    |
| `P_reorder_ratio` | `                     float16 ` | reorder ratio     |
| `P_order_hour_of_day_mean` | `            float16 ` | mean order_hour_of_day     |
| `P_order_hour_of_day_var` | `             float16 ` | var order_hour_of_day     |
| `P_order_dow_mean` | `                    float16 ` | mean order_dow     |
| `P_order_dow_var` | `                     float16 ` | var order_dow     |
| `UP_orders_num` | `                       uint8 ` |    number of times particular user has ordered particular product  |
| `UP_orders_since_previous` | `            uint8 ` |    number of orders since previous purchase of product by user  |
| `UP_days_since_prior_order` | `           uint16 ` |   days since user last ordered product   |
| `UP_days_since_prior_order_score` | `     float16 ` |  normalize above by user's days_since_prior_order    |
| `UP_reordered` | `                        bool ` |     boolean indicating whether the product was ever reordered by user |
| `UP_order_ratio` | `                      float16 ` |  fraction of baskets in which a given product appears for a given user (count of orders in which product appears divided by total orders)    |
| `UP_penultimate` | `                      bool ` |     products in user's penultimate (previous) order as `bool` (`train` and `test` sets contain ultimate order) |
| `UP_antepenultimate` | `                  bool ` |     products in user's antepenultimate order as `bool` |
| `UP_order_dow_score` | `                  float16 ` |  ultimate score for order_dow using (`U_ultimate` - `P_order_dow_mean`) / `P_order_dow_std` (intuitively, how 'far' is a user's ultimate order dow from the mean dow product is ordered)    |
| `UP_order_hour_of_day_score` | `          float16 ` |  ultimate score for order_hour_of_day using (`U_ultimate` - `P_order_hour_of_day_mean`) / `P_order_hour_of_day_std` (intuitively, how 'far' is a user's ultimate order hour_of_day from the mean hour_of_day product is ordered)    |
| `LDA_1` | `                               float16 ` |   Latent Dirichlet Allocation Feature 1     |
| `LDA_2` | `                               float16 ` |   Latent Dirichlet Allocation Feature 2     |
| `LDA_3` | `                               float16 ` |   Latent Dirichlet Allocation Feature 3     |
| `LDA_4` | `                               float16 ` |   Latent Dirichlet Allocation Feature 4     |
| `LDA_5` | `                               float16 ` |   Latent Dirichlet Allocation Feature 5     |
| `LDA_6` | `                               float16 ` |   Latent Dirichlet Allocation Feature 6     |
| `LDA_7` | `                               float16 ` |   Latent Dirichlet Allocation Feature 7     |
| `LDA_8` | `                               float16 ` |   Latent Dirichlet Allocation Feature 8     |
| `LDA_9` | `                               float16 ` |   Latent Dirichlet Allocation Feature 9     |
| `LDA_10` | `                              float16 ` |   Latent Dirichlet Allocation Feature 10    |

## Load Data

In [1]:
!ls instacart_data/all/

aisles.csv	 order_products__prior.csv  orders.csv
departments.csv  order_products__train.csv  products.csv


In [2]:
import pandas as pd
import numpy as np
pd.options.display.latex.repr=True

file_path = 'instacart_data/all/'

load_data_dtype = {
    'order_id': np.uint32,
    'user_id': np.uint32,
    'eval_set': 'category',
    'order_number': np.uint8,
    'order_dow': np.uint8,
    'order_hour_of_day': np.uint8,
    # pandas 'gotcha'; leave as float:
    'days_since_prior_order': np.float16,
    'product_id': np.uint16,
    'add_to_cart_order': np.uint8,
    'reordered': np.bool
}

df_aisles = pd.read_csv(file_path + 'aisles.csv')
df_departments = pd.read_csv(file_path + 'departments.csv')
df_products = pd.read_csv(file_path + 'products.csv')

# Specify dtype to reduce memory utilization
df_order_products_prior = pd.read_csv(file_path + 'order_products__prior.csv',
                                      dtype=load_data_dtype)
df_order_products_train = pd.read_csv(file_path + 'order_products__train.csv',
                                      dtype=load_data_dtype)
df_orders = pd.read_csv(file_path + 'orders.csv', dtype=load_data_dtype)

# df_prior = full products from all prior orders
df_prior = pd.merge(df_orders[df_orders['eval_set'] == 'prior'],
                    df_order_products_prior,
                    on='order_id')

# # Useful DataFrame for aisle and department feature construction
# df_ad = pd.merge(df_prior, df_products, how='left',
#                  on='product_id').drop('product_name', axis=1)

## Train, Test, and Kaggle Sets

As this dataset comes from a (completed) Kaggle competition, the set of users whose ultimate order matches `eval_set == 'test'` form the test set for the competition; the ultimate order for this set is held aside by Kaggle so that participants can submit a prediction which Kaggle scores against the withheld set.

Partitions of the dataset by user are defined by
* $U_\text{train}$: 80% of the 131,209 users whose ultimate orders are available.
* $U_\text{test}$: 20% of the 131,209 users whose ultimate orders are available.
* $U_\text{kaggle}$: The 75,000 users whose ultimate orders are withheld by Kaggle. This project does not explicitly use this set; predictions on this set merely serve as a sanity check via submission to Kaggle.

The [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) utility defines the partition between $U_\text{train}$ and $U_\text{test}$. The list of strings of datasets, `dsets = ['train', 'test', 'kaggle']`, instantiates $\mathrm{DSets}$. `users` is a dictionary of lists of `user_ids` keyed by `dsets`, initialized by 
>`users = dict.fromkeys(dsets)`,

so that `users[ds]` for `ds in dsets` instantiates $U_s$ for $s \in \mathrm{DSets}$. Similarly, this notebook constructs matrices `X[ds]`, initialized by
> `X = dict.fromkeys(dsets)`,

to instantiate $X_s$. Aside from the analogy between dictionary keys and subscripts, the `dict` type offers a coherent way to partition the dataset $D_s$ of user orders and baskets in `orders.csv` and `order_products*.csv` into separate DataFrames `orders[ds]` and `prior[ds]` for `ds in dsets` at the outset so as to avoid potential data leaks. The following is a partial notational dictionary.

In [3]:
from sklearn.model_selection import train_test_split

# Names of dataset partitions
dsets = ['train', 'test', 'kaggle']

# Partition of users into dsets
users = dict.fromkeys(dsets)

# Use sklearn utility to partition project users into train and test user lists.
users['train'], users['test'] = train_test_split(list(
    df_orders[df_orders.eval_set == 'train']['user_id']),
                                                 test_size=0.2,
                                                 random_state=20190502)

# Kaggle submissions test set
users['kaggle'] = list(
    df_orders[df_orders.eval_set == 'test']['user_id'])  #.to_list()

In [4]:
# Split DataFrames we will use in feature construction into dicts of DataFrames
prior = dict.fromkeys(dsets)
orders = dict.fromkeys(dsets)
orders_full = dict.fromkeys(dsets)

# ad = dict.fromkeys(dsets)

for ds in dsets:
    prior[ds] = df_prior[df_prior['user_id'].isin(users[ds])]
    orders[ds] = df_orders[df_orders['user_id'].isin(users[ds])
                           & (df_orders.eval_set == 'prior')]
    orders_full[ds] = df_orders[df_orders['user_id'].isin(users[ds])]
#     ad[ds] = df_ad[df_ad['user_id'].isin(users[ds])]

### Indexes

It will be useful to have an `Index` of users and a `MultiIndex` of user-product pairs.

The "full" index could include a few dozen user-product pairs which appear in 'train' but not 'prior'. To consider products users have previously ordered, these are discluded. Further, the technical complication in reindexing and deciding good fillna values is unlikely worth the additional predictive ability of including these user-product pairs.

In [5]:
pd.__version__

'0.23.4'

In [6]:
# Create Index of all users
# for pandas 0.24:
# u_index[ds], _ = pd.MultiIndex.from_frame(orders[ds]['user_id']).sortlevel()
# for pandas 0.23.4:

u_index = dict.fromkeys(dsets)

for ds in dsets:
    u_index[ds], _ = pd.Index(list(orders[ds]['user_id'].values),
                              name='user_id').sortlevel()
    u_index[ds] = u_index[ds].drop_duplicates()

In [7]:
# Create MultiIndex of all (nonempty) (user, product) pairs
# and empty DataFrame with that MultiIndex for joins with
# features with user index or product index
# for pandas 0.24:
# up_index[ds], _ = pd.MultiIndex.from_frame(prior[ds][['user_id', 'product_id']]).sortlevel()
# for pandas 0.23.4:

up_index = dict.fromkeys(dsets)
up_empty_df = dict.fromkeys(dsets)

for ds in dsets:
    up_index[ds], _ = pd.MultiIndex.from_tuples(
        list(prior[ds][['user_id', 'product_id']].values),
        names=prior[ds][['user_id', 'product_id']].columns).sortlevel()
    up_index[ds] = up_index[ds].drop_duplicates()
    up_empty_df[ds] = pd.DataFrame(index=up_index[ds])

### $X_s$ Ultimate
These DataFrames are helpful in building the features prefixed by `U_ultimate`.

In [8]:
# The ultimate orders
ultimate = dict.fromkeys(dsets)

ultimate['train'] = df_orders[(df_orders['eval_set'] == 'train')
                              & df_orders['user_id'].isin(users['train'])]
# 'eval_set' == 'train' is correct here since that is *Kaggle's* train:
ultimate['test'] = df_orders[(df_orders['eval_set'] == 'train')
                             & df_orders['user_id'].isin(users['test'])]
ultimate['kaggle'] = df_orders[(df_orders['eval_set'] == 'test')
                               & df_orders['user_id'].isin(users['kaggle'])]

### $y_\text{train}$ and $y_\text{test}$

The true $y$-vectors are below. Kaggle witholds the data $y_\text{kaggle}$; instead, Kaggle competitors may submit a prediction $\hat{y}_\text{kaggle}$ which Kaggle scores against $y_\text{kaggle} $.

In [9]:
# Build y['train'] and y['test']
# df_present = ultimate train and test orders
df_y = pd.merge(df_orders[df_orders['eval_set'] == 'train'],
                df_order_products_train,
                on='order_id')

In [10]:
y = dict.fromkeys(dsets)

y['train'] = (
    pd.DataFrame(
        [[True]],
        index=pd.MultiIndex.from_tuples(
            list(
                # (user, product) pairs of purchases in 'train' df -> list
                df_y[df_y['user_id'].isin(users['train'])]
                [['user_id', 'product_id']].values)))
    # Fill unpurchased items in overall up_index as False
    .reindex(up_index['train']).fillna(False))

y['test'] = (
    pd.DataFrame(
        [[True]],
        index=pd.MultiIndex.from_tuples(
            list(
                # (user, product) pairs of purchases in 'test' df -> list
                df_y[df_y['user_id'].isin(users['test'])]
                [['user_id', 'product_id']].values)))
    # Fill unpurchased items in overall up_index as False
    .reindex(up_index['test']).fillna(False))

y['kaggle'] = pd.DataFrame(data=['foo'])

#### Save $y$

In [11]:
pd.set_option('io.hdf.default_format', 'table')

In [12]:
store = pd.HDFStore('io.h5')

In [13]:
for dset, dframe in y.items():
    store['/y/' + str(dset)] = dframe

In [14]:
store.close()

In [15]:
store.is_open

False

In [16]:
# Cleanup y
del df_y, y

## Features

In [17]:
# dimensions
users_num = df_orders['user_id'].max()
products_num = df_products['product_id'].max()

In [18]:
# Make a dict to collect groups of features (e.g. profiles, clusterings, etc)
groups_dict = {}

Since both `order_dow` and `order_hour_of_day` are cyclic temporal features, it may help the model to encode them as such. To do so, tranform the cyclic features to angles in radians and use circular statistics as described in [Directional Statistics](https://en.wikipedia.org/wiki/Directional_statistics#Measures_of_location_and_spread) and the first sections of [NCSS Circular Data Analysis](https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Circular_Data_Analysis.pdf).

In addition, [the implementation of the scipy circular variance calculation is suspect](https://stackoverflow.com/questions/52856232/scipy-circular-variance), while the [astropy.stats.circstats](http://docs.astropy.org/en/stable/stats/circ.html) calculation seems correct.

In [19]:
from astropy.stats import circmean, circvar

def angle_transform(series, period):
    return series.multiply(2 * np.pi / period).sub(np.pi).astype('float16')

### Ultimate User Features

These are the (known) `order_dow`, `order_hour_of_day`, and `days_since_prior_order` of the ultimate order.

In [20]:
from collections import defaultdict

# dictionary to store given user features
u_given_dict = defaultdict(dict)

In [21]:
# Compute each feature separately for 'train', 'test,', and 'kaggle' in dsets
for ds in dsets:

    # ultimate order_dow
    u_given_dict['U_ultimate_order_dow'][ds] = angle_transform(
        ultimate[ds].set_index('user_id').order_dow, 7)

    # ultimate order_hour_of_day
    u_given_dict['U_ultimate_order_hour_of_day'][ds] = angle_transform(
        ultimate[ds].set_index('user_id').order_hour_of_day, 24)

    # ultimate days_since_prior_order
    u_given_dict['U_ultimate_days_since_prior_order'][ds] = (
        ultimate[ds].set_index('user_id').days_since_prior_order)

In [22]:
# Rename feature columns/pandas Series object by u_given_dict key name pointing to it.

for ds in dsets:
    for k, v in u_given_dict.items():
        v[ds].rename(k, inplace=True)

In [23]:
# Combine given user features; store as key 'U_given'

groups_dict['U_given'] = {
    ds: pd.concat([u_given_dict[k][ds] for k in u_given_dict.keys()], axis=1)
    for ds in dsets
}

### User Profile

In [24]:
# dictionary to store user features
u_dict = defaultdict(dict)

for ds in dsets:

    # number of orders a given user has placed
    u_dict['U_orders_num'][ds] = (
        prior[ds]
        .groupby(by='user_id')['order_number']
        .max().apply(pd.to_numeric,
                     downcast='unsigned'))

    # number of total items a given user has purchased
    u_dict['U_items_total'][ds] = (
        prior[ds].groupby('user_id')['product_id'].count().apply(
            pd.to_numeric, downcast='unsigned'))

    # mean basket size for a given user
    u_dict['U_order_size_mean'][ds] = (u_dict['U_items_total'][ds].div(
        u_dict['U_orders_num'][ds]).astype('float16'))

    # std basket size for a given user
    u_dict['U_order_size_std'][ds] = (prior[ds].groupby([
        'user_id', 'order_number'
    ]).add_to_cart_order.max().groupby('user_id').std().astype('float16'))

    # number of unique products a given user has purchased
    u_dict['U_unique_products'][ds] = (
        prior[ds].groupby('user_id')['product_id'].nunique().apply(
            pd.to_numeric, downcast='unsigned'))

    # number of total items a given user has purchased which are reorders
    u_dict['U_reordered_num'][ds] = (
        prior[ds].groupby('user_id')['reordered'].sum().apply(
            pd.to_numeric, downcast='unsigned'))

    # mean reorders per basket
    u_dict['U_reorder_size_mean'][ds] = (u_dict['U_reordered_num'][ds].div(
        u_dict['U_orders_num'][ds]).astype('float16'))

    # std reorders per basket
    u_dict['U_reorder_size_std'][ds] = (prior[ds].groupby([
        'user_id', 'order_number'
    ]).reordered.sum().groupby('user_id').std().astype('float16'))

    # proportion of items a given user has purchased which are reorders
    u_dict['U_reordered_ratio'][ds] = (u_dict['U_reordered_num'][ds].div(
        u_dict['U_items_total'][ds]).astype('float16'))

    # mean order_dow
    u_dict['U_order_dow_mean'][ds] = pd.concat(
        [
            orders[ds]['user_id'],
            angle_transform(
                # load-bearing .rename(). Fix.
                orders[ds]['order_dow'].rename('U_order_dow_mean'),
                7)
        ],
        axis=1).groupby('user_id').aggregate(circmean).astype(
            'float16').U_order_dow_mean

    # var order_dow
    u_dict['U_order_dow_var'][ds] = pd.concat(
        [
            orders[ds]['user_id'],
            angle_transform(
                # load-bearing .rename(). Fix.
                orders[ds]['order_dow'].rename('U_order_dow_var'),
                7)
        ],
        axis=1).groupby('user_id').aggregate(circvar).astype(
            'float16').U_order_dow_var

    # ultimate score for order_dow using circstd = sqrt(-2ln(circvar))
    u_dict['U_order_dow_score'][ds] = (
        u_given_dict['U_ultimate_order_dow'][ds]
        .sub(u_dict['U_order_dow_mean'][ds])
        .div(u_dict['U_order_dow_var'][ds]
             .apply(lambda x: np.sqrt(-2 * np.log(x))))
        .fillna(0)
        .clip(-20, 20)
        .astype('float16'))

    # mean order_hour_of_day
    u_dict['U_order_hour_of_day_mean'][ds] = (
        pd.concat(
            [orders[ds]['user_id'],
                angle_transform(
                    orders[ds]['order_hour_of_day']
                    # load-bearing .rename(). Fix.
                    .rename('U_order_hour_of_day_mean'),
                    24)
            ],
            axis=1)
        .groupby('user_id')
        .aggregate(circmean)
        .astype('float16')
        .U_order_hour_of_day_mean)

    # var order_hour_of_day
    u_dict['U_order_hour_of_day_var'][ds] = (
        pd.concat(
            [
                orders[ds]['user_id'],
                angle_transform(
                    orders[ds]['order_hour_of_day']
                    # load-bearing .rename(). Fix.
                    .rename('U_order_hour_of_day_var'),
                    24)
            ],
            axis=1)
        .groupby('user_id')
        .aggregate(circvar)
        .astype('float16')
        .U_order_hour_of_day_var)

    # ultimate score for order_hour_of_day using circstd = sqrt(-2ln(circvar))
    u_dict['U_order_hour_of_day_score'][ds] = (
        u_given_dict['U_ultimate_order_hour_of_day'][ds]
        .sub(u_dict['U_order_hour_of_day_mean'][ds])
        .div(u_dict['U_order_hour_of_day_var'][ds]
             .apply(lambda x: np.sqrt(-2 * np.log(x))))
        .fillna(0)
        .clip(-20, 20)
        .astype('float16')
    )

    # mean days since prior order (mean user order time interval)
    u_dict['U_days_since_prior_order_mean'][ds] = (
        orders_full[ds]
        .groupby('user_id')
        .days_since_prior_order
        .mean()
        .astype('float16')
    )

    # std days since prior order (std user order time interval)
    u_dict['U_days_since_prior_order_std'][ds] = (
        orders_full[ds]
        .groupby('user_id')
        .days_since_prior_order
        .std()
        .astype('float16')
    )

In [25]:
# Rename feature columns/pandas Series object by u_dict key name pointing to it.

for ds in dsets:
    for k, v in u_dict.items():
        v[ds].rename(k, inplace=True)

In [26]:
# Combine user features; store as key 'U'

groups_dict['U'] = {ds : pd.concat([u_dict[k][ds] for k in u_dict.keys()], axis=1) for ds in dsets}

### Product Profile

In [27]:
# dictionary to store product features
p_dict = defaultdict(dict)

for ds in dsets:

    # number of total purchases
    p_dict['P_orders_num'][ds] = (
        prior[ds]
        .groupby('product_id')['order_id']
        .count()
        .apply(pd.to_numeric, downcast='unsigned'))

    # number of purchasers
    p_dict['P_unique_users'][ds] = (
        prior[ds]
        .groupby('product_id')['user_id']
        .nunique()
        .apply(pd.to_numeric, downcast='unsigned'))

    # reorder ratio
    p_dict['P_reorder_ratio'][ds] = (
        prior[ds]
        .groupby(['product_id'])['reordered']
        .mean()
        .astype('float16'))

    # mean order_hour_of_day
    p_dict['P_order_hour_of_day_mean'][ds] = angle_transform(
        prior[ds]
        .set_index('product_id')
        .order_hour_of_day,
        24).groupby('product_id').aggregate(circmean)

    # var order_hour_of_day
    p_dict['P_order_hour_of_day_var'][ds] = angle_transform(
        prior[ds]
        .set_index('product_id')
        .order_hour_of_day,
        24).groupby('product_id').aggregate(circvar)

    # mean order_dow
    p_dict['P_order_dow_mean'][ds] = angle_transform(
        prior[ds]
        .set_index('product_id')
        .order_hour_of_day,
        7).groupby('product_id').aggregate(circmean)

    # var order_dow
    p_dict['P_order_dow_var'][ds] = angle_transform(
        prior[ds]
        .set_index('product_id')
        .order_hour_of_day,
        7).groupby('product_id').aggregate(circvar)

In [28]:
# Rename feature columns/pandas Series objects by p_dict key name pointing to it.

for ds in dsets:
    for k, v in p_dict.items():
        v[ds].rename(k, inplace=True)

In [29]:
# Combine product features; store as key 'P'

groups_dict['P'] = {
    ds: pd.concat([p_dict[k][ds] for k in p_dict.keys()], axis=1)
    for ds in dsets
}

### User-Product Profile

In [30]:
# dictionary to store user-product features
up_dict = defaultdict(dict)

for ds in dsets:

    # number of times particular user has ordered particular product
    up_dict['UP_orders_num'][ds] = (
        prior[ds]
        .groupby(['user_id', 'product_id'])['order_id']
        .count()
        .apply(pd.to_numeric, downcast='unsigned'))

    # number of orders since previous purchase of product by user
    # fill_value = infty?
    up_dict['UP_orders_since_previous'][ds] = (
        prior[ds].groupby(['user_id'])['order_number']
        .max()
        - prior[ds]
        .groupby(['user_id', 'product_id'])['order_number']
        .max()
        .apply(pd.to_numeric, downcast='unsigned'))

    # days since user last ordered product
    # groups of days_since_prior_order by user_id
    days_gpby_user = (
        orders_full[ds]
        .groupby('user_id')
        .days_since_prior_order
    )

    # given 'order_number' is UP_orders_since_previous
    # sum last orders_ago+1 days_since_prior_order
    def days_ago(row):
        orders_ago = int(row['order_number'])
        user = row['user_id']
        return (days_gpby_user
                .get_group(user)
                .iloc[-(orders_ago + 1):]
                .sum())

    # apply days_ago to UP_orders_since_previous
    up_dict['UP_days_since_prior_order'][ds] = (pd.Series(
        data=up_dict['UP_orders_since_previous'][ds]
        .reset_index()
        .apply(days_ago, axis=1)
        .values,
        index=up_dict['UP_orders_since_previous'][ds].index)
    .astype('uint16'))

    # clean-up
    del days_gpby_user

    # normalize above by user's days_since_prior_order
    # maybe use t-score instead?
    up_dict['UP_days_since_prior_order_score'][ds] = (
        up_dict['UP_days_since_prior_order'][ds]
        .sub(up_empty_df[ds].join(
            u_dict['U_days_since_prior_order_mean'][ds]).iloc[:, 0])
        .div(up_empty_df[ds].join(
            u_dict['U_days_since_prior_order_std'][ds]).iloc[:, 0])
        .fillna(0).clip(-20, 20).astype('float16'))

In [31]:
for ds in dsets:

    # reordered as `bool`
    up_dict['UP_reordered'][ds] = (
        prior[ds]
        .groupby(['user_id', 'product_id'])['reordered']
        .any())

    # fraction of baskets in which a given product appears for a given user,
    # count of orders in which product appears divided by total orders
    up_dict['UP_order_ratio'][ds] = (
        prior[ds].groupby(['user_id', 'product_id'])['order_number']
        .count()
        .div(prior[ds].groupby(['user_id'])['order_number']
             .max())
        .astype('float16')
    )

    # products in user's penultimate (previous) order as `bool`
    # (`train` and `test` sets contain ultimate order)

    up_dict['UP_penultimate'][ds] = (
        prior[ds].groupby(['user_id', 'product_id'])
        .order_number
        .max() 
        == prior[ds].groupby(['user_id'])
        .order_number
        .max()
        .reindex(up_index[ds], level=0)
    )

    # products in user's antepenultimate order as `bool`
    # index = UP pair (not distinct) with data = order_number
    past_orders = (
        prior[ds][['user_id', 'order_number', 'product_id']]
        .set_index(['user_id', 'product_id'])
    )
    
    # all UP pairs with max order_number - 1
    max_order_number_sub1 = (
        prior[ds].groupby(['user_id'])
        .order_number
        .max()
        .sub(1)
        .reindex(up_index[ds], level=0)
        .to_frame()
    )
    
    # intersection
    up_dict['UP_antepenultimate'][ds] = (
        pd.merge(
            past_orders,
            max_order_number_sub1,
            on=['user_id', 'product_id', 'order_number'])
        .reindex(up_index[ds], fill_value=False)
        .astype('bool')
        .iloc[:, 0]
    )
    
    # cleanup
    del past_orders, max_order_number_sub1

    # ultimate score for order_dow using circstd = sqrt(-2ln(circvar))
    # using (U_ultimate - P_order_dow_mean) / P_order_dow_std
    # broadcast to up_index
    # intuitively, how 'far' is a user's ultimate order dow from the mean dow product is ordered
    up_dict['UP_order_dow_score'][ds] = (
        pd.DataFrame(
            data=(up_empty_df[ds]
                      .join(u_given_dict['U_ultimate_order_dow'][ds])
                      .iloc[:, 0]
                  .sub(up_empty_df[ds]
                       .join(p_dict['P_order_dow_mean'][ds])
                       .iloc[:, 0])
                  .div(up_empty_df[ds]
                       .join(p_dict['P_order_dow_var'][ds]
                             .apply(lambda x: 
                                    np.sqrt(-2 * np.log(x))))
                       .iloc[:, 0])
                  ),
            index=up_index[ds])    
        .fillna(0)
        .clip(-20, 20)
        .astype('float16')
        .iloc[:, 0]
    )
        
    # ultimate score for order_hour_of_day using circstd = sqrt(-2ln(circvar))
    # using (U_ultimate - P_order_hour_of_day_mean) / P_order_hour_of_day_std
    # broadcast to up_index
    # intuitively, how 'far' is a user's ultimate order hour_of_day from the mean hour_of_day product is ordered
    # ndarray instead of pandas; couldn't resolve an arithmetic issue
    up_dict['UP_order_hour_of_day_score'][ds] = (
        pd.DataFrame(
            data=(up_empty_df[ds]
                      .join(u_given_dict['U_ultimate_order_hour_of_day'][ds])
                      .iloc[:, 0]
                  .sub(up_empty_df[ds]
                       .join(p_dict['P_order_hour_of_day_mean'][ds])
                       .iloc[:, 0])
                  .div(up_empty_df[ds]
                       .join(p_dict['P_order_hour_of_day_var'][ds]
                             .apply(lambda x: 
                                    np.sqrt(-2 * np.log(x))))
                       .iloc[:, 0])
                  ),
            index=up_index[ds])    
        .fillna(0)
        .clip(-20, 20)
        .astype('float16')
        .iloc[:, 0]
    )

In [32]:
# Rename feature columns/pandas Series objects by up_dict key name pointing to it.

for ds in dsets:
    for k, v in up_dict.items():
        v[ds].rename(k, inplace=True)

In [33]:
# Combine user-product features; store as key 'UP'

groups_dict['UP'] = {
    ds: pd.concat([up_dict[k][ds] for k in up_dict.keys()], axis=1)
    for ds in dsets
}

### Latent Dirichlet Allocation Features

The parameter values for [`sklearn.decomposition.LatentDirichletAllocation`](https://scikit-learn.org/stable/modules/decomposition.html#latent-dirichlet-allocation-lda) below are found and discussed in the notebooks:
* [Instacart: LDA GridSearchCV (Course)](./instacart-lda-gridsearchcv-course)
* [Instacart: LDA GridSearchCV (Fine)](./instacart-lda-gridsearchcv-fine)

In [34]:
# scipy sparse matrix of number of times particular user has ordered particular product
UP_count_matrix = dict.fromkeys(dsets)

for ds in dsets:
    UP_count_matrix[ds], _, _ = (groups_dict['UP'][ds]['UP_orders_num'].apply(
        pd.to_numeric, downcast='unsigned').to_sparse().to_coo())

In [35]:
from sklearn.decomposition import LatentDirichletAllocation

LDA_features = dict.fromkeys(dsets)

for ds in dsets:
    lda = LatentDirichletAllocation(n_components=10,
                                    max_iter=10,
                                    learning_decay=0.85,
                                    n_jobs=1,
                                    learning_method='online')

    LDA_features[ds] = lda.fit_transform(UP_count_matrix[ds])

In [36]:
groups_dict['LDA'] = {
    ds: pd.DataFrame(data=LDA_features[ds],
                     index=u_index[ds],
                     columns=[
                         'LDA_' + str(k + 1)
                         for k in range(LDA_features[ds].shape[1])
                     ]).astype('float16')
    for ds in dsets
}

### Aisle and Department Features

The features below did not score well in previous versions of [Instacart: Top-N Random Forest Model](./instacart-top-n-random-forest-model/). User-aisle and user-department features defined in analogy to user-product features above, should perform considerably better than the aisle and department features below, which are defined in analogy to product features.

In [37]:
# # dictionary to store aisle features
# a_dict = defaultdict(dict)

# for ds in dsets:

#     # mean order_hour_of_day
#     a_dict['A_order_hour_of_day_mean'][ds] = angle_transform(ad[ds].set_index('aisle_id')
#                                                 .order_hour_of_day,
#                                                 24
#                                                 ).groupby('aisle_id').aggregate(circmean)

#     # std order_hour_of_day
#     a_dict['A_order_hour_of_day_var'][ds] = angle_transform(ad[ds].set_index('aisle_id')
#                                                .order_hour_of_day,
#                                                24
#                                                ).groupby('aisle_id').aggregate(circvar)

#     # mean order_dow
#     a_dict['A_order_dow_mean'][ds] = angle_transform(ad[ds].set_index('aisle_id')
#                                         .order_dow,
#                                         7
#                                         ).groupby('aisle_id').aggregate(circmean)

#     # var order_dow
#     a_dict['A_order_dow_var'][ds] = angle_transform(ad[ds].set_index('aisle_id')
#                                        .order_dow,
#                                        7
#                                        ).groupby('aisle_id').aggregate(circvar)

#     # reorder ratio
#     a_dict['A_reorder_ratio'][ds] = (ad[ds].groupby(['aisle_id'])['reordered']
#                        .mean()
#                        .astype('float16')
#                        )

In [38]:
# # Rename feature columns/pandas Series objects by a_dict key name pointing to it.

# for ds in dsets:
#     for k, v in a_dict.items():
#         v[ds].rename(k, inplace=True)

In [39]:
# # Combine aisle features into a_features
# # Reindex to products index for join with up_index

# #a_features = {ds : pd.DataFrame(index=groups_dict['P'][ds].index).join(
# #    pd.concat([a_dict[k][ds] for k in a_dict.keys()], axis=1)) for ds in dsets}

# # a_features = {ds : pd.concat([a_dict[k][ds] for k in a_dict.keys()], axis=1) for ds in dsets}

# groups_dict['A'] = {ds : 
#     # "dict" from product_id -> aisle_id (index=product_id, col=aisle_id)
#                     df_ad[['aisle_id', 'product_id']]
#                     .drop_duplicates()
#                     .set_index('product_id')
#                     .sort_index()
#     # join with aisle features with aisle_id as column
#                     .join(
#                         pd.concat([feature[ds] for feature in a_dict.values()], axis=1),
#         on='aisle_id')
#     .drop('aisle_id', axis=1)
#                     for ds in dsets}

# for ds in dsets:
#     groups_dict['A'][ds].index.rename('product_id', inplace=True)

In [40]:
# # dictionary to store department features
# d_dict = defaultdict(dict)

# for ds in dsets:
    
#     # mean order_hour_of_day
#     d_dict['D_order_hour_of_day_mean'][ds] = angle_transform(ad[ds].set_index('department_id')
#                                                 .order_hour_of_day,
#                                                 24
#                                                 ).groupby('department_id').aggregate(circmean)

#     # std order_hour_of_day
#     d_dict['D_order_hour_of_day_var'][ds] = angle_transform(ad[ds].set_index('department_id')
#                                                .order_hour_of_day,
#                                                24
#                                                ).groupby('department_id').aggregate(circvar)

#     # mean order_dow
#     d_dict['D_order_dow_mean'][ds] = angle_transform(ad[ds].set_index('department_id')
#                                         .order_dow,
#                                         7
#                                         ).groupby('department_id').aggregate(circmean)

#     # var order_dow
#     d_dict['D_order_dow_var'][ds] = angle_transform(ad[ds].set_index('department_id')
#                                        .order_dow,
#                                        7
#                                        ).groupby('department_id').aggregate(circvar)

#     # reorder ratio
#     d_dict['D_reorder_ratio'][ds] = (ad[ds].groupby(['department_id'])['reordered']
#                        .mean()
#                        .astype('float16')
#                        )

In [41]:
# # Rename feature columns/pandas Series objects by d_dict key name pointing to it.

# for ds in dsets:
#     for k, v in d_dict.items():
#         v[ds].rename(k, inplace=True)

In [42]:
# # Combine department features into a_features
# # Reindex to products index for join with up_index

# #a_features = {ds : pd.DataFrame(index=groups_dict['P'][ds].index).join(
# #    pd.concat([d_dict[k][ds] for k in d_dict.keys()], axis=1)) for ds in dsets}

# # a_features = {ds : pd.concat([d_dict[k][ds] for k in d_dict.keys()], axis=1) for ds in dsets}

# groups_dict['D'] = {ds : 
#     # "dict" from product_id -> department_id (index=product_id, col=department_id)
#                     df_ad[['department_id', 'product_id']]
#                     .drop_duplicates()
#                     .set_index('product_id')
#                     .sort_index()
#     # join with department features with department_id as column
#                     .join(
#                         pd.concat([feature[ds] for feature in d_dict.values()], axis=1),
#         on='department_id')
#     .drop('department_id', axis=1)
#                     for ds in dsets}

# for ds in dsets:
#     groups_dict['D'][ds].index.rename('product_id', inplace=True)

In [43]:
# Cleanup intermediate dicts
del (
    u_given_dict,
    u_dict,
    p_dict,
    up_dict,
    #     a_dict,
    #     d_dict
)

# Cleanup dataframes
del (  #df_ad,
    df_aisles, df_departments, df_order_products_prior,
    df_order_products_train, df_orders, df_prior, df_products)

In [44]:
%who

LDA_features	 LatentDirichletAllocation	 UP_count_matrix	 angle_transform	 circmean	 circvar	 days_ago	 defaultdict	 dframe	 
ds	 dset	 dsets	 file_path	 groups_dict	 k	 lda	 load_data_dtype	 np	 
orders	 orders_full	 pd	 prior	 products_num	 store	 train_test_split	 u_index	 ultimate	 
up_empty_df	 up_index	 users	 users_num	 v	 


## Concatenate

Combine the above constructed features into a `dset`-keyed `dict` `X[ds]` to instantiate $\{X_s\}$.

In [45]:
# Concatenate list of elements of groups_dict for each dset
X = {
    ds: pd.concat([
        pd.DataFrame(index=up_index[ds]).join(group[ds])
        for group in groups_dict.values()
    ],
                  axis=1)
    for ds in dsets
}

In [46]:
# Nulls make sklearn unhappy
[X[ds].isnull().any().any() for ds in dsets]

[False, False, False]

The above undid `uint` downcasts somewhere. For now, fix manually.

In [47]:
X['train'].info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 6760791 entries, (1, 196) to (206209, 48742)
Data columns (total 47 columns):
U_ultimate_order_dow                 float16
U_ultimate_order_hour_of_day         float16
U_ultimate_days_since_prior_order    float16
U_orders_num                         int64
U_items_total                        int64
U_order_size_mean                    float16
U_order_size_std                     float16
U_unique_products                    int64
U_reordered_num                      float64
U_reorder_size_mean                  float16
U_reorder_size_std                   float16
U_reordered_ratio                    float16
U_order_dow_mean                     float16
U_order_dow_var                      float16
U_order_dow_score                    float16
U_order_hour_of_day_mean             float16
U_order_hour_of_day_var              float16
U_order_hour_of_day_score            float16
U_days_since_prior_order_mean        float16
U_days_since_prior_orde

In [48]:
cols = [
    'U_orders_num', 'U_items_total', 'U_unique_products', 'U_reordered_num',
    'P_orders_num', 'P_unique_users', 'UP_orders_num',
    'UP_orders_since_previous'
]

for ds in dsets:
    X[ds][cols] = X[ds][cols].apply(pd.to_numeric,
                                    errors='coerce',
                                    downcast='unsigned')

# for ds in dsets:
#     X[ds]['UP_order_dow_score'] = np.nan_to_num(X[ds]['UP_order_dow_score'])
#     X[ds]['UP_order_hour_of_day_score'] = np.nan_to_num(X[ds]['UP_order_hour_of_day_score'])

In [49]:
X['train'].info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 6760791 entries, (1, 196) to (206209, 48742)
Data columns (total 47 columns):
U_ultimate_order_dow                 float16
U_ultimate_order_hour_of_day         float16
U_ultimate_days_since_prior_order    float16
U_orders_num                         uint8
U_items_total                        uint16
U_order_size_mean                    float16
U_order_size_std                     float16
U_unique_products                    uint16
U_reordered_num                      uint16
U_reorder_size_mean                  float16
U_reorder_size_std                   float16
U_reordered_ratio                    float16
U_order_dow_mean                     float16
U_order_dow_var                      float16
U_order_dow_score                    float16
U_order_hour_of_day_mean             float16
U_order_hour_of_day_var              float16
U_order_hour_of_day_score            float16
U_days_since_prior_order_mean        float16
U_days_since_prior_ord

## Save $\{X_s\}$

In [50]:
store.open()

In [51]:
store.is_open

True

In [52]:
for dset, dframe in X.items():
    store['/X/' + str(dset)] = dframe

In [53]:
store.keys()

['/X/kaggle', '/X/test', '/X/train', '/y/kaggle', '/y/test', '/y/train']

In [54]:
store.close()

# References

(<a id="cit-bleiLatentDirichletAllocation2003" href="#call-bleiLatentDirichletAllocation2003">Blei, Ng <em>et al.</em>, 2003</a>) Blei David M., Ng Andrew Y. and Jordan Michael I., ``_Latent Dirichlet Allocation_'', Journal of Machine Learning Research, vol. 3, number Jan, pp. 993-1022,  2003.

