# Model to Predict Elo Customer Loyalty

_Note! If you want to commit any changes to this document, please strip all output (Cell > Current Outputs > Clear, or set up [nbstripout](https://github.com/kynan/nbstripout) as a git filter) from this notebook before doing so. Thanks!_


## Import Libraries

Next we import the Python libraries we'll need. If any of these are missing for you, you can install them with e.g. `pip3 install pandas` on the command line.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

## Load Data

Load the data into Pandas data frames and look at their structure.

First thing we'll do with the training data is split it into a train and validation set. (The given test set is what we'll later make our predictions on and upload, but only after we are fully satisfied with our model.)

In [None]:
hist_trans_df = pd.read_csv('data/unzipped/historical_transactions.csv',
                            parse_dates=['purchase_date'])
merchants_df = pd.read_csv('data/unzipped/merchants.csv',
                           index_col='merchant_id')
merch_trans_df = pd.read_csv('data/unzipped/new_merchant_transactions.csv',
                             parse_dates=['purchase_date'])
train_and_validation_df = pd.read_csv('data/unzipped/train.csv',
                                      index_col='card_id',
                                      parse_dates=['first_active_month'])
test_df = pd.read_csv('data/unzipped/test.csv',
                      index_col='card_id',
                      parse_dates=['first_active_month'])

In [None]:
for v in ['feature_1', 'feature_2', 'feature_3']:
    train_and_validation_df[v] = train_and_validation_df[v].astype('category').cat.as_ordered()

In [None]:
for v in ['authorized_flag', 'category_1', 'category_2', 'category_3', 'merchant_id', 'merchant_category_id',
          'subsector_id', 'city_id', 'state_id']:
    hist_trans_df[v] = hist_trans_df[v].astype('category').cat.as_ordered()
    merch_trans_df[v] = merch_trans_df[v].astype('category').cat.as_ordered()

In [None]:
hist_trans_df.head()

In [None]:
merchants_df.head()

In [None]:
merch_trans_df.head()

In [None]:
train_and_validation_df.head()

## Create Features

Next we want to combine and shape all of our raw data to create useful features in the train (and validation and test) data set.

In [None]:
from fastai import *
from fastai.tabular import *
from fastai.metrics import *
from feature_engineering import *

Fastai has a useful helper function called `add_datepart()`, which takes a date field and turns it into a bunch of useful columns, such as "day of week", "is month end", etc.

In [None]:
add_datepart(hist_trans_df, 'purchase_date')
add_datepart(merch_trans_df, 'purchase_date')

Let's also calculate, for each purchase, the time passed since _the last purchase for that card and merchant combination_, e.g. for a transaction of me buying toilet paper at Rewe (true story, by the way), it would be the time passed since I last bought anything at Rewe, or `nan` had I never bought anything there.

We don't do this for the new merchant transactions, because there there's only ever one purchase for each card-merchant combination, so this value would always be `nan`.

In [None]:
hist_trans_df.sort_values(by=['purchase_Elapsed'], inplace=True)

Since `merchant_id` can be nan, we need to treat those as non-nan temporarily in order not to drop rows when grouping. We also filter the data frame, since we are only interested in these three columns.

In [None]:
filtered = hist_trans_df[['card_id', 'merchant_id', 'purchase_Elapsed']]
filtered['merchant_id'] = filtered['merchant_id'].cat.add_categories('temp_nan').fillna('temp_nan')

In [None]:
grouped = filtered.groupby(['card_id', 'merchant_id'], sort=False)

Now we aggregate by getting the `diff` (i.e. the value at `n` minus the value at `n - 1`).

_**Warning!** This takes a very long time ..._

In [None]:
hist_trans_df['elapsed_since_merchant_purchase'] = grouped.agg({'purchase_Elapsed': 'diff'})

In [None]:
hist_trans_df.head()

### Aggregate Transaction Data

Next we'll use the functions defined in `feature_engineering.py` to aggregate the historical transactions for each card into single values for that card, for instance the mean of all purchase amounts, &c.

_Note: these functions can take quite a long time to complete._

In [None]:
hist_trans_df.columns

In [None]:
aggregators = {
    'purchase_amount': ['sum', 'mean', 'min', 'max', 'std'],
    'installments': ['sum', 'mean', 'min', 'max', 'std'],
    'month_lag': ['mean', 'min', 'max'],
    'merchant_id': ['nunique'],
    'merchant_category_id': ['nunique'],
    'state_id': ['nunique'],
    'city_id': ['nunique'],
    'subsector_id': ['nunique'],
    'elapsed_since_merchant_purchase': ['sum', 'mean', 'min', 'max', 'std'],
}

In [None]:
add_aggregated_numerical_fields(train_and_validation_df, hist_trans_df, aggregators=aggregators)

For the categorical fields, we can't aggregate by taking the mean or sum values, so let's count the occurences of each possible categorical value instead. _(Iow, for a category that can be either YES or NO, we count the number of YESes and the number of NOs and use those values.)_

In [None]:
add_aggregated_categorical_fields(train_and_validation_df,
                                  hist_trans_df,
                                  column_names=['authorized_flag', 'category_1', 'category_2', 'category_3'])

In [None]:
# category_2 and category_3 contain nan values, so let's skip those for now.
add_top_categories(train_and_validation_df,
                   hist_trans_df,
                   column_names=['authorized_flag', 'category_1', 'subsector_id', 'city_id', 'state_id',
                                 'purchase_Year', 'purchase_Month', 'purchase_Week', 'purchase_Day',
                                 'purchase_Dayofweek'])

Now lets do the same aggregating for the `new_merchants_transactions` data.

In [None]:
add_aggregated_numerical_fields(train_and_validation_df, merch_trans_df, aggregators=aggregators, prefix='merch_')

This table concerns only authorized transactions, so we don't need to include that column here.

In [None]:
add_aggregated_categorical_fields(train_and_validation_df,
                                  merch_trans_df,
                                  column_names=['category_1', 'category_2', 'category_3'],
                                  prefix='merch_')

`category_2` and `category_3` contain nan values, so let's skip those for now.

For some reason this fails for the `merch_trans_df` data frame, so let's skip these for now.

In [None]:
#add_top_categories(train_and_validation_df,
#                   merch_trans_df,
#                   column_names=['category_1', 'subsector_id', 'city_id', 'state_id'],
#                   prefix='merch_')

In [None]:
train_and_validation_df.head()

### Repeat for Test Set

In [None]:
add_aggregated_numerical_fields(test_df, hist_trans_df, aggregators=aggregators)
add_aggregated_categorical_fields(test_df,
                                  hist_trans_df,
                                  column_names=['authorized_flag', 'category_1', 'category_2', 'category_3'])
# category_2 and category_3 contain nan values, so let's skip those for now.
add_top_categories(test_df,
                   hist_trans_df,
                   column_names=['authorized_flag', 'category_1', 'subsector_id', 'city_id', 'state_id'])

## Split Into Train and Validation Sets

Split our data into a train test (80%) and a validation set (20%).

from sklearn.model_selection import train_test_split
train_df, validate_df = train_test_split(train_and_validation_df, test_size=0.2, random_state=238923)

In [None]:
train_df.shape

In [None]:
validate_df.shape

In [None]:
train_df.head()

## Remove Outliers

We shouldn't actually ever do this manually, except for experimental purposes. Spoiler: the outliers have a large impact on the final performance of our model.

In [None]:
# train_df = train_df[train_df.target > -25]

## A Quick Look at Correlations

In [None]:
train_df.corr().target.sort_values(ascending=False)

## Set Up Model

We'll use the fastai tabular regressor here, which is built for exactly this problem.

### Create Data Bunch

A fastai DataBunch more or less contains the data that we'll feed to our model.

First, as the data bunch takes one data frame containing both the test and validation samples, we need to get the indices for our validation samples.

Then we tell the model which of the columns are categorical features, which are continuous features, and also which of the columns contains the target (the value we want to predict).

In [None]:
valid_idx = range(len(train_df), len(train_df) + len(validate_df)); valid_idx

Let's have a look at which columns we have. We will need to tell fastai which ones are categorical and which ones are continuous.

In [None]:
train_df.columns

In [None]:
category_names = ['feature_1',
                  'feature_2',
                  'feature_3',
                  'authorized_flag_top',
                  'category_1_top',
                  'subsector_id_top',
                  'city_id_top',
                  'state_id_top',
                  'purchase_Year_top',
                  'purchase_Month_top',
                  'purchase_Week_top',
                  'purchase_Day_top',
                  'purchase_Dayofweek_top']
continuous_names = ['purchase_amount_sum',
                    'purchase_amount_mean',
                    'purchase_amount_min',
                    'purchase_amount_max',
                    'purchase_amount_std',
                    'installments_sum',
                    'installments_mean',
                    'installments_min',
                    'installments_max',
                    'installments_std',
                    'month_lag_mean',
                    'month_lag_min',
                    'month_lag_max',
                    'merchant_id_nunique',
                    'merchant_category_id_nunique',
                    'state_id_nunique',
                    'city_id_nunique',
                    'subsector_id_nunique',
                    'elapsed_since_merchant_purchase_sum',
                    'elapsed_since_merchant_purchase_mean',
                    'elapsed_since_merchant_purchase_min',
                    'elapsed_since_merchant_purchase_max',
                    'elapsed_since_merchant_purchase_std',
                    'authorized_flag_Y_ratio',
                    'category_1_Y_ratio',
                    'category_2_1.0_ratio',
                    'category_2_2.0_ratio',
                    'category_2_3.0_ratio',
                    'category_2_4.0_ratio',
                    'category_2_5.0_ratio',
                    'category_3_A_ratio',
                    'category_3_B_ratio',
                    'category_3_C_ratio',
                    'merch_purchase_amount_sum',
                    'merch_purchase_amount_mean',
                    'merch_purchase_amount_min',
                    'merch_purchase_amount_max',
                    'merch_purchase_amount_std',
                    'merch_installments_sum',
                    'merch_installments_mean',
                    'merch_installments_min',
                    'merch_installments_max',
                    'merch_installments_std',
                    'merch_month_lag_mean',
                    'merch_month_lag_min',
                    'merch_month_lag_max',
                    'merch_merchant_id_nunique',
                    'merch_merchant_category_id_nunique',
                    'merch_state_id_nunique',
                    'merch_city_id_nunique',
                    'merch_subsector_id_nunique',
                    'merch_category_1_Y_ratio',
                    'merch_category_2_1.0_ratio',
                    'merch_category_2_3.0_ratio',
                    'merch_category_2_2.0_ratio',
                    'merch_category_2_4.0_ratio',
                    'merch_category_2_5.0_ratio',
                    'merch_category_3_A_ratio',
                    'merch_category_3_B_ratio',
                    'merch_category_3_C_ratio']
dep_var = 'target'

Since we picked our validation samples randomly from the initial data set, and since fastai requires us to give the indices of the validation samples in a data frame containing both the training and validation samples, we just concatenate them together with training samples first and the validation samples at the end.

In [None]:
df = pd.concat([train_df, validate_df]).reset_index()[category_names + continuous_names + [dep_var]]

In [None]:
data = (TabularList.from_df(df,
                            path='data/unzipped',
                            cat_names=category_names,
                            cont_names=continuous_names,
                            procs=[FillMissing, Categorify, Normalize])
                .split_by_idx(valid_idx)
                .label_from_df(cols=dep_var, label_cls=FloatList)
                .databunch())

Let's have a look at a random batch of data to see how it looks after the processing done by the fastai library.

In [None]:
data.show_batch()

### Create Learner

This is what we actually use to train the model and make predictions.

First we decide how large we want to make the embeddings of our categorical features (the number of category options divided by 2 is a good heuristic, apparently).

Then we tell the model the range within which we expect all predictions to fall (internally the model uses a sigmoid function, so in order for us, in practice, to actually get predictions near the expected maximum value, we set the upper bound to be a little higher than the expected maximum).

The competition uses root mean squared error to evaluate the entries, so we'll use that, too.

In [None]:
min_y = np.min(train_df['target']) * 1.2
max_y = np.max(train_df['target']) * 1.2
y_range = torch.tensor([min_y, max_y], device=defaults.device); y_range

In [None]:
np.min(train_df['target']), np.max(train_df['target'])

In [None]:
learn = tabular_learner(data,
                        layers=[200, 100],
                        ps=[1e-2, 1e-1],
                        emb_drop=0.04,
                        y_range=y_range,
                        metrics=rmse)

In [None]:
learn.model

### Figure Out Learning Rate

To figure out which learning rate to use, we use fastai's learning rate finder.

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot()

### Train Model

Finally we train the model, with weight decay to encourage the model to use fewer features, and then show some results.

In [None]:
learn.fit_one_cycle(1, 1e-3, wd=0.6)

In [None]:
learn.recorder.plot_losses()

In [None]:
learn.recorder.show_results()

## Make Predictions

Now that we have trained our model, lets make some predictions to see whether or not our metrics lie to us.

In [None]:
predictions, targets = [x.numpy().flatten() for x in learn.get_preds(DatasetType.Valid)]
prediction_df = pd.DataFrame({'prediction': predictions, 'target': targets})

In [None]:
(np.amin(predictions), np.amax(predictions))

In [None]:
prediction_df.head()

In [None]:
prediction_df.tail()

### Calculate RMSE On Validation Set

Get the root mean squared error for the validation set only. This value we can compare against the public leaderboard on Kaggle, more or less.

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

In [None]:
sqrt(mean_squared_error(prediction_df.target, prediction_df.prediction))

## Make Submission Predictions

Finally, we need to run our model against the test set that is used by the competition's organizers to evaluate the competitors. We save the result to a `submission.csv` file which we'll then upload to Kaggle.

_Note: we should only do this at the very end, when we are happy with our hyperparameters. Otherwise, if we change our model based on our results on the public leaderboard, we risk overfitting our model to the 30% of samples used for the public leaderboard, and will fail to generalize for the remaining 70% of samples._

In [None]:
out_df = test_df.copy(); out_df.head()

In [None]:
# Warning -- this takes quite a long time.
out_df['target'] = [learn.predict(row)[2].numpy().flatten()[0] for _, row in out_df.iterrows()]

In [None]:
out_df['target'].to_csv('submission.csv.zip', header=['target'], index_label='card_id', compression='zip')