# Model to Predict Elo Customer Loyalty

_Note! If you want to commit any changes to this document, please strip all output (Cell > Current Outputs > Clear, or set up [nbstripout](https://github.com/kynan/nbstripout) as a git filter) from this notebook before doing so. Thanks!_


## Import Libraries

Next we import the Python libraries we'll need. If any of these are missing for you, you can install them with e.g. `pip3 install pandas` on the command line.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

## Load Data

Load the data into Pandas data frames and look at their structure.

First thing we'll do with the training data is split it into a train and validation set. (The given test set is what we'll later make our predictions on and upload, but only after we are fully satisfied with our model.)

In [None]:
hist_trans_df = pd.read_csv('data/unzipped/historical_transactions.csv')
merchants_df = pd.read_csv('data/unzipped/merchants.csv', index_col='merchant_id')
merch_trans_df = pd.read_csv('data/unzipped/new_merchant_transactions.csv')
train_and_validation_df = pd.read_csv('data/unzipped/train.csv', index_col='card_id')
test_df = pd.read_csv('data/unzipped/test.csv', index_col='card_id')

In [None]:
hist_trans_df.head()

In [None]:
merchants_df.head()

In [None]:
merch_trans_df.head()

In [None]:
train_and_validation_df.head()

## Create Features

Next we want to combine and shape all of our raw data to create useful features in the train data set.

For now, I'll do this step-by-step here to make it a bit easier to understand (for myself), but later we'll probably want to do this in some python classes, or this file will be huge. // Erich

In [None]:
merged = train_and_validation_df.merge(hist_trans_df, how='left', on=['card_id']); merged.head()

In [None]:
means = merged.groupby('card_id')['purchase_amount'].mean(); means.head()

In [None]:
sums = merged.groupby('card_id')['purchase_amount'].sum(); sums.head()

In [None]:
train_and_validation_df['purchase_amount_mean'] = means
train_and_validation_df['purchase_amount_sum'] = sums

### Repeat for Test Set

In [None]:
test_merged = test_df.merge(hist_trans_df, how='left', on=['card_id'])
test_means = test_merged.groupby('card_id')['purchase_amount'].mean()
test_sums = test_merged.groupby('card_id')['purchase_amount'].sum()
test_df['purchase_amount_mean'] = test_means
test_df['purchase_amount_sum'] = test_sums

## Split Into Train and Validation Sets

Split our data into a train test (80%) and a validation set (20%).

In [None]:
from sklearn.model_selection import train_test_split
train_df, validate_df = train_test_split(train_and_validation_df, test_size=0.3, random_state=238923)

In [None]:
train_df.shape

In [None]:
validate_df.shape

In [None]:
train_df.head()

## Explore Data

### Correlations

In [None]:
sns.heatmap(train_df.corr(), vmin=-1, vmax=1, cmap='PiYG')

### Distributions

In [None]:
sns.distplot(train_df.target)

In [None]:
sns.countplot(x='feature_1', palette='Set2', data=train_df)

In [None]:
sns.countplot(x='feature_2', palette='Set2', data=train_df)

In [None]:
sns.countplot(x='feature_3', palette='Set2', data=train_df)

In [None]:
sns.distplot(train_df.purchase_amount_mean)

## Set Up Model

We'll use the fastai tabular regressor, for which we'll need some additional imports.

In [None]:
from fastai import *
from fastai.tabular import *
from fastai.metrics import *

### Create Data Bunch

A fastai DataBunch more or less contains the data that we'll feed to our model.

First, as the data bunch takes one data frame containing both the test and validation samples, we need to get the indices for our validation samples.

Then we tell the model which of the columns are categorical features, which are continuous features, and also which of the columns contains the target (the value we want to predict).

In [None]:
valid_idx = range(len(train_and_validation_df) - len(validate_df), len(train_and_validation_df)); valid_idx

Since we picked our validation samples randomly from the initial data set, and since fastai requires us to give the indices of the validation samples in a data frame containing both the training and validation samples, we just concatenate them together with training samples first and the validation samples at the end.

In [None]:
category_names = ['feature_1', 'feature_2', 'feature_3']
continuous_names = ['purchase_amount_mean', 'purchase_amount_sum']
dep_var = 'target'

In [None]:
df = pd.concat([train_df, validate_df]).reset_index()[category_names + continuous_names + [dep_var]]

In [None]:
data = (TabularList.from_df(df,
                            path='data/unzipped',
                            cat_names=category_names,
                            cont_names=continuous_names,
                            procs=[FillMissing, Categorify, Normalize])
                .split_by_idx(valid_idx)
                .label_from_df(cols=dep_var, label_cls=FloatList)
                .databunch())

### Create Learner

This is what we actually use to train the model and make predictions.

First we decide how large we want to make the embeddings of our categorical features (the number of category options divided by 2 is a good heuristic, apparently).

Then we tell the model the range within which we expect all predictions to fall (internally the model uses a sigmoid function, so in order for us, in practice, to actually get predictions near the expected maximum value, we set the upper bound to be a little higher than the expected maximum).

The competition uses root mean squared error to evaluate the entries, so we'll use that, too.

In [None]:
category_szs = {'feature_1': 5,
                'feature_2': 3,
                'feature_3': 2}
emb_szs = {k: (v + 1) // 2 for k, v in category_szs.items()}

In [None]:
max_log_y = np.log(np.max(train_df['target']) * 1.2)
y_range = torch.tensor([-max_log_y, max_log_y], device=defaults.device); y_range

In [None]:
learn = tabular_learner(data,
                        layers=[1000, 500],
                        emb_szs=emb_szs,
                        ps=[0.2, 0.5],
                        emb_drop=0.1,
                        y_range=y_range,
                        metrics=rmse)

In [None]:
learn.model

### Figure Out Learning Rate

To figure out which learning rate to use, we use fastai's learning rate finder.

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot()

### Train Model

Finally we train the model, with weight decay to encourage the model to use fewer features, and then show some results.

In [None]:
learn.fit_one_cycle(1, 1e-5, wd=0.2)

In [None]:
learn.recorder.plot_losses()

## Make Predictions

Now that we have trained our model, lets make some predictions to see whether or not our metrics lie to us.

In [None]:
predictions, targets = [x.numpy().flatten() for x in learn.get_preds(DatasetType.Valid)]
prediction_df = pd.DataFrame({'prediction': predictions, 'target': targets})

In [None]:
(np.amin(predictions), np.amax(predictions))

In [None]:
prediction_df.head()

In [None]:
prediction_df.tail()

### Calculate RMSE On Validation Set

Get the root mean squared error for the validation set only. This value we can compare against the public leaderboard on Kaggle, more or less.

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

In [None]:
sqrt(mean_squared_error(prediction_df.target, prediction_df.prediction))

## Make Submission Predictions

Finally, we need to run our model against the test set that is used by the competition's organizers to evaluate the competitors. We save the result to a `submission.csv` file which we'll then upload to Kaggle.

_Note: we should only do this at the very end, when we are happy with our hyperparameters. Otherwise, if we change our model based on our results on the public leaderboard, we risk overfitting our model to the 30% of samples used for the public leaderboard, and will fail to generalize for the remaining 70% of samples._

In [None]:
out_df = test_df.copy(); out_df.head()

In [None]:
# Warning -- this takes quite a long time.
out_df['target'] = [learn.predict(row)[2].numpy().flatten()[0] for _, row in out_df.iterrows()]

In [None]:
out_df['target'].to_csv('submission.csv.zip', header=['target'], index_label='card_id', compression='zip')