# Model to Predict Elo Customer Loyalty

_Note! If you want to commit any changes to this document, please strip all output (Cell > Current Outputs > Clear, or set up [nbstripout](https://github.com/kynan/nbstripout) as a git filter) from this notebook before doing so. Thanks!_


## Import Libraries

Next we import the Python libraries we'll need. If any of these are missing for you, you can install them with e.g. `pip3 install pandas` on the command line.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

## Load Data

Load the data into Pandas data frames and look at their structure.

First thing we'll do with the training data is split it into a train and validation set. (The given test set is what we'll later make our predictions on and upload, but only after we are fully satisfied with our model.)

In [None]:
hist_trans_df = pd.read_csv('data/unzipped/historical_transactions.csv')
merchants_df = pd.read_csv('data/unzipped/merchants.csv')
merch_trans_df = pd.read_csv('data/unzipped/new_merchant_transactions.csv')
train_and_validation_df = pd.read_csv('data/unzipped/train.csv')
test_df = pd.read_csv('data/unzipped/test.csv')

In [None]:
from sklearn.model_selection import train_test_split
train_df, validate_df = train_test_split(train_and_validation_df, test_size=0.2, random_state=238923)

In [None]:
train_df.shape

In [None]:
validate_df.shape

In [None]:
hist_trans_df.head()

In [None]:
merchants_df.head()

In [None]:
merch_trans_df.head()

In [None]:
train_df.head()

## Explore Data

### Correlations

In [None]:
sns.heatmap(train_df.corr(), vmin=-1, vmax=1, cmap='PiYG')

### Distributions

In [None]:
sns.distplot(train_df.target)

In [None]:
sns.countplot(x='feature_1', palette='Set2', data=train_df)

In [None]:
sns.countplot(x='feature_2', palette='Set2', data=train_df)

In [None]:
sns.countplot(x='feature_3', palette='Set2', data=train_df)

In [None]:
sns.distplot(hist_trans_df.purchase_amount)

## Set Up Model

We'll use the fastai tabular regressor, for which we'll need some additional imports.

In [None]:
from fastai import *
from fastai.tabular import *
from fastai.metrics import *

### Create Data Bunch

A fastai DataBunch more or less contains the data that we'll feed to our model.

First, as the data bunch takes one data frame containing both the test and validation samples, we need to get the indices for our validation samples.

Then we tell the model which of the columns are categorical features, which are continuous features, and also which of the columns contains the target (the value we want to predict).

In [None]:
valid_idx = range(len(train_and_validation_df) - len(validate_df), len(train_and_validation_df))

In [None]:
category_names = ['feature_1', 'feature_2', 'feature_3']
continuous_names = []
dep_var = 'target'

In [None]:
train_df[dep_var].head()

In [None]:
data = (TabularList.from_df(train_and_validation_df,
                            path='data/unzipped',
                            cat_names=category_names,
                            cont_names=continuous_names,
                            procs=[FillMissing, Categorify, Normalize])
                .split_by_idx(valid_idx)
                .label_from_df(cols=dep_var, label_cls=FloatList)
                .databunch())

### Create Learner

This is what we actually use to train the model and make predictions.

First we decide how large we want to make the embeddings of our categorical features (the number of category options divided by 2 is a good heuristic, apparently).

Then we tell the model the range within which we expect all predictions to fall (internally the model uses a sigmoid function, so in order for us, in practice, to actually get predictions near the expected maximum value, we set the upper bound to be a little higher than the expected maximum).

The competition uses root mean squared error to evaluate the entries, so we'll use that, too.

In [None]:
category_szs = {'feature_1': 5,
                'feature_2': 3,
                'feature_3': 2}
emb_szs = {k: (v + 1) // 2 for k, v in category_szs.items()}

In [None]:
max_log_y = np.log(np.max(train_df['target']) * 1.2)
y_range = torch.tensor([0, max_log_y], device=defaults.device)

In [None]:
def rmse(pred:Tensor, targ:Tensor)->Rank0Tensor:
    "Root mean squared error between `pred` and `targ`."
    return torch.sqrt(F.mse_loss(pred, targ))

In [None]:
learn = tabular_learner(data,
                        layers=[1000, 500],
                        emb_szs=emb_szs,
                        ps=[0.2, 0.5],
                        emb_drop=0.1,
                        y_range=y_range,
                        metrics=rmse)

In [None]:
learn.model

### Figure Out Learning Rate

To figure out which learning rate to use, we use fastai's learning rate finder.

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot()

### Train Model

Finally we train the model, with weight decay to encourage the model to use fewer features, and then show some results.

In [None]:
learn.fit_one_cycle(1, 1e-5, wd=0.2)

In [None]:
learn.recorder.plot_losses()