# Model to Classify Elo Customer Loyalty Outliers

_Note! If you want to commit any changes to this document, please strip all output (Cell > Current Outputs > Clear, or set up [nbstripout](https://github.com/kynan/nbstripout) as a git filter) from this notebook before doing so. Thanks!_

For more detailed descriptions of some of these steps, see the `elo_loyalty_prediction` notebook.

PS. For now this model is not very successful when it comes to predicting outliers. Perhaps we can come back to it when we have done more feature engineering.

## Load Libraries and Data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from feature_engineering import *

In [None]:
hist_trans_df = pd.read_csv('data/unzipped/historical_transactions.csv',
                            parse_dates=['purchase_date'])
merchants_df = pd.read_csv('data/unzipped/merchants.csv',
                           index_col='merchant_id')
merch_trans_df = pd.read_csv('data/unzipped/new_merchant_transactions.csv',
                             parse_dates=['purchase_date'])
train_and_validation_df = pd.read_csv('data/unzipped/train.csv',
                                      index_col='card_id',
                                      parse_dates=['first_active_month'])
test_df = pd.read_csv('data/unzipped/test.csv',
                      index_col='card_id',
                      parse_dates=['first_active_month'])

## Create Features

In [None]:
add_aggregated_numerical_fields(train_and_validation_df,
                                hist_trans_df,
                                column_names=['purchase_amount', 'installments', 'month_lag'],
                                aggregator=np.mean)

In [None]:
add_aggregated_numerical_fields(train_and_validation_df,
                                hist_trans_df,
                                column_names=['purchase_amount', 'installments'],
                                aggregator=np.sum)

In [None]:
add_aggregated_categorical_fields(train_and_validation_df,
                                  hist_trans_df,
                                  column_names=['authorized_flag', 'category_1', 'category_2', 'category_3'])

In [None]:
from sklearn.model_selection import train_test_split
train_df, validate_df = train_test_split(train_and_validation_df, test_size=0.2, random_state=238923)

In [None]:
train_df.head()

## The Issue

If we have a look at the loyalty scores that we want to predict for this competition, we'll see that there are a bunch of outliers at around -30 loyalty. (NB. this field is likely normalised to have mean 0 and standard deviation 1, meaning these outliers are probably 0 \[nan?] in the original data set.)

If we could disregard these, our model for predicting the loyalty score would have a much easier time. But of course we don't know which of the incoming fields are outliers in this sense and which aren't. So let's try to make a classifier model to predict whether or not a sample (a `card_id`) is an outlier!

In [None]:
sns.distplot(train_df.target)

Let's call any sample with a loyalty score below -25 an outlier.

In [None]:
train_df['is_outlier'] = train_df.target < -25
validate_df['is_outlier'] = validate_df.target < -25

In [None]:
sns.countplot(x='is_outlier', data=train_df)

## Set Up Model

Again, for more detailed comments on some of these steps, have a look at the `elo_loyalty_prediction` notebook.

In [None]:
for v in ['feature_1', 'feature_2', 'feature_3', 'is_outlier']:
    train_df[v] = train_df[v].astype('category').cat.as_ordered()
    validate_df[v] = validate_df[v].astype('category').cat.as_ordered()

In [None]:
from fastai import *
from fastai.tabular import *
from fastai.metrics import *

In [None]:
valid_idx = range(len(train_and_validation_df) - len(validate_df), len(train_and_validation_df)); valid_idx

In [None]:
category_names = ['feature_1', 'feature_2', 'feature_3']
dep_var = 'is_outlier'
continuous_names = [col for col in train_df.columns if col not in (
    ['first_active_month', 'target'] + category_names + [dep_var])]

In [None]:
df = pd.concat([train_df, validate_df]).reset_index()[category_names + continuous_names + [dep_var]]

In [None]:
data = TabularDataBunch.from_df('data/unzipped',
                                df,
                                dep_var,
                                valid_idx=valid_idx,
                                procs=[FillMissing, Categorify, Normalize],
                                cat_names=category_names,
                                cont_names=continuous_names)

In [None]:
data.show_batch()

In [None]:
learn = tabular_learner(data,
                        layers=[200,100],
                        ps=[1e-3, 1e-2],
                        emb_drop=0.05,
                        metrics=[accuracy, Precision(), Recall()])

In [None]:
learn.model

## Train Model

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(3, 1e-3, wd=0.2)

In [None]:
learn.recorder.plot_losses()

## Make Predictions

Now that we have trained our model, lets make some predictions to see whether or not our metrics lie to us.

Even though our model never assigns over 50% probability that any row is an outlier, we can consider, for instance, all samples for which the probability is over 5% to be an outlier.

In [None]:
predictions, targets = [x.numpy() for x in learn.get_preds(DatasetType.Valid)]
# Each element in prediction is an array of two values, the likelihood of 
# `False` (not an outlier) and the likelihood of `True` (an outlier).
outlier_predictions = [x[1] > 0.05 for x in predictions]
outlier_targets = targets == 1

In [None]:
prediction_df = pd.DataFrame({'prediction': outlier_predictions, 'target': outlier_targets})

In [None]:
prediction_df.head()

In [None]:
prediction_df.prediction.value_counts()

In [None]:
prediction_df.target.value_counts()

In [None]:
prediction_df[prediction_df.prediction].head()

Calculate **precision** _(fraction of relevant instances among the retrieved instances)_ and **recall** _(fraction of relevant instances that have been retrieved over the total amount of relevant instances)_.

In [None]:
prediction_counts = prediction_df[prediction_df.prediction].target.value_counts()
false_positives = prediction_counts[0]
true_positives = prediction_counts[1]
precision = true_positives / (false_positives + true_positives); precision

In [None]:
total_positives = prediction_df.target.value_counts()[1]
true_positives = prediction_counts[1]
recall = true_positives / total_positives; recall