# Model to Classify Elo Customer Loyalty Outliers

_Note! If you want to commit any changes to this document, please strip all output (Cell > Current Outputs > Clear, or set up [nbstripout](https://github.com/kynan/nbstripout) as a git filter) from this notebook before doing so. Thanks!_

For more detailed descriptions of some of these steps, see the `elo_loyalty_prediction` notebook.

PS. For now this model is not very successful when it comes to predicting outliers. Perhaps we can come back to it when we have done more feature engineering.

## Load Libraries and Data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
hist_trans_df = pd.read_csv('data/processed/historical_transactions.csv')
merch_trans_df = pd.read_csv('data/processed/new_merchant_transactions_with_merchants.csv')
train_and_validation_df = pd.read_csv('data/unzipped/train.csv',
                                      index_col='card_id',
                                      parse_dates=['first_active_month'])
test_df = pd.read_csv('data/unzipped/test.csv',
                      index_col='card_id',
                      parse_dates=['first_active_month'])

## Create Features

In [None]:
from fastai import *
from fastai.tabular import *
from fastai.metrics import *
from feature_engineering import *

In [None]:
add_datepart(train_and_validation_df, 'first_active_month')
train_and_validation_df.drop(['first_active_monthDay', 'first_active_monthDayofweek',
                              'first_active_monthDayofyear', 'first_active_monthIs_month_end',
                              'first_active_monthIs_month_start', 'first_active_monthIs_quarter_end',
                              'first_active_monthIs_year_end'], axis=1, inplace=True)

In [None]:
add_datepart(test_df, 'first_active_month')
test_df.drop(['first_active_monthDay', 'first_active_monthDayofweek', 'first_active_monthDayofyear',
              'first_active_monthIs_month_end', 'first_active_monthIs_month_start',
              'first_active_monthIs_quarter_end', 'first_active_monthIs_year_end'], axis=1, inplace=True)

In [None]:
aggs = {
    'purchase_amount': ['sum', 'mean', 'min', 'max', 'std'],
    'installments': ['sum', 'mean', 'min', 'max', 'std'],
    'month_lag': ['mean', 'min', 'max'],
    'merchant_id': ['nunique'],
    'state_id': ['nunique'],
    'city_id': ['nunique'],
}
# Here are the aggregators we only want to use for the `historical_transactions` data.
hist_trans_aggs = {
    'merchant_category_id': ['nunique'],
    'subsector_id': ['nunique'],
    'elapsed_since_last_purchase': ['sum', 'mean', 'min', 'max', 'std'],
    'elapsed_since_last_merch_purchase': ['sum', 'mean', 'min', 'max', 'std'],
}
# Here are the aggregators we only want to use for the `new_merchants_transactions` data.
merch_trans_aggs = {
    'category_1_transaction': ['nunique'],
    'category_2': ['nunique'],
    'category_3': ['nunique'],
    'category_4': ['nunique'],
    'merchant_category_id_transaction': ['nunique'],
    'merchant_category_id_merchant': ['nunique'],
    'merchant_group_id': ['nunique'],
    'subsector_id_merchant': ['nunique'],
    'category_1_merchant': ['nunique'],
    'state_id': ['nunique'],
    'elapsed_since_last_purchase': ['sum', 'mean', 'min', 'max', 'std'],
    'numerical_1': ['sum', 'mean', 'min', 'max', 'std'],
    'numerical_2': ['sum', 'mean', 'min', 'max', 'std'],
    'avg_sales_lag3': ['sum', 'mean', 'min', 'max', 'std'],
    'avg_purchases_lag3': ['sum', 'mean', 'min', 'max', 'std'],
    'active_months_lag3': ['sum', 'mean', 'min', 'max', 'std'],
    'avg_sales_lag6': ['sum', 'mean', 'min', 'max', 'std'],
    'avg_purchases_lag6': ['sum', 'mean', 'min', 'max', 'std'],
    'active_months_lag6': ['sum', 'mean', 'min', 'max', 'std'],
    'avg_sales_lag12': ['sum', 'mean', 'min', 'max', 'std'],
    'avg_purchases_lag12': ['sum', 'mean', 'min', 'max', 'std'],
    'active_months_lag12': ['sum', 'mean', 'min', 'max', 'std'],
}

In [None]:
add_aggregated_numerical_fields(train_and_validation_df, hist_trans_df, aggregators={**aggs, **hist_trans_aggs})

In [None]:
add_aggregated_categorical_fields(train_and_validation_df,
                                  hist_trans_df,
                                  column_names=['authorized_flag', 'category_1', 'category_2', 'category_3'])

In [None]:
add_top_categories(train_and_validation_df,
                   hist_trans_df,
                   column_names=['authorized_flag', 'category_1', 'subsector_id', 'city_id', 'state_id',
                                 'purchase_Year', 'purchase_Month', 'purchase_Week', 'purchase_Day',
                                 'purchase_Dayofweek'])

In [None]:
add_aggregated_numerical_fields(train_and_validation_df, merch_trans_df, aggregators={**aggs, **merch_trans_aggs},
                                prefix='merch_')

In [None]:
from sklearn.model_selection import train_test_split
train_df, validate_df = train_test_split(train_and_validation_df, test_size=0.2, random_state=238923)

In [None]:
train_df.head()

## The Issue

If we have a look at the loyalty scores that we want to predict for this competition, we'll see that there are a bunch of outliers at around -30 loyalty. (NB. this field is likely normalised to have mean 0 and standard deviation 1, meaning these outliers are probably 0 \[nan?] in the original data set.)

If we could disregard these, our model for predicting the loyalty score would have a much easier time. But of course we don't know which of the incoming fields are outliers in this sense and which aren't. So let's try to make a classifier model to predict whether or not a sample (a `card_id`) is an outlier!

In [None]:
sns.distplot(train_df.target)

Let's call any sample with a loyalty score below -25 an outlier.

In [None]:
train_df['is_outlier'] = train_df.target < -25
validate_df['is_outlier'] = validate_df.target < -25

In [None]:
sns.countplot(x='is_outlier', data=train_df)

## Set Up Model

Again, for more detailed comments on some of these steps, have a look at the `elo_loyalty_prediction` notebook.

We will do some upsampling of the outliers in the training set – basically copying those rows so that they don't drown among all the countless non-outlier samples.

In [None]:
upsampled_train_df = train_df.copy().append([train_df[train_df.is_outlier]] * 10, ignore_index=True)

In [None]:
valid_idx = range(len(upsampled_train_df), len(upsampled_train_df) + len(validate_df)); valid_idx

In [None]:
category_names = ['feature_1',
                  'feature_2',
                  'feature_3',
                  'authorized_flag_top',
                  'category_1_top',
                  'subsector_id_top',
                  'city_id_top',
                  'state_id_top',
                  'purchase_Year_top',
                  'purchase_Month_top',
                  'purchase_Week_top',
                  'purchase_Day_top',
                  'purchase_Dayofweek_top']
continuous_names = ['first_active_monthYear',
                    'first_active_monthMonth',
                    'first_active_monthWeek',
                    'first_active_monthIs_quarter_start',
                    'first_active_monthIs_year_start',
                    'first_active_monthElapsed',
                    'purchase_amount_sum',
                    'purchase_amount_mean',
                    'purchase_amount_min',
                    'purchase_amount_max',
                    'purchase_amount_std',
                    'installments_sum',
                    'installments_mean',
                    'installments_min',
                    'installments_max',
                    'installments_std',
                    'month_lag_mean',
                    'month_lag_min',
                    'month_lag_max',
                    'merchant_id_nunique',
                    'state_id_nunique',
                    'city_id_nunique',
                    'merchant_category_id_nunique',
                    'subsector_id_nunique',
                    'elapsed_since_last_purchase_sum',
                    'elapsed_since_last_purchase_mean',
                    'elapsed_since_last_purchase_min',
                    'elapsed_since_last_purchase_max',
                    'elapsed_since_last_purchase_std',
                    'elapsed_since_last_merch_purchase_sum',
                    'elapsed_since_last_merch_purchase_mean',
                    'elapsed_since_last_merch_purchase_min',
                    'elapsed_since_last_merch_purchase_max',
                    'elapsed_since_last_merch_purchase_std',
                    'authorized_flag_Y_ratio',
                    'category_1_N_ratio',
                    'category_2_1.0_ratio',
                    'category_2_3.0_ratio',
                    'category_2_4.0_ratio',
                    'category_2_2.0_ratio',
                    'category_2_5.0_ratio',
                    'category_3_A_ratio',
                    'category_3_B_ratio',
                    'category_3_C_ratio',
                    'merch_purchase_amount_sum',
                    'merch_purchase_amount_mean',
                    'merch_purchase_amount_min',
                    'merch_purchase_amount_max',
                    'merch_purchase_amount_std',
                    'merch_installments_sum',
                    'merch_installments_mean',
                    'merch_installments_min',
                    'merch_installments_max',
                    'merch_installments_std',
                    'merch_month_lag_mean',
                    'merch_month_lag_min',
                    'merch_month_lag_max',
                    'merch_merchant_id_nunique',
                    'merch_state_id_nunique',
                    'merch_city_id_nunique',
                    'merch_category_1_transaction_nunique',
                    'merch_category_2_nunique',
                    'merch_category_3_nunique',
                    'merch_category_4_nunique',
                    'merch_merchant_category_id_transaction_nunique',
                    'merch_merchant_category_id_merchant_nunique',
                    'merch_merchant_group_id_nunique',
                    'merch_subsector_id_merchant_nunique',
                    'merch_category_1_merchant_nunique',
                    'merch_elapsed_since_last_purchase_sum',
                    'merch_elapsed_since_last_purchase_mean',
                    'merch_elapsed_since_last_purchase_min',
                    'merch_elapsed_since_last_purchase_max',
                    'merch_elapsed_since_last_purchase_std',
                    'merch_numerical_1_sum',
                    'merch_numerical_1_mean',
                    'merch_numerical_1_min',
                    'merch_numerical_1_max',
                    'merch_numerical_1_std',
                    'merch_numerical_2_sum',
                    'merch_numerical_2_mean',
                    'merch_numerical_2_min',
                    'merch_numerical_2_max',
                    'merch_numerical_2_std',
                    'merch_avg_sales_lag3_sum',
                    'merch_avg_sales_lag3_mean',
                    'merch_avg_sales_lag3_min',
                    'merch_avg_sales_lag3_max',
                    'merch_avg_sales_lag3_std',
                    'merch_avg_purchases_lag3_sum',
                    'merch_avg_purchases_lag3_mean',
                    'merch_avg_purchases_lag3_min',
                    'merch_avg_purchases_lag3_max',
                    'merch_avg_purchases_lag3_std',
                    'merch_active_months_lag3_sum',
                    'merch_active_months_lag3_mean',
                    'merch_active_months_lag3_min',
                    'merch_active_months_lag3_max',
                    'merch_active_months_lag3_std',
                    'merch_avg_sales_lag6_sum',
                    'merch_avg_sales_lag6_mean',
                    'merch_avg_sales_lag6_min',
                    'merch_avg_sales_lag6_max',
                    'merch_avg_sales_lag6_std',
                    'merch_avg_purchases_lag6_sum',
                    'merch_avg_purchases_lag6_mean',
                    'merch_avg_purchases_lag6_min',
                    'merch_avg_purchases_lag6_max',
                    'merch_avg_purchases_lag6_std',
                    'merch_active_months_lag6_sum',
                    'merch_active_months_lag6_mean',
                    'merch_active_months_lag6_min',
                    'merch_active_months_lag6_max',
                    'merch_active_months_lag6_std',
                    'merch_avg_sales_lag12_sum',
                    'merch_avg_sales_lag12_mean',
                    'merch_avg_sales_lag12_min',
                    'merch_avg_sales_lag12_max',
                    'merch_avg_sales_lag12_std',
                    'merch_avg_purchases_lag12_sum',
                    'merch_avg_purchases_lag12_mean',
                    'merch_avg_purchases_lag12_min',
                    'merch_avg_purchases_lag12_max',
                    'merch_avg_purchases_lag12_std',
                    'merch_active_months_lag12_sum',
                    'merch_active_months_lag12_mean',
                    'merch_active_months_lag12_min',
                    'merch_active_months_lag12_max',
                    'merch_active_months_lag12_std',]
dep_var = 'is_outlier'

In [None]:
df = pd.concat([upsampled_train_df, validate_df]).reset_index()[category_names + continuous_names + [dep_var]]

In [None]:
data = TabularDataBunch.from_df('data/unzipped',
                                df,
                                dep_var,
                                valid_idx=valid_idx,
                                procs=[FillMissing, Categorify, Normalize],
                                cat_names=category_names,
                                cont_names=continuous_names)

In [None]:
data.show_batch()

In [None]:
learn = tabular_learner(data,
                        layers=[200, 100],
                        ps=[1e-2, 1e-1],
                        emb_drop=0.04,
                        metrics=[accuracy, Precision(), Recall()])

In [None]:
learn.model

## Train Model

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(3, 1e-2, wd=0.5)

In [None]:
learn.recorder.plot_losses()

## Make Predictions

Now that we have trained our model, lets make some predictions to see whether or not our metrics lie to us.

Let's only take those predictions where the model was at least 80% confident that the sample is an outlier, to increase our precision at the expense of recall.

In [None]:
predictions, targets = [x.numpy() for x in learn.get_preds(DatasetType.Valid)]
# Each element in prediction is an array of two values, the likelihood of 
# `False` (not an outlier) and the likelihood of `True` (an outlier).
outlier_predictions = [x[1] > 0.9 for x in predictions]
outlier_targets = targets == 1

In [None]:
prediction_df = pd.DataFrame({'prediction': outlier_predictions, 'target': outlier_targets})

In [None]:
prediction_df.head()

In [None]:
prediction_df.prediction.value_counts()

In [None]:
prediction_df.target.value_counts()

In [None]:
prediction_df[prediction_df.prediction].head()

Calculate **precision** _(fraction of relevant instances among the retrieved instances)_ and **recall** _(fraction of relevant instances that have been retrieved over the total amount of relevant instances)_.

In [None]:
prediction_counts = prediction_df[prediction_df.prediction].target.value_counts()
false_positives = prediction_counts[0]
true_positives = prediction_counts[1]
precision = true_positives / (false_positives + true_positives); precision

In [None]:
total_positives = prediction_df.target.value_counts()[1]
true_positives = prediction_counts[1]
recall = true_positives / total_positives; recall