# FCA

Technical Challenge for Data Science Candidates

This workbook loads a pickle file from the prior notebook.

Naive Model

In [330]:
import numpy as np
import pandas as pd
import math
import json

from os import path

from sklearn.model_selection import KFold
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import ensemble

from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import RFE

import scipy.stats as st
import statsmodels as sm
import statsmodels.api as smi
from sklearn import svm

from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split

from cycler import cycler
import matplotlib.pyplot as plt
import seaborn as sns

pd.__version__

'0.24.2'

In [331]:
# If you turn this feature on, you can display each result as it happens.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [332]:
# this is the local Utility module and it reloaded each time.
from fca import Utility

In [334]:
%load_ext autoreload
%autoreload 1
%aimport fca

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [335]:
# My utility singleton.
i0 = Utility.instance()

## Modelling

There are some highly correlated features that might be confusing.
Build a naive model that uses all the features and refine it down.

Filter the dataset and then apply a selection of models.

In [336]:
# df2 = pd.read_pickle("coded.pickle")
df2 = pd.read_pickle("scaled.pickle")

In [337]:
# re-instate the booleanbinary outcome
df2.y = df2.y > 0
df2.y;

In [338]:
## Low incidence rate requires some up-sampling.
np.sum(df2.y.values) / len(df2.y)

0.11265417111780131

### Reload the dataset for the model

You need to run this after you have chosen and evaluated to reset the dataset.
There is no scaling needed for the models chosen.
If you run all cells, the default is evaluate for the never-active with logistic regression.

### Splitting

I use sample0 for the different datasets. 
 1. use a train/test split
 2. use everything - let cross-validation make the splits

In [339]:
sample0=2

### Case 0: Prescient data

Null hypothesis.

I include the outcome variable - forcing over-fitting. All the models should converge to one. Or the data is structurally unsound (mis-scaled) or completely noise. I then remove each of the prescient variables.

In [340]:
y = df2.y
# X = df2.drop(columns=['y'])
X = df2

### Case 1: All features

Remove the outcome and get a baseline

In [320]:
y = df2.y
X = df2.drop(columns=['y'])
# X = df2
X.columns

Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed'],
      dtype='object')

### Case 2: case 1 and other refinements


In [321]:
# Remove highly-correlated features
# default is an historical - if loan given will they default
pcols = ['default', 'poutcome', 'pdays', 'previous' ]
pcols = pcols + ['euribor3m', 'emp.var.rate',  'cons.conf.idx']
pcols

['default',
 'poutcome',
 'pdays',
 'previous',
 'euribor3m',
 'emp.var.rate',
 'cons.conf.idx']

In [322]:
X = X.drop(columns=pcols)
X.columns

Index(['age', 'job', 'marital', 'education', 'housing', 'loan', 'contact',
       'month', 'day_of_week', 'duration', 'campaign', 'cons.price.idx',
       'nr.employed'],
      dtype='object')

### Split the Data

In [341]:
Xcols = list(X.columns)
if sample0 == 1:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
elif sample0 == 2:
    X_train, X_test, y_train, y_test = ( X, None, y, None )

In [342]:
X_train.shape
y_train.shape
Xcols

(41188, 29)

(41188,)

['age',
 'job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'contact',
 'month',
 'day_of_week',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome',
 'emp.var.rate',
 'cons.price.idx',
 'cons.conf.idx',
 'euribor3m',
 'nr.employed',
 'y',
 'job0',
 'marital0',
 'education0',
 'default0',
 'housing0',
 'loan0',
 'poutcome0',
 'pdays0']

## Model cases

Again choose one to apply to the data you've chosen.

These are a couple of models to evaluate.

The first SVC is not practical for "big" data (width and count), but works well for smaller ones.

The Multi-Layer Perceptron Classifier is a useful neural network and can get reasonable results quickly - it parallizes well. The hidden_layer_sizes need tuning to get a cross validation score over 0.5.

The logistic regression is a very solid performer.

In [325]:
clf = svm.SVC(kernel='linear', C=1)

In [292]:
# Useful method. Reduce the size of the neural network to find a path with prescient features. log_2 should be minimal
x0 = math.floor(math.log(X.shape[1], 2))
clf =  MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(x0, x0, 2), 
                     random_state=1, warm_start=True)

In [344]:
original_params = {'n_estimators': 1000, 'max_leaf_nodes': 4, 'max_depth': None, 'random_state': 2,
                   'min_samples_split': 5}
params = original_params

clf = ensemble.GradientBoostingClassifier(**params)

In [343]:
clf = ensemble.RandomForestClassifier(max_depth=2, random_state=0)

In [239]:
clf = LogisticRegression(C=1e4, solver='lbfgs', multi_class='auto', max_iter=10000) # max_iter is 2 orders up!

## From cross-validation results, choose the most accurate model

To evaluate the model under a cross-validation scheme, run this cell.

In [345]:
scores = cross_val_score(clf, X_train, y_train, cv=5, error_score=np.nan)
scores

array([1., 1., 1., 1., 1.])

## Feature selection

Try to determine best features

In [328]:
# create the RFE model and select 3 attributes
rfe = RFE(clf, 5)
rfe = rfe.fit(X_train, y_train)
# summarize the selection of the attributes
rfe.support_
rfe.ranking_



array([False, False, False, False, False, False,  True,  True, False,
        True, False,  True,  True])

array([2, 8, 9, 7, 6, 5, 1, 1, 4, 1, 3, 1, 1])

In [329]:
x0 = np.array(Xcols)
list(x0[rfe.support_])

['contact', 'month', 'duration', 'cons.price.idx', 'nr.employed']

In [256]:
idxs = list(np.where(rfe.support_)[0])
idxs
Xcols

[7, 8, 10, 14, 17]

['age',
 'job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'contact',
 'month',
 'day_of_week',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome',
 'cons.price.idx',
 'cons.conf.idx',
 'nr.employed']

# Summary

Case 1. Never active. The models all fit quickly and over cross-validation manage to get to over 90% accuracy.
Only MLPC struggles.

Case 2. Inactive for over a year. SVM really struggles to converge. Logistic regression needs 10000 iterations but gets to over 90%. MLPC gets there but is probably over-fitting.

Case 3. Inactive for over a month. MLCP gets 30% with full layers. Logistic regression can get to over 75%.

Running through the models, and looking at the feature selection, it seems that, for all cases, if a user has had transaction declined, or has had one reverted or declared as pending, then they prove to be inactive. For some reason a notification of ONBOARDING_TIPS_ACTIVATED_USERS seems to be a good indicator. This might be that account-holders who are having trouble ask for help in the form of onboarding tips.