In [None]:
# default_exp philander2014

# Philander 2014

> Full replication

This notebook shows how gamba can be used to reproduce findings from Philander's 2014 study on data mining methods for detecting high-risk gamblers.

- [Data Download (thetransparencyproject.org)](http://www.thetransparencyproject.org/download_index.php)
- [Data Description]()
- [Original Paper](https://www.tandfonline.com/doi/abs/10.1080/14459795.2013.841721)

It uses data available through the transaparency project above, and performs eight distinct supervised machine learning techniques.

**Note:** given the high dimensionality of this data (17), the sample size (530) doesn't meet the [rule of thumb](https://youtu.be/Dc0sr0kdBVI?t=3414) that 10x17 (or 1700) observations are required for learning to be generalisable. This means that the ouputs of the methods below may change drastically upon repeated executions, and comparison to the original may not be meaningful.

To begin, import gamba as usual;

In [None]:
import gamba as gb

In [None]:
philander_data = gb.data.prepare_philander_data('AnalyticDataSet_HighRisk.txt', loud=True)
train_measures, test_measures = gb.measures.split_measures_table(philander_data, frac=.696, loud=True)
display(train_measures.head(3))

530 players loaded
train:test
 369 : 161 ready


Unnamed: 0,player_id,country,gender,age,total_wagered,num_bets,frequency,duration,bets_per_day,net_loss,intensity,variability,frequency_1m,trajectory,z_intensity,z_variability,z_frequency,z_trajectory,self_exclude
21,1325917,276,1,53,93.34,53,22,419,2.409091,14.99,1.0,3.685557,4,-0.230259,-0.62285,-0.243764,-0.239622,-0.21847,1
283,1366865,792,1,24,160.7257,14,5,260,2.8,105.5627,3.666667,32.369628,3,0.630566,-0.111173,-0.057883,-0.415612,0.980193,1
253,1363496,300,1,29,1834.56,196,71,390,2.760563,420.32,3.0,7.972459,3,-0.385017,-0.239092,-0.215983,-0.415612,-0.433965,1


In [None]:
log_r = gb.machine_learning.logistic_regression(train_measures, test_measures, 'self_exclude')
lasso_l = gb.machine_learning.lasso_logistic_regression(train_measures, test_measures, 'self_exclude')

## Neural Networks
The following cell uses the [Keras](https://keras.io) library to create and train some neural networks as described in the study. The original study uses the R [nnet](https://cran.r-project.org/web/packages/nnet/nnet.pdf#Rfn.optim) and [caret](https://cran.r-project.org/web/packages/caret/vignettes/caret.html) packages, [this stackoverflow post](https://stackoverflow.com/questions/42417948/how-to-use-size-and-decay-in-nnet) was helpful in understanding the original parameters.

Framing the self_exclude label (0 or 1) as a regression problem means creating a neural network with a single output node and clipping the prediction. The classification version of the neural network used in the original analysis uses an identical network topology but passes two strings as values instead of a 1 or 0. This should in theory have no substantial difference on the performance of the network (given the sample size and identical architectures).

In [None]:
from keras.layers import Dense, Activation
from keras.models import Sequential
import numpy as np
# classification = discrete output
# regression = continuous output

def simple_classification_neural_network(train_measures, test_measures, label):
    
    train_data = train_measures.drop(['player_id', label], axis=1)
    train_labels = train_measures[label]
    test_data = test_measures.drop(['player_id',label], axis=1)
    test_labels = test_measures[label]

    model = Sequential()
    model.add(Dense(17, activation = 'relu', input_dim = 17))
    model.add(Dense(50, activation = 'relu'))
    model.add(Dense(units = 1, activation='sigmoid'))

    model.compile(optimizer = 'adam', 
                  loss = 'mean_squared_error',
                  metrics=['accuracy'])

    history = model.fit(train_data, train_labels, 
                        batch_size = 20, epochs = 100,
                        validation_data=(test_data, test_labels),
                        verbose=False)

    # now make a prediction and clip the values to 0 or 1 as in the original code
    raw_prediction = model.predict(test_data)
    prediction = [value[0] for value in np.where(raw_prediction >= 0.5, 1, 0)]
    
    return prediction

test_labels = test_measures['self_exclude']
from sklearn import metrics
import pandas as pd

#nn_r = shallow_neural_network(train_measures, test_measures, 'self_exclude')
nn_c = simple_classification_neural_network(train_measures, test_measures, 'self_exclude')

  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,sensitivity,specificity,accuracy,precision,auc,odds_ratio
nn_r,1.0,0.0,0.354,0.354,0.5,0


## Support Vector Machines (SVMs)
The following cell uses [scikit-learn's SVM](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm) methods to create and trains some SVM's. The original paper uses Dimitriadou et al's [implementations in R described here](https://www.researchgate.net/profile/Friedrich_Leisch/publication/221678005_E1071_Misc_Functions_of_the_Department_of_Statistics_E1071_TU_Wien/links/547305880cf24bc8ea19ad1d/E1071-Misc-Functions-of-the-Department-of-Statistics-E1071-TU-Wien.pdf).

In [None]:
svm_e = gb.machine_learning.svm_eps_regression(train_measures, test_measures, 'self_exclude')
svm_c = gb.machine_learning.svm_c_classification(train_measures, test_measures, 'self_exclude')
svm_o = gb.machine_learning.svm_one_classification(train_measures, test_measures, 'self_exclude')

## Random Forest
This section implements [scikit-learn's ensemble methods](https://scikit-learn.org/stable/modules/ensemble.html#forest) to create random forests for classification and regression

In [None]:
rf_r = gb.machine_learning.rf_regression(train_measures, test_measures, 'self_exclude')
rf_c = gb.machine_learning.rf_classification(train_measures, test_measures, 'self_exclude')

## All Methods Together

In [None]:
test_labels = test_measures['self_exclude']
all_results = [
    gb.machine_learning.compute_performance('Logistic Regression', test_labels, log_r),
    gb.machine_learning.compute_performance('Lasso Logistic Regression', test_labels, lasso_l),
    gb.machine_learning.compute_performance('NN Regression', test_labels, nn_r),
    gb.machine_learning.compute_performance('Same NN Again', test_labels, nn_c),
    gb.machine_learning.compute_performance('SVM eps-Regression', test_labels, svm_e),
    gb.machine_learning.compute_performance('SVM c-Classification', test_labels, svm_c),
    gb.machine_learning.compute_performance('SVM one-Classification', test_labels, svm_o),
    gb.machine_learning.compute_performance('RF Regression', test_labels, rf_r),
    gb.machine_learning.compute_performance('RF Classification', test_labels, rf_c)
]

all_results_df = pd.concat(all_results)
display(all_results_df)

NameError: name 'nn_r' is not defined

# Visualisations

In [None]:
import matplotlib.pyplot as plt

def plot_individual(measures_table, player_id):
    
    measure_cols = measures_table.columns[1:]
    player = measures_table[measures_table['player_id'] == player_id]
    
    for measure in measure_cols:
        plt.figure(figsize=[4,2])
        hist = plt.hist(measures_table[measure].values, alpha=0.5)
        plt.plot([player[measure].values[0], player[measure].values[0]], [0, hist[0].max()], label=measure, color='black')
        plt.legend()
        plt.show()
    
plot_individual(philander_data, 1324368)


NameError: name 'philander_data' is not defined