In [None]:
# default_exp philander2014

# Philander 2014

> Full replication

This notebook shows how gamba can be used to reproduce findings from Philander's 2014 study on data mining methods for detecting high-risk gamblers.

- [Data Download (thetransparencyproject.org)](http://www.thetransparencyproject.org/download_index.php)
- [Data Description]()
- [Original Paper](https://www.tandfonline.com/doi/abs/10.1080/14459795.2013.841721)

It uses data available through the transaparency project above, and performs eight distinct supervised machine learning techniques.

**Note:** given the high dimensionality of this data (17), the sample size (530) doesn't meet the [rule of thumb](https://youtu.be/Dc0sr0kdBVI?t=3414) that 10x17 (or 1700) observations are required for learning to be generalisable. This means that the ouputs of the methods below may change drastically upon repeated executions, and comparison to the original may not be meaningful. That considered, this notebook shows you how to do this kind of analysis using identical methods.

To begin, import gamba as usual;

In [2]:
import gamba as gb

In [3]:
measures_table = gb.data.prepare_philander_data('AnalyticDataSet_HighRisk.txt', loud=True)

530 players loaded


## Logistic Regressions
The machine learning module has wrappers for two logistic regression functions which can be used here. As with other machine learning methods in the gamba library, they return both the actual test labels and the predicted labels so that performance metrics can be computed.

> Note: throughout this page the naming convention is an abbreviated version of the name for the test labels used, and the same name with a 'p' on the end to denote the predicted labels.

> Note: you'll also notice a train_test_split parameter of 0.696 as a parameter to all of the methods, this is just to make sure that exactly the same train test split happens as in the paper (it defaults to 0.7)

In [4]:
log_r, log_rp = gb.machine_learning.logistic_regression(measures_table, 'self_exclude', train_test_split=0.696)
lasso_l, lasso_lp = gb.machine_learning.lasso_logistic_regression(measures_table, 'self_exclude', train_test_split=0.696)

## Neural Networks
The following cell uses the [Keras](https://keras.io) library to create and train some neural networks as described in the study. The original study uses the R [nnet](https://cran.r-project.org/web/packages/nnet/nnet.pdf#Rfn.optim) and [caret](https://cran.r-project.org/web/packages/caret/vignettes/caret.html) packages, [this stackoverflow post](https://stackoverflow.com/questions/42417948/how-to-use-size-and-decay-in-nnet) was helpful in understanding the original parameters.

Framing the self_exclude label (0 or 1) as a regression problem means creating a neural network which returns a continuous label. The classification version of the neural network used in the original analysis uses an identical network topology but passes two strings as values instead of a 1 or 0. This should in theory have no substantial difference on the performance of the network (given the sample size and identical architectures).

By contrast, the gamba library's neural network methods have subtly different topologies for classification and regression as described in [Deep Learning with Python](http://faculty.neu.edu.cn/yury/AAI/Textbook/Deep%20Learning%20with%20Python.pdf), which are used here.

In [5]:
nn_c, nn_cp = gb.machine_learning.neural_network_classification(measures_table, 'self_exclude', train_test_split=0.696)
nn_r, nn_rp = gb.machine_learning.neural_network_regression(measures_table, 'self_exclude', train_test_split=0.696)

## Support Vector Machines (SVMs)
The following cell uses [scikit-learn's SVM](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm) methods to create and trains some SVM's. The original paper uses Dimitriadou et al's [implementations in R described here](https://www.researchgate.net/profile/Friedrich_Leisch/publication/221678005_E1071_Misc_Functions_of_the_Department_of_Statistics_E1071_TU_Wien/links/547305880cf24bc8ea19ad1d/E1071-Misc-Functions-of-the-Department-of-Statistics-E1071-TU-Wien.pdf).

In [6]:
svm_e, svm_ep = gb.machine_learning.svm_eps_regression(measures_table, 'self_exclude', train_test_split=0.696)
svm_c, svm_cp = gb.machine_learning.svm_c_classification(measures_table, 'self_exclude', train_test_split=0.696)
svm_o, svm_op = gb.machine_learning.svm_one_classification(measures_table, 'self_exclude', train_test_split=0.696)

## Random Forest
This section implements [scikit-learn's ensemble methods](https://scikit-learn.org/stable/modules/ensemble.html#forest) to create random forests for classification and regression

In [7]:
rf_r, rf_rp = gb.machine_learning.rf_regression(measures_table, 'self_exclude', train_test_split=0.696)
rf_c, rf_cp = gb.machine_learning.rf_classification(measures_table, 'self_exclude', train_test_split=0.696)

## All Methods Together

Finally lets present the performance of each of the machine learning techniques using a number of metrics. Not all of the metrics apply to all of the methods, but it's a good way to see roughly how they compare.

In [9]:
all_results = [
    gb.machine_learning.compute_performance('Logistic Regression', log_r, log_rp),
    gb.machine_learning.compute_performance('Lasso Logistic Regression', lasso_l, lasso_lp),
    gb.machine_learning.compute_performance('NN Regression', nn_r, nn_rp),
    gb.machine_learning.compute_performance('NN Classification', nn_c, nn_cp),
    gb.machine_learning.compute_performance('SVM eps-Regression', svm_e, svm_ep),
    gb.machine_learning.compute_performance('SVM c-Classification', svm_c, svm_cp),
    gb.machine_learning.compute_performance('SVM one-Classification', svm_o, svm_op),
    gb.machine_learning.compute_performance('RF Regression', rf_r, rf_rp),
    gb.machine_learning.compute_performance('RF Classification', rf_c, rf_cp)
]

all_results_df = gb.concat(all_results)
display(all_results_df)

Unnamed: 0,sensitivity,specificity,accuracy,precision,auc,odds_ratio
Logistic Regression,0.08,0.901,0.646,0.267,0.49,0.791
Lasso Logistic Regression,0.094,0.972,0.683,0.625,0.533,3.646
NN Regression,0.431,0.545,0.509,0.306,0.488,0.91
NN Classification,0.0,1.0,0.64,,0.5,0.0
SVM eps-Regression,0.02,0.991,0.683,0.5,0.505,2.18
SVM c-Classification,0.0,1.0,0.646,,0.5,0.0
SVM one-Classification,0.462,0.505,0.491,0.308,0.483,0.873
RF Regression,0.373,0.755,0.634,0.413,0.564,1.825
RF Classification,0.241,0.869,0.658,0.481,0.555,2.106
