## A Toe In The Tensorflow DNN Water

This notebook is supplementary material for the talk _Deep Neural Networks for
Scalable Prediction_ given at the ASA's __Conference on Statistical Practice__, Portland OR, February 2018.

It provides a runnable example of a _very_ simple multilayer perceptron binary classifier deep neural network, with training and evaluation done using the Tensorflow library.  The purpose is to introduce to statistics practitioners some basic deep neural network concepts.  Also included is an "ordinary" binary logistic regression model that produces results for comparison purposes.

The example provided here doesn't necessarily represent the "best" way of training and evaluating the network described, nor do the methods used illustrate the best ways of using the capabilities of the packages employed.  In fact, this example doesn't work exceedingly well for the real world data employed.  Try to improve it by changing the number of layers or by making other modifications.

To run what's here you'll need to be using a Python 3.x kernel, and also to install a couple of Python packages.  You'll need to download some publicly available data.

## License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty or MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the terms of the GNU General Public License for more details. The GNU General Public License can be found at <http://www.gnu.org/licenses/>.

## The Data

The data is from a Portuguese bank.  It was used for direct marketing as described by Moro, Cortez & Rita (2014) _A data driven approach to predict the success of bank telemarketing_. __Decision Support Systems__, 62, 22-31.  

It is available from the University of California Irvine (UCI) Machine Learning Repository at:  

https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

## Some Libraries Needed

You'll need to have the packages below installed before you can import them.  Note that if you are using tensorflow 1.4 and Python 3.6x, when you import tensorflow you may get a compile time warning.  A (possible)reason is that a 1.4 module was compiled under Python 3.5.  It's just a warning, and shouldn't affect what will run.

In [16]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn import preprocessing

# The following allows output from multiple statements to come out in a single output cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Basic Deep Learning Approach

Generally speaking, most Deep Learning applications involve four main elements: data, a model or models, an optimization method, and a cost or loss function.  You provide or specify each of these, and then you train your model to do what it's supposed to do.  Training is typically accomplished using a gradient descent method that attempts to minimize the value of the loss function conditional on the values of weights.  A commonly used loss function is what's called *cross entropy loss*.  

Some computational libraries used to apply deep learning methods, like Tensorflow and Torch, use *tensors* to contain the data used in training and testing deep learning algorithms. A tensor a a grid of numbers with a variable number of dimensions, or axes.

## Get The Data and Documentation

Download the `bank.zip` file from the UCI repository. Unzip it in a directory it can be accessed from by this Notebook.  This zip file should have three files in it: `bank.csv`, `bank-full.csv`, and `bank-names.txt`.  The text file provides variable definitions and some background information.  

The `bank-full` file is a semicolon-delimited file with 45,211 records having 17 columns or variables. It has a header record containing variable names.  The `bank` file has in it a randomly selected 10% of the records in the `bank-full` file.  In what follows the full data set is used, but you can used the smaller dataset if you like.

In [2]:
# Read the data into a Pandas DataFrame.  Read `bank.csv` if you want less data
bankDat=pd.read_csv('bank-full.csv',sep=';')

## Types of Variables

As read in, most of the 17 variables are type "object" (i.e., character) and some are integer.  The variable we want to predict is `y`, which is coded "yes" or "no", indicating whether a customer responded to the most recent marketing campaign.  

To keep this example very simple, we're going to use a subset of the variables, those that can be used with the least amount of transformation. Deep Neural networks implemented in Tensorflow like to have continuous, rather than discrete, input data as their inputs are used in mathematical operations like addition and multiplication. Discrete or categorical inputs used are transformed in some way, like by dummy or "one hot" coding, binning ("bucketizing"), hashing, or embedding. Check out some Tensorflow documentation about encoding categorical features, e.g. https://www.tensorflow.org/versions/master/get_started/feature_columns

We're going to "fudge" a little here by treating some binary input variables as truncated continuous variables by coding them as 0 or 1.

Note that the percentage of "yes's" in the `y` data, the percent responding to the campaign, is $\approx$ 12%.

## Selecting Variables as Inputs

We're going to use just "features" (inputs) that are we'll treat as continuous measures.

In [3]:
varsToUse=['age','balance','housing','loan','campaign','previous','y']

campCusts=bankDat[varsToUse]

`campCusts` is a DataFrame with the vars we're going to use. `y` will be the binary dependent variable, or *output*.  The other variables will be our inputs, or *features*.

## A Simple Transformation

`housing`, `loan`, and `y` are character types with values "yes" or "no". We'll convert these to numeric 1|0 variables. 

In [4]:
# Here's a stone simple function for creating new 0/1 vars.
# Another way to do this would be to use the numpy .where() method.

def zero1(z):
    if z=='no':
        return(0)
    return(1)

cVars=['housing','loan','y']   # vars to be made 0|1

new01Vars=campCusts[cVars].applymap(zero1)   # create a new df that's 0|1

contVars=campCusts.loc[:,~(campCusts.columns.isin(cVars))]

campCusts2=contVars.join(new01Vars)  # join on the DataFrame indexes.

### Make Input Data Numpy Arrays

In [5]:
campCustX=campCusts2.iloc[:,0:6].values    #Here are the input, or predictor, variables

convY=campCusts2.y.values          # Our 0|1 output, or dependent, variable. Approx. 12% 1's.

### Random Split into Training and Test Data

We can use a numpy function to make "coin flips" to get an 80/20 split.  We'll use the 80% for training, and the 20% for testing, for evaluating the NN's predictive accuracy using held out data.  We'll use a method from `sklearn` to do the split.

In [6]:
Xtrain, Xtest, yTrain, yTest = train_test_split(campCustX, convY,
                                   test_size=0.20, random_state=23)

### Scaling (Standardizing) Predictor Variables

It's a common machine learning practice to rescale continuous input variables so that they have the same mean and variance.  We'll do that here using `sklearn` methods.  You can skip this rescaling to see if it makes a difference in the results, below, if you want.

In [7]:
Xtrain=Xtrain.astype(float)

scaler = preprocessing.StandardScaler().fit(Xtrain)

# XTrain's columns should be mean = 0, std = 1
Xtrain = scaler.transform(Xtrain)

# Xtest is rescaled based on Xtrain's means and std devs
Xtest = scaler.transform(Xtest.astype(float))

### "Ordinary" Binary Logistic Regression

For the sake of comparison to results that follow, let's use some `scikit learn` methods to estimate a logistic regression using the training data.  Then, check the model's predictive accuracy and AUC using the training data and the test data.

In [8]:
from sklearn import linear_model
logReg=linear_model.LogisticRegression()
logMod1=logReg.fit(Xtrain,yTrain)

In [9]:
print('training data accuracy {:5.3f}'.format(logMod1.score(Xtrain,yTrain)))
trainPredProbs=logMod1.predict_proba(Xtrain)
regTrainAUC=roc_auc_score(yTrain,trainPredProbs[:,1])
print('training data auc {:5.3f}'.format(regTrainAUC))

print('test data accuracy {:5.3f}'.format(logMod1.score(Xtest,yTest)))
trainPredProbs=logMod1.predict_proba(Xtest)
regTestAUC=roc_auc_score(yTest,trainPredProbs[:,1])
print('test data auc {:5.3f}'.format(regTestAUC))

training data accuracy 0.882
training data auc 0.688
test data accuracy 0.881
test data auc 0.689


### Mini-Batch Data Feeder

We're doing to do "mini-batch" training. We're going to use randomly selected with replacement samples ("mini-batches") from our training data in each iteration of our training algorithm.  Once we're done we'll use the observations we've set aside ('train' = False) to assess our network's predictive accuracy using data it didn't learn it's parameters, i.e. weights, from.  This function will grab a random subset of the training data on each iteration of the algorithm.

In [10]:
def get_batch(epoch, ncases, b_ndx, b_size):
    # epoch is the alg iteration, b_ndx is the batch no.
    # b_size is batch size
    
    ndxs= np.random.randint(ncases,size=b_size)
    X_bat=Xtrain[ndxs]
    y_bat=yTrain[ndxs]
    return X_bat, y_bat


### Two Hidden Layer NN

A two hidden layer fully connected NN with four nodes per hidden layer is specified here.  Feel free to fiddle around with it, of course.

There are more elegant and more efficient ways of doing what follows using methods in the `tensorflow` library.

In [11]:
# Parameters  - change 'em if you wish

learn_rate = 0.1
b_size = 100                 # batch size
n_epochs = 100               # no. of epochs
ncases=Xtrain.shape[0]       #no. of records
n_bats = int(np.ceil(ncases/b_size))  # no. of batches

# Inputs, nodes in hidden layers, classes in the output

n_hid_1 = 4 # 1st layer number of neurons
n_hid_2 = 4 # 2nd layer number of neurons
num_inp = 6 # selected campaign predictors
num_class = 2 # y values, 0 or 1


# tf placeholders for input data

X = tf.placeholder("float32", shape=(None, num_inp),name="X")
y = tf.placeholder("int32", shape=(None),name="y")

# Layer weights & biases

weights = {
    'h1': tf.Variable(tf.random_normal([num_inp, n_hid_1])),
    'h2': tf.Variable(tf.random_normal([n_hid_1, n_hid_2])),
    'out': tf.Variable(tf.random_normal([n_hid_2, num_class]))
     }
biases = {
    'b1': tf.Variable(tf.random_normal([n_hid_1])),
    'b2': tf.Variable(tf.random_normal([n_hid_2])),
    'out': tf.Variable(tf.random_normal([num_class]))
     }


Definitions of network layers.  A "slicker" way to would be to use "named scopes."

In [12]:
def neural_net(x):
    # A two hidden fully connected layers each with 4 neurons, relu activation fcns
    layer_1 = tf.nn.relu(tf.matmul(X, weights['h1']))
    # Hidden fully connected layer with 4 neurons
    layer_2 = tf.nn.relu(tf.matmul(layer_1, weights['h2']))
    # Output fully connected layer with a neuron for each class
    out_layer = tf.matmul(layer_2, weights['out'])
    return out_layer


#### Specification of of the loss function, optimization method, accuracy calculation.

In [13]:
logits = neural_net(X)

loss_op = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
    logits=logits, labels=y))

optimizer = tf.train.AdamOptimizer(learning_rate=learn_rate)
train_op = optimizer.minimize(loss_op)

correct_pred=tf.nn.in_top_k(logits,y,1)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

### Training and Evaluation

#### Intialization of All Variables

In [14]:
init = tf.global_variables_initializer()

#### Learning Algorithm in Tensorflow Session

Here's a `Tensorflow` session that runs the specified number of epochs with min-batches of the specified size.  When it's done iterating, it prints out classification accuracies and AUC estimates based on all of the training data, and all of the test data.

This part might take a bit to run, and especially so if you are running a non-gpu version of Tensorflow.

In [15]:
with tf.Session() as sess:
    init.run()
    print('epoch ',end='')
    for epoch in range(n_epochs):
        
        for b_ndx in range(n_bats):
            X_batch, y_batch=get_batch(epoch,ncases, b_ndx, b_size)
            sess.run(train_op, feed_dict={X: X_batch, y: y_batch})
        if epoch % 20 == 0:
            print(epoch, end=', ')
    
    print('done!\n')
    accTrain=accuracy.eval(feed_dict={X: Xtrain, y: yTrain})
    print('\ntraining data accuracy {:5.3f}'.format(accTrain))
    trainProbs=tf.nn.softmax(logits).eval(feed_dict={X: Xtrain, y: yTrain})
    print('training data auc {:5.3f}'.format(roc_auc_score(yTrain,trainProbs[:,1])))
    
    accTest=accuracy.eval(feed_dict={X: Xtest, y: yTest})
    print('test data accuracy {:5.3f}'.format(accTest))
    testProbs=tf.nn.softmax(logits).eval(feed_dict={X: Xtrain, y: yTrain})
    print('test data auc {:5.3f}'.format(roc_auc_score(yTrain,testProbs[:,1])))
   


epoch 0, 20, 40, 60, 80, done!


training data accuracy 0.887
training data auc 0.699
test data accuracy 0.884
test data auc 0.699
