## Data exploration and feature selection

In this notebook I will explore the credit card transaction data downloaded from https://www.kaggle.com/mlg-ulb/creditcardfraud/data and use various techniques to select features for modeling. Feature selection reduces noise and allows the model to train on the most important information. Often time this helps improve the training speed as well as model performance.

In [1]:
import pandas as pd
import numpy as np 

import tensorflow as tf

from sklearn.decomposition import PCA
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import StratifiedShuffleSplit

In [2]:
df = pd.read_csv("creditcard.csv")

In [3]:
if ~any(df.isnull().sum()):
    print('No missing values.')

No missing values.


In [4]:
df.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.16598e-15,3.416908e-16,-1.37315e-15,2.086869e-15,9.604066e-16,1.490107e-15,-5.556467e-16,1.177556e-16,-2.406455e-15,...,1.656562e-16,-3.44485e-16,2.578648e-16,4.471968e-15,5.340915e-16,1.687098e-15,-3.666453e-16,-1.220404e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


A few things to notice:
1. There are 30 features: Time, Amount, and 28 others that are masked.
2. The data is highly imbalanced: from the mean value of Class we see that only 0.17% of the transactions are fraud.
3. The variances for V1 to V28 are already sorted. This suggests that V1-V28 have been already been PCA transformed.

In [5]:
array = df.values
X = array[:,1:29]
pca = PCA()
X_new = pca.fit_transform(X)

In [6]:
abs(X_new)-abs(X)

array([[-1.72528658e-13,  7.87286902e-14, -2.88213897e-13, ...,
         1.11577414e-14,  8.29891711e-15,  8.00817745e-14],
       [-8.88178420e-16, -8.74855743e-14,  1.28147493e-13, ...,
        -2.17326157e-14, -4.03271166e-14, -4.84855212e-15],
       [ 1.33226763e-15, -8.88178420e-15,  7.79376563e-14, ...,
         3.72479825e-14,  2.97747937e-14,  2.44179676e-14],
       ...,
       [ 0.00000000e+00,  1.12132525e-14, -7.99360578e-15, ...,
         3.81639165e-15,  3.92047506e-16, -1.57859836e-15],
       [ 1.41553436e-15,  3.66373598e-15, -4.32986980e-15, ...,
         1.33226763e-15,  3.09474668e-15,  3.53883589e-15],
       [ 1.22124533e-15, -8.32667268e-16,  3.21964677e-15, ...,
        -1.44328993e-15,  3.73832909e-16, -4.57966998e-16]])

Applying Principal Component Analysis doesn't change the features except for some sign flips (due to the ambiguity in choosing a sign for the eigenvectors), so this does confirm that the given data has been PCA transformed.

In [7]:
n_PCA = 28 # number of features to keep after PCA
features_PCA = ['V' + str(i) for i in range(1,n_PCA+1)]
print(features_PCA)

['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28']


Next I will do supervised feature selection by choosing features that have the strongest correlation with the class. This can be done with the built-in SelectKBest function from scikit-learn. Here I use f_classif as score function (chi squared is for non-negative data). Optional: use ExtraTreesClassifier to fit the model to extract the importance of features.

In [8]:
X = array[:,1:29]
y = array[:,30]
select = SelectKBest(score_func=f_classif, k = 'all')
fit = select.fit(X, y)
scores_SKB = pd.DataFrame(fit.scores_)

n_selectK = 15 # number of features to keep after SelectK
features_selectK = ['V'+ str(i+1) for i in scores_SKB.nlargest(n_selectK,0).index]
print(features_selectK)

#model = ExtraTreesClassifier()
#model.fit(X,y)
#scores_ET = pd.DataFrame(model.feature_importances_)

#n_ET = 15 # number of important features to keep after ExtraTrees
#features_ET = ['v'+ str(i+1) for i in scores_ET.nlargest(n_ET,0).index]
#print(features_ET)

  if np.issubdtype(mask.dtype, np.int):


['V17', 'V14', 'V12', 'V10', 'V16', 'V3', 'V7', 'V11', 'V4', 'V18', 'V1', 'V9', 'V5', 'V2', 'V6']


In [9]:
kept_features = list(set(features_selectK).intersection(features_PCA))
print(kept_features)

['V11', 'V2', 'V7', 'V12', 'V4', 'V14', 'V1', 'V3', 'V5', 'V6', 'V9', 'V10', 'V17', 'V16', 'V18']


In [10]:
dropped_features = ['V'+str(i) for i in range(1,29) if ('V'+str(i)) not in kept_features]
#print(dropped_features)
df = df.drop(dropped_features, axis=1)

In [11]:
# prepare training and testing data

df.rename(columns={'Class':'Fraud'}, inplace=True)
df['NonFraud'] = 1-df['Fraud']

X = df.drop(['Fraud','NonFraud'],1)
y = df[['Fraud','NonFraud']]

# normalize the data to help the optimization
for col in X.columns:
    temp = X.loc[:,col]
    mean, std = temp.mean(), temp.std()
    X.loc[:,col] = (temp - mean) / std

# shuffle and split data into training and testing sets
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=0)
for idx_train, idx_test in sss.split(X, y):
    X_train = X.iloc[idx_train]
    y_train = y.iloc[idx_train]
    X_test = X.iloc[idx_test]
    y_test = y.iloc[idx_test]

## Train and test the neural network

In [16]:
n_epochs = 200 # number of training epochs
steps_to_check = 5

batch_size = 1000
learning_rate = 0.01
keep_prob = 0.9 # for dropout method, to reduce overfitting

n_nodes0 = len(kept_features) + 2 # number of input nodes
n_nodes1 = 20 # number of nodes in the 1st hidden layer
n_nodes2 = 25 # number of nodes in the 2nd hidden layer

# set up the variables
pkeep = tf.placeholder(tf.float32)
x = tf.placeholder(tf.float32, [None, n_nodes0])

W1 = tf.Variable(tf.random_normal([n_nodes0, n_nodes1], stddev=0.1))
b1 = tf.Variable(tf.zeros([n_nodes1]))
y1 = tf.nn.sigmoid(tf.matmul(x, W1) + b1)

W2 = tf.Variable(tf.random_normal([n_nodes1, n_nodes2], stddev=0.1))
b2 = tf.Variable(tf.zeros([n_nodes2]))
y2 = tf.nn.sigmoid(tf.matmul(y1, W2) + b2)
y2 = tf.nn.dropout(y2, pkeep)

W3 = tf.Variable(tf.random_normal([n_nodes2, 2], stddev=0.1)) 
b3 = tf.Variable(tf.zeros([2]))
y = tf.nn.softmax(tf.matmul(y2, W3) + b3)
y_ = tf.placeholder(tf.float32, [None, 2])

# use cross entropy as cost function
cost_function = -tf.reduce_sum(y_*tf.log(y))

optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost_function)
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

In [None]:
with tf.Session() as sess:
    # initialize the variables
    sess.run(tf.global_variables_initializer())
    
    # train the network
    for epoch in range(n_epochs): 
        for batch in range(int(len(X_train)/batch_size)):
            X_batch = X_train[batch*batch_size:(1+batch)*batch_size]
            y_batch = y_train[batch*batch_size:(1+batch)*batch_size]
            sess.run([optimizer], {x: X_batch, y_: y_batch, pkeep: keep_prob})

        # check cost and accuracy after fixed number of steps
        if epoch % steps_to_check == 0:
            acc, cost = sess.run([accuracy, cost_function], 
                                 {x: X_train, y_: y_train, pkeep: keep_prob})
            print('Epoch: %4d    Cost: %8.2f    Accuracy: %.5f' %(epoch, cost, acc))
    
    print('\nTraining completed.\n')
    print('Testing the model on test data.\n')
    
    # test on test data
    output = sess.run(y, {x: X_test, y_: y_test, pkeep:1})

In [None]:
target = (y_test['Fraud'] > 0).astype(int)
predict = (output[:,0] > output[:,1]).astype(int)
from evaluation import confusion
confusion(predict,target)

There are a lot of parameters to tune in order to optimize the performance of the model (size of the neural networks, number of training epochs, number of important features to keep etc.). Generally the accuracy is above 99.92%. The recall and precision for the positive class (fraud) can be around 80%, and those for the negative class (non-fraud) is above 99.96%.