# Credit Card Fraud

This notebook uses data from kaggle to practice supervised learning on pre-PCA'd data.  The relevant dataset can be found [here](https://www.kaggle.com/dalpozz/creditcardfraud)

In [1]:
import pandas as pd
frame = pd.read_csv("./data/creditcard.csv")
frame.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


## Initial impressions

The data in this csv doesn't represent recognizable signals from the relevant transactions; each "V" is already a PCA vector, and is intentionally provided without the original dataset for confidentiality reasons.  This means I have no way to develop any intuition about what's important just from knowing the domain, which is kind of a neat new challenge.

In [2]:
frame.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.16598e-15,3.416908e-16,-1.37315e-15,2.086869e-15,9.604066e-16,1.490107e-15,-5.556467e-16,1.177556e-16,-2.406455e-15,...,1.656562e-16,-3.44485e-16,2.578648e-16,4.471968e-15,5.340915e-16,1.687098e-15,-3.666453e-16,-1.220404e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


There *are* two columns that are labeled in the dataset, one is time since the interval began for this dataset, and one is the value of the transaction.  Let's take a minute and see if either of them are noticably different from fraudulent to regular transactions.

In [3]:
frame.Amount[frame.Class == 1].describe()

count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64

In [4]:
frame.Amount[frame.Class == 0].describe()

count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64

For amount, unsurprisingly, there's a lot less range for fraudulent transactions.  I suppose to a fraudster, it feels like large transactions are likely to get noticed.

In [5]:
frame.Time[frame.Class == 1].describe()

count       492.000000
mean      80746.806911
std       47835.365138
min         406.000000
25%       41241.500000
50%       75568.500000
75%      128483.000000
max      170348.000000
Name: Time, dtype: float64

In [6]:
frame.Time[frame.Class == 0].describe()

count    284315.000000
mean      94838.202258
std       47484.015786
min           0.000000
25%       54230.000000
50%       84711.000000
75%      139333.000000
max      172792.000000
Name: Time, dtype: float64

At a glance there isn't much interesting here for time. However, the categories are very unbalanced; there are 284,000 valid transactions and just under 500 fraudulent.  It would be very possible in a test train split to accidentally put all the fraud examples in one set or the other.  We should split them and control how many of each go into each set.

In [7]:
fraud = frame[frame.Class == 1]
valid = frame[frame.Class == 0]
len(fraud)

492

In [8]:
training_percentage = 0.85
validation_percentage_of_test = 0.3

fraud_train = fraud.sample(frac=training_percentage)
fraud_train_inverse = fraud.drop(fraud_train.index)
fraud_validation = fraud_train_inverse.sample(frac=validation_percentage_of_test)
fraud_test = fraud_train_inverse.drop(fraud_validation.index)
# oversample the minority class
fraud_train = pd.concat([fraud_train, fraud_train, fraud_train, fraud_train])
fraud_train = pd.concat([fraud_train, fraud_train, fraud_train, fraud_train])
fraud_train = pd.concat([fraud_train, fraud_train, fraud_train, fraud_train])


valid_train = valid.sample(frac=training_percentage)
valid_train_inverse = valid.drop(valid_train.index)
valid_validation = valid_train_inverse.sample(frac=validation_percentage_of_test)
valid_test = valid_train_inverse.drop(valid_validation.index)
# undersample the majority class
valid_train = valid_train.sample(frac=0.5)


And now they can be concatenated into complete sets and shuffled:

In [9]:
from sklearn.utils import shuffle
train_features = shuffle(pd.concat([fraud_train, valid_train]))
test_features = shuffle(pd.concat([fraud_test, valid_test]))
validation_features = shuffle(pd.concat([fraud_validation, valid_validation]))

Now we can pull out the labels, and drop them from the feature frames

In [10]:
train_labels = train_features.Class
test_labels = test_features.Class
validation_labels = validation_features.Class
train_features = train_features.drop(['Class'], axis=1)
test_features = test_features.drop(['Class'], axis=1)
validation_features = validation_features.drop(['Class'], axis=1)

In [11]:
train_features.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
count,147586.0,147586.0,147586.0,147586.0,147586.0,147586.0,147586.0,147586.0,147586.0,147586.0,...,147586.0,147586.0,147586.0,147586.0,147586.0,147586.0,147586.0,147586.0,147586.0,147586.0
mean,92083.879081,-0.822654,0.63129,-1.221204,0.805675,-0.52694,-0.253155,-0.957899,0.093502,-0.450655,...,0.065634,0.11251,0.002233,-0.010763,-0.019136,0.007926,0.010054,0.029891,0.015904,93.021534
std,47934.8436,3.806442,2.705999,4.197309,2.456529,2.813501,1.53208,3.825485,3.017389,1.735221,...,0.922374,1.715664,0.900441,0.894272,0.590721,0.580118,0.478784,0.681384,0.376441,248.840715
min,2.0,-46.855047,-63.344698,-32.965346,-5.519697,-42.147898,-23.496714,-43.557242,-73.216718,-13.434066,...,-28.009635,-34.830382,-10.933144,-30.26972,-2.822684,-7.495741,-2.53433,-22.565679,-11.710896,0.0
25%,51426.25,-1.298443,-0.477496,-1.615401,-0.685276,-0.934655,-0.971244,-0.892838,-0.209479,-1.026896,...,-0.206396,-0.215443,-0.545816,-0.182292,-0.372216,-0.317991,-0.314087,-0.069208,-0.054008,3.79
50%,83243.5,-0.2725,0.250864,-0.168438,0.285916,-0.148289,-0.38844,-0.083415,0.057332,-0.201091,...,-0.039046,0.012646,0.005741,-0.016927,0.028828,0.0305,-0.03968,0.01437,0.017171,20.0
75%,137769.5,1.233204,1.145381,0.855995,1.384394,0.579607,0.296348,0.49381,0.459822,0.478908,...,0.215385,0.278251,0.53963,0.161528,0.41087,0.368708,0.272505,0.169662,0.111651,84.0
max,172788.0,2.45493,22.057729,9.382558,16.715537,34.099309,22.529298,36.877368,20.007208,15.594995,...,39.420904,27.202839,10.50309,19.228169,4.022866,7.519589,3.220178,12.152401,22.620072,19656.53


Scale features so they're all of comparable varience.

In [12]:
cols = train_features.columns.values

for col_name in cols:
    avg, dev = frame[col_name].mean(), frame[col_name].std()
    train_features.loc[:, col_name] = (train_features[col_name] - avg) / dev
    train_features.loc[:, col_name] = (train_features[col_name] - avg) / dev

Build model

In [61]:
import numpy as np

trainX = train_features.as_matrix()
testX = test_features.as_matrix()
validX = validation_features.as_matrix()

trainY = np.reshape(train_labels.as_matrix(), (-1, 1))
testY = np.reshape(test_labels.as_matrix(), (-1, 1))
validY = np.reshape(validation_labels.as_matrix(), (-1, 1))

In [71]:
# layer size configuration
input_nodes = len(trainX[0])
hidden_layer_1 = 200
hidden_layer_2 = 50
hidden_layer_3 = 13
output_nodes = 1
dropout_keep_percentage = 0.75
learning_rate = 0.01
training_epochs = 2000
display_step = 100

In [72]:
import tensorflow as tf
x_inputs = tf.placeholder(tf.float32, [None, input_nodes], name="x_inputs")
y_outputs = tf.placeholder(tf.float32,[None,1], name="y_outputs")
keep_prob = tf.placeholder(tf.float32, name="keep_prob")

weights1 = tf.Variable(tf.truncated_normal([input_nodes, hidden_layer_1], stddev=0.1))
bias1 = tf.Variable(tf.zeros([hidden_layer_1]))
hidden1 = tf.nn.relu(tf.matmul(x_inputs, weights1) + bias1)

weights2 = tf.Variable(tf.truncated_normal([hidden_layer_1, hidden_layer_2], stddev=0.1))
bias2 = tf.Variable(tf.zeros([hidden_layer_2]))
hidden2 = tf.nn.relu(tf.matmul(hidden1, weights2) + bias2)

weights3 = tf.Variable(tf.truncated_normal([hidden_layer_2, hidden_layer_3], stddev=0.1))
bias3 = tf.Variable(tf.zeros([hidden_layer_3]))
hidden3 = tf.nn.relu(tf.matmul(hidden2, weights3) + bias3)
hidden3 = tf.nn.dropout(hidden3, keep_prob)

output_weights = tf.Variable(tf.truncated_normal([hidden_layer_3, output_nodes], stddev=0.1))
output_bias = tf.Variable(tf.zeros([output_nodes]))
output = tf.nn.sigmoid(tf.matmul(hidden3, output_weights) + output_bias)

In [73]:
cost = tf.reduce_mean(tf.square(output - y_outputs))
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
correct_prediction = tf.equal(tf.round(output), tf.round(y_outputs))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

In [74]:
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)

Instructions for updating:
Use `tf.global_variables_initializer` instead.


In [75]:
print('epoch: ', end="")
for i in range(training_epochs):  
    print(i, '...', end="")
    sess.run([optimizer], feed_dict={x_inputs: trainX, y_outputs: trainY, keep_prob: dropout_keep_percentage})
    if (i) % display_step == 0:
        train_acc, train_cost = sess.run([accuracy, cost], feed_dict={x_inputs: trainX, y_outputs: trainY, keep_prob: 1.0})
        test_acc = sess.run([accuracy], feed_dict={x_inputs: testX, y_outputs: testY, keep_prob: 1.0})
        print("TRAINING:", train_acc)
        print("COST:", train_cost)
        print("TEST:", test_acc)
        print('epoch: ', end="")

epoch: 0 ...TRAINING: 0.568279
COST: 0.247112
TEST: [0.99789333]
epoch: 1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11 ...12 ...13 ...14 ...15 ...16 ...17 ...18 ...19 ...20 ...21 ...22 ...23 ...24 ...25 ...26 ...27 ...28 ...29 ...30 ...31 ...32 ...33 ...34 ...35 ...36 ...37 ...38 ...39 ...40 ...41 ...42 ...43 ...44 ...45 ...46 ...47 ...48 ...49 ...50 ...51 ...52 ...53 ...54 ...55 ...56 ...57 ...58 ...59 ...60 ...61 ...62 ...63 ...64 ...65 ...66 ...67 ...68 ...69 ...70 ...71 ...72 ...73 ...74 ...75 ...76 ...77 ...78 ...79 ...80 ...81 ...82 ...83 ...84 ...85 ...86 ...87 ...88 ...89 ...90 ...91 ...92 ...93 ...94 ...95 ...96 ...97 ...98 ...99 ...100 ...TRAINING: 0.900953
COST: 0.219499
TEST: [0.99792677]
epoch: 101 ...102 ...103 ...104 ...105 ...106 ...107 ...108 ...109 ...110 ...111 ...112 ...113 ...114 ...115 ...116 ...117 ...118 ...119 ...120 ...121 ...122 ...123 ...124 ...125 ...126 ...127 ...128 ...129 ...130 ...131 ...132 ...133 ...134 ...135 ...136 ...137 ...138 ...139 ...140