#  2013 Credit Card Fraud

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

### Importing Basic Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Reading In Credit Card Data

In [2]:
df = pd.read_csv('creditcard.csv')
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [3]:
df.Class.value_counts()

0    284315
1       492
Name: Class, dtype: int64

### Balancing Target Column

In [4]:
from sklearn.utils import resample

In [9]:
# Separate majority and minority classes
df_majority = df[df.Class==0]
df_minority = df[df.Class==1]

In [10]:
# Upsample minority class
df_minority_upsampled = resample(df_minority, replace=True, n_samples=284315,random_state=123) 

In [11]:
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

In [13]:
# Display new class counts
df_upsampled.Class.value_counts()

1    284315
0    284315
Name: Class, dtype: int64

## Logistic Regression Model

In [32]:
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_predict
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LogisticRegression

In [26]:
lr = LogisticRegression()

In [29]:
# Separate input features (X) and target variable (y)
y = df_upsampled.Class
X = df_upsampled.drop('Class', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# Train model
Lrc = LogisticRegression().fit(X_train, y_train)
 
# Predict on training set
cross_val_score(Lrc, X, y, cv= 10)

array([ 0.87315349,  0.92297411,  0.92772228,  0.93384215,  0.94479811,
        0.94108543,  0.9457986 ,  0.9443741 ,  0.94970279,  0.95817945])

### Logistic Regression Metrics and Analysis

In [28]:
from sklearn.metrics import classification_report,confusion_matrix

In [35]:
lr_pred = Lrc.predict(X_test)

In [39]:
print(confusion_matrix(y_test, lr_pred))
print('\n')
print(classification_report(y_test, lr_pred))

[[91598  2418]
 [ 9271 84361]]


             precision    recall  f1-score   support

          0       0.91      0.97      0.94     94016
          1       0.97      0.90      0.94     93632

avg / total       0.94      0.94      0.94    187648



Looking at the Logistical regression model. It is good, but it has some room for improvement, mislabeling 9271 fraudulent transactions and 2418 normal transactions.  Since are data has had PCA applied to it already. We are unable to rank our datasets features using the coeffcients removing one of this models strenghts.  Lets try a random forest model and see if we can improve the results. 

## Random Forest Model

In [24]:
from sklearn.ensemble import RandomForestClassifier

In [25]:
# Separate input features (X) and target variable (y)
y = df_upsampled.Class
X = df_upsampled.drop('Class', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# Train model
Rfc = RandomForestClassifier()
Rfc.fit(X_train, y_train)
 
# Predict on training set
cross_val_score(Rfc.fit(X_train, y_train), X, y, cv= 10)

array([ 0.9738323 ,  0.99992966,  0.99996483,  1.        ,  0.99992966,
        0.99984172,  0.99996483,  1.        ,  0.99998241,  0.99996483])

### Random Forest Metrics and Analysis

In [37]:
rf_pred = Rfc.predict(X_test)

In [40]:
print(confusion_matrix(y_test, rf_pred))
print('\n')
print(classification_report(y_test, rf_pred))

[[94011     5]
 [    0 93632]]


             precision    recall  f1-score   support

          0       1.00      1.00      1.00     94016
          1       1.00      1.00      1.00     93632

avg / total       1.00      1.00      1.00    187648



As we can see, the random forest model is incredibly accurate only misclassifying 5 out 187,648 cases.  Of the 5 data points that were missed, all of them were type 1 errors. Mean our model  caught all fraudulent transactions, but predicted 5 normal transactions to be fraudulent.  When in reality, they were regular transactions.  Overall, I think the random forest model is great for this situation, because the values we were working with already had PCA applied to them.  So, we couldn’t tell what features had the most influence over the target value anyways making random forests black box problem irrelevant.