# Credit Card Fraud Detection

### Goal

Train a model to predict fraudulent credit card transactions

### Tasks
1. Load [data from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud/downloads/creditcardfraud.zip) and do exploratory data analysis to understand our data
2. Clean data and prepare it in X (2-d array), y (1-d array) format for modeling
3. Select algorithm and train model
4. Evaluate model
5. Tune/improve model
6. Use model to predict fraudulent transactions

### Approach
1. Sample the data so as to reduce the skew
2. Start with a logistic regression classifier
3. Perform classifications model using other classification algorithms.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import imblearn

%matplotlib inline
pd.options.display.max_columns = 40

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

### Load data

In [2]:
df = pd.read_csv('./data/creditcard.csv')

### Exploratory data analysis

In [3]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


### Preparing our data for modeling

In [4]:
X = df.ix[:, df.columns != 'Class']
y = df.ix[:, df.columns == 'Class'].values.ravel()

[`.ravel()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel.html) is a method that helps us convert y (which is originally a column-vector) to a 1-dimensional array, so that scikit-learn won't throw a DataConversionWarning. The code will work without transforming it with `.values.ravel()` as well, but we'll have a warning message, which is not so nice. 

In [5]:
### Split our data into train and test sets
X_train_v1, X_test_v1, y_train_v1, y_test_v1 = train_test_split(X, y, test_size = 0.25, random_state=0)

In [None]:
rus = RandomUnderSampler(return_indices=True)
X_undersampled, y_undersampled, idx_resampled = rus.fit_sample(X, y)
print('length of X and y:', len(X_undersampled), len(y_undersampled))
print('Count of y values:', collections.Counter(y_undersampled))

In [None]:
X_train_v2, X_test_v2, y_train_v2, y_test_v2 = train_test_split(X_undersampled, y_undersampled, random_state=0)

In [None]:
X_train_v3, X_test_v3, y_train_v3, y_test_v3 = train_test_split(X_undersampled, y_undersampled, random_state=0)

## Iteration 1: Logistic regression model (with no sampling or thresholding)

### Train our model

### Evaluate our model

## Iteration 2: Logistic regression model (with undersampled data)

To improve the accuracy of the model, we can undersample the data such that the proportion of cases of y=0 and y=1 are 50-50, instead of 99.8-0.2.

**`imblearn`** (imbalanced_learn) is a nice library that has methods for doing this undersampling

In [33]:
from imblearn.under_sampling import RandomUnderSampler

import collections

In [34]:
rus = RandomUnderSampler(return_indices=True)
X_undersampled, y_undersampled, idx_resampled = rus.fit_sample(X, y)
print('length of X and y:', len(X_undersampled), len(y_undersampled))
print('Count of y values:', collections.Counter(y_undersampled))

length of X and y: 984 984
Count of y values: Counter({1: 492, 0: 492})


In [None]:
### Split our data into train and test sets using the smaller balanced dataset
### Tip: Don't overwrite X_train, X_test, y_train, y_test, so that you can still use these other variables later
### if you want to


### Train our model

### Evaluate our model

## Iteration 3: Logistic regression model (with GridSearchCV)

## Iteration 4: Logistic regression model (with undersampled data and GridSearchCV)

## Iteration 5: Random Forest

In [50]:
from sklearn.ensemble import RandomForestClassifier

## Iteration 6: Random Forest (with undersampling)

## Iteration 7: Random Forest (with undersampled data, and 40% train_test_split ratio)

## Iteration 8: Random Forest (with undersampling, and 40% train_test_split ratio, and optimization with GridSearchCV)