# Credit Card Fraud Detection

### Goal

Train a model to predict fraudulent credit card transactions

### Tasks
1. Load [data from Kaggle](https://www.kaggle.com/dalpozz/creditcardfraud/downloads/creditcardfraud.zip) and do exploratory data analysis to understand our data
2. Clean data and prepare it in X (2-d array), y (1-d array) format for modeling
3. Select algorithm and train model
4. Evaluate model
5. Tune/improve model
6. Use model to predict fraudulent transactions

### Approach
1. Sample the data so as to reduce the skew
2. Start with a logistic regression classifier
3. Perform classifications model using other classification algorithms.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import imblearn

%matplotlib inline
pd.options.display.max_columns = 40

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

### Load data

### Exploratory data analysis

### Preparing our data for modeling

In [25]:
X = df.ix[:, df.columns != 'Class']
y = df.ix[:, df.columns == 'Class'].values.ravel()

[`.ravel()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel.html) is a method that helps us convert y (which is originally a column-vector) to a 1-dimensional array, so that scikit-learn won't throw a DataConversionWarning. The code will work without transforming it with `.values.ravel()` as well, but we'll have a warning message, which is not so nice. 

In [26]:
### Split our data into train and test sets


## Iteration 1: Logistic regression model (with no sampling or thresholding)

### Train our model

### Evaluate our model

## Iteration 2: Logistic regression model (with undersampled data)

To improve the accuracy of the model, we can undersample the data such that the proportion of cases of y=0 and y=1 are 50-50, instead of 99.8-0.2.

**`imblearn`** (imbalanced_learn) is a nice library that has methods for doing this undersampling

In [33]:
from imblearn.under_sampling import RandomUnderSampler

import collections

In [34]:
rus = RandomUnderSampler(return_indices=True)
X_undersampled, y_undersampled, idx_resampled = rus.fit_sample(X, y)
print('length of X and y:', len(X_undersampled), len(y_undersampled))
print('Count of y values:', collections.Counter(y_undersampled))

length of X and y: 984 984
Count of y values: Counter({1: 492, 0: 492})


In [None]:
### Split our data into train and test sets using the smaller balanced dataset
### Tip: Don't overwrite X_train, X_test, y_train, y_test, so that you can still use these other variables later
### if you want to


### Train our model

### Evaluate our model

## Iteration 3: Logistic regression model (with GridSearchCV)

## Iteration 4: Logistic regression model (with undersampled data and GridSearchCV)

## Iteration 5: Random Forest

In [50]:
from sklearn.ensemble import RandomForestClassifier

## Iteration 6: Random Forest (with undersampling)

## Iteration 7: Random Forest (with undersampled data, and 40% train_test_split ratio)

## Iteration 8: Random Forest (with undersampling, and 40% train_test_split ratio, and optimization with GridSearchCV)