## Fraud Detection Modeling

The purpose of this notebook is to build an ensemble classifier that will assess if fraudulent activity has occured in a transaction. The initial steps include creating a Gradient Boosting Classifier in order to establish a baseline. The next steps include hyperparameter searching to find create our best machine learning model reflecting our dataset.

### Imports

In [1]:
import pandas as pd

from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV

In [2]:
#loading our data
df = pd.read_csv('../data/fraud.csv')

df.head()

Unnamed: 0,CASH_IN,CASH_OUT,DEBIT,PAYMENT,TRANSFER,amount,newbalanceOrig,oldbalanceDest,isFraud
0,0,0,0,1,0,9839.64,160296.36,0.0,0
1,0,0,0,1,0,1864.28,19384.72,0.0,0
2,0,0,0,0,1,181.0,0.0,0.0,1
3,0,1,0,0,0,181.0,0.0,21182.0,1
4,0,0,0,1,0,11668.14,29885.86,0.0,0


### Data Preparation

Here we are simply splitting our features and target variables. Also we split our data into training and testing sets.

In [3]:
#separating features and target
X = df.drop('isFraud', axis=1)
y = df['isFraud']

In [4]:
#splitting our dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Initial Gradient Boosting Classifier

Creating and initial Gradient Boosting Classifier on our training set to establish a baseline. 

In [7]:
#creating a base classifier
gbc = GradientBoostingClassifier(n_estimators=10)

In [8]:
#fitting our classifier on our training data
gbc.fit(X_train, y_train)

GradientBoostingClassifier(n_estimators=10)

In [10]:
#get train and testing predictions from initial classifier
y_train_pred = gbc.predict(X_train)
y_test_pred = gbc.predict(X_test)

In [11]:
#view f1 score for train and testing
print("f1_score for training set predictions:", f1_score(y_train, y_train_pred))
print("f1_score for test set predictions:", f1_score(y_test, y_test_pred))

f1_score for training set predictions: 0.21777120580723217
f1_score for test set predictions: 0.2011111111111111


In [13]:
#view mean accuracy on given test data and labels
print("mean accuracy on given test data:", gbc.score(X_test, y_test))

mean accuracy on given test data: 0.9988699623739906


### Randomized Search For Hyperparameter Tuning

In [14]:
#parameters to test
params = {
    'learning_rate': [0.1, 0.5, 1.0, 10],
}

In [16]:
#building a new classifier using randomized search
clf = RandomizedSearchCV(GradientBoostingClassifier(n_estimators=10), params)

In [None]:
#begin our hyperparameter tuning
search = clf.fit(X_train, y_train)



In [None]:
#get our best parameters
search.best_params_

In [None]:
#get predictions from our best parameter model
y_hat_train = search.predict(X_train)
y_hat_test = search.predict(X_test)

In [None]:
#get the f1 scores of our best model
print("f1_score for training set predictions:", f1_score(y_train, y_hat_train))
print("f1_score for test set predictions:", f1_score(y_test, y_hat_test))