# Credit Card Fraud Detection with Random Forest

## Problem Statement
Classification problem - need to detect fraudulent credit card transactions in a dataset of 284,807 transactions. There are only 492 fraudulent transactions in the dataset, so it is very highly imbalanced.

## Download and unzipping the dataset
The dataset is available from the [Kaggle website](https://www.kaggle.com/mlg-ulb/creditcardfraud) - you will need a Kaggle account so either register for an account or sign in if you already have one.

![download](img/data_download.png)

The downloaded archive is named 310_23498_bundle_archive.zip and contains a single csv file: creditcard.csv 

In [1]:
# import all packages
from zipfile import ZipFile
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score,precision_score,recall_score,f1_score,matthews_corrcoef
from sklearn.metrics import confusion_matrix

In [2]:
# unzip the downloaded archive
with ZipFile('310_23498_bundle_archive.zip', 'r') as zipObj:
   zipObj.extractall()

## Dataset exploration

Read the creditcard.csv file into a pandas dataframe:

In [3]:
df = pd.read_csv('creditcard.csv')

Check the number of rows and columns. Should be 284,807 rows, one for each transaction and 31 columns. Time, V1 to V28 and Amount are the feature columns Class column is the indicator of non-fraudulent (Class=0) or fraudulent (Class=1) transactions. 

In [4]:
print('The dataframe has',df.shape[0],'rows and',df.shape[1],'columns.')

The dataframe has 284807 rows and 31 columns.


Check for missing values:

In [5]:
print('There are',df.isnull().sum().sum(),'missing values.')

There are 0 missing values.


Check class balance:

In [6]:
class_0 = df['Class'].value_counts()[0]
class_1 = df['Class'].value_counts()[1]
print('There are',class_0,'non-fraudulent transactions and',class_1,'fraudulent transactions.')

There are 284315 non-fraudulent transactions and 492 fraudulent transactions.


## Training the Random Forest Classifier

First, we need to separate the dataframe into features (X) and class labels (y):

In [7]:
X = df.iloc[:,:30]
y = df.iloc[:,30]

Now use SciKit-Learn's train-test split utility to divide X and y into train and test sets:

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

In [9]:
crf=RandomForestClassifier(n_estimators=200)
crf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Now make predictions using the test set and check the accuracy:

In [10]:
predictions=crf.predict(X_test)

n_errors = (predictions != y_test).sum()
print('Errors:', n_errors)

acc= accuracy_score(y_test,predictions)
print("The accuracy is  {}".format(acc))
prec= precision_score(y_test,predictions)
print("The precision is {}".format(prec))
rec= recall_score(y_test,predictions)
print("The recall is {}".format(rec))
f1= f1_score(y_test,predictions)
print("The F1-Score is {}".format(f1))
MCC=matthews_corrcoef(y_test,predictions)
print("The Matthews correlation coefficient is {}".format(MCC))

The accuracy is  0.9995084442259752
The precision is 0.9195402298850575
The recall is 0.7920792079207921
The F1-Score is 0.851063829787234
The Matthews correlation coefficient is 0.8531958042156231
