# Fraud Detection with Machine Learning
This notebook explores a fraud detection problem with financial transactions.

**Dictionary**<br/>
This is the column definition of the referenced sythentic dataset.
<br/><br/>

| Column Name | Description |
| ----------- | ----------- | 
| step | maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).|
| type | CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER. |
| amount |  amount of the transaction in local currency. |
| nameOrig | customer who started the transaction |
| oldbalanceOrg | initial balance before the transaction |
| newbalanceOrig | new balance after the transaction |
| nameDest | customer who is the recipient of the transaction |
| oldbalanceDest | initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants). |
| newbalanceDest | new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants). |
<br/><br/>

**CashIn**	A Client moves money into the network via a Merchant <br/>
**CashOut**	A Client moves money out of the network via a Merchant <br/>
**Debit**	A Client moves money into a Bank <br/>
**Transfer**	A Client sends money to another Client <br/>
**Payment**	A Client exchanges money for something from a Merchant <br/>
<br/>
Courtesy of Jamie Hoyzer, DS & ML-Feb Mar 2021

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objs as go
from sklearn.model_selection import train_test_split

In [None]:
# Importing the machine learning model
from sklearn.ensemble import RandomForestClassifier

# Import GridSearchCV to find the model with the best parameters
from sklearn.model_selection import GridSearchCV

# Importing the the functions to measure metrics for the model
from sklearn.metrics import accuracy_score, precision_score, recall_score

In [None]:
accountDf = pd.read_csv('./PS_subset.csv')

In [None]:
accountDf.drop(['isFlaggedFraud'], axis=1, inplace=True)

In [None]:
accountDf.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0


In [None]:
accountDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   step            5000 non-null   int64  
 1   type            5000 non-null   object 
 2   amount          5000 non-null   float64
 3   nameOrig        5000 non-null   object 
 4   oldbalanceOrg   5000 non-null   float64
 5   newbalanceOrig  5000 non-null   float64
 6   nameDest        5000 non-null   object 
 7   oldbalanceDest  5000 non-null   float64
 8   newbalanceDest  5000 non-null   float64
 9   isFraud         5000 non-null   int64  
dtypes: float64(5), int64(2), object(3)
memory usage: 390.8+ KB


In [None]:
# Look at the volume for different types of transactions 
fig = px.histogram(accountDf, x="type")
fig.show()

In [None]:
# Look at extent of fraudulent transcations 
accountDf.groupby('isFraud')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f4c00c5b990>

In [None]:
fig = px.pie(accountDf, values='amount', names='type', color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()

In [None]:
# Check ratio of fraudulent transactions
totalFraud = accountDf[accountDf['isFraud'] == 1]['amount'].sum()
totalNonFraud = accountDf[accountDf['isFraud'] == 0]['amount'].sum()

totalAmount = accountDf['amount'].sum()

totalFraudTransactions = len(accountDf[accountDf['isFraud'] == 1])
totalNonFraudTransactions = len(accountDf[accountDf['isFraud'] == 0])

In [None]:
ratioAmountDf = pd.DataFrame({
    'sum': [totalFraud, totalNonFraud], 
    'isFraud': [1, 0]
})

ratioTransactionDf = pd.DataFrame({
    'sum': [totalFraudTransactions, totalNonFraudTransactions], 
    'isFraud': [1, 0]
})

In [None]:
# Ratio of fraudulent transactions in 'transactions'
fig = px.pie(ratioTransactionDf, values='sum', names='isFraud')
fig.show()

In [None]:
# Ratio of fraudulent transactions in 'amount'
fig = px.pie(ratioAmountDf, values='sum', names='isFraud')
fig.show()

In [None]:
accountDf["type"]=="CASH_OUT"

0       False
1       False
2       False
3        True
4       False
        ...  
4995    False
4996    False
4997    False
4998    False
4999    False
Name: type, Length: 5000, dtype: bool

In [None]:
# making boolean series for a team name
filter1 = accountDf["type"]=="CASH_OUT"
filter2 = accountDf["isFraud"]==1
  
# filtering data
accountDfSubset = accountDf[filter1 & filter2]

### Rules Based vs Machine Learning

In [None]:
conditions = [
    (accountDf['oldbalanceOrg'] <= 56900) & (accountDf['type'] == 'TRANSFER') & (accountDf['newbalanceDest'] <= 105),
    (accountDf['oldbalanceOrg'] < 56900) & (accountDf['newbalanceDest'] <= 12),
    (accountDf['oldbalanceOrg'] > 56900) & (accountDf['newbalanceOrig'] > 12) & (accountDf['amount'] > 1160000)
]

mapping = [1, 1, 1]

accountDf['label'] = np.select(conditions, mapping, default=0)

In [None]:
totalFraudTransactions = len(accountDf[accountDf['label'] == 1])
totalNonFraudTransactions = len(accountDf[accountDf['label'] == 0])

ratioTransactionDf = pd.DataFrame({
    'sum': [totalFraudTransactions, totalNonFraudTransactions], 
    'label': [1, 0]
})

# Ratio of fraudulent transactions in 'transactions'
fig = px.pie(ratioTransactionDf, values='sum', names='label')
fig.show()

In [None]:
accountDf.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,label
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,1
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,1
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,1
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,1


In [None]:
features = accountDf.drop(['label'], axis=1)
label = accountDf['label']

In [None]:
features = pd.get_dummies(features, columns=['type'])

In [None]:
features.head()

Unnamed: 0,step,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,1,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0,0,0,1,0
1,1,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0,0,0,1,0
2,1,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0,0,0,0,1
3,1,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0,1,0,0,0
4,1,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0,0,0,1,0


In [None]:
features.drop(['nameOrig', 'nameDest'], axis=1, inplace=True)

In [None]:
trainF, testF, trainL, testL = train_test_split(features, label, test_size=0.4, random_state=42)
testF, valF, testL, valL = train_test_split(testF, testL, test_size=0.5, random_state=42)

In [None]:
# Create 3 RandomForestClassifiers with the best hyperparameters
rfModel1 = RandomForestClassifier(n_estimators=50, max_depth=10)
rfModel1.fit(trainF, trainL.values.ravel())

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=10, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [None]:
# Adjust the paramaters after validation test dataset
predLabel = rfModel1.predict(valF)

In [None]:
# Use the test data set with the best available model
predLabel = rfModel1.predict(testF)

accuracy = round(accuracy_score(testL, predLabel), 3)
precision = round(precision_score(testL, predLabel), 3)
recall = round(recall_score(testL, predLabel), 3)

print(
        'Max depth: {} and Estimators: {} ---> Accuracy: {}, Precision: {}, Recall: {}'
        .format(rfModel1.max_depth, rfModel1.n_estimators, accuracy, precision, recall)
)

Max depth: 10 and Estimators: 50 ---> Accuracy: 0.998, Precision: 1.0, Recall: 0.995
