# Fraud Detection with Machine Learning
This notebook explores a fraud detection problem with financial transactions.

**Dictionary**<br/>
This is the column definition of the referenced sythentic dataset.
<br/><br/>

| Column Name | Description |
| ----------- | ----------- | 
| step | maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).|
| type | CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER. |
| amount |  amount of the transaction in local currency. |
| nameOrig | customer who started the transaction |
| oldbalanceOrg | initial balance before the transaction |
| newbalanceOrig | new balance after the transaction |
| nameDest | customer who is the recipient of the transaction |
| oldbalanceDest | initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants). |
| newbalanceDest | new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants). |
<br/><br/>

**CashIn**	A Client moves money into the network via a Merchant <br/>
**CashOut**	A Client moves money out of the network via a Merchant <br/>
**Debit**	A Client moves money into a Bank <br/>
**Transfer**	A Client sends money to another Client <br/>
**Payment**	A Client exchanges money for something from a Merchant <br/>
<br/>
Courtesy of Jamie Hoyzer, DS & ML-Feb Mar 2021

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objs as go
from sklearn.model_selection import train_test_split

In [None]:
# Importing the machine learning model
from sklearn.ensemble import RandomForestClassifier

# Import GridSearchCV to find the model with the best parameters
from sklearn.model_selection import GridSearchCV

# Importing the the functions to measure metrics for the model
from sklearn.metrics import accuracy_score, precision_score, recall_score

In [None]:
#accountDf = pd.read_csv('./PS_subset.csv')

In [None]:
# (2) Import data file(s)
# Use the SQL method from Spark to import the data as DataFrames. 
# See the reference code below. Also see the DataBricks Fraud Detection code as example.
# For basic exploration, you can also use Pandas.
# bs140513_032310.csv
# bsNET140513_032310.csv
!curl -O https://storage.googleapis.com/datascience-practice/bs140513_032310.csv.zip
!curl -O https://storage.googleapis.com/datascience-practice/bsNET140513_032310.csv.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 7251k  100 7251k    0     0  10.0M      0 --:--:-- --:--:-- --:--:-- 10.0M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6139k  100 6139k    0     0  5658k      0  0:00:01  0:00:01 --:--:-- 5658k


In [None]:
# (2.a) See the downloaded files
!ls

bs140513_032310.csv.zip  bsNET140513_032310.csv.zip  PS_subset.csv  sample_data


In [None]:
# (2.b) Unzip the files
!unzip bs140513_032310.csv.zip
!unzip bsNET140513_032310.csv.zip

Archive:  bs140513_032310.csv.zip
  inflating: bs140513_032310.csv     
Archive:  bsNET140513_032310.csv.zip
  inflating: bsNET140513_032310.csv  


In [None]:
# (2.c) Read transaction data into Pandas DataFrame
accountDf = pd.read_csv('bs140513_032310.csv')
# (2.d) Read network data into Pandas DataFrame

In [None]:
accountDf.head()

Unnamed: 0,step,customer,age,gender,zipcodeOri,merchant,zipMerchant,category,amount,fraud
0,0,'C1093826151','4','M','28007','M348934600','28007','es_transportation',4.55,0
1,0,'C352968107','2','M','28007','M348934600','28007','es_transportation',39.68,0
2,0,'C2054744914','4','F','28007','M1823072687','28007','es_transportation',26.89,0
3,0,'C1760612790','3','M','28007','M348934600','28007','es_transportation',17.25,0
4,0,'C757503768','5','M','28007','M348934600','28007','es_transportation',35.72,0


In [None]:
accountDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 594643 entries, 0 to 594642
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   step         594643 non-null  int64  
 1   customer     594643 non-null  object 
 2   age          594643 non-null  object 
 3   gender       594643 non-null  object 
 4   zipcodeOri   594643 non-null  object 
 5   merchant     594643 non-null  object 
 6   zipMerchant  594643 non-null  object 
 7   category     594643 non-null  object 
 8   amount       594643 non-null  float64
 9   fraud        594643 non-null  int64  
dtypes: float64(1), int64(2), object(7)
memory usage: 45.4+ MB


In [None]:
# Look at the volume for different types of transactions 
fig = px.histogram(accountDf, x="type")
fig.show()

ValueError: ignored

In [None]:
# Look at extent of fraudulent transcations 
accountDf.groupby('isFraud').sum()

In [None]:
fig = px.pie(accountDf, values='amount', names='type', color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()

In [None]:
# Check ratio of fraudulent transactions
totalFraud = accountDf[accountDf['isFraud'] == 1]['amount'].sum()
totalNonFraud = accountDf[accountDf['isFraud'] == 0]['amount'].sum()
totalAmount = accountDf['amount'].sum()

totalFraudTransactions = len(accountDf[accountDf['isFraud'] == 1])
totalNonFraudTransactions = len(accountDf[accountDf['isFraud'] == 0])

In [None]:
ratioAmountDf = pd.DataFrame({
    'sum': [totalFraud, totalNonFraud], 
    'isFraud': [1, 0]
})

ratioTransactionDf = pd.DataFrame({
    'sum': [totalFraudTransactions, totalNonFraudTransactions], 
    'isFraud': [1, 0]
})

In [None]:
# Ratio of fraudulent transactions in 'transactions'
fig = px.pie(ratioTransactionDf, values='sum', names='isFraud')
fig.show()

In [None]:
# Ratio of fraudulent transactions in 'amount'
fig = px.pie(ratioAmountDf, values='sum', names='isFraud')
fig.show()

In [None]:
accountDf["type"]=="CASH_OUT"

In [None]:
# making boolean series for a team name
filter1 = accountDf["type"]=="CASH_OUT"
filter2 = accountDf["isFraud"]==1
  
# filtering data
accountDfSubset = accountDf[filter1 & filter2]

### Rules Based vs Machine Learning

In [None]:
conditions = [
    (accountDf['oldbalanceOrg'] <= 56900) & (accountDf['type'] == 'TRANSFER') & (accountDf['newbalanceDest'] <= 105),
    (accountDf['oldbalanceOrg'] < 56900) & (accountDf['newbalanceDest'] <= 12),
    (accountDf['oldbalanceOrg'] > 56900) & (accountDf['newbalanceOrig'] > 12) & (accountDf['amount'] > 1160000)
]

mapping = [1, 1, 1]

accountDf['label'] = np.select(conditions, mapping, default=0)

In [None]:
accountDf.head()

In [None]:
features = accountDf.drop(['label'], axis=1)
label = accountDf['label']

In [None]:
features = pd.get_dummies(features, columns=['type'])

In [None]:
features.head()

In [None]:
features.drop(['nameOrig', 'nameDest'], axis=1, inplace=True)

In [None]:
trainF, testF, trainL, testL = train_test_split(features, label, test_size=0.4, random_state=42)
testF, valF, testL, valL = train_test_split(testF, testL, test_size=0.5, random_state=42)

In [None]:
# Create 3 RandomForestClassifiers with the best hyperparameters
rfModel1 = RandomForestClassifier(n_estimators=50, max_depth=10)
rfModel1.fit(trainF, trainL.values.ravel())

In [None]:
predLabel = rfModel1.predict(valF)

In [None]:
# Use the test data set with the best available model
predLabel = rfModel1.predict(testF)

accuracy = round(accuracy_score(testL, predLabel), 3)
precision = round(precision_score(testL, predLabel), 3)
recall = round(recall_score(testL, predLabel), 3)

print(
        'Max depth: {} and Estimators: {} ---> Accuracy: {}, Precision: {}, Recall: {}'
        .format(rfModel1.max_depth, rfModel1.n_estimators, accuracy, precision, recall)
)