# Import

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
data = pd.read_csv('Fraud.csv')

# Data Description

In [None]:
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [None]:
data.info()

# Data cleaning




We checked for missing values in below codes. Hence there are 0 missing values in the dataset.
The Fraud variable is read as an integer. Since this is  the class variable, we convert it to object type.

In [None]:
# Convert class variables type to object
data['isFraud'] = data['isFraud'].astype('object')

In [None]:
# Test if there is any missing values in dataset
data.isnull().values.any()

False

By examining the 'type' variable from dataset we know what are different types of transaction and which of these can be fraudulent.
From the above all possible types of transaction, only cash-out and transfer are considered as fraudulent transactions.
Thus it makes sense only to retain only these two type of transaction in our dataset. Since only CASH-OUT and TRANSFER transaction can be fraudulent, we can reduce the size of the dataset by reatining only these transacvtion types and removing PAYMENT, CASH-IN and DEBIT.
Therefore we managed to reduce the data from over 6 million transaction to ~2.8 million transaction.

In [None]:
# Retaining only CASH-OUT and TRANSFER transactions
data = data.loc[data['type'].isin(['CASH_OUT', 'TRANSFER']),:]
print('The new data now has ', len(data), ' transactions.')


The new data now has  2770409  transactions.


Negative or zero transaction amount

First, we check if the amount column is always positive. The following two code snippets break this into the number of transactions where the amount is negative and those where the amount is 0.
There are only a few cases in which transacted amount is 0. We observe by exploring the data of these transactions that they are all fraudulent transactions. So, we can assume that if the transaction amount is 0, the transaction is fraudulent. We remove these transactions from the data and include this condition while making the final predictions.

In [None]:
# Check that there are no negative amounts
print('Number of transactions where the transaction amount is negative: ' +
str(sum(data['amount'] < 0)))


Number of transactions where the transaction amount is negative: 0


In [None]:
print('Number of transactions where the transaction amount is zero: ' +
str(sum(data['amount'] == 0)))

Number of transactions where the transaction amount is zero: 16


In [None]:
# Remove 0 amount values
data = data.loc[data['amount'] > 0,:]


Fraud transaction analysis

we noticed that there are inaccuracies in how the ‘balance’ variable is captured for both originator and recipient. We also observed that in almost half the cases, the originator’s initial balance is recorded as 0.
We check the inaccuracy in the balance variable and compare between fraud and nonfraud. The inaccuracy is defined as the difference between what the balance should be accounting for the transaction amount and what it is recorded as balance.

We calculate the balance inaccuracies for both the originator and destination as follows:

In [None]:
# Defining inaccuracies in originator and recipient balances
data['origBalance_inacc'] = (data['oldbalanceOrg'] - data['amount']) - data['newbalanceOrig']
data['destBalance_inacc'] = (data['oldbalanceDest'] + data['amount']) - data['newbalanceDest']


Overall, we identified a few dimensions along which fraudulent transactions can be distinguished from non-fraudulent transactions.These are as follows:

Time step - fraudulent transactions have are equally likely to occur in all time steps, but genuine transactions peak in specific time steps.

Balances - initial balance of originator is much more likely to be 0 in case of
genuine transactions than fraud transactions.

Inaccuracies in balance - inaccuracy in destination balance is likely to be
negative in case of genuine transactions but positive in case of fraud transactions.

# Predictive modeling for fraud detection

In this section, we choose the variables needed for the ML model, encode categorical variables as numeric and standardize the data. 
Let us recall columns in the dataset

In [None]:
data.columns

Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud', 'origBalance_inacc', 'destBalance_inacc'],
      dtype='object')

The name (or ID) of the originator and destination are not needed for classification. So, we remove them.


In [None]:
# Modeling dataset creation
data = data.drop(['nameOrig', 'nameDest'], axis=1)


We have one categorical variable in the dataset – the transaction type. This feature needs to be encoded as binary variables, and dummy variables need to be created. The following code snippet is used to perform this

In [None]:
# Creating dummy variables through one hot encoding for 'type' column
data = pd.get_dummies(data, columns=['type'], prefix=['type'])

This creates two binary dummy variables – type_CASH_OUT and type_TRANSFER

In this transformation, we convert all columns in the data to have the same range. This is done through the standard scaler feature available in python. The following code snippet is used to perform this transformation.

In [None]:
# Normalization of the dataset
from  sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
data_scaled = pd.DataFrame(std_scaler.fit_transform(data.loc[:,~data.columns.isin(['isFraud'])]))
data_scaled.columns = data.columns[:-1]
data_scaled['isFraud'] = data['isFraud']


We split the scaled dataset into training and testing datasets. We decide to use 70% of the original data for training and the remaining 30% for testing.
The following code snippet is used to create training and testing datasets.

In [None]:
# spliting dataset into train and test
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
X = data_scaled.loc[:, data_scaled.columns != 'isFraud']
y = data_scaled.loc[:, data_scaled.columns == 'isFraud']
X_train_original, X_test_original, y_train_original, y_test_original = train_test_split(X,y,test_size = 0.3, random_state = 0)
label_encoder = LabelEncoder()
y_train_original = label_encoder.fit_transform(y_train_original.values.ravel())
y_test_original = label_encoder.fit_transform(y_test_original.values.ravel())


# Model training

In this prediction I used Random forest classifier algorithm. with criterion as gini index, max depth as 5 and number of estimators as 10.

In [None]:
from sklearn.ensemble import RandomForestClassifier
#from sklearn.model_selection import StratifiedKFold
#scr = 'recall'
model_rf = RandomForestClassifier(criterion='gini', max_depth=5, n_estimators=10)
#skf = StratifiedKFold(5)


In [None]:
# Cross validation
#from sklearn.model_selection import cross_val_score
#sc_rf = cross_val_score(model_rf, X_train_original, y_train_original, cv=skf, scoring=scr)


In [None]:
model_rf.fit(X_train_original, y_train_original)

RandomForestClassifier()

In [None]:
y_pred = model_rf.predict(X_test_original)

The key factors that predict fraudulent customers are balance of the originator, inaccuracies of balances, time steps, etc.

# Result

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
print(f'Confusion Matrix: \n\n{confusion_matrix(y_test_original, y_pred)}')
print("\nAccuracy score: ", accuracy_score(y_test_original, y_pred) * 100)
print("\nPrecision score: ", precision_score(y_test_original, y_pred, average=None) * 100)


# Conclusion


For prevention comapany should focused on type of transaction, particularly CASH-OUT, and TRANSFER.

The above model can able to identify fraudulent transactions beforehand allowing the team to take preemptive action prior to them clearing out. This can increase company's internal efficiency by 10x allowing them to save significant transaction costs.

Another way to fight financial fraud is to get foresight into why and when it might happen. Using predictive analytics, a machine learning model can identify the factors that contribute to fraud and produce accurate forecasts. This way, early intervention is possible, and the risks of fraud can be managed and reduced appropriately.



### Leveraging Machine Learning to manage financial fraud

With visibility into the variables that are likely to characterize fraudulent transactions, financial organizations can not only detect fraud but anticipate its occurrence.

This allows for early intervention to mitigate financial losses, avoid damage to company reputation and customer experience, and even keep company morale in check. For example, financial organizations can set up auto alerts for suspicious activity, launch educational campaigns for customers and users, and facilitate consistent monitoring to reduce risk.