## **Problem Statement**

Develop an ML model using supervised learning from the proposed dataset. The problem is a binary classification problem and the target variable is 'isFraud'. The model should predict for the positive class, which indicates that a transaction is fraudulent. The model should additionally be written in the Tensorflow framework.

## **Exploratory Data Analysis (EDA)**

### Dataset Size: The First Challenge

The dataset is this [Synthetic Financial Transaction Dataset](https://www.kaggle.com/datasets/ealaxi/paysim1) from Kaggle, and it has a size of approximately 6,500,000 rows and 11 features. The first challenge we ran into was that, as students, doing any kind of EDA on data of this magnitude was too time consuming. Our smaller machines had trouble running the Python cells and scripts that would process this amount of data. So the first action we took was resizing our dataset down to 500,000 rows.

It was important that we preserved the ratio of classes during the resize (as the dataset is imbalanced), so we used the train_test_split method and specified the 'stratify' parameter. We tried doing this through cells in Python Notebooks, but the cells were just too slow and VSCode would death-freeze. We employed a quick workaround by using caching and python scripting to handle the resizing of the dataset only.

### Resizing the Dataset by Python Script (do not run cell)

#### resize_dataset.py

In [None]:
# Imports
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

# Cache libraries
import joblib
import os
import time

CSV_PATH = './data/financial_transactions.csv'
CSV_RESIZED_PATH = './data/financial_transactions_resized.csv'
DF_CACHE = './data/df_cache.pkl'


def read_dataset(dataset_file_path:str):
    '''
    Rename columns then return dataframe. Our dataset was too large (around
    6,500,000 rows) to run on our machines. So caching is used to speed up 
    execution.
    '''

    # Check if the dataframe is cached
    if os.path.exists(DF_CACHE):

        # Cache exists, load dataframe from cache
        df = joblib.load(DF_CACHE)
    else:

        df = pd.read_csv(dataset_file_path)

        # Better fitting column names, removes some of the typos in the original 
        # dataset column names
        df.rename(columns={'type':'paymentType', 'nameOrig':'accSender',
                           'oldbalanceOrg':'oldBalanceSender', 'newbalanceOrig':
                           'newBalanceSender', 'nameDest':'accRecipient',
                           'oldbalanceDest':'oldBalanceRecipient',
                           'newbalanceDest':'newBalanceRecipient'})

        # Cache the dataframe for quick repeated script execution
        joblib.dump(df, DF_CACHE)

    # Print class imbalance
    print (df['isFraud'].value_counts())

    return df

def resize_dataset(df: pd.DataFrame):
    ''' 
    Resize the dataset to have 500,000 entries.
    '''

    X = df.drop('isFraud', axis=1)
    y = df['isFraud']

    X_resized, X_test, y_resized, y_test = train_test_split(X, y,
                                                            test_size=0.92,
                                                            stratify=y,
                                                            random_state=777)
    X_resized['isFraud'] = y_resized

    # Write to file.
    X_resized.to_csv(CSV_RESIZED_PATH, index=False)

if __name__ == "__main__":

    df = read_dataset(CSV_PATH)
    resize_dataset(df)

### Findings on the Resized Dataset

First, set the location of the dataset according to where you stored it.

In [1]:
CSV_LOCATION = './data/financial_transactions_resized.csv'

Now let us do some EDA on our resized dataset.

In [2]:
# Imports
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

df = pd.read_csv(CSV_LOCATION)
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFlaggedFraud,isFraud
0,522,CASH_OUT,239530.21,C466629885,0.0,0.0,C1784299460,633346.88,872877.09,0,0
1,233,CASH_IN,351816.64,C714945383,52728.0,404544.64,C679086556,16479219.3,16127402.66,0,0
2,160,PAYMENT,6761.76,C86193329,0.0,0.0,M84953887,0.0,0.0,0,0
3,181,CASH_OUT,45639.34,C1783941943,0.0,0.0,C1153411207,49385.07,95024.4,0,0
4,277,PAYMENT,14687.36,C1712987399,17055.0,2367.64,M132453964,0.0,0.0,0,0


## Data Preprocessing

In [None]:
from sklearn.model_selection import train_test_split