# UTS Deep Learning and Optimization
## Case: Fraud Detection
---
Nama: Felicia Ferren

NIM: 2440013071

OneDrive link: https://binusianorg-my.sharepoint.com/personal/felicia_ferren_binus_ac_id/_layouts/15/guestaccess.aspx?docid=03802a58f6bdb450c9deec8ab7a717522&authkey=ASsRA8sA5MwMJ3XGLINL5Ec&e=k5UMzN 

Alternate link: https://youtu.be/_xGoGMvdvIA 



## Case I: Fraud Detection

### About Dataset
> It is important that credit card companies are **able to recognize fraudulent credit card transactions** so that customers are not charged for items that they did not purchase.  
>  



#### About Data
> The dataset contains **transactions made by credit cards in September 2013 by European cardholders**.
This dataset presents transactions that occurred in two days, where we have **492 frauds out of 284,807 transactions**. The dataset is **highly unbalanced, the positive class (frauds) account for 0.172% of all transactions**.
>
> It contains **only numerical input variables** which are **the result of a PCA transformation**. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. **Features V1, V2, … V28** are the principal components obtained with PCA, the only features which have not been transformed with PCA are **'Time' and 'Amount'**. **Feature 'Time'** contains the seconds elapsed between each transaction and the first transaction in the dataset. **The feature 'Amount'** is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. **Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise**.
>
> Given the class **imbalance ratio**, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.


#### Acknowledgements
> The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection.
More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

### Import from Kaggle

first thing first, we will import drive to access kaggle.json file, in order to be able to import data from kaggle using API Token.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# set up to import dataset from kaggle
! pip install kaggle
! mkdir ~/.kaggle
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json # from my google drive

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Then, import the specific dataset and unzip the dataset.

In [None]:
# import dataset from kaggle
!kaggle datasets download -d whenamancodes/fraud-detection

Downloading fraud-detection.zip to /content
 74% 49.0M/66.0M [00:00<00:00, 63.7MB/s]
100% 66.0M/66.0M [00:00<00:00, 72.0MB/s]


In [None]:
!unzip fraud-detection.zip

Archive:  fraud-detection.zip
  inflating: creditcard.csv          


Now, we have our dataset inside colab!

### Data Preparation

import libaries needed and do seeding so the notebook gives stable output across runs.

In [None]:
#import libraries
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

import numpy as np

In [None]:
# seeding, to make this notebook output stable across runs 
def seed_everything(seed: int):
    import random, os
    import numpy as np
    import torch
    
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True
    
seed_everything(42)

Now, load the dataset (it is in a form of csv file), then see the head of the data.

In [None]:
# load dataset
cc = pd.read_csv('creditcard.csv')

cc.head()
# 30 features
# labels: 1 - fraud; 0 - fraud

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


Let's see the shape of the table, the statistics descriptive, and the info of the dataframe.

In [None]:
# data shape
print(cc.shape)

(284807, 31)


In [None]:
# descriptive statistics
cc.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,1.168375e-15,3.416908e-16,-1.379537e-15,2.074095e-15,9.604066e-16,1.487313e-15,-5.556467e-16,1.213481e-16,-2.406331e-15,...,1.654067e-16,-3.568593e-16,2.578648e-16,4.473266e-15,5.340915e-16,1.683437e-15,-3.660091e-16,-1.22739e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


In [None]:
# data summary
cc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

From the results above, we can see that our data shape is (284807, 31), meaning that there are 284807 observations with 31 column or features. All of the features are numerical, as mentioned from the case description.

now, we will create our own dataset class function

In [None]:
# create our own dataset class
from torch.utils.data import Dataset

class CreditCardDataset(Dataset):
  def __init__(self, X, y):
    self.X = X
    self.y = y

  def __getitem__(self, index):
    X = torch.Tensor(self.X[index])
    y = torch.LongTensor(self.y[index, None])

    return X, y
  
  def __len__(self):
    return len(self.X)

then, get the features and labels seperated from each other. the predictors (X variable) are V1-V28, Time, and Amount feature. the dependent variable (y variable) is the class feature.

In [None]:
# get the features and labels from the dataset 
X = cc[cc.columns[0:30]].values # set time, V1-V28, and amount feature as X variable
y = cc.Class.values.astype(np.int64) # set class feature as y variable

print(X)
print(y)

[[ 0.00000000e+00 -1.35980713e+00 -7.27811733e-02 ...  1.33558377e-01
  -2.10530535e-02  1.49620000e+02]
 [ 0.00000000e+00  1.19185711e+00  2.66150712e-01 ... -8.98309914e-03
   1.47241692e-02  2.69000000e+00]
 [ 1.00000000e+00 -1.35835406e+00 -1.34016307e+00 ... -5.53527940e-02
  -5.97518406e-02  3.78660000e+02]
 ...
 [ 1.72788000e+05  1.91956501e+00 -3.01253846e-01 ...  4.45477214e-03
  -2.65608286e-02  6.78800000e+01]
 [ 1.72788000e+05 -2.40440050e-01  5.30482513e-01 ...  1.08820735e-01
   1.04532821e-01  1.00000000e+01]
 [ 1.72792000e+05 -5.33412522e-01 -1.89733337e-01 ... -2.41530880e-03
   1.36489143e-02  2.17000000e+02]]
[0 0 0 ... 0 0 0]


### Data Pre-Processing

We will continue to the pre-processing process. in this section, we will explore more information about the data and prepare for the modeling. 

This time, we will see if there is missing values in the data.

In [None]:
# find out missing values
cc.isna().sum() # no missing values

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

So, there is no missing values. Now, we will check the class count.. is it the same as mentioned in the case description?

In [None]:
# see the count in class feature // target feature
cc[cc.columns[-1]].value_counts() # the target class is imbalanced

0    284315
1       492
Name: Class, dtype: int64

it is true that there are only 492 frauds out of all the transaction data. this means that our class is imbalanced. to counter that, we will need to use weights when applying our loss function.

now, we begin the data augmentation by using standard scaler to scale the values to follow standard/normal distribution. it is important because we dont want if there are dominant features over the others. 

In [None]:
# transformation using standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)


now, we will split our data into train, validation, and test data using train_test_split function. this time, the ratio for train, validation, and test data will be 80:10:10.

to obtain such ratio, we will split the data into train and test data first using 80:20 ratio. then, split the test data into validation and test data using 50:50 ratio. 

then, load the data using dataloader and continue to modelling. (use batch size = 16)

In [None]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2) #train 80; test 20
test_X, valid_X, test_y, valid_y = train_test_split(test_X, test_y, 
                                                      test_size=0.5) # divide test 1:1 with valid

from torch.utils.data import DataLoader

train_ds = CreditCardDataset(train_X, train_y) #load our dataset; with batch size = 16
train_loader = DataLoader(train_ds, batch_size=16, 
                             shuffle=True, num_workers=0)

valid_ds = CreditCardDataset(valid_X, valid_y)
valid_loader = DataLoader(valid_ds, batch_size=16, 
                             shuffle=False, num_workers=0)

test_ds = CreditCardDataset(test_X, test_y)
test_loader = DataLoader(test_ds, batch_size=16, 
                            shuffle=False, num_workers=0)

### Modelling 1

Now, we will start builiding our model. we put 30 as as our input feature because there are 30 features in our data. the input will be doubled and enter 1 hidden layer. then we doubled again the input in the hidden layer, and enter the final layer. in the final layer, we set our out-feature as 2 (as there are two class where 0 - not fraud and 1 - fraud). 

do ReLu from input layer to hidden layer and hidden layer to final layer.

In [None]:
# architecture
# n nodes input layer, 1 hidden layer with 2 × n initial nodes, and a final layer of k classes
# n = 30; k = 2 (class: 0-not fraud & 1-fraud)
class Net(nn.Module):
    # define nn
    def __init__(self):
        super(Net, self).__init__()
        # input layer
        self.fc1 = nn.Linear(30, 60) # input features = 30, output features = 2 x 30 = 60
        # hidden layer
        self.fc2 = nn.Linear(60, 120) # input features = 60, output features = 2 x 60 = 120
        # output layer
        self.fc3 = nn.Linear(120, 2) #input features = 120 , output features = 2 (class: 0-not fraud & 1-fraud)
        
    def forward(self, X):
        X = self.fc1(X)
        X = F.relu(X)
        X = self.fc2(X)
        X = F.relu(X)
        X = self.fc3(X)

        return X

after defining the model, we can instantiating the model. then, define the weights (because it is an imbalanced classification case).

we know that 0-not fraud & 1-fraud along with the number of samples for each class... so, the weight for each class is obtained by dividing the number of samples with 2 * the number of samples for each class. 

then, we will use Cross Entropy with the addition of the weights to find the loss and Stochastic Gradient Descent as our optimizer, with 0.001 learning rate.

In [None]:
# Instantiating the model
net = Net()

#class weights for two-class classification
class_weights = torch.tensor([(284807/(2*284315)), (284807/(2*492))])
# Choosing the loss function
criterion = nn.CrossEntropyLoss(weight = class_weights) # imbalanced classification
# Choosing the optimizer
optimizer = torch.optim.SGD(net.parameters(), lr=0.001)

we will see the training and validation loss for 20 epoch.

In [None]:
epochs = 20
 
train_mean_losses = []
valid_mean_losses = []

valid_best_loss = np.inf

for i in range(epochs):  
    #===============================================================
    # training 
    train_losses = []
    
    print("=========================================================")
    print("Epoch {}".format(i))

    for iteration, batch_data in enumerate(train_loader):
        X_batch, y_batch = batch_data 
        
        optimizer.zero_grad() # zero out gradients so the gradients wont accumulate
        
        out = net(X_batch) # give prediction // forward pass
        loss = criterion(out.squeeze(), y_batch.flatten()) # compute loss
        
        loss.backward() # back propagation
        optimizer.step() # update weight and bias using the computed gradients
        
        train_losses.append(loss) 
    
    train_mean_loss = torch.mean(torch.stack(train_losses)) # average loss
    print('training loss: {:10.8f}'.format(train_mean_loss))
    
    train_mean_losses.append(train_mean_loss) # accumulate batch loss so we can average over epoch
    
    #===============================================================
    # validation
    valid_losses = []
    with torch.set_grad_enabled(False):
        for iteration, batch_data in enumerate(valid_loader):
            X_batch, y_batch = batch_data

            out = net(X_batch) # forward pass
            loss = criterion(out, y_batch.flatten()) # compute loss
            valid_losses.append(loss) 
            
        valid_mean_loss = torch.mean(torch.stack(valid_losses)) # average loss
        print('validation loss: {:10.8f}'.format(valid_mean_loss))
        
        valid_mean_losses.append(valid_mean_loss) # accumulate batch loss so we can average over epoch
        
        if valid_mean_loss.cpu().numpy()[()] < valid_best_loss: # choosing the best model
            valid_best_loss = valid_mean_loss
            torch.save(net.state_dict(), "best_model.pth")
            best_epoch = i
    #===============================================================

Epoch 0
training loss: 0.08191586
validation loss: 0.03670596
Epoch 1
training loss: 0.02856855
validation loss: 0.02936846
Epoch 2
training loss: 0.02533012
validation loss: 0.02779216
Epoch 3
training loss: 0.02376500
validation loss: 0.02637728
Epoch 4
training loss: 0.02265439
validation loss: 0.02582286
Epoch 5
training loss: 0.02187897
validation loss: 0.02517576
Epoch 6
training loss: 0.02121170
validation loss: 0.02445884
Epoch 7
training loss: 0.02059289
validation loss: 0.02401854
Epoch 8
training loss: 0.02039365
validation loss: 0.02391064
Epoch 9
training loss: 0.01993328
validation loss: 0.02362168
Epoch 10
training loss: 0.01947279
validation loss: 0.02347326
Epoch 11
training loss: 0.01926268
validation loss: 0.02278720
Epoch 12
training loss: 0.01914011
validation loss: 0.02288328
Epoch 13
training loss: 0.01879090
validation loss: 0.02228023
Epoch 14
training loss: 0.01861238
validation loss: 0.02235446
Epoch 15
training loss: 0.01747836
validation loss: 0.02215145
Ep

the value of training loss and validation loss indicate whether the model is overfit or underfit. model is said to be overfit if the training loss is decreasing when the validation loss not.

from the result above, we can see that both training and validation loss kept decreasing. hence, we can conclude that there is no overfitting nor underfitting.

However, we cannot stop jumping to conclusions. To test the state of overfitting further, we will run test data to test with the model.

### Modelling 2: Architecture Modification & Hyperparameter Tuning

in this section, we will try to modify the architecture using dropout and tune hyperpameter using optuna to gain better model. 

In [None]:
!pip install optuna

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting optuna
  Downloading optuna-3.0.3-py3-none-any.whl (348 kB)
[K     |████████████████████████████████| 348 kB 6.4 MB/s 
Collecting colorlog
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting cliff
  Downloading cliff-3.10.1-py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 10.7 MB/s 
[?25hCollecting alembic>=1.5.0
  Downloading alembic-1.8.1-py3-none-any.whl (209 kB)
[K     |████████████████████████████████| 209 kB 70.8 MB/s 
Collecting cmaes>=0.8.2
  Downloading cmaes-0.9.0-py3-none-any.whl (23 kB)
Collecting Mako
  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 8.8 MB/s 
Collecting cmd2>=1.0.0
  Downloading cmd2-2.4.2-py3-none-any.whl (147 kB)
[K     |████████████████████████████████| 147 kB 73.1 MB/s 
[?25hCollecting autopage>=0.4.0
  Downloading autopage-0.5.1-py3-none-any.whl (2

for the architecture modification, dropout is placed after ReLu in the first fully conected layer.

dropout function itself is used to prevent a model from overfitting. Dropout works by randomly setting the outgoing edges of hidden units (neurons that make up hidden layers) to 0 at each update of the training phase.

In [None]:
# architecture modification
# n nodes input layer, 1 hidden layer with 2 × n initial nodes, and a final layer of k classes
# n = 30; k = 2 (class: 0-not fraud & 1-fraud)
class NetModif(nn.Module):
    # define nn
    def __init__(self):
        super(NetModif, self).__init__()
        # input layer
        self.fc1 = nn.Linear(30, 60) # input features = 30, output features = 2 x 30 = 60
        # hidden layer
        self.fc2 = nn.Linear(60, 120) # input features = 60, output features = 2 x 60 = 120
        # output layer
        self.fc3 = nn.Linear(120, 2) #input features = 120 , output features = 2 (class: 0-not fraud & 1-fraud)
        
        self.drop = nn.Dropout(0.50)
        
    def forward(self, X):
        X = self.fc1(X)
        X = F.relu(X)
        X = self.drop(X)
        X = self.fc2(X)
        X = F.relu(X)
        X = self.fc3(X)

        return X

then, in the hyperparameter tuning, we suggest:
- Adam, Adadelta, Adagrad, and SGD for the optimizer. 
- 10^-5 until 10^-1 for the learning rate.
- 16 until 64 for batch size with step = 16.

In [None]:
# hyperparameter tuning process
import optuna
from torch import optim

def objective(trial):

    # Generate the model.
    model2 = NetModif()

    # Generate the optimizers.

    # try RMSprop and SGD
    '''
    optimizer_name = trial.suggest_categorical("optimizer", ["RMSprop", "SGD"])
    momentum = trial.suggest_float("momentum", 0.0, 1.0)
    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    optimizer = getattr(optim, optimizer_name)(model.parameters(), lr=lr,momentum=momentum)
    '''
    #try Adam, AdaDelta and Adagrad
    
    optimizer_name = trial.suggest_categorical("optimizer", ["Adam", "Adadelta","Adagrad", "SGD"])
    lr = trial.suggest_float("lr", 1e-5, 1e-1,log=True)
    optimizer = getattr(optim, optimizer_name)(model2.parameters(), lr=lr)
    batch_size=trial.suggest_int("batch_size", 16, 64,step=16)

    criterion=nn.CrossEntropyLoss(weight = class_weights)
    
    N_EPOCHS = 10
    # Training of the model.
    for epoch in range(N_EPOCHS):
        model2.train()
       
        for batch_idx, (images, labels) in enumerate(train_loader):
            # Limiting training images for faster epochs.
            #if batch_idx * BATCHSIZE >= N_TRAIN_EXAMPLES:
            #    break

            images, labels = images, labels

            optimizer.zero_grad()
            output = model2(images)
            loss = criterion(output.squeeze(), labels.flatten())
            loss.backward()
            optimizer.step()

        # Validation of the model.
        model2.eval()
        correct = 0
        with torch.no_grad():
            for batch_idx, (images, labels) in enumerate(valid_loader):
                # Limiting validation images.
               # if batch_idx * BATCHSIZE >= N_VALID_EXAMPLES:
                #    break
                images, labels = images, labels
                output = model2(images)
                # Get the index of the max log-probability.
                pred = output.argmax(dim=1, keepdim=True)
                correct += pred.eq(labels.view_as(pred)).sum().item()

        accuracy = correct / len(valid_loader.dataset)

        trial.report(accuracy, epoch)

        # Handle pruning based on the intermediate value.
        if trial.should_prune():
            raise optuna.exceptions.TrialPruned()

    return accuracy

now, we do the hyperparameter tuning process and find out the best hyperparameter to use.

In [None]:
# create study object to maximize the objective function
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=10)

trial = study.best_trial

print('Accuracy: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

[32m[I 2022-11-24 06:53:30,972][0m A new study created in memory with name: no-name-899b929b-a5fc-4524-a5a0-6a6ec92db026[0m
[32m[I 2022-11-24 06:56:14,084][0m Trial 0 finished with value: 0.9982093325374811 and parameters: {'optimizer': 'Adadelta', 'lr': 0.004076801138829891, 'batch_size': 16}. Best is trial 0 with value: 0.9982093325374811.[0m
[32m[I 2022-11-24 06:58:40,697][0m Trial 1 finished with value: 0.999403110845827 and parameters: {'optimizer': 'Adagrad', 'lr': 0.01701203721145084, 'batch_size': 16}. Best is trial 1 with value: 0.999403110845827.[0m
[32m[I 2022-11-24 07:01:23,084][0m Trial 2 finished with value: 0.9982093325374811 and parameters: {'optimizer': 'Adadelta', 'lr': 0.0001830526635861672, 'batch_size': 16}. Best is trial 1 with value: 0.999403110845827.[0m
[32m[I 2022-11-24 07:03:48,846][0m Trial 3 finished with value: 0.9993328885923949 and parameters: {'optimizer': 'Adagrad', 'lr': 0.04500723620418616, 'batch_size': 32}. Best is trial 1 with value:

Accuracy: 0.999403110845827
Best hyperparameters: {'optimizer': 'Adagrad', 'lr': 0.01701203721145084, 'batch_size': 16}


the result says that the best hyperparameter is Adagrad optimizer with 0.01701203721145084 learning rate, and 16 batch size. 

now, we train and validate the model using the hyperparameter, and see the training and validation loss for 20 epoch.


In [None]:
# Instantiating the model
model2 = NetModif()

#class weights for two-class classification
class_weights = torch.tensor([(284807/(2*284315)), (284807/(2*492))])
# Choosing the loss function
criterion = nn.CrossEntropyLoss(weight = class_weights) # imbalanced classification
# Choosing the optimizer
optimizer = torch.optim.Adagrad(model2.parameters(), lr=0.01701203721145084) # apply from the hyperparam tuning result

In [None]:
epochs = 20
 
train_mean_losses = []
valid_mean_losses = []

valid_best_loss = np.inf

for i in range(epochs):  
    #===============================================================
    # training 
    train_losses = []
    
    print("=========================================================")
    print("Epoch {}".format(i))

    for iteration, batch_data in enumerate(train_loader):
        X_batch, y_batch = batch_data
        
        optimizer.zero_grad()
        
        out = model2(X_batch)
        loss = criterion(out.squeeze(), y_batch.flatten())
        
        loss.backward()
        optimizer.step()
        
        train_losses.append(loss)
    
    train_mean_loss = torch.mean(torch.stack(train_losses))
    print('training loss: {:10.8f}'.format(train_mean_loss))
    
    train_mean_losses.append(train_mean_loss)
    
    #===============================================================
    # validation
    valid_losses = []
    with torch.set_grad_enabled(False):
        for iteration, batch_data in enumerate(valid_loader):
            X_batch, y_batch = batch_data

            out = model2(X_batch)
            loss = criterion(out, y_batch.flatten())
            valid_losses.append(loss)
            
        valid_mean_loss = torch.mean(torch.stack(valid_losses))
        print('validation loss: {:10.8f}'.format(valid_mean_loss))
        
        valid_mean_losses.append(valid_mean_loss)
        
        if valid_mean_loss.cpu().numpy()[()] < valid_best_loss:
            valid_best_loss = valid_mean_loss
            torch.save(model2.state_dict(), "best_model.pth")
            best_epoch = i
    #===============================================================

Epoch 0
training loss: 0.05544159
validation loss: 0.03398507
Epoch 1
training loss: 0.03172327
validation loss: 0.02986004
Epoch 2
training loss: 0.02876923
validation loss: 0.03136512
Epoch 3
training loss: 0.02800628
validation loss: 0.03323418
Epoch 4
training loss: 0.02638127
validation loss: 0.03354193
Epoch 5
training loss: 0.02498032
validation loss: 0.02895638
Epoch 6
training loss: 0.02416300
validation loss: 0.03061480
Epoch 7
training loss: 0.02451184
validation loss: 0.03089141
Epoch 8
training loss: 0.02322871
validation loss: 0.03163657
Epoch 9
training loss: 0.02249948
validation loss: 0.02717743
Epoch 10
training loss: 0.02289980
validation loss: 0.03098697
Epoch 11
training loss: 0.02309910
validation loss: 0.02991005
Epoch 12
training loss: 0.02264206
validation loss: 0.02778618
Epoch 13
training loss: 0.02142965
validation loss: 0.03184785
Epoch 14
training loss: 0.02156622
validation loss: 0.02787832
Epoch 15
training loss: 0.02341561
validation loss: 0.02753935
Ep

from the result above, we can see that training loss decreased a over epochs... On the other hand, the validation was not decreasing (keeping around 0.02 - 0.03). this means that the tuned model might be overfit (where tuning should handle overfitting) or because we use dropout to our model.

Still, we cannot stop jumping to conclusions. To test the state of overfitting further, we will run test data to test with the model. 


### Evaluation

now, we will evaluate our model using test data. let's see the accuracy, precision, and recall from both model, compare them and choose the best model.

In [None]:
# Model 1
test_predictions = np.empty((0,2))
with torch.no_grad():
    for iteration, batch_data in enumerate(test_loader):
        X_batch, y_batch = batch_data        
        out = net(X_batch)
        
        test_predictions = np.append(test_predictions, out.numpy(), 
                                     axis=0)
        
# accuracy, precision, and recall
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

test_predictions = np.array(test_predictions)
test_predictions = np.argmax(np.array(test_predictions), axis=1)


print("\n=========================================================\n")
accuracy = accuracy_score(test_y, test_predictions) # accuracy
print("Accuracy: {}".format(accuracy))

precision = precision_score(test_y, test_predictions) # precision 
print("Precision: ", precision)

recall = recall_score(test_y, test_predictions) # recall
print("Recall: ", recall)

# in addition, add confusion matrix to see the actual and predicted class
from sklearn.metrics import confusion_matrix
print("\n=========================================================\n")
print("Confusion Matrix:")
print(confusion_matrix(test_y, test_predictions)) # confusion matrix




Accuracy: 0.9991222218320986
Precision:  0.6774193548387096
Recall:  0.8936170212765957


Confusion Matrix:
[[28414    20]
 [    5    42]]


In [None]:
# Model 2
test_predictions = np.empty((0,2))
with torch.no_grad():
    for iteration, batch_data in enumerate(test_loader):
        X_batch, y_batch = batch_data        
        out = model2(X_batch)
        
        test_predictions = np.append(test_predictions, out.numpy(), 
                                     axis=0)
        
# accuracy, precision, and recall
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

test_predictions = np.array(test_predictions)
test_predictions = np.argmax(np.array(test_predictions), axis=1)


print("\n=========================================================\n")
accuracy = accuracy_score(test_y, test_predictions)
print("Accuracy: {}".format(accuracy))

precision = precision_score(test_y, test_predictions)
print("Precision: ", precision)

recall = recall_score(test_y, test_predictions)
print("Recall: ", recall)

from sklearn.metrics import confusion_matrix
print("\n=========================================================\n")
print("Confusion Matrix:")
print(confusion_matrix(test_y, test_predictions))



Accuracy: 0.9988413328183702
Precision:  0.6029411764705882
Recall:  0.8723404255319149


Confusion Matrix:
[[28407    27]
 [    6    41]]


From the result above, we can see that **there is not much significant difference between model1 and model2 (tuned and modified model)**. We also know that the first model made less loss value. 

Summary: 
- **Model1 slightly performs better on accuracy** with the percentage of 99.91% accuracy vs 99.88% accuracy on model2.
- **Model1 slightly performs better onprecision** with the percentage of 89.36% precision vs. 87.23% precision on model2.
- **Model1 slightly performs better on recall** with the percentage of 67.74% recall vs 60.29% accuracy on model2.

For fraud detection, it is important to **consider the Actual Positives** captured by our model. The model will label it as Positive (True Positive). We don't want the model to detect/label frauds as non-frauds. Hence, **the model have to minimize the False Negative**. 

Applying the same understanding, we know that **Recall** shall be the model metric we use to select our best model when there is a high cost associated with False Negative (supporting the accuracy).

**In conclusion, Model1 will be chosen since it has slightly better results on all metric scores.**

Finally, save our complete/final model.

In [None]:
# Saving complete model
torch.save(net, "complete_model.pth")
# torch.save(model2, "complete_model.pth")

In [None]:
from google.colab import files
files.download("complete_model.pth")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>