# Creation of a baseline model

The baseline model reaches the minimum performance that should be outperformed by any model we build.
</br>
It is a naive approach classifying all scans at no-fraud. Therefore, the model will ALWAYS score -1.0 for the achieved monetary score compared to the maximum. This value can be compared to the cross validation scores of the following notebooks.
</br> 
The train_test_split here is only used as an example to illustrate the performance of the baseline model. 
Performing a cross validation with the baseline model will also always provide a average monetary score of -1.0

In [1]:
# define needed functions

#How high was the reached monetary value with the application of our model?
def get_monetary_value(confusion_matrix):
    monetary_value = ((confusion_matrix[0,0] * 0) + (confusion_matrix[1, 0] * -5) + (confusion_matrix[0, 1] * -25) + (confusion_matrix[1, 1] * 5))
    return monetary_value

#What was the maximum monetary value? -> (FN + TP) * 5
def get_max_monetary_value(confusion_matrix):
    max_monetary_value = (confusion_matrix[1,0] + confusion_matrix[1,1]) * 5
    return max_monetary_value

In [2]:
# Import of used packages

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

import os
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

#Set a random state
rs = 4

In [3]:
# get current directory 
path = os.getcwd() 

# get parent directory 
parent = os.path.dirname(path)

# move to the directory with data
train_csv = os.path.join(parent, "data", "train.csv")

#Import our dataset
dataset = pd.read_csv(train_csv, delimiter = '|')

In [4]:
#Split X and y
X = dataset.drop('fraud', axis=1)
y = dataset.fraud

In [5]:
X.head()

Unnamed: 0,trustLevel,totalScanTimeInSeconds,grandTotal,lineItemVoids,scansWithoutRegistration,quantityModifications,scannedLineItemsPerSecond,valuePerSecond,lineItemVoidsPerPosition
0,5,1054,54.7,7,0,3,0.027514,0.051898,0.241379
1,3,108,27.36,5,2,4,0.12963,0.253333,0.357143
2,3,1516,62.16,3,10,5,0.008575,0.041003,0.230769
3,6,1791,92.31,8,4,4,0.016192,0.051541,0.275862
4,5,430,81.53,3,7,2,0.062791,0.189605,0.111111


In [6]:
# establish train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = rs, stratify=y) 

In [7]:
# classify everything as no-fraud (0)
naive_preds = np.zeros(y_test.shape[0])
X_test['fraud'] = y_test
X_test['fraud'].value_counts()

0    355
1     21
Name: fraud, dtype: int64

In [8]:
# establish the confusion matrix
cm = confusion_matrix(y_test, naive_preds)

#Visualize the results
tn, fp, fn, tp = cm.ravel()
print("TN:", tn, " FP:", fp, "\nFN:", fn, "\t TP:", tp)
#TN FP
#FN TP

TN: 355  FP: 0 
FN: 21 	 TP: 0


In [9]:
# get the classification report
print(classification_report(y_test, naive_preds))

              precision    recall  f1-score   support

           0       0.94      1.00      0.97       355
           1       0.00      0.00      0.00        21

    accuracy                           0.94       376
   macro avg       0.47      0.50      0.49       376
weighted avg       0.89      0.94      0.92       376



In [13]:
#How well did our model perform in terms of Monetary Value
print("Achieved monetary value:\t", get_monetary_value(cm))
print("Maximum monetary value:\t\t", get_max_monetary_value(cm))
print("Achieved monetary score:\t", round(get_monetary_value(cm) / get_max_monetary_value(cm), 4))

Achieved monetary value:	 -105
Maximum monetary value:		 105
Achieved monetary score:	 -1.0
