## TRANSACTION DATA (CREDIT FRAUD) - HYPERPARAMETER

 Sumber dataset : https://drive.google.com/file/d/1sr0k8_k7huFuHiR_C5P_r60VaBDzwoTb/view

Notebook ini digunakan untuk mencari parameter terbaik yang nantinya akan di gunakan dalam **Machine Learning**. Sengaja dibuat terpisah, karena pada prakteknya *hyperparameter tuning* yang di lakukan pada notebook ini saja memakan waktu sekitar **5 sampai 6 jam**. Sehingga akan sangat merepotkan jika di satukan dalam notebook **Machine Learning**.

## IMPORT LIBRARIES

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

from sklearn import tree
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.svm import SVC

from sklearn.metrics import confusion_matrix, classification_report, recall_score
from sklearn.model_selection import RandomizedSearchCV

import warnings
warnings.filterwarnings('ignore')

## OVERVIEW

In [2]:
df = pd.read_csv('transaction_HP.csv')

In [3]:
df.head()

Unnamed: 0,transactionAmount,currentBalance,availableMoney,creditLimit,posConditionCode_8.0,posEntryMode_9.0,posEntryMode_5.0,cardPresent_Yes,posEntryMode_90.0,posEntryMode_80.0,isFraud
0,0.162764,-0.478167,0.201811,-0.25,0,0,1,0,0,0,1
1,-0.389651,-0.453445,0.18601,-0.25,0,1,0,0,0,0,0
2,0.648121,-0.447949,0.182498,-0.25,0,0,1,0,0,0,0
3,0.902954,-0.406334,0.155901,-0.25,0,0,0,0,0,0,1
4,-0.547566,-0.355851,0.123636,-0.25,0,0,0,0,0,0,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 633846 entries, 0 to 633845
Data columns (total 11 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   transactionAmount     633846 non-null  float64
 1   currentBalance        633846 non-null  float64
 2   availableMoney        633846 non-null  float64
 3   creditLimit           633846 non-null  float64
 4   posConditionCode_8.0  633846 non-null  int64  
 5   posEntryMode_9.0      633846 non-null  int64  
 6   posEntryMode_5.0      633846 non-null  int64  
 7   cardPresent_Yes       633846 non-null  int64  
 8   posEntryMode_90.0     633846 non-null  int64  
 9   posEntryMode_80.0     633846 non-null  int64  
 10  isFraud               633846 non-null  int64  
dtypes: float64(4), int64(7)
memory usage: 53.2 MB


## PARAMETER TUNING WITH PIPELINING

In [5]:
# Split target predictors

X = df.drop(['isFraud'], axis=1)
y = df['isFraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
X_train.head()

Unnamed: 0,transactionAmount,currentBalance,availableMoney,creditLimit,posConditionCode_8.0,posEntryMode_9.0,posEntryMode_5.0,cardPresent_Yes,posEntryMode_90.0,posEntryMode_80.0
614975,-0.510623,1.079725,-0.439062,0.0,0,0,0,1,1,0
608484,0.011676,0.966522,-0.011898,0.25,0,1,0,1,0,0
217286,0.454412,0.151156,1.21885,0.75,0,1,0,0,0,0
536757,1.149939,2.463425,-0.258977,0.75,0,0,1,0,0,0
156415,1.42838,-0.409801,0.51293,0.0,0,0,0,0,1,0


In [7]:
X_test.head()

Unnamed: 0,transactionAmount,currentBalance,availableMoney,creditLimit,posConditionCode_8.0,posEntryMode_9.0,posEntryMode_5.0,cardPresent_Yes,posEntryMode_90.0,posEntryMode_80.0
231593,0.910993,-0.475314,0.554801,0.0,0,0,1,1,0,0
516251,-0.509985,-0.451679,0.184882,-0.25,1,0,1,0,0,0
586857,0.706438,-0.279312,1.493973,0.75,0,0,0,0,0,0
4752,0.152619,-0.422549,-0.401438,-0.65,0,0,1,0,0,0
539908,-0.022204,-0.342789,0.824915,0.25,0,0,1,1,0,0


In [9]:
# Logistic Regression with hyperparameter model build + pipelining

LRG_pipe = Pipeline([('scale', RobustScaler()),
                     ('clf', LogisticRegression())])

LRG_param = {'clf__penalty': ['l1', 'l2', 'none'], 
             'clf__solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
             'clf__max_iter' : [100,200]}

RSCV_LRG = RandomizedSearchCV(LRG_pipe, LRG_param, cv=5, scoring='accuracy')

In [10]:
# K-Nearest Neighbors with hyperparameter model build + pipelining

KNN_pipe = Pipeline([('scale', RobustScaler()),
                     ('clf', KNeighborsClassifier())])

KNN_param = {'clf__leaf_size': list(range(1, 50)),
             'clf__n_neighbors' : list(range(1, 30)),
             'clf__p' : [1,2]}


RSCV_KNN = RandomizedSearchCV(KNN_pipe, KNN_param, cv=5, scoring='accuracy')

In [11]:
models = ['LogisticRegression', 'KNNeighbors']
pipes = [RSCV_LRG, RSCV_KNN]
for model, pipe in zip(models, pipes):
    print(model, '\n')
    pipe.fit(X_train, y_train)
    print('Best Score : ', pipe.best_score_)
    print('Best Params : ', pipe.best_params_)
    print('\n')

LogisticRegression 

Best Score :  0.9828783062024643
Best Params :  {'clf__solver': 'lbfgs', 'clf__penalty': 'l2', 'clf__max_iter': 200}


KNNeighbors 

Best Score :  0.9829019713018663
Best Params :  {'clf__p': 2, 'clf__n_neighbors': 25, 'clf__leaf_size': 20}




> Kita sudah mendapatkan parameter terbaik, selanjutnya kita gunakan parameter ini untuk model di Machine Learning.