## Model Training

A credit card is one of the most used financial products to make online purchases and payments. Though the Credit cards can be a convenient way to manage your finances, they can also be risky. Credit card fraud is the unauthorized use of someone else's credit card or credit card information to make purchases or withdraw cash.

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. 

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

We have to build a classification model to predict whether a transaction is fraudulent or not.


In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier 
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, classification_report,f1_score
import seaborn as sns
import scipy.stats as stats
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor
import warnings

#### Import the CSV Data as Pandas DataFrame

In [4]:
df = pd.read_csv('data\creditcard.csv')

#### Show Top 5 Records

In [5]:
df.head(5)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


#### Preparing X and Y variables

In [6]:
X = df.drop(columns=['Class'],axis=1)

In [7]:
X.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99


In [8]:
y = df['Class']

In [9]:
y

0         0
1         0
2         0
3         0
4         0
         ..
284802    0
284803    0
284804    0
284805    0
284806    0
Name: Class, Length: 284807, dtype: int64

### Handling the data imbalance

we calculate the average count of the two classes and oversample the minority class to the average value followed by undersampling the majority class to the average value.

Oversampler - SMOTE creates synthetic samples by interpolating between existing minority class samples. It addresses the overfitting issue associated with random oversampling and can help avoid model bias.

Undersampler -Random sampling involves randomly duplicating samples from the minority class or randomly deleting samples from the majority class.


In [10]:
count_0 = y.value_counts()[0]
count_1 = y.value_counts()[1]
average_count = int((count_0+ count_1 ) / 2)


resampling_pipeline = Pipeline([
    ('oversampler', SMOTE(sampling_strategy={1: average_count})),
    ('undersampler', RandomUnderSampler(sampling_strategy={0: average_count}))
])


X_resampled, y_resampled = resampling_pipeline.fit_resample(X, y)

X,y =X_resampled, y_resampled
df_resampled = pd.concat([X,y],axis=1)
df_resampled.head(5)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
32447,36808.0,1.218465,-0.607446,1.073094,0.96117,-1.32295,0.108251,-0.900172,0.222699,0.021664,...,-0.446894,-0.503524,0.075382,0.351624,0.412391,-0.27103,0.091048,0.026417,1.99,0
186900,127301.0,0.489344,-2.897521,-2.089704,0.752418,-0.878168,-0.183567,0.907671,-0.270685,0.905803,...,0.588326,0.173171,-0.70206,-0.308573,-0.111789,0.118026,-0.180651,0.060101,796.3,0
240712,150715.0,2.020479,-0.20365,-1.440525,0.044468,0.15093,-0.653931,0.037303,-0.152042,0.180686,...,0.263852,0.761016,0.065818,0.831924,0.025478,0.730463,-0.103488,-0.071043,19.95,0
151378,95468.0,-4.34977,-0.658404,0.426053,-0.030472,1.695829,-0.143122,0.659388,-1.417816,3.751194,...,-0.905606,-0.706089,1.696976,0.659373,0.741013,0.159791,1.680936,-0.593322,0.01,0
113610,73134.0,1.326213,0.379177,-0.091221,0.774639,0.413574,-0.060075,0.167722,-0.168945,-0.038247,...,-0.129748,-0.28564,-0.210156,-0.98859,0.770022,-0.356715,0.028106,0.014737,7.47,0


### Scaling the numeric features 

In [11]:
# Create Column Transformer with 3 types of transformers
num_features = X.select_dtypes(exclude="object").columns

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()

preprocessor = ColumnTransformer(
    [
         ("StandardScaler", numeric_transformer, num_features),        
    ]
)

In [12]:
X = preprocessor.fit_transform(X)

In [13]:
X.shape

(284806, 30)

In [14]:
X

array([[-1.06164739,  0.68075376, -0.69369853, ...,  0.00863689,
        -0.03333977, -0.41368064],
       [ 0.81888287,  0.54665621, -1.32372187, ..., -0.27705495,
         0.04620839,  3.09197769],
       [ 1.30544796,  0.82825776, -0.58261019, ..., -0.19591836,
        -0.26350462, -0.33441483],
       ...,
       [ 1.34584608,  0.52674526, -0.53965953, ..., -0.62079644,
         0.23400354,  0.85453497],
       [ 0.28267369, -4.39639923,  4.89272235, ..., -5.40327668,
        -1.01032875, -0.4147258 ],
       [-0.97117867, -0.6173819 ,  0.71354533, ...,  1.86879814,
         0.82528708, -0.34406668]])

In [15]:
# separate dataset into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train.shape, X_test.shape

((227844, 30), (56962, 30))

#### Create an Evaluate Function to give all metrics after model Training

In [16]:
def evaluate_model(true, predicted):
    accuracy = accuracy_score(true, predicted)
    roc = roc_auc_score(true, predicted)
    f1 = f1_score(true, predicted)
    return accuracy, roc, f1

In [17]:
models = {
    "Random Forest": RandomForestClassifier(),
    "XGBClassifier": XGBClassifier(),
    "KNeighborsClassifier": KNeighborsClassifier(),
}
model_list = []
accuracy_list =[]

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Evaluate Train and Test dataset
    model_train_accuracy , model_train_roc, model_train_f1 = evaluate_model(y_train, y_train_pred)

    model_test_accuracy , model_test_roc, model_test_f1 = evaluate_model(y_test, y_test_pred)

    
    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])
    
    print('Model performance for Training set')
    print("- Accuracy Score: {:.4f}".format(model_train_accuracy))
    print("- Roc Auc Score: {:.4f}".format(model_train_roc))
    print("- F1 Score: {:.4f}".format(model_train_f1))

    print('----------------------------------')
    
    print('Model performance for Test set')
    print("- Accuracy Score: {:.4f}".format(model_test_accuracy))
    print("- Roc Auc Score: {:.4f}".format(model_test_roc))
    print("- F1 Score: {:.4f}".format(model_test_f1))
    accuracy_list.append(model_test_accuracy)
    
    print('='*35)
    print('\n')

Random Forest
Model performance for Training set
- Accuracy Score: 1.0000
- Roc Auc Score: 1.0000
- F1 Score: 1.0000
----------------------------------
Model performance for Test set
- Accuracy Score: 0.9999
- Roc Auc Score: 0.9999
- F1 Score: 0.9999


XGBClassifier
Model performance for Training set
- Accuracy Score: 1.0000
- Roc Auc Score: 1.0000
- F1 Score: 1.0000
----------------------------------
Model performance for Test set
- Accuracy Score: 0.9998
- Roc Auc Score: 0.9998
- F1 Score: 0.9998


KNeighborsClassifier
Model performance for Training set
- Accuracy Score: 0.9989
- Roc Auc Score: 0.9989
- F1 Score: 0.9989
----------------------------------
Model performance for Test set
- Accuracy Score: 0.9984
- Roc Auc Score: 0.9984
- F1 Score: 0.9984




### Results

In [18]:
pd.DataFrame(list(zip(model_list, accuracy_list)), columns=['Model Name', 'Accuracy']).sort_values(by=["Accuracy"],ascending=False)

Unnamed: 0,Model Name,Accuracy
0,Random Forest,0.999895
1,XGBClassifier,0.999807
2,KNeighborsClassifier,0.99842


### Perform Hyperparameter Tuning on the model is with the highest accuracy score

In [19]:
param_grid = {
    'n_estimators': [20,40,60,100],  
    'max_depth': [5, 10, 15,20],
}

rf_classifier = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, 
                           cv=2, n_jobs=-1, verbose=2, scoring='accuracy')

grid_search.fit(X_train, y_train)
best_hyperparameters = grid_search.best_params_
print("Best Hyperparameters: ", best_hyperparameters)
best_rf_model = grid_search.best_estimator_
y_pred = best_rf_model.predict(X_test)

Fitting 2 folds for each of 16 candidates, totalling 32 fits
Best Hyperparameters:  {'max_depth': 20, 'n_estimators': 100}


Results of the best model

In [20]:
accuracy = accuracy_score(y_pred, y_test)
print(f"Accuracy: {accuracy:.2f}")

print(classification_report(y_pred, y_test))

roc_auc = roc_auc_score(y_pred, y_test)
print(f"ROC AUC: {roc_auc:.2f}")

Accuracy: 1.00
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     28483
           1       1.00      1.00      1.00     28479

    accuracy                           1.00     56962
   macro avg       1.00      1.00      1.00     56962
weighted avg       1.00      1.00      1.00     56962

ROC AUC: 1.00


#### Difference between Actual and Predicted Values

In [21]:
pred_df=pd.DataFrame({'Actual Value':y_test,'Predicted Value':y_pred})
pred_df

Unnamed: 0,Actual Value,Predicted Value
195613,0,0
187876,0,0
246134,0,0
418392,1,1
420757,1,1
...,...,...
128527,0,0
403762,1,1
363157,1,1
259282,0,0


### Conclusion

Given the highly imbalanced nature of the dataset, we addressed this challenge by applying a combination of oversampling and undersampling techniques using SMOTE and RandomUnderSampler. This approach helped create a more balanced dataset.

After training multiple models, the Random Forest Classifier emerged as the best performer, providing the highest accuracy among all models tested. 
To further optimize the model, performed hyperparameter tuning using GridSearchCV, which allowed to find the best parameters. This tuning process yielded exceptional results, with both the accuracy score and the ROC AUC score reaching 1.00.

In conclusion, the tuned Random Forest model successfully predicts fraudulent transactions with high accuracy on the test data, demonstrating its effectiveness for this critical task.