# Intro

This is and End-to-End ML project on credit card fraud detection. The data set used is freely available [on Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud). Here some EDA will be done and later several models will be tested out. The main challenge here is a huge imbalance in the data, hence some techniques such as cost-sensetive learning and resampling will be tried out.

# Import Data and Libraries

Let's begin with importing required libraries and then take a first look at the data.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
plt.style.use('fivethirtyeight')
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold, RepeatedStratifiedKFold, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score, average_precision_score, balanced_accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectPercentile, f_classif, RFE
from imblearn.over_sampling import RandomOverSampler, SMOTE, SVMSMOTE, KMeansSMOTE
from imblearn.under_sampling import RandomUnderSampler
import imblearn.pipeline
from xgboost import XGBClassifier
import hyperopt 
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
import joblib
import time
from lightgbm import LGBMClassifier

In [None]:
df=pd.read_csv('../data/creditcard.csv')
df.head()

At first glance, the data just seems like some random numbers. Indeed, feature names (except for `Time`, `Amount` and `Class`) don't seem to make sence. There's a reason for that: according to dataset description on Kaggle, the dataset contains transactions made by credit cards in September 2013 by European cardholders and there are s only numerical input variables which are the result of a PCA ([Principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis)) transformation. So, rather than working with original features, due to confidentiality issues we have to work with features derived using PCA.
\
`Time` contains the seconds elapsed between each transaction and the first transaction in the dataset. `Amount` is the transaction Amount, and `Class` is the target variable: 1=Fraud, 0=Non-Fraud.

In [None]:
df.shape

There are 280K+ transactions in the dataset, What about Data Types and Null values?

In [None]:
df.info()

As all data is numeric, there will be no need to encoding. But it was obvious, as PCA returns only numeric data.

In [None]:
df.isnull().values.any()

Great, no Nulls at all! Then we can just move on to EDA.

# Exploratory Data Analysis

As all features except for `Time`, `Amount` and `Class` are PCA-generated, I don't think we could find anything interesting in their descriptive statistics.

In [None]:
df.describe()[['Time', 'Amount', 'Class']]

Looking at mean of `Class` being very close to 0 we can already see a problem: the target value is very imbalanced. Let's explore further.

In [None]:
target_vals=pd.DataFrame({'Total Count':df['Class'].value_counts()})
target_vals['Total %']=target_vals['Total Count'].map(lambda x: x/df.shape[0]*100)
target_vals

And there it is: a highly imbalanced problem. Here, out of 284807 samples, there are only 492 cases of fraud, less than 0.18%! This problem must be tackled during modelling part of the project, using various techniques (resampling, changing class weights, etc).
Now let's look at differences between fraudulent and nirmal transactions in-detail.

In [None]:
fraud, normal=df[df['Class']==1], df[df['Class']==0]
fraud.shape, normal.shape

In [None]:
fig=make_subplots(rows=2, cols=1)

fig.add_trace(go.Histogram(x=fraud['Amount'], nbinsx=100), row=1, col=1)

fig.add_trace(go.Histogram(x=fraud['Time'], nbinsx=50), row=2, col=1)

fig.update_layout(height=600, width=1200, title_text="Amount and Time Distribution (Fraudulent Transactions)")
fig.update_xaxes(title_text="Amount ($)", row=1, col=1)
fig.update_xaxes(title_text="Time", row=2, col=1)
fig.show()

In [None]:
fig=make_subplots(rows=2, cols=1)

fig.add_trace(go.Histogram(x=normal['Amount'], nbinsx=100), row=1, col=1)

fig.add_trace(go.Histogram(x=normal['Time'], nbinsx=50), row=2, col=1)

fig.update_layout(height=600, width=1200, title_text="Amount and Time Distribution (Normal Transactions)")
fig.update_xaxes(title_text="Amount ($)", row=1, col=1)
fig.update_xaxes(title_text="Time", row=2, col=1)

There's no that much of a difference in Amount distributions, but Time is indeed differs, as fraudulent transactions are more uniformly distributed, while normal transactions have a cyclical distribution. Now what's about descriptive statistics?

In [None]:
fraud_data=fraud.describe()[['Time', 'Amount']]
normal_data=normal.describe()[['Time', 'Amount']]
compare=pd.concat([fraud_data, normal_data], axis=1)
compare.columns=['Fraud_Time', 'Fraud_Amount', 'Normal_Time', 'Normal_Amount']
compare

From here we see that the average amount of fraudulent transaction is 1.4x higher compared to Normal ones. Std is almost the same. 
\
Now let's compare PCA-generated features and look for some valuable insights.

In [None]:
v_features=df.columns.drop(['Time', 'Amount', 'Class'])
v_features

In [None]:
plt.figure(figsize=(20, 14*4))
gs = gridspec.GridSpec(ncols=2, nrows=14)
for i, cn in enumerate(df[v_features]):
    ax = plt.subplot(gs[i])
    sns.distplot(df[cn][df['Class']==1], bins=50)
    sns.distplot(df[cn][df['Class']==0], bins=50)
    ax.set_xlabel('')
    ax.set_title(f'histogram of feature: {str(cn)}')
plt.show()

Lots of plots here, but there is some valuable information here. Features V1-V7, V9-V12, V14, V16-V19 and V21 of classes 0 and 1 seem to have different distributions. It means they could help our future ML model to distinquish classes (we'll test this claim while evaluating feature importance). Other features are pretty much the same. One could even argue in favor of dropping them, but I won't do that. 

The last thing is to find Pearson's correlation between features and target.

In [None]:
df.corr()['Class'].sort_values(ascending=False)

V11, V4 and V2 have some positive correlation with `Class`. V17, V14 and V12 have medium negative correlations. As we don't have very strong correlation (0.8+) here, there's no need to drop anything.

# Feature Engineering & Selection

In this part I'll scale non-PCA features using sklearn's `RobustScaler`. As for Feature Engineering, there's not that much one can do here manually (again, because of artificial PCA-created features). Some automatic FE can be tried using `PolynomialFeatures` or `featureloots`. Later, feature selection can be performed to get most useful features.

## Train/test split

In [None]:
X, y=df.drop(columns=['Class']), df['Class']
X.shape, y.shape

In [None]:
X_train, X_test, y_train, y_test=train_test_split(X, y, stratify=y)

for y_split in (y_train, y_test):
    zeros, ones=np.unique(y_split, return_counts=True)[1]
    print(zeros/(zeros+ones)*100, ones/(zeros+ones)*100) # see the share of train and test

It's important to use stratified split to preserve class imbalance in two sets.

## Polynomial Features

Now let's try to use abovementioned `PolynomialFeatures`. What it does is creating polynomial and interaction features. The idea here is, given a feature(s) with some influence on target, generate even stronger features by raising them to the power 2 or 3 , multiply them, etc. But one must be carefull, as the number of features can explode very easily. I'll use degree of 2, so I won't get that much of new features. Also, I'll save a copy of original data to use it later.

In [None]:
X_train_orig, X_test_orig, y_train_orig, y_test_orig=X_train.copy(), X_test.copy(), y_train.copy(), y_test.copy()

In [None]:
poly=PolynomialFeatures(2)
X_train_poly=poly.fit_transform(X_train)
X_test_poly=poly.transform(X_test)

X_train_poly.shape, X_test_poly.shape

We went from 30 to almost 500 features. Now, let's see some new features and later fit original and transformed data to simple `LogisticRegression` classifier and see results, For evaluation, ROC AUC Score is used, as simple accuracy for imbalanced data is not a good option. 
\
(NOTE: No Scaling (for `Amount` and `Time`) was used, but was tested out. Surprisingly, it gave slightly worse results. It [can happen](https://stackoverflow.com/questions/26668316/effect-of-feature-scaling-on-accuracy) sometimes. My guess, it's because of mixing non-scaled PCA features with scaled non-PCA ones, but it may be something else too.)

In [None]:
def simple_train(xtrain, xtest, ytrain, ytest, data_type):
    lr=LogisticRegression()
    start_time=time.time()
    lr.fit(xtrain, ytrain)
    roc=roc_auc_score(ytest, lr.predict(xtest))
    print(f'({data_type}) ROC AUC Score: {roc}')
    print(f'Time spent: {time.time()-start_time}')
    return roc

In [22]:
for data_type, x_tr, x_te in ('orig', X_train_orig, X_test_orig), ('poly', X_train_poly, X_test_poly):
    simple_train(x_tr, x_te, y_train, y_test, data_type)

(orig) ROC AUC Score: 0.853327918540655
Time spent: 2.272566318511963
(poly) ROC AUC Score: 0.8857990027585246
Time spent: 22.178084135055542


Well, there's clearly some performance boost. But there's a problem: much longer training time. It's not a surprise, as we got much more features to deal with. Some feature selection techniques should be used here.

## Feature Selection

Feature Selection is the process of reducing the number of input variables when developing an ML model. It can help us to get rid of non-informative or redundant features. One ca use many ways to do FS: Correlation, RFE, Feature Importance, etc. I'll use statictical methods (namely, [ANOVA F measure](https://en.wikipedia.org/wiki/F-test)) to select subsets of features based on their relationship with the target. 

In [23]:
fs=SelectPercentile(score_func=f_classif, percentile=70)
X_train_poly_fs=fs.fit_transform(X_train_poly, y_train)
X_test_poly_fs=fs.transform(X_test_poly)
X_train_poly_fs.shape

(213605, 347)

In [24]:
simple_train(X_train_poly_fs, X_test_poly_fs, y_train, y_test, 'poly_fs')

(poly_fs) ROC AUC Score: 0.868342987654753
Time spent: 10.47500205039978


0.868342987654753

We've dropped 30% of features and still get performance boost. Pretty good if you ask me.

## Transformation Pipeline

Now we can build an end-to-end pipeline for all transformations using `Pipeline`. This is for the sake of demonstration. Later a final pipeline will be constructed and used in deployment.

In [25]:
pipeline=Pipeline([
                ('polynom', PolynomialFeatures(2)),
                ('feature_select', SelectPercentile(score_func=f_classif, percentile=70)),
                ('linear_regression', LogisticRegression())
])

pipeline.fit(X_train, y_train)
roc_auc_score(y_test, pipeline.predict(X_test))

0.868342987654753

# Model Selection

This is (in my opinion) the most fascinating part of this project. That's because here the imbalance problem of the dataset will be tackled. Various methods and models will be used and one will be chosen for later optimization and deployment.

Imbalanced learning techniques can be broadly categorized into cost-sensitive and resampling methods. In **cost-sensitive** methods, algorithms are adjusted to favor the detection of the minority class. This usually implies a modification of the optimization function in the training step of the learning algorithm. 
\
On the contrary, **resampling methods** operate at the data level, by adding a pre-processing step to rebalance the dataset before the training algorithm is applied. Resampling can be achieved by removing examples of the majority class (undersampling techniques), adding examples of the minority class (oversampling techniques), etc.

## Cost-Sensitive 

Cost-sensitive learning is a subfield of machine learning that addresses classification problems where the misclassification costs are not equal. Here an imbalance ratio (IR) is used to make a model penalize minority-class mistakes more. In sklearn we can use `class_weight` to make the model cost-sensitive. To get an IR we just do minority class ratio\majory class ratio.

In [26]:
X.shape, y.shape

((284807, 30), (284807,))

In [27]:
minor_share, major_share=np.unique(y, return_counts=True)[1][1]/df.shape[0], np.unique(y, return_counts=True)[1][0]/df.shape[0]
print(minor_share, major_share)

IR=minor_share/major_share
print(IR)

0.001727485630620034 0.9982725143693799
0.00173047500131896


Now let's compare 2 `LogisticRegression` models, one default and another cost-sensitive. First, we must get our transformed X.

In [28]:
poly_feats=PolynomialFeatures(2)
feat_select=SelectPercentile(score_func=f_classif, percentile=70)

data_pipe=Pipeline([
                ('polynom', poly_feats),
                ('feature_select', feat_select)])

X_transformed=data_pipe.fit_transform(X_train, y_train)
X_transformed.shape

(213605, 347)

I'll save this pipeline for deployment.

In [29]:
joblib.dump(data_pipe, '../models/data_pipe.joblib')

['../models/data_pipe.joblib']

In [30]:
def train_simple_cv(X, y, class_weight={0:1,1:1}):
    model=LogisticRegression(class_weight=class_weight)

    cv=StratifiedKFold(n_splits=3)
    cv_results_=cross_validate(model, X, y, cv=cv,
                            scoring=['roc_auc',
                                    'average_precision',
                                    'balanced_accuracy'],
                            return_estimator=True)
    return pd.DataFrame(cv_results_).agg('mean')

In [31]:
lr_default_res=train_simple_cv(X_transformed, y_train)
lr_default_res

fit_time                  6.835541
score_time                0.149103
test_roc_auc              0.910031
test_average_precision    0.730929
test_balanced_accuracy    0.890110
dtype: float64

In [32]:
lr_cost_sense_res=train_simple_cv(X_transformed, y_train, {0:IR, 1:1})
lr_cost_sense_res

fit_time                  5.583052
score_time                0.156197
test_roc_auc              0.959604
test_average_precision    0.704659
test_balanced_accuracy    0.928720
dtype: float64

There is a significant boost in ROC AUC and Balanced Accuracy and very slight increase in AP. Hence Cost-Sensitivity helps.
\
 We can use this technique not only with `LogisticRegression` but other sklearn classifiers. Now let's move on and take a look at resampling techniques.

## Resampling

Resampling strategies address class imbalance at the data level, by resampling the dataset to reduce the imbalance ratio. The resampling of an imbalanced dataset occurs before the training of the prediction model and can be seen as a data preprocessing step. Let's look at resampling methods one-by-one.

### Oversampling

Oversampling techniques aim at rebalancing the dataset by creating new samples for the minority class. The two most widely-used methods are random oversampling and SMOTE. I'll use the latter. 

SMOTE ([Synthetic Minority Oversampling Technique](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/)) oversamples the minority class by generating synthetic examples in the neighborhood of observed ones. The idea is to form new minority examples by interpolating between samples of the same class. 

In [None]:
def train_cv(model, X, y, splits=5):
    cv=StratifiedKFold(n_splits=splits)
    cv_results_=cross_validate(model, X, y, cv=cv,
                            scoring=['roc_auc',
                                    'average_precision',
                                    'balanced_accuracy'],
                            return_estimator=True,
                            n_jobs=-1)
    return pd.DataFrame(cv_results_).agg('mean')

In [None]:
model_smote=imblearn.pipeline.Pipeline([('smote', SMOTE()), ('model', LogisticRegression())])
smote_res=train_cv(model_smote, X_transformed, y_train)
smote_res

Really good results here. There are a couple of modified versions of vanilla SMOTE, like SVM-SMOTE or KMeans-SMOTE. Let's test the first one.

Now we can use a simpler resampling method called Random Oversampling. Here we don't generate new samples, but just over-sample the minority class(es) by picking samples at random with replacement.

In [None]:
model_ros=imblearn.pipeline.Pipeline([('ros', RandomOverSampler()), ('model', LogisticRegression())])
ros_res=train_cv(model_ros, X_transformed, y_train)
ros_res

ROC AUC is almost the same, but AP and Balanced Accuracy did drop a bit.

### Undersampling

Undersampling refers to the process of reducing the number of samples in the majority class. The naive approach, called random undersampling (RUS), consists in randomly removing samples from the majority class until the desired imbalance ratio is achieved.

In [None]:
model_rus=imblearn.pipeline.Pipeline([('rus', RandomUnderSampler()), ('model', LogisticRegression())])
rus_res=train_cv(model_rus, X_transformed, y_train)
rus_res

### Combination

We actually can combine Over and Undersampling techniques. LEt's take a look at SMOTE-RUS combination.

In [None]:
model_smote_rus=imblearn.pipeline.Pipeline([('smote', SMOTE(sampling_strategy=.5)), 
                                            ('rus', RandomUnderSampler(sampling_strategy=1.0)), 
                                            ('model', LogisticRegression())])
smote_rus_res=train_cv(model_smote_rus, X_transformed, y_train)
smote_rus_res

## Ensemble

Ensemble methods consist in training multiple prediction models for the same prediction task, and in combining their outputs to make the final prediction. There are ways to incorporate cost-sensitivity or resampling methods to ensembles (see [BalancedBaggingClassifier](https://imbalanced-learn.org/dev/references/generated/imblearn.ensemble.BalancedBaggingClassifier.html), [BalancedRandomForestClassifier](https://imbalanced-learn.org/dev/references/generated/imblearn.ensemble.BalancedRandomForestClassifier.html)). Here I'll use one method, Weighted XGBoost. It's basically XGBoost+Cost Sensitivity. We'll compare delault XGBoost with a weighted one.

In [None]:
default_xgb=XGBClassifier()
default_xgb_res=train_cv(default_xgb, X_transformed, y_train)
default_xgb_res

In [None]:
wighted_xgb=XGBClassifier(scale_pos_weight=1/IR)
wighted_xgb_res=train_cv(wighted_xgb, X_transformed, y_train)
wighted_xgb_res

There is a slight boost in Balanced Accuracy, ROC AUC and AP are almost the same. Let's now put all results side-by-side.

In [None]:
df_comp=pd.DataFrame(pd.concat([lr_default_res,
            lr_cost_sense_res,
            smote_res,
            ros_res,
            rus_res,
            smote_rus_res,
            default_xgb_res,
            wighted_xgb_res], axis=1))
df_comp.columns=['LR', 'Weighted LR', 'SMOTE LR', 'ROS LR', 'RUS LR', 'SMOTE-RUS LR', 'XGBoost', 'Weighted XGBoost']
df_comp.index=['Fit Time', 'Score Time', 'ROC AUC', 'Average Precision', 'Balanced Accuracy']
df_comp

# Final Model

This is the last section of the project. Here I'll finalize the model, train it on a full dataset and save it for future usage.
\
(NOTE: Hyperparameters were chosen by hand. Hyperparameter optimisation couyld be done, but it'd take A LOT (approx. 15 hours) of time, hence I had to find somewhat optimal parameters by hand.)

In [33]:
params={'max_depth':10,
        'learning_rate':0.2,
        'subsample':0.7,
        'colsample_bytree':0.6,
        'n_estimators':500,
        'scale_pos_weight':1/IR}

xgb_clf=XGBClassifier()
xgb_clf.set_params(**params)
xgb_clf.fit(X_transformed, y_train)

In [34]:
def eval_model(model, X, y):
    data_pipe_loaded=joblib.load('../models/data_pipe.joblib')
    X_trans=data_pipe_loaded.transform(X)
    roc_auc=roc_auc_score(y, model.predict(X_trans))
    average_precision=average_precision_score(y, model.predict(X_trans))
    balanced_accuracy=balanced_accuracy_score(y, model.predict(X_trans))

    return {'ROC AUC': roc_auc,
            'Average Precision': average_precision, 
            'Balanced Accuracy': balanced_accuracy}

final_res=eval_model(xgb_clf, X_test, y_test)
final_res

{'ROC AUC': 0.906454824055268,
 'Average Precision': 0.7601437069531122,
 'Balanced Accuracy': 0.9064548240552679}

Now I'll create and save a final model that will be trained on an entire dataset. 

In [35]:
xgb_clf_final=XGBClassifier()
xgb_clf_final.set_params(**params)

In [36]:
data_pipe_loaded=joblib.load('../models/data_pipe.joblib')
X_final=data_pipe_loaded.transform(X)

X_final.shape, y.shape

((284807, 347), (284807,))

In [37]:
xgb_clf_final.fit(X_final, y)

In [43]:
joblib.dump(xgb_clf_final, '../models/xgb_final_model.sav')

['../models/xgb_final_model.sav']

That's all! In this project we've seen several ways to deal with imbalanced data. Hope you've learned something new and found the project helpful!