## Credit Card Fraud Detection

### Problem Statement
The problem statement chosen for this project is to predict fraudulent credit card transactions with the help of machine learning models.

In this project, you will analyse customer-level data that has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group. 

### Business problem overview
For many banks, retaining high profitable customers is the number one business goal. Banking fraud, however, poses a significant threat to this goal for different banks. In terms of substantial financial losses, trust and credibility, this is a concerning issue to both banks and customers alike.

It has been estimated by Nilson Report that by 2020, banking frauds would account for $30 billion worldwide. With the rise in digital payment channels, the number of fraudulent transactions is also increasing in new and different ways. 


In the banking industry, credit card fraud detection using machine learning is not only a trend but a necessity for them to put proactive monitoring and fraud prevention mechanisms in place. Machine learning is helping these institutions to reduce time-consuming manual reviews, costly chargebacks and fees as well as denials of legitimate transactions.

In [None]:
import numpy as np
import pandas as pd
from collections import Counter


import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# import machine learning and stats libraries:
from scipy import stats
from scipy.stats import norm, skew
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

import sklearn
from sklearn import metrics
from sklearn.metrics import roc_curve, auc, f1_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import average_precision_score, precision_recall_curve

from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler, PowerTransformer

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.linear_model import Ridge, Lasso, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import plot_importance
# Import:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC

from skopt import BayesSearchCV

from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN

# To ignore warnings
import warnings
warnings.filterwarnings("ignore")

#!pip install xgboost imblearn scikit-optimize

## Exploratory data analysis

In [None]:
df = pd.read_csv('creditcard.csv')
df.head()

In [None]:
# Check Row and Column Count
print(df.shape)
# Check distribution of data:
df.describe()

In [None]:
# Check Column Names, Types, Counts, Null counts
print(df.info())

In [None]:
#observe the different feature type present in the data

#Check the fraud/Non_Fraud record counts
print(df['Class'].value_counts())
(df.groupby('Class')['Class'].count()/df['Class'].count()) *100

In [None]:
# Plot Correlation Matrix
cor = df.corr()
plt.figure(figsize=(24,18))
sns.heatmap(cor, cmap="YlGnBu", annot=True)
plt.show()

Here we will observe the distribution of our classes

In [None]:
classes=df['Class'].value_counts()
normal_share=classes[0]/df['Class'].count()*100
fraud_share=classes[1]/df['Class'].count()*100

In [None]:
# Create a bar plot for the number and percentage of fraudulent vs non-fraudulent transcations

plt.figure(figsize=(7,5))
sns.countplot(df['Class'])
plt.title("Class Count", fontsize=18)
plt.xlabel("Record counts by class", fontsize=15)
plt.ylabel("Count", fontsize=15)
plt.show()

In [None]:
# Create a scatter plot to observe the distribution of classes with time
plt.figure(figsize=(10,3))
cmap = sns.color_palette('Set2')
sns.scatterplot(x=df['Time'], y='Class', palette=cmap, data=df)
plt.xlabel('Time', size=18)
plt.ylabel('Class', size=18)
plt.tick_params(axis='x', labelsize=16)
plt.tick_params(axis='y', labelsize=16) 
plt.title('Time vs Class Distribution', size=20, y=1.05)

In [None]:
# Create a scatter plot to observe the distribution of classes with Amount
plt.figure(figsize=(10,3))
sns.scatterplot(x=df['Amount'], y='Class', palette=cmap, data=df)
plt.xlabel('Time', size=18)
plt.ylabel('Class', size=18)
plt.tick_params(axis='x', labelsize=16)
plt.tick_params(axis='y', labelsize=16) 
plt.title('Time vs Class Distribution', size=20, y=1.05)

In [None]:
# Drop unnecessary columns
df = df.drop('Time', axis = 1)

### Splitting the data into train & test data

In [None]:
y= df['Class']
X = df.drop(['Class'], axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100, test_size=0.20)

##### Preserve X_test & y_test to evaluate on the test data once you build the model

In [None]:
print(np.sum(y))
print(np.sum(y_train))
print(np.sum(y_test))

### Plotting the distribution of a variable

In [None]:
# plot the histogram of a variable from the dataset to see the skewness
normal_records = df.Class == 0
fraud_records = df.Class == 1
cols = list(X.columns.values)

plt.figure(figsize=(20, 60))
for n, col in enumerate(cols):
    plt.subplot(10,3,n+1)
    sns.distplot(X[col][normal_records], color='green')
    sns.distplot(X[col][fraud_records], color='red')
    plt.title(col, fontsize=17)
plt.show()


### If there is skewness present in the distribution use:
- <b>Power Transformer</b> package present in the <b>preprocessing library provided by sklearn</b> to make distribution more gaussian

In [None]:
# - Apply : preprocessing.PowerTransformer(copy=False) to fit & transform the train & test data
from sklearn.preprocessing import PolynomialFeatures, PowerTransformer
pt = PowerTransformer()
pt.fit(X_train)                       ## Fit the PT on training data
X_train_pt = pt.transform(X_train)    ## Then apply on all data
X_test_pt = pt.transform(X_test)

X_train_pt_df = pd.DataFrame(data=X_train_pt, columns=cols)
X_test_pt_df = pd.DataFrame(data=X_test_pt, columns=cols)

In [None]:
# plot the histogram of a variable from the dataset again to see the result 
plt.figure(figsize=(20, 60))
for n, col in enumerate(cols):
    plt.subplot(10,3,n+1)
    sns.distplot(X_train_pt_df[col][normal_records], color='green')
    sns.distplot(X_train_pt_df[col][fraud_records], color='red')
    plt.title(col, fontsize=17)
plt.show()

In [None]:
# plot the histogram of a variable from the test dataset again to see the result 
plt.figure(figsize=(20, 60))
for n, col in enumerate(cols):
    plt.subplot(10,3,n+1)
    sns.distplot(X_test_pt_df[col][normal_records], color='green')
    sns.distplot(X_test_pt_df[col][fraud_records], color='red')
    plt.title(col, fontsize=17)
plt.show()

## Model Building
- Build different models on the imbalanced dataset and see the result

### Helper Functions

In [None]:
# Dictionary to Store Model Performance Results
overall_results = []

# Creating function to display ROC-AUC score, f1 score and classification report
def display_scores(y_test, y_pred, y_test_pred_proba):
    '''
    Display ROC-AUC score, f1 score and classification report of a model.
    '''
    f1score = f1_score(y_test, y_pred)
    print(f"F1 Score: {round(f1score*100, 2)}%") 
    print(f"Classification Report: \n {classification_report(y_test, y_pred)}")
    
    fpr, tpr, thresholds = metrics.roc_curve(y_test, y_test_pred_proba,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score(y_test, y_test_pred_proba)
    print(f"AUC Score: {round(auc_score*100, 2)}%")
    plt.figure(figsize=(5, 5))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()
    
    result = [f1score, auc_score]
    
    return result

## Model 1 - Logistic Regression Model

In [None]:
def cv_logistic_regression(X_train, y_train):
    # Logistic Regression parameters for K-fold cross vaidation
    params = {"C": [0.01, 0.1, 1, 10, 100, 1000]}
    folds = KFold(n_splits=5, shuffle=True, random_state=4)

    #perform cross validation
    model_cv = GridSearchCV(estimator = LogisticRegression(),
                            param_grid = params, 
                            scoring= 'roc_auc', 
                            cv = folds, 
                            n_jobs=-1,
                            verbose = 1,
                            return_train_score=True) 
    #perform hyperparameter tuning
    model_cv.fit(X_train, y_train)
    #print the evaluation result by choosing a evaluation metric
    print('Best ROC AUC score: ', model_cv.best_score_)
    #print the optimum value of hyperparameters
    print('Best hyperparameters: ', model_cv.best_params_)
    
    cv_results = pd.DataFrame(model_cv.cv_results_)
    
    # plot of C versus train and validation scores
    plt.figure(figsize=(8, 6))
    plt.plot(cv_results['param_C'], cv_results['mean_test_score'])
    plt.plot(cv_results['param_C'], cv_results['mean_train_score'])
    plt.xlabel('C')
    plt.ylabel('sensitivity')
    plt.legend(['test result', 'train result'], loc='upper left')
    plt.xscale('log')
    plt.show()
    return model_cv.best_params_

In [None]:
def model_logistic_regression(X_train, y_train):
    best_params = cv_logistic_regression(X_train, y_train)
    
    # Instantiating the model with best C
    log_reg_imb_model = LogisticRegression(C=best_params.get('C'))
    
    # Fitting the model on train dataset
    log_reg_imb_model.fit(X_train, y_train)
    
    # Predictions on the train set
    print("-" * 40)
    print(f"{'-'*10} Predict on Train Set {'-'*10}")
    print("-" * 40)
    y_train_pred = log_reg_imb_model.predict(X_train)
    
    # Predicted probability
    y_train_pred_proba = log_reg_imb_model.predict_proba(X_train)[:,1]

    result_train = display_scores(y_train, y_train_pred, y_train_pred_proba)
    
    # Making prediction on the test set
    print("-" * 45)
    print(f"{'-'*10} Predict on Test Set {'-'*10}")
    print("-" * 45)
    y_test_pred = log_reg_imb_model.predict(X_test)

    # Predicted probability
    y_test_pred_proba = log_reg_imb_model.predict_proba(X_test)[:,1]

    result_test = display_scores(y_test, y_test_pred, y_test_pred_proba)
    
    print(f"{'='*43}\n{'='*15}Model Summary{'='*15}\n{'='*43}")
    print(f"Train Set:\n---------\nF1 Score - {round(result_train['F1 Score']*100, 2)}%\nAUC Score - {round(result_train['AUC Score']*100, 2)}% ")
    print(f"Test Set:\n--------\nF1 Score - {round(result_test['F1 Score']*100, 2)}%\nAUC Score - {round(result_test['AUC Score']*100, 2)}% ")
    
    return log_reg_imb_model, result_train, result_test

In [None]:
log_reg_imb_model, result_train, result_test = model_logistic_regression(X_train, y_train)
overall_results.extend[ ['Imbalanced', 'Logistic regression', 'train', *result_train],
                        ['Imbalanced', 'Logistic regression', 'test', *result_test]
                      ]

### Similarly explore other algorithms by building models like:
- KNN
- SVM
- Decision Tree
- Random Forest
- XGBoost

### Model 2 - SVM Model

#### Cross Validation with K-Fold

In [None]:
def cv_svm(X_train, y_train):
    # Hyperparameters for Grid Search
    param_grid = {'C': [0.1],# 10, 100], 
                  'kernel': ['sigmoid']#'rbf', 
                 } 
  
    svm_model = SVC(probability=True, gamma='auto')
    
    folds = 2

    #perform cross validation
    model_cv = GridSearchCV(estimator = svm_model, 
                        param_grid = param_grid, 
                        scoring= 'roc_auc', 
                        cv = folds,
                        verbose = 3,
                        return_train_score=True)  
    # fit the model
    model_cv.fit(X_train, y_train)
    #print the evaluation result by choosing a evaluation metric
    print('Best ROC AUC score: ', model_cv.best_score_)
    #print the optimum value of hyperparameters
    print('Best hyperparameters: ', model_cv.best_params_)
    print('Best Estimator: ', model_cv.best_estimator_)

    return model_cv.best_params_


#### Fit Model

In [None]:
def model_svm(X_train, y_train):
    best_params = cv_svm(X_train, y_train)
    
    # fit model on training data
    imb_model = SVC(C=best_params['C'],
                        gamma='auto', 
                        kernel=best_params['kernel'],
                        probability=True)
    
    # Fitting the model on train dataset
    imb_model.fit(X_train, y_train)
    
    # Predictions on the train set
    print("-" * 40)
    print(f"{'-'*10} Predict on Train Set {'-'*10}")
    print("-" * 40)
    y_train_pred = imb_model.predict(X_train)
    
    # Predicted probability
    y_train_pred_proba = imb_model.predict_proba(X_train)[:,1]

    # Plot the ROC curve
    result_train = display_scores(y_train, y_train_pred, y_train_pred_proba)
    
    # Making prediction on the test set
    print("-" * 45)
    print(f"{'-'*10} Predict on Test Set {'-'*10}")
    print("-" * 45)
    y_test_pred = imb_model.predict(X_test)

    # Predicted probability
    y_test_pred_proba = imb_model.predict_proba(X_test)[:,1]

    # Plot the ROC curve
    result_test = display_scores(y_test, y_test_pred, y_test_pred_proba)
    
    print(f"{'='*43}\n{'='*15}Model Summary{'='*15}\n{'='*43}")
    print(f"Train Set:\n---------\nF1 Score - {round(result_train['F1 Score']*100, 2)}%\nAUC Score - {round(result_train['AUC Score']*100, 2)}% ")
    print(f"Test Set:\n--------\nF1 Score - {round(result_test['F1 Score']*100, 2)}%\nAUC Score - {round(result_test['AUC Score']*100, 2)}% ")
    
    result = {'SVM': {'Train':result_train, 'Test': result_test}}

    return imb_model, result

In [None]:
svm_imb_model, result_train, result_test = model_svm(X_train, y_train)
overall_results.extend[ ['Imbalanced', 'SVM', 'train', *result_train],
                        ['Imbalanced', 'SVM', 'test', *result_test]
                      ]

## Model 3 - XGBoost

In [None]:
def cv_xgboost(X_train, y_train):
    # Hyperparameters for Grid Search
    param_grid = {'learning_rate': [0.2, 0.6], 
             'subsample': [0.3, 0.6, 0.9]}
    folds = 3

    xgb_model = XGBClassifier(max_depth=2, n_estimators=200, eval_metric='logloss')
    
    #perform cross validation
    model_cv = GridSearchCV(estimator = xgb_model, 
                        param_grid = param_grid, 
                        scoring= 'roc_auc', 
                        cv = folds, 
                        verbose = 1,
                        return_train_score=True)  
    # fit the model
    model_cv.fit(X_train, y_train)
    #print the evaluation result by choosing a evaluation metric
    print('Best ROC AUC score: ', model_cv.best_score_)
    #print the optimum value of hyperparameters
    print('Best hyperparameters: ', model_cv.best_params_)
    print('Best Estimator: ', model_cv.best_estimator_)

    return model_cv.best_params_

In [None]:
def model_xgboost(X_train, y_train):
    best_params = cv_xgboost(X_train, y_train)
    
    params = {
             }
    
    # fit model on training data
    xgb_imb_model = XGBClassifier(learning_rate=best_params['learning_rate'],
                                  max_depth= 2, 
                                  n_estimators=200,
                                  subsample=best_params['subsample'],
                                  objective='binary:logistic',
                                  eval_metric='logloss')
    
    # Fitting the model on train dataset
    xgb_imb_model.fit(X_train, y_train)
    
    # Predictions on the train set
    print("-" * 40)
    print(f"{'-'*10} Predict on Train Set {'-'*10}")
    print("-" * 40)
    y_train_pred = xgb_imb_model.predict(X_train)
    
    # Predicted probability
    y_train_pred_proba = xgb_imb_model.predict_proba(X_train)[:,1]

    # Plot the ROC curve
    result_train = display_scores(y_train, y_train_pred, y_train_pred_proba)
    
    # Making prediction on the test set
    print("-" * 45)
    print(f"{'-'*10} Predict on Test Set {'-'*10}")
    print("-" * 45)
    y_test_pred = xgb_imb_model.predict(X_test)

    # Predicted probability
    y_test_pred_proba = xgb_imb_model.predict_proba(X_test)[:,1]

    # Plot the ROC curve
    result_test = display_scores(y_test, y_test_pred, y_test_pred_proba)
    
    print(f"{'='*43}\n{'='*15}Model Summary{'='*15}\n{'='*43}")
    print(f"Train Set:\n---------\nF1 Score - {round(result_train['F1 Score']*100, 2)}%\nAUC Score - {round(result_train['AUC Score']*100, 2)}% ")
    print(f"Test Set:\n--------\nF1 Score - {round(result_test['F1 Score']*100, 2)}%\nAUC Score - {round(result_test['AUC Score']*100, 2)}% ")
    
    result = {'XGBoost': {'Train':result_train, 'Test': result_test}}

    return xgb_imb_model, result
    

In [None]:
xgb_imb_model, result_train, result_test = model_xgboost(X_train, y_train)

overall_results.extend[ ['Imbalanced', 'XGBoost', 'train', *result_train],
                        ['Imbalanced', 'XGBoost', 'test', *result_test]
                      ]

#### Proceed with the model which shows the best result 
- Apply the best hyperparameter on the model
- Predict on the test dataset

In [None]:
clf = xgb_imb_model

### Print the important features of the best model to understand the dataset
- This will not give much explanation on the already transformed dataset
- But it will help us in understanding if the dataset is not PCA transformed

In [None]:
var_imp = []
for i in clf.feature_importances_:
    var_imp.append(i)
print('Top var =', var_imp.index(np.sort(clf.feature_importances_)[-1])+1)
print('2nd Top var =', var_imp.index(np.sort(clf.feature_importances_)[-2])+1)
print('3rd Top var =', var_imp.index(np.sort(clf.feature_importances_)[-3])+1)

# Variable on Index-16 and Index-13 seems to be the top 2 variables
top_var_index = var_imp.index(np.sort(clf.feature_importances_)[-1])
second_top_var_index = var_imp.index(np.sort(clf.feature_importances_)[-2])

X_train_1 = X_train.to_numpy()[np.where(y_train==1.0)]
X_train_0 = X_train.to_numpy()[np.where(y_train==0.0)]

np.random.shuffle(X_train_0)

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 20]

plt.scatter(X_train_1[:, top_var_index], X_train_1[:, second_top_var_index], label='Actual Class-1 Examples')
plt.scatter(X_train_0[:X_train_1.shape[0], top_var_index], X_train_0[:X_train_1.shape[0], second_top_var_index],
            label='Actual Class-0 Examples')
plt.legend()

## Model building with balancing Classes

##### Perform class balancing with :
- Random Oversampling
- SMOTE
- ADASYN

### Random Oversampling

In [None]:
# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority')
# fit and apply the transform
X_train_over, y_train_over = oversample.fit_resample(X_train, y_train)

# Befor sampling class distribution
print('Before sampling class distribution:-',Counter(y_train))
# new class distribution 
print('New class distribution:-',Counter(y_train_over))

### Model Building with Balanced Dataset (Random Oversampling)

#### Logistic Regression Model

In [None]:
log_reg_over_model, result_train, result_test = model_logistic_regression(X_train_over, y_train_over)

overall_results.extend[ ['Random Oversampling', 'Logistic Regression', 'train', *result_train],
                        ['Random Oversampling', 'Logistic Regression', 'test', results['test']['f1score'], results['test']['auc_score']]
                      ]

### SVM Model

In [None]:
svm_over_model, result_train, result_test = model_svm(X_train_over, y_train_over)

overall_results.extend[ ['Random Oversampling', 'SVM', 'train', *result_train],
                        ['Random Oversampling', 'SVM', 'test', *result_test]
                      ]

#### XGBoost Model

In [None]:
xgb_over_model, result_train, result_test = model_xgboost(X_train_over, y_train_over)

overall_results.extend[ ['Random Oversampling', 'XGBoost', 'train', *result_train],
                        ['Random Oversampling', 'XGBoost', 'test', *result_test]
                      ]

### SMOTE (Synthetic Minority Oversampling Technique)

### Print the class distribution after applying SMOTE 

In [None]:
sm = SMOTE(random_state=0)
X_train_smote, y_train_smote = sm.fit_resample(X_train, y_train)
# Artificial minority samples and corresponding minority labels from SMOTE are appended
# below X_train and y_train respectively
# So to exclusively get the artificial minority samples from SMOTE, we do
X_train_smote_1 = X_train_smote[X_train.shape[0]:]

X_train_1 = X_train.to_numpy()[np.where(y_train==1.0)]
X_train_0 = X_train.to_numpy()[np.where(y_train==0.0)]


plt.rcParams['figure.figsize'] = [20, 20]
fig = plt.figure()

plt.subplot(3, 1, 1)
plt.scatter(X_train_1[:, 0], X_train_1[:, 1], label='Actual Class-1 Examples')
plt.legend()

plt.subplot(3, 1, 2)
plt.scatter(X_train_1[:, 0], X_train_1[:, 1], label='Actual Class-1 Examples')
plt.scatter(X_train_smote_1.iloc[:X_train_1.shape[0], 0], X_train_smote_1.iloc[:X_train_1.shape[0], 1],
            label='Artificial SMOTE Class-1 Examples')
plt.legend()

plt.subplot(3, 1, 3)
plt.scatter(X_train_1[:, 0], X_train_1[:, 1], label='Actual Class-1 Examples')
plt.scatter(X_train_0[:X_train_1.shape[0], 0], X_train_0[:X_train_1.shape[0], 1], label='Actual Class-0 Examples')
plt.legend()

### Model Building with Balanced Dataset (SMOTE)

#### Linear Regression Model

In [None]:
log_reg_smote_model, result_train, result_test = model_logistic_regression(X_train_smote, y_train_smote)

overall_results.extend[ ['SMOTE', 'Logistic regression', 'train', *result_train],
                        ['SMOTE', 'Logistic regression', 'test', *result_test]
                      ]


#### SVM Model

In [None]:
svm_smote_model, result_train, result_test = model_svm(X_train_smote, y_train_smote)

overall_results.extend[ ['SMOTE', 'SVM', 'train', *result_train],
                        ['SMOTE', 'SVM', 'test', *result_test]
                      ]

#### XGBoost Model

In [None]:
xgb_smote_model, result_train, result_test = model_xgboost(X_train_smote, y_train_smote)

overall_results.extend[ ['SMOTE', 'XGBoost', 'train', *result_train],
                        ['SMOTE', 'XGBoost', 'test', *result_test]
                      ]

### ADASYN (Adaptive Synthetic Sampling)

### Print the class distribution after applying ADASYN

In [None]:
ada = ADASYN(random_state=0)
X_train_adasyn, y_train_adasyn = ada.fit_resample(X_train, y_train)

# Artificial minority samples and corresponding minority labels from ADASYN are appended
# below X_train and y_train respectively
# So to exclusively get the artificial minority samples from ADASYN, we do
X_train_adasyn_1 = X_train_adasyn[X_train.shape[0]:]

X_train_1 = X_train.to_numpy()[np.where(y_train==1.0)]
X_train_0 = X_train.to_numpy()[np.where(y_train==0.0)]

plt.rcParams['figure.figsize'] = [20, 20]
fig = plt.figure()

plt.subplot(3, 1, 1)
plt.scatter(X_train_1[:, 0], X_train_1[:, 1], label='Actual Class-1 Examples')
plt.legend()

plt.subplot(3, 1, 2)
plt.scatter(X_train_1[:, 0], X_train_1[:, 1], label='Actual Class-1 Examples')
plt.scatter(X_train_adasyn_1.iloc[:X_train_1.shape[0], 0], X_train_adasyn_1.iloc[:X_train_1.shape[0], 1],
            label='Artificial ADASYN Class-1 Examples')
plt.legend()

plt.subplot(3, 1, 3)
plt.scatter(X_train_1[:, 0], X_train_1[:, 1], label='Actual Class-1 Examples')
plt.scatter(X_train_0[:X_train_1.shape[0], 0], X_train_0[:X_train_1.shape[0], 1], label='Actual Class-0 Examples')
plt.legend()

### Model Building with Balanced Dataset (ADASYN)

#### Linear Regression Model

In [None]:
log_reg_adasyn_model, result_train, result_test = model_logistic_regression(X_train_adasyn, y_train_adasyn)

overall_results.extend[ ['ADASYN', 'Linear Regression', 'train', results['train']['f1score'], results['train']['auc_score']],
                        ['ADASYN', 'Linear Regression', 'test', results['test']['f1score'], results['test']['auc_score']]
                      ]

#### SVM Model

In [None]:
svm_adasyn_model, result_train, result_test = model_svm(X_train_adasyn, y_train_adasyn)

overall_results.extend[ ['ADASYN', 'SVM', 'train', *result_train],
                        ['ADASYN', 'SVM', 'test', *result_test]
                      ]

#### XGBoost Model

In [None]:
xgb_adasyn_model, result_train, result_test = model_xgboost(X_train_adasyn, y_train_adasyn)

overall_results.extend[ ['ADASYN', 'XGBoost', 'train', *result_train],
                        ['ADASYN', 'XGBoost', 'test', *result_test]
                      ]

### Select the oversampling method which shows the best result on a model
- Apply the best hyperparameter on the model
- Predict on the test dataset

In [None]:
# perform the best oversampling method on X_train & y_train

clf = xgb_adasyn_model
#clf.fit( ) # fit on the balanced dataset
#print() --> #print the evaluation score on the X_test by choosing the best evaluation metric

### Print the important features of the best model to understand the dataset

In [None]:
var_imp = []
for i in clf.feature_importances_:
    var_imp.append(i)
print('Top var =', var_imp.index(np.sort(clf.feature_importances_)[-1])+1)
print('2nd Top var =', var_imp.index(np.sort(clf.feature_importances_)[-2])+1)
print('3rd Top var =', var_imp.index(np.sort(clf.feature_importances_)[-3])+1)

# Variable on Index-13 and Index-9 seems to be the top 2 variables
top_var_index = var_imp.index(np.sort(clf.feature_importances_)[-1])
second_top_var_index = var_imp.index(np.sort(clf.feature_importances_)[-2])

X_train_1 = X_train.to_numpy()[np.where(y_train==1.0)]
X_train_0 = X_train.to_numpy()[np.where(y_train==0.0)]

np.random.shuffle(X_train_0)

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [20, 20]

plt.scatter(X_train_1[:, top_var_index], X_train_1[:, second_top_var_index], label='Actual Class-1 Examples')
plt.scatter(X_train_0[:X_train_1.shape[0], top_var_index], X_train_0[:X_train_1.shape[0], second_top_var_index],
            label='Actual Class-0 Examples')
plt.legend()

## Conclusion and Results

In [None]:
print(overall_results)
print(pd.DataFrame(overall_results))