## Handling Imbalanced Dataset with Machine Learning
### Context
It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

### Content
The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

### Inspiration
Identify fraudulent credit card transactions.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

### Acknowledgements
The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

Please download dataset creditcard.csv at following linkage. 
https://nottinghamedu1-my.sharepoint.com/:x:/g/personal/bixxb2_nottingham_edu_cn/EYyfDZydTxBIg-SnPd1EifkBy872mqHM0k5synbzBYDp7g?e=HdQHb6

In [1]:
import pandas as pd
df=pd.read_csv('creditcard.csv')
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [2]:
##look at the data and feature size
df.shape

(284807, 31)

In [3]:
## look at the target class distribution, it is unbalanced dataset.
df['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

In [4]:
## look at the datetype
df.dtypes

Time      float64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       int64
dtype: object

In [5]:
### Create Independent and Dependent Features
X=df.drop("Class",axis=1)
y=df.Class

In [6]:
# Since we are going to automatically tune our model parameters 
# we need to split our data into training and testing sets. 
# Do this here. 

# Keep 70% of the data for training 
# Use a random_state of 42

# Use the following variable names:
# X_train : The input features for the training set
# y_train : The output feature for the training set
# X_test : The input features for the test set
# y_test : The output feature for the test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.70, random_state=42,stratify=y)

## Model Setup

In [7]:
# Setup the logistic regression classifier 
# tune the penalty type "l1" "l2" and C ranges from [1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02] 
# with GridSeachCV and cross validation
# set n_job=-1

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
from sklearn.model_selection import KFold
import numpy as np
from sklearn.model_selection import GridSearchCV

log_class=LogisticRegression()
grid={'C':10.0 **np.arange(-2,3),'penalty':['l1','l2']}
cv=KFold(n_splits=5,random_state=None,shuffle=False)
clf=GridSearchCV(log_class,grid,cv=cv,n_jobs=-1,scoring='f1_macro')
clf.fit(X_train,y_train)

        nan 0.85879849        nan 0.85382335]
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


GridSearchCV(cv=KFold(n_splits=5, random_state=None, shuffle=False),
             estimator=LogisticRegression(), n_jobs=-1,
             param_grid={'C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]),
                         'penalty': ['l1', 'l2']},
             scoring='f1_macro')

In [8]:
## evaluate the logistic regression's performance in terms of confusion, 
## accuracy, presion, f1, recall, etc.
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

y_pred=clf.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[198919    102]
 [    91    253]]
0.9990319263662127
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    199021
           1       0.71      0.74      0.72       344

    accuracy                           1.00    199365
   macro avg       0.86      0.87      0.86    199365
weighted avg       1.00      1.00      1.00    199365



In [9]:
y_train.value_counts()

0    85294
1      148
Name: Class, dtype: int64

## Incorporate Cost-sensitive Learning

In [10]:
## using the the argument class_weight in [sklearn] RandomForestClassifier to incorporate the 
## cost-sensitive learning, set the weights as {0:1,1:100}
from sklearn.ensemble import RandomForestClassifier

class_weight=dict({0:1,1:100})
classifier=RandomForestClassifier(class_weight=class_weight)
classifier.fit(X_train,y_train)

RandomForestClassifier(class_weight={0: 1, 1: 100})

In [11]:
## evaluate the performance and compare it to the naive Logistic Regression's performance above

y_pred=classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[199007     14]
 [    99    245]]
0.9994332004113059
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    199021
           1       0.95      0.71      0.81       344

    accuracy                           1.00    199365
   macro avg       0.97      0.86      0.91    199365
weighted avg       1.00      1.00      1.00    199365



## Under Sampling Method

https://miro.medium.com/max/700/1*ENvt_PTaH5v4BXZfd-3pMA.png

In [12]:
y_train.value_counts()

0    85294
1      148
Name: Class, dtype: int64

There are three versions of the technique, named NearMiss-1, NearMiss-2, and NearMiss-3.
#### Unbalanced data 
https://machinelearningmastery.com/wp-content/uploads/2019/10/Scatter-Plot-of-Imbalanced-Classification-Dataset.png

#### NearMiss-1 
selects examples from the majority class that have the smallest average distance to the N closest examples from the minority class. (https://machinelearningmastery.com/wp-content/uploads/2019/10/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-1.png)

#### NearMiss-2 
selects examples from the majority class that have the smallest average distance to the N furthest examples from the minority class. (https://machinelearningmastery.com/wp-content/uploads/2019/10/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-2.png)

#### NearMiss-3 
involves selecting a given number of majority class examples for each example in the minority class that are closest.(https://machinelearningmastery.com/wp-content/uploads/2019/10/Scatter-Plot-of-Imbalanced-Dataset-Undersampled-with-NearMiss-3.png)



In [13]:
## under sampling the dataset with [imblearn] NearMiss method
## The default version is 1 and default n_neighbors is 3.
from imblearn.under_sampling import NearMiss
ns=NearMiss(0.8)
X_train_ns,y_train_ns=ns.fit_resample(X_train,y_train)



In [14]:
## print the class 0,1 distribution before and after the under sampling
print("The number of classes before fit {}".format(y_train.value_counts().to_dict()))
print("The number of classes after fit {}".format(y_train_ns.value_counts().to_dict()))

The number of classes before fit {0: 85294, 1: 148}
The number of classes after fit {0: 185, 1: 148}


In [15]:
## train a [sklearn] RandomForestClassifier and fit it with undersampling data

classifier=RandomForestClassifier()
classifier.fit(X_train_ns,y_train_ns)

RandomForestClassifier()

In [16]:
## print the performance
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

y_pred=classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[168558  30463]
 [    27    317]]
0.8470644295638653
              precision    recall  f1-score   support

           0       1.00      0.85      0.92    199021
           1       0.01      0.92      0.02       344

    accuracy                           0.85    199365
   macro avg       0.51      0.88      0.47    199365
weighted avg       1.00      0.85      0.92    199365



## Over Sampling Method

In [17]:
## Using the [imblearn] RanomOverSampler to over sample the dataset randomly
## Set the ratio of the number of samples in the minority class over 
## the number of samples in the majority class after resampling is 0.75.
from imblearn.over_sampling import RandomOverSampler

os=RandomOverSampler(0.75)
X_train_ns,y_train_ns=os.fit_resample(X_train,y_train)
print("The number of classes before fit {}".format(y_train.value_counts().to_dict()))
print("The number of classes after fit {}".format(y_train_ns.value_counts().to_dict()))

The number of classes before fit {0: 85294, 1: 148}
The number of classes after fit {0: 85294, 1: 63970}




In [18]:
## train a [sklearn] RandomForestClassifier and fit it with over sampling data
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier()
classifier.fit(X_train_ns,y_train_ns)

RandomForestClassifier()

In [19]:
## print the performance
y_pred=classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[198995     26]
 [    97    247]]
0.9993830411556692
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    199021
           1       0.90      0.72      0.80       344

    accuracy                           1.00    199365
   macro avg       0.95      0.86      0.90    199365
weighted avg       1.00      1.00      1.00    199365



## SMOTETomek Method
Synthetic Minority Over-sampling Technique

For SMOTE, you select some observations and use a distance measure to synthetically generate a new instance with the same properties for the available features. Analyzing one feature at a time, SMOTE takes the difference between an observation and its nearest neighbor. It multiplies the difference with a random number between zero and one. Then, it identifies a new point by adding the random number to the feature. This way, SMOTE does not copy observations and instead creates a new, synthetic one.

https://miro.medium.com/max/700/1*FcM03wUtW_dB2YGZXyVb7Q.png

In [20]:
## using the [imblearn] SMOTETomek to generate artificial minority data points.
## Set the ratio of the number of samples in the minority class over 
## the number of samples in the majority class after resampling is 0.75.
from imblearn.combine import SMOTETomek
os=SMOTETomek(0.75)



In [21]:
X_train_os,y_train_os=os.fit_resample(X_train,y_train)
print("The number of classes before fit {}".format(y_train.value_counts().to_dict()))
print("The number of classes after fit {}".format(y_train_os.value_counts().to_dict()))

The number of classes before fit {0: 85294, 1: 148}
The number of classes after fit {0: 84692, 1: 63368}


In [22]:
## Using the [imblearn] RanomOverSampler to over sample the dataset randomly
## Set the ratio of the number of samples in the minority class over 
## the number of samples in the majority class after resampling is 0.75.


c

RandomForestClassifier()

In [23]:
## print the performance
y_pred=classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[198982     39]
 [    76    268]]
0.9994231685601785
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    199021
           1       0.87      0.78      0.82       344

    accuracy                           1.00    199365
   macro avg       0.94      0.89      0.91    199365
weighted avg       1.00      1.00      1.00    199365



## Ensemble Techniques

A specific method which uses AdaBoostClassifier as learners in the bagging classifier is called easyEnsemble. The EasyEnsembleClassifier allows to bag AdaBoost learners which are trained on balanced bootstrap samples. The balancing is achieved by random under-sampling.

In [24]:
## Using [imblearn] EasyEnsembleClassifier to do classification
from imblearn.ensemble import EasyEnsembleClassifier

easy=EasyEnsembleClassifier()
easy.fit(X_train,y_train)

EasyEnsembleClassifier()

In [25]:
## print the performance
y_pred=easy.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[191665   7356]
 [    33    311]]
0.962937326010082
              precision    recall  f1-score   support

           0       1.00      0.96      0.98    199021
           1       0.04      0.90      0.08       344

    accuracy                           0.96    199365
   macro avg       0.52      0.93      0.53    199365
weighted avg       1.00      0.96      0.98    199365

