# **Project Name - Credit Card Fruad Detection**

**Contribution - Individual**

**GitHub Link - **

# **Problem Statement**

Imagine yourself as a freelance data scientist ready for the next project adventure. Your task is to select a machine learning project from the list provided or propose an original project idea that resonates with you. Your objective is to identify a specific challenge within the chosen industry domain and design a machine-learning solution to address it. Whether you're predicting customer behavior, optimizing processes, or making healthcare more efficient, your project should demonstrate your ability to approach complex problems, preprocess and analyze relevant data, develop and fine-tune models, and interpret results in a meaningful way. Your project will be a testament to your adaptability, curiosity, and aptitude for machine learning.
Execute an end-to-end data science project by following the below steps:

Step 1: Define the Problem Statement

Step 2: Data Collection

Step 3: Data Preprocessing

Step 4: Exploratory Data Analysis (EDA)

Step 5: Model Selection, Training & Evaluation




***Let's Begin***

There are 5 steps to follow along with those steps I'll compare the data with different algorithms to find out which suites best.

In [1]:
# import numpy and pandas
import numpy as np
import pandas as pd

# to plot within notebook
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import homogeneity_score
from sklearn.metrics import silhouette_score
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.cluster import MiniBatchKMeans
# from sklearn.cluster import DBSCAN
from imblearn.over_sampling import SMOTE
# from imblearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
import plotly.figure_factory as ff
import plotly.graph_objects as go
from sklearn.metrics import roc_curve

import sklearn

*Let's import the dataset*

In [2]:
# load the dataset
df = pd.read_csv("creditcard.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,258647,1.725265,-1.337256,-1.012687,-0.361656,-1.431611,-1.098681,-0.842274,-0.026594,-0.032409,...,0.414524,0.793434,0.028887,0.419421,-0.367529,-0.155634,-0.015768,0.01079,189.0,0
1,69263,0.683254,-1.681875,0.533349,-0.326064,-1.455603,0.101832,-0.52059,0.114036,-0.60176,...,0.116898,-0.304605,-0.125547,0.244848,0.069163,-0.460712,-0.017068,0.063542,315.17,0
2,96552,1.067973,-0.656667,1.029738,0.253899,-1.172715,0.073232,-0.745771,0.249803,1.383057,...,-0.189315,-0.426743,0.079539,0.129692,0.002778,0.970498,-0.035056,0.017313,59.98,0
3,281898,0.119513,0.729275,-1.678879,-1.551408,3.128914,3.210632,0.356276,0.920374,-0.160589,...,-0.335825,-0.906171,0.10835,0.593062,-0.424303,0.164201,0.245881,0.071029,0.89,0
4,86917,1.271253,0.275694,0.159568,1.003096,-0.128535,-0.60873,0.088777,-0.145336,0.156047,...,0.031958,0.123503,-0.174528,-0.147535,0.735909,-0.26227,0.015577,0.015955,6.53,0


In [3]:
df['Class'].value_counts()

Class
0    5000
1      50
Name: count, dtype: int64

### The Metric Trap:
One of the major issues when dealing with unbalanced datasets relates to the metrics used to evaluate their model. Using simpler metrics like accuracy score can be misleading. In a dataset with highly unbalanced classes, the classifier will always “predict” the most common class without performing any analysis of the features and it will have a high accuracy rate, obviously not the correct one.

In [4]:
# explore the features available in the dataframe
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5050 entries, 0 to 5049
Data columns (total 31 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  5050 non-null   int64  
 1   V1          5050 non-null   float64
 2   V2          5050 non-null   float64
 3   V3          5050 non-null   float64
 4   V4          5050 non-null   float64
 5   V5          5050 non-null   float64
 6   V6          5050 non-null   float64
 7   V7          5050 non-null   float64
 8   V8          5050 non-null   float64
 9   V9          5050 non-null   float64
 10  V10         5050 non-null   float64
 11  V11         5050 non-null   float64
 12  V12         5050 non-null   float64
 13  V13         5050 non-null   float64
 14  V14         5050 non-null   float64
 15  V15         5050 non-null   float64
 16  V16         5050 non-null   float64
 17  V17         5050 non-null   float64
 18  V18         5050 non-null   float64
 19  V19         5050 non-null  

In [5]:
# summary statistics
df.describe()

Unnamed: 0.1,Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,5050.0,5050.0,5050.0,5050.0,5050.0,5050.0,5050.0,5050.0,5050.0,5050.0,...,5050.0,5050.0,5050.0,5050.0,5050.0,5050.0,5050.0,5050.0,5050.0,5050.0
mean,142870.01703,-0.014675,0.044325,-0.035144,0.002494,-0.046625,-0.04634,-0.04302,-0.008398,-0.027331,...,-0.003516,-0.009421,-0.004147,-0.0012,-0.003314,-0.004836,-0.005726,0.002482,86.117232,0.009901
std,82574.683531,1.952784,1.558235,1.691458,1.493592,1.32132,1.254301,1.33817,1.323247,1.134506,...,0.756735,0.724749,0.601276,0.5994,0.517363,0.481913,0.411055,0.302719,227.210259,0.09902
min,5.0,-25.266355,-18.701995,-26.823673,-4.575708,-18.664251,-6.357009,-23.78347,-41.484823,-8.504285,...,-20.262054,-5.532541,-17.026156,-2.307453,-3.308049,-1.71564,-7.9761,-5.048979,0.0,0.0
25%,71817.75,-0.926226,-0.589562,-0.873696,-0.871759,-0.733235,-0.777552,-0.571678,-0.211263,-0.651215,...,-0.231508,-0.558904,-0.161166,-0.354973,-0.316947,-0.331584,-0.070963,-0.052133,4.99,0.0
50%,142544.0,0.009592,0.088726,0.168377,-0.027034,-0.060932,-0.304225,0.036753,0.000985,-0.052724,...,-0.035204,-0.013332,-0.011305,0.038272,0.0192,-0.059882,0.003521,0.012842,20.26,0.0
75%,215019.0,1.310062,0.809298,1.017166,0.763626,0.603678,0.356664,0.594029,0.313264,0.568374,...,0.196481,0.509243,0.146835,0.441278,0.348177,0.228486,0.095662,0.077357,75.0,0.0
max,284782.0,2.422508,14.323254,3.760965,11.885313,9.880564,7.47397,9.288494,16.633103,8.054123,...,19.283602,5.805795,13.218751,3.535179,3.590787,2.961609,4.623508,9.876371,4584.88,1.0


In [6]:
# check for missing values
df.isnull().sum()

Unnamed: 0    0
V1            0
V2            0
V3            0
V4            0
V5            0
V6            0
V7            0
V8            0
V9            0
V10           0
V11           0
V12           0
V13           0
V14           0
V15           0
V16           0
V17           0
V18           0
V19           0
V20           0
V21           0
V22           0
V23           0
V24           0
V25           0
V26           0
V27           0
V28           0
Amount        0
Class         0
dtype: int64

In [7]:
df["Class"].value_counts()

Class
0    5000
1      50
Name: count, dtype: int64

In [8]:
# ratio of fraud and no fraud cases
df["Class"].value_counts(normalize=True)

Class
0    0.990099
1    0.009901
Name: proportion, dtype: float64

In [9]:
# get the mean for each group
df.groupby("Class").mean()

Unnamed: 0_level_0,Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,143084.8702,0.03503,0.011553,0.037444,-0.04576,-0.013825,-0.030885,0.014315,-0.022432,-0.002227,...,-0.002896,-0.010583,-0.010206,-0.003305,-0.000918,-0.002613,-0.004651,-0.009584,0.002414,85.843714
1,121384.7,-4.985211,3.321539,-7.293909,4.827952,-3.326587,-1.591882,-5.776541,1.395058,-2.537728,...,0.19458,0.703182,0.069065,-0.088374,-0.029425,-0.073336,-0.023377,0.380072,0.009304,113.469


In [10]:
# implement a rule for stating which cases are flagged as fraud
df["flag_as_fraud"] = np.where(np.logical_and(df["V1"] < -3, df["V3"] < -5), 1, 0)
df["flag_as_fraud"].head(10)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: flag_as_fraud, dtype: int64

In [11]:
# create a crosstab of flagged fraud cases versus the actual fraud cases
print(pd.crosstab(df.Class, df.flag_as_fraud, rownames=["Actual Fraud"], colnames=["Flagged Fraud"]))

Flagged Fraud     0   1
Actual Fraud           
0              4984  16
1                28  22


With this, we detect 22 out of 50 fraud cases, but can't detect the other 28, and get 16 false positives. Next, we'll see how this measures up to a machine learning model.

## **Supervised Machine Learning**
### **Machine learning model to catch fraud**
When we have labelled data, we can use supervised machine learning techniques to flag fraudulent transactions. We can use classifiers, adjust them and compare them to find the most efficient fraud detection model.

In [12]:
# create input and target variable
X = df.drop(["Unnamed: 0", "Class", "flag_as_fraud"], axis=1)
y = df["Class"]

In [13]:
# create the training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [14]:
print("Distribution of classes of dependent variable in train :")
print(y_train.value_counts())

print("\n Distribution of classes of dependent variable in test :")
print(y_test.value_counts())

Distribution of classes of dependent variable in train :
Class
0    3495
1      40
Name: count, dtype: int64

 Distribution of classes of dependent variable in test :
Class
0    1505
1      10
Name: count, dtype: int64


In [15]:
model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [16]:
# obtain model predictions
predicted = model.predict(X_test)

In [17]:
# predict probabilities
probs = model.predict_proba(X_test)

In [18]:
# print the accuracy score
print("Accuracy Score: {}".format(accuracy_score(y_test, predicted)))

Accuracy Score: 0.998019801980198


In [19]:
# print the ROC score
print("ROC score: {}\n".format(roc_auc_score(y_test, probs[:,1])))

# print the classifcation report and confusion matrix
print("Classification report:\n{}\n".format(classification_report(y_test, predicted)))

# print confusion matrix
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print("Confusion matrix:\n{}\n".format(conf_mat))

ROC score: 0.9982724252491695

Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1505
           1       0.89      0.80      0.84        10

    accuracy                           1.00      1515
   macro avg       0.94      0.90      0.92      1515
weighted avg       1.00      1.00      1.00      1515


Confusion matrix:
[[1504    1]
 [   2    8]]



In [20]:
# Creating the Plotly heatmap
fig = ff.create_annotated_heatmap(
    z=conf_mat,
    x=['Predicted Negative', 'Predicted Positive'],
    y=['Actual Negative', 'Actual Positive'],
    annotation_text=conf_mat,
    colorscale='Blues',
    showscale=False
)

# Adding titles and labels
fig.update_layout(
    title='Confusion Matrix of the Classifier',
    xaxis=dict(title='Predicted Label'),
    yaxis=dict(title='Actual Label')
)

# Display the plot
fig.show()

As the figure shown above, we managed to catch 8 out of 10 fraud cases, only 1 false positive and 2 false negative, not bad for our first machine learning model.

## **Data Resampling**
Synthetic Minority Oversampling Technique (SMOTE):

This technique generates synthetic data for the minority class.
SMOTE (Synthetic Minority Oversampling Technique) works by randomly picking a point from the minority class and computing the k-nearest neighbors for this point. The synthetic points are added between the chosen point and its neighbors.

**SMOTE algorithm works in 4 simple steps:**

*   Choose a minority class as the input vector
*   Find its k nearest neighbors (k_neighbors is specified as an argument in  
    the SMOTE() function)
*   Choose one of these neighbors and place a synthetic point anywhere on the
    line joining the point under consideration and its chosen neighbor
*   Repeat the steps until data is balanced

One thing to keep note of: Use resampling methods on training set, never on test set. Always make sure the test set is free of duplicate or synthetic data.

In [21]:
# import SMOTE
from imblearn.over_sampling import SMOTE

# Creating an instance of SMOTE
method = SMOTE(sampling_strategy='auto', random_state=42, k_neighbors=5)

# Applying SMOTE to the dataset
X_resampled, y_resampled = method.fit_resample(X, y)

In [22]:
# check before and after resample
print("Before resampling:\n{}\n".format(y_train.value_counts()))
print("After resampling:\n{}\n".format(pd.Series(y_resampled).value_counts()))

Before resampling:
Class
0    3495
1      40
Name: count, dtype: int64

After resampling:
Class
0    5000
1    5000
Name: count, dtype: int64



The result above shows how the balance between the two classes has changed with SMOTE. Unlike Random Over-sampling, SMOTE does not create exact copies of observations, but creates new, synthetic, samples that are quite similar to the existing observations in the minority class. We can then fit the resampled training data into a machine learning model and make prediction on the non-resampled test data.

In [23]:
# fit the model
model = LogisticRegression(solver="liblinear")
model.fit(X_resampled, y_resampled)

# make predictions
predicted = model.predict(X_test)
probs = model.predict_proba(X_test)

# print the accuracy score
print("Accuracy Score: {}\n".format(accuracy_score(y_test, predicted)))

# print the ROC score
print("ROC score: {}\n".format(roc_auc_score(y_test, probs[:,1])))

# print the classifcation report and confusion matrix
print("Classification report:\n{}\n".format(classification_report(y_test, predicted)))

# print confusion matrix
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print("Confusion matrix:\n{}\n".format(conf_mat))

Accuracy Score: 0.9986798679867986

ROC score: 0.9995348837209302

Classification report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1505
           1       0.83      1.00      0.91        10

    accuracy                           1.00      1515
   macro avg       0.92      1.00      0.95      1515
weighted avg       1.00      1.00      1.00      1515


Confusion matrix:
[[1503    2]
 [   0   10]]



As we can see, the SMOTE slightly improves our results.

With that let's explore with other techniques like Random Forest, Decesion Trees,.. to find which has the highest close margin.


## **Random Forest**:

*   In Random Forest, we grow multiple trees as opposed to a single tree in  
    CART model.
*   We construct trees from the subsets of the original dataset. These subsets
    can have a fraction of the columns as well as rows.
*   To classify a new object based on attributes, each tree gives a
    classification and we say that the tree “votes” for that class.
*   The forest chooses the classification having the most votes (over all the
    trees in the forest) and in case of regression, it takes the average of outputs by different trees.


In [24]:
# define the model as the random forest
model = RandomForestClassifier(class_weight="balanced_subsample", random_state=0)

# fit the model to our training set
model.fit(X_train, y_train)

# obtain predictions from the test data
predicted = model.predict(X_test)

# predict probabilities
probs = model.predict_proba(X_test)

# print the accuracy score, ROC score, classification report and confusion matrix
print("Accuracy Score: {}\n".format(accuracy_score(y_test, predicted)))
print("ROC score = {}\n".format(roc_auc_score(y_test, probs[:,1])))
print("Classification Report:\n{}\n".format(classification_report(y_test, predicted)))
print("Confusion Matrix:\n{}\n".format(confusion_matrix(y_test, predicted)))

Accuracy Score: 0.9986798679867986

ROC score = 0.999468438538206

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1505
           1       0.90      0.90      0.90        10

    accuracy                           1.00      1515
   macro avg       0.95      0.95      0.95      1515
weighted avg       1.00      1.00      1.00      1515


Confusion Matrix:
[[1504    1]
 [   1    9]]



In [25]:
conf_mat = confusion_matrix(y_test, predicted)

# Creating the Plotly heatmap
fig = ff.create_annotated_heatmap(
    z=conf_mat,
    x=['Predicted Negative', 'Predicted Positive'],
    y=['Actual Negative', 'Actual Positive'],
    annotation_text=conf_mat,
    colorscale='Blues',
    showscale=False
)

# Adding titles and labels
fig.update_layout(
    title='Confusion Matrix of the Classifier',
    xaxis=dict(title='Predicted Label'),
    yaxis=dict(title='Actual Label')
)

# Display the plot
fig.show()

In [26]:
# define the parameter sets to test
param_grid = {"n_estimators": [10, 50],
              "max_features": ["auto", "log2"],
#               "min_samples_leaf": [1, 10],
              "max_depth": [4, 8],
              "criterion": ["gini", "entropy"],
              "class_weight": [None, {0:1, 1:12}]
}

# define the model to use
model = RandomForestClassifier(random_state=0)

# combine the parameter sets with the defined model
CV_model = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring="recall", n_jobs=-1)

# fit the model to our training data and obtain best parameters
CV_model.fit(X_train, y_train)


`max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.



In [27]:
# show best parameters
CV_model.best_params_

{'class_weight': None,
 'criterion': 'gini',
 'max_depth': 4,
 'max_features': 'auto',
 'n_estimators': 10}

In [28]:
# obtain predictions from the test data
predicted = CV_model.predict(X_test)

# predict probabilities
probs = CV_model.predict_proba(X_test)

# print the accuracy score, ROC score, classification report and confusion matrix
print("Accuracy Score: {}\n".format(accuracy_score(y_test, predicted)))
print("ROC score = {}\n".format(roc_auc_score(y_test, probs[:,1])))
print("Classification Report:\n{}\n".format(classification_report(y_test, predicted)))
print("Confusion Matrix:\n{}\n".format(confusion_matrix(y_test, predicted)))

Accuracy Score: 0.9993399339933994

ROC score = 0.999734219269103

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1505
           1       0.91      1.00      0.95        10

    accuracy                           1.00      1515
   macro avg       0.95      1.00      0.98      1515
weighted avg       1.00      1.00      1.00      1515


Confusion Matrix:
[[1504    1]
 [   0   10]]



In [29]:
conf_mat = confusion_matrix(y_test, predicted)

# Create annotations for the heatmap
annotations = []
for i in range(len(conf_mat)):
    for j in range(len(conf_mat)):
        annotations.append(
            dict(
                text=str(conf_mat[i][j]),
                x=j,
                y=i,
                xref='x1',
                yref='y1',
                showarrow=False,
                font=dict(size=16, color='black')
            )
        )

# Create the Plotly heatmap
fig = go.Figure(data=go.Heatmap(
    z=conf_mat,
    x=['Predicted Negative', 'Predicted Positive'],
    y=['Actual Negative', 'Actual Positive'],
    colorscale='Blues',
    showscale=False,
    text=conf_mat,
    hovertemplate='%{text}',
    hoverinfo='skip'
))

# Adding titles and labels
fig.update_layout(
    title='Confusion Matrix of the Classifier',
    xaxis=dict(title='Predicted Label'),
    yaxis=dict(title='Actual Label'),
    annotations=annotations
)

# Display the plot
fig.show()

## **Logistic Regression**

In [30]:
# define the Logistic Regression model with weights
lr_model = LogisticRegression(class_weight={0:1, 1:15}, random_state=5, solver="liblinear")

# fit the model to our training data
lr_model.fit(X_train, y_train)

# obtain predictions from the test data
predicted = lr_model.predict(X_test)

# predict probabilities
probs = lr_model.predict_proba(X_test)

# print the accuracy score, ROC score, classification report and confusion matrix
print("Accuracy Score: {}\n".format(accuracy_score(y_test, predicted)))
print("ROC score = {}\n".format(roc_auc_score(y_test, probs[:,1])))
print("Classification Report:\n{}\n".format(classification_report(y_test, predicted)))
print("Confusion Matrix:\n{}\n".format(confusion_matrix(y_test, predicted)))

Accuracy Score: 0.9973597359735974

ROC score = 0.9992691029900332

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1505
           1       0.75      0.90      0.82        10

    accuracy                           1.00      1515
   macro avg       0.87      0.95      0.91      1515
weighted avg       1.00      1.00      1.00      1515


Confusion Matrix:
[[1502    3]
 [   1    9]]



In [31]:
conf_mat = confusion_matrix(y_test, predicted)

# Creating the Plotly heatmap
fig = ff.create_annotated_heatmap(
    z=conf_mat,
    x=['Predicted Negative', 'Predicted Positive'],
    y=['Actual Negative', 'Actual Positive'],
    annotation_text=conf_mat,
    colorscale='Blues',
    showscale=False,
    font_colors=['black']
)

# Adding titles and labels
fig.update_layout(
    title=dict(text='Confusion Matrix of the Classifier', x=0.5),
    xaxis=dict(title='Predicted Label'),
    yaxis=dict(title='Actual Label')
)

# Display the plot
fig.show()

Logistic Regression has quite different performance from the Random Forest. More false positives, but also a better Recall. It will therefore be a useful addition to the Random Forest in an ensemble model.

Now let's see what Decision Tree has to offer.

## **Decision Tree**

In [32]:
# define the Decision Tree model with balanced weight
tree_model = DecisionTreeClassifier(random_state=0, class_weight="balanced")

# fit the model to our training data
tree_model.fit(X_train, y_train)

# obtain predictions from the test data
predicted = tree_model.predict(X_test)

# predict probabilities
probs = tree_model.predict_proba(X_test)

# print the accuracy score, ROC score, classification report and confusion matrix
print("Accuracy Score: {}\n".format(accuracy_score(y_test, predicted)))
print("ROC score = {}\n".format(roc_auc_score(y_test, probs[:,1])))
print("Classification Report:\n{}\n".format(classification_report(y_test, predicted)))
print("Confusion Matrix:\n{}\n".format(confusion_matrix(y_test, predicted)))

Accuracy Score: 0.9966996699669967

ROC score = 0.9486710963455148

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1505
           1       0.69      0.90      0.78        10

    accuracy                           1.00      1515
   macro avg       0.85      0.95      0.89      1515
weighted avg       1.00      1.00      1.00      1515


Confusion Matrix:
[[1501    4]
 [   1    9]]



In [33]:
conf_mat = confusion_matrix(y_test, predicted)

# Create the annotated heatmap
fig = ff.create_annotated_heatmap(
    z=conf_mat,
    x=['Predicted Negative', 'Predicted Positive'],
    y=['Actual Negative', 'Actual Positive'],
    annotation_text=conf_mat,
    colorscale='Blues',
    showscale=False
)

# Adding titles and labels
fig.update_layout(
    title=dict(text='Confusion Matrix of the Classifier', x=0.5),
    xaxis=dict(title='Predicted Label'),
    yaxis=dict(title='Actual Label'),
    font=dict(size=12)
)

# Display the plot
fig.show()

Compared to Random Forest, Decision Tree also has more false positives, but a better Recall. It will therefore be a useful addition to the Random Forest in an ensemble model.

In [34]:
# Compute ROC curve and ROC area for each model
fpr, tpr, _ = roc_curve(y_test, CV_model.predict_proba(X_test)[:,1])
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_model.predict_proba(X_test)[:,1])
tree_fpr, tree_tpr, _ = roc_curve(y_test, tree_model.predict_proba(X_test)[:,1])

# Create a Plotly figure
fig = go.Figure()

# Add Random Forest ROC curve
fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines', name='Random Forest (AUC = {:1.4f})'.format(roc_auc_score(y_test, CV_model.predict_proba(X_test)[:,1]))))

# Add Logistic Regression ROC curve
fig.add_trace(go.Scatter(x=lr_fpr, y=lr_tpr, mode='lines', name='Logistic Regression (AUC = {:1.4f})'.format(roc_auc_score(y_test, lr_model.predict_proba(X_test)[:,1]))))

# Add Decision Tree ROC curve
fig.add_trace(go.Scatter(x=tree_fpr, y=tree_tpr, mode='lines', name='Decision Tree (AUC = {:1.4f})'.format(roc_auc_score(y_test, tree_model.predict_proba(X_test)[:,1]))))

# Add Baseline
fig.add_trace(go.Scatter(x=[0,1], y=[0,1], mode='lines', name='Baseline (AUC = 0.5000)', line=dict(dash='dash')))

# Update layout
fig.update_layout(
    title='ROC Curve',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate',
    legend_title='Classifier',
    width=800,
    height=600
)

# Show the figure
fig.show()

# **Conclusion:**
In this project, we have used supervised machine learning techniques to detect fraud cases. We use supervised machine learning when we have fraud cases with labels. By combining the classifiers, we can take the best of multiple models. Random Forest as a standalone model was good in Precision but quite bad in terms of false negatives. Logistic Regression was good in Recall but very bad in terms of false positives. Decision Tree was in the middle.




   | Model               | Precision | Recall | F1-score | Accuracy | AUC ROC | TP  | FP | FN | TN   |
   |---------------------|-----------|--------|----------|----------|---------|-----|----|----|------|
   | Random Forest       | 0.91      | 1.00   | 0.95     | 0.9993   | 0.9997  | 10  | 1  | 0 | 1504 |
   | Logistic Regression | 0.75      | 0.90   | 0.82     | 0.9973   | 0.9992  | 9  | 3 | 1 | 1502 |
   | Decision Tree       | 0.69      | 0.90   | 0.78     | 0.9966   | 0.9486  | 9  | 4 | 1 | 1501 |

### Summary:
**Precision:** Random Forest has the highest precision (0.91), followed by Logistic Regression (0.75) and Decision Tree (0.69). Precision measures the accuracy of positive predictions.

**Recall:** Random Forest achieved perfect recall (1.00), indicating it correctly identified all positive instances. Logistic Regression and Decision Tree both have a recall of 0.90.

**F1-score:** Random Forest also has the highest F1-score (0.95), which is the harmonic mean of precision and recall. Logistic Regression follows with 0.82, and Decision Tree has 0.78.

**Accuracy:** Random Forest has the highest accuracy (0.9993), which measures overall model correctness. Logistic Regression and Decision Tree follow closely with accuracies of 0.9973 and 0.9966, respectively.

**AUC ROC:** Random Forest has the highest AUC ROC (0.9997), which measures the area under the receiver operating characteristic curve. Logistic Regression follows with 0.9992, and Decision Tree has 0.9486.

**Confusion Matrix Elements:**

TP (True Positives): Number of correctly predicted positive instances.

FP (False Positives): Number of incorrectly predicted positive instances.

FN (False Negatives): Number of incorrectly predicted negative instances.

TN (True Negatives): Number of correctly predicted negative instances.




### **1.   Performance Comparison:**


*   Random Forest demonstrates superior performance across most metrics compared to Logistic Regression and Decision Tree. It achieves perfect recall (1.00), indicating it correctly identifies all positive instances, along with high precision (0.91), indicating accurate positive predictions. This leads to the highest F1-score (0.95), reflecting a balance between precision and recall.

*   Logistic Regression performs moderately well with a precision of 0.75, recall of 0.90, and F1-score of 0.82. It shows good overall accuracy (0.9973) and AUC ROC (0.9992), indicating robust performance in classifying positive and negative instances.

*   Decision Tree exhibits the lowest performance among the three models with precision of 0.69, recall of 0.90, and F1-score of 0.78. It has lower AUC ROC (0.9486) compared to the other models, suggesting it may struggle more with distinguishing between positive and negative instances.



### **2.   Accuracy and Reliability:**


*   Random Forest stands out as the most accurate model with an accuracy of 0.9993, indicating its predictions align closely with actual outcomes. Its high AUC ROC (0.9997) further confirms its robustness in classification tasks.

*   Logistic Regression and Decision Tree also demonstrate strong accuracies (0.9973 and 0.9966, respectively) and AUC ROC scores (0.9992 for Logistic Regression), though they are slightly lower than those of Random Forest.



### **3.   Practical Considerations:**

*   Random Forest may be the preferred model when high recall and precision are crucial, such as in medical diagnostics or fraud detection, due to its balanced performance and perfect recall.

*   Logistic Regression offers a good balance between interpretability and performance, making it suitable for scenarios where understanding feature importance is important.

*   Decision Tree could be used when transparency in decision-making is required, but its performance may need improvement for more critical applications.






In conclusion, the choice of model should align with specific application requirements and priorities, balancing between performance metrics such as precision, recall, F1-score, accuracy, and AUC ROC, to ensure optimal performance in real-world scenarios.
