# Objectives

Perform exploratory  Data Analysis and determine Training Labels

*   Create a column for the class
*   Standardize the data
*   Split into training data and test data

\-Find best Hyperparameter for **Logistic Regression**, **Decision Trees**, **K-Nearest-neighbours (KNN)**, **Support Vector Machine (SVM)**, **Random Forest**, and 

*   Find the method performs best using test data


## Import Libraries and Define Auxiliary Functions

#### We will import the following libraries for this project

In [6]:
# Pandas is a software library written for the Python programming language for data manipulation and analysis.
import pandas as pd
# NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
import numpy as np
# Matplotlib is a plotting library for python and pyplot gives us a MatLab like plotting framework. We will use this in our plotter function to plot data.
import matplotlib.pyplot as plt
#Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics
import seaborn as sns
# Preprocessing allows us to standarsize our data
from sklearn import preprocessing
# Allows us to split our data into training and testing data
from sklearn.model_selection import train_test_split
# Allows us to test parameters of classification algorithms and find the best one
#from sklearn.model_selection import GridSearchCV  ## We did not use
# Allows us to test accuracy by jaccard_score
from sklearn.metrics import jaccard_score
# Allows us to see the confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix
# Allows us to see the classification_report or F1 Score
from sklearn.metrics import classification_report
# Allows us to see the log loss score
from sklearn.metrics import log_loss
# Logistic Regression classification algorithm
from sklearn.linear_model import LogisticRegression
# Support Vector Machine classification algorithm
from sklearn.svm import SVC
from sklearn import svm
# Decision Tree classification algorithm
from sklearn.tree import DecisionTreeClassifier
# K Nearest Neighbors classification algorithm
from sklearn.neighbors import KNeighborsClassifier
# Import Library for Random Forest
from sklearn.ensemble import RandomForestClassifier

## Load the dataframe

#### Load the data

In [7]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [8]:
data=pd.read_csv("../input/creditcardfraud/creditcard.csv")

## Explore the dataset

#### Shape of the data, head and tail

In [9]:
print('1. DATA SHAPE: ', data.shape)
print('2. DATA HEAD: '"\n",  data.head(5))
print('3. DATA TAIL: '"\n",  data.tail(5))

#### Check the types of the data

In [10]:
data.info()

#### Checking missing values

In [11]:
data.isnull().sum()

#### Describe the summary statistics

In [12]:
pd.set_option('precision',2)
data.describe()

#### Checking transaction distribution

In [13]:
Total_transactions = len(data)
secured = len(data[data.Class == 0])
fraud = len(data[data.Class == 1])
fraud_percentage = round(fraud/secured*100, 2)
print('Total number of Transaction are:', Total_transactions)
print('Total number of Secured Transaction are:', secured)
print('Total number of Fraud Transaction are:', fraud)
print('Percentage of Fraud Transaction are:', fraud_percentage)

#### Independent Variables

In [14]:
X = data.loc[:, data.columns != 'Class']
X[0:5]

#### Standardize the data in X 

In [15]:
x_scaled = preprocessing.scale(X)
X = pd.DataFrame(x_scaled)
X

#### Dependent Variable

In [16]:
y = data["Class"].to_numpy()
y[0:5]

#### Now, we normalize the dataset

In [17]:
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

## Spliting Data into Train and Test Set

In [18]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=2)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

## Modeling

### Logistic Regression with Scikit-learn

**C** parameter indicates **inverse of regularization strength** which must be a positive float.

In [19]:
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

Now we can predict using our test set:

In [20]:
yhat = LR.predict(X_test)
yhat

**predict_proba**  returns estimates for all classes, ordered by the label of classes. So, the first column is the probability of class 0, P(Y=0|X), and second column is probability of class 1, P(Y=1|X):

In [21]:
yhat_prob = LR.predict_proba(X_test)
yhat_prob

#### Accuracy Evaluation by Jaccard Index

we can define jaccard as the size of the intersection divided by the size of the union of the two label sets. If the entire set of predicted labels for a sample strictly matches with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.

In [22]:
LR_AS = round(jaccard_score(y_test, yhat,pos_label=0)*100,2)

LR_AS

#### Accuracy Evaluation by Confusion Matrix

Another way of looking at the accuracy of the classifier is to look at **confusion matrix**.

In [23]:
#yhat=logreg_cv.predict(X_test)
cm = confusion_matrix(y_test, yhat, labels=[1,0])
print('Confusion matrix:''\n', cm)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)

disp.plot(cmap=plt.cm.Blues)
plt.show()

In this confusion matrix, the first row presents the negetive and second row presents the positive result. So we have a total of 56870 true positive and 8 false positive result. That explains, out of 56870+8= 56878, we have 56870 successfully classified normal transaction and 8 were falsely classified as normal but they were fraudlent.

#### Checking F1 Score

In [24]:
print (classification_report(y_test, yhat))

Based on the count of each section, we can calculate precision and recall of each label:

*   **Precision** is a measure of the accuracy provided that a class label has been predicted. It is defined by: precision = TP / (TP + FP)

*   **Recall** is the true positive rate. It is defined as: Recall =  TP / (TP + FN)

So, we can calculate the precision and recall of each class.

**F1 score:**
Now we are in the position to calculate the F1 scores for each label based on the precision and recall of that label.

The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to show that a classifer has a good value for both recall and precision.

Finally, we can tell the average accuracy for this classifier is the average of the F1-score for both labels, which is 1.0 in our case.

### Decision Trees with Scikit-learn

We will first create an instance of the **DecisionTreeClassifier** called **fraudTree**.
    Inside of the classifier, specify  *criterion="entropy"* so we can see the information gain of each node.

In [25]:
fraudTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
fraudTree # it shows the default parameters

Next, we will fit the data with the training feature matrix <b> X_trainset </b> and training  response vector <b> y_trainset </b>

In [26]:
fraudTree.fit(X_train,y_train)

Let's make some <b>predictions</b> on the testing dataset and store it into a variable called <b>predTree</b>.

In [27]:
predTree = fraudTree.predict(X_test)

You can print out <b>predTree</b> and <b>y_test</b> if you want to visually compare the predictions to the actual values.

In [28]:
print (predTree [0:5])
print (y_test [0:5])

#### Accuracy Evaluation by Jaccard Index

We can define jaccard as the size of the intersection divided by the size of the union of the two label sets. If the entire set of predicted labels for a sample strictly matches with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.

In [29]:
DT_AS = round(jaccard_score(y_test, predTree,pos_label=0)*100,2)

DT_AS

#### Accuracy Evaluation by Confusion Matrix

Another way of looking at the accuracy of the classifier is to look at **confusion matrix**.

In [30]:
cm = confusion_matrix(y_test, predTree, labels=[1,0])
print('Confusion matrix:''\n', cm)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)

disp.plot(cmap=plt.cm.Blues)
plt.show()

In this confusion matrix, the first row presents the negetive and second row presents the positive result. So we have a total of 56867 true positive and 11 false positive result. That explains, out of 56867+11= 56878, we have 56867 successfully classified normal transaction and 11 were falsely classified as normal but they were fraudlent.

#### Checking F1 Score

In [31]:
print (classification_report(y_test, predTree))

Based on the count of each section, we can calculate precision and recall of each label:

*   **Precision** is a measure of the accuracy provided that a class label has been predicted. It is defined by: precision = TP / (TP + FP)

*   **Recall** is the true positive rate. It is defined as: Recall =  TP / (TP + FN)

So, we can calculate the precision and recall of each class.

**F1 score:**
Now we are in the position to calculate the F1 scores for each label based on the precision and recall of that label.

The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to show that a classifer has a good value for both recall and precision.

Finally, we can tell the average accuracy for this classifier is the average of the F1-score for both labels, which is 1.0 in our case.

### K-Nearest-neighbours with Scikit-learn

Let's start the training algorithm with k=4 for now:

In [32]:
k = 4
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh

We can use the model to make predictions on the test set:

In [33]:
yhat = neigh.predict(X_test)
yhat[0:5]

#### Accuracy Evaluation by Jaccard Index

We can define jaccard as the size of the intersection divided by the size of the union of the two label sets. If the entire set of predicted labels for a sample strictly matches with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.

In [34]:
KNN_AS = round(jaccard_score(y_test, yhat,pos_label=0)*100,2)

KNN_AS

#### Accuracy Evaluation by Confusion Matrix

Another way of looking at the accuracy of the classifier is to look at **confusion matrix**.

In [35]:
cm = confusion_matrix(y_test, yhat, labels=[1,0])
print('Confusion matrix:''\n', cm)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)

disp.plot(cmap=plt.cm.Blues)
plt.show()

In this confusion matrix, the first row presents the negetive and second row presents the positive result. So we have a total of 56875 true positive and 3 false positive result. That explains, out of 56875+3= 56878, we have 56875 successfully classified normal transaction and 3 were falsely classified as normal but they were fraudlent.

#### Checking F1 Score

In [42]:
print (classification_report(y_test, yhat))

Based on the count of each section, we can calculate precision and recall of each label:

*   **Precision** is a measure of the accuracy provided that a class label has been predicted. It is defined by: precision = TP / (TP + FP)

*   **Recall** is the true positive rate. It is defined as: Recall =  TP / (TP + FN)

So, we can calculate the precision and recall of each class.

**F1 score:**
Now we are in the position to calculate the F1 scores for each label based on the precision and recall of that label.

The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to show that a classifer has a good value for both recall and precision.

Finally, we can tell the average accuracy for this classifier is the average of the F1-score for both labels, which is 1.0 in our case.

### Support Vector Machine with Scikit-learn

Let's start with Radial Basis Function (RBF) kernel.

In [43]:
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train) 

After being fitted, the model can then be used to predict new values:

In [44]:
yhat = clf.predict(X_test)
yhat [0:5]

#### Accuracy Evaluation by Jaccard Index

We can define jaccard as the size of the intersection divided by the size of the union of the two label sets. If the entire set of predicted labels for a sample strictly matches with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.

In [45]:
SVM_AS = round(jaccard_score(y_test, yhat,pos_label=0)*100,2)

SVM_AS

#### Accuracy Evaluation by Confusion Matrix

Another way of looking at the accuracy of the classifier is to look at **confusion matrix**.

In [41]:
cm = confusion_matrix(y_test, yhat, labels=[1,0])
print('Confusion matrix:''\n', cm)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)

disp.plot(cmap=plt.cm.Blues)
plt.show()

In this confusion matrix, the first row presents the negetive and second row presents the positive result. So we have a total of 56874 true positive and 4 false positive result. That explains, out of 56874+4= 56878, we have 56874 successfully classified normal transaction and 4 were falsely classified as normal but they were fraudlent.

#### Checking F1 Score

In [46]:
print (classification_report(y_test, yhat))

Based on the count of each section, we can calculate precision and recall of each label:

*   **Precision** is a measure of the accuracy provided that a class label has been predicted. It is defined by: precision = TP / (TP + FP)

*   **Recall** is the true positive rate. It is defined as: Recall =  TP / (TP + FN)

So, we can calculate the precision and recall of each class.

**F1 score:**
Now we are in the position to calculate the F1 scores for each label based on the precision and recall of that label.

The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to show that a classifer has a good value for both recall and precision.

Finally, we can tell the average accuracy for this classifier is the average of the F1-score for both labels, which is 1.0 in our case.

### Random Forest with Scikit-learn

Initialize the Random Forest

In [47]:
RF = RandomForestClassifier()

#Train the model using Training Dataset
RF.fit(X_train, y_train)

After being fitted, the model can then be used to predict new values:

In [48]:
yhat = RF.predict(X_test)
yhat [0:5]

#### Accuracy Evaluation by Jaccard Index

We can define jaccard as the size of the intersection divided by the size of the union of the two label sets. If the entire set of predicted labels for a sample strictly matches with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.

In [49]:
RF_AS = round(jaccard_score(y_test, yhat,pos_label=0)*100,2)
RF_AS


#### Accuracy Evaluation by Confusion Matrix

Another way of looking at the accuracy of the classifier is to look at **confusion matrix**.

In [50]:
cm = confusion_matrix(y_test, yhat, labels=[1,0])
print('Confusion matrix:''\n', cm)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)

disp.plot(cmap=plt.cm.Blues)
plt.show()

In this confusion matrix, the first row presents the negetive and second row presents the positive result. So we have a total of 56873 true positive and 5 false positive result. That explains, out of 56873+5= 56878, we have 56873 successfully classified normal transaction and 5 were falsely classified as normal but they were fraudlent.

#### Checking F1 Score

In [51]:
print (classification_report(y_test, yhat))

Based on the count of each section, we can calculate precision and recall of each label:

*   **Precision** is a measure of the accuracy provided that a class label has been predicted. It is defined by: precision = TP / (TP + FP)

*   **Recall** is the true positive rate. It is defined as: Recall =  TP / (TP + FN)

So, we can calculate the precision and recall of each class.

**F1 score:**
Now we are in the position to calculate the F1 scores for each label based on the precision and recall of that label.

The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to show that a classifer has a good value for both recall and precision.

Finally, we can tell the average accuracy for this classifier is the average of the F1-score for both labels, which is 1.0 in our case.

## Best Performed Method

In [52]:
BPM = pd.DataFrame({
    'Models': ['Logistic Regression', 'Decision Trees','K-Nearest-neighbours (KNN)', 
               'Support Vector Machine (SVM)', 'Random Forest'],
    'Accuracy_Score': [LR_AS, DT_AS, KNN_AS, SVM_AS, RF_AS]})

BPM.sort_values(by='Accuracy_Score', ascending=False)

## Conclusion

*   We found **0.17%** Fraud Transaction were occured.
*   We found **99.96%** accuracy score for **Random Forest**. 
 

# Thank you very much for reading this NoteBook. Please share your comments and suggestions for improving more.