# Before Starting:

If you liked this kernel please don't forget to upvote the project, this will keep me motivated to other kernels in the future. I hope you enjoy our deep exploration into this dataset. Let's begin!

# **Credit Card Fraud Detection**
**Anonymized credit card transactions labeled as fraudulent or genuine**

# __Introduction__

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. Eventually it is also important for companies NOT to detect transactions which are genuine as fradulent, otherwise companies whould keep blocking the credit card, and which may lead to customer dissatisfaction. So here are two important expects of this analysis:

* What would happen when company will not able to detect the fradulent transation and would not confirm from customer about this recent transaction wheather it was made by him/her.

* In contract, what would happen when company will detect a genuine transaction as fradulent and keep calling customer for confirmation or might block card.

The datasets contains transactions that have 492 frauds out of 284,807 transactions. So the dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. When we try to build the prediction model with these kind of unbalanced dataset, then model will be more inclined towards to detect new unseen transaction as genuine as out dataset contains about 99% genuine data.

# **Load Data**

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import math
import matplotlib
import sklearn

# Print versions of libraries
print(f"Numpy version : Numpy {np.__version__}")
print(f"Pandas version : Pandas {pd.__version__}")
print(f"Matplotlib version : Matplotlib {matplotlib.__version__}")
print(f"Seaborn version : Seaborn {sns.__version__}")
print(f"SkLearn version : SkLearn {sklearn.__version__}")

# Magic Functions for In-Notebook Display
%matplotlib inline

# Setting seabon style
sns.set(style='darkgrid', palette='deep')

## Import the Dataset

In [None]:
df = pd.read_csv('../input/creditcardfraud/creditcard.csv', encoding='latin_1')

In [None]:
# Converting all column names to lower case
df.columns = df.columns.str.lower()

In [None]:
df.head()

In [None]:
df.tail()

* **Due to confidentiality issue, original features V1, V2,... V28 have been transformed with PCA, however we may guess that these features might be orginally credit card number, expirary date, CVV, card holder name, transaction location, transaction date time, etc.** 

* The only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. 

* Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

In [None]:
# Customising default values to view all columns
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

# pd.set_option('display.max_rows',1000)

In [None]:
df.head(10)

In [None]:
# import inspect

In [None]:
# inspect.getfullargspec(pd.value_counts)

# **Exploratory Data Analysis**

Once the data is read into python, we need to explore/clean/filter it before processing it for machine learning It involves adding/deleting few colums or rows, joining some other data, and handling qualitative variables like dates.

Now that we have the data, I wanted to run a few initial comparisons between the three columns - Time, Amount, and Class.

## Checking concise summary of dataset

It is also a good practice to know the features and their corresponding data types,along with finding whether they contain null values or not.

In [None]:
df.info()

**Highlights**

* Dataset contains details of 284807 transactions with 31 features.
* There is no missing data in out dataset, every columns contain excatly 284807 rows.
* All data types are float64 ,except 1 : Class 
* All data types are float64 ,except 1 : Class 
* 28 columns have Sequential Names and values that don't make any logical sense - > V1 , V2 ....V28
* 3 columns : TIME , AMOUNT and CLASS which can be analysed for various INSIGHTS ! 
* Memory Usage : 67 MB , not so Harsh !!

## Count unique values of label

In [None]:
print(df['class'].value_counts())
print('\n')
print(df['class'].value_counts(normalize=True))

In [None]:
df["class"].value_counts().plot(kind = 'pie',explode=[0, 0.1],figsize=(6, 6),autopct='%1.1f%%',shadow=True)
plt.title("Fraudulent and Non-Fraudulent Distribution",fontsize=20)
plt.legend(["Fraud", "Genuine"])
plt.show()

**Highlights**

This dataset have 492 frauds out of 284,807 transactions. The dataset is **highly unbalanced**, the positive class (frauds) account for 0.172% of all transactions. Most of the transactions are non-fraud. If we use this dataframe as the base for our predictive models and analysis, our algorithms will probably overfit since it will "assume" that most transactions are not fraud. But we don't want our model to assume, we want our model to detect patterns that give signs of fraud!

## Generate descriptive statistics

Lets summarize the central tendency, dispersion and shape of a dataset's distribution. Out of all the columns, the only ones that made the most sense were Time, Amount, and Class (fraud or not fraud). The other 28 columns were transformed using what seems to be a PCA dimensionality reduction in order to protect user identities.

The data itself is short in terms of time (it’s only 2 days long), and these transactions were made by European cardholders.

In [None]:
df[['time','amount','class']].describe()

**Highlights**
* On an average, credict card transaction is happening at every 94813.86 seconds.
* Average transation amount is 88.35 with standard deviation of 250, with minimum amount of 0.0 and maximum amount 25,691.16. By seeing the 75% and maximum amount, it look like the feature 'Amount' is higly **positive skewed**. We will check the distribution graph of amount to get more clarity.

## Finding null values

In [None]:
# Dealing with missing data
df.isnull().sum().max()

**Highlights**

There are no missing values present in the dataset. It is not necessary that missing values are present in the dataset in the form of  NA, NAN, Zeroes etc, it may be present by some other values also that can be explored by analysising the each features.

## Removing duplicate data

In [None]:
# Count the duplicate data
# ?print("No of duplicate data : ",len(df[df.duplicated()]))  
# print("\n")
print("Percentage of duplicate data : ",round(len(df[df.duplicated()])/len(df),4)*100, "%")  
print("\n")
print("Duplicate vs Non duplicate counts :")
print(df.duplicated().value_counts())

There are 1081 duplicate rows present in the dataset.

In [None]:
# Removing the Duplicate Values
df.drop_duplicates(inplace = True)

In [None]:
# Check for duplicate data if they exist 
print(df.duplicated().value_counts())

# Reset the index
df.reset_index(drop = True , inplace = True)

In [None]:
df.shape

In [None]:
df.reset_index(inplace = True , drop = True)

## Transaction for zero amount

In [None]:
df[df['amount'] == 0]['amount'].count()

In [None]:
df[(df['amount'] == 0) & (df['class'] == 1)]['amount'].count()

It is impossibile to have transation of amount zero from credit card. So these 1808 zero value transaction are actually null values and need to remove. However out of 1808 zero value transations, 25 are actually recognized as fradulent and rest as geninue.

### Remove the zero value non-fraud transactions only
Out data is highly unbalanced, and deleting the fraud transaction will make it more unbalanced.So we will delete only the genuine transactions of zero value.

In [None]:
# Remove the zero value non-fraud transactions only
df.drop(df[(df['amount'] == 0) & (df['class'] == 1)].index, inplace = True) 

In [None]:
# Check if zero transactions are removed or not
df[(df['amount'] == 0) & (df['class'] == 1)]['amount'].count()

In [None]:
df.reset_index(inplace = True , drop = True)

## Distribution of Amount

In [None]:
plt.figure(figsize=(8,6))
plt.title('Distribution of Transaction Amount', fontsize=14)
sns.distplot(df['amount'], bins=100)
plt.plot()

Most the transaction amount falls between 0 and about 3000 and we have some outliers for really big amount transactions and it may actually make sense to drop those outliers in our analysis if they are just a few points that are very extreme.

### Distribution of Amount for Fradulent & Genuine transactions

In [None]:
fig, axs = plt.subplots(ncols=2,figsize=(16,4))
sns.distplot(df[df['class'] == 1]['amount'], bins=100, ax=axs[0])
axs[0].set_title("Distribution of Fraud Transactions")

sns.distplot(df[df['class'] == 0]['amount'], bins=100, ax=axs[0])
axs[1].set_title("Distribution of Genuine Transactions")

plt.plot()

This graph shows that most of the fraud transaction amount is less than 500 dollor. This also shows that the fraud transaction is very high for an amount near to 0, lets find that amount.

In [None]:
print("Fraud Transaction distribution : \n",df[(df['class'] == 1)]['amount'].value_counts().head())
print("\n")
print("Maximum amount of fraud transaction - ",df[(df['class'] == 1)]['amount'].max())
print("Minimum amount of fraud transaction - ",df[(df['class'] == 1)]['amount'].min())

So there are 105 fraud transactions for just one dollor and 27 fraud transaction for $99.99. And higest fraud transaction amount was 2125.87 and lowest was just 0.01.

### Distribution of Amount w.r.t Class

In [None]:
plt.figure(figsize=(8,6))
sns.boxplot(x='class', y='amount',data = df)
plt.title('Amount Distribution for Fraud and Genuine transactions')
plt.plot()

Most the transaction amount falls between 0 and about 3000 and we have some outliers for really big amount transactions and it may actually make sense to drop those outliers in our analysis if they are just a few points that are very extreme. Also we have should be conscious about that these **outlier should not be the fradulent transaction**. Generally, fradulent trasactions can of big amount and removing them from the data, can make the predicting model bais. 

So we can essentially build a model that realistically predicts transation as fraud without affected by outliers. It may not be really useful to actually have our model train on these extreme outliers.

## Distribution of Time

In [None]:
plt.figure(figsize=(8,6))
plt.title('Distribution of Transaction Time', fontsize=14)
sns.distplot(df['time'], bins=100)
plt.show()

By seeing graph, we can see there are two peaks in the graph and even there are some local peaks. We can think of these as the time of the day, like the peak is the day time when most people do the transactions and the depth is the night time when most people just sleeps. We aleady know that out data contains credit card transaction for only two days, so there are two peaks for day time and one depth for one night time.

### Distribution of time w.r.t. transactions types

In [None]:
fig, axs = plt.subplots(ncols=2, figsize=(16,4))

sns.distplot(df[(df['class'] == 1)]['time'], bins=100, ax=axs[0])
axs[0].set_title("Distribution of Fraud Transactions")

sns.distplot(df[(df['class'] == 0)]['time'], bins=100, ax=axs[1])
axs[1].set_title("Distribution of Genue Transactions")

plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.boxplot(x='class', y='time',data = df)
plt.title('Time Distribution for Fraud and Genuine transactions')
plt.show()

### Distribution of transaction type w.r.t amount

In [None]:
fig, axs = plt.subplots(nrows=2,sharex=True,figsize=(16,6))

sns.scatterplot(x='time',y='amount', data=df[df['class']==1], ax=axs[0])
axs[0].set_title("Distribution of Fraud Transactions")

sns.scatterplot(x='time',y='amount', data=df[df['class']==0], ax=axs[1])
axs[1].set_title("Distribution of Genue Transactions")

plt.show()

## Removal of Outliers

In [None]:
Q3 = np.percentile(df['amount'], 75)
Q1 = np.percentile(df['amount'], 25)

# calculate interquartrile range - IQR = thirdQuartile - firstQuartile
IQR = (Q3 - Q1)
    
# Usually we take scale value of 1.5 times IQR to calculate. But this scale depends on the distribution followed by the data. 
# Say if my data seem to follow exponential distribution then this scale would change.   
# So I am taking sacle from 1.5 to 5.

# Lower outlier boundry (LOB) / Lower Whisker 
LOB = Q1 - (IQR * 5.0)
print(f"Lower Whisker : {LOB}")
    
# Upper outlier boundry (UOB) / Upper Whisker
UOB = Q3 + (IQR * 5.0)
print(f"Upper Whisker : {UOB}")

amtAllOutliers = df[(df['amount'] < LOB) | (df['amount'] > UOB)]['amount']
amtFrdOutliers = df[(df['class'] == 1) & ((df['amount'] < LOB) | (df['amount'] > UOB))]['amount']
amtGenuOutliers = df[(df['class'] == 0) & ((df['amount'] < LOB) | (df['amount'] > UOB))]['amount']

print('\n')
print("No of all type of transaction outliers : ", amtAllOutliers.count())
print("No of fraud transaction outliers : ", amtFrdOutliers.count())
print("No of genuine transaction outliers : ", amtGenuOutliers.count())
print("Percentage of outliers : ", round((amtGenuOutliers.count()/len(df))*100,2))

**There are total number of 11,213 outliers out of which 3.94% are fradulent.** 

### Checking proportion of data for fraud vs genuine

In [None]:
# Check the balace of data including outliers
print("Balace of data including outliers")
print(df['class'].value_counts(normalize=True))

print('\n')
# Check the balance of data excluding outliers
print("Balace of data excluding outliers")
print(df[(df['amount'] < LOB) | (df['amount'] > UOB)]['class'].value_counts(normalize=True))

Now we have check the total number of outliers in Amount feature and how many of them are fradulents. We found that balance of fraud vs genuine is not impacted much, so we can remove the outlier.

### Delete Outliers

In [None]:
# check shape before deleting outliers
df.shape

In [None]:
# Removing outliers
df = df.drop(amtAllOutliers.index)

In [None]:
# check shape after deleting outliers
df.shape

In [None]:
df.reset_index(inplace = True , drop = True)

### Check Amount Distribution after deleting outliers

In [None]:
fig, axs = plt.subplots(ncols=2,figsize=(12,6))

sns.distplot(df['amount'], bins=100, ax=axs[0])
axs[0].set_title("Amount Distribution")

sns.boxplot(x='class', y='amount',data = df, ax=axs[1])
axs[1].set_title("Dsitribution of Amount wrt Class")
plt.show()

## Categorical vs Continuous Features

Finging unique values for each column to understand which column is categorical and which one is Continuous

In [None]:
# Finging unique values for each column
df[['time','amount','class']].nunique()

## Correlation Among Explanatory Variables

Having **too many features** in a model is not always a good thing because it might cause overfitting and worser results when we want to predict values for a new dataset. Thus, **if a feature does not improve your model a lot, not adding it may be a better choice.**

Another important thing is **correlation. If there is very high correlation between two features, keeping both of them is not a good idea most of the time not to cause overfitting.** However, this does not mean that you must remove one of the highly correlated features. 

Lets find out top 10 features which are highly correlaed with price.

In [None]:
df[['time','amount','class']].corr()['class'].sort_values(ascending=False).head(10)

In [None]:
plt.title('Pearson Correlation Matrix')
sns.heatmap(df[['time', 'amount','class']].corr(),linewidths=0.25,vmax=0.7,square=True,cmap="viridis",
            linecolor='w',annot=True);

It looks like that no features are highly correlated with any other features.

## Lets check the data again after cleaning

In [None]:
df.shape

In [None]:
df['class'].value_counts(normalize=True)

# **Feature Engineering** 

## Feature engineering on Time

### Converting time from second to hour

In [None]:
# Converting time from second to hour
df['time'] = df['time'].apply(lambda sec : (sec/3600))

### Calculating hour of the day

In [None]:
# Calculating hour of the day
df['hour'] = df['time']%24   # 2 days of data
df['hour'] = df['hour'].apply(lambda x : math.floor(x))

### Calculating First and Second Day

In [None]:
# Calculating First and Second day
df['day'] = df['time']/24   # 2 days of data
df['day'] = df['day'].apply(lambda x : 1 if(x==0) else math.ceil(x))

In [None]:
df[['time','hour','day','amount','class']]

### Fraud and Genuine transaction Day wise

In [None]:
# calculating fraud transaction daywise
dayFrdTran = df[(df['class'] == 1)]['day'].value_counts()
# calculating genuine transaction daywise
dayGenuTran = df[(df['class'] == 0)]['day'].value_counts()
# calculating total transaction daywise
dayTran = df['day'].value_counts()

print("No of transaction Day wise:")
print(dayTran)

print("\n")

print("No of fraud transaction Day wise:")
print(dayFrdTran)

print("\n")

print("No of genuine transactions Day wise:")
print(dayGenuTran)

print("\n")

print("Percentage of fraud transactions Day wise:")
print((dayFrdTran/dayTran)*100)

* Total number of transaction on Day 1 was 1,38,355, out of which 242 was fraud and 1,38,113 was genuie. Fraud transation was 0.17% of total transaction on day 1.

* Total number of transaction on Day 2 was 1,34,133, out of which 166 was fraud and 1,33,967 was genuie. Fraud transation was 0.12% of total transaction on day 2.

* Most of the transaction including the fraud transaction happened on day 1.

Lets see the above numbers in graph.

In [None]:
fig, axs = plt.subplots(ncols=3, figsize=(16,4))

sns.countplot(df['day'], ax=axs[0])
axs[0].set_title("Distribution of Total Transactions")

sns.countplot(df[(df['class'] == 1)]['day'], ax=axs[1])
axs[1].set_title("Distribution of Fraud Transactions")

sns.countplot(df[(df['class'] == 0)]['day'], ax=axs[2])
axs[2].set_title("Distribution of Genuine Transactions")

plt.show()

In [None]:
# Time plots 
fig , axs = plt.subplots(nrows = 1 , ncols = 2 , figsize = (15,8))

sns.distplot(df[df['class']==0]['time'].values , color = 'green' , ax = axs[0])
axs[0].set_title('Genuine Transactions')

sns.distplot(df[df['class']==1]['time'].values , color = 'red' ,ax = axs[1])
axs[1].set_title('Fraud Transactions')

fig.suptitle('Comparison between Transaction Frequencies vs Time for Fraud and Genuine Transactions')
plt.show()

In [None]:
# Let's see if we find any particular pattern between time ( in hours ) and Fraud vs Genuine Transactions

plt.figure(figsize=(12,10))

sns.distplot(df[df['class'] == 0]["hour"], color='g') # Genuine - green
sns.distplot(df[df['class'] == 1]["hour"], color='r') # Fraudulent - Red

plt.title('Fraud vs Genuine Transactions by Hours', fontsize=15)
plt.xlim([0,25])
plt.show()

**Above graph shows that most of the Fraud transactions are happening at night time (0 to 7 hours) when most of the people are sleeping and Genuine transaction are happening during day time (9 to 21 hours).**

In [None]:
df[['time','hour','day','amount','class']].groupby('hour').count()['class'].plot()

### Visualising Data for detecting any particular Pattern or Anomaly using Histogram Plots

Finally visulaising all columns once and for all to observe any abnormality

In [None]:
df.hist(figsize = (25,25))
plt.show()

## Reset the index

In [None]:
df.reset_index(inplace = True , drop = True)

# **Scale Amount Feature**

* It is good idea to scale the data, so that the column(feature) with lesser significance might not end up dominating the objective function due to its larger range. like a column like age has a range between 0 to 80, but a column like salary has range from thousands to lakhs, hence, salary column will dominate to predict the outcome even if it may not be important.
* In addition, features having different unit should also be scaled thus providing each feature equal initial weightage. Like Age in years and Sales in Dollars must be brought down to a common scale before feeding it to the ML algorithm
* This will result in a better prediction model.


**Scaling using the log** : There are two main reasons to use logarithmic scales in charts and graphs. 
* The first is to respond to skewness towards large values; i.e., cases in which one or a few points are much larger than the bulk of the data. 
* The second is to show percent change or multiplicative factors. 

**PCA Transformation**: The description of the data says that all the features went through a PCA transformation (Dimensionality Reduction technique) except for time and amount.

**Scaling**: Keep in mind that in order to implement a PCA transformation features need to be previously scaled.

In [None]:
# # Since most of our data has already been scaled we should scale the columns that are left to scale (Amount and Time)

# from sklearn.preprocessing import MinMaxScaler

# scaler = MinMaxScaler()
# df['ScaledAmount'] = scaler.fit_transform(df[['Amount']])
# df['ScaledAmount'].tail()

In [None]:
# df['ScaledTime'] = scaler.fit_transform(df[['Time']])
# df['ScaledTime'].tail()

In [None]:
# df[['Time','ScaledTime','Amount','ScaledAmount','Class']].tail(10)
# df.head().T

## Scale amount by Log

In [None]:
# Scale amount by log
df['amount_log'] = np.log(df.amount + 0.01)

## Scale the Amount Column

In [None]:
from sklearn.preprocessing import StandardScaler # importing a class from a module of a library

ss = StandardScaler() # object of the class StandardScaler ()
df['amount_scaled'] = ss.fit_transform(df['amount'].values.reshape(-1,1))

In [None]:
#Feature engineering to a better visualization of the values
plt.figure(figsize=(14,6))
# Let's explore the Amount by Class and see the distribuition of Amount transactions
plt.subplot(121)
ax = sns.boxplot(x ="class",y="amount",data=df)
ax.set_title("Class x Amount", fontsize=20)
ax.set_xlabel("Is Fraud?", fontsize=16)
ax.set_ylabel("Amount", fontsize = 16)

plt.subplot(122)
ax1 = sns.boxplot(x ="class",y="amount_log", data=df)
ax1.set_title("Class x Log Amount", fontsize=20)
ax1.set_xlabel("Is Fraud?", fontsize=16)
ax1.set_ylabel("Amount(Log)", fontsize = 16)

# plt.subplot(123)
# ax1 = sns.boxplot(x ="class",y="amount_scaled", data=df)
# ax1.set_title("Class x Scaled Amount", fontsize=20)
# ax1.set_xlabel("Is Fraud?", fontsize=16)
# ax1.set_ylabel("Amount(Log)", fontsize = 16)

plt.subplots_adjust(hspace = 0.6, top = 0.8)

plt.show()

* We can see a slightly difference in log amount of our two Classes. 
* The IQR of fraudulent transactions are higher than normal transactions, but normal transactions have highest values

## Comparing Amount and Transaction Class

In [None]:
# We need to bin amounts first , but the problem is the skewness of the amount , still , let's try
legit_list = df[df['class']==0]['amount'].describe().tolist()
fraud_list = df[df['class']==1]['amount'].describe().tolist()
pd.DataFrame(np.transpose(legit_list) , np.transpose(fraud_list))

In [None]:
comp_df = pd.DataFrame([df[df['class']==0]['amount'].describe().to_dict() , df[df['class']==1]['amount'].describe().to_dict()])
comp_df = comp_df.T
comp_df

In [None]:
comp_df.columns = ['Legit' , 'Fraud']
comp_df.plot(kind = 'barh' , figsize = (10,10))

In [None]:
df[['time','hour','day','amount','amount_log','amount_scaled','class']]

# __Saving preprossed data as serialized files__
* To deploy the predictive models built we save them along with the required data files as serialized file objects
* We save cleaned and processed input data, tuned predictive models as files so that they can later be re-used/shared

In [None]:
### Save the processed data 

# Lets save the processed data so that we can use it later without running the preprocessing technique again and again.
# df.to_csv('elementary_data_processed.csv' , index = False)

In [None]:
import pickle
import os

In [None]:
CreditCardFraudDataCleaned = df

# Saving the Python objects as serialized files can be done using pickle library
# Here let us save the Final Data set after all the transformations as a file
with open('CreditCardFraudDataCleaned.pkl', 'wb') as fileWriteStream:
    pickle.dump(CreditCardFraudDataCleaned, fileWriteStream)
    # Don't forget to close the filestream!
    fileWriteStream.close()
    
print('pickle file is saved at Location:',os.getcwd())

> ### Load preprocessed data

In [None]:
# Reading a Pickle file
with open('CreditCardFraudDataCleaned.pkl', 'rb') as fileReadStream:
    CreditCardFraudDataFromPickle = pickle.load(fileReadStream)
    # Don't forget to close the filestream!
    fileReadStream.close()
    
# Checking the data read from pickle file. It is exactly same as the DiamondPricesData
df = CreditCardFraudDataFromPickle
df.head()

In [None]:
# df = pd.read_csv('elementary_data_processed.csv')

In [None]:
df.shape

In [None]:
df.head()

# **Splitting data into Training and Testing samples**

We dont use the full data for creating the model. Some data is randomly selected and kept aside for checking how good the model is. This is known as Testing Data and the remaining data is called Training data on which the model is built. Typically 70% of data is used as Training data and the rest 30% is used as Tesing data.

In [None]:
df.columns

In [None]:
# Separate Target Variable and Predictor Variables
X = df.drop(['time','class','hour','day','amount','amount_log','amount_scaled'],axis=1)
y = df['class']

In [None]:
X

In [None]:
# Load the library for splitting the data
from sklearn.model_selection import train_test_split

In [None]:
# Split the data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

In [None]:
# Quick sanity check with the shapes of Training and testing datasets
print("X_train - ",X_train.shape)
print("y_train - ",y_train.shape)
print("X_test - ",X_test.shape)
print("y_test - ",y_test.shape)

# __Baseline for models__

# Let's Discuss Next Steps - 

1  __Classification Models__

- Logistic Regression
- XG Boost
- SVM 's
- Decision Trees
- Random Forest

2  __Class Imbalance Solutions__

- Under Sampling
- Over Sampling
- SMOTE
- ADASYN

3  __Metrics__

- Accuracy Score
- Confusion Matrix
- ROC_AUC
- F1 Score

# __Model Building__

##### We are aware that our dataset is highly imbalanced, however we check the performance of imbalance dataset first and later we implement some techniques to balance the dataset and again check the performance of balanced dataset. Finally we will compare each regression models performance.

# __1. Logistic Regression__

## 1.1 Logistic Regression with __imbalanced__ data

In [None]:
from sklearn.linear_model import LogisticRegression # Importing Classifier Step

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) # Sequence for splitting

logreg = LogisticRegression(solver='lbfgs') # () towards the end
logreg.fit(X_train, y_train) 

### Predict from Test set

In [None]:
y_pred = logreg.predict(X_test)

### Model Evolution

In [None]:
from sklearn import metrics

In [None]:
# https://en.wikipedia.org/wiki/Precision_and_recall
print(metrics.classification_report(y_test, y_pred))

In [None]:
print("Accuracy : ",metrics.accuracy_score(y_pred , y_test))

In [None]:
# Predicted values counts for fraud and genuine of test dataset
pd.Series(y_pred).value_counts()

**Our model predicted 76 transaction as fraud and 81671 transactions as genuine from test dataset.**

In [None]:
# Actual values counts for fraud and genuine of test dataset
pd.Series(y_test).value_counts()

**There are originally 121 fraud transactions and our model predicted only 76 fraud transaction. So the accuracy of our model should be ${76}\over{121}$, right?**

In [None]:
76/121

So 62.81% should be our accuracy.

**However, this not the case. Actually there are originally 121 fraud transactions and 81626 genuine transactions in test dataset. However our model predicted only 76 fraud transaction. Also it should be keep in mind that these 76 predicted fraud transaction may not be identified correctly. It means that these predicted 76 fraud transactions are NOT only from 121 originall fraud transaction, however they may be from genuine transactions as well.**

We will see our real accuracy in below cells.

## __Model Evolution Matrix__

## Confusion Matrix

__Why and When__?

__Every problem is different and derives a different set of values for a particular business use case , thus every model must be evaluated differently.__

## Let's get to know the terminology and Structure first

A confusion matrix is defined into four parts : __{ TRUE , FALSE } (Actual) ,{POSITIVE , NEGATIVE} (Predicted)__
Positive and Negative is what you predict , True and False is what you are told

Which brings us to 4 relations : True Positive , True Negative , False Positive , False Negative <br>
__P__ redicted - __R__ ows and __A__ ctual as __C__ olumns <br>

<img src = 'https://github.com/dktalaicha/Kaggle/blob/master/CreditCardFraudDetection/images/final_cnf.png?raw=true'>

![](https://imgur.com/om1pl02)


## __Accuracy , Precision and Recall__

##### __Accuracy__ : The most used and classic classification metric : Suited for binary classification problems.

$$  \text{Accuracy} = \frac{( TP + TN ) }{ (TP + TN + FP + FN )}$$

Basically Rightly predicted results amongst all the results , used when the classes are balanced

##### __Precision__ : What proportion of predicted positives are truly positive ? Used when we need to predict the positive thoroughly, sure about it !

$$ \text{Precision} = \frac{( TP )}{( TP + FP )} $$

##### __Sensitivity or Recall__ : What proportion of actual positives is correctly classified ? choice when we want to capture as many positives as possible

$$ \text{Recall} = \frac{(TP)}{( TP + FN )} $$

##### __F1 Score__ : Harmonic mean of Precision and Recall. It basically maintains a balance between the precision and recall for your classifier

$$ F1 = \frac{2 * (\text{ precision } * \text{ recall })}{(\text{ precision } + \text{ recall } )} $$



### Confusion Matrix

In [2]:
cnf_matrix = metrics.confusion_matrix(y_test,y_pred)
cnf_matrix

NameError: name 'metrics' is not defined

In [None]:
# Heatmap for Confusion Matrix
# ax= plt.subplot()

p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.ylabel('Actual',fontsize = 18)
plt.xlabel('Predicted',fontsize = 18)

# ax.xaxis.set_ticklabels(['Genuine', 'Fraud']); 
# ax.yaxis.set_ticklabels(['Genuine', 'Fraud']);

plt.show()

**There are 61 transaction recognised as True Postive, means they are orignally fraud transactions and our model precited them as fraud.**

**True Negative** - 81611 (truely saying negative - genuine transaction correctly identified as genuine)

**True Postive** - 61 (truely saying positive - fraud transaction correctly identified as fraud)

**False Negative** - 60 ( falsely saying negative - fraud transaction incorrectly identified as genuine)

**False Positive** - 15 ( falsely saying positive - genuine transaction incorrectly identified as fraud)

#### We already know that we have 121 fraud transaction in our test dataset, but our model predicted only 61 fraud transaction. So the real accuracy of our model is ${61}\over{121}$

In [None]:
61/121

So, **50.41%** is real accuracny of our model.

### __ROC AUC Curve__

It is an evaluation metric that helps identify the strength of the model to distinguish between two outcomes. It defines if a model can create a clear boundary between the postive and the negative class. 

Let's talk about some definitions first: 

##### __Sensitivity__ or __Recall__

The sensitivity of a model is defined by the proportion of actual positives that are classified as Positives , i.e = TP / ( TP + FN )

$$ \text{Recall or Sensitivity} = \frac{(TP)}{( TP + FN )} $$

<img src = "https://github.com/dktalaicha/Kaggle/blob/master/CreditCardFraudDetection/images/sens.png?raw=true">

##### __Specificity__

The specificity of a model is defined by the proportion of actual negatives that are classified as Negatives , i.e = TN / ( TN + FP )

$$ \text{Specificity} = \frac{(TN)}{( TN + FP )} $$

<img src = "https://github.com/dktalaicha/Kaggle/blob/master/CreditCardFraudDetection/images/spec.png?raw=true">

As we can see that both are independent of each other and lie in teo different quadrants , we can understand that they are inversely related to each other. Thus as Sensitivity goes up , Specificity goes down and vice versa.

### ROC CURVE

It is a plot between Sesitivity and ( 1 - Specificity ) , which intuitively is a plot between True Positive Rate and False Positive Rate. 
It depicts if a model can clearly identify each class or not

Higher the area under the curve , better the model and it's ability to seperate the positive and negative class.

<img src = "https://github.com/dktalaicha/Kaggle/blob/master/CreditCardFraudDetection/images/tpfpfntn.jpeg?raw=true">
<img src = "https://github.com/dktalaicha/Kaggle/blob/master/CreditCardFraudDetection/images/auc.png?raw=true">
<img src = "https://github.com/dktalaicha/Kaggle/blob/master/CreditCardFraudDetection/images/auc2.png?raw=true">

In [None]:
metrics.roc_auc_score(y_test , y_pred) 

In [None]:
y_pred_proba = logreg.predict_proba(X_test)
y_pred_proba

In [None]:
# plot ROC Curve

plt.figure(figsize=(8,6))

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)
print("AUC - ",auc,"\n")

plt.plot(fpr,tpr,linewidth=2, label="data 1, auc="+str(auc))
plt.legend(loc=4)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12
plt.title('ROC curve for Predicting a credit card fraud detection')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

# __Class Imbalance__

Let's Fix the class Imbalance and apply some sampling techniques

# Under Sampling and Over Sampling

<img src = 'https://github.com/dktalaicha/Kaggle/blob/master/CreditCardFraudDetection/images/under_over_sampling.jpg?raw=true'>

# Synthetic Minority OverSampling Technique (SMOTE)
<img src='https://github.com/dktalaicha/Kaggle/blob/master/CreditCardFraudDetection/images/smote.png?raw=true'>

# ADASYN 

fdsfsfsdgdfsgds
fgdsgdfgdfgfd


## Import imbalace technique algorithims

In [None]:
# Import imbalace technique algorithims
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler

In [None]:
df.head()

## 1.2.Logistic Regression with __Undersampling__ data

In [None]:
from collections import Counter # counter takes values returns value_counts dictionary
from sklearn.datasets import make_classification

print('Original dataset shape %s' % Counter(y))

rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)

print('Resampled dataset shape %s' % Counter(y_res))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.3, random_state=0)

# Undersampling with Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_pred , y_test))  
print("AUC : ",metrics.roc_auc_score(y_test , y_pred))
# Accuracy is surely reduced , let's look at the roc curve now

In [None]:
# plot ROC Curve

plt.figure(figsize=(8,6))

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)
print("AUC - ",auc,"\n")

plt.plot(fpr,tpr,linewidth=2, label="data 1, auc="+str(auc))
plt.legend(loc=4)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12
plt.title('ROC curve for Predicting a credit card fraud detection')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

In [None]:
# Heatmap for Confusion Matrix

cnf_matrix = metrics.confusion_matrix(y_test , y_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.xlabel('Predicted',fontsize = 18)
plt.ylabel('Actual',fontsize = 18)

# ax.xaxis.set_ticklabels(['Genuine', 'Fraud']); 
# ax.yaxis.set_ticklabels(['Genuine', 'Fraud']);

plt.show()

## 1.3.Logistic Regression with __Oversampling__ data

In [None]:
from imblearn.over_sampling import RandomOverSampler

In [None]:
print('Original dataset shape %s' % Counter(y))
random_state = 42

ros = RandomOverSampler(random_state=random_state)
X_res, y_res = ros.fit_resample(X, y)

print('Resampled dataset shape %s' % Counter(y_res))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.3, random_state=0)

# Oversampling with Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_pred , y_test))  
print("AUC : ",metrics.roc_auc_score(y_test , y_pred))

In [None]:
# plot ROC Curve

plt.figure(figsize=(8,6))

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)
print("AUC - ",auc,"\n")

plt.plot(fpr,tpr,linewidth=2, label="data 1, auc="+str(auc))
plt.legend(loc=4)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12
plt.title('ROC curve for Predicting a breast cancer classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

In [None]:
# Heatmap for Confusion Matrix

cnf_matrix = metrics.confusion_matrix(y_test , y_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.xlabel('Predicted',fontsize = 18)
plt.ylabel('Actual',fontsize = 18)

# ax.xaxis.set_ticklabels(['Genuine', 'Fraud']); 
# ax.yaxis.set_ticklabels(['Genuine', 'Fraud']);

plt.show()

## 1.4 Logistic Regression with __SMOTE__ data

In [None]:
from imblearn.over_sampling import SMOTE, ADASYN

In [None]:
print('Original dataset shape %s' % Counter(y))

smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)

print('Resampled dataset shape %s' % Counter(y_res))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.3, random_state=0)

# SMOTE Sampling with Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_pred , y_test))  
print("AUC : ",metrics.roc_auc_score(y_test , y_pred))

In [None]:
# plot ROC Curve

plt.figure(figsize=(8,6))

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)
print("AUC - ",auc,"\n")

plt.plot(fpr,tpr,linewidth=2, label="data 1, auc="+str(auc))
plt.legend(loc=4)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12
plt.title('ROC curve for Predicting a breast cancer classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

In [None]:
# Heatmap for Confusion Matrix

cnf_matrix = metrics.confusion_matrix(y_test , y_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.xlabel('Predicted',fontsize = 18)
plt.ylabel('Actual',fontsize = 18)

# ax.xaxis.set_ticklabels(['Genuine', 'Fraud']); 
# ax.yaxis.set_ticklabels(['Genuine', 'Fraud']);

plt.show()

## 1.5 Logistic Regression with __ADASYN__ data

In [None]:
print('Original dataset shape %s' % Counter(y))

adasyn = ADASYN(random_state=42)

X_res, y_res = adasyn.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.3, random_state=0)

#  ADASYN Sampling with Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_pred , y_test))  
print("AUC : ",metrics.roc_auc_score(y_test , y_pred))

In [None]:
# plot ROC Curve

plt.figure(figsize=(8,6))

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)
print("AUC - ",auc,"\n")

plt.plot(fpr,tpr,linewidth=2, label="data 1, auc="+str(auc))
plt.legend(loc=4)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12
plt.title('ROC curve for Predicting a breast cancer classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

In [None]:
# Heatmap for Confusion Matrix

cnf_matrix = metrics.confusion_matrix(y_test , y_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.xlabel('Predicted',fontsize = 18)
plt.ylabel('Actual',fontsize = 18)

# ax.xaxis.set_ticklabels(['Genuine', 'Fraud']); 
# ax.yaxis.set_ticklabels(['Genuine', 'Fraud']);

plt.show()

## Principal Component Analysis

##### reduce 29 columns - 2 columns , so that I can look at them in a plot !

In [None]:
from sklearn.decomposition import PCA # SVD , t-SNE , Linear Discrimant Analysis
X_reduced_pca = PCA(n_components=2, random_state=42).fit_transform(X_res)

In [None]:
#f,ax = plt.figure(figsize=(24,6))

plt.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y_res== 0), cmap='coolwarm', label='No Fraud', linewidths=2)
plt.scatter(X_reduced_pca[:,0], X_reduced_pca[:,1], c=(y_res == 1), cmap='coolwarm', label='Fraud', linewidths=2)
plt.show()

# Building different models with different balanced datasets 
Let's now try either different models , first by creating multiple datsets for undersampled , oversampled and SMOTE sampled

## 1. Undersampled Data

In [None]:
print('Original dataset shape %s' % Counter(y))

rus = RandomUnderSampler(random_state=42)
X_under, y_under = rus.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_under))

## 2. Oversampled Data

In [None]:
print('Original dataset shape %s' % Counter(y))

ros = RandomOverSampler(random_state=42)
X_over, y_over = ros.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_over))

## 3. SMOTE Data

In [None]:
print('Original dataset shape %s' % Counter(y))

smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_smote))

## 4. ADASYN Data

In [None]:
print('Original dataset shape %s' % Counter(y))

adasyn = ADASYN(random_state=42)
X_adasyn, y_adasyn = adasyn.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_adasyn))

# Now applying different models and evaluating the dataset

In [None]:
from sklearn.svm import SVC # Support Vector Classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# 2 Classifier - Decision Tree Classifier

## 2.1 Decision Tree Classifier with __imbalanced__ data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

dte = DecisionTreeClassifier()
dte.fit( X_train, y_train )

y_pred = dte.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_pred , y_test))  
print("AUC : ",metrics.roc_auc_score(y_test , y_pred))

# plot ROC Curve

plt.figure(figsize=(8,6))

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)

plt.plot(fpr,tpr,linewidth=2, label="data 1, auc="+str(auc))
plt.legend(loc=4)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12
plt.title('ROC curve for Predicting a credit card fraud detection')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

In [None]:
# Heatmap for Confusion Matrix

cnf_matrix = metrics.confusion_matrix(y_test , y_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.xlabel('Predicted',fontsize = 18)
plt.ylabel('Actual',fontsize = 18)

# ax.xaxis.set_ticklabels(['Genuine', 'Fraud']); 
# ax.yaxis.set_ticklabels(['Genuine', 'Fraud']);

plt.show()

## 2.2 Decision Tree Classifier with __Undersampling__ data

In [None]:
# Undersampled data with Decision Tree Classifiers

X_train, X_test, y_train, y_test = train_test_split(X_under, y_under, test_size=0.3, random_state=0)

dte = DecisionTreeClassifier()
dte.fit( X_train, y_train )

y_pred = dte.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_pred , y_test))  
print("AUC : ",metrics.roc_auc_score(y_test , y_pred))

# plot ROC Curve

plt.figure(figsize=(8,6))

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)

plt.plot(fpr,tpr,linewidth=2, label="data 1, auc="+str(auc))
plt.legend(loc=4)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12
plt.title('ROC curve for Predicting a credit card fraud detection')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

In [None]:
# Heatmap for Confusion Matrix

cnf_matrix = metrics.confusion_matrix(y_test , y_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.xlabel('Predicted',fontsize = 18)
plt.ylabel('Actual',fontsize = 18)

# ax.xaxis.set_ticklabels(['Genuine', 'Fraud']); 
# ax.yaxis.set_ticklabels(['Genuine', 'Fraud']);

plt.show()

## 2.3 Decision Tree Classifier with __Oversampling__ data

In [None]:
# Oversampled data with Decision Tree Classifiers # Best model after Classifier - DTE

X_train, X_test, y_train, y_test = train_test_split(X_over, y_over, test_size=0.3, random_state=0)

dte = DecisionTreeClassifier()
dte.fit( X_train, y_train )

y_pred = dte.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_pred , y_test))  
print("AUC : ",metrics.roc_auc_score(y_test , y_pred))

# plot ROC Curve

plt.figure(figsize=(8,6))

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)

plt.plot(fpr,tpr,linewidth=2, label="data 1, auc="+str(auc))
plt.legend(loc=4)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12
plt.title('ROC curve for Predicting a credit card fraud detection')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

In [None]:
# Heatmap for Confusion Matrix

cnf_matrix = metrics.confusion_matrix(y_test , y_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.xlabel('Predicted',fontsize = 18)
plt.ylabel('Actual',fontsize = 18)

# ax.xaxis.set_ticklabels(['Genuine', 'Fraud']); 
# ax.yaxis.set_ticklabels(['Genuine', 'Fraud']);

plt.show()

## 2.4 Decision Tree Classifier with __SMOTE__ data

In [None]:
# SMOTE data with Decision Tree Classifiers

X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.3, random_state=0)

dte = DecisionTreeClassifier()
dte.fit( X_train, y_train )

y_pred = dte.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_pred , y_test))  
print("AUC : ",metrics.roc_auc_score(y_test , y_pred))

# plot ROC Curve

plt.figure(figsize=(8,6))

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)

plt.plot(fpr,tpr,linewidth=2, label="data 1, auc="+str(auc))
plt.legend(loc=4)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12
plt.title('ROC curve for Predicting a credit card fraud detection')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

In [None]:
# Heatmap for Confusion Matrix

cnf_matrix = metrics.confusion_matrix(y_test , y_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.xlabel('Predicted',fontsize = 18)
plt.ylabel('Actual',fontsize = 18)

# ax.xaxis.set_ticklabels(['Genuine', 'Fraud']); 
# ax.yaxis.set_ticklabels(['Genuine', 'Fraud']);

plt.show()

# 3 Random Forest Classifier

## 3.1 Random Forest Classifier with __imbalance__ data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

rfc = RandomForestClassifier()
rfc.fit( X_train, y_train )

y_pred = rfc.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_pred , y_test))  
print("AUC : ",metrics.roc_auc_score(y_test , y_pred))

# plot ROC Curve

plt.figure(figsize=(8,6))

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)

plt.plot(fpr,tpr,linewidth=2, label="data 1, auc="+str(auc))
plt.legend(loc=4)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12
plt.title('ROC curve for Predicting a credit card fraud detection')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

In [None]:
# Heatmap for Confusion Matrix

cnf_matrix = metrics.confusion_matrix(y_test , y_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.xlabel('Predicted',fontsize = 18)
plt.ylabel('Actual',fontsize = 18)

# ax.xaxis.set_ticklabels(['Genuine', 'Fraud']); 
# ax.yaxis.set_ticklabels(['Genuine', 'Fraud']);

plt.show()

## 3.2 Random Forest Classifier with __Undersampling__ data

In [None]:
# Undersampled data with Decision Tree Classifiers

X_train, X_test, y_train, y_test = train_test_split(X_under, y_under, test_size=0.3, random_state=0)

rfc = RandomForestClassifier()
rfc.fit( X_train, y_train )

y_pred = rfc.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_pred , y_test))  
print("AUC : ",metrics.roc_auc_score(y_test , y_pred))

# plot ROC Curve

plt.figure(figsize=(8,6))

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)

plt.plot(fpr,tpr,linewidth=2, label="data 1, auc="+str(auc))
plt.legend(loc=4)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12
plt.title('ROC curve for Predicting a credit card fraud detection')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

In [None]:
# Heatmap for Confusion Matrix

cnf_matrix = metrics.confusion_matrix(y_test , y_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.xlabel('Predicted',fontsize = 18)
plt.ylabel('Actual',fontsize = 18)

# ax.xaxis.set_ticklabels(['Genuine', 'Fraud']); 
# ax.yaxis.set_ticklabels(['Genuine', 'Fraud']);

plt.show()

## 3.3 Random Forest Classifier with __Oversampling__ data

In [None]:
# Oversampled data with Decision Tree Classifiers # Best model after Classifier - DTE

X_train, X_test, y_train, y_test = train_test_split(X_over, y_over, test_size=0.3, random_state=0)

rfc = RandomForestClassifier()
rfc.fit( X_train, y_train )

y_pred = rfc.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_pred , y_test))  
print("AUC : ",metrics.roc_auc_score(y_test , y_pred))

# plot ROC Curve

plt.figure(figsize=(8,6))

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)

plt.plot(fpr,tpr,linewidth=2, label="data 1, auc="+str(auc))
plt.legend(loc=4)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12
plt.title('ROC curve for Predicting a credit card fraud detection')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

In [None]:
# Heatmap for Confusion Matrix

cnf_matrix = metrics.confusion_matrix(y_test , y_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.xlabel('Predicted',fontsize = 18)
plt.ylabel('Actual',fontsize = 18)

# ax.xaxis.set_ticklabels(['Genuine', 'Fraud']); 
# ax.yaxis.set_ticklabels(['Genuine', 'Fraud']);

plt.show()

## 3.4 Random Forest Classifier with __SMOTE__ data

In [None]:
# SMOTE data with Decision Tree Classifiers

X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.3, random_state=0)

rfc = RandomForestClassifier()
rfc.fit( X_train, y_train )

y_pred = rfc.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_pred , y_test))  
print("AUC : ",metrics.roc_auc_score(y_test , y_pred))

# plot ROC Curve

plt.figure(figsize=(8,6))

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)

plt.plot(fpr,tpr,linewidth=2, label="data 1, auc="+str(auc))
plt.legend(loc=4)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12
plt.title('ROC curve for Predicting a credit card fraud detection')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

In [None]:
# Heatmap for Confusion Matrix

cnf_matrix = metrics.confusion_matrix(y_test , y_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.xlabel('Predicted',fontsize = 18)
plt.ylabel('Actual',fontsize = 18)

# ax.xaxis.set_ticklabels(['Genuine', 'Fraud']); 
# ax.yaxis.set_ticklabels(['Genuine', 'Fraud']);

plt.show()

## For Loop

In [None]:
# from sklearn.linear_model import LogisticRegression
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.svm import SVC
# from sklearn.neighbors import KNeighborsClassifier


# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)


# # 4.2 Build Models
# # Let’s test 6 different algorithms:

# # Logistic Regression (LR)
# # Linear Discriminant Analysis (LDA)
# # K-Nearest Neighbors (KNN).
# # Classification and Regression Trees (CART).
# # Gaussian Naive Bayes (NB).
# # Support Vector Machines (SVM).

# # Spot Check Algorithms
# models = []
# models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
# models.append(('DT', DecisionTreeClassifier()))
# models.append(('RF', RandomForestClassifier()))
# models.append(('SVM', SVC(gamma='auto')))
# models.append(('KNN', KNeighborsClassifier()))

# # evaluate each model in turn
# results = []
# names = []

# for name, model in models:
#     kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
#     cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
#     results.append(cv_results)
#     names.append(name)
#     print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
    
# print("\n")

# 4 K Nearest Classifier

## 4.1 KNN Classifier with __imbalance__ data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

knn = KNeighborsClassifier()
knn.fit( X_train, y_train )

y_pred = knn.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_pred , y_test))  
print("AUC : ",metrics.roc_auc_score(y_test , y_pred))

# plot ROC Curve

plt.figure(figsize=(8,6))

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)

plt.plot(fpr,tpr,linewidth=2, label="data 1, auc="+str(auc))
plt.legend(loc=4)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12
plt.title('ROC curve for Predicting a credit card fraud detection')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

In [None]:
# Heatmap for Confusion Matrix

cnf_matrix = metrics.confusion_matrix(y_test , y_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.xlabel('Predicted',fontsize = 18)
plt.ylabel('Actual',fontsize = 18)

# ax.xaxis.set_ticklabels(['Genuine', 'Fraud']); 
# ax.yaxis.set_ticklabels(['Genuine', 'Fraud']);

plt.show()

## 4.2 KNN Classifier with __Undersampling__ data

In [None]:
# Undersampled data with Decision Tree Classifiers

X_train, X_test, y_train, y_test = train_test_split(X_under, y_under, test_size=0.3, random_state=0)

knn = KNeighborsClassifier()
knn.fit( X_train, y_train )

y_pred = knn.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_pred , y_test))  
print("AUC : ",metrics.roc_auc_score(y_test , y_pred))

# plot ROC Curve

plt.figure(figsize=(8,6))

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)

plt.plot(fpr,tpr,linewidth=2, label="data 1, auc="+str(auc))
plt.legend(loc=4)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12
plt.title('ROC curve for Predicting a credit card fraud detection')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

In [None]:
# Heatmap for Confusion Matrix

cnf_matrix = metrics.confusion_matrix(y_test , y_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.xlabel('Predicted',fontsize = 18)
plt.ylabel('Actual',fontsize = 18)

# ax.xaxis.set_ticklabels(['Genuine', 'Fraud']); 
# ax.yaxis.set_ticklabels(['Genuine', 'Fraud']);

plt.show()

## 4.3 KNN with __Oversampling__ data

In [None]:
# Oversampled data with Decision Tree Classifiers # Best model after Classifier - DTE

X_train, X_test, y_train, y_test = train_test_split(X_over, y_over, test_size=0.3, random_state=0)

knn = KNeighborsClassifier()
knn.fit( X_train, y_train )

y_pred = knn.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_pred , y_test))  
print("AUC : ",metrics.roc_auc_score(y_test , y_pred))

# plot ROC Curve

plt.figure(figsize=(8,6))

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)

plt.plot(fpr,tpr,linewidth=2, label="data 1, auc="+str(auc))
plt.legend(loc=4)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12
plt.title('ROC curve for Predicting a credit card fraud detection')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

In [None]:
# Heatmap for Confusion Matrix

cnf_matrix = metrics.confusion_matrix(y_test , y_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.xlabel('Predicted',fontsize = 18)
plt.ylabel('Actual',fontsize = 18)

# ax.xaxis.set_ticklabels(['Genuine', 'Fraud']); 
# ax.yaxis.set_ticklabels(['Genuine', 'Fraud']);

plt.show()

## 4.4 KNN Classifier with __SMOTE__ data

In [None]:
# SMOTE data with Decision Tree Classifiers

X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.3, random_state=0)

knn = KNeighborsClassifier()
knn.fit( X_train, y_train )

y_pred = knn.predict(X_test)

print("Accuracy :",metrics.accuracy_score(y_pred , y_test))  
print("AUC : ",metrics.roc_auc_score(y_test , y_pred))

# plot ROC Curve

plt.figure(figsize=(8,6))

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)

plt.plot(fpr,tpr,linewidth=2, label="data 1, auc="+str(auc))
plt.legend(loc=4)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12
plt.title('ROC curve for Predicting a credit card fraud detection')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')

In [None]:
# Heatmap for Confusion Matrix

cnf_matrix = metrics.confusion_matrix(y_test , y_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')

plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.xlabel('Predicted',fontsize = 18)
plt.ylabel('Actual',fontsize = 18)

# ax.xaxis.set_ticklabels(['Genuine', 'Fraud']); 
# ax.yaxis.set_ticklabels(['Genuine', 'Fraud']);

plt.show()