# Capstone Project - Credit Card Fraud Detection

## 5. DBSCAN

This technique is based on the DBSCAN clustering method. DBSCAN is a non-parametric, density based outlier detection method in a one or multi dimensional feature space.

In the DBSCAN clustering technique, all data points are defined either as Core Points, Border Points or Noise Points.
- Core Points are data points that have at least MinPts neighboring data points within a distance ℇ.
- Border Points are neighbors of a Core Point within the distance ℇ but with less than MinPts neighbors within the distance ℇ.
- All other data points are Noise Points, also identified as outliers.

Outlier detection thus depends on the required number of neighbors MinPts, the distance ℇ and the selected distance measure, like Euclidean or Manhattan.

Compare to K-means Clustering, the number of clusters does not need to be predefined when using DBSCAN. The algorithm finds core samples of high density and expands clusters from them. DBSCAN works well on data containing clusters of similar density. It can be used to identify fraud as very small clusters.

Since the size of the particular dataset we are using now is too large for DBSCAN to run, we will take a subsample of the dataset. 

In [3]:
import numpy as np
import pandas as pd
import scipy as sp
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

%matplotlib inline

from sklearn.preprocessing import StandardScaler
from matplotlib.patches import Rectangle
from pprint import pprint as pp
import csv
from pathlib import Path
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline 
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import r2_score, classification_report, confusion_matrix, accuracy_score, roc_auc_score, roc_curve, precision_recall_curve, average_precision_score
from sklearn.metrics import homogeneity_score, silhouette_score
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import MiniBatchKMeans, DBSCAN
from itertools import product

In [42]:
df = pd.read_csv('../assets/creditcard.csv')

In [43]:
df.shape

(284807, 31)

In [44]:
df.shape

(284807, 31)

In [45]:
df.Class.value_counts()

0    284315
1       492
Name: Class, dtype: int64

## Train/Test split

In [46]:
X = df.drop('Class',axis=1) # independent columns - features
y = df.loc[:,'Class']       # target column - Class

In [47]:
print("Input Shape : ", X.shape)
print("Output Shape : ", y.shape)

Input Shape :  (284807, 30)
Output Shape :  (284807,)


In [48]:
#Train test split into train and validation sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.04, random_state = 42)

In [49]:
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)

(273414, 30) (11393, 30) (273414,) (11393,)


## Sub-sampling

I used tran_test_split to split the original dataset into train and test set. I will create a new dataframe using test dataset and run DBSCAN on it.

In [50]:
new_df = pd.concat([pd.DataFrame(X_test), pd.DataFrame(y_test)], axis=1)

In [51]:
new_df.shape

(11393, 31)

In [52]:
new_df['Class'].value_counts()

0    11373
1       20
Name: Class, dtype: int64

In [53]:
new_df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
43428,41505.0,-16.526507,8.584972,-18.649853,9.505594,-13.793819,-2.832404,-16.701694,7.517344,-8.507059,...,1.190739,-1.12767,-2.358579,0.673461,-1.4137,-0.462762,-2.018575,-1.042804,364.19,1
49906,44261.0,0.339812,-2.743745,-0.13407,-1.385729,-1.451413,1.015887,-0.524379,0.22406,0.899746,...,-0.213436,-0.942525,-0.526819,-1.156992,0.311211,-0.746647,0.040996,0.102038,520.12,0
29474,35484.0,1.39959,-0.590701,0.168619,-1.02995,-0.539806,0.040444,-0.712567,0.002299,-0.971747,...,0.102398,0.168269,-0.166639,-0.81025,0.505083,-0.23234,0.011409,0.004634,31.0,0
276481,167123.0,-0.432071,1.647895,-1.669361,-0.349504,0.785785,-0.630647,0.27699,0.586025,-0.484715,...,0.358932,0.873663,-0.178642,-0.017171,-0.207392,-0.157756,-0.237386,0.001934,1.5,0
278846,168473.0,2.01416,-0.137394,-1.015839,0.327269,-0.182179,-0.956571,0.043241,-0.160746,0.363241,...,-0.238644,-0.6164,0.347045,0.061561,-0.360196,0.17473,-0.078043,-0.070571,0.89,0


### Scaling the data

In [54]:
labels = new_df.Class

In [55]:
cols = list((new_df.drop(['Time', 'Amount', 'Class'], axis=1).columns.values))

In [56]:
cols

['V1',
 'V2',
 'V3',
 'V4',
 'V5',
 'V6',
 'V7',
 'V8',
 'V9',
 'V10',
 'V11',
 'V12',
 'V13',
 'V14',
 'V15',
 'V16',
 'V17',
 'V18',
 'V19',
 'V20',
 'V21',
 'V22',
 'V23',
 'V24',
 'V25',
 'V26',
 'V27',
 'V28']

In [57]:
# Take the float values of df for X
X = new_df[cols].values.astype(np.float)

In [69]:
X.shape

(11393, 28)

### Preprocessing

In [70]:
# Define the scaler and apply to the data
ss = StandardScaler()
X_scaled = ss.fit_transform(X)

### DBSCAN

In [77]:
# Initialize and fit the DBscan model
db = DBSCAN(eps=0.9, min_samples=10, n_jobs=-1).fit(X_scaled)

# Obtain the predicted labels and calculate number of clusters
pred_labels = db.labels_
n_clusters = len(set(pred_labels)) - (1 if -1 in labels else 0)

In [78]:
# Print performance metrics for DBscan
print(f'Estimated number of clusters: {n_clusters}')
print(f'Homogeneity: {homogeneity_score(labels, pred_labels):0.3f}')
print(f'Silhouette Coefficient: {silhouette_score(X_scaled, pred_labels):0.3f}')

Estimated number of clusters: 32
Homogeneity: 0.008
Silhouette Coefficient: -0.332


### Assessing smallest clusters

In [79]:
# Count observations in each cluster number
counts = np.bincount(pred_labels[pred_labels >= 0])

# Print the result
print(counts)

[12 61 45 22 26 17 15 38 45 24 36 25 34 27 21 36 10 21 10 15 13 15 10 15
 13 13 10 19 10 16 10]


In [80]:
# Sort the sample counts of the clusters and take the top 3 smallest clusters
smallest_clusters = np.argsort(counts)[:3]

In [81]:
# Print the results 
print(f'The smallest clusters are clusters: {smallest_clusters}')

The smallest clusters are clusters: [30 28 26]


In [82]:
# Print the counts of the smallest clusters only
print(f'Their counts are: {counts[smallest_clusters]}')

Their counts are: [10 10 10]


### Results verification

In [83]:
# Create a dataframe of the predicted cluster numbers and fraud labels 
df = pd.DataFrame({'clusternr':pred_labels,'fraud':labels})

# Create a condition flagging fraud for the smallest clusters 
df['predicted_fraud'] = np.where((df['clusternr'].isin([30, 28, 26])), 1 , 0)

In [84]:
# Run a crosstab on the results 
print(pd.crosstab(df['fraud'], df['predicted_fraud'], rownames=['Actual Fraud'], colnames=['Flagged Fraud']))

Flagged Fraud      0   1
Actual Fraud            
0              11343  30
1                 20   0


20 Fraud cases are deteced. This result shows that DBSCAN has 100% accuracy in detecting fraud.