In [16]:
import pandas as pd

# Read in data 
data = pd.read_csv('Resources/creditcard.csv')

# Unsupervised Outlier Detection

Now that we have processed our data, we can begin deploying our machine learning algorithms. We will use the following techniques:

## Local Outlier Factor (LOF)

The anomaly score of each sample is called Local Outlier Factor. It measures the local deviation of density of a given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood.

In anomaly detection, the local outlier factor (LOF) is an algorithm proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in 2000 for finding anomalous data points by measuring the local deviation of a given data point with respect to its neighbours.

The local outlier factor is based on a concept of a local density, where locality is given by k {\displaystyle k} k nearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers.

The local density is estimated by the typical distance at which a point can be "reached" from its neighbors. The definition of "reachability distance" used in LOF is an additional measure to produce more stable results within clusters. 

## Isolation Forest Algorithm

The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.

This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.

Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

In [11]:
# Print the shape of the data
cc = data.sample(frac=0.9, random_state = 42)
print(cc.shape)
# print(cc.describe())

(256326, 31)


In [12]:
# Determine number of fraud cases in dataset

Fraud = data[data['Class'] == 1]
Valid = data[data['Class'] == 0]

outlier_fraction = len(Fraud)/float(len(Valid))
print(outlier_fraction)

print('Fraud Cases: {}'.format(len(data[data['Class'] == 1])))
print('Valid Transactions: {}'.format(len(data[data['Class'] == 0])))

0.0017304750013189597
Fraud Cases: 492
Valid Transactions: 284315


In [13]:


# Get all the columns from the dataFrame
columns = cc.columns.tolist()

# Filter the columns to remove data we do not want
columns = [c for c in columns if c not in ["Class"]]

# Store the variable we'll be predicting on
target = "Class"

X = cc[columns]
Y = cc[target]

# Print shapes
print(X.shape)
print(Y.shape)

(256326, 30)
(256326,)


In [14]:
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor

# define random states
state = 1

# define outlier detection tools to be compared
classifiers = {
    "Isolation Forest": IsolationForest(max_samples=len(X),
                                        contamination=outlier_fraction,
                                        random_state=state),
    "Local Outlier Factor": LocalOutlierFactor(
        n_neighbors=30,
        contamination=outlier_fraction)}

In [20]:
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
# ignore all future warnings
simplefilter(action='ignore', category=DeprecationWarning)

In [21]:
# Fit the model
n_outliers = len(Fraud)


for i, (clf_name, clf) in enumerate(classifiers.items()):
    
    # fit the data and tag outliers
    if clf_name == "Local Outlier Factor":
        y_pred = clf.fit_predict(X)
        scores_pred = clf.negative_outlier_factor_
    else:
        clf.fit(X)
        scores_pred = clf.decision_function(X)
        y_pred = clf.predict(X)
    
    # Reshape the prediction values to 0 for valid, 1 for fraud. 
    y_pred[y_pred == 1] = 0
    y_pred[y_pred == -1] = 1
    
    n_errors = (y_pred != Y).sum()
    
    # Run classification metrics
    print('{}: {}'.format(clf_name, n_errors))
    print(f"Accuracy Score: {accuracy_score(Y, y_pred)}")
    print(f"Classification Report:\n {classification_report(Y, y_pred)}")

Isolation Forest: 613
Accuracy Score: 0.9976085141577522
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    255887
           1       0.30      0.31      0.31       439

   micro avg       1.00      1.00      1.00    256326
   macro avg       0.65      0.65      0.65    256326
weighted avg       1.00      1.00      1.00    256326

Local Outlier Factor: 859
Accuracy Score: 0.9966487987952841
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    255887
           1       0.03      0.03      0.03       439

   micro avg       1.00      1.00      1.00    256326
   macro avg       0.51      0.51      0.51    256326
weighted avg       1.00      1.00      1.00    256326

