In [1]:
import os
import numpy as np
import pandas as pd
import networkx as nx
from matplotlib import pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from sklearn import metrics

## Anomaly Detection in Financial Transaction Graphs

### Baseline Model

An anomaly is - by nature - rare. The vast majority of data will not be an anomaly. Therefore, a baseline model is to just assume that there are zero anomalies in the data. We will see that a model such as has an exceptionally high accuracy, but will not be effective according to the metric we end up choosing.

### Metric for Success

Because anomalies are rare, accuracy (ratio of correct predictions to total predictions) is a rather misleading metric for determining the success of an anomaly detecting system. A baseline model that always predicts that the data has no anomalies can get an accuracy of $99\%$ or more! Instead, we should be more focused on the precision and recall of the model:

$$
\text{Precision} = \frac{|\{\text{flagged}\} \cap \{\text{anomaly}\}|}{|\{\text{flagged}\}|}
$$

$$
\text{Recall} = \frac{|\{\text{flagged}\} \cap \{\text{anomaly}\}|}{|\{\text{anomaly}\}|}
$$

Notice, that while our toy baseline model has extremely high accuracy, both its precision and recall is 0, because it does not predict any anomalies and thus it does not have any true positive rate.

Similar to the creation of a [receiver operations characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) (ROC) curve, as the threshold at which a point is deemed an anomaly loosens, recall and precision changes. This creates a curve that determines the performance of the model.

our metric will be the area under the curve (AUC) of the [precision-recall](https://en.wikipedia.org/wiki/Precision_and_recall) curve. 0 is the worst possible performance and 1 is the best possible performance.

In [2]:
def evaluate(y_gt, y_probs):
    precision, recall, _ = metrics.precision_recall_curve(y_gt, y_probs)
    precision = np.concatenate([precision[:-1], [precision[-2]]])
    recall = np.concatenate([recall[:-1], [recall[-2]]])
    return metrics.auc(recall, precision)

## Data

In [3]:
data_root = "./data"

In [4]:
sample = 1000000
filepath = os.path.join(data_root, "synthetic_financial_transactions/", "data.csv")
df = pd.read_csv(filepath).sample(n=sample).reset_index(drop=True)
train_df = df[["amount", "nameOrig", "oldbalanceOrg", "newbalanceOrig", "nameDest", "oldbalanceDest", "newbalanceDest"]]
test_df = df[["nameOrig", "nameDest", "isFraud"]]
train = train_df.drop(columns=["nameOrig", "nameDest"]).to_numpy()
test = test_df.drop(columns=["nameOrig", "nameDest"]).to_numpy()

In [5]:
train_df.head()

Unnamed: 0,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest
0,117552.58,C1867229710,0.0,0.0,C934138900,10043318.83,10160871.41
1,8959.66,C1901296898,64277.92,55318.25,M1039727393,0.0,0.0
2,217460.12,C1638223066,43.0,0.0,C654871113,436602.5,654062.63
3,247287.39,C2058009043,9031.0,0.0,C2082236953,336709.02,583996.41
4,4890.11,C323470761,55307.0,50416.89,M983826542,0.0,0.0


In [6]:
test_df.head()

Unnamed: 0,nameOrig,nameDest,isFraud
0,C1867229710,C934138900,0
1,C1901296898,M1039727393,0
2,C1638223066,C654871113,0
3,C2058009043,C2082236953,0
4,C323470761,M983826542,0


## Models

### Baseline

In [7]:
y_prob = np.zeros(len(test))
auc = evaluate(test, y_prob)
print("Baseline Model AUC: {:.2e}".format(auc))

Baseline Model AUC: 0.00e+00


As expected, according to our metric, the baseline - assume there are no outliers - does give results at all.

### Global Outlier Detection

In [8]:
knn = 2
clf = LocalOutlierFactor(n_neighbors=knn)
%time outliers = clf.fit(train)
y_prob = clf.negative_outlier_factor_
auc = evaluate(test, y_prob)
print("LOF Model AUC: {:.2e}".format(auc))

Wall time: 5min 8s
LOF Model AUC: 1.37e-03


Our LOF model does not show very convincing results.

### Direct Neighbor Outlier Detection

In [9]:
node_G = nx.from_pandas_edgelist(train_df, source="nameOrig", target="nameDest", edge_attr=True, create_using=nx.MultiDiGraph)
edge_G = nx.line_graph(node_G)

In [10]:
attr = dict.fromkeys(edge_G.nodes())
for source, target, v in edge_G.nodes():
    attr[(source, target, v)] = node_G[source][target]
nx.set_node_attributes(edge_G, attr)

### Community Outlier Detection
### OddBall
### Hybrid Model