ANOMALY DETECTION MODELS
1. Isolation Forest
2. One-Class SVM
3. Autoencoders


ISOLATION FOREST

- How it works: The Isolation Forest is an unsupervised tree-ensemble method that “isolates” anomalies via random splits. It constructs many binary trees by recursively partitioning the data: at each node, a random feature and split value are chosen. Observations that lie in sparse regions (outliers) tend to require fewer splits to isolate, yielding shorter average path lengths from root to leaf. The model’s anomaly score is based on these path lengths: points with unusually short paths are flagged as anomalies.

- Handling imbalance: Because Isolation Forest does not rely on label frequencies, it naturally handles imbalanced data. It treats anomalies as rare by design, isolating them regardless of class proportions. In fact, the method “isolates” points independently of any class distribution, making it well-suited to detect rare frauds without needing labeled examples. You can also set the contamination parameter to a small value (e.g. 0.01) to indicate the expected fraction of outliers, or determine the anomaly threshold empirically after scoring.

In [None]:
from sklearn.ensemble import IsolationForest
# Train the model (assuming X_train is a NumPy array of features)
iso = IsolationForest(n_estimators=100, max_samples='auto', contamination=0.01, random_state=42)
iso.fit(X_train)

# Score new data (e.g. test set) for anomalies
scores = -iso.decision_function(X_test)  
# (Higher score means more anomalous; decision_function returns negative for anomalies)
preds = iso.predict(X_test)  # +1 for inliers, -1 for outliers

# Identify anomalies via threshold
anomaly_indices = (preds == -1)


In practice, one often inspects scores and chooses a threshold: e.g. mark the top 1–5% highest scores as anomalies if contamination was unknown. The decision_function produces a score where negative values are flagged by default. The isolation forest’s predict() method directly yields +1 (normal) or -1 (anomaly). 

Evaluation: If you have some labeled fraud cases for testing, evaluate by precision, recall, and ROC-AUC. For example, use sklearn.metrics.roc_auc_score(y_true, scores) to compute AUC. Because frauds are rare, focus on precision/recall at the tail of the score distribution. Note that IF can sometimes mark normal border points (especially those near the edge of the training distribution) as anomalous. Tune hyperparameters (number of trees, max depth, max_samples) via cross-validation or using the ROC curve: the scikit-learn docs emphasize that “shorter path lengths indicate anomalies”, so you can adjust contamination to balance false alarms vs misses. Overall, Isolation Forest is fast and scales well (O(n log n)) on large, high-dimensional blockchain data, making it a practical first choice for unsupervised fraud detection.

ONE CLASS SVM 

- Concept: The One-Class SVM (Support Vector Machine) is a novelty/outlier detector that learns the boundary of the “normal” class. It finds a (potentially non-linear) hypersurface that encloses the training data, so new points outside this region are considered anomalies. In scikit-learn, OneClassSVM can use various kernels; the RBF (Gaussian) kernel is common to capture complex data shapes. By default it uses kernel='rbf'. A key parameter is nu, which sets an upper bound on the fraction of training points that may be considered outliers (and also a lower bound on the support vectors). For highly imbalanced or unlabeled blockchain data, you typically train the OCSVM on normal transactions only, treating unknowns as “test.” 

- Handling imbalance: Since it is inherently one-class, OCSVM does not require negative labels. You train on “normal” data, and any deviation in new data is flagged. The nu parameter effectively tunes sensitivity: a smaller nu (e.g. 0.01) means the model assumes very few anomalies in training and is stricter. This gives control over false positive rate in an imbalanced scenario. 

In [None]:
from sklearn.svm import OneClassSVM

# Fit on (mostly) normal training data
ocsvm = OneClassSVM(kernel='rbf', gamma=0.01, nu=0.01)
ocsvm.fit(X_train)  

# Score/test on new transactions
preds = ocsvm.predict(X_test)      # +1 = normal, -1 = anomaly
scores = ocsvm.decision_function(X_test)  
# (Higher scores indicate normal; anomalies get negative scores)

# Identify anomalies
anomaly_indices = (preds == -1)


The gamma (for RBF) and nu require tuning: e.g. try gamma=1/num_features or use a grid search (GridSearchCV) on a validation set. For blockchain data, scaling is crucial before SVM: apply a StandardScaler or MinMaxScaler so features like gas price and transaction value contribute comparably.


Evaluation: With some known fraud labels, evaluate with ROC-AUC as above. Since decision_function yields scores (where anomalies are <0), you can compute AUC using roc_auc_score(y_true, -scores) (flipping sign if needed so that higher means more anomalous). Also consider the fraction of points flagged: OCSVM’s predict will label roughly nu*100% of training data as outliers by design, so on test data expect a similar baseline. In summary, One-Class SVM provides a flexible kernelized decision boundary, but it can be sensitive to kernel and does not scale well to huge datasets. It works best when you have (mostly) “clean” normal data and need a principled novelty detector.

AUTOENCODERS

Encoder–Decoder Structure: An autoencoder is a neural network trained to reproduce its input. It has an encoder that maps the input X to a low-dimensional code z (the “bottleneck”), and a decoder that reconstructs the output X' from z.

By training on normal transaction data only, the autoencoder learns how to compress/reconstruct typical patterns. At inference, we compute the reconstruction error for each sample (e.g. mean squared error between X and X'). Fraudulent or anomalous transactions tend to have larger reconstruction error because they do not conform to the learned normal patterns. Concretely: if x is an input vector, and x̂ is its autoencoder output, define an anomaly score as ||x - x̂||.

In [None]:
import tensorflow as tf
from tensorflow.keras import layers, models

input_dim = X_train.shape[1]
# Define encoder
inputs = tf.keras.Input(shape=(input_dim,))
encoded = layers.Dense(64, activation='relu')(inputs)
encoded = layers.Dense(32, activation='relu')(encoded)
# Bottleneck code
code = layers.Dense(16, activation='relu')(encoded)
# Define decoder
decoded = layers.Dense(32, activation='relu')(code)
decoded = layers.Dense(64, activation='relu')(decoded)
outputs = layers.Dense(input_dim, activation='linear')(decoded)

autoencoder = models.Model(inputs, outputs)
autoencoder.compile(optimizer='adam', loss='mse')

# Train only on normal data
autoencoder.fit(X_train, X_train, epochs=50, batch_size=32, validation_split=0.1)


Here, loss='mse' minimizes reconstruction error on normal examples. We ensure the latent dimension is small (16 in this example) to force a compact representation. After training, compute reconstruction errors on the test set:

In [None]:
reconstructions = autoencoder.predict(X_test)
mse = np.mean(np.square(X_test - reconstructions), axis=1)  # reconstruction error per sample


Threshold Selection and Evaluation: To decide which errors indicate fraud, one may set a threshold. Two common approaches:

- Statistical threshold: Use a high quantile of errors on a held-out normal set (e.g. flag anything above the 95th percentile).

- ROC-based threshold: If a small labeled subset of fraud is available, choose the threshold that maximizes AUC or F1. For example:

In [None]:
from sklearn.metrics import roc_auc_score
y_true = [0 if normal else 1 for each test sample]  # 1=fraud label
roc_auc = roc_auc_score(y_true, mse)

#This gives the area under the ROC curve for using mse as the score. 
#Even without labels, one can inspect the error distribution and look for a “knee” to set a cutoff.

Limitations: Autoencoders can be very powerful but have pitfalls. They require substantial training data and are non-interpretable. Crucially, recent studies warn that autoencoders can fully reconstruct even anomalous inputs, failing the basic assumption. In other words, a fraud pattern far from the training data might still be mapped to a similar latent code and produce low error. This means careful validation is needed. In practice, one may use a small set of known fraud examples (semi-supervised) to calibrate the threshold. Regularization (e.g. dropout, sparsity) and not making the network too large can also prevent the model from simply memorizing everything.

Comparison and Best Practices:

- Isolation Forest is fast and easy to use on large blockchain datasets. It excels at high-dimensional data and does not assume any data distribution. It directly provides anomaly scores. However, it may erroneously flag edge cases as anomalies, and its results are not easy to interpret. It works purely unsupervised, which suits unlabeled ledger data.

- One-Class SVM provides a principled “normal class” boundary with kernel flexibility. It can be effective when you have many clean (non-fraud) examples to train on, and it explicitly controls the outlier fraction via nu. On the downside, OCSVM can be slow on very large samples (training scales roughly O(n³)), and choosing the right kernel/gamma is nontrivial. It assumes most training data are normal, so some fraud contamination can hurt its boundary. Still, in semi-supervised settings where labeled fraud is scarce, One-Class SVM is a sound choice.

- Autoencoders capture complex non-linear patterns in the data and can, in principle, detect subtle anomalies. They offer flexibility (e.g. convolutional layers for transaction time-series). But they require careful tuning and much data. A big limitation (recently highlighted in literature) is that autoencoders may reconstruct anomalies too well, making fraud hard to detect. They also risk overfitting if the network is too powerful. Best practice: train on as pure a set of normal transactions as possible, use a validation set to pick the reconstruction-error threshold (via ROC-AUC or quantiles), and monitor both training and validation loss for signs of overfitting.

Handling Imbalanced/Unlabeled Data: All three methods are naturally geared toward imbalanced settings because they do not require balanced class labels. They either treat anomalies as outliers (Isolation Forest), model only the normal class (One-Class SVM), or rely on reconstruction of normal data (Autoencoder). In practice, it’s wise to combine approaches: use robust feature engineering (as above), normalize features, and possibly ensemble multiple detectors. For example, one might flag a transaction only if all methods mark it anomalous, reducing false positives. If any labeled fraud examples exist, use them for evaluation and threshold tuning (e.g. via ROC-AUC) but avoid using them in training for the unsupervised models. Remember to re-scale new incoming data using the same scalers fitted on training data. 

Overall, the choice depends on data scale and interpretability needs. Isolation Forest is a good default for large-scale exploratory analysis, One-Class SVM offers a clear math framework if data are smaller, and Autoencoders bring deep learning power (with caution) when capturing complex fraud patterns. By carefully engineering features and preprocessing, each method can better cope with rare fraud patterns in unlabeled blockchain datasets.