# Comparing Isolation Forest implementations

This notebook performs a short comparison of the Isolation Forest implementations in different libraries ([IsoTree](https://github.com/david-cortes/isotree), [EIF](https://github.com/sahandha/eif) and [SciKit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html)) by first fitting models to different datasets with the default hyperparameters and then performing a short tuning over a small grid of hyperparameters, depending on which ones can be varied in each library.

The datasets are all taken from the [Outlier Detection DataSets (ODDS)](http://odds.cs.stonybrook.edu) webpage.

The only evaluation metric used here is AUC, with outliers being the positive class and being the minority in each dataset. The outliers come already labelled and some of them are artificially produced - see the link above for details.

*(For a speed comparison with larger datasets see [this other notebook](https://github.com/david-cortes/isotree/blob/master/timings/timings_python.ipynb))*

Datasets compared:
* [Satellite (6435 rows, 36 columns)](#p1)
* [Antthyroid (7200 rows, 6 columns)](#p2)
* [Pendigits (6870 rows, 16 columns)](#p3)

In [1]:
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from isotree import IsolationForest as IsolationForestIsoTree
from sklearn.ensemble import IsolationForest as IsolationForestSKL
from eif import iForest
from scipy.io import loadmat

<a id="p1"></a>
## Satellite (6435 rows, 36 columns)

In [2]:
satellite = loadmat("satellite.mat")
X = np.asfortranarray(satellite["X"]).astype(np.float64)
y = satellite["y"].astype(np.float64).reshape(-1)
X.shape

(6435, 36)

Checking isotree library:

In [3]:
pred_default = IsolationForestIsoTree().fit_predict(X)
roc_auc_score(y, pred_default)

0.7002590560187147

In [4]:
params_try = {
    "sample_size" : [256, 1024, 5000],
    "prob_pick_pooled_gain" : [0, 1],
    "weigh_by_kurtosis" : [True, False],
}
cv_model = GridSearchCV(estimator=IsolationForestIsoTree(ntrees=100, ndim=2,
                                                         penalize_range=False,
                                                         missing_action="fail",
                                                         random_seed=1),
                        param_grid=params_try,
                        scoring="roc_auc", refit=True,
                        cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=1))
cv_model.fit(X,y)
cv_model.best_params_

{'prob_pick_pooled_gain': 1, 'sample_size': 256, 'weigh_by_kurtosis': True}

In [5]:
pred_tuned = cv_model.decision_function(X)
roc_auc_score(y, pred_tuned)

0.8399789803094202

Checking scikit-learn library:

In [6]:
pred_default = IsolationForestSKL(n_jobs=-1, random_state=1).fit(X).decision_function(X)
roc_auc_score(y, -pred_default)

0.6872106805842193

In [7]:
params_try = {
    "max_samples" : [256, 1024, 5000],
}
cv_model = GridSearchCV(estimator=IsolationForestSKL(n_estimators=100, random_state=1, n_jobs=-1),
                        param_grid=params_try,
                        scoring="roc_auc", refit=True,
                        cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=1))
cv_model.fit(X, 1-y)
cv_model.best_params_

{'max_samples': 5000}

In [8]:
pred_tuned = cv_model.decision_function(X)
roc_auc_score(y, -pred_tuned)

0.7392743305207337

Checking EIF library - note that it doesn't have default arguments:

In [9]:
### EIF library is not scikit-learn-compatible, need a wrapper
from sklearn.base import BaseEstimator
class EIF_sk_compat(BaseEstimator):
    def __init__(self, sample_size=256, ntrees=100, seed=1, ExtensionLevel=0):
        self.sample_size = sample_size
        self.ntrees = ntrees
        self.seed = seed
        self.ExtensionLevel = ExtensionLevel
    def fit(self, X, y=None):
        self._model = iForest(X, ntrees=self.ntrees, sample_size=self.sample_size,
                              ExtensionLevel=self.ExtensionLevel, seed=self.seed)
        return self
    def decision_function(self, X):
        return self._model.compute_paths(X_in=X)
    
params_try = {
    "sample_size" : [256, 1024, 5000],
    "ExtensionLevel" : [0, 1]
}
cv_model = GridSearchCV(estimator=EIF_sk_compat(),
                        param_grid=params_try,
                        scoring="roc_auc", refit=True,
                        cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=1))
cv_model.fit(X,y)
cv_model.best_params_

{'ExtensionLevel': 1, 'sample_size': 5000}

In [10]:
pred_tuned = cv_model.decision_function(X)
roc_auc_score(y, pred_tuned)

0.7142364915048116

<a id="p2"></a>
## Antthyroid (7200 rows, 6 columns)

In [11]:
annthyroid = loadmat("annthyroid.mat")
X = np.asfortranarray(annthyroid["X"]).astype(np.float64)
y = annthyroid["y"].astype(np.float64).reshape(-1)
X.shape

(7200, 6)

Checking isotree library:

In [12]:
pred_default = IsolationForestIsoTree().fit_predict(X)
roc_auc_score(y, pred_default)

0.8001538917936738

In [13]:
params_try = {
    "sample_size" : [256, 1024, 5000],
    "prob_pick_pooled_gain" : [0, 1],
    "weigh_by_kurtosis" : [True, False],
}
cv_model = GridSearchCV(estimator=IsolationForestIsoTree(ntrees=100, ndim=2,
                                                         penalize_range=False,
                                                         missing_action="fail",
                                                         random_seed=1),
                        param_grid=params_try,
                        scoring="roc_auc", refit=True,
                        cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=1))
cv_model.fit(X,y)
cv_model.best_params_

{'prob_pick_pooled_gain': 1, 'sample_size': 256, 'weigh_by_kurtosis': True}

In [14]:
pred_tuned = cv_model.decision_function(X)
roc_auc_score(y, pred_tuned)

0.9817450846208218

Checking scikit-learn library:

In [15]:
pred_default = IsolationForestSKL(n_jobs=-1, random_state=1).fit(X).decision_function(X)
roc_auc_score(y, -pred_default)

0.836075461478732

In [16]:
params_try = {
    "max_samples" : [256, 1024, 5000],
}
cv_model = GridSearchCV(estimator=IsolationForestSKL(n_estimators=100, random_state=1, n_jobs=-1),
                        param_grid=params_try,
                        scoring="roc_auc", refit=True,
                        cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=1))
cv_model.fit(X, 1-y)
cv_model.best_params_

{'max_samples': 256}

In [17]:
pred_tuned = cv_model.decision_function(X)
roc_auc_score(y, -pred_tuned)

0.836075461478732

Checking EIF library - note that it doesn't have default arguments:

In [18]:
params_try = {
    "sample_size" : [256, 1024, 5000],
    "ExtensionLevel" : [0, 1]
}
cv_model = GridSearchCV(estimator=EIF_sk_compat(),
                        param_grid=params_try,
                        scoring="roc_auc", refit=True,
                        cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=1))
cv_model.fit(X,y)
cv_model.best_params_

{'ExtensionLevel': 0, 'sample_size': 256}

In [19]:
pred_tuned = cv_model.decision_function(X)
roc_auc_score(y, pred_tuned)

0.807920117854482

<a id="p3"></a>
## Pendigits (6870 rows, 16 columns)

In [20]:
pendigits = loadmat("pendigits.mat")
X = np.asfortranarray(pendigits["X"]).astype(np.float64)
y = pendigits["y"].astype(np.float64).reshape(-1)
X.shape

(6870, 16)

Checking isotree library:

In [21]:
pred_default = IsolationForestIsoTree().fit_predict(X)
roc_auc_score(y, pred_default)

0.9393651230112356

In [22]:
params_try = {
    "sample_size" : [256, 1024, 5000],
    "prob_pick_pooled_gain" : [0, 1],
    "weigh_by_kurtosis" : [True, False],
}
cv_model = GridSearchCV(estimator=IsolationForestIsoTree(ntrees=100, ndim=2,
                                                         penalize_range=False,
                                                         missing_action="fail",
                                                         random_seed=1),
                        param_grid=params_try,
                        scoring="roc_auc", refit=True,
                        cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=1))
cv_model.fit(X,y)
cv_model.best_params_

{'prob_pick_pooled_gain': 1, 'sample_size': 256, 'weigh_by_kurtosis': False}

In [23]:
pred_tuned = cv_model.decision_function(X)
roc_auc_score(y, pred_tuned)

0.9634517999129261

Checking scikit-learn library:

In [24]:
pred_default = IsolationForestSKL(n_jobs=-1, random_state=1).fit(X).decision_function(X)
roc_auc_score(y, -pred_default)

0.9562758262490165

In [25]:
params_try = {
    "max_samples" : [256, 1024, 5000],
}
cv_model = GridSearchCV(estimator=IsolationForestSKL(n_estimators=100, random_state=1, n_jobs=-1),
                        param_grid=params_try,
                        scoring="roc_auc", refit=True,
                        cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=1))
cv_model.fit(X, 1-y)
cv_model.best_params_

{'max_samples': 1024}

In [26]:
pred_tuned = cv_model.decision_function(X)
roc_auc_score(y, -pred_tuned)

0.9485098111103473

Checking EIF library - note that it doesn't have default arguments:

In [27]:
params_try = {
    "sample_size" : [256, 1024, 5000],
    "ExtensionLevel" : [0, 1]
}
cv_model = GridSearchCV(estimator=EIF_sk_compat(),
                        param_grid=params_try,
                        scoring="roc_auc", refit=True,
                        cv=StratifiedKFold(n_splits=10, shuffle=True, random_state=1))
cv_model.fit(X,y)
cv_model.best_params_

{'ExtensionLevel': 0, 'sample_size': 1024}

In [28]:
pred_tuned = cv_model.decision_function(X)
roc_auc_score(y, pred_tuned)

0.9579447461484994