# `ECOD` Trial Notebook

Let's try to get familiar with the `ECOD` model from `pyod` to then implement all the experiments with it. 

In [30]:
import sys
sys.path.append("..")
import matplotlib.pyplot as plt
from utils_reboot.datasets import Dataset
from utils_reboot.utils import * 
from model_reboot.interpretability_module import *
import os
import numpy as np
import time 

`ECOD` is pretty easy to use since it has only 2 input parameters:
- `contamination`: the percentage of outliers in the dataset
- `n_jobs` : the number of parallel jobs to run for the model, but here we will leave it as default to 1

## `ECOD` Experiments

In [31]:
dataset = Dataset('wine', path = '../data/real',feature_names_filepath='../data/')
dataset.drop_duplicates()

`scenario_1`

In [3]:
dataset.initialize_train_test()

In [5]:
dataset.X_train

array([[1.329e+01, 1.970e+00, 2.680e+00, ..., 1.070e+00, 2.840e+00,
        1.270e+03],
       [1.430e+01, 1.920e+00, 2.720e+00, ..., 1.070e+00, 2.650e+00,
        1.280e+03],
       [1.368e+01, 1.830e+00, 2.360e+00, ..., 1.230e+00, 2.870e+00,
        9.900e+02],
       ...,
       [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
        8.350e+02],
       [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
        8.400e+02],
       [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
        5.600e+02]])

In [4]:
I=ECOD()
I.fit(dataset.X_train)
scores=I.predict(dataset.X_test)

# Correlation Experiments 

Experiment to compute the correlation between `LFI` importance scores and Anomaly Scores

In [1]:
import sys
sys.path.append("..")
import matplotlib.pyplot as plt
from utils_reboot.datasets import Dataset
from utils_reboot.utils import * 
import os
import numpy as np
import time 

2024-09-03 07:40:21.673262: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [7]:
dataset=Dataset('wine',path='../data/real/')
dataset.drop_duplicates()
dataset.pre_process()

In [8]:
I=ExtendedIsolationForest(1,200)
I.fit(dataset.X_train)

## Compute Anomaly Scores 

In [11]:
scores=I.predict(dataset.X_test)
scores.shape

(129,)

## Compute `LFI` scores

In [13]:
lfi_scores=I.local_importances(dataset.X_test)
lfi_scores.shape

(129, 13)

### Compute the sum of each row in the `LFI` scores

In [15]:
# Compute the sum of each row in lfi_scores
lfi_scores_sum = np.sum(lfi_scores, axis=1)
lfi_scores_sum.shape

(129,)

#### N.B → It makes sense

The intuition we started from seems to make sense → the maximym sum of the `LFI` scores corresponds to the sample with the highest Anomaly Score

In [17]:
# Find the index where lfi_scores_sum is the highest
index = np.argmax(lfi_scores_sum)
# Find the index where scores is the highest
index2 = np.argmax(scores)
print(f'Index of max lfi_scores_sum: {index}')
print(f'Index of max scores: {index2}')

Index of max lfi_scores_sum: 72
Index of max scores: 72


In [18]:
# Find the index where lfi_scores_sum is the lowest
index = np.argmin(lfi_scores_sum)
# Find the index where scores is the lowest
index2 = np.argmin(scores)
print(f'Index of min lfi_scores_sum: {index}')
print(f'Index of min scores: {index2}')

Index of min lfi_scores_sum: 99
Index of min scores: 99


## Compute Correlation

The correlation score is almost 1 so this confirms the intuition we started from → the higher the feature importances the higher the Anomaly Score

In [19]:
# Compute the correlation between scores and lfi_scores_sum
correlation = np.corrcoef(scores, lfi_scores_sum)
print(f'Correlation between scores and lfi_scores_sum: {correlation[0,1]}')

Correlation between scores and lfi_scores_sum: 0.9621052097500566


In [22]:
correlation

array([[1.        , 0.96210521],
       [0.96210521, 1.        ]])

### Check `corr_exp.pickle`

In [28]:
basepath=os.path.dirname(os.getcwd())
cor_path=os.path.join(basepath,'utils_reboot','corr_exp.pickle')
cor_dict=open_element(cor_path)

In [29]:
cor_dict.keys()

dict_keys(['EXIFFI+', 'EXIFFI', 'DIFFI', 'IF_RandomForest', 'EIF_RandomForest', 'EIF+_RandomForest', 'RandomForest'])

In [26]:
cor_dict['EXIFFI+'].keys()

dict_keys(['Xaxis'])

In [27]:
cor_dict['EXIFFI+']['Xaxis']

[0.9284276502821726,
 0.936314335211612,
 0.9158891873347039,
 0.9364476213950881,
 0.9314721781297911,
 0.9289341147883154,
 0.918314225129243,
 0.9168395835342296,
 0.9356587095881654,
 0.9271933197199017]

## Test with `DIFFI` 

In [1]:
import sys
sys.path.append("..")
import matplotlib.pyplot as plt
from utils_reboot.datasets import Dataset
from utils_reboot.utils import * 
from model_reboot.interpretability_module import *
import os
import numpy as np
import time 

2024-09-03 09:00:54.003925: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
dataset=Dataset('wine',path='../data/real/')
dataset.drop_duplicates()
dataset.pre_process()

In [3]:
I=sklearn_IsolationForest(n_estimators=200)
I.fit(dataset.X_train)

Compute Anomaly Scores with `DIFFI` and check the correlation with the `LFI` scores

In [4]:
scores=I.predict(dataset.X_test)

In [5]:
lfi=np.zeros((dataset.X_test.shape[0],dataset.X_test.shape[1]))
for i in range(dataset.X_test.shape[0]):
    lfi[i],_=local_diffi(I,dataset.X_test[i,:])

lfi.shape

(129, 13)

In [6]:
lfi_sum=np.sum(lfi,axis=1)
lfi_sum.shape

(129,)

In [7]:
correlation = np.corrcoef(scores, lfi_sum)
print(f'Correlation between scores and lfi_scores_sum: {correlation[0,1]}')

Correlation between scores and lfi_scores_sum: 0.9406390411642097


### `RandomForest` Importance

The strategy used throughout the paper to compute the importance with `RandomForest` is to create a `RandomForestRegressor()` and fit it to the task of predicting the Anomaly Scores produced by the `AD` model we want to explain. Then we compute the feature importances with the `feature_importances_` attribute of the `RandomForestRegressor` model.

This is however only a `GFI` score. 

In [9]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(dataset.X_test, I.predict(dataset.X_test))
fi = rf.feature_importances_

In [11]:
fi

array([0.06552875, 0.02920797, 0.07393253, 0.0663468 , 0.19107696,
       0.1586969 , 0.02763913, 0.04929794, 0.1489197 , 0.04123027,
       0.04694226, 0.0642726 , 0.03690819])

#### N.B. `RandomForest` `LFI` Score

One possible idea to compute the `LFI` scores with the `RandomForest` surrogate model is to fit the model only on a single sample and then compute the feature importances. This way we can get the `LFI` scores for a single sample.

In [28]:
rf = RandomForestRegressor()
rf.fit(dataset.X_test[10,:].reshape(1,-1), [I.predict(dataset.X_test)[10]])
fi = rf.feature_importances_
print(f'Local Feature Importance: {fi}')

Local Feature Importance: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


In [25]:
dataset.X_test[0,:].reshape(1,-1)

array([[ 0.76126556, -0.39331144,  1.22362398, -1.20140666,  0.30838901,
         1.5701678 ,  1.73351092, -0.63975504,  0.33709255,  0.42461294,
         0.61469876,  0.65569271,  2.92914239]])

In [26]:
[I.predict(dataset.X_test)[0]]

[0.4777249886806976]

## `ECOD` Feature Importance



In [1]:
import sys
sys.path.append("..")
import matplotlib.pyplot as plt
from utils_reboot.datasets import Dataset
from utils_reboot.utils import * 
from model_reboot.interpretability_module import *
import os
import numpy as np
import time 

2024-09-04 12:38:52.586488: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
dataset = Dataset('breastw', path = '../data/real',feature_names_filepath='../data/')
dataset.drop_duplicates()
dataset.initialize_train_test()

In [3]:
I=ECOD()
I.fit(dataset.X_train)
scores=I.predict(dataset.X_test)

In [4]:
dataset.shape

(449, 9)

Test Global Importance

In [5]:
gfi=I.ecod_global_importance(dataset.X_test)
gfi.shape

(449, 9)

In [6]:
gfi

array([[0.37449071, 0.55697087, 0.54058565, ..., 0.28729595, 0.45068159,
        0.24038257],
       [0.37449071, 0.4592484 , 0.41984699, ..., 0.28729595, 0.42203224,
        0.24038257],
       [0.5145696 , 0.55697087, 0.54058565, ..., 0.28729595, 0.45068159,
        0.24038257],
       ...,
       [0.37449071, 1.        , 1.        , ..., 0.48035235, 1.        ,
        0.31846463],
       [0.44556845, 0.70386223, 0.50540924, ..., 1.        , 0.54996113,
        0.24038257],
       [0.44556845, 0.70386223, 0.67821533, ..., 1.        , 0.48568515,
        0.24038257]])

## N.B. 

- Attribute `O` → Outlier score for each feature → take onl the first half of the scores 
- Formula for feature importance
 -  For each feature f
    - Distance from a given sample `x[f]` and the 99% percentile of feature `f` 
    - The importance score is the inverse of this distance → in this way the closer a sample is to the 99% percentile the higher the importance score

In [24]:
I.O.shape

(898, 9)

In [25]:
I.O[0]

array([0.50490407, 1.13720959, 1.23182556, 0.97122445, 0.85474946,
       0.95952841, 0.63055934, 0.79381691, 0.30793023])

In [28]:
I.O[449]

array([0.50490407, 1.13720959, 1.23182556, 0.97122445, 0.85474946,
       0.95952841, 0.63055934, 0.79381691, 0.30793023])

In [29]:
imp_df=pd.DataFrame(I.O)
imp_df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8
count,898.0,898.0,898.0,898.0,898.0,898.0,898.0,898.0,898.0
mean,1.275273,1.167302,1.206074,1.13432,1.222775,1.000107,1.295681,1.069922,0.771913
std,0.55916,0.397239,0.428647,0.450031,0.706675,0.182251,0.635557,0.456634,0.859938
min,0.504904,0.634752,0.686488,0.686488,0.554063,0.668944,0.630559,0.643191,0.30793
25%,0.925239,0.875914,0.914066,0.947968,0.854749,0.959528,0.759915,0.793817,0.30793
50%,1.231826,1.13721,1.231826,0.971224,0.854749,0.959528,1.130289,0.793817,0.30793
75%,1.872916,1.336338,1.319531,1.336338,1.472294,1.239488,1.830357,1.370824,1.327899
max,2.175197,1.932636,2.081671,2.09969,2.848926,1.239488,3.111291,2.012678,3.467966


In [33]:
I.O - np.quantile(I.O, 0.99, axis=0)

array([[-1.67029319, -0.79542603, -0.84984563, ..., -2.48073128,
        -1.21886142, -3.16003532],
       [-1.67029319, -1.17747086, -1.38182008, ..., -2.48073128,
        -1.36948724, -3.16003532],
       [-0.94337169, -0.79542603, -0.84984563, ..., -2.48073128,
        -1.21886142, -3.16003532],
       ...,
       [-1.67029319,  0.        ,  0.        , ..., -1.08180517,
         0.        , -2.14006616],
       [-1.2443241 , -0.42073258, -0.97859462, ...,  0.        ,
        -0.81831032, -3.16003532],
       [-1.2443241 , -0.42073258, -0.47445798, ...,  0.        ,
        -1.05894703, -3.16003532]])

### FINAL FORMULA TO IMPLEMENT

In [36]:
imp_weight=1/(1+(np.quantile(I.O, 0.99, axis=0) - I.O))
imp_weight[70]

array([0.37449071, 0.43518309, 0.54058565, 0.41669469, 0.3339816 ,
       0.7812744 , 0.33545773, 0.45068159, 0.24038257])