Same category is always imputed when enough trees are grown #43

AnotherSamWilson · 2022-08-16T19:39:39Z

See this example:

from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
iris = pd.concat(load_iris(return_X_y=True, as_frame=True), axis=1)
iris["target"] = iris["target"].astype("category")

amp_iris = iris.copy()
na_where = {}
for c in iris.columns:
    na_where[c] = sorted(np.random.choice(amp_iris.shape[0], size=25, replace=False))
    amp_iris.loc[na_where[c],c] = np.NaN

# Only class 0 was imputed
from isotree import IsolationForest
imputer = IsolationForest(
    ntrees=100,
    build_imputer=True,
    ndim=1,
    missing_action="impute"
)
imp_iris = imputer.fit_transform(amp_iris)
t = "target"
imp_iris.loc[na_where[t], t].unique()

# Use less trees, process is much more accurate
imputer = IsolationForest(
    ntrees=10,
    build_imputer=True,
    ndim=1,
    missing_action="impute"
)
imp_iris = imputer.fit_transform(amp_iris)
(imp_iris.loc[na_where[t], t] == iris.loc[na_where[t], t]).mean()

Using any number of trees over 100 caused only the first class (0) to ever be imputed. Using only 10 trees usually makes the imputation much more accurate. I tried playing around with different max_depths, but to no avail. Are there any obvious parameters I am missing to make the categorical imputation more accurate?

The text was updated successfully, but these errors were encountered:

david-cortes · 2022-08-16T20:08:22Z

Thanks again for the bug report. There is an issue in the code calculations with some numbers turning into infinite so in the meantime better not use fit_transform.

david-cortes · 2022-08-16T21:12:55Z

Should be fixed now in the latest version:

pip install -U isotree

david-cortes · 2022-08-16T21:15:57Z

And by the way, the imputation is meant to be used alongside with prob_pick_pooled_gain - chances are that results wouldn't be very good otherwise.

david-cortes closed this as completed in 40ccb42 Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Same category is always imputed when enough trees are grown #43

Same category is always imputed when enough trees are grown #43

AnotherSamWilson commented Aug 16, 2022

david-cortes commented Aug 16, 2022

david-cortes commented Aug 16, 2022

david-cortes commented Aug 16, 2022

Same category is always imputed when enough trees are grown #43

Same category is always imputed when enough trees are grown #43

Comments

AnotherSamWilson commented Aug 16, 2022

david-cortes commented Aug 16, 2022

david-cortes commented Aug 16, 2022

david-cortes commented Aug 16, 2022