Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Same category is always imputed when enough trees are grown #43

Closed
AnotherSamWilson opened this issue Aug 16, 2022 · 3 comments
Closed

Comments

@AnotherSamWilson
Copy link

See this example:

from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
iris = pd.concat(load_iris(return_X_y=True, as_frame=True), axis=1)
iris["target"] = iris["target"].astype("category")

amp_iris = iris.copy()
na_where = {}
for c in iris.columns:
    na_where[c] = sorted(np.random.choice(amp_iris.shape[0], size=25, replace=False))
    amp_iris.loc[na_where[c],c] = np.NaN

# Only class 0 was imputed
from isotree import IsolationForest
imputer = IsolationForest(
    ntrees=100,
    build_imputer=True,
    ndim=1,
    missing_action="impute"
)
imp_iris = imputer.fit_transform(amp_iris)
t = "target"
imp_iris.loc[na_where[t], t].unique()

# Use less trees, process is much more accurate
imputer = IsolationForest(
    ntrees=10,
    build_imputer=True,
    ndim=1,
    missing_action="impute"
)
imp_iris = imputer.fit_transform(amp_iris)
(imp_iris.loc[na_where[t], t] == iris.loc[na_where[t], t]).mean()

Using any number of trees over 100 caused only the first class (0) to ever be imputed. Using only 10 trees usually makes the imputation much more accurate. I tried playing around with different max_depths, but to no avail. Are there any obvious parameters I am missing to make the categorical imputation more accurate?

@david-cortes
Copy link
Owner

Thanks again for the bug report. There is an issue in the code calculations with some numbers turning into infinite so in the meantime better not use fit_transform.

@david-cortes
Copy link
Owner

Should be fixed now in the latest version:

pip install -U isotree

@david-cortes
Copy link
Owner

And by the way, the imputation is meant to be used alongside with prob_pick_pooled_gain - chances are that results wouldn't be very good otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants