Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] _DummyModel's predict function should be able to predict on dataset.data #2202

Closed
OlivierBinette opened this issue Dec 17, 2022 · 3 comments
Labels
bug needs triage Issue needs to be labeled and prioritized

Comments

@OlivierBinette
Copy link
Contributor

Describe the bug

A _DummyModel created on a given dataset is not able to predict on dataset.data. It can only predict on dataset.features_columns.

This behavior is inconsistent with sklearn models which can predict on any dataframe that contains the necessary columns, even if there are extraneous ones. Data validation in _DummyModel's predict function should be more forgiving.

This can cause issues in checks where y_pred is passed instead of a model and the run_logic expects the context's model (which is a dummy model) to be able to predict on dataset.data like a sklearn model would.

To Reproduce

from deepchecks.tabular.context import _DummyModel
from deepchecks.tabular.datasets.regression import wine_quality

dataset = wine_quality.load_data()[0]
model = wine_quality.load_fitted_model()
y_pred = model.predict(dataset.data)

dummy =_DummyModel(train = dataset, test=None, y_pred_train=y_pred)

dummy.predict(dataset.features_columns)
## Works as expected

dummy.predict(dataset.data)
## raise DeepchecksValueError('Data that has not been seen before passed for inference with static '
##                            'predictions. Pass a real model to resolve this')

Expected behavior

Dummy model should be able to predict on dataset.data, and on any dataframe which contains the necessary columns.

Environment (please complete the following information):

  • OS: Linux (Ubuntu 22)
  • Python Version: 3.10
  • Deepchecks Version: 0.10.0
@github-actions github-actions bot added the needs triage Issue needs to be labeled and prioritized label Dec 17, 2022
@noamzbr
Copy link
Collaborator

noamzbr commented Dec 18, 2022

Thanks for noticing that @OlivierBinette!
Indeed sklearn models can predict when additional (unneeded) columns are present, but it's an ability we do not require from user models (see our model guide). The user may pass a custom model, and in that case it's not required to be able to predict when additional columns are present.

In that light, all internal calls to model.predict() in the tabular package always pass dataset.feature_columns, rather than dataset.data, and that's why we never run into an issue with _DummyModel only working on the features (and actually it is helpful for internal development that errors are raised in tests when model.predict(dataset.data) is mistakenly used).

Given that it's an internal mechanism, is there a case in your opinion that this behavior may cause a bug?

@OlivierBinette
Copy link
Contributor Author

I think the current behavior could cause a confusing bug for a user defining a custom check. If the run logic uses context.model.predict(dataset.data), this will work when a sklearn model is passed to the check but throw an error when y_pred is passed to the check.

So from a end-user perspective it might be more convenient for the dummy model to be able to predict from dataset.data. But I see the point that it's not an issue from a development perspective.

@noamzbr
Copy link
Collaborator

noamzbr commented Dec 18, 2022

I think we still want to keep the current behavior (on the contrary - because it keeps us from assuming this about models), but we can solve your concern by making the error in this case more informative.
Closing this issue in place of this new one.

@noamzbr noamzbr closed this as completed Dec 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs triage Issue needs to be labeled and prioritized
Projects
None yet
Development

No branches or pull requests

2 participants