You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A _DummyModel created on a given dataset is not able to predict on dataset.data. It can only predict on dataset.features_columns.
This behavior is inconsistent with sklearn models which can predict on any dataframe that contains the necessary columns, even if there are extraneous ones. Data validation in _DummyModel's predict function should be more forgiving.
This can cause issues in checks where y_pred is passed instead of a model and the run_logic expects the context's model (which is a dummy model) to be able to predict on dataset.data like a sklearn model would.
To Reproduce
fromdeepchecks.tabular.contextimport_DummyModelfromdeepchecks.tabular.datasets.regressionimportwine_qualitydataset=wine_quality.load_data()[0]
model=wine_quality.load_fitted_model()
y_pred=model.predict(dataset.data)
dummy=_DummyModel(train=dataset, test=None, y_pred_train=y_pred)
dummy.predict(dataset.features_columns)
## Works as expecteddummy.predict(dataset.data)
## raise DeepchecksValueError('Data that has not been seen before passed for inference with static '## 'predictions. Pass a real model to resolve this')
Expected behavior
Dummy model should be able to predict on dataset.data, and on any dataframe which contains the necessary columns.
Environment (please complete the following information):
OS: Linux (Ubuntu 22)
Python Version: 3.10
Deepchecks Version: 0.10.0
The text was updated successfully, but these errors were encountered:
Thanks for noticing that @OlivierBinette!
Indeed sklearn models can predict when additional (unneeded) columns are present, but it's an ability we do not require from user models (see our model guide). The user may pass a custom model, and in that case it's not required to be able to predict when additional columns are present.
In that light, all internal calls to model.predict() in the tabular package always pass dataset.feature_columns, rather than dataset.data, and that's why we never run into an issue with _DummyModel only working on the features (and actually it is helpful for internal development that errors are raised in tests when model.predict(dataset.data) is mistakenly used).
Given that it's an internal mechanism, is there a case in your opinion that this behavior may cause a bug?
I think the current behavior could cause a confusing bug for a user defining a custom check. If the run logic uses context.model.predict(dataset.data), this will work when a sklearn model is passed to the check but throw an error when y_pred is passed to the check.
So from a end-user perspective it might be more convenient for the dummy model to be able to predict from dataset.data. But I see the point that it's not an issue from a development perspective.
I think we still want to keep the current behavior (on the contrary - because it keeps us from assuming this about models), but we can solve your concern by making the error in this case more informative.
Closing this issue in place of this new one.
Describe the bug
A _DummyModel created on a given
dataset
is not able to predict ondataset.data
. It can only predict ondataset.features_columns
.This behavior is inconsistent with sklearn models which can predict on any dataframe that contains the necessary columns, even if there are extraneous ones. Data validation in _DummyModel's predict function should be more forgiving.
This can cause issues in checks where y_pred is passed instead of a model and the run_logic expects the context's model (which is a dummy model) to be able to predict on dataset.data like a sklearn model would.
To Reproduce
Expected behavior
Dummy model should be able to predict on
dataset.data
, and on any dataframe which contains the necessary columns.Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: