Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise ValueError when predict/predict_proba input types don't match fit input #3036

Merged
merged 27 commits into from
Nov 17, 2021

Conversation

bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented Nov 11, 2021

fix #2855

Running

import pandas as pd
from evalml import AutoMLSearch
import logging
logging.basicConfig(level=logging.ERROR)

path = "/Users/bryan.chen/Downloads/string_nan_ex/1625078186889-mushroom_subset.csv"

df = pd.read_csv(path)
y_train = df['class']
X_train = df.drop('class', axis=1)

aml = AutoMLSearch(X_train, y_train, 'binary', verbose=False)
aml.search()


pipeline = aml.best_pipeline
holdout = pd.read_csv("/Users/bryan.chen/Downloads/string_nan_ex/mushroom_holdout.csv")
pipeline.predict(holdout)

now results in:
image

@bchen1116 bchen1116 self-assigned this Nov 11, 2021
@codecov
Copy link

codecov bot commented Nov 11, 2021

Codecov Report

Merging #3036 (76f733b) into main (d6682a4) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3036     +/-   ##
=======================================
+ Coverage   99.8%   99.8%   +0.1%     
=======================================
  Files        312     312             
  Lines      30340   30421     +81     
=======================================
+ Hits       30249   30330     +81     
  Misses        91      91             
Impacted Files Coverage Δ
evalml/utils/__init__.py 100.0% <ø> (ø)
evalml/pipelines/component_graph.py 99.8% <100.0%> (+0.1%) ⬆️
...valml/tests/pipeline_tests/test_component_graph.py 99.9% <100.0%> (+0.1%) ⬆️
evalml/tests/pipeline_tests/test_pipelines.py 99.8% <100.0%> (+0.1%) ⬆️
evalml/tests/utils_tests/test_woodwork_utils.py 100.0% <100.0%> (ø)
evalml/utils/woodwork_utils.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6682a4...76f733b. Read the comment docs.

@@ -642,7 +643,8 @@ def test_score_nonlinear_regression(

@patch("evalml.pipelines.BinaryClassificationPipeline.fit")
@patch("evalml.pipelines.components.Estimator.predict")
def test_score_binary_single(mock_predict, mock_fit, X_y_binary):
@patch("evalml.pipelines.component_graph._schema_is_equal", return_value=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add this when we mock pipeline fit in order to not raise the valueError

if first.types.index.tolist() != other.types.index.tolist():
return False
logical = [
x if x != "Integer" else "Double"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion with @freddyaboulton and @angela97lin, we decided to treat Integer and Doubles as the same logical types until @chukarsten's work with NullableInteger goes in. At that point, we can revisit this and see what the best way to accommodate that would be.

@@ -378,6 +384,14 @@ def _transform_features(
dict: Outputs from each component.
"""
X = infer_feature_types(X)
if not fit:
if not _schema_is_equal(X.ww.schema, self._input_types):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 I think we can get rid of schema_is_equal and let woodwork do the comparison for us if we use exclude. Something like this

        if not fit:
            if X.ww.select(exclude=['integer'], return_schema=True) != self._input_types:
                raise ValueError(
                    "Input X data types are different from the input types the pipeline was fitted on."
                )
        else:
            self._input_types = X.ww.select(exclude=['integer'], return_schema=True)

Copy link
Contributor Author

@bchen1116 bchen1116 Nov 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton I had to implement this because the schema for logical types included the logical type objects (ie Categorical(), Integer()), and the equality would fail here when the objects are different instances.
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wild, thanks for explaining! Can we add a unit test for this scenario? The unit tests on this branch pass with using the schema equality method above. I think we should also file a woodwork issue? I feel like schema equality should work in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can add a test for this case! I brought it up to the woodwork team, but seems like this is something they want to keep

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this yesterday and I think it's cause the DateTime format is None in X2 in the repro you shared.

I will try to come up with a minimal repro and share with ww team. This ends up working

from evalml.demos import load_fraud

X, y = load_fraud(1000)

X2 = X.ww.copy()

X.ww.schema == X2.ww.schema

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @bchen1116 ! I am excited to get this out the door, hopefully it'll help debug some tricky problems that can happen between fit and predict.

evalml/utils/woodwork_utils.py Outdated Show resolved Hide resolved
@bchen1116 bchen1116 merged commit 401457c into main Nov 17, 2021
@chukarsten chukarsten mentioned this pull request Nov 29, 2021
@freddyaboulton freddyaboulton deleted the bc_2855_types branch May 13, 2022 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants