Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stored predict_proba results in .x for intermediate estimators in ComponentGraph #2629

Merged
merged 5 commits into from
Aug 18, 2021

Conversation

christopherbunn
Copy link
Contributor

As part of the component graph changes necessary to support building our new ensembler (#1930), this PR stores the prediction probabilities (when available) for non-final estimators in a component graph. If predict_proba is not available or if the final estimator in a component graph is being evaluated, then predict is used.

@christopherbunn christopherbunn changed the title Store predict_proba results in .x for intermediate estimators in ComponentGraph Stored predict_proba results in .x for intermediate estimators in ComponentGraph Aug 13, 2021
@codecov
Copy link

codecov bot commented Aug 13, 2021

Codecov Report

Merging #2629 (b849cc6) into main (0485d7c) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2629     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        298     298             
  Lines      27024   27086     +62     
=======================================
+ Hits       26980   27042     +62     
  Misses        44      44             
Impacted Files Coverage Δ
evalml/pipelines/component_graph.py 99.7% <100.0%> (+0.1%) ⬆️
...valml/tests/pipeline_tests/test_component_graph.py 100.0% <100.0%> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 99.9% <100.0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0485d7c...b849cc6. Read the comment docs.

@christopherbunn christopherbunn marked this pull request as ready for review August 16, 2021 15:03
@christopherbunn christopherbunn requested review from angela97lin and a team August 16, 2021 15:03
@@ -458,36 +458,43 @@ def test_component_graph_fit_features(
mock_X_t = pd.DataFrame(np.ones(pd.DataFrame(X).shape))
mock_fit_transform.return_value = mock_X_t
mock_fit.return_value = Estimator
mock_predict.return_value = pd.Series(y)
mock_predict_proba.return_value = pd.Series(y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this a DataFrame instead to mimic and make sure that when we have an output with more than one column?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(oops, pressed add single comment too quickly!)

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 😁

I had a suggestion: we should convert our mock_predict_proba to a Dataframe to closer mimic real output. I commented directly on some but not all the places 😂

I'm curious if we have tests for calling _compute_features for a RegressionPipeline (via fit/predict/compute_features/or other public methods) and for MulticlassClassificationPipeline (since predict_proba would not drop a column).

Are there tests that already exist (and hence that test not erroring on this PR is confirmation) for multiclass/regression? Otherwise, it could be good to write some :)

evalml/tests/pipeline_tests/test_pipelines.py Outdated Show resolved Hide resolved
):
X, y = X_y_binary
mock_fit_transform.return_value = pd.DataFrame(X)
mock_predict.return_value = pd.Series(y)
mock_predict_proba.return_value = pd.Series(y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as below, let's change this to a DataFrame?

):
X, y = X_y_binary
mock_imputer.return_value = pd.DataFrame(X)
mock_ohe.return_value = pd.DataFrame(X)
mock_en_predict.return_value = pd.Series(np.ones(X.shape[0]))
mock_rf_predict.return_value = pd.Series(np.zeros(X.shape[0]))
mock_en_predict_proba.return_value = pd.Series(np.ones(X.shape[0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment, let's return dfs!

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good and addressing Angela's concerns and perhaps changing the naming scheme might be a good move.

@@ -802,8 +815,8 @@ def test_input_feature_names(example_graph):
"column_1_c",
]
assert input_feature_names["Logistic Regression"] == [
"Random Forest.x",
"Elastic Net.x",
"1_Random Forest.x",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we perhaps have any other thoughts for options to name this? The prepending with the underscore is kinda weird, especially with the space after it. I think an appen....sion? might look better and adding like a "col" or something to it might be a bit clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to do column_name_component scheme since to my knowledge we aren't using underscores anywhere else for the pipeline name. However, in general I don't really have any strong preferences for what the naming convention should be and I'm definitely open to ideas. Your idea of something like [Col 0 Random Forest.x, Col 1 Random Forest.x, etc.] makes sense to me and I'm good to move forward with this if there aren't any other suggestions.

@christopherbunn
Copy link
Contributor Author

LGTM! 😁

I had a suggestion: we should convert our mock_predict_proba to a Dataframe to closer mimic real output. I commented directly on some but not all the places 😂

Thanks! I covered most of these test cases, lmk if I missed some but we should be using dfs as the output for intermediate estimators in tests now.

I'm curious if we have tests for calling _compute_features for a RegressionPipeline (via fit/predict/compute_features/or other public methods) and for MulticlassClassificationPipeline (since predict_proba would not drop a column).

Are there tests that already exist (and hence that test not erroring on this PR is confirmation) for multiclass/regression? Otherwise, it could be good to write some :)

For RegressionPipeline, there's some coverage with tests that use the nonlinear_regression_pipeline_class fixture (like test_score_nonlinear_regression) since they use the example_regression_graph that calculates predict_proba for intermediate estimators. This the same with MulticlassClassificationPipeline, nonlinear_multiclass_pipeline_class, and test_score_nonlinear_multiclass. I think the score() test case covers the fit/predict test case for these pipeline types, but lmk if you think it would be worth having a more explicit case for these.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for making the changes!

RE the output as dataframes, it'd be nice if they had multiple cols like the output of predict_proba usually would, but not a blocking change 😁

X, y = X_y_binary
mock_predict_proba.return_value = pd.DataFrame(y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bahaha RE comment to change output to dataframes, I'm curious about when it's a df and has multiple columns! but not a big change :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops I see 😅. I added the two test cases with different predict_proba outputs to address this.

@christopherbunn christopherbunn merged commit 7231a38 into main Aug 18, 2021
@freddyaboulton freddyaboulton deleted the predict_proba_component_graph branch May 13, 2022 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants