Stored predict_proba results in .x for intermediate estimators in ComponentGraph #2629
Stored predict_proba results in .x for intermediate estimators in ComponentGraph #2629christopherbunn merged 5 commits intomainfrom
predict_proba results in .x for intermediate estimators in ComponentGraph #2629Conversation
predict_proba results in .x for intermediate estimators in ComponentGraph predict_proba results in .x for intermediate estimators in ComponentGraph
Codecov Report
@@ Coverage Diff @@
## main #2629 +/- ##
=======================================
+ Coverage 99.9% 99.9% +0.1%
=======================================
Files 298 298
Lines 27024 27086 +62
=======================================
+ Hits 26980 27042 +62
Misses 44 44
Continue to review full report at Codecov.
|
84dc8e3 to
fbfe8c7
Compare
| mock_fit_transform.return_value = mock_X_t | ||
| mock_fit.return_value = Estimator | ||
| mock_predict.return_value = pd.Series(y) | ||
| mock_predict_proba.return_value = pd.Series(y) |
There was a problem hiding this comment.
Should we make this a DataFrame instead to mimic and make sure that when we have an output with more than one column?
There was a problem hiding this comment.
(oops, pressed add single comment too quickly!)
angela97lin
left a comment
There was a problem hiding this comment.
LGTM! 😁
I had a suggestion: we should convert our mock_predict_proba to a Dataframe to closer mimic real output. I commented directly on some but not all the places 😂
I'm curious if we have tests for calling _compute_features for a RegressionPipeline (via fit/predict/compute_features/or other public methods) and for MulticlassClassificationPipeline (since predict_proba would not drop a column).
Are there tests that already exist (and hence that test not erroring on this PR is confirmation) for multiclass/regression? Otherwise, it could be good to write some :)
| X, y = X_y_binary | ||
| mock_fit_transform.return_value = pd.DataFrame(X) | ||
| mock_predict.return_value = pd.Series(y) | ||
| mock_predict_proba.return_value = pd.Series(y) |
There was a problem hiding this comment.
Same comment as below, let's change this to a DataFrame?
| mock_ohe.return_value = pd.DataFrame(X) | ||
| mock_en_predict.return_value = pd.Series(np.ones(X.shape[0])) | ||
| mock_rf_predict.return_value = pd.Series(np.zeros(X.shape[0])) | ||
| mock_en_predict_proba.return_value = pd.Series(np.ones(X.shape[0])) |
There was a problem hiding this comment.
Same comment, let's return dfs!
chukarsten
left a comment
There was a problem hiding this comment.
I think this looks good and addressing Angela's concerns and perhaps changing the naming scheme might be a good move.
| assert input_feature_names["Logistic Regression"] == [ | ||
| "Random Forest.x", | ||
| "Elastic Net.x", | ||
| "1_Random Forest.x", |
There was a problem hiding this comment.
Do we perhaps have any other thoughts for options to name this? The prepending with the underscore is kinda weird, especially with the space after it. I think an appen....sion? might look better and adding like a "col" or something to it might be a bit clearer.
There was a problem hiding this comment.
I chose to do column_name_component scheme since to my knowledge we aren't using underscores anywhere else for the pipeline name. However, in general I don't really have any strong preferences for what the naming convention should be and I'm definitely open to ideas. Your idea of something like [Col 0 Random Forest.x, Col 1 Random Forest.x, etc.] makes sense to me and I'm good to move forward with this if there aren't any other suggestions.
Thanks! I covered most of these test cases, lmk if I missed some but we should be using dfs as the output for intermediate estimators in tests now.
For |
abd3947 to
ca82cff
Compare
angela97lin
left a comment
There was a problem hiding this comment.
LGTM, thanks for making the changes!
RE the output as dataframes, it'd be nice if they had multiple cols like the output of predict_proba usually would, but not a blocking change 😁
| def test_predict(mock_predict, mock_fit, example_graph, X_y_binary): | ||
| def test_predict(mock_predict, mock_predict_proba, mock_fit, example_graph, X_y_binary): | ||
| X, y = X_y_binary | ||
| mock_predict_proba.return_value = pd.DataFrame(y) |
There was a problem hiding this comment.
Bahaha RE comment to change output to dataframes, I'm curious about when it's a df and has multiple columns! but not a big change :)
There was a problem hiding this comment.
Oops I see 😅. I added the two test cases with different predict_proba outputs to address this.
f19d66e to
65a54f0
Compare
65a54f0 to
da45cfa
Compare
da45cfa to
b849cc6
Compare
As part of the component graph changes necessary to support building our new ensembler (#1930), this PR stores the prediction probabilities (when available) for non-final estimators in a component graph. If
predict_probais not available or if the final estimator in a component graph is being evaluated, thenpredictis used.