Handle categorical columns in a separate chain for DefaultAlgorithm pipelines#2986
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2986 +/- ##
=======================================
+ Coverage 99.0% 99.7% +0.8%
=======================================
Files 312 312
Lines 29979 30137 +158
=======================================
+ Hits 29656 30041 +385
+ Misses 323 96 -227
Continue to review full report at Codecov.
|
| """{}""" | ||
| needs_fitting = False | ||
|
|
||
| def _check_input_for_columns(self, X): |
There was a problem hiding this comment.
Needed the same changes that I made to SelectColumns in #2944 due to the split in pipeline structure (sometimes the columns to be dropped are in separate chains or only in one chain).
| ) | ||
|
|
||
|
|
||
| def _make_pipeline_from_multiple_graphs( |
There was a problem hiding this comment.
There was a problem hiding this comment.
Can you explain what the problem was with the final y input?
There was a problem hiding this comment.
@freddyaboulton the implementation for the stacked ensembler assumes that each input pipeline has an estimator so that the final y input chosen to the stacked estimator is the y input going into each pipelines estimator. This usually means that the label encoder or the original y is selected as the final y input. This implementation always chooses the last component that modifies y of the last pipeline. @christopherbunn could give a better explanation of how the stacked implementation works.
There was a problem hiding this comment.
@jeremyliweishih Looks good to me! Thank you for making the changes. I think this unblocks further development by getting rid of the pesky bugs from categorical features but I think we can improve the implementation regarding the two samplers and splitting numeric vs categorical further so I'm looking forward to the follow-up issues!
Also, I'm wondering if this code will be substantially simpler if we had a first-class api for combining pipelines. Maybe we need to prioritize that? Seems like we'll have to do it for data check actions? FYI @angela97lin @dsherry @chukarsten
…l into js_default_component_graph
bchen1116
left a comment
There was a problem hiding this comment.
Haven't gone through the tests yet, but this is looking good so far! Left many nits and some comments on things that I think could be changed to make this more robust and efficient.
Hopefully can finish looking through the tests tomorrow.
bchen1116
left a comment
There was a problem hiding this comment.
Thanks for making the changes! Looks good to me, left a question for my own understanding.
I remember there was some messy stuff with how the pipeline names are displayed in these longer split pipelines. Was there an issue filed to track fixing that for a future PR?
| new_names[name] = old_names[component_name] | ||
| return new_names | ||
|
|
||
| def _rename_pipeline_parameters_custom_hyperparameters(self, pipelines): |
| "Select Columns Transformer": {"columns": self._selected_cat_cols} | ||
| } | ||
| categorical_pipeline = make_pipeline( | ||
| self.X, |
| add_result(algo, batch) | ||
|
|
||
| batch = algo.next_batch() | ||
| add_result(algo, batch) |
There was a problem hiding this comment.
do we need to do 2 calls for this?
There was a problem hiding this comment.
yes, its to get it to the correct batch!
There was a problem hiding this comment.
Can you add a comment on that for future reference? I'm sure i'll forget lol
|
@bchen1116 will file that issue, along side the sampler issue and the cat-num issue! |
|
waiting on latest perf tests before merging. |
…l into js_default_component_graph
Couple major changes:
This requires some additional logic in
DefaultAlgorithmdue to renaming of components and handling when to create this two-pronged pipeline.I made the same changes I made to
Select Columns TransformertoDrop Columns Transformeras well. This was because if certain columns were to be dropped, sometimes it would only exist in one chain or different columns would be in different chains. Thus, we only drop whats available.Added
_make_pipeline_from_multiple_graphswhich essentially has the same implementation of the stacked ensembler helper but with a couple logic changes on naming and selecting which final y output to take for the estimator. I left this as a separate function due to the discussions in Label Encoder appears twice for AutoML-generated stacked ensemble pipeline #2987 and Update label encoder in pipeline implementation to look for LabelEncoder class instead of looking for "Label Encoder" #2906 and those limitations apply here as well.Perf tests:
report.html.zip