-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle categorical columns in a separate chain for DefaultAlgorithm
pipelines
#2986
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2986 +/- ##
=======================================
+ Coverage 99.0% 99.7% +0.8%
=======================================
Files 312 312
Lines 29979 30137 +158
=======================================
+ Hits 29656 30041 +385
+ Misses 323 96 -227
Continue to review full report at Codecov.
|
@@ -81,8 +81,14 @@ class DropColumns(ColumnSelector): | |||
"""{}""" | |||
needs_fitting = False | |||
|
|||
def _check_input_for_columns(self, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed the same changes that I made to SelectColumns
in #2944 due to the split in pipeline structure (sometimes the columns to be dropped are in separate chains or only in one chain).
@@ -369,6 +386,126 @@ def _make_new_component_name(model_type, component_name, idx=None): | |||
) | |||
|
|||
|
|||
def _make_pipeline_from_multiple_graphs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what the problem was with the final y input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton the implementation for the stacked ensembler assumes that each input pipeline has an estimator so that the final y input chosen to the stacked estimator is the y input going into each pipelines estimator. This usually means that the label encoder or the original y is selected as the final y input. This implementation always chooses the last component that modifies y of the last pipeline. @christopherbunn could give a better explanation of how the stacked implementation works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyliweishih Looks good to me! Thank you for making the changes. I think this unblocks further development by getting rid of the pesky bugs from categorical features but I think we can improve the implementation regarding the two samplers and splitting numeric vs categorical further so I'm looking forward to the follow-up issues!
Also, I'm wondering if this code will be substantially simpler if we had a first-class api for combining pipelines. Maybe we need to prioritize that? Seems like we'll have to do it for data check actions? FYI @angela97lin @dsherry @chukarsten
…l into js_default_component_graph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't gone through the tests yet, but this is looking good so far! Left many nits and some comments on things that I think could be changed to make this more robust and efficient.
Hopefully can finish looking through the tests tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes! Looks good to me, left a question for my own understanding.
I remember there was some messy stuff with how the pipeline names are displayed in these longer split pipelines. Was there an issue filed to track fixing that for a future PR?
new_names[name] = old_names[component_name] | ||
return new_names | ||
|
||
def _rename_pipeline_parameters_custom_hyperparameters(self, pipelines): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
"Select Columns Transformer": {"columns": self._selected_cat_cols} | ||
} | ||
categorical_pipeline = make_pipeline( | ||
self.X, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyliweishih was this filed?
add_result(algo, batch) | ||
|
||
batch = algo.next_batch() | ||
add_result(algo, batch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do 2 calls for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, its to get it to the correct batch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment on that for future reference? I'm sure i'll forget lol
@bchen1116 will file that issue, along side the sampler issue and the cat-num issue! |
waiting on latest perf tests before merging. |
…l into js_default_component_graph
Couple major changes:
This requires some additional logic in
DefaultAlgorithm
due to renaming of components and handling when to create this two-pronged pipeline.I made the same changes I made to
Select Columns Transformer
toDrop Columns Transformer
as well. This was because if certain columns were to be dropped, sometimes it would only exist in one chain or different columns would be in different chains. Thus, we only drop whats available.Added
_make_pipeline_from_multiple_graphs
which essentially has the same implementation of the stacked ensembler helper but with a couple logic changes on naming and selecting which final y output to take for the estimator. I left this as a separate function due to the discussions in Label Encoder appears twice for AutoML-generated stacked ensemble pipeline #2987 and Update label encoder in pipeline implementation to look for LabelEncoder class instead of looking for "Label Encoder" #2906 and those limitations apply here as well.Perf tests:
report.html.zip