Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle categorical columns in a separate chain for DefaultAlgorithm pipelines #2986

Merged
merged 46 commits into from
Nov 8, 2021

Conversation

jeremyliweishih
Copy link
Collaborator

@jeremyliweishih jeremyliweishih commented Oct 28, 2021

Couple major changes:

  1. When applicable (i.e categorical columns exist), the pipeline structure changes s.t categoricals are handled in a separate preprocessing chain. Heres, an example of what the change looks like. Note: the real pipeline has an additional label encoder which is detailed by Label Encoder appears twice for AutoML-generated stacked ensemble pipeline #2987.

graphviz (1)

This requires some additional logic in DefaultAlgorithm due to renaming of components and handling when to create this two-pronged pipeline.

  1. I made the same changes I made to Select Columns Transformer to Drop Columns Transformer as well. This was because if certain columns were to be dropped, sometimes it would only exist in one chain or different columns would be in different chains. Thus, we only drop whats available.

  2. Added _make_pipeline_from_multiple_graphs which essentially has the same implementation of the stacked ensembler helper but with a couple logic changes on naming and selecting which final y output to take for the estimator. I left this as a separate function due to the discussions in Label Encoder appears twice for AutoML-generated stacked ensemble pipeline #2987 and Update label encoder in pipeline implementation to look for LabelEncoder class instead of looking for "Label Encoder" #2906 and those limitations apply here as well.

Perf tests:
report.html.zip

@codecov
Copy link

codecov bot commented Oct 28, 2021

Codecov Report

Merging #2986 (19f9fb1) into main (a28484e) will increase coverage by 0.8%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2986     +/-   ##
=======================================
+ Coverage   99.0%   99.7%   +0.8%     
=======================================
  Files        312     312             
  Lines      29979   30137    +158     
=======================================
+ Hits       29656   30041    +385     
+ Misses       323      96    -227     
Impacted Files Coverage Δ
evalml/automl/automl_algorithm/automl_algorithm.py 100.0% <100.0%> (ø)
...valml/automl/automl_algorithm/default_algorithm.py 100.0% <100.0%> (ø)
...elines/components/transformers/column_selectors.py 92.9% <100.0%> (-7.1%) ⬇️
evalml/pipelines/utils.py 99.6% <100.0%> (+0.1%) ⬆️
...valml/tests/automl_tests/test_default_algorithm.py 100.0% <100.0%> (ø)
...mponent_tests/test_column_selector_transformers.py 100.0% <100.0%> (ø)
evalml/tests/pipeline_tests/test_pipeline_utils.py 99.7% <100.0%> (+0.1%) ⬆️
evalml/tests/automl_tests/test_automl.py 99.5% <0.0%> (+0.1%) ⬆️
evalml/automl/automl_search.py 99.9% <0.0%> (+0.2%) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a28484e...19f9fb1. Read the comment docs.

@jeremyliweishih jeremyliweishih marked this pull request as ready for review November 1, 2021 18:54
@@ -81,8 +81,14 @@ class DropColumns(ColumnSelector):
"""{}"""
needs_fitting = False

def _check_input_for_columns(self, X):
Copy link
Collaborator Author

@jeremyliweishih jeremyliweishih Nov 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed the same changes that I made to SelectColumns in #2944 due to the split in pipeline structure (sometimes the columns to be dropped are in separate chains or only in one chain).

@@ -369,6 +386,126 @@ def _make_new_component_name(model_type, component_name, idx=None):
)


def _make_pipeline_from_multiple_graphs(
Copy link
Collaborator Author

@jeremyliweishih jeremyliweishih Nov 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this new function but most of it is the same as _make_stacked_ensemble_pipeline with some new logic around which final y input to take. I can put up an issue to refactor this moving forward but mainly kept this separate due to the conversations surrounding #2987 and #2906.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what the problem was with the final y input?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton the implementation for the stacked ensembler assumes that each input pipeline has an estimator so that the final y input chosen to the stacked estimator is the y input going into each pipelines estimator. This usually means that the label encoder or the original y is selected as the final y input. This implementation always chooses the last component that modifies y of the last pipeline. @christopherbunn could give a better explanation of how the stacked implementation works.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih Looks good to me! Thank you for making the changes. I think this unblocks further development by getting rid of the pesky bugs from categorical features but I think we can improve the implementation regarding the two samplers and splitting numeric vs categorical further so I'm looking forward to the follow-up issues!

Also, I'm wondering if this code will be substantially simpler if we had a first-class api for combining pipelines. Maybe we need to prioritize that? Seems like we'll have to do it for data check actions? FYI @angela97lin @dsherry @chukarsten

evalml/pipelines/utils.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/default_algorithm.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/default_algorithm.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't gone through the tests yet, but this is looking good so far! Left many nits and some comments on things that I think could be changed to make this more robust and efficient.

Hopefully can finish looking through the tests tomorrow.

evalml/automl/automl_algorithm/default_algorithm.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/default_algorithm.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/default_algorithm.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/default_algorithm.py Outdated Show resolved Hide resolved
evalml/pipelines/utils.py Outdated Show resolved Hide resolved
evalml/pipelines/utils.py Show resolved Hide resolved
evalml/pipelines/utils.py Outdated Show resolved Hide resolved
evalml/pipelines/utils.py Show resolved Hide resolved
evalml/pipelines/utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes! Looks good to me, left a question for my own understanding.

I remember there was some messy stuff with how the pipeline names are displayed in these longer split pipelines. Was there an issue filed to track fixing that for a future PR?

new_names[name] = old_names[component_name]
return new_names

def _rename_pipeline_parameters_custom_hyperparameters(self, pipelines):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

"Select Columns Transformer": {"columns": self._selected_cat_cols}
}
categorical_pipeline = make_pipeline(
self.X,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih was this filed?

add_result(algo, batch)

batch = algo.next_batch()
add_result(algo, batch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do 2 calls for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, its to get it to the correct batch!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment on that for future reference? I'm sure i'll forget lol

@jeremyliweishih
Copy link
Collaborator Author

@bchen1116 will file that issue, along side the sampler issue and the cat-num issue!

@jeremyliweishih
Copy link
Collaborator Author

waiting on latest perf tests before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants