Handle categorical columns in a separate chain for `DefaultAlgorithm` pipelines #2986

jeremyliweishih · 2021-10-28T18:16:16Z

Couple major changes:

When applicable (i.e categorical columns exist), the pipeline structure changes s.t categoricals are handled in a separate preprocessing chain. Heres, an example of what the change looks like. Note: the real pipeline has an additional label encoder which is detailed by Label Encoder appears twice for AutoML-generated stacked ensemble pipeline #2987.

This requires some additional logic in DefaultAlgorithm due to renaming of components and handling when to create this two-pronged pipeline.

I made the same changes I made to Select Columns Transformer to Drop Columns Transformer as well. This was because if certain columns were to be dropped, sometimes it would only exist in one chain or different columns would be in different chains. Thus, we only drop whats available.
Added _make_pipeline_from_multiple_graphs which essentially has the same implementation of the stacked ensembler helper but with a couple logic changes on naming and selecting which final y output to take for the estimator. I left this as a separate function due to the discussions in Label Encoder appears twice for AutoML-generated stacked ensemble pipeline #2987 and Update label encoder in pipeline implementation to look for LabelEncoder class instead of looking for "Label Encoder" #2906 and those limitations apply here as well.

Perf tests:
report.html.zip

…onent_graph

codecov · 2021-10-28T18:22:15Z

Codecov Report

Merging #2986 (19f9fb1) into main (a28484e) will increase coverage by 0.8%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #2986     +/-   ##
=======================================
+ Coverage   99.0%   99.7%   +0.8%     
=======================================
  Files        312     312             
  Lines      29979   30137    +158     
=======================================
+ Hits       29656   30041    +385     
+ Misses       323      96    -227

Impacted Files	Coverage Δ
evalml/automl/automl_algorithm/automl_algorithm.py	`100.0% <100.0%> (ø)`
...valml/automl/automl_algorithm/default_algorithm.py	`100.0% <100.0%> (ø)`
...elines/components/transformers/column_selectors.py	`92.9% <100.0%> (-7.1%)`	⬇️
evalml/pipelines/utils.py	`99.6% <100.0%> (+0.1%)`	⬆️
...valml/tests/automl_tests/test_default_algorithm.py	`100.0% <100.0%> (ø)`
...mponent_tests/test_column_selector_transformers.py	`100.0% <100.0%> (ø)`
evalml/tests/pipeline_tests/test_pipeline_utils.py	`99.7% <100.0%> (+0.1%)`	⬆️
evalml/tests/automl_tests/test_automl.py	`99.5% <0.0%> (+0.1%)`	⬆️
evalml/automl/automl_search.py	`99.9% <0.0%> (+0.2%)`	⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a28484e...19f9fb1. Read the comment docs.

…onent_graph

evalml/automl/automl_algorithm/default_algorithm.py

…onent_graph

jeremyliweishih · 2021-11-01T18:55:39Z

evalml/pipelines/components/transformers/column_selectors.py

@@ -81,8 +81,14 @@ class DropColumns(ColumnSelector):
    """{}"""
    needs_fitting = False

+    def _check_input_for_columns(self, X):


Needed the same changes that I made to SelectColumns in #2944 due to the split in pipeline structure (sometimes the columns to be dropped are in separate chains or only in one chain).

jeremyliweishih · 2021-11-01T18:58:01Z

evalml/pipelines/utils.py

@@ -369,6 +386,126 @@ def _make_new_component_name(model_type, component_name, idx=None):
    )


+def _make_pipeline_from_multiple_graphs(


Adding this new function but most of it is the same as _make_stacked_ensemble_pipeline with some new logic around which final y input to take. I can put up an issue to refactor this moving forward but mainly kept this separate due to the conversations surrounding #2987 and #2906.

Can you explain what the problem was with the final y input?

@freddyaboulton the implementation for the stacked ensembler assumes that each input pipeline has an estimator so that the final y input chosen to the stacked estimator is the y input going into each pipelines estimator. This usually means that the label encoder or the original y is selected as the final y input. This implementation always chooses the last component that modifies y of the last pipeline. @christopherbunn could give a better explanation of how the stacked implementation works.

…onent_graph

freddyaboulton

@jeremyliweishih Looks good to me! Thank you for making the changes. I think this unblocks further development by getting rid of the pesky bugs from categorical features but I think we can improve the implementation regarding the two samplers and splitting numeric vs categorical further so I'm looking forward to the follow-up issues!

Also, I'm wondering if this code will be substantially simpler if we had a first-class api for combining pipelines. Maybe we need to prioritize that? Seems like we'll have to do it for data check actions? FYI @angela97lin @dsherry @chukarsten

evalml/automl/automl_algorithm/default_algorithm.py

evalml/pipelines/utils.py

evalml/automl/automl_algorithm/default_algorithm.py

…l into js_default_component_graph

bchen1116

Haven't gone through the tests yet, but this is looking good so far! Left many nits and some comments on things that I think could be changed to make this more robust and efficient.

Hopefully can finish looking through the tests tomorrow.

evalml/automl/automl_algorithm/default_algorithm.py

evalml/pipelines/components/transformers/column_selectors.py

evalml/pipelines/utils.py

…onent_graph

bchen1116

Thanks for making the changes! Looks good to me, left a question for my own understanding.

I remember there was some messy stuff with how the pipeline names are displayed in these longer split pipelines. Was there an issue filed to track fixing that for a future PR?

bchen1116 · 2021-11-05T18:25:24Z

evalml/automl/automl_algorithm/default_algorithm.py

+                            new_names[name] = old_names[component_name]
+        return new_names
+
+    def _rename_pipeline_parameters_custom_hyperparameters(self, pipelines):


bchen1116 · 2021-11-05T18:26:06Z

evalml/automl/automl_algorithm/default_algorithm.py

+                "Select Columns Transformer": {"columns": self._selected_cat_cols}
+            }
+            categorical_pipeline = make_pipeline(
+                self.X,


@jeremyliweishih was this filed?

bchen1116 · 2021-11-05T19:13:17Z

evalml/tests/automl_tests/test_default_algorithm.py

+    add_result(algo, batch)
+
+    batch = algo.next_batch()
+    add_result(algo, batch)


do we need to do 2 calls for this?

yes, its to get it to the correct batch!

Can you add a comment on that for future reference? I'm sure i'll forget lol

jeremyliweishih · 2021-11-05T20:00:02Z

@bchen1116 will file that issue, along side the sampler issue and the cat-num issue!

jeremyliweishih · 2021-11-05T20:08:57Z

waiting on latest perf tests before merging.

…l into js_default_component_graph

jeremyliweishih added 3 commits October 28, 2021 11:35

Split into seperate function and call in default algo

7ebbbdb

Add tests for make split pipeline

9511d0c

Merge branch 'main' of github.com:alteryx/evalml into js_default_comp…

d066709

…onent_graph

jeremyliweishih added 7 commits October 28, 2021 16:51

Fix tests

f53214b

Merge branch 'main' of github.com:alteryx/evalml into js_default_comp…

259968e

…onent_graph

lint

99b8587

RL

d9e1227

lint

d86103d

lint

1d2d4aa

fix coverage

c7dd019

jeremyliweishih commented Oct 29, 2021

View reviewed changes

evalml/automl/automl_algorithm/default_algorithm.py Outdated Show resolved Hide resolved

jeremyliweishih added 9 commits October 29, 2021 11:19

only create split if there are cat columns

61c63ad

Make drop columns no-op if columns do not exist'

8cb89d0

Use pipeline subnames

c8a14c3

Create new function and add logic for selecting y input for estimator

89a0316

remove uncessary label encoder

3d2b2fc

Merge branch 'main' of github.com:alteryx/evalml into js_default_comp…

0b9777f

…onent_graph

fix last y logic

0c4fa3b

Add back label encoder at the front

f960eba

Merge branch 'main' of github.com:alteryx/evalml into js_default_comp…

049ba33

…onent_graph

jeremyliweishih marked this pull request as ready for review November 1, 2021 18:54

auto-assign bot assigned jeremyliweishih Nov 1, 2021

jeremyliweishih commented Nov 1, 2021

View reviewed changes

jeremyliweishih added 5 commits November 1, 2021 15:01

lint

7ecebf0

lint

17380af

lint

8477611

Fix coverage

81a0e34

fix broken test

8ae2794

jeremyliweishih added 3 commits November 2, 2021 16:01

Merge branch 'main' of github.com:alteryx/evalml into js_default_comp…

a9a4e1c

…onent_graph

Clean up logic grabbing OHE

16a68fb

Merge branch 'main' into js_default_component_graph

6e07821

freddyaboulton approved these changes Nov 3, 2021

View reviewed changes

jeremyliweishih added 5 commits November 3, 2021 14:21

Change default parameter for create pipeline from multiple cg

ee6df95

Fix ensembling parameters

e66a82f

Merge branch 'js_default_component_graph' of github.com:alteryx/evalm…

1f172ce

…l into js_default_component_graph

Make similar changes to custom hyperparameters

561dae3

Add coverage for custom hyper parameters

401a42a

bchen1116 requested changes Nov 3, 2021

View reviewed changes

jeremyliweishih added 3 commits November 5, 2021 10:31

Merge branch 'main' of github.com:alteryx/evalml into js_default_comp…

6229979

…onent_graph

Address comments

d0157ce

Fix RL

d04f57e

jeremyliweishih requested a review from bchen1116 November 5, 2021 15:44

Merge branch 'main' into js_default_component_graph

813096d

bchen1116 approved these changes Nov 5, 2021

View reviewed changes

Merge branch 'main' into js_default_component_graph

7e5033e

jeremyliweishih mentioned this pull request Nov 8, 2021

Default Algorithm: fully split numeric and categorical columns in preprocessing split #3020

Closed

jeremyliweishih added 5 commits November 8, 2021 11:44

RL

6d8e58f

Merge branch 'js_default_component_graph' of github.com:alteryx/evalm…

695906c

…l into js_default_component_graph

Merge branch 'main' into js_default_component_graph

3e0db7d

Merge branch 'main' into js_default_component_graph

85fed4d

Merge branch 'main' into js_default_component_graph

19f9fb1

jeremyliweishih merged commit 79b9d22 into main Nov 8, 2021

chukarsten mentioned this pull request Nov 9, 2021

Release v0.37.0 #3029

Merged

jeremyliweishih mentioned this pull request Nov 18, 2021

Use one sampler for split preprocessing pipeline #3076

Closed

freddyaboulton mentioned this pull request Mar 31, 2022

Use only one sampler for split pipelines in default algorithm #3430

Closed

freddyaboulton deleted the js_default_component_graph branch May 13, 2022 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle categorical columns in a separate chain for `DefaultAlgorithm` pipelines #2986

Handle categorical columns in a separate chain for `DefaultAlgorithm` pipelines #2986

jeremyliweishih commented Oct 28, 2021 •

edited

Loading

codecov bot commented Oct 28, 2021 •

edited

Loading

jeremyliweishih Nov 1, 2021 •

edited

Loading

jeremyliweishih Nov 1, 2021 •

edited

Loading

freddyaboulton Nov 2, 2021

jeremyliweishih Nov 2, 2021

freddyaboulton left a comment •

edited

Loading

bchen1116 left a comment

bchen1116 left a comment

bchen1116 Nov 5, 2021

bchen1116 Nov 5, 2021

bchen1116 Nov 5, 2021

jeremyliweishih Nov 5, 2021

bchen1116 Nov 5, 2021

jeremyliweishih commented Nov 5, 2021

jeremyliweishih commented Nov 5, 2021

		@@ -369,6 +386,126 @@ def _make_new_component_name(model_type, component_name, idx=None):
		)


		def _make_pipeline_from_multiple_graphs(

Handle categorical columns in a separate chain for DefaultAlgorithm pipelines #2986

Handle categorical columns in a separate chain for DefaultAlgorithm pipelines #2986

Conversation

jeremyliweishih commented Oct 28, 2021 • edited Loading

codecov bot commented Oct 28, 2021 • edited Loading

Codecov Report

jeremyliweishih Nov 1, 2021 • edited Loading

Choose a reason for hiding this comment

jeremyliweishih Nov 1, 2021 • edited Loading

Choose a reason for hiding this comment

freddyaboulton Nov 2, 2021

Choose a reason for hiding this comment

jeremyliweishih Nov 2, 2021

Choose a reason for hiding this comment

freddyaboulton left a comment • edited Loading

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

bchen1116 Nov 5, 2021

Choose a reason for hiding this comment

bchen1116 Nov 5, 2021

Choose a reason for hiding this comment

bchen1116 Nov 5, 2021

Choose a reason for hiding this comment

jeremyliweishih Nov 5, 2021

Choose a reason for hiding this comment

bchen1116 Nov 5, 2021

Choose a reason for hiding this comment

jeremyliweishih commented Nov 5, 2021

jeremyliweishih commented Nov 5, 2021

Handle categorical columns in a separate chain for `DefaultAlgorithm` pipelines #2986

Handle categorical columns in a separate chain for `DefaultAlgorithm` pipelines #2986

jeremyliweishih commented Oct 28, 2021 •

edited

Loading

codecov bot commented Oct 28, 2021 •

edited

Loading

jeremyliweishih Nov 1, 2021 •

edited

Loading

jeremyliweishih Nov 1, 2021 •

edited

Loading

freddyaboulton left a comment •

edited

Loading