Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully separate numeric and categorical data in DefaultAlgorithm #3209

Merged
merged 38 commits into from
Jan 24, 2022

Conversation

jeremyliweishih
Copy link
Collaborator

@jeremyliweishih jeremyliweishih commented Jan 10, 2022

Fixes #3020

Example:
split

You can see that a one hot encoder does not exist in the numeric pipeline.

Perf test results:
split_fixed_no_change.html.zip

@@ -152,8 +152,7 @@ class SelectByType(ColumnSelector):
def __init__(self, column_types=None, random_seed=0, **kwargs):
parameters = {"column_types": column_types}
parameters.update(kwargs)
Transformer.__init__(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was causing parameters to disappear. Still slightly unsure of the exact mechanism of how the parameters disappear but my hypothesis is that the parameters field in SelectByType you can see here is different from the parameters field created by this init logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that they're disappearing, it's that they're put in the parameters dictionary because that's that the Transformer init does.

'Numeric Pipeline - Select Columns By Type Transformer': {'columns': None, 'parameters': {'column_types': ['numeric']}, 'component_obj': None}

I agree with the redesign. ColumnSelectors are meant to take a list of columns so we need to either need to rename column_types to columns or not make this component a subclass. What we here doing before of trying to skip the parent init and get to the grandparent init is probably not the right thing to do.

Interesting find!

@@ -136,7 +136,7 @@ def _modify_columns(self, cols, X, y=None):
return X.ww[column_intersection]


class SelectByType(ColumnSelector):
class SelectByType(Transformer):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to just inherit from Transformer and override fit as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always good to consider

@@ -397,8 +397,8 @@ def make_pipeline(
An empty dictionary or None implies using all default values for component parameters.
sampler_name (str): The name of the sampler component to add to the pipeline. Only used in classification problems.
Defaults to None
extra_components (list[ComponentBase]): List of extra components to be added after preprocessing components. Defaults to None.
extra_components_position (str): Where to put extra components. Defaults to "before_preprocessing" and any other value will put components after preprocessing components.
extra_components_before (list[ComponentBase]): List of extra components to be added before preprocessing components. Defaults to None.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised this API here to just have two lists for components in both locations. I was thinking about changing it to a component to index mapping but thought it would be too complicated (I would need to figure out index locations and how they would change as we insert each component).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a solid change. I think you should let the use cases drive the API. Do we currently have (or foresee) a reason to be able to inject components at different locations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any short term usage other than in DefaultAlgorithm but I can imagine sometime in the future we would want more control over where exact components go in our pipelines. I agree that we should let the use cases drive the API so let's leave this for now!

@codecov
Copy link

codecov bot commented Jan 10, 2022

Codecov Report

Merging #3209 (6c90a07) into main (1a3f32e) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3209     +/-   ##
=======================================
+ Coverage   99.8%   99.8%   +0.1%     
=======================================
  Files        326     326             
  Lines      31497   31563     +66     
=======================================
+ Hits       31406   31472     +66     
  Misses        91      91             
Impacted Files Coverage Δ
evalml/automl/automl_algorithm/automl_algorithm.py 100.0% <100.0%> (ø)
...valml/automl/automl_algorithm/default_algorithm.py 100.0% <100.0%> (ø)
...elines/components/transformers/column_selectors.py 92.9% <100.0%> (ø)
evalml/pipelines/utils.py 99.4% <100.0%> (-<0.1%) ⬇️
...valml/tests/automl_tests/test_default_algorithm.py 100.0% <100.0%> (ø)
...mponent_tests/test_column_selector_transformers.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1a3f32e...6c90a07. Read the comment docs.

}
numeric_pipeline = make_pipeline(
self.X,
self.X.ww.select(exclude="category"),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I exclude all category semantic tag columns so there won't be preprocessing components for those components.

@@ -447,14 +451,13 @@ def _make_split_pipeline(self, estimator, pipeline_name=None):
"Select Columns Transformer": {"columns": self._selected_cat_cols}
}
categorical_pipeline = make_pipeline(
self.X,
self.X.ww.select(include=["category"]),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise I do the opposite here.

@jeremyliweishih jeremyliweishih marked this pull request as ready for review January 11, 2022 21:20
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, Jeremy! Just a copy pasta and a nit. Do with them what you want!

@@ -136,7 +136,7 @@ def _modify_columns(self, cols, X, y=None):
return X.ww[column_intersection]


class SelectByType(ColumnSelector):
class SelectByType(Transformer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always good to consider

evalml/pipelines/utils.py Outdated Show resolved Hide resolved
@@ -397,8 +397,8 @@ def make_pipeline(
An empty dictionary or None implies using all default values for component parameters.
sampler_name (str): The name of the sampler component to add to the pipeline. Only used in classification problems.
Defaults to None
extra_components (list[ComponentBase]): List of extra components to be added after preprocessing components. Defaults to None.
extra_components_position (str): Where to put extra components. Defaults to "before_preprocessing" and any other value will put components after preprocessing components.
extra_components_before (list[ComponentBase]): List of extra components to be added before preprocessing components. Defaults to None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a solid change. I think you should let the use cases drive the API. Do we currently have (or foresee) a reason to be able to inject components at different locations?

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool work! Just left a few small nits

docs/source/release_notes.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work, the perf tests look great!

docs/source/release_notes.rst Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/default_algorithm.py Outdated Show resolved Hide resolved
evalml/tests/automl_tests/test_default_algorithm.py Outdated Show resolved Hide resolved
@jeremyliweishih jeremyliweishih merged commit 2cb356d into main Jan 24, 2022
@angela97lin angela97lin mentioned this pull request Jan 25, 2022
@freddyaboulton freddyaboulton deleted the js_3020_fully_split branch May 13, 2022 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Default Algorithm: fully split numeric and categorical columns in preprocessing split
5 participants