-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fully separate numeric and categorical data in DefaultAlgorithm
#3209
Conversation
@@ -152,8 +152,7 @@ class SelectByType(ColumnSelector): | |||
def __init__(self, column_types=None, random_seed=0, **kwargs): | |||
parameters = {"column_types": column_types} | |||
parameters.update(kwargs) | |||
Transformer.__init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was causing parameters to disappear. Still slightly unsure of the exact mechanism of how the parameters disappear but my hypothesis is that the parameters
field in SelectByType
you can see here is different from the parameters
field created by this init logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that they're disappearing, it's that they're put in the parameters
dictionary because that's that the Transformer
init does.
'Numeric Pipeline - Select Columns By Type Transformer': {'columns': None, 'parameters': {'column_types': ['numeric']}, 'component_obj': None}
I agree with the redesign. ColumnSelectors
are meant to take a list of columns so we need to either need to rename column_types
to columns
or not make this component a subclass. What we here doing before of trying to skip the parent init and get to the grandparent init is probably not the right thing to do.
Interesting find!
@@ -136,7 +136,7 @@ def _modify_columns(self, cols, X, y=None): | |||
return X.ww[column_intersection] | |||
|
|||
|
|||
class SelectByType(ColumnSelector): | |||
class SelectByType(Transformer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to just inherit from Transformer
and override fit
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always good to consider
@@ -397,8 +397,8 @@ def make_pipeline( | |||
An empty dictionary or None implies using all default values for component parameters. | |||
sampler_name (str): The name of the sampler component to add to the pipeline. Only used in classification problems. | |||
Defaults to None | |||
extra_components (list[ComponentBase]): List of extra components to be added after preprocessing components. Defaults to None. | |||
extra_components_position (str): Where to put extra components. Defaults to "before_preprocessing" and any other value will put components after preprocessing components. | |||
extra_components_before (list[ComponentBase]): List of extra components to be added before preprocessing components. Defaults to None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised this API here to just have two lists for components in both locations. I was thinking about changing it to a component to index mapping but thought it would be too complicated (I would need to figure out index locations and how they would change as we insert each component).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a solid change. I think you should let the use cases drive the API. Do we currently have (or foresee) a reason to be able to inject components at different locations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any short term usage other than in DefaultAlgorithm
but I can imagine sometime in the future we would want more control over where exact components go in our pipelines. I agree that we should let the use cases drive the API so let's leave this for now!
Codecov Report
@@ Coverage Diff @@
## main #3209 +/- ##
=======================================
+ Coverage 99.8% 99.8% +0.1%
=======================================
Files 326 326
Lines 31497 31563 +66
=======================================
+ Hits 31406 31472 +66
Misses 91 91
Continue to review full report at Codecov.
|
} | ||
numeric_pipeline = make_pipeline( | ||
self.X, | ||
self.X.ww.select(exclude="category"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here I exclude all category
semantic tag columns so there won't be preprocessing components for those components.
@@ -447,14 +451,13 @@ def _make_split_pipeline(self, estimator, pipeline_name=None): | |||
"Select Columns Transformer": {"columns": self._selected_cat_cols} | |||
} | |||
categorical_pipeline = make_pipeline( | |||
self.X, | |||
self.X.ww.select(include=["category"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise I do the opposite here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, Jeremy! Just a copy pasta and a nit. Do with them what you want!
@@ -136,7 +136,7 @@ def _modify_columns(self, cols, X, y=None): | |||
return X.ww[column_intersection] | |||
|
|||
|
|||
class SelectByType(ColumnSelector): | |||
class SelectByType(Transformer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
always good to consider
@@ -397,8 +397,8 @@ def make_pipeline( | |||
An empty dictionary or None implies using all default values for component parameters. | |||
sampler_name (str): The name of the sampler component to add to the pipeline. Only used in classification problems. | |||
Defaults to None | |||
extra_components (list[ComponentBase]): List of extra components to be added after preprocessing components. Defaults to None. | |||
extra_components_position (str): Where to put extra components. Defaults to "before_preprocessing" and any other value will put components after preprocessing components. | |||
extra_components_before (list[ComponentBase]): List of extra components to be added before preprocessing components. Defaults to None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a solid change. I think you should let the use cases drive the API. Do we currently have (or foresee) a reason to be able to inject components at different locations?
…js_3020_fully_split
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool work! Just left a few small nits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work, the perf tests look great!
Fixes #3020
Example:

You can see that a one hot encoder does not exist in the numeric pipeline.
Perf test results:
split_fixed_no_change.html.zip