Fully separate numeric and categorical data in DefaultAlgorithm#3209
Fully separate numeric and categorical data in DefaultAlgorithm#3209jeremyliweishih merged 38 commits intomainfrom
DefaultAlgorithm#3209Conversation
| def __init__(self, column_types=None, random_seed=0, **kwargs): | ||
| parameters = {"column_types": column_types} | ||
| parameters.update(kwargs) | ||
| Transformer.__init__( |
There was a problem hiding this comment.
This was causing parameters to disappear. Still slightly unsure of the exact mechanism of how the parameters disappear but my hypothesis is that the parameters field in SelectByType you can see here is different from the parameters field created by this init logic.
There was a problem hiding this comment.
It's not that they're disappearing, it's that they're put in the parameters dictionary because that's that the Transformer init does.
'Numeric Pipeline - Select Columns By Type Transformer': {'columns': None, 'parameters': {'column_types': ['numeric']}, 'component_obj': None}
I agree with the redesign. ColumnSelectors are meant to take a list of columns so we need to either need to rename column_types to columns or not make this component a subclass. What we here doing before of trying to skip the parent init and get to the grandparent init is probably not the right thing to do.
Interesting find!
|
|
||
|
|
||
| class SelectByType(ColumnSelector): | ||
| class SelectByType(Transformer): |
There was a problem hiding this comment.
Decided to just inherit from Transformer and override fit as well.
There was a problem hiding this comment.
always good to consider
| Defaults to None | ||
| extra_components (list[ComponentBase]): List of extra components to be added after preprocessing components. Defaults to None. | ||
| extra_components_position (str): Where to put extra components. Defaults to "before_preprocessing" and any other value will put components after preprocessing components. | ||
| extra_components_before (list[ComponentBase]): List of extra components to be added before preprocessing components. Defaults to None. |
There was a problem hiding this comment.
Revised this API here to just have two lists for components in both locations. I was thinking about changing it to a component to index mapping but thought it would be too complicated (I would need to figure out index locations and how they would change as we insert each component).
There was a problem hiding this comment.
I think this is a solid change. I think you should let the use cases drive the API. Do we currently have (or foresee) a reason to be able to inject components at different locations?
There was a problem hiding this comment.
I don't see any short term usage other than in DefaultAlgorithm but I can imagine sometime in the future we would want more control over where exact components go in our pipelines. I agree that we should let the use cases drive the API so let's leave this for now!
Codecov Report
@@ Coverage Diff @@
## main #3209 +/- ##
=======================================
+ Coverage 99.8% 99.8% +0.1%
=======================================
Files 326 326
Lines 31497 31563 +66
=======================================
+ Hits 31406 31472 +66
Misses 91 91
Continue to review full report at Codecov.
|
| } | ||
| numeric_pipeline = make_pipeline( | ||
| self.X, | ||
| self.X.ww.select(exclude="category"), |
There was a problem hiding this comment.
here I exclude all category semantic tag columns so there won't be preprocessing components for those components.
| } | ||
| categorical_pipeline = make_pipeline( | ||
| self.X, | ||
| self.X.ww.select(include=["category"]), |
There was a problem hiding this comment.
Likewise I do the opposite here.
chukarsten
left a comment
There was a problem hiding this comment.
Looks good, Jeremy! Just a copy pasta and a nit. Do with them what you want!
|
|
||
|
|
||
| class SelectByType(ColumnSelector): | ||
| class SelectByType(Transformer): |
There was a problem hiding this comment.
always good to consider
| Defaults to None | ||
| extra_components (list[ComponentBase]): List of extra components to be added after preprocessing components. Defaults to None. | ||
| extra_components_position (str): Where to put extra components. Defaults to "before_preprocessing" and any other value will put components after preprocessing components. | ||
| extra_components_before (list[ComponentBase]): List of extra components to be added before preprocessing components. Defaults to None. |
There was a problem hiding this comment.
I think this is a solid change. I think you should let the use cases drive the API. Do we currently have (or foresee) a reason to be able to inject components at different locations?
…js_3020_fully_split
eccabay
left a comment
There was a problem hiding this comment.
This is really cool work! Just left a few small nits
ParthivNaresh
left a comment
There was a problem hiding this comment.
Excellent work, the perf tests look great!
Fixes #3020
Example:

You can see that a one hot encoder does not exist in the numeric pipeline.
Perf test results:
split_fixed_no_change.html.zip