Conversation
Codecov Report
@@ Coverage Diff @@
## main #2531 +/- ##
=======================================
+ Coverage 99.9% 99.9% +0.1%
=======================================
Files 291 291
Lines 26587 26644 +57
=======================================
+ Hits 26544 26601 +57
Misses 43 43
Continue to review full report at Codecov.
|
freddyaboulton
left a comment
There was a problem hiding this comment.
@eccabay Looks good! Thanks for using the existing ColumnSelector abstraction!
One comment I'd like to resolve before merge is that if we make _check_input_for_columns a no-op we get semantic tag selection for free! I think that'll make this more flexible.
| needs_fitting = False | ||
|
|
||
| def _check_input_for_columns(self, X): | ||
| cols = self.parameters.get("columns") or [] |
There was a problem hiding this comment.
What are your thoughts on making this a no-op?
If we do that, then we get selection by semantic tag for free. I feel like that would be useful for workflows that don't use the "standard" set of types. For example, a user might be interested in selecting all numeric features (including Age and AgeNullable) or all categorical features (CountryCode and Ordinal). It would also make it easy for users to select based on their own criteria if they set the tags correctly. For example, users may be interested in selecting based on whether a feature was created by another component or not.
Another argument for doing that is that check_input_for_columns was implemented because X.ww[cols] would error if the columns were not in the dataframe. However, Xww.select does not error out if the resulting dataframe is empty. I think it's ok if this returns an empty dataframe - maybe we can just raise a warning instead?
There was a problem hiding this comment.
Agreed. I think this could definitely be a no-op, since it does allow us to select by semantic tags automatically.
Additionally,, currently it doesn't allow for selection if you use the capital string logical types (ie "NaturalLanguage" vs "natural_language")

I would think this should work since I can call X.ww.select() with both and get valid results
bchen1116
left a comment
There was a problem hiding this comment.
Looking good! Agreed with Freddy, might be nice to not limit it to selecting by logical types. Left some comments on additional tests that could be helpful, as well as some clean up
| needs_fitting = False | ||
|
|
||
| def _check_input_for_columns(self, X): | ||
| cols = self.parameters.get("columns") or [] |
There was a problem hiding this comment.
Agreed. I think this could definitely be a no-op, since it does allow us to select by semantic tags automatically.
Additionally,, currently it doesn't allow for selection if you use the capital string logical types (ie "NaturalLanguage" vs "natural_language")

I would think this should work since I can call X.ww.select() with both and get valid results
evalml/tests/component_tests/test_column_selector_transformers.py
Outdated
Show resolved
Hide resolved
evalml/tests/component_tests/test_column_selector_transformers.py
Outdated
Show resolved
Hide resolved
| n_components = 37 | ||
| n_components = 38 | ||
| elif is_using_conda: | ||
| n_components = 48 |
There was a problem hiding this comment.
Putting a pin in this after our conversation during office hours 😅
|
@eccabay I triggered @dsherry You might have to merge this to main since it won't pass the codecov test until we can address a solution for #2555 |
| def _modify_columns(self, cols, X, y=None): | ||
| selected_columns = X.ww.select(cols) | ||
| if len(cols) > 0 and selected_columns.empty: | ||
| raise ValueError(f"Column(s) of type {cols} not found in input data") |
There was a problem hiding this comment.
This is the same type of error we raise in ColumnSelector._check_input_for_columns. Do we need to define it in two places? I think the intent with the current layout of ColumnSelector methods was for _check_input_for_columns to do this sort of validation, and for _modify_columns to only include logic to make modifications (i.e. dropping columns for the DropColumns transformer)
I think we should define this error only in _check_input_for_columns.
If you want the error to include the column types for SelectByType, but omit the column types for ColumnSelector and other subclasses, I would suggest something like:
- Define a class attribute like
_VALIDATION_ERROR_FORMAT_STRING = "Columns {columns} not found in input data." - Override the value of
_VALIDATION_ERROR_FORMAT_STRINGin theSelectByTypeclass definition to"Columns {columns} of type {column_types} not found in input data" - Update
ColumnSelector._check_input_for_columnsto do
raise ValueError(
self._VALIDATION_ERROR_FORMAT_STRING.format(**self.parameters)
)- Update
SelectByType._modify_columnsto just runreturn X.ww.select(zip(columns, self.column_types))(my example here prob not 100% correct, but close)
Alternatively, we could simplify this, just delete SelectColumns._modify_columns and not have the error message say anything about types.
Make sense?
dsherry
left a comment
There was a problem hiding this comment.
@eccabay looks good! I left some specific suggestions on how to reduce the complexity of this code change even further. Not blocking merge, feel free to merge this PR and get another up to address, if you are interested in doing so.
|
@ParthivNaresh thanks for the heads up. Thankfully, since we made the codecov patch check optional, and that is the one which is going red on this PR, @eccabay doesn't need me to do the merge! The codecov project check is still green here. |
angela97lin
left a comment
There was a problem hiding this comment.
Thanks for cleaning this up and taking in all of our comments, Becca! This looks great 😁
I do have some final comments before this is merged regarding _check_input_for_columns--I'm not certain it's doing exactly what we want right now in SelectByType. It checks for column names via the columns parameter, but SelectByType checks for column types, so I'm not sure we can just use it as is?
|
|
||
| def transform(self, X, y=None): | ||
| X = infer_feature_types(X) | ||
| self._check_input_for_columns(X) |
There was a problem hiding this comment.
Interesting... does _check_input_for_columns do anything here? Since we don't define it, I'm assuming it uses ColumnSelector's, which has the line:
cols = self.parameters.get("columns") or []
That returns an empty list for this component and therefore doesn't do any checks, right?
There was a problem hiding this comment.
It's true that that line will return an empty list and therefore the missing column name check will do nothing - that's by design here. This new bit of code should check if there are column types and if they match though:
col_types = self.parameters.get("column_types")
if col_types:
missing_cols = X.ww.select(col_types).empty
I agree, it does feel a little weird to do this here, I'm thinking it may make more sense to move this to check_input_for_columns within SelectByType itself.
| def __init__(self, column_types=None, random_seed=0, **kwargs): | ||
| parameters = {"column_types": column_types} | ||
| parameters.update(kwargs) | ||
| Transformer.__init__( |
There was a problem hiding this comment.
This, and the following comment, has me questioning if it's worth subclassing ColumnSelector rather than just Transformer; it seems like we're not gaining much value from ColumnSelector 😂
There was a problem hiding this comment.
That's true, the way this is written there's not a lot of benefit to having SelectByType subclass this implementation of ColumnSelector. That being said, it still makes logical sense to have it be a subclass, since it does function to select columns. With that, I see this more as a reason to consider refactoring ColumnSelector rather than removing SelectByType as a subclass.
| transformer = class_to_test(columns=["not in data"]) | ||
| with pytest.raises(ValueError, match="'not in data' not found in input data"): | ||
| with pytest.raises(ValueError, match="not found in input data"): |
There was a problem hiding this comment.
This test doesn't actually test that a column type does not exist, but rather that a column does not exist! Since we're passing in the keyword columns, we trigger self._check_input_for_columns(X) to raise a ValueError, but don't we actually want to test the following:
transformer = SelectByType(column_types=["some fake column type"])
and see that the ValueError is raised? I think this is related to my previous comment about using _check_input_for_columns :)
Closes #2479
I'm totally open to changing the name of the class, it was just the first thing I thought of