Skip to content

Add SelectDtypeColumns component#2531

Merged
eccabay merged 34 commits intomainfrom
2479_dtype_columns
Jul 30, 2021
Merged

Add SelectDtypeColumns component#2531
eccabay merged 34 commits intomainfrom
2479_dtype_columns

Conversation

@eccabay
Copy link
Contributor

@eccabay eccabay commented Jul 20, 2021

Closes #2479

I'm totally open to changing the name of the class, it was just the first thing I thought of

@CLAassistant
Copy link

CLAassistant commented Jul 20, 2021

CLA assistant check
All committers have signed the CLA.

@codecov
Copy link

codecov bot commented Jul 20, 2021

Codecov Report

Merging #2531 (eaf766d) into main (2e0551b) will increase coverage by 0.1%.
The diff coverage is 97.6%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2531     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        291     291             
  Lines      26587   26644     +57     
=======================================
+ Hits       26544   26601     +57     
  Misses        43      43             
Impacted Files Coverage Δ
evalml/pipelines/components/__init__.py 100.0% <ø> (ø)
evalml/tests/component_tests/test_utils.py 97.5% <50.0%> (ø)
...alml/pipelines/components/transformers/__init__.py 100.0% <100.0%> (ø)
...elines/components/transformers/column_selectors.py 100.0% <100.0%> (ø)
...mponent_tests/test_column_selector_transformers.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_components.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e0551b...eaf766d. Read the comment docs.

@eccabay eccabay self-assigned this Jul 20, 2021
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay Looks good! Thanks for using the existing ColumnSelector abstraction!

One comment I'd like to resolve before merge is that if we make _check_input_for_columns a no-op we get semantic tag selection for free! I think that'll make this more flexible.

needs_fitting = False

def _check_input_for_columns(self, X):
cols = self.parameters.get("columns") or []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on making this a no-op?

If we do that, then we get selection by semantic tag for free. I feel like that would be useful for workflows that don't use the "standard" set of types. For example, a user might be interested in selecting all numeric features (including Age and AgeNullable) or all categorical features (CountryCode and Ordinal). It would also make it easy for users to select based on their own criteria if they set the tags correctly. For example, users may be interested in selecting based on whether a feature was created by another component or not.

Another argument for doing that is that check_input_for_columns was implemented because X.ww[cols] would error if the columns were not in the dataframe. However, Xww.select does not error out if the resulting dataframe is empty. I think it's ok if this returns an empty dataframe - maybe we can just raise a warning instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think this could definitely be a no-op, since it does allow us to select by semantic tags automatically.

Additionally,, currently it doesn't allow for selection if you use the capital string logical types (ie "NaturalLanguage" vs "natural_language")
image
I would think this should work since I can call X.ww.select() with both and get valid results

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Agreed with Freddy, might be nice to not limit it to selecting by logical types. Left some comments on additional tests that could be helpful, as well as some clean up

needs_fitting = False

def _check_input_for_columns(self, X):
cols = self.parameters.get("columns") or []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think this could definitely be a no-op, since it does allow us to select by semantic tags automatically.

Additionally,, currently it doesn't allow for selection if you use the capital string logical types (ie "NaturalLanguage" vs "natural_language")
image
I would think this should work since I can call X.ww.select() with both and get valid results

@eccabay eccabay requested a review from bchen1116 July 22, 2021 13:58
n_components = 37
n_components = 38
elif is_using_conda:
n_components = 48
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting a pin in this after our conversation during office hours 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue filed #2555

@eccabay eccabay requested a review from dsherry July 26, 2021 18:37
@ParthivNaresh
Copy link
Contributor

ParthivNaresh commented Jul 26, 2021

@eccabay I triggered build_conda_pkg on the PR just to verify that it fails without the bump in n_components. I went ahead and added +1 to is_using_conda and is_running_py_39_or_above.

@dsherry You might have to merge this to main since it won't pass the codecov test until we can address a solution for #2555

def _modify_columns(self, cols, X, y=None):
selected_columns = X.ww.select(cols)
if len(cols) > 0 and selected_columns.empty:
raise ValueError(f"Column(s) of type {cols} not found in input data")
Copy link
Contributor

@dsherry dsherry Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same type of error we raise in ColumnSelector._check_input_for_columns. Do we need to define it in two places? I think the intent with the current layout of ColumnSelector methods was for _check_input_for_columns to do this sort of validation, and for _modify_columns to only include logic to make modifications (i.e. dropping columns for the DropColumns transformer)

I think we should define this error only in _check_input_for_columns.

If you want the error to include the column types for SelectByType, but omit the column types for ColumnSelector and other subclasses, I would suggest something like:

  • Define a class attribute like _VALIDATION_ERROR_FORMAT_STRING = "Columns {columns} not found in input data."
  • Override the value of _VALIDATION_ERROR_FORMAT_STRING in the SelectByType class definition to "Columns {columns} of type {column_types} not found in input data"
  • Update ColumnSelector._check_input_for_columns to do
raise ValueError(
                self._VALIDATION_ERROR_FORMAT_STRING.format(**self.parameters)
            )
  • Update SelectByType._modify_columns to just run return X.ww.select(zip(columns, self.column_types)) (my example here prob not 100% correct, but close)

Alternatively, we could simplify this, just delete SelectColumns._modify_columns and not have the error message say anything about types.

Make sense?

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay looks good! I left some specific suggestions on how to reduce the complexity of this code change even further. Not blocking merge, feel free to merge this PR and get another up to address, if you are interested in doing so.

@dsherry
Copy link
Contributor

dsherry commented Jul 29, 2021

@ParthivNaresh thanks for the heads up. Thankfully, since we made the codecov patch check optional, and that is the one which is going red on this PR, @eccabay doesn't need me to do the merge! The codecov project check is still green here.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning this up and taking in all of our comments, Becca! This looks great 😁

I do have some final comments before this is merged regarding _check_input_for_columns--I'm not certain it's doing exactly what we want right now in SelectByType. It checks for column names via the columns parameter, but SelectByType checks for column types, so I'm not sure we can just use it as is?


def transform(self, X, y=None):
X = infer_feature_types(X)
self._check_input_for_columns(X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting... does _check_input_for_columns do anything here? Since we don't define it, I'm assuming it uses ColumnSelector's, which has the line:

        cols = self.parameters.get("columns") or []

That returns an empty list for this component and therefore doesn't do any checks, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that that line will return an empty list and therefore the missing column name check will do nothing - that's by design here. This new bit of code should check if there are column types and if they match though:

col_types = self.parameters.get("column_types")
if col_types:
       missing_cols = X.ww.select(col_types).empty

I agree, it does feel a little weird to do this here, I'm thinking it may make more sense to move this to check_input_for_columns within SelectByType itself.

def __init__(self, column_types=None, random_seed=0, **kwargs):
parameters = {"column_types": column_types}
parameters.update(kwargs)
Transformer.__init__(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, and the following comment, has me questioning if it's worth subclassing ColumnSelector rather than just Transformer; it seems like we're not gaining much value from ColumnSelector 😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, the way this is written there's not a lot of benefit to having SelectByType subclass this implementation of ColumnSelector. That being said, it still makes logical sense to have it be a subclass, since it does function to select columns. With that, I see this more as a reason to consider refactoring ColumnSelector rather than removing SelectByType as a subclass.

Comment on lines 49 to +50
transformer = class_to_test(columns=["not in data"])
with pytest.raises(ValueError, match="'not in data' not found in input data"):
with pytest.raises(ValueError, match="not found in input data"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't actually test that a column type does not exist, but rather that a column does not exist! Since we're passing in the keyword columns, we trigger self._check_input_for_columns(X) to raise a ValueError, but don't we actually want to test the following:

transformer = SelectByType(column_types=["some fake column type"])

and see that the ValueError is raised? I think this is related to my previous comment about using _check_input_for_columns :)

@eccabay eccabay merged commit 53eaf22 into main Jul 30, 2021
@eccabay eccabay deleted the 2479_dtype_columns branch July 30, 2021 19:42
@chukarsten chukarsten mentioned this pull request Aug 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create a new component which selects using feature types

8 participants