-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds drop column transformer component #827
Conversation
Codecov Report
@@ Coverage Diff @@
## master #827 +/- ##
========================================
Coverage 99.66% 99.67%
========================================
Files 184 186 +2
Lines 7190 7295 +105
========================================
+ Hits 7166 7271 +105
Misses 24 24
Continue to review full report at Codecov.
|
evalml/pipelines/components/transformers/drop_columns_transformer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/drop_columns_transformer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/drop_columns_transformer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/drop_columns_transformer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/drop_columns_transformer.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some impl and test comments.
I think the questions you asked in the PR description need resolution. I think this component should error if it can't find the specified columns, hence if a np array is provided it should error. As for pandas dfs, what do you think should happen? We could have the column names fall back to trying to use the index if one exists, but I'm worried that will be nonintuitive behavior, and that instead we should require users to set column names in their dfs if they want to use this component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made some comments, but otherwise looks good to me once someone else also approves
evalml/pipelines/components/transformers/drop_columns_transformer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/drop_columns_transformer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/drop_columns_transformer.py
Outdated
Show resolved
Hide resolved
@dsherry @kmax12 Thanks for the comments! RE my questions: It'd be nice to support np.arrays and pd.DataFrames without specific string names by using indices or in the case of DataFrames, integer column names. Looks like unnamed columns are basically integer column names, so I don't see anything wrong with choosing a specific column by its index? Or at least I don't see it being unintuitive enough that we should just error out 🤔 What do you think? What confusion are you worried this may cause? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending @dsherry review
@angela97lin yeah, makes sense. I like the idea of having that support. My concern is that pandas DFs can sometimes have a custom index set up. And that adding support for that case increases complexity, will need more unit testing. If you wanna add it, it would definitely be useful, but consider doing it in a separate PR so we can get this merged first. |
def test_drop_column_transformer_transform(): | ||
X = pd.DataFrame({'one': [1, 2, 3, 4], 'two': [2, 3, 4, 5], 'three': [1, 2, 3, 4]}) | ||
drop_transformer = DropColumns(columns=None) | ||
assert drop_transformer.transform(X).equals(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to also assert that X
has been unmodified by transform
assert drop_transformer.transform(X).equals(X) | ||
|
||
drop_transformer = DropColumns(columns=["one"]) | ||
assert drop_transformer.transform(X).equals(X.loc[:, X.columns != "one"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Could also say X.drop(['one'])
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! I don't know why, I had wanted to access via a different way than our transformer implementation haha but I think this is more clear so updated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Closes #774
Questions:
Currently, my implementation support np.arrays + indices.