Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds drop column transformer component #827

Merged
merged 13 commits into from
Jun 4, 2020
Merged

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Jun 1, 2020

Closes #774

Questions:

  • what if input is not pd.DataFrame? what if np.array?
  • what if columns don't have names?
  • to address both: do we accept indices?

Currently, my implementation support np.arrays + indices.

@angela97lin angela97lin self-assigned this Jun 1, 2020
@codecov
Copy link

codecov bot commented Jun 1, 2020

Codecov Report

Merging #827 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master     #827    +/-   ##
========================================
  Coverage   99.66%   99.67%            
========================================
  Files         184      186     +2     
  Lines        7190     7295   +105     
========================================
+ Hits         7166     7271   +105     
  Misses         24       24            
Impacted Files Coverage Δ
evalml/pipelines/components/__init__.py 100.00% <ø> (ø)
evalml/pipelines/components/utils.py 100.00% <ø> (ø)
...alml/pipelines/components/transformers/__init__.py 100.00% <100.00%> (ø)
.../pipelines/components/transformers/drop_columns.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_components.py 100.00% <100.00%> (ø)
...s/component_tests/test_drop_columns_transformer.py 100.00% <100.00%> (ø)
evalml/tests/component_tests/test_utils.py 96.42% <100.00%> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 99.74% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f7d220e...ac117dd. Read the comment docs.

@angela97lin angela97lin requested a review from dsherry June 2, 2020 13:02
@angela97lin angela97lin marked this pull request as ready for review June 2, 2020 13:02
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some impl and test comments.

I think the questions you asked in the PR description need resolution. I think this component should error if it can't find the specified columns, hence if a np array is provided it should error. As for pandas dfs, what do you think should happen? We could have the column names fall back to trying to use the index if one exists, but I'm worried that will be nonintuitive behavior, and that instead we should require users to set column names in their dfs if they want to use this component.

Copy link
Contributor

@kmax12 kmax12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made some comments, but otherwise looks good to me once someone else also approves

@angela97lin
Copy link
Contributor Author

@dsherry @kmax12 Thanks for the comments!

RE my questions: It'd be nice to support np.arrays and pd.DataFrames without specific string names by using indices or in the case of DataFrames, integer column names. Looks like unnamed columns are basically integer column names, so I don't see anything wrong with choosing a specific column by its index? Or at least I don't see it being unintuitive enough that we should just error out 🤔 What do you think? What confusion are you worried this may cause?

@angela97lin angela97lin requested review from dsherry and kmax12 June 3, 2020 18:58
Copy link
Contributor

@kmax12 kmax12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending @dsherry review

@dsherry
Copy link
Contributor

dsherry commented Jun 3, 2020

@angela97lin yeah, makes sense. I like the idea of having that support. My concern is that pandas DFs can sometimes have a custom index set up. And that adding support for that case increases complexity, will need more unit testing. If you wanna add it, it would definitely be useful, but consider doing it in a separate PR so we can get this merged first.

def test_drop_column_transformer_transform():
X = pd.DataFrame({'one': [1, 2, 3, 4], 'two': [2, 3, 4, 5], 'three': [1, 2, 3, 4]})
drop_transformer = DropColumns(columns=None)
assert drop_transformer.transform(X).equals(X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to also assert that X has been unmodified by transform

assert drop_transformer.transform(X).equals(X)

drop_transformer = DropColumns(columns=["one"])
assert drop_transformer.transform(X).equals(X.loc[:, X.columns != "one"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Could also say X.drop(['one']) right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! I don't know why, I had wanted to access via a different way than our transformer implementation haha but I think this is more clear so updated!

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@angela97lin angela97lin merged commit c55e109 into master Jun 4, 2020
@angela97lin angela97lin deleted the 774_drop_component branch June 4, 2020 15:06
@kmax12 kmax12 mentioned this pull request Jun 4, 2020
@angela97lin angela97lin mentioned this pull request Jun 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add "drop column" component
3 participants