Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements SelectColumns Transformer. Updating unit tests and docs. #873

Merged
merged 6 commits into from Jun 23, 2020

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Jun 23, 2020

Pull Request Description

Addresses #837 by implementing a SelectColumns transformer. To reduce code duplication between DropColumns and SelectColumns I created a base class called ColumnSelector. Since this PR is only a minor modification of the code written for DropColumns, I did not prepare a design document.


After creating the pull request: in order to pass the changelog_updated check you will need to update the "Future Release" section of docs/source/changelog.rst to include this pull request by adding :pr:123.

@freddyaboulton freddyaboulton linked an issue Jun 23, 2020 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Jun 23, 2020

Codecov Report

Merging #873 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #873   +/-   ##
=======================================
  Coverage   99.75%   99.75%           
=======================================
  Files         195      195           
  Lines        8467     8503   +36     
=======================================
+ Hits         8446     8482   +36     
  Misses         21       21           
Impacted Files Coverage Δ
evalml/pipelines/components/__init__.py 100.00% <ø> (ø)
...alml/pipelines/components/transformers/__init__.py 100.00% <100.00%> (ø)
...elines/components/transformers/column_selectors.py 100.00% <100.00%> (ø)
...mponent_tests/test_column_selector_transformers.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 56c4521...e342e23. Read the comment docs.

@freddyaboulton freddyaboulton marked this pull request as ready for review Jun 23, 2020
Copy link
Collaborator

@dsherry dsherry left a comment

@freddyaboulton great first PR!! 🎊 The common base class is a helpful addition.

I left some comments. Only one blocking merge was about calling fit in transform. Others were just a couple style notes.

hyperparameter_ranges = {}

def _modify_columns(self, cols, X, y=None):
return X.drop(columns=cols, axis=1)
Copy link
Collaborator

@dsherry dsherry Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton this is great! It's nice having the base class.

return X.drop(columns=cols, axis=1)

def transform(self, X, y=None):
"""Transforms data X by dropping columns.
Copy link
Collaborator

@dsherry dsherry Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you add this method just so the docstring would be different? 👍 that's great!

if not isinstance(X, pd.DataFrame):
X = pd.DataFrame(X)

self.fit(X, y)
Copy link
Collaborator

@dsherry dsherry Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton our general pattern is to keep fit and transform separate, and to use fit_transform to combine them. Its nice to have all our components keep that convention so let's do that here too.

Copy link
Contributor Author

@freddyaboulton freddyaboulton Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

@@ -17,6 +17,7 @@ Changelog
* Added utility methods to calculate and graph permutation importances :pr:`860`
* Added new utility functions necessary for generating dynamic preprocessing pipelines :pr:`852`
* Added kwargs to all components :pr:`863`
* Added SelectColumns transformer :pr:`873`
Copy link
Collaborator

@dsherry dsherry Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -4,5 +4,5 @@
from .feature_selection import FeatureSelector, RFClassifierSelectFromModel, RFRegressorSelectFromModel
from .imputers import PerColumnImputer, SimpleImputer
from .scalers import StandardScaler
from .drop_columns import DropColumns
from .column_selectors import DropColumns, SelectColumns
Copy link
Collaborator

@dsherry dsherry Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

with pytest.raises(ValueError, match="Parameter columns must be a list."):
_ = DropColumns(columns="Column1")
with pytest.raises(ValueError, match="Parameter columns must be a list."):
_ = SelectColumns(columns="Column2")
Copy link
Collaborator

@dsherry dsherry Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

For maintainability, may be nice to split each of these unit tests into two, i.e. test_drop_columns_init and test_select_columns_init.

In fact, lately I've been getting into using pytest.mark.parametrize for tests which have two or three configuration options but would otherwise have mostly identical code. That could be helpful here.

Copy link
Contributor

@angela97lin angela97lin Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, these tests looks like a good candidate for parametrize :)

Copy link
Contributor Author

@freddyaboulton freddyaboulton Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! I've reworked the tests to use parametrize .

@dsherry dsherry requested a review from jeremyliweishih Jun 23, 2020
Copy link
Contributor

@angela97lin angela97lin left a comment

Minor comments but I think once you separate fit/transform, LGTM! 👍

def fit(self, X, y=None):
"""'Fits' the transformer by checking if the column names are present in the dataset.

Args:
Copy link
Contributor

@angela97lin angela97lin Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, and honestly we probably have "Args" floating around elsewhere but we're trying to go with "Arguments" for docstrings! (Same with other places in this PR)

Copy link
Contributor Author

@freddyaboulton freddyaboulton Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out!

with pytest.raises(ValueError, match="Parameter columns must be a list."):
_ = DropColumns(columns="Column1")
with pytest.raises(ValueError, match="Parameter columns must be a list."):
_ = SelectColumns(columns="Column2")
Copy link
Contributor

@angela97lin angela97lin Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, these tests looks like a good candidate for parametrize :)

@freddyaboulton freddyaboulton merged commit b69e1ed into master Jun 23, 2020
2 checks passed
@freddyaboulton freddyaboulton deleted the 837-select-columns branch Jun 23, 2020

assert check2(class_to_test(columns=["one"]).fit_transform(X), X)

assert check3(class_to_test(columns=list(X.columns)).fit_transform(X), X)
Copy link
Collaborator

@dsherry dsherry Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton oh cool, I hadn't considered abstracting the checks out too! It may be easier to read in the future if we did something like

@pytest.mark.parametrize("class_to_test", [DropColumns, SelectColumns])
def test_column_transformer_fit_transform(class_to_test):
    X = pd.DataFrame({'one': [1, 2, 3, 4], 'two': [2, 3, 4, 5], 'three': [1, 2, 3, 4]})
    assert df.empty(<stuff>)
    assert X == class_to_test(columns=[]).fit_transform(X)
    if class_to_test == DropColumns:
        assert X == class_to_test().fit_transform(X.drop(columns=["one"]))
    if class_to_test == SelectColumns:
        assert X == X[["one"]]

I think I'm screwing up the checks but that's a pattern I've used elsewhere recently.

But your call!

@angela97lin angela97lin mentioned this pull request Jun 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Select Column Component
3 participants