-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implements SelectColumns Transformer. Updating unit tests and docs. #873
Conversation
Codecov Report
@@ Coverage Diff @@
## master #873 +/- ##
=======================================
Coverage 99.75% 99.75%
=======================================
Files 195 195
Lines 8467 8503 +36
=======================================
+ Hits 8446 8482 +36
Misses 21 21
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton great first PR!! 🎊 The common base class is a helpful addition.
I left some comments. Only one blocking merge was about calling fit
in transform
. Others were just a couple style notes.
hyperparameter_ranges = {} | ||
|
||
def _modify_columns(self, cols, X, y=None): | ||
return X.drop(columns=cols, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton this is great! It's nice having the base class.
return X.drop(columns=cols, axis=1) | ||
|
||
def transform(self, X, y=None): | ||
"""Transforms data X by dropping columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you add this method just so the docstring would be different? 👍 that's great!
if not isinstance(X, pd.DataFrame): | ||
X = pd.DataFrame(X) | ||
|
||
self.fit(X, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton our general pattern is to keep fit
and transform
separate, and to use fit_transform
to combine them. Its nice to have all our components keep that convention so let's do that here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
@@ -17,6 +17,7 @@ Changelog | |||
* Added utility methods to calculate and graph permutation importances :pr:`860` | |||
* Added new utility functions necessary for generating dynamic preprocessing pipelines :pr:`852` | |||
* Added kwargs to all components :pr:`863` | |||
* Added SelectColumns transformer :pr:`873` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -4,5 +4,5 @@ | |||
from .feature_selection import FeatureSelector, RFClassifierSelectFromModel, RFRegressorSelectFromModel | |||
from .imputers import PerColumnImputer, SimpleImputer | |||
from .scalers import StandardScaler | |||
from .drop_columns import DropColumns | |||
from .column_selectors import DropColumns, SelectColumns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
with pytest.raises(ValueError, match="Parameter columns must be a list."): | ||
_ = DropColumns(columns="Column1") | ||
with pytest.raises(ValueError, match="Parameter columns must be a list."): | ||
_ = SelectColumns(columns="Column2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
For maintainability, may be nice to split each of these unit tests into two, i.e. test_drop_columns_init
and test_select_columns_init
.
In fact, lately I've been getting into using pytest.mark.parametrize
for tests which have two or three configuration options but would otherwise have mostly identical code. That could be helpful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, these tests looks like a good candidate for parametrize
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out! I've reworked the tests to use parametrize
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments but I think once you separate fit/transform, LGTM! 👍
def fit(self, X, y=None): | ||
"""'Fits' the transformer by checking if the column names are present in the dataset. | ||
|
||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, and honestly we probably have "Args" floating around elsewhere but we're trying to go with "Arguments" for docstrings! (Same with other places in this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out!
with pytest.raises(ValueError, match="Parameter columns must be a list."): | ||
_ = DropColumns(columns="Column1") | ||
with pytest.raises(ValueError, match="Parameter columns must be a list."): | ||
_ = SelectColumns(columns="Column2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, these tests looks like a good candidate for parametrize
:)
|
||
assert check2(class_to_test(columns=["one"]).fit_transform(X), X) | ||
|
||
assert check3(class_to_test(columns=list(X.columns)).fit_transform(X), X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton oh cool, I hadn't considered abstracting the checks out too! It may be easier to read in the future if we did something like
@pytest.mark.parametrize("class_to_test", [DropColumns, SelectColumns])
def test_column_transformer_fit_transform(class_to_test):
X = pd.DataFrame({'one': [1, 2, 3, 4], 'two': [2, 3, 4, 5], 'three': [1, 2, 3, 4]})
assert df.empty(<stuff>)
assert X == class_to_test(columns=[]).fit_transform(X)
if class_to_test == DropColumns:
assert X == class_to_test().fit_transform(X.drop(columns=["one"]))
if class_to_test == SelectColumns:
assert X == X[["one"]]
I think I'm screwing up the checks but that's a pattern I've used elsewhere recently.
But your call!
Pull Request Description
Addresses #837 by implementing a
SelectColumns
transformer. To reduce code duplication betweenDropColumns
andSelectColumns
I created a base class calledColumnSelector
. Since this PR is only a minor modification of the code written forDropColumns
, I did not prepare a design document.After creating the pull request: in order to pass the changelog_updated check you will need to update the "Future Release" section of
docs/source/changelog.rst
to include this pull request by adding :pr:123
.