-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DelayedFeatureTransformer encodes categorical features and targets #1691
DelayedFeatureTransformer encodes categorical features and targets #1691
Conversation
@@ -173,3 +226,40 @@ def test_target_delay_when_gap_is_0(gap, delayed_features_data): | |||
answer = answer.drop(columns=["target_delay_0"]) | |||
|
|||
pd.testing.assert_frame_equal(transformer.fit_transform(None, y), answer) | |||
|
|||
|
|||
@pytest.mark.parametrize('use_woodwork', [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realized we didn't have coverage for woodwork
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woot woot 🥳
Codecov Report
@@ Coverage Diff @@
## main #1691 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 240 240
Lines 18767 18853 +86
=========================================
+ Hits 18759 18845 +86
Misses 8 8
Continue to review full report at Codecov.
|
38edb46
to
bc0a53d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left a few comments, but good tests!
if y is not None: | ||
y = _convert_to_woodwork_structure(y) | ||
|
||
if y.logical_type == logical_types.Categorical: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this check fail if y
is None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed! 🙏
if col_name in categorical_columns: | ||
col = LabelEncoder().fit_transform(col) | ||
col = pd.Series(col, index=X.index) | ||
X = X.assign(**{f"{col_name}_delay_{t}": col.shift(t) for t in range(1, self.max_delay + 1)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Just a small change to take out a for loop - take it or leave it, lol. Not worth holding up a merge over. Nice job.
for col_name in X: | ||
col = X[col_name] | ||
if col_name in categorical_columns: | ||
col = LabelEncoder().fit_transform(col) | ||
col = pd.Series(col, index=X.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered doing all the encoding over a subset of the data frame:
X[categorical_columns] = X[categorical_columns].apply(LabelEncoder().fit_transform)
Or something like that...I didn't run it ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give this a shot! Great suggestion :)
else: | ||
y = _convert_woodwork_types_wrapper(y.to_series()) | ||
|
||
categorical_columns = {name for name, column in X.columns.items() if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we were using straight up pandas, we'd do something like df.select_dtypes("object")
. Does woodwork not have anything like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a select method we can use! This would pair nicely with your suggestion to apply all of the encoding at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to not go with select because ultimately we just need the column names for categorical columns. Using select returns a datatable so we'd need an extra step to return the column names. I figure this is more direct.
b313ae2
to
25d4764
Compare
…TS pipeline code.
25d4764
to
49dd106
Compare
Pull Request Description
Fixes #1581
Fixes #1685
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.