DelayedFeatureTransformer encodes categorical features and targets#1691
Conversation
| pd.testing.assert_frame_equal(transformer.fit_transform(None, y), answer) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize('use_woodwork', [True, False]) |
There was a problem hiding this comment.
Realized we didn't have coverage for woodwork
Codecov Report
@@ Coverage Diff @@
## main #1691 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 240 240
Lines 18767 18853 +86
=========================================
+ Hits 18759 18845 +86
Misses 8 8
Continue to review full report at Codecov.
|
38edb46 to
bc0a53d
Compare
bchen1116
left a comment
There was a problem hiding this comment.
LGTM! Left a few comments, but good tests!
| if y is not None: | ||
| y = _convert_to_woodwork_structure(y) | ||
|
|
||
| if y.logical_type == logical_types.Categorical: |
There was a problem hiding this comment.
Wouldn't this check fail if y is None?
| if col_name in categorical_columns: | ||
| col = LabelEncoder().fit_transform(col) | ||
| col = pd.Series(col, index=X.index) | ||
| X = X.assign(**{f"{col_name}_delay_{t}": col.shift(t) for t in range(1, self.max_delay + 1)}) |
chukarsten
left a comment
There was a problem hiding this comment.
This looks good. Just a small change to take out a for loop - take it or leave it, lol. Not worth holding up a merge over. Nice job.
| for col_name in X: | ||
| col = X[col_name] | ||
| if col_name in categorical_columns: | ||
| col = LabelEncoder().fit_transform(col) | ||
| col = pd.Series(col, index=X.index) |
There was a problem hiding this comment.
Have you considered doing all the encoding over a subset of the data frame:
X[categorical_columns] = X[categorical_columns].apply(LabelEncoder().fit_transform)
Or something like that...I didn't run it ;)
There was a problem hiding this comment.
I'll give this a shot! Great suggestion :)
| else: | ||
| y = _convert_woodwork_types_wrapper(y.to_series()) | ||
|
|
||
| categorical_columns = {name for name, column in X.columns.items() if |
There was a problem hiding this comment.
So if we were using straight up pandas, we'd do something like df.select_dtypes("object"). Does woodwork not have anything like that?
There was a problem hiding this comment.
There is a select method we can use! This would pair nicely with your suggestion to apply all of the encoding at once.
There was a problem hiding this comment.
I decided to not go with select because ultimately we just need the column names for categorical columns. Using select returns a datatable so we'd need an extra step to return the column names. I figure this is more direct.
b313ae2 to
25d4764
Compare
…TS pipeline code.
25d4764 to
49dd106
Compare
Pull Request Description
Fixes #1581
Fixes #1685
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rstto include this pull request by adding :pr:123.