Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DelayedFeatureTransformer encodes categorical features and targets #1691

Merged
merged 10 commits into from
Jan 19, 2021

Conversation

freddyaboulton
Copy link
Contributor

Pull Request Description

Fixes #1581
Fixes #1685


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@freddyaboulton freddyaboulton changed the title 1581 delayed feature transformer encodes targets DelayedFeatureTransformer Encodes categorical features and targets Jan 13, 2021
@@ -173,3 +226,40 @@ def test_target_delay_when_gap_is_0(gap, delayed_features_data):
answer = answer.drop(columns=["target_delay_0"])

pd.testing.assert_frame_equal(transformer.fit_transform(None, y), answer)


@pytest.mark.parametrize('use_woodwork', [True, False])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized we didn't have coverage for woodwork

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woot woot 🥳

@codecov
Copy link

codecov bot commented Jan 13, 2021

Codecov Report

Merging #1691 (e5e9a63) into main (c41df79) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1691     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         240      240             
  Lines       18767    18853     +86     
=========================================
+ Hits        18759    18845     +86     
  Misses          8        8             
Impacted Files Coverage Δ
...mponents/transformers/preprocessing/time_series.py 100.0% <100.0%> (ø)
.../pipelines/time_series_classification_pipelines.py 100.0% <100.0%> (ø)
...mponent_tests/test_delayed_features_transformer.py 100.0% <100.0%> (ø)
...peline_tests/test_time_series_baseline_pipeline.py 100.0% <100.0%> (ø)
.../tests/pipeline_tests/test_time_series_pipeline.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c41df79...e5e9a63. Read the comment docs.

@freddyaboulton freddyaboulton force-pushed the 1581-delayed-feature-transformer-encodes-targets branch from 38edb46 to bc0a53d Compare January 13, 2021 21:55
@freddyaboulton freddyaboulton marked this pull request as ready for review January 13, 2021 22:33
@freddyaboulton freddyaboulton added the enhancement An improvement to an existing feature. label Jan 13, 2021
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left a few comments, but good tests!

if y is not None:
y = _convert_to_woodwork_structure(y)

if y.logical_type == logical_types.Categorical:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this check fail if y is None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! 🙏

if col_name in categorical_columns:
col = LabelEncoder().fit_transform(col)
col = pd.Series(col, index=X.index)
X = X.assign(**{f"{col_name}_delay_{t}": col.shift(t) for t in range(1, self.max_delay + 1)})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Just a small change to take out a for loop - take it or leave it, lol. Not worth holding up a merge over. Nice job.

Comment on lines 84 to 88
for col_name in X:
col = X[col_name]
if col_name in categorical_columns:
col = LabelEncoder().fit_transform(col)
col = pd.Series(col, index=X.index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered doing all the encoding over a subset of the data frame:

X[categorical_columns] = X[categorical_columns].apply(LabelEncoder().fit_transform)

Or something like that...I didn't run it ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give this a shot! Great suggestion :)

else:
y = _convert_woodwork_types_wrapper(y.to_series())

categorical_columns = {name for name, column in X.columns.items() if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we were using straight up pandas, we'd do something like df.select_dtypes("object"). Does woodwork not have anything like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a select method we can use! This would pair nicely with your suggestion to apply all of the encoding at once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to not go with select because ultimately we just need the column names for categorical columns. Using select returns a datatable so we'd need an extra step to return the column names. I figure this is more direct.

@freddyaboulton freddyaboulton force-pushed the 1581-delayed-feature-transformer-encodes-targets branch from b313ae2 to 25d4764 Compare January 15, 2021 17:33
@freddyaboulton freddyaboulton changed the title DelayedFeatureTransformer Encodes categorical features and targets DelayedFeatureTransformer encodes categorical features and targets Jan 15, 2021
@freddyaboulton freddyaboulton force-pushed the 1581-delayed-feature-transformer-encodes-targets branch from 25d4764 to 49dd106 Compare January 15, 2021 21:09
@freddyaboulton freddyaboulton merged commit f727942 into main Jan 19, 2021
@freddyaboulton freddyaboulton deleted the 1581-delayed-feature-transformer-encodes-targets branch January 19, 2021 16:04
@bchen1116 bchen1116 mentioned this pull request Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing feature.
Projects
None yet
4 participants