Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DateTimeFeaturizer encodes features as ints #1479

Merged
merged 7 commits into from Dec 8, 2020

Conversation

freddyaboulton
Copy link
Contributor

Pull Request Description

Fixes #1477


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@freddyaboulton freddyaboulton force-pushed the 1477-datetime-should-encode-columns branch 2 times, most recently from 00263ea to b1d058b Compare November 30, 2020 22:29
@codecov
Copy link

codecov bot commented Nov 30, 2020

Codecov Report

Merging #1479 (8ec13b1) into main (c3eb47b) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1479     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         230      230             
  Lines       15776    15824     +48     
=========================================
+ Hits        15768    15816     +48     
  Misses          8        8             
Impacted Files Coverage Δ
.../transformers/preprocessing/datetime_featurizer.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_components.py 100.0% <100.0%> (ø)
.../tests/component_tests/test_datetime_featurizer.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c3eb47b...8ec13b1. Read the comment docs.

@freddyaboulton freddyaboulton marked this pull request as ready for review November 30, 2020 22:46
@@ -7,19 +7,33 @@

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation takes 150 ms to create features for 150k rows. I think that's reasonable.

image

return X_t.drop(self._date_time_col_names, axis=1)

def get_feature_names(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named it get_feature_names to match the OHE.

@@ -105,6 +105,8 @@ def score(self, X, y, objectives):
X = pd.DataFrame()
X = _convert_to_woodwork_structure(X)
X = _convert_woodwork_types_wrapper(X.to_dataframe())
y = _convert_to_woodwork_structure(y)
y = _convert_woodwork_types_wrapper(y.to_series())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this was missing 😬

@@ -7,19 +7,33 @@


def _extract_year(col):
return col.dt.year
return col.dt.year, None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated whether or not we should encode the years starting at 0 but that sounds like a performance test? I can file an issue but it'd be blocked by #1383

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would encoding years take away contextual information about the datetime later on in the pipeline? January will always be mapped to 0 but 1981 being mapped to 0 would take information wouldn't it? Or maybe I'm misunderstanding!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a thought that linear models would work best with scaled features. Now that I think about it again, we automatically add a standard scaler for linear models at least in AutoML so maybe we can keep as is for now!

@freddyaboulton freddyaboulton added this to the December 2020 milestone Dec 1, 2020
@freddyaboulton freddyaboulton force-pushed the 1477-datetime-should-encode-columns branch from b1d058b to feeb200 Compare December 1, 2020 18:44
@ParthivNaresh
Copy link
Contributor

Looks great! I don't know if you want a test checking the get_feature_names output with an input that has just year and hour datetime information because function mapping return None for those. Not really a big deal anyways

@freddyaboulton freddyaboulton force-pushed the 1477-datetime-should-encode-columns branch from feeb200 to 0277a7f Compare December 1, 2020 22:36
Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm curious and don't know too much about the pros / cons of encoding dates as ints vs categories as they are right now. I assume for time series regression problems, it makes sense to encode as int, as there's valuable information about how close two dates are (1 and 2 are much closer than 1 and 30). But would it also be useful to use categories for day of the week, where the magnitude might not necessarily matter? 🤔

@freddyaboulton
Copy link
Contributor Author

@angela97lin and I just chatted about her comment about whether or not day-of-week should be treated as a category. We agreed that we shouldn't do that by default because it forces the user to have a ohe downstream (or else some estimators would crash, like xgboost) which could lead to an explosion of columns (if there are many delayed day-of-week features, which is likely). These extra columns may not increase performance and would have a negative impact on downstream modules, like permutation importance in model understanding.

For the short-term, we'll add a encode_as_categories parameter to init that defaults to False. In the long term, Automl should be "smart enough" try different encodings for the same columns per estimator.

@freddyaboulton freddyaboulton force-pushed the 1477-datetime-should-encode-columns branch 2 times, most recently from 64159c5 to 5641803 Compare December 2, 2020 20:51
@@ -387,7 +387,7 @@
"\n",
"class CustomPipeline(MulticlassClassificationPipeline):\n",
" name = \"Custom Pipeline\"\n",
" component_graph = ['Imputer', MyDropNullColumns, 'DateTime Featurization Component', 'One Hot Encoder', 'Random Forest Classifier']\n",
" component_graph = ['Imputer', MyDropNullColumns, 'One Hot Encoder', 'Random Forest Classifier']\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While #1495 gets fixed.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

docs/source/user_guide/pipelines.ipynb Outdated Show resolved Hide resolved
@freddyaboulton freddyaboulton force-pushed the 1477-datetime-should-encode-columns branch from 80e8f25 to b09bbda Compare December 7, 2020 15:59
@freddyaboulton freddyaboulton force-pushed the 1477-datetime-should-encode-columns branch from b09bbda to b8779f0 Compare December 7, 2020 17:59
@freddyaboulton freddyaboulton force-pushed the 1477-datetime-should-encode-columns branch from b8779f0 to 1c434b1 Compare December 8, 2020 22:15
@freddyaboulton freddyaboulton merged commit 70226e1 into main Dec 8, 2020
1 check passed
@freddyaboulton freddyaboulton deleted the 1477-datetime-should-encode-columns branch December 8, 2020 22:40
@kmax12
Copy link
Contributor

kmax12 commented Dec 11, 2020

im jumping in a little late to the convo here, but in case this nuance wasn't discussed, wanted to chime in.

one of the reasons we might encode the day of the week or month as a category, rather than int is to not confuse the the estimator of the cyclic nature of the variable e.g january (0) and december (11) are actually only one month a part. A linear estimator can learn that january and december are close in time when encode as 0 and 11.

on the other hand, the numeric encoding makes it easier in other cases for the classifier to learn that january and febuary are close in time compared to the categorical approach.

all that being said, I'm not sure what the the best default option is for us. mainly documenting the situation.

here's SO post with some more discussion: https://datascience.stackexchange.com/questions/17759/encoding-features-like-month-and-hour-as-categorial-or-numeric

@dsherry dsherry mentioned this pull request Dec 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DateTimeFeaturizer should return integer-valued features
5 participants