-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DateTimeFeaturizer encodes features as ints #1479
Conversation
00263ea
to
b1d058b
Compare
Codecov Report
@@ Coverage Diff @@
## main #1479 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 230 230
Lines 15776 15824 +48
=========================================
+ Hits 15768 15816 +48
Misses 8 8
Continue to review full report at Codecov.
|
@@ -7,19 +7,33 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return X_t.drop(self._date_time_col_names, axis=1) | ||
|
||
def get_feature_names(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I named it get_feature_names
to match the OHE.
@@ -105,6 +105,8 @@ def score(self, X, y, objectives): | |||
X = pd.DataFrame() | |||
X = _convert_to_woodwork_structure(X) | |||
X = _convert_woodwork_types_wrapper(X.to_dataframe()) | |||
y = _convert_to_woodwork_structure(y) | |||
y = _convert_woodwork_types_wrapper(y.to_series()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this was missing 😬
@@ -7,19 +7,33 @@ | |||
|
|||
|
|||
def _extract_year(col): | |||
return col.dt.year | |||
return col.dt.year, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debated whether or not we should encode the years starting at 0 but that sounds like a performance test? I can file an issue but it'd be blocked by #1383
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would encoding years take away contextual information about the datetime later on in the pipeline? January will always be mapped to 0 but 1981 being mapped to 0 would take information wouldn't it? Or maybe I'm misunderstanding!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a thought that linear models would work best with scaled features. Now that I think about it again, we automatically add a standard scaler for linear models at least in AutoML so maybe we can keep as is for now!
b1d058b
to
feeb200
Compare
Looks great! I don't know if you want a test checking the get_feature_names output with an input that has just year and hour datetime information because function mapping return None for those. Not really a big deal anyways |
feeb200
to
0277a7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm curious and don't know too much about the pros / cons of encoding dates as ints vs categories as they are right now. I assume for time series regression problems, it makes sense to encode as int, as there's valuable information about how close two dates are (1 and 2 are much closer than 1 and 30). But would it also be useful to use categories for day of the week, where the magnitude might not necessarily matter? 🤔
@angela97lin and I just chatted about her comment about whether or not day-of-week should be treated as a category. We agreed that we shouldn't do that by default because it forces the user to have a ohe downstream (or else some estimators would crash, like xgboost) which could lead to an explosion of columns (if there are many delayed day-of-week features, which is likely). These extra columns may not increase performance and would have a negative impact on downstream modules, like permutation importance in model understanding. For the short-term, we'll add a |
64159c5
to
5641803
Compare
@@ -387,7 +387,7 @@ | |||
"\n", | |||
"class CustomPipeline(MulticlassClassificationPipeline):\n", | |||
" name = \"Custom Pipeline\"\n", | |||
" component_graph = ['Imputer', MyDropNullColumns, 'DateTime Featurization Component', 'One Hot Encoder', 'Random Forest Classifier']\n", | |||
" component_graph = ['Imputer', MyDropNullColumns, 'One Hot Encoder', 'Random Forest Classifier']\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While #1495 gets fixed.
eeaaaf9
to
80e8f25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
80e8f25
to
b09bbda
Compare
b09bbda
to
b8779f0
Compare
… handle bools yet.
b8779f0
to
1c434b1
Compare
im jumping in a little late to the convo here, but in case this nuance wasn't discussed, wanted to chime in. one of the reasons we might encode the day of the week or month as a category, rather than int is to not confuse the the estimator of the cyclic nature of the variable e.g january (0) and december (11) are actually only one month a part. A linear estimator can learn that january and december are close in time when encode as 0 and 11. on the other hand, the numeric encoding makes it easier in other cases for the classifier to learn that january and febuary are close in time compared to the categorical approach. all that being said, I'm not sure what the the best default option is for us. mainly documenting the situation. here's SO post with some more discussion: https://datascience.stackexchange.com/questions/17759/encoding-features-like-month-and-hour-as-categorial-or-numeric |
Pull Request Description
Fixes #1477
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.