-
Notifications
You must be signed in to change notification settings - Fork 861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix timezone datetime edgecase #2538
Conversation
try: | ||
X_datetime[datetime_feature + '.' + feature] = getattr(X_datetime[datetime_feature].dt, feature).astype(int) | ||
except AttributeError: | ||
# Strange datetime object, try removing timezone info | ||
X_datetime[datetime_feature + '.' + feature] = getattr(self._remove_timezones(X_datetime[datetime_feature]).dt, feature).astype(int) | ||
|
||
try: | ||
X_datetime[datetime_feature] = pd.to_numeric(X_datetime[datetime_feature]) | ||
except TypeError: | ||
# Strange datetime object, try removing timezone info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stripping known time zones would change the timestamp for those values. The method below can handle it correctly.
I think we can explicitly handle edge cases via separating good rows vs bad ones; using utc=True
allows to normalize all time zones (though the cases without tz would be considered UTC):
def normalize_timeseries(series):
series = pd.to_datetime(series, utc=True)
broken_idx = series[(series == 'NaT') | series.isna() | series.isnull()].index
bad_rows = series.iloc[broken_idx]
good_rows = series[~series.isin(bad_rows)].astype(int)
series.iloc[good_rows.index] = good_rows
series[broken_idx] = int(good_rows.mean())
return series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
series = pd.to_datetime(series, utc=True)
Crashes.
../../../src/autogluon/features/generators/datetime.py:131: in _remove_timezones
return normalize_timeseries(datetime_as_object)
../../../src/autogluon/features/generators/datetime.py:122: in normalize_timeseries
series = pd.to_datetime(series, utc=True)
../../../../../../virtual/autogluon38/lib/python3.8/site-packages/pandas/core/tools/datetimes.py:1068: in to_datetime
values = convert_listlike(arg._values, format)
../../../../../../virtual/autogluon38/lib/python3.8/site-packages/pandas/core/tools/datetimes.py:438: in _convert_listlike_datetimes
result, tz_parsed = objects_to_datetime64ns(
../../../../../../virtual/autogluon38/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py:2177: in objects_to_datetime64ns
result, tz_parsed = tslib.array_to_datetime(
pandas/_libs/tslib.pyx:427: in pandas._libs.tslib.array_to_datetime
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E ValueError: mixed datetimes and integers in passed array
1533140820000000000, | ||
1600055337034500096, | ||
1600055337034500096, | ||
1628528828659000000, | ||
1628528895541000000, | ||
1610022803938000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct because time the known zones were stripped.
0 1533140820000000000
1 1600067037034500096
2 1600067037034500096
3 1628543228659000000
4 1628543295541000000
5 1610040803938000000
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.