Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timezone datetime edgecase #2538

Merged
merged 3 commits into from
Dec 8, 2022
Merged

Fix timezone datetime edgecase #2538

merged 3 commits into from
Dec 8, 2022

Conversation

Innixma
Copy link
Contributor

@Innixma Innixma commented Dec 8, 2022

Issue #, if available:

Description of changes:

  • Addresses complicated preprocessing situation where timezones (including multiple timezones and multiple timezone + no-timezone rows) are present in a single datetime column, along with NaN's of multiple formats. Previously this crashed.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@Innixma Innixma added this to the 0.6.1 Release milestone Dec 8, 2022
Comment on lines 80 to 89
try:
X_datetime[datetime_feature + '.' + feature] = getattr(X_datetime[datetime_feature].dt, feature).astype(int)
except AttributeError:
# Strange datetime object, try removing timezone info
X_datetime[datetime_feature + '.' + feature] = getattr(self._remove_timezones(X_datetime[datetime_feature]).dt, feature).astype(int)

try:
X_datetime[datetime_feature] = pd.to_numeric(X_datetime[datetime_feature])
except TypeError:
# Strange datetime object, try removing timezone info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stripping known time zones would change the timestamp for those values. The method below can handle it correctly.

I think we can explicitly handle edge cases via separating good rows vs bad ones; using utc=True allows to normalize all time zones (though the cases without tz would be considered UTC):

def normalize_timeseries(series):
    series = pd.to_datetime(series, utc=True)
    broken_idx = series[(series == 'NaT') | series.isna() | series.isnull()].index
    bad_rows = series.iloc[broken_idx]
    good_rows = series[~series.isin(bad_rows)].astype(int)
    series.iloc[good_rows.index] = good_rows
    series[broken_idx] = int(good_rows.mean())
    return series

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

series = pd.to_datetime(series, utc=True)

Crashes.


../../../src/autogluon/features/generators/datetime.py:131: in _remove_timezones
    return normalize_timeseries(datetime_as_object)
../../../src/autogluon/features/generators/datetime.py:122: in normalize_timeseries
    series = pd.to_datetime(series, utc=True)
../../../../../../virtual/autogluon38/lib/python3.8/site-packages/pandas/core/tools/datetimes.py:1068: in to_datetime
    values = convert_listlike(arg._values, format)
../../../../../../virtual/autogluon38/lib/python3.8/site-packages/pandas/core/tools/datetimes.py:438: in _convert_listlike_datetimes
    result, tz_parsed = objects_to_datetime64ns(
../../../../../../virtual/autogluon38/lib/python3.8/site-packages/pandas/core/arrays/datetimes.py:2177: in objects_to_datetime64ns
    result, tz_parsed = tslib.array_to_datetime(
pandas/_libs/tslib.pyx:427: in pandas._libs.tslib.array_to_datetime
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   ValueError: mixed datetimes and integers in passed array

Comment on lines 119 to 124
1533140820000000000,
1600055337034500096,
1600055337034500096,
1628528828659000000,
1628528895541000000,
1610022803938000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct because time the known zones were stripped.

0    1533140820000000000
1    1600067037034500096
2    1600067037034500096
3    1628543228659000000
4    1628543295541000000
5    1610040803938000000

@Innixma Innixma merged commit c849c45 into master Dec 8, 2022
@Innixma Innixma deleted the fix_datetime_edgecases branch January 18, 2023 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants