-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54285][PYTHON] Cache timezone info to avoid expensive timestamp conversion #52980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
f6b2a8a
26528da
83971c5
5e60040
9c62278
3601655
61da8a1
29c5cd5
69249c2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -441,6 +441,12 @@ def __repr__(self) -> str: | |
| class TimestampType(DatetimeType, metaclass=DataTypeSingleton): | ||
| """Timestamp (datetime.datetime) data type.""" | ||
|
|
||
| # We need to cache the timezone info for datetime.datetime.fromtimestamp | ||
| # otherwise the forked process will be extremely slow to convert the timestamp. | ||
| # This is probably a glibc issue - the forked process will have a bad cache/lock | ||
| # status for the timezone info. | ||
| tz_info = None | ||
|
|
||
| def needConversion(self) -> bool: | ||
| return True | ||
|
|
||
|
|
@@ -454,7 +460,12 @@ def toInternal(self, dt: datetime.datetime) -> int: | |
| def fromInternal(self, ts: int) -> datetime.datetime: | ||
| if ts is not None: | ||
| # using int to avoid precision loss in float | ||
| return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000) | ||
| # If TimestampType.tz_info is not None, we need to use it to convert the timestamp. | ||
| # Otherwise, we need to use the default timezone. | ||
| # We need to replace the tzinfo to None to keep backward compatibility | ||
| return datetime.datetime.fromtimestamp(ts // 1000000, self.tz_info).replace( | ||
| microsecond=ts % 1000000, tzinfo=None | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a comment on why dropping the tzinfo here?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Basically for backward compatibility. We are still working on the details of the behavior so this might not be the final implementation. |
||
| ) | ||
|
|
||
|
|
||
| class TimestampNTZType(DatetimeType, metaclass=DataTypeSingleton): | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,6 +18,7 @@ | |
| """ | ||
| Worker that receives input from Piped RDD. | ||
| """ | ||
| import datetime | ||
| import itertools | ||
| import os | ||
| import sys | ||
|
|
@@ -71,7 +72,7 @@ | |
| ArrowStreamUDTFSerializer, | ||
| ArrowStreamArrowUDTFSerializer, | ||
| ) | ||
| from pyspark.sql.pandas.types import to_arrow_type | ||
| from pyspark.sql.pandas.types import to_arrow_type, TimestampType | ||
| from pyspark.sql.types import ( | ||
| ArrayType, | ||
| BinaryType, | ||
|
|
@@ -3302,6 +3303,11 @@ def main(infile, outfile): | |
| if split_index == -1: # for unit tests | ||
| sys.exit(-1) | ||
| start_faulthandler_periodic_traceback() | ||
|
|
||
| # Use the local timezone to convert the timestamp | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shall we add a TODO comment? I think it's wrong to use local timezone, but it's the current behavior today. We should revisit it.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree it's wrong to use the local timezone. I'm thinking to start a PR very soon and we can discuss the details so we don't have to leave trace in our code. There is also a JIRA that's already opened for the issue. |
||
| tz = datetime.datetime.now().astimezone().tzinfo | ||
| TimestampType.tz_info = tz | ||
|
|
||
| check_python_version(infile) | ||
|
|
||
| # read inputs only for a barrier task | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to make timezone an optional field in
TimestampType?if it is not set, then respect the config "spark.sql.session.timeZone"
@cloud-fan @HyukjinKwon
e.g. in pyarrow,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The missing timezone causes a lot of confusion when we have to convert the timezone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIMESTAMP WITH TIMEZONE type may be added in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So where is the value set from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if TIMESTAMP WITH TIMEZONE is supported in the future, I think the TIMEZONE should be explicitly set by users, otherwise default to "spark.sql.session.timeZone"