-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-33863][PYTHON] Respect session timezone in udf workers #53161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -27,6 +27,7 @@ | |
| import inspect | ||
| import itertools | ||
| import json | ||
| import zoneinfo | ||
| from typing import Any, Callable, Iterable, Iterator, Optional, Tuple | ||
|
|
||
| from pyspark.accumulators import ( | ||
|
|
@@ -3304,8 +3305,12 @@ def main(infile, outfile): | |
| sys.exit(-1) | ||
| start_faulthandler_periodic_traceback() | ||
|
|
||
| # Use the local timezone to convert the timestamp | ||
| tz = datetime.datetime.now().astimezone().tzinfo | ||
| tzname = os.environ.get("SPARK_SESSION_LOCAL_TIMEZONE", None) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To confirm, we will hit this branch for every udf execution, not just once per python worker initialization, right?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's correct, but it doesn't seem like
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @gaogaotiantian We can use the same way as the other configs to get the runtime config, like
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, so basically overwrite this for every subclassed worker?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. Also if we have a flag, the subclasses should decide whether it returns the session local timezone or
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So the flag should be a conf in the same level as session local timezone? Or just Python udf level? Will it be default to the original behavior or the new behavior?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the flag should be the same level as the session local timezone, a runtime conf in
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Arrow-based UDFs already handles the session local timezone, so it may be ok to just update |
||
| if tzname: | ||
| tz = zoneinfo.ZoneInfo(tzname) | ||
| else: | ||
| # Use the local timezone to convert the timestamp | ||
| tz = datetime.datetime.now().astimezone().tzinfo | ||
| TimestampType.tz_info = tz | ||
|
|
||
| check_python_version(infile) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for arrow-based UDFs,
sessionLocalTimeZoneis actually already passed to the python sidespark/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala
Line 153 in ed23cc3
However this
workerConfis not available in vanilla Python UDF, probably we can consider supporting it in vanilla Python UDF in the future. also cc @HeartSaVioRThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea it's better to pass the configs via a proper protocol, instead of system variables. But it's already the case for vanilla python runner and I think it's fine to follow it.