Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion python/pyspark/sql/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -441,6 +441,12 @@ def __repr__(self) -> str:
class TimestampType(DatetimeType, metaclass=DataTypeSingleton):
"""Timestamp (datetime.datetime) data type."""

# We need to cache the timezone info for datetime.datetime.fromtimestamp
# otherwise the forked process will be extremely slow to convert the timestamp.
# This is probably a glibc issue - the forked process will have a bad cache/lock
# status for the timezone info.
tz_info = None
Copy link
Contributor

@zhengruifeng zhengruifeng Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to make timezone an optional field in TimestampType?
if it is not set, then respect the config "spark.sql.session.timeZone"
@cloud-fan @HyukjinKwon

e.g. in pyarrow,

In [29]: pa.timestamp('us', tz='America/Los_Angeles')
Out[29]: TimestampType(timestamp[us, tz=America/Los_Angeles])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The missing timezone causes a lot of confusion when we have to convert the timezone

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIMESTAMP WITH TIMEZONE type may be added in the future

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So where is the value set from?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if TIMESTAMP WITH TIMEZONE is supported in the future, I think the TIMEZONE should be explicitly set by users, otherwise default to "spark.sql.session.timeZone"


def needConversion(self) -> bool:
return True

Expand All @@ -454,7 +460,12 @@ def toInternal(self, dt: datetime.datetime) -> int:
def fromInternal(self, ts: int) -> datetime.datetime:
if ts is not None:
# using int to avoid precision loss in float
return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000)
# If TimestampType.tz_info is not None, we need to use it to convert the timestamp.
# Otherwise, we need to use the default timezone.
# We need to replace the tzinfo to None to keep backward compatibility
return datetime.datetime.fromtimestamp(ts // 1000000, self.tz_info).replace(
microsecond=ts % 1000000, tzinfo=None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment on why dropping the tzinfo here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically for backward compatibility. We are still working on the details of the behavior so this might not be the final implementation.

)


class TimestampNTZType(DatetimeType, metaclass=DataTypeSingleton):
Expand Down
8 changes: 7 additions & 1 deletion python/pyspark/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
"""
Worker that receives input from Piped RDD.
"""
import datetime
import itertools
import os
import sys
Expand Down Expand Up @@ -71,7 +72,7 @@
ArrowStreamUDTFSerializer,
ArrowStreamArrowUDTFSerializer,
)
from pyspark.sql.pandas.types import to_arrow_type
from pyspark.sql.pandas.types import to_arrow_type, TimestampType
from pyspark.sql.types import (
ArrayType,
BinaryType,
Expand Down Expand Up @@ -3302,6 +3303,11 @@ def main(infile, outfile):
if split_index == -1: # for unit tests
sys.exit(-1)
start_faulthandler_periodic_traceback()

# Use the local timezone to convert the timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add a TODO comment? I think it's wrong to use local timezone, but it's the current behavior today. We should revisit it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's wrong to use the local timezone. I'm thinking to start a PR very soon and we can discuss the details so we don't have to leave trace in our code. There is also a JIRA that's already opened for the issue.

tz = datetime.datetime.now().astimezone().tzinfo
TimestampType.tz_info = tz

check_python_version(infile)

# read inputs only for a barrier task
Expand Down