New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22010][PySpark] Change fromInternal method of TimestampType #19234
Conversation
Test build #81786 has finished for PR 19234 at commit
|
Test build #81787 has finished for PR 19234 at commit
|
Test build #81788 has finished for PR 19234 at commit
|
python/pyspark/sql/types.py
Outdated
return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000) | ||
y, m, d, hh, mm, ss, _, _, _ = (time.gmtime(ts // 1000000) if _is_utc | ||
else time.localtime(ts // 1000000)) | ||
return datetime.datetime(y, m, d, hh, mm, ss, ts % 1000000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd explain how the current change is equivalent to fromtimestamp
in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only difference is this ss = min(ss, 59)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have more details in the PR description? I can follow what it fixes but I believe we should better describe details for other guys to follow the history in the future. To me, it has been always hard to track such changes and ensure the root cause of a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the change itself seems fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some description and support for leap seconds
Test build #81845 has finished for PR 19234 at commit
|
Could you mark [PySpark] in the title? cc @ueshin |
LGTM. |
Any idea why we're not using |
I'm asking because such a code is 2x faster than my solution |
I think it was changed back in #7363. I think that avoids precision loss from Python's >>> decimal.Decimal(1 / 1e6)
Decimal('9.99999999999999954748111825886258685613938723690807819366455078125E-7')
>>> decimal.Decimal(1 / 100000)
Decimal('0')
>>> decimal.Decimal(1 / 100000.)
Decimal('0.000010000000000000000818030539140313095458623138256371021270751953125') Also, I think >>> 5 / 2
2.5
>>> 5 // 2
2 |
I check with some samples and code with float can trigger errors. |
Seems fine to me too as is. @maver1ck, I think you could take out |
OK. It passed all tests, so let merge it |
Hm, BTW, do we handle if tz is None:
# As of version 2015f max fold in IANA database is
# 23 hours at 1969-09-30 13:00:00 in Kwajalein.
# Let's probe 24 hours in the past to detect a transition:
max_fold_seconds = 24 * 3600
y, m, d, hh, mm, ss = converter(t - max_fold_seconds)[:6]
probe1 = cls(y, m, d, hh, mm, ss, us, tz)
trans = result - probe1 - timedelta(0, max_fold_seconds)
if trans.days < 0:
y, m, d, hh, mm, ss = converter(t + trans // timedelta(0, 1))[:6]
probe2 = cls(y, m, d, hh, mm, ss, us, tz)
if probe2 == result:
result._fold = 1 Or do you guys see it could be ignorable as it is quite newly fixed (in 3.6.x) and a corner case vs the improvement? Looking at the performance improvement in the PR description, it sounds pretty trivial. If it is safe to go, I am okay but if we miss anything, I doubt if it is worth fixing. |
It was introduced with this PEP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am looking at CPython 3.5 branch- https://github.com/python/cpython/blob/3.5/Lib/datetime.py#L1375-L1387 and https://github.com/python/cpython/blob/3.5/Lib/datetime.py#L1398
@classmethod
def fromtimestamp(cls, t, tz=None):
"""Construct a datetime from a POSIX timestamp (like time.time()).
A timezone info object may be passed in as well.
"""
_check_tzinfo_arg(tz)
result = cls._fromtimestamp(t, tz is not None, tz)
if tz is not None:
result = tz.fromutc(result)
return result
and ...
frac, t = _math.modf(t)
us = round(frac * 1e6)
if us >= 1000000:
t += 1
us -= 1000000
elif us < 0:
t -= 1
us += 1000000
converter = _time.gmtime if utc else _time.localtime
y, m, d, hh, mm, ss, weekday, jday, dst = converter(t)
ss = min(ss, 59) # clamp out leap seconds if the platform has them
return cls(y, m, d, hh, mm, ss, us, tz)
So, this saves one comparison:
if tz is not None:
...
and fraction calculation:
frac, t = _math.modf(t)
us = round(frac * 1e6)
if us >= 1000000:
t += 1
us -= 1000000
elif us < 0:
t -= 1
us += 1000000
The change itself seems fine as we anyway save some (little) computation.
However, looking through https://www.python.org/dev/peps/pep-0495 (in Python 3.6.0) and the codes without a super close look, it looks the codes:
if tz is None:
# As of version 2015f max fold in IANA database is
# 23 hours at 1969-09-30 13:00:00 in Kwajalein.
# Let's probe 24 hours in the past to detect a transition:
max_fold_seconds = 24 * 3600
y, m, d, hh, mm, ss = converter(t - max_fold_seconds)[:6]
probe1 = cls(y, m, d, hh, mm, ss, us, tz)
trans = result - probe1 - timedelta(0, max_fold_seconds)
if trans.days < 0:
y, m, d, hh, mm, ss = converter(t + trans // timedelta(0, 1))[:6]
probe2 = cls(y, m, d, hh, mm, ss, us, tz)
if probe2 == result:
result._fold = 1
handling a (super) corner case of a certain time with DST in Kwajalein, if I read correctly.
So, I am natural on this. I think I don't support this PR but wouldn't object if someone merges for the current status of this change.
Probably, more compelling reasons could be helpful, e.g., actual performance improvement is not actually quite trivial.
Closing this one is also should be fine to me.
@maver1ck, if you'd like to leave it as is, I think we should better close it. |
I think that porting changes from Python 3.6 give us too complicated code. |
What changes were proposed in this pull request?
This PR changes the way pySpark converts Timestamp format from internal to Python representation.
The code is based on Python datetime and udatetime library.
https://github.com/python/cpython/blob/018d353c1c8c87767d2335cd884017c2ce12e045/Lib/datetime.py#L1425-L1458
https://github.com/freach/udatetime/blob/08f95c233188a6a3d316e9905cfb3379278376f5/udatetime/_pure.py#L30-L42
Benchmarks
Before change:
4.58 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
After change:
System with UTC timezone
1.49 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Other timezones:
3.15 µs ± 388 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
How was this patch tested?
Existing tests.
Performance benchmarks.