[SPARK-22010][PySpark] Change fromInternal method of TimestampType #19234

maver1ck · 2017-09-14T15:08:05Z

What changes were proposed in this pull request?

This PR changes the way pySpark converts Timestamp format from internal to Python representation.

The code is based on Python datetime and udatetime library.
https://github.com/python/cpython/blob/018d353c1c8c87767d2335cd884017c2ce12e045/Lib/datetime.py#L1425-L1458
https://github.com/freach/udatetime/blob/08f95c233188a6a3d316e9905cfb3379278376f5/udatetime/_pure.py#L30-L42

Benchmarks
Before change:
4.58 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

After change:
System with UTC timezone
1.49 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Other timezones:
3.15 µs ± 388 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

How was this patch tested?

Existing tests.
Performance benchmarks.

SparkQA · 2017-09-14T15:13:29Z

Test build #81786 has finished for PR 19234 at commit 02301eb.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-14T15:34:38Z

Test build #81787 has finished for PR 19234 at commit 25bf50d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-14T16:04:44Z

Test build #81788 has finished for PR 19234 at commit f18f4c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-09-16T11:51:53Z

python/pyspark/sql/types.py

-            return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000)
+            y, m, d, hh, mm, ss, _, _, _ = (time.gmtime(ts // 1000000) if _is_utc
+                                            else time.localtime(ts // 1000000))
+            return datetime.datetime(y, m, d, hh, mm, ss, ts % 1000000)


I'd explain how the current change is equivalent to fromtimestamp in the PR description.

https://github.com/python/cpython/blob/018d353c1c8c87767d2335cd884017c2ce12e045/Lib/datetime.py#L1425-L1458

I think the only difference is this ss = min(ss, 59)

Could we have more details in the PR description? I can follow what it fixes but I believe we should better describe details for other guys to follow the history in the future. To me, it has been always hard to track such changes and ensure the root cause of a bug.

BTW, the change itself seems fine to me.

I added some description and support for leap seconds

SparkQA · 2017-09-16T16:52:55Z

Test build #81845 has finished for PR 19234 at commit bed6193.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-16T23:13:15Z

Could you mark [PySpark] in the title? cc @ueshin

ueshin · 2017-09-18T04:52:37Z

LGTM.

maver1ck · 2017-09-18T07:12:28Z

Any idea why we're not using
datetime.datetime.fromtimestamp(ts / 100000.) ?
There is a comment about overflow. But if it exists ?

maver1ck · 2017-09-18T07:13:07Z

I'm asking because such a code is 2x faster than my solution

HyukjinKwon · 2017-09-18T08:18:41Z

I think it was changed back in #7363. I think that avoids precision loss from Python's float limitation:

>>> decimal.Decimal(1 / 1e6)
Decimal('9.99999999999999954748111825886258685613938723690807819366455078125E-7')
>>> decimal.Decimal(1 / 100000)
Decimal('0')
>>> decimal.Decimal(1 / 100000.)
Decimal('0.000010000000000000000818030539140313095458623138256371021270751953125')

Also, I think // is used for Python 3 behaviour:

>>> 5 / 2
2.5
>>> 5 // 2
2

maver1ck · 2017-09-18T10:20:15Z

I check with some samples and code with float can trigger errors.

HyukjinKwon · 2017-09-18T13:57:21Z

Seems fine to me too as is. @maver1ck, I think you could take out [WIP] and let it be merged.

maver1ck · 2017-09-18T15:36:06Z

OK. It passed all tests, so let merge it

HyukjinKwon · 2017-09-20T02:20:01Z

Hm, BTW, do we handle
https://github.com/python/cpython/blob/018d353c1c8c87767d2335cd884017c2ce12e045/Lib/datetime.py#L1443-L1455:

        if tz is None:
            # As of version 2015f max fold in IANA database is
            # 23 hours at 1969-09-30 13:00:00 in Kwajalein.
            # Let's probe 24 hours in the past to detect a transition:
            max_fold_seconds = 24 * 3600
            y, m, d, hh, mm, ss = converter(t - max_fold_seconds)[:6]
            probe1 = cls(y, m, d, hh, mm, ss, us, tz)
            trans = result - probe1 - timedelta(0, max_fold_seconds)
            if trans.days < 0:
                y, m, d, hh, mm, ss = converter(t + trans // timedelta(0, 1))[:6]
                probe2 = cls(y, m, d, hh, mm, ss, us, tz)
                if probe2 == result:
                    result._fold = 1

Or do you guys see it could be ignorable as it is quite newly fixed (in 3.6.x) and a corner case vs the improvement?

Looking at the performance improvement in the PR description, it sounds pretty trivial. If it is safe to go, I am okay but if we miss anything, I doubt if it is worth fixing.

maver1ck · 2017-09-21T15:03:40Z

It was introduced with this PEP.
https://www.python.org/dev/peps/pep-0495/

HyukjinKwon

I am looking at CPython 3.5 branch- https://github.com/python/cpython/blob/3.5/Lib/datetime.py#L1375-L1387 and https://github.com/python/cpython/blob/3.5/Lib/datetime.py#L1398

    @classmethod
    def fromtimestamp(cls, t, tz=None):
        """Construct a datetime from a POSIX timestamp (like time.time()).
        A timezone info object may be passed in as well.
        """
        _check_tzinfo_arg(tz)

        result = cls._fromtimestamp(t, tz is not None, tz)
        if tz is not None:
            result = tz.fromutc(result)
        return result

and ...

        frac, t = _math.modf(t)
        us = round(frac * 1e6)
        if us >= 1000000:
            t += 1
            us -= 1000000
        elif us < 0:
            t -= 1
            us += 1000000

        converter = _time.gmtime if utc else _time.localtime
        y, m, d, hh, mm, ss, weekday, jday, dst = converter(t)
        ss = min(ss, 59)    # clamp out leap seconds if the platform has them
        return cls(y, m, d, hh, mm, ss, us, tz)

So, this saves one comparison:

if tz is not None:
    ...

and fraction calculation:

        frac, t = _math.modf(t)
        us = round(frac * 1e6)
        if us >= 1000000:
            t += 1
            us -= 1000000
        elif us < 0:
            t -= 1
            us += 1000000

The change itself seems fine as we anyway save some (little) computation.

However, looking through https://www.python.org/dev/peps/pep-0495 (in Python 3.6.0) and the codes without a super close look, it looks the codes:

        if tz is None:
            # As of version 2015f max fold in IANA database is
            # 23 hours at 1969-09-30 13:00:00 in Kwajalein.
            # Let's probe 24 hours in the past to detect a transition:
            max_fold_seconds = 24 * 3600
            y, m, d, hh, mm, ss = converter(t - max_fold_seconds)[:6]
            probe1 = cls(y, m, d, hh, mm, ss, us, tz)
            trans = result - probe1 - timedelta(0, max_fold_seconds)
            if trans.days < 0:
                y, m, d, hh, mm, ss = converter(t + trans // timedelta(0, 1))[:6]
                probe2 = cls(y, m, d, hh, mm, ss, us, tz)
                if probe2 == result:
                    result._fold = 1

handling a (super) corner case of a certain time with DST in Kwajalein, if I read correctly.

So, I am natural on this. I think I don't support this PR but wouldn't object if someone merges for the current status of this change.

Probably, more compelling reasons could be helpful, e.g., actual performance improvement is not actually quite trivial.

Closing this one is also should be fine to me.

HyukjinKwon · 2017-11-06T13:29:01Z

@maver1ck, if you'd like to leave it as is, I think we should better close it.

maver1ck · 2017-11-07T07:59:28Z

I think that porting changes from Python 3.6 give us too complicated code.
I'm closing it.

maver1ck added 3 commits September 14, 2017 16:56

Change internal Timestamp conversion

238b556

Typo fix

0cb2a48

Import fix

02301eb

maver1ck added 2 commits September 14, 2017 17:19

PEP8 compliance

25bf50d

Change way to check if timezone equals UTC

f18f4c8

dongjoon-hyun mentioned this pull request Sep 15, 2017

[SPARK-22025][PySpark] Speeding up fromInternal for StructField #19246

Closed

HyukjinKwon reviewed Sep 16, 2017

View reviewed changes

Add support for leap seconds

bed6193

maver1ck changed the title ~~[SPARK-22010] Change fromInternal method of TimestampType~~ [SPARK-22010][PySpark] Change fromInternal method of TimestampType Sep 17, 2017

maver1ck changed the title ~~[SPARK-22010][PySpark] Change fromInternal method of TimestampType~~ [WIP][SPARK-22010][PySpark] Change fromInternal method of TimestampType Sep 18, 2017

maver1ck changed the title ~~[WIP][SPARK-22010][PySpark] Change fromInternal method of TimestampType~~ [SPARK-22010][PySpark] Change fromInternal method of TimestampType Sep 18, 2017

HyukjinKwon reviewed Sep 22, 2017

View reviewed changes

maver1ck closed this Nov 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22010][PySpark] Change fromInternal method of TimestampType #19234

[SPARK-22010][PySpark] Change fromInternal method of TimestampType #19234

maver1ck commented Sep 14, 2017 •

edited

SparkQA commented Sep 14, 2017

SparkQA commented Sep 14, 2017

SparkQA commented Sep 14, 2017

HyukjinKwon Sep 16, 2017

maver1ck Sep 16, 2017

HyukjinKwon Sep 16, 2017

HyukjinKwon Sep 16, 2017

maver1ck Sep 16, 2017

SparkQA commented Sep 16, 2017

gatorsmile commented Sep 16, 2017

ueshin commented Sep 18, 2017

maver1ck commented Sep 18, 2017

maver1ck commented Sep 18, 2017

HyukjinKwon commented Sep 18, 2017

maver1ck commented Sep 18, 2017

HyukjinKwon commented Sep 18, 2017

maver1ck commented Sep 18, 2017

HyukjinKwon commented Sep 20, 2017

maver1ck commented Sep 21, 2017

HyukjinKwon left a comment

HyukjinKwon commented Nov 6, 2017

maver1ck commented Nov 7, 2017

[SPARK-22010][PySpark] Change fromInternal method of TimestampType #19234

[SPARK-22010][PySpark] Change fromInternal method of TimestampType #19234

Conversation

maver1ck commented Sep 14, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 14, 2017

SparkQA commented Sep 14, 2017

SparkQA commented Sep 14, 2017

HyukjinKwon Sep 16, 2017

Choose a reason for hiding this comment

maver1ck Sep 16, 2017

Choose a reason for hiding this comment

HyukjinKwon Sep 16, 2017

Choose a reason for hiding this comment

HyukjinKwon Sep 16, 2017

Choose a reason for hiding this comment

maver1ck Sep 16, 2017

Choose a reason for hiding this comment

SparkQA commented Sep 16, 2017

gatorsmile commented Sep 16, 2017

ueshin commented Sep 18, 2017

maver1ck commented Sep 18, 2017

maver1ck commented Sep 18, 2017

HyukjinKwon commented Sep 18, 2017

maver1ck commented Sep 18, 2017

HyukjinKwon commented Sep 18, 2017

maver1ck commented Sep 18, 2017

HyukjinKwon commented Sep 20, 2017

maver1ck commented Sep 21, 2017

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Nov 6, 2017

maver1ck commented Nov 7, 2017

maver1ck commented Sep 14, 2017 •

edited