Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22010][PySpark] Change fromInternal method of TimestampType #19234

Closed
wants to merge 6 commits into from

Conversation

maver1ck
Copy link
Contributor

@maver1ck maver1ck commented Sep 14, 2017

What changes were proposed in this pull request?

This PR changes the way pySpark converts Timestamp format from internal to Python representation.

The code is based on Python datetime and udatetime library.
https://github.com/python/cpython/blob/018d353c1c8c87767d2335cd884017c2ce12e045/Lib/datetime.py#L1425-L1458
https://github.com/freach/udatetime/blob/08f95c233188a6a3d316e9905cfb3379278376f5/udatetime/_pure.py#L30-L42

Benchmarks
Before change:
4.58 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

After change:
System with UTC timezone
1.49 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Other timezones:
3.15 µs ± 388 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

How was this patch tested?

Existing tests.
Performance benchmarks.

@SparkQA
Copy link

SparkQA commented Sep 14, 2017

Test build #81786 has finished for PR 19234 at commit 02301eb.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 14, 2017

Test build #81787 has finished for PR 19234 at commit 25bf50d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 14, 2017

Test build #81788 has finished for PR 19234 at commit f18f4c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

return datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000)
y, m, d, hh, mm, ss, _, _, _ = (time.gmtime(ts // 1000000) if _is_utc
else time.localtime(ts // 1000000))
return datetime.datetime(y, m, d, hh, mm, ss, ts % 1000000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd explain how the current change is equivalent to fromtimestamp in the PR description.

https://github.com/python/cpython/blob/018d353c1c8c87767d2335cd884017c2ce12e045/Lib/datetime.py#L1425-L1458

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only difference is this ss = min(ss, 59)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have more details in the PR description? I can follow what it fixes but I believe we should better describe details for other guys to follow the history in the future. To me, it has been always hard to track such changes and ensure the root cause of a bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the change itself seems fine to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some description and support for leap seconds

@SparkQA
Copy link

SparkQA commented Sep 16, 2017

Test build #81845 has finished for PR 19234 at commit bed6193.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Could you mark [PySpark] in the title? cc @ueshin

@maver1ck maver1ck changed the title [SPARK-22010] Change fromInternal method of TimestampType [SPARK-22010][PySpark] Change fromInternal method of TimestampType Sep 17, 2017
@ueshin
Copy link
Member

ueshin commented Sep 18, 2017

LGTM.

@maver1ck
Copy link
Contributor Author

Any idea why we're not using
datetime.datetime.fromtimestamp(ts / 100000.) ?
There is a comment about overflow. But if it exists ?

@maver1ck
Copy link
Contributor Author

I'm asking because such a code is 2x faster than my solution

@maver1ck maver1ck changed the title [SPARK-22010][PySpark] Change fromInternal method of TimestampType [WIP][SPARK-22010][PySpark] Change fromInternal method of TimestampType Sep 18, 2017
@HyukjinKwon
Copy link
Member

I think it was changed back in #7363. I think that avoids precision loss from Python's float limitation:

>>> decimal.Decimal(1 / 1e6)
Decimal('9.99999999999999954748111825886258685613938723690807819366455078125E-7')
>>> decimal.Decimal(1 / 100000)
Decimal('0')
>>> decimal.Decimal(1 / 100000.)
Decimal('0.000010000000000000000818030539140313095458623138256371021270751953125')

Also, I think // is used for Python 3 behaviour:

>>> 5 / 2
2.5
>>> 5 // 2
2

@maver1ck
Copy link
Contributor Author

I check with some samples and code with float can trigger errors.

@HyukjinKwon
Copy link
Member

Seems fine to me too as is. @maver1ck, I think you could take out [WIP] and let it be merged.

@maver1ck maver1ck changed the title [WIP][SPARK-22010][PySpark] Change fromInternal method of TimestampType [SPARK-22010][PySpark] Change fromInternal method of TimestampType Sep 18, 2017
@maver1ck
Copy link
Contributor Author

OK. It passed all tests, so let merge it

@HyukjinKwon
Copy link
Member

Hm, BTW, do we handle
https://github.com/python/cpython/blob/018d353c1c8c87767d2335cd884017c2ce12e045/Lib/datetime.py#L1443-L1455:

        if tz is None:
            # As of version 2015f max fold in IANA database is
            # 23 hours at 1969-09-30 13:00:00 in Kwajalein.
            # Let's probe 24 hours in the past to detect a transition:
            max_fold_seconds = 24 * 3600
            y, m, d, hh, mm, ss = converter(t - max_fold_seconds)[:6]
            probe1 = cls(y, m, d, hh, mm, ss, us, tz)
            trans = result - probe1 - timedelta(0, max_fold_seconds)
            if trans.days < 0:
                y, m, d, hh, mm, ss = converter(t + trans // timedelta(0, 1))[:6]
                probe2 = cls(y, m, d, hh, mm, ss, us, tz)
                if probe2 == result:
                    result._fold = 1

Or do you guys see it could be ignorable as it is quite newly fixed (in 3.6.x) and a corner case vs the improvement?

Looking at the performance improvement in the PR description, it sounds pretty trivial. If it is safe to go, I am okay but if we miss anything, I doubt if it is worth fixing.

@maver1ck
Copy link
Contributor Author

It was introduced with this PEP.
https://www.python.org/dev/peps/pep-0495/

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am looking at CPython 3.5 branch- https://github.com/python/cpython/blob/3.5/Lib/datetime.py#L1375-L1387 and https://github.com/python/cpython/blob/3.5/Lib/datetime.py#L1398

    @classmethod
    def fromtimestamp(cls, t, tz=None):
        """Construct a datetime from a POSIX timestamp (like time.time()).
        A timezone info object may be passed in as well.
        """
        _check_tzinfo_arg(tz)

        result = cls._fromtimestamp(t, tz is not None, tz)
        if tz is not None:
            result = tz.fromutc(result)
        return result

and ...

        frac, t = _math.modf(t)
        us = round(frac * 1e6)
        if us >= 1000000:
            t += 1
            us -= 1000000
        elif us < 0:
            t -= 1
            us += 1000000

        converter = _time.gmtime if utc else _time.localtime
        y, m, d, hh, mm, ss, weekday, jday, dst = converter(t)
        ss = min(ss, 59)    # clamp out leap seconds if the platform has them
        return cls(y, m, d, hh, mm, ss, us, tz)

So, this saves one comparison:

if tz is not None:
    ...

and fraction calculation:

        frac, t = _math.modf(t)
        us = round(frac * 1e6)
        if us >= 1000000:
            t += 1
            us -= 1000000
        elif us < 0:
            t -= 1
            us += 1000000

The change itself seems fine as we anyway save some (little) computation.



However, looking through https://www.python.org/dev/peps/pep-0495 (in Python 3.6.0) and the codes without a super close look, it looks the codes:

        if tz is None:
            # As of version 2015f max fold in IANA database is
            # 23 hours at 1969-09-30 13:00:00 in Kwajalein.
            # Let's probe 24 hours in the past to detect a transition:
            max_fold_seconds = 24 * 3600
            y, m, d, hh, mm, ss = converter(t - max_fold_seconds)[:6]
            probe1 = cls(y, m, d, hh, mm, ss, us, tz)
            trans = result - probe1 - timedelta(0, max_fold_seconds)
            if trans.days < 0:
                y, m, d, hh, mm, ss = converter(t + trans // timedelta(0, 1))[:6]
                probe2 = cls(y, m, d, hh, mm, ss, us, tz)
                if probe2 == result:
                    result._fold = 1

handling a (super) corner case of a certain time with DST in Kwajalein, if I read correctly.

So, I am natural on this. I think I don't support this PR but wouldn't object if someone merges for the current status of this change.

Probably, more compelling reasons could be helpful, e.g., actual performance improvement is not actually quite trivial.

Closing this one is also should be fine to me.

@HyukjinKwon
Copy link
Member

@maver1ck, if you'd like to leave it as is, I think we should better close it.

@maver1ck
Copy link
Contributor Author

maver1ck commented Nov 7, 2017

I think that porting changes from Python 3.6 give us too complicated code.
I'm closing it.

@maver1ck maver1ck closed this Nov 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants