Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak to parser skipped_idx + PEP8 cleanup #435

Merged
merged 3 commits into from Aug 4, 2017

Conversation

@pganssle
Copy link
Member

pganssle commented Aug 2, 2017

One part of the cleanup mentioned in PR #419. I implemented the _recombine_skipped_tokens function as indicated here (extend), using a token queue (queue) and compared that to @jbrockmendel's version with a set. Here is the code for the queue version:

def _recombine_skipped_queue(tokens, skipped_idxs):
    """
    >>> tokens = ["foo", " ", "bar", " ", "19June2000", "baz"]
    >>> skipped_idxs = set([0, 1, 2, 5])
    >>> _recombine_skipped(tokens, skipped_idxs)
    ["foo bar", "baz"]

    """
    # This groups consecutive values
    skipped_tokens = []
    idx_queue = []
    for idx in skipped_idxs:
        if idx_queue and idx - 1 == [-1]:
        if len(idx_queue) and idx - 1 != idx_queue[-1]:
            skipped_tokens.append(''.join(map(tokens.__getitem__, idx_queue)))
            idx_queue = []

        idx_queue.append(idx)

    if idx_queue:
        skipped_tokens.append(''.join(map(tokens.__getitem__, idx_queue)))

    return skipped_tokens

I think we should keep skipped_idxs as a list, since it's not really necessary for it to be a set (and currently it starts out sorted anyway). Interestingly, I couldn't figure out why the queue version was consistently slower than Brock's version with skipped_tokens[-1] = skipped_tokens[-1] + tokens[idx], since strings are immutable, I was expecting growing a string incrementally with + to be much slower than queuing up all the parts and combining them at the end, but lo and behold, when I switched over to using the += style method, the list version is faster than the set version! I believe that's because in CPython they've special-cased this sort of string extension to try to extend the string in-place. See this StackOverflow question.

Despite the fact that this is not part of the Python spec, it seems that it's been implemented in Python 2.7 and 3.6 as well as pypy2 and pypy3. Using a loop that randomly generates token strings and skipped indices to test this, here are some profiling results:

Python 2.7:

$ python2.7 test_recombination_speed.py 
Running extend timing test
1000 loops, 500 sets: 0.884 ms per loop, 1.767 us per set
Running queue timing test
1000 loops, 500 sets: 1.263 ms per loop, 2.526 us per set
Running set timing test
1000 loops, 500 sets: 1.192 ms per loop, 2.384 us per set

Python 3.6:

$ python3.6 test_recombination_speed.py 
Running extend timing test
1000 loops, 500 sets: 1.075 ms per loop, 2.150 us per set
Running queue timing test
1000 loops, 500 sets: 1.469 ms per loop, 2.938 us per set
Running set timing test
1000 loops, 500 sets: 1.238 ms per loop, 2.477 us per set

pypy2:

$ pypy test_recombination_speed.py 
Running extend timing test
1000 loops, 500 sets: 0.132 ms per loop, 0.263 us per set
Running queue timing test
1000 loops, 500 sets: 0.425 ms per loop, 0.851 us per set
Running set timing test
1000 loops, 500 sets: 0.313 ms per loop, 0.627 us per set

pypy3:

$ pypy3 test_recombination_speed.py 
Running extend timing test
1000 loops, 500 sets: 0.149 ms per loop, 0.297 us per set
Running queue timing test
1000 loops, 500 sets: 0.339 ms per loop, 0.677 us per set
Running set timing test
1000 loops, 500 sets: 0.236 ms per loop, 0.472 us per set

The code to run this can be found here.

@pganssle pganssle added the parser label Aug 2, 2017
@pganssle pganssle added this to the 2.7.0 milestone Aug 2, 2017
@pganssle pganssle force-pushed the pganssle:skipped_idx_list branch from 27544d7 to dfd5c71 Aug 2, 2017
@pganssle pganssle requested a review from jbrockmendel Aug 2, 2017
@pganssle

This comment has been minimized.

Copy link
Member Author

pganssle commented Aug 2, 2017

@jbrockmendel You should be able to merge this if you want to take a look.

The substantive change is in the first commit. The second commit is just some PEP8 cleanup.

@@ -1328,9 +1327,6 @@ def _parse_hms(i, l, info, res):
return i


# TODO: require len(token) >= 3 like we do for the between-parens version?

This comment has been minimized.

Copy link
@pganssle

pganssle Aug 2, 2017

Author Member

Note: I removed this because I don't think we can require len(token) >= 3 here, since Z is most definitely a valid potential tzname.

This comment has been minimized.

Copy link
@jbrockmendel

jbrockmendel Aug 2, 2017

Contributor

OK. There are two things I'd like to accomplish here (not necessarily in this PR):

  1. Get as much of the could-this-be-a-tzname logic into one place. So if Z is allowed but aside from that special case there are length-based criteria, that would be easy to put into _could_be_tzname. You've put much more though than I have into timezone stuff, so I'm going to defer to you on what this logic should actually be.

  2. Unify/merge the logic used here and in the between-parens tzname criteria. Or if it can't be unified, at least clearly document the relevant differences.

This comment has been minimized.

Copy link
@pganssle

pganssle Aug 2, 2017

Author Member

Yeah, the time zone stuff needs a pretty serious overhaul. The reason I haven't really started into it totally yet is that I think it involves some opportunistic loading of time zone data from various sources, and I'm not sure how heavy a burden that will be in resource-constrained environments, so I've been trying to think through the right way to do it.

I'm not sure why the between-parens tzname criteria is different, but one thing I've learned is that a lot of the stuff that just seems crazy in the parser is there because there's some spec or some subset of people generating date strings in a very specific way (like the fact that 12:00:00+0300 and '12:00:00 UTC+0300` parse to UTC + 3 and UTC -3, respectively), so it's probably worth trying to dig up why they were separate in the first place before unifying them.

That said, we should definitely maximize code-reuse if only so that downstream users only have to patch in a single place if they want a different behavior, so if there's some reason they are different, it's worth factoring out everything except that difference.

(yearfirst and self[1] <= 12 and self[2] <= 31):
if (self[0] > 31 or
self.find_probable_year_index(_timelex.split(self.tzstr)) == 0 or
(yearfirst and self[1] <= 12 and self[2] <= 31)):

This comment has been minimized.

Copy link
@jbrockmendel

jbrockmendel Aug 2, 2017

Contributor

+1. This is much prettier.

if i+1 < len_l and l[i+1] in ('+', '-'):
l[i+1] = ('+', '-')[l[i+1] == '+']
if i + 1 < len_l and l[i + 1] in ('+', '-'):
l[i + 1] = ('+', '-')[l[i + 1] == '+']

This comment has been minimized.

Copy link
@jbrockmendel

jbrockmendel Aug 2, 2017

Contributor

Not strictly pertinent, but in a perfect world we should avoid altering the tokens.

This comment has been minimized.

Copy link
@pganssle

pganssle Aug 2, 2017

Author Member

Agreed. The flow if this code is somewhat confusing, so I'm not sure why it needs to be altered, but presumably once we have something that looks like a zone name we'd move this logic into a tzparser() that specifically parses time zones anyway.

We'll also need to be able to turn off this logic, because the POSIX spec is incredibly counter-intuitive, so I think a bunch of stuff emits stuff with the exact opposite convention.

@@ -1401,19 +1403,20 @@ def _parsems(value):

def _recombine_skipped(tokens, skipped_idxs):
"""
>>> tokens = ["foo", " ", "bar", " ", "19June2000", "baz"]
>>> skipped_idxs = set([0, 1, 2, 5])
>>> _recombine_skipped(tokens, skipped_idxs)
["foo bar", "baz"]
"""

This comment has been minimized.

Copy link
@jbrockmendel

jbrockmendel Aug 2, 2017

Contributor

Can you update the docstring example? I think it just requires changing set([0, 1, 2, 5]) to [0, 1, 2, 5].

This comment has been minimized.

Copy link
@pganssle

pganssle Aug 2, 2017

Author Member

Oops!

Copy link
Contributor

jbrockmendel left a comment

lgtm

@pganssle pganssle merged commit 123bcd2 into dateutil:master Aug 4, 2017
3 checks passed
3 checks passed
codecov/project 90.84% (target 80%)
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@pganssle pganssle added this to Style and Organization in Codebase Cleanup Nov 13, 2017
@pganssle pganssle mentioned this pull request Mar 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Codebase Cleanup
Style and Organization
2 participants
You can’t perform that action at this time.