Skip to content

Tweak to parser skipped_idx + PEP8 cleanup #435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 4, 2017

Conversation

pganssle
Copy link
Member

@pganssle pganssle commented Aug 2, 2017

One part of the cleanup mentioned in PR #419. I implemented the _recombine_skipped_tokens function as indicated here (extend), using a token queue (queue) and compared that to @jbrockmendel's version with a set. Here is the code for the queue version:

def _recombine_skipped_queue(tokens, skipped_idxs):
    """
    >>> tokens = ["foo", " ", "bar", " ", "19June2000", "baz"]
    >>> skipped_idxs = set([0, 1, 2, 5])
    >>> _recombine_skipped(tokens, skipped_idxs)
    ["foo bar", "baz"]

    """
    # This groups consecutive values
    skipped_tokens = []
    idx_queue = []
    for idx in skipped_idxs:
        if idx_queue and idx - 1 == [-1]:
        if len(idx_queue) and idx - 1 != idx_queue[-1]:
            skipped_tokens.append(''.join(map(tokens.__getitem__, idx_queue)))
            idx_queue = []

        idx_queue.append(idx)

    if idx_queue:
        skipped_tokens.append(''.join(map(tokens.__getitem__, idx_queue)))

    return skipped_tokens

I think we should keep skipped_idxs as a list, since it's not really necessary for it to be a set (and currently it starts out sorted anyway). Interestingly, I couldn't figure out why the queue version was consistently slower than Brock's version with skipped_tokens[-1] = skipped_tokens[-1] + tokens[idx], since strings are immutable, I was expecting growing a string incrementally with + to be much slower than queuing up all the parts and combining them at the end, but lo and behold, when I switched over to using the += style method, the list version is faster than the set version! I believe that's because in CPython they've special-cased this sort of string extension to try to extend the string in-place. See this StackOverflow question.

Despite the fact that this is not part of the Python spec, it seems that it's been implemented in Python 2.7 and 3.6 as well as pypy2 and pypy3. Using a loop that randomly generates token strings and skipped indices to test this, here are some profiling results:

Python 2.7:

$ python2.7 test_recombination_speed.py 
Running extend timing test
1000 loops, 500 sets: 0.884 ms per loop, 1.767 us per set
Running queue timing test
1000 loops, 500 sets: 1.263 ms per loop, 2.526 us per set
Running set timing test
1000 loops, 500 sets: 1.192 ms per loop, 2.384 us per set

Python 3.6:

$ python3.6 test_recombination_speed.py 
Running extend timing test
1000 loops, 500 sets: 1.075 ms per loop, 2.150 us per set
Running queue timing test
1000 loops, 500 sets: 1.469 ms per loop, 2.938 us per set
Running set timing test
1000 loops, 500 sets: 1.238 ms per loop, 2.477 us per set

pypy2:

$ pypy test_recombination_speed.py 
Running extend timing test
1000 loops, 500 sets: 0.132 ms per loop, 0.263 us per set
Running queue timing test
1000 loops, 500 sets: 0.425 ms per loop, 0.851 us per set
Running set timing test
1000 loops, 500 sets: 0.313 ms per loop, 0.627 us per set

pypy3:

$ pypy3 test_recombination_speed.py 
Running extend timing test
1000 loops, 500 sets: 0.149 ms per loop, 0.297 us per set
Running queue timing test
1000 loops, 500 sets: 0.339 ms per loop, 0.677 us per set
Running set timing test
1000 loops, 500 sets: 0.236 ms per loop, 0.472 us per set

The code to run this can be found here.

@pganssle pganssle added the parser label Aug 2, 2017
@pganssle pganssle added this to the 2.7.0 milestone Aug 2, 2017
@pganssle
Copy link
Member Author

pganssle commented Aug 2, 2017

@jbrockmendel You should be able to merge this if you want to take a look.

The substantive change is in the first commit. The second commit is just some PEP8 cleanup.

@@ -1328,9 +1327,6 @@ def _parse_hms(i, l, info, res):
return i


# TODO: require len(token) >= 3 like we do for the between-parens version?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I removed this because I don't think we can require len(token) >= 3 here, since Z is most definitely a valid potential tzname.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. There are two things I'd like to accomplish here (not necessarily in this PR):

  1. Get as much of the could-this-be-a-tzname logic into one place. So if Z is allowed but aside from that special case there are length-based criteria, that would be easy to put into _could_be_tzname. You've put much more though than I have into timezone stuff, so I'm going to defer to you on what this logic should actually be.

  2. Unify/merge the logic used here and in the between-parens tzname criteria. Or if it can't be unified, at least clearly document the relevant differences.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the time zone stuff needs a pretty serious overhaul. The reason I haven't really started into it totally yet is that I think it involves some opportunistic loading of time zone data from various sources, and I'm not sure how heavy a burden that will be in resource-constrained environments, so I've been trying to think through the right way to do it.

I'm not sure why the between-parens tzname criteria is different, but one thing I've learned is that a lot of the stuff that just seems crazy in the parser is there because there's some spec or some subset of people generating date strings in a very specific way (like the fact that 12:00:00+0300 and '12:00:00 UTC+0300` parse to UTC + 3 and UTC -3, respectively), so it's probably worth trying to dig up why they were separate in the first place before unifying them.

That said, we should definitely maximize code-reuse if only so that downstream users only have to patch in a single place if they want a different behavior, so if there's some reason they are different, it's worth factoring out everything except that difference.

(yearfirst and self[1] <= 12 and self[2] <= 31):
if (self[0] > 31 or
self.find_probable_year_index(_timelex.split(self.tzstr)) == 0 or
(yearfirst and self[1] <= 12 and self[2] <= 31)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. This is much prettier.

if i+1 < len_l and l[i+1] in ('+', '-'):
l[i+1] = ('+', '-')[l[i+1] == '+']
if i + 1 < len_l and l[i + 1] in ('+', '-'):
l[i + 1] = ('+', '-')[l[i + 1] == '+']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly pertinent, but in a perfect world we should avoid altering the tokens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The flow if this code is somewhat confusing, so I'm not sure why it needs to be altered, but presumably once we have something that looks like a zone name we'd move this logic into a tzparser() that specifically parses time zones anyway.

We'll also need to be able to turn off this logic, because the POSIX spec is incredibly counter-intuitive, so I think a bunch of stuff emits stuff with the exact opposite convention.

@@ -1401,19 +1403,20 @@ def _parsems(value):

def _recombine_skipped(tokens, skipped_idxs):
"""

>>> tokens = ["foo", " ", "bar", " ", "19June2000", "baz"]
>>> skipped_idxs = set([0, 1, 2, 5])
>>> _recombine_skipped(tokens, skipped_idxs)
["foo bar", "baz"]

"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the docstring example? I think it just requires changing set([0, 1, 2, 5]) to [0, 1, 2, 5].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops!

Copy link
Contributor

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@pganssle pganssle merged commit 123bcd2 into dateutil:master Aug 4, 2017
@pganssle pganssle mentioned this pull request Mar 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants