-
Notifications
You must be signed in to change notification settings - Fork 510
Tweak to parser skipped_idx + PEP8 cleanup #435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
27544d7
to
dfd5c71
Compare
@jbrockmendel You should be able to merge this if you want to take a look. The substantive change is in the first commit. The second commit is just some PEP8 cleanup. |
@@ -1328,9 +1327,6 @@ def _parse_hms(i, l, info, res): | |||
return i | |||
|
|||
|
|||
# TODO: require len(token) >= 3 like we do for the between-parens version? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I removed this because I don't think we can require len(token) >= 3
here, since Z
is most definitely a valid potential tzname.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. There are two things I'd like to accomplish here (not necessarily in this PR):
-
Get as much of the could-this-be-a-tzname logic into one place. So if
Z
is allowed but aside from that special case there are length-based criteria, that would be easy to put into_could_be_tzname
. You've put much more though than I have into timezone stuff, so I'm going to defer to you on what this logic should actually be. -
Unify/merge the logic used here and in the between-parens tzname criteria. Or if it can't be unified, at least clearly document the relevant differences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the time zone stuff needs a pretty serious overhaul. The reason I haven't really started into it totally yet is that I think it involves some opportunistic loading of time zone data from various sources, and I'm not sure how heavy a burden that will be in resource-constrained environments, so I've been trying to think through the right way to do it.
I'm not sure why the between-parens tzname criteria is different, but one thing I've learned is that a lot of the stuff that just seems crazy in the parser is there because there's some spec or some subset of people generating date strings in a very specific way (like the fact that 12:00:00+0300
and '12:00:00 UTC+0300` parse to UTC + 3 and UTC -3, respectively), so it's probably worth trying to dig up why they were separate in the first place before unifying them.
That said, we should definitely maximize code-reuse if only so that downstream users only have to patch in a single place if they want a different behavior, so if there's some reason they are different, it's worth factoring out everything except that difference.
(yearfirst and self[1] <= 12 and self[2] <= 31): | ||
if (self[0] > 31 or | ||
self.find_probable_year_index(_timelex.split(self.tzstr)) == 0 or | ||
(yearfirst and self[1] <= 12 and self[2] <= 31)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. This is much prettier.
if i+1 < len_l and l[i+1] in ('+', '-'): | ||
l[i+1] = ('+', '-')[l[i+1] == '+'] | ||
if i + 1 < len_l and l[i + 1] in ('+', '-'): | ||
l[i + 1] = ('+', '-')[l[i + 1] == '+'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly pertinent, but in a perfect world we should avoid altering the tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. The flow if this code is somewhat confusing, so I'm not sure why it needs to be altered, but presumably once we have something that looks like a zone name we'd move this logic into a tzparser()
that specifically parses time zones anyway.
We'll also need to be able to turn off this logic, because the POSIX spec is incredibly counter-intuitive, so I think a bunch of stuff emits stuff with the exact opposite convention.
dateutil/parser.py
Outdated
@@ -1401,19 +1403,20 @@ def _parsems(value): | |||
|
|||
def _recombine_skipped(tokens, skipped_idxs): | |||
""" | |||
|
|||
>>> tokens = ["foo", " ", "bar", " ", "19June2000", "baz"] | |||
>>> skipped_idxs = set([0, 1, 2, 5]) | |||
>>> _recombine_skipped(tokens, skipped_idxs) | |||
["foo bar", "baz"] | |||
|
|||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the docstring example? I think it just requires changing set([0, 1, 2, 5])
to [0, 1, 2, 5]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
One part of the cleanup mentioned in PR #419. I implemented the
_recombine_skipped_tokens
function as indicated here (extend
), using a token queue (queue
) and compared that to @jbrockmendel's version with aset
. Here is the code for thequeue
version:I think we should keep
skipped_idxs
as alist
, since it's not really necessary for it to be aset
(and currently it starts out sorted anyway). Interestingly, I couldn't figure out why the queue version was consistently slower than Brock's version withskipped_tokens[-1] = skipped_tokens[-1] + tokens[idx]
, since strings are immutable, I was expecting growing a string incrementally with+
to be much slower than queuing up all the parts and combining them at the end, but lo and behold, when I switched over to using the+=
style method, the list version is faster than the set version! I believe that's because in CPython they've special-cased this sort of string extension to try to extend the string in-place. See this StackOverflow question.Despite the fact that this is not part of the Python spec, it seems that it's been implemented in Python 2.7 and 3.6 as well as pypy2 and pypy3. Using a loop that randomly generates token strings and skipped indices to test this, here are some profiling results:
Python 2.7:
Python 3.6:
pypy2:
pypy3:
The code to run this can be found here.