Join GitHub today
Invalid datetime can produce partial parse. Should return None instead #26
I came across this silent failure when parsing a third party datetime with an unusual separator:
>>> import ciso8601 >>> ciso8601.parse_datetime('2017-05-29 17:36:00Z') datetime.datetime(2017, 5, 29, 17, 36, tzinfo=<UTC>) #Now for a completely invalid string >>> print(ciso8601.parse_datetime('Completely invalid!')) None #Now for an out of spec string (invalid separator) >>> ciso8601.parse_datetime('2017-05-29_17:36:00Z') datetime.datetime(2017, 5, 29, 0, 0)
It seems that the parser sees a character that it doesn't recognize and stops parsing?
I would have expected that if the entire string cannot be parsed, it would return
This lead to a sneaky bug in the code.
I'm not sure what sort of PR you'd be expecting. If it were my decision (ie. my library), I think I would change it to return
If that's what you'd like to see, I could take a shot at that. But it would be a change of behaviour, so anyone who has written code that relies on this behaviour would need to change their code.
As an example of usage that would change, this test would change to expect
I don't think changing the behavior is bad as long as we increase the major/minor version number, or maybe introduce a
This library was originally designed to take valid (trusted) input and its main focus is on parsing it as efficiently as possible into a datetime, which is why parsing invalid strings may sometimes result in unexpected behavior (and there are several issues open about this).
@suhailpatel Do you have any thoughts on this?
I think being strict is the best approach (eg return None if you see anything 'invalid' whilst parsing the string), otherwise you do end up with bugs like the ones mentioned where you get partial dates or partial times if you are too liberal or return partial times.
Also as we found in #30, the cPython PyDatetime interface does accept invalid input in some cases which then causes errors when you manipulate badly parsed datetimes so we definitely need to ensure we have good checking in our C parsing code on our side.
This was referenced
Sep 14, 2017
I would argue that returning
referenced this issue
May 9, 2018
I've made an attempt at making this change. I'll be making a PR as soon as I've resolved anything that comes out of this discussion. You can see my initial comments and a [largely outdated] draft PR here.
The problem with the current version of
As a developer with a given timestamp string, I cannot predict a priori what
My v2 Invariant
In my mind, the best way to resolve this ambiguity (in the case of invalid timestamps) is to premise version 2 of
My current work seems to indicate that this imposes a <10% performance penalty over v1.0.7 (but I'll continue profiling to make sure there's no more optimization to do).
FYI I'm fine with that, but I'm also okay if we skip the exact reason if it benefits performance or simplifies the code.
It also exists for the case where your code doesn't (need to) deal with aware datetimes. People have different opinions on whether to use aware or naive datetimes, but e.g. http://lucumr.pocoo.org/2011/7/15/eppur-si-muove/ suggests that "internally always use offset naive datetime objects". Personally, I also prefer to use naive datetimes throughout the code (which are assumed to be in UTC), and most of my use cases actually use
An idea is to still ensure that the string is valid ISO8601 in its entirety (i.e. you reject strings with an invalid time zone offset), but ignore the timezone information when returning the Python string (or add the offset to the naive datetime so it's always UTC). Would have to check what the actual performance penalty is.
Just to be clear, would
I really like the idea of replacing "unaware" with "naive".
I guess the use of the term "unsafe" is because a date time like
Ah, I see. You are perfectly right. I've profiled it and the performance penalty is almost entirely in the
I think that clears things up for me. There will be
That's fine, though I see this as a case where you should be using
This is why I want to change the name to
Alternative names could be