Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in num2date cinversion #134

Closed
rkouznetsov opened this issue Nov 20, 2019 · 22 comments
Closed

Error in num2date cinversion #134

rkouznetsov opened this issue Nov 20, 2019 · 22 comments

Comments

@rkouznetsov
Copy link

Originally posted to Unidata/netcdf4-python#981, but is still reproducible with freshly-cloned cftime.

  #!/usr/bin/env python
 import numpy as np
 import cftime as nc4
 iarr=np.arange(86400, dtype=np.int32)
 units='hours since 2018-01-01 00:00:00 UTC'
 dates1 = nc4.num2date(iarr,units)[-5:]
 dates2 = nc4.num2date(iarr[-5:],units)
 print dates1 == dates2

Results in
[ True False True True False]
Should be
[ True True True True True]

Environment:
Ubuntu 18.04, python-numpy 1:1.13.3-2ubuntu1, cftime (revision ea68823).

Thank you!

@jswhit
Copy link
Collaborator

jswhit commented Nov 20, 2019

Microseconds are now included in the comparison (#118) - and the 2nd and 5th dates are off by 13 microseconds. The calculation itself has an accuracy of about 100 milliseconds. Perhaps we should modify the comparison to ignore differences of less than a specified tolerance? @davidhassell - what do you think?

@rkouznetsov
Copy link
Author

Hi, Thank you for prompt response. I do not think that the comparison is an issue. I can imagine a case when 13 microseconds difference matters, so comparison did a good job. I would say, that in given example the issue is in handling dates as float/double. The integer number of seconds from midnight should never result in non-zero microseconds. In unix the internal format for time (struct timeval) is two integers: seconds and microseconds. It would be great to have something like that in cftime...

@jswhit
Copy link
Collaborator

jswhit commented Nov 21, 2019

cftime time uses julian day/fractional julian day for it's arithmetic - so it's subject to usual issues of floating point precision/comparison. Probably a better way to compare dates given a specified tolerance would be something like this:

import numpy as np
import cftime as nc4
iarr=np.arange(86400, dtype=np.int32)
units='hours since 2018-01-01 00:00:00 UTC'
dates1 = nc4.num2date(iarr,units)[-5:]
dates2 = nc4.num2date(iarr[-5:],units)
datediff = [diff.microseconds for diff in dates1-dates2]
print(np.abs(datediff) < 100) # tolerance of 100 microseconds

[ True  True  True  True  True]

@rkouznetsov
Copy link
Author

Thank you for the workaround! I have already implemented something like this, but such solutions in my applications tend to strike back after some time... So, please, keep the comparison. It works perfectly well.

If floating-point julian day is something that is not easy to fix, probably the dates produced by num2date could be rounded internally to, say, nearest millisecond? Then a millisecond accuracy could be declared as a feature of num2date. Makes sense?

@jswhit
Copy link
Collaborator

jswhit commented Nov 21, 2019

This is a very complicated issue that has been discussed often - there are no easy solutions. Rounding to the nearest millisecond will not give you millisecond accuracy, since a 64 bit julian day is only accurate to about O(10) milliseconds. There are certainly more accurate algorithms out there, but none that I have found so far that work will all the calendars available in the CF metadata standard.

@rkouznetsov
Copy link
Author

Thanks! Indeed, climatologists are quite creative in inventing oteher-planet calendars, and there even a few-days per year accuracy ("360_day" calendar) would be tolerable by design. But for calendars like proleptic_georgian accuracy does matter. Actually, I would be fine even with one-second accuracy, as long as it results in exact values when it is possible.

The issue in the example above is that num2date returns different values of datetime depending on the length of the input array it is fed with. The input values are exactly the same...

With the latest cftime even the returned objects are different.

In [1]: import numpy as np
   ...: import cftime as nc4
   ...: iarr=np.arange(86400, dtype=np.int32)
   ...: units='hours since 2018-01-01 00:00:00 UTC'
   ...: dates1 = nc4.num2date(iarr,units)[-5:]
   ...: dates2 = nc4.num2date(iarr[-5:],units)
   ...: print dates1 == dates2
   ...: 
[ True False  True  True False]

In [2]: dates1
Out[2]: 
array([real_datetime(2027, 11, 9, 19, 0),
       real_datetime(2027, 11, 9, 20, 0, 0, 13),
       real_datetime(2027, 11, 9, 21, 0),
       real_datetime(2027, 11, 9, 22, 0),
       real_datetime(2027, 11, 9, 23, 0, 0, 13)], dtype=object)

In [3]: dates2
Out[3]: 
array([datetime.datetime(2027, 11, 9, 19, 0),
       datetime.datetime(2027, 11, 9, 20, 0),
       datetime.datetime(2027, 11, 9, 21, 0),
       datetime.datetime(2027, 11, 9, 22, 0),
       datetime.datetime(2027, 11, 9, 23, 0)], dtype=object)

In [4]: nc4.__version__
Out[4]: '1.0.4.2'

I wonder if it is desirable behaviour...

@jswhit
Copy link
Collaborator

jswhit commented Nov 22, 2019

That is caused by this

# round to nearest second if within ms_eps microseconds

which was intended to fix another problem (#78).

If you comment out the if-block here

if indxms.any():

you will see that all the dates are the same, but the calendar formatting may be messed up because the date will not land exactly on midnight.

The question is - which behaviour is less desirable?

@jswhit
Copy link
Collaborator

jswhit commented Nov 22, 2019

Sorry, I was mistaken - the cause is actually the fact that python datetime calculations are used in one instance and cftime calculations in the other. If you use only_use_cftime_datetimes=True in num2date, you will get the same answer, including the extra 13 ms. The python datetime module is used for the calculations whenever possible. Since the python datetime module only allows for positive times, the first calculation cannot use it. The logic is here

elif postimes and ((calendar == 'proleptic_gregorian' and basedate.year >= MINYEAR) or \

The issue with negative times was discovered here

Unidata/netcdf4-python#659

Arguably, only_use_cftime_datetimes=True should be the default to avoid surprises like this.

@rkouznetsov
Copy link
Author

Looks like a mess... Certainly, returning different objects depending on subtle differences in the input is confusing. Having two time-handling libraries, one precise, but limited in time, another imprecise, but more flexible and universal, i would prefer to explicitly force use of one or another depending on my application, may be even with different functoins. In the current implementation one can force cftime, but, unfortunately, one can not force using datetime. That seems to be a missing feature.

Automatic choice and/or fallbacks is a way to an unexpected behaviour, so if datetime is forced, an exception should be generated.

Makes sense?

@rkouznetsov
Copy link
Author

rkouznetsov commented Nov 23, 2019

Since the python datetime module only allows for positive times, the first calculation cannot use it.

That is something I did not understand. The check for postimes checks for positive interval rather than positive time. The datetime module is perfectly capable of using negative intervals (timedeltas) or multiply positive intervals with negative numbers.

The reason for this check was, probably, to guard against getting into before the datetime's MINYEAR with negative offsets.

@jswhit
Copy link
Collaborator

jswhit commented Nov 23, 2019

Yes, that was the reason.

@rkouznetsov
Copy link
Author

Thank you! Just coming back to the issue. Is there a way (or can it be implemented?) to force num2date to return datetime.datetime objects?

@jswhit
Copy link
Collaborator

jswhit commented Dec 24, 2019

If use_only_cftime_datetimes=False, python datetime instances will be returned where possible (but not for negative times - extra logic would need to be added check datetime's MINYEAR for that to work).

@rkouznetsov
Copy link
Author

Thank you! Indeed we regularly deal with negative times. I seek for much simpler option to get datetime instance where possible, and an exception when resulting date does not fit the datetime range. Not much logic is needed for that.

@jswhit
Copy link
Collaborator

jswhit commented Dec 28, 2019

Are you looking for a use_only_python_datetimes option for num2date and date2num that returns an exception of calendar != propleptic_gregorian and always returns python datetime instances?

Or, perhaps instead of adding a new kwarg, we could use use_only_cftimes=None.

@rkouznetsov
Copy link
Author

Yep. That's what i was after. I guess, it should also work for calendar == 'standard'.

date2num is of no issue for cftime vs datetime choice. Currently it makes double, and np.round works fine for the range of datetime if integer output needed.

While it is possible to instruct num2date what is the needed return type, i wonder if there is any way for num2date to consistently guess what is the right object to return. Consider processing a bunch of files with different times. If the intelligent guess based on time values is implemented, it is perfectly possible to imagine a set of files that would result in different time objects. Such inconsistency might lead to an unexpected behaviour that is quite difficult to debug.

@jswhit
Copy link
Collaborator

jswhit commented Dec 30, 2019

We don't want to have different objects returned depending on the input - as you suggest, that leads to surprises and hard to debug errors. I can add have num2date return python datetime objects if use_only_python_datetimes=True as long as calendar=proleptic_gregorian or calendar=standard and the reference date is after 1582.

@rkouznetsov
Copy link
Author

Great! That should be fine for my applications.
I guess, gregorian should be fine even before 1582 (if someone needs it). In any case, i would check resulting values after conversion rather than just the reference date..

@jswhit
Copy link
Collaborator

jswhit commented Dec 31, 2019

@rkouznetsov - could you test PR #139 (branch issue134) to make sure it does what you expect?

@jswhit2
Copy link
Contributor

jswhit2 commented Jan 31, 2020

@rkouznetsov - have you had a chance to try PR #139 (branch issue134) yet?

@rkouznetsov
Copy link
Author

Yep. Sorry. Got a bit confused with github interface. Please see my comment at #139

@jswhit
Copy link
Collaborator

jswhit commented Feb 4, 2020

PR #139 merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants