Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MaskedQuantity breaking np.nanmean etc. methods in aggregate_downsample #12435

Closed
Tracked by #12430
dhomeier opened this issue Nov 7, 2021 · 18 comments
Closed
Tracked by #12430
Assignees
Milestone

Comments

@dhomeier
Copy link
Contributor

dhomeier commented Nov 7, 2021

Description

A downstream issue lightkurve/lightkurve#1157 appears to have been triggered by the use of Masked objects, probably following #11127. The same code that returned a timeseries subclass with Quantity columns, "masked" with np.nan where applicable, with astropy 5.0rc1 is creating columns of MaskedQuantity. This is breaking any code using functions like np.nanmean that are currently not supported.

Expected behaviour

MaskedQuantity should either fully support required array functions or not be used where still unsupported functions are required.

Actual behaviour

For a more minimal example that can also be constructed in 4.3 (i.e. the aggregate_downsample failure is independently from its updates in 5.0) – note that the array functions do work for np.ma.array, but the NaNs don't seem to be correctly caught in the downsampling:

Steps to Reproduce

import numpy as np
from astropy import units as u
from astropy.time import Time
from astropy.timeseries import Timeseries, aggregate_downsample
from astropy.utils.masked import Masked

ma = np.ma.array([[1, 2, 13, 248, np.nan, np.nan, 303, 12, 21]], mask=np.array([0, 0, 0, 1, 0, 1, 1, 0, 0])).T
np.nanmean(ma)
9.8

np.nanmean(ma.data)
85.71428571428571

mq = Masked(ma) * u.m
np.nanmean(qa)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 5, in nanmean
TypeError: no implementation found for 'numpy.nanmean' on types that implement __array_function__: [<class 'astropy.utils.masked.core.MaskedQuantity'>]

tsa = TimeSeries(time=Time(np.arange(9)+4e4, format='mjd'), data=ma)
tsq = TimeSeries(time=Time(np.arange(9)+4e4, format='mjd'), data=mq)
aggregate_downsample(tsa, time_bin_size=3*u.d)
<BinnedTimeSeries length=3>
time_bin_start time_bin_size        col0      
                     s                        
    object        float64         float64     
-------------- ------------- -----------------
       40000.0      259200.0 5.333333333333333
40002.99999991      259200.0               nan
40005.99999982      259200.0              16.5

aggregate_downsample(tsq, time_bin_size=3*u.d)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/lib/python3.10/site-packages/astropy/timeseries/downsample.py", line 127, in aggregate_downsample
    data[unique_indices] = u.Quantity(reduceat(values.value, groups, aggregate_func),
  File "/opt/lib/python3.10/site-packages/astropy/timeseries/downsample.py", line 28, in reduceat
    result.append(function(array[indices[i]:indices[i+1]]))
  File "<__array_function__ internals>", line 5, in nanmean
TypeError: no implementation found for 'numpy.nanmean' on types that implement __array_function__: [<class 'astropy.utils.masked.core.MaskedNDArray'>]

I am not sure how exactly the timeseries subclass in the downstream issue is instantiated, but the change to MaskedQuantity occurred with 5.0rc1 without any other code changes (@barentsen may have more background on the read-in details). So the code should probably be more conservative in using Masked objects for now, or functions like aggregate_downsample will need to escape those methods applying them either to the unmasked instances or their numpy content.

np.nanmean(np.ma.array(mq.value.data, mask=mq.mask))
9.8

Better still would of course be to directly support those functions. A naïve attempt to just add them to function_helpers.MASKED_SAFE_FUNCTIONS actually (sort of) worked for np.nanmedian, but np.nanmean, np.nansum, np.nanprod` then fail with

File "/opt/lib/python3.10/site-packages/numpy/lib/nanfunctions.py", line 937, in nanmean
    arr, mask = _replace_nan(a, 0)
  File "/opt/lib/python3.10/site-packages/numpy/lib/nanfunctions.py", line 108, in _replace_nan
    np.copyto(a, val, where=mask)
  File "<__array_function__ internals>", line 5, in copyto
TypeError: no implementation found for 'numpy.copyto' on types that implement __array_function__: [<class 'astropy.utils.masked.core.MaskedNDArray'>]

so they will probably need a dedicated implementation – analogously to MaskedIterator.mean?

System Details

Python: 3.8.12 / 3.9.8 / 3.10.0
Platform: macOS-10.14.6-x86_64-i386-64bit
Astropy: 5.0rc1
Numpy: 1.21.2
Scipy: 1.7.1
Matplotlib: 3.5.0rc1

@mhvk
Copy link
Contributor

mhvk commented Nov 7, 2021

@dhomeier - thanks, this example definitely proves we should support nanmean etc for Masked! Note that what to do with those is actually the first item in the meta follow-up issue #11539. For float data, it seems best would be to simply treat all masked items as nan, but what to do for int, etc.? Maybe a logical option is to let the functions imply conversion to float...

Another question is what should happen with all-NaN columns. I guess most logical is to just mask those.

@dhomeier
Copy link
Contributor Author

dhomeier commented Nov 7, 2021

I thought one could just apply the nanfunctions to the numerical content (as in the np.ma.array(mq.value.data, mask=mq.mask) above) and then restore the NDData, Quantity etc. from the result. But I have seen you made some distinctly different implementation choices for some of the other functions on MaskedQuantity, so a consistent pattern would perhaps still to be sorted out. Interestingly, with the simple addition of np.nanmedian I also found some different behaviour from the operation on np.ma.array.
dtype='int' ndarrays don't accept np.nan, for those one might just fall back on the non-nan versions?

@dhomeier
Copy link
Contributor Author

dhomeier commented Nov 7, 2021

Another question is what should happen with all-NaN columns. I guess most logical is to just mask those.

What output would one expect for them? Technically it's a function on an empty sequence/array either way; np.nanfunctions generally are consistent in that, only for the all-masked case it looks a bit off:

# all-NaN, should be 0.0 / 0
np.nanmean(ma[4:6])
<stdin>:1: RuntimeWarning: Mean of empty slice
nan

# all masked
np.nanmean(ma[5:7])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 5, in nanmean
  File "/opt/lib/python3.10/site-packages/numpy/lib/nanfunctions.py", line 950, in nanmean
    avg = _divide_by_count(tot, cnt, out=out)
  File "/opt/lib/python3.10/site-packages/numpy/lib/nanfunctions.py", line 212, in _divide_by_count
    return np.divide(a, b, out=a, casting='unsafe')
ValueError: output array is read-only

@mhvk
Copy link
Contributor

mhvk commented Nov 8, 2021

@dhomeier - I tried to think back a little more what the problem was, and remembered that I had a half-finished implementation. For the inexact types, this is easy: Just fill all masked values with NaN, run np.nan<whatever> and then create a new Masked instance with NaN values masked.

The trick was the integer and other non-exact dtypes. As you noted, the implementation for Masked and numpy's MaskedArray are not the same: in particular, I really did not want to continue what I felt was a big mistake: to have things like ma.sum() ignore masked elements - since this implies a choice of setting those elements to 0 (the sum becomes a meaningless quantity unless you also know the number of elements; see https://docs.astropy.org/en/latest/utils/masked/index.html#utils-masked-vs-numpy-maskedarray).

But for the nanfunctions, that would not be correct. For instance, the docstring of nansum states that "Return the sum ... treating Not a Numbers (NaNs) as zero.", so if a masked value is treated as NaN, then the outcome is prescribed. I thought this was good in that it also gave users the option to get the Masked = 0 case if they wanted. But then I got unsure as I was implementing it, and also failed to write the test cases (most of the work, really...).

Anyway, I think your issue means this should just get done... I've started updating the old PR I had, but I'm still getting errors...

Also, I am now seeing that np.nansum never returns NaN, not even for all-NaN arrays, so I guess it should not return Masked arrays either. Must admit I find this somewhat inconsistent...

p.s. Note that the nanfunctions do not properly support np.ma.MaskedArray:

In [6]: np.nanmedian(np.ma.array([1., 2., 3., 4., 5.], mask=[True, True, True, False, False]))
/usr/lib/python3/dist-packages/numpy/core/fromnumeric.py:748: UserWarning: Warning: 'partition' will ignore the 'mask' of the MaskedArray.
  a.partition(kth, axis=axis, kind=kind, order=order)
Out[6]: masked

@dhomeier
Copy link
Contributor Author

dhomeier commented Nov 8, 2021

Yes, the way they are ignoring masked elements or not does not make much sense to me.
np.nanmedian in particular seems completely erratic. Note that using the Masked version by simply adding it to MASKED_SAFE_FUNCTIONS looks very sensible!

>>> np.nanmedian(np.ma.array([1, 2, 3, 100, np.nan, np.nan], mask=[0, 0, 0, 1, 1, 0]))
/Users/derek/opt/lib/python3.10/site-packages/numpy/lib/function_base.py:3685: UserWarning: Warning: 'partition' will ignore the 'mask' of the MaskedArray.
  part.partition(kth)
3.0
>>> np.nanmedian(np.ma.array([1, 2, 3, 100, np.nan, np.nan, 200], mask=[0, 0, 0, 1, 1, 0, 0]))
nan
>>> np.nanmedian(np.ma.array([1, 2, 3, 100, np.nan, np.nan, 200], mask=[0, 0, 0, 1, 1, 1, 0]))
/Users/derek/opt/lib/python3.10/site-packages/numpy/core/fromnumeric.py:755: UserWarning: Warning: 'partition' will ignore the 'mask' of the MaskedArray.
  a.partition(kth, axis=axis, kind=kind, order=order)
nan
>>> np.nanmedian(Masked(np.ma.array([1, 2, 3, 100, np.nan, np.nan, 200], mask=[0, 0, 0, 1, 1, 1, 0])))
MaskedNDArray(2.5)
>>> np.nanmedian(Masked(np.ma.array([1, 2, 3, 100, np.nan, np.nan, 200], mask=[0, 0, 0, 1, 1, 1, 1])))
MaskedNDArray(2.)

@dhomeier
Copy link
Contributor Author

dhomeier commented Nov 8, 2021

Also, I am now seeing that np.nansum never returns NaN, not even for all-NaN arrays, so I guess it should not return Masked arrays either. Must admit I find this somewhat inconsistent...

I understand that as sum([]) just as with all-masked arrays, so returning 0.0 looks logical to me. Likewise for nanmean, which would then turn into 0.0 / len([]) = np.nan.

@dhomeier dhomeier mentioned this issue Nov 8, 2021
6 tasks
@astrofrog
Copy link
Member

@mhvk @dhomeier it sounds like fixing this might be a pretty big task so how do we want to approach this with respect to the 5.0 release? Is there a quick temporary fix we can do to minimise breakage to other packages?

@dhomeier
Copy link
Contributor Author

dhomeier commented Nov 9, 2021

I have not tracked down exactly how the timeseries is initialised here, but I think the critical step is the change in reading in the QTable, which previously had the columns in question as NaN-valued Quantities:

>>> qt = table.QTable.read('tess2019128220341-0000000410458113-0016-s_lc.fits')
WARNING: dropping mask in Quantity column 'PSF_CENTR2_ERR': masked Quantity not supported [astropy.table.table]
>>> qt[:3]['PSF_CENTR1_ERR']
<Quantity [nan, nan, nan] pix>

whereas in 5.0x this becomes

<MaskedColumn name='PSF_CENTR1_ERR' dtype='float32' unit='pix' format='{:14.7e}' length=3>
--
--
--
>>> qt[:3]['PSF_CENTR1_ERR'].data
masked_array(data=[--, --, --],
             mask=[ True,  True,  True],
       fill_value=1e+20,
            dtype=float32)

Now the original file of course has no record of a mask, but simply

>>> hdulist[1].data[:3]['PSF_CENTR1_ERR']
array([nan, nan, nan], dtype=float32)

so it feels to me overly imposing to automatically convert the values to masked elements. My suggestion is to provide some option mask_nan for reading in a table, and even defaulting to False at least until Masked has full nanfunctions support.

@mhvk
Copy link
Contributor

mhvk commented Nov 9, 2021

I'll try to push my fix for the nanfunctions later today - it was mostly done...

@mhvk
Copy link
Contributor

mhvk commented Nov 9, 2021

@dhomeier - the FITS standard is to use NaN for masked elements, so the idea was that we finally lived up to the standard!

@astrofrog
Copy link
Member

@mhvk - thanks!

@dhomeier
Copy link
Contributor Author

dhomeier commented Nov 9, 2021

Sorry, I had not checked that part of the standard. As it was already treated this way in non-Q Tables, it's certainly a step forward to enable this for QTable in the same way; if the full functionality is around the corner (and better than in numpy.ma!), all the better.
In case there are any remaining problems, a convenience function to convert the entire [Q]Table back as tab.unmasked could be helpful – basically a method for
QTable([getattr(col, 'unmasked', col) for col in masked_table.itercols()])

@mhvk
Copy link
Contributor

mhvk commented Nov 9, 2021

@dhomeier - indeed, it will be good if all types of tables behave the same way! And no surprise you would have missed that; there really should be a what's new entry about it (see #11914 (comment)...). Of course, that hopefully will mean that nanmean & co are simply not needed any more!

But in the meantime, hopefully #12454 will fix the regression...

@dhomeier
Copy link
Contributor Author

dhomeier commented Nov 9, 2021

Of course, that hopefully will mean that nanmean & co are simply not needed any more!

@mhvk that's actually a good workaround for now, to replace them with their standard counterparts assuming that all NaN values would be masked anyway!
So putting in a

rmse_func = lambda x: np.sum(x) if hasattr(x, 'mask') and np.all(np.isfinite(x)) else np.nansum(x) if np.any(np.isfinite(x)) else np.nan

in LightCurve.bin() fixes this issue in lightkurve for now.

@astrofrog
Copy link
Member

@dhomeier - i think #12454 does fix the initial issue you ran into FYI

@dhomeier
Copy link
Contributor Author

dhomeier commented Nov 9, 2021

But in the meantime, hopefully #12454 will fix the regression...

Missed again that you already had the PR in. It (or the 5.0.x backport) does fix the lightkurve regression, thanks a lot. Only thing I noted is that the returned columns from aggregate_downsample are regular Quantity columns now, not masked. But that might make sense as it's not obvious how the mask itself would be downsampled.

@mhvk
Copy link
Contributor

mhvk commented Nov 9, 2021

Great to hear that the issue is fixed, and even better that with the new MaskedQuantity, the whole checking for finite no longer is necessary!!

@mhvk
Copy link
Contributor

mhvk commented Nov 10, 2021

I guess we should close this issue - while I'm indeed not completely sure the present behaviour is entirely consistent, that is better as a separate discussion. At least this is fixed!

@mhvk mhvk closed this as completed Nov 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants