Using Masked inside Time #15231

mhvk · 2023-08-26T13:20:18Z

To do:

Make mask read-only (at least as gotten from .mask, perhaps also internally).
Add configuration option for having np.ma.MaskedArray always on output.
Add documentation (change-log entry, what's new, documentation about config)

@taldcroft - when I introduced Masked, I had hoped to soon use it to have masked SkyCoord, etc. Of course, things never go very fast. But even at the time I had made a start by seeing how things would work for Time, where it would replace using np.nan for jd2. This PR is to actually do that (as draft for now). This is triggered in part by #15230 which led back to #6509, a request to have an option for "NaT", ie., a marker for a bad time (as opposed to a masked one).

I remember vaguely that at the time you were rather hesitant to go this route, but do not really remember why, except that it probably involved backward compatibility! The approach taken here is that if Time gets np.ma.MaskedArray as input, that's also what it will give as output (reusing the _shaped_like_input method as before), otherwise it will be Masked. I think this is a reasonable option, though I think we should add a configuration option for those who want to just set a given type. If you think that makes sense, I'll add that.

Note that the PR consists of 4 commits all of which are working states, so perhaps useful to review independently:

Move to use Masked everywhere. Conversion upon inputs, all output as Masked. EDIT: also a small bug fix in fitstime, uncovered because now jd2 is no longer nan if an element is masked.
Since formats do not really have to know about masking any more, move the mask stuff to Time proper.
For outputs, use the masked class used for the inputs. Note: this is incomplete: if one does t2 = t-t[0], then t2 just uses Masked again. Probably should take _masked_cls from the left-most argument or so. Or maybe this should be an attribute on info and let that do the merging?!
Extra tests for all formats and a fix for a bug found as a result (can mix in 1 and 3 if desired).

One API change is that the mask is now writeable if masked values were passed in. This is consistent with how masked arrays generally work (but could be changed). One consequence of that is that the masked property can no longer be cached. (Obviously, this could be changed. Since the mask is always a copy of that from the inputs, we could write-protect it.)

Out of scope here:

Solving NaN equivalent for astropy.time.Time #6509 - though if we use Masked internally, we become free to use np.nan to indicate "NaT";
Allowing input of masked data where the underlying values do not get overridden (right now, one would have to initialize t = Time(val.unmasked, ...) and then set the mask after, with t[val.mask] = np.ma.masked).
Might be nice for Masked(times, mask=...) to work too.

Small trial:

from astropy.time import Time
from astropy.utils.masked import Masked
t = Time(Masked([50000., 50001, 50002], mask=[False, True, False]), format='mjd')
t
# <Time object: scale='utc' format='mjd' value=[50000.    ——— 50002.]>
t.isot
# MaskedNDArray(['1995-10-10T00:00:00.000',                       ———,
#                '1995-10-12T00:00:00.000'], dtype='<U23')
(t-t[0]).to('s')
# <MaskedQuantity [     0.,     ———, 172800.] s>

from astropy.table import QTable
QTable([t])
 
<QTable length=3>
  col0 
  Time 
-------
50000.0
    ———
50002.0

By checking this box, the PR author has requested that maintainers do NOT use the "Squash and Merge" button. Maintainers should respect this when possible; however, the final decision is at the discretion of the maintainer that merges the PR.

github-actions · 2023-08-26T13:20:53Z

github-actions · 2023-08-26T13:20:57Z

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

taldcroft · 2023-08-27T10:52:22Z

@mhvk - you got it that my big concern would be about breaking existing code. The minimal impact to existing astropy tests is a good start, but some quick questions without really looking at the code yet:

You mentioned that the mask is writeable, so would that cause problems for the caching of format conversions? (e.g. tm.iso is cached).
To what extent is your MaskedNDArray API-compatible with the current numpy MaskedArray? What is the potential for incompatibility there?
What about speed and memory performance changes?
Does this fix TypeError when comparing TimeDelta and a nan Quantity #15230 ?

mhvk · 2023-08-27T11:24:54Z

@mhvk - you got it that my big concern would be about breaking existing code. The minimal impact to existing astropy tests is a good start, but some quick questions without really looking at the code yet:
* You mentioned that the `mask` is writeable, so would that cause problems for the caching of format conversions? (e.g. `tm.iso` is cached).

I do indeed worry about that, and have not checked. I do not have a strong opinion on whether the mask should be writeable or not, so we can stick more to the Time-is-almost-immutable case that we have and make the mask itself (or perpaps what is returned by .mask) read-only again. Indeed, I just realized that since the mask is shared between jd1 and jd2, making it read-only might give some extra protection against them somehow getting out of sync.

* To what extent is your `MaskedNDArray` API-compatible with the current numpy `MaskedArray`? What is the potential for incompatibility there?

See https://docs.astropy.org/en/latest/utils/masked/index.html#utils-masked-vs-numpy-maskedarray

I think that if we implement the config item I suggested that enforces np.ma.MaskedArray output, users of Time can ensure there are no differences whatsoever.

* What about speed and memory performance changes?

I would expect a (slight) degradation in both cases, since there now is some overhead in propagating and storing the mask (latter only 1 byte for 16 bytes of data, though, since mask is shared). Given that actual time calculations are expensive, I think the time overhead will be minimal. Would be good if we still had functioning benchmarks! (astropy-benchmarks seems pretty defunct, and sadly astropy.time never made it on them). Of course, this only concerns the case that Time is actually masked!

* Does this fix [`TypeError` when comparing TimeDelta and a nan Quantity #15230](https://github.com/astropy/astropy/issues/15230) ?

Not directly. But it will enable using np.nan as a marker for NaT (though I think we could fix #15230 also by simply allowing TimeDelta to have non-finite numbers; unlike for Time, that might break just about nothing).

Note that the larger goal I still have in mind here is to add masking to SkyCoord (and everything below it), and potentially even modelling. Since SkyCoord uses Quantity, I will have to use Masked there. A bit similarly, I have been working on Distribution recently, to allow it to be used in representations, etc., and thus get "automatic" error propagation also for coordinates. To get that to work for Time needs an overhaul very similar to what I did here, so that jd1 and jd2 are no longer required to be base ndarray.

Anyway, am happy to make changes, but would like to be sure you think this is OK at least in principle...

taldcroft · 2023-08-27T12:20:42Z

@mhvk - in principle I am fine going ahead. In part this is because I think SkyCoord really needs masking and so doing things in a unified way is the right approach. The nan approach for Time was always meant as a "good-enough" approach in lieu of another mature solution. Since Masked has been out for some time now it should be solid enough.

Thanks for the work!

mhvk · 2023-08-27T14:34:37Z

OK, sounds good. I add two checks on top for things I should implement.

mhvk · 2023-10-05T12:55:14Z

@taldcroft - I pushed an update to this effort to use Masked inside Time, which includes updates to the documentation (plus changelog entry and what's-new), adds a configuration option to get np.ma.MaskedArray output, and makes the .mask read-only.

Compared to what I had before, the main change is that I kept the option simple: choose either Masked or np.ma.MaskedArray output, i.e., I dropped my original idea that the output should by default depend on what class was used for input. This mostly because propagating that information through operations seemed needlessly complex, but also just to "refuse the temptation to guess".

This is fairly complete, but there are a few related things I'm wondering about; would be great to get your opinion!

Should .masked now just depend on whether the internal arrays are Masked, rather than on whether there is any element masked? When using np.nan we really didn't have much choice, but always having a mask when the data is Masked might help avoid surprises, that, e.g., one can only do time.jd.filled(np.nan) if any element is masked, even if one passed Masked input (but perhaps had no masked elements).
Related: should it be possible make Time masked? E.g., by setting time.masked = True. Or, perhaps nicer, by making Masked(time, mask=...) work? The latter may need a bit more thinking, but should be do-able (normally, one would have separate MaskedTime and Time classes, but I don't think it is best here and it should be possible to avoid it).
Related too: should Time gain a .filled() method and/or an unmasked property? (Latter could also be done by time.masked = False.)

Here, all could be done as follow-up, but I think at least the first point would be best addressed before 6.0, since otherwise people get another API change in 6.1. The other items just add features, so could be done in a later version too (but good to think through a bit what we want, since it would affect SkyCoord as well).

mhvk · 2023-10-05T15:15:49Z

p.s. I checked, and it turns out to be fairly trivial to get Masked(time, mask=...) and to even have a MaskedTime = Masked(Time) which takes a mask argument. One could imagine that this MaskedTime has the unmasked property and filled() method, while regular Time can have a mask, but does not have those "extras". In that case, we could leave the behaviour of Time.masked exactly as it is now, since one can do ifinstance(possibly_masked_time, Masked).

p.s.2 The scheme I used to create MaskedTime would work trivially on other classes like SkyCoord!

taldcroft · 2023-10-05T17:11:46Z

This makes me think about a big picture question and overall API consistency.

The current Time class can be masked and that is indicated by the masked attribute. Because Time is a container class it has the option of converting to a "masked" object in-place, and presumably the same would apply to SkyCoord. However, Column and Quantity are different and for those they require a distinct Masked... version.

If there will be a MaskedTime which is slight different from Time with masked JD values, then would a table join with missing values create a Time or a MaskedTime? Would one choice be preferred?

Writing it out like this makes me think that we should adhere to "one and only one way" to doing masking for each class at the expense of inconsistency in how masking is done.

That would imply answers to your questions (of course up for more discussion):

Should .masked now just depend on whether the internal arrays are Masked? Yes.
Should Masked(time, mask=...) work? Whether or not it works, astropy core should not document that as part of core astropy nor define or use a MaskedTime class anywhere.
Should Time gain a .filled() method and/or an unmasked property? Yes, those are both well defined even for an object that is not masked so the API can be consistent.

mhvk · 2023-10-07T09:17:53Z

Thanks, that's super helpful! I agree with your logic. Shall I add .filled() and unmasked to this PR, then?

One further question: it would seem sensible to allow setting Time.masked -- otherwise, to make an instance masked without setting any masked elements would require something a bit silly like doing t[:0] = np.ma.masked. Setting to False would then be equivalent to doing .unmasked inplace.

As for working with Masked, I made a few trials and it is possible to recognize Time as a Masked instance if it is in fact Masked, which I think will be useful. The MaskedTime class is easily made into a pure factory class, which just adds a mask argument to Time, but produces a Time instance (for which instance(t, MaskedTime) will work as long as t.masked holds. But that is for another PR -- I think much of the code can live on a new MaskableShapedLikeNDarray class, which can then be used by SkyCoord, etc., as well.

taldcroft · 2023-10-07T10:55:37Z

@mhvk - About the .masked attribute. Seeing your question makes me think again and wonder if there will be any internal or downstream issues with the subtle API change in what .masked implies (has some masked items vs might have masked items). I took a quick look through astropy time and table and nothing jumped out.

In any case, I suppose that .masked could be settable, but the more important question is whether that should ever be necessary. I would expect that doing tm[2] = np.ma.masked for an unmasked Time would auto-convert the _time object to Masked and that would be it. In other words why would you want to set .masked without actually masking any item?

Conceptually this gets to the original implementation and analogy with Pandas that from the user perspective there doesn't need to be much (if any) distinction between masked or unmasked. This is the benefit of having a container class (or using sentinel values).

In the other direction, doing tm.masked = False will have the consequence of unmasking values in-place, which might be surprising to some users. The explicit way to write this would be tm = tm.unmasked.

Upshot: I'm on the fence about whether to allow setting .masked, but either way I don't think we should document this is a feature of the API.

mhvk · 2023-10-07T20:50:18Z

OK, why don't I leave settable masked out of this PR -- it is something that can be added later anyway.

mhvk · 2023-10-07T20:59:22Z

OK, I pushed the further commits that make time.masked reflect whether or not the internal jd1 is Masked.

mhvk · 2023-10-07T21:00:35Z

Note that the test on scalars uncovered a small bug in Masked (bca63dc) - I'll make a separate PR with that, since that probably should still be backported.

taldcroft · 2023-10-19T13:08:23Z

Here is where we come back to the fundamental issue of masked vs missing. This is surprising to me because of the implication that the data are valid but simply masked:

>>> dates = Masked(["2000-01-01", "2003-12-30"], mask=[True, False])
>>> print(Time(dates, format="iso", out_subfmt="date").unmasked)
['2003-12-30' '2003-12-30']

Most people would probably expect ['2000-01-01', '2003-12-30']. But of course the code should also work for the example of missing or bad data (and indeed it does work but with a result that might surprise some people):

>>> dates = Masked(["", "2003-12-30"], mask=[True, False])
>>> print(Time(dates, format="iso", out_subfmt="date").unmasked)
['2003-12-30' '2003-12-30']

I wonder if we could address this ambiguity with this:

Define a cached property for each format class that provides a default value which is equivalent to "2000-01-01".
Use that default value consistently to replace any masked / missing values.
Document that new property and in particular make it clear for Time that any input masked values are always replaced with the default value.

Note that these issues are not necessarily new with this PR, but are somehow more obvious to me now.

mhvk · 2023-10-26T17:58:46Z

OK, the final push worked locally at least...

This replaces the use of nan for jd2 with Masked arrays for jd1 and jd2. For now, the mask is not writeable if nothing Masked was input, since we do not want to change the nature of jd1 and jd2 without explicit input. Along the way, this uncovered a bug in how masked elements were treated in fitstime. The old implementation merged jd1 and jd2 together using np.stack, which removes any mask. But because jd2 was nan for any element anyway, the element still was recognized as masked. Now it is treated properly.

Moving the cache to Time itself seems more logical, since apart from the mask, all the handling of cache state is done in Time. Indeed, it was on Time before setting of Time elements was introduced in astropygh-6028. This just moves it back now that the mask handling is much simpler. One resulting change is that any cached format information still needs be deleted when changing format, since it can be out of date.

Also check that the mask is shared internally between jd1 and jd2, but not with any inputs or outputs.

taldcroft

Looks good, approved!!

mhvk added time utils.masked labels Aug 26, 2023

mhvk added this to the v6.0 milestone Aug 26, 2023

mhvk mentioned this pull request Aug 26, 2023

NaN equivalent for astropy.time.Time #6509

Open

mhvk force-pushed the masked-time-ma-optional branch 3 times, most recently from 65ebd26 to 8522edb Compare August 27, 2023 09:30

mhvk force-pushed the masked-time-ma-optional branch from 8522edb to 2b47bcf Compare October 5, 2023 12:15

mhvk marked this pull request as ready for review October 7, 2023 20:59

mhvk requested review from taldcroft and saimn as code owners October 7, 2023 20:59

mhvk mentioned this pull request Oct 7, 2023

Ensure mask calculation is correct also for multi-op ufunc with scalars #15450

Merged

1 task

mhvk force-pushed the masked-time-ma-optional branch from 33162b2 to 8c6b23b Compare October 7, 2023 23:46

mhvk mentioned this pull request Oct 7, 2023

Ensure Masked scalars print the same way as regular array scalars #15451

Merged

1 task

mhvk force-pushed the masked-time-ma-optional branch from 8c6b23b to 4376e66 Compare October 8, 2023 00:23

mhvk and others added 22 commits October 26, 2023 19:03

Test all formats and fix bug for ymdhms

bb67cba

Ensure the mask from the mask attribute is always read-only.

9e58a26

Also check that the mask is shared internally between jd1 and jd2, but not with any inputs or outputs.

Allow unmasked by assigning np.ma.nomask

55cb8f7

Allow the output masked class to be configurable.

81887a3

Changelog and whatsnew entries, updates to the docs

d7f96aa

Fix case where format already masks some entries

609df0c

Let Time.masked reflect whether _time.jd1 is Masked

a9b2e42

Add unmasked property and filled method

d6326cd

Fix TimeDeltaQuantityString for Masked

0708d7a

Add a fill_value class method to TimeFormat

025c7f3

New strategy for Time masking / filling

3cf845d

Add class method to fill masked values

d154975

Use in_subfmt for determining fill_value

bb08f68

Update narrative docs

728d59b

Change default fill value to J2000.0 (jd=2451545.0)

25bd233

Ensure we give each format the original values

4e898c5

Ensure that unmasked works also with an array-valued location

e3574c1

Make cache depend on masked_array_type

5f21301

Give explicit link to astropy.utils.masked package in whatsnew

afe9a30

Move fill value discussion to warning and clarify scale

de5736b

mhvk force-pushed the masked-time-ma-optional branch from 696b60e to de5736b Compare October 26, 2023 23:03

taldcroft approved these changes Oct 27, 2023

View reviewed changes

taldcroft mentioned this pull request Oct 27, 2023

Ensure cache coherence for views of time instances #15453

Merged

1 task

mhvk merged commit f2cd545 into astropy:main Oct 27, 2023
27 checks passed

mhvk deleted the masked-time-ma-optional branch October 27, 2023 13:01

mhvk mentioned this pull request Jan 11, 2024

Meta issue for Masked class follow-up #11539

Open

8 tasks

neutrinoceros mentioned this pull request May 2, 2024

TypeError on write when stacking an empty table with a table that has a Time column #16370

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Masked inside Time #15231

Using Masked inside Time #15231

mhvk commented Aug 26, 2023 •

edited

github-actions bot commented Aug 26, 2023 •

edited by taldcroft

github-actions bot commented Aug 26, 2023

taldcroft commented Aug 27, 2023

mhvk commented Aug 27, 2023

taldcroft commented Aug 27, 2023

mhvk commented Aug 27, 2023

mhvk commented Oct 5, 2023

mhvk commented Oct 5, 2023

taldcroft commented Oct 5, 2023

mhvk commented Oct 7, 2023

taldcroft commented Oct 7, 2023

mhvk commented Oct 7, 2023

mhvk commented Oct 7, 2023

mhvk commented Oct 7, 2023

taldcroft commented Oct 19, 2023

mhvk commented Oct 26, 2023

taldcroft left a comment

Using Masked inside Time #15231

Using Masked inside Time #15231

Conversation

mhvk commented Aug 26, 2023 • edited

github-actions bot commented Aug 26, 2023 • edited by taldcroft

github-actions bot commented Aug 26, 2023

taldcroft commented Aug 27, 2023

mhvk commented Aug 27, 2023

taldcroft commented Aug 27, 2023

mhvk commented Aug 27, 2023

mhvk commented Oct 5, 2023

mhvk commented Oct 5, 2023

taldcroft commented Oct 5, 2023

mhvk commented Oct 7, 2023

taldcroft commented Oct 7, 2023

mhvk commented Oct 7, 2023

mhvk commented Oct 7, 2023

mhvk commented Oct 7, 2023

taldcroft commented Oct 19, 2023

mhvk commented Oct 26, 2023

taldcroft left a comment

Choose a reason for hiding this comment

mhvk commented Aug 26, 2023 •

edited

github-actions bot commented Aug 26, 2023 •

edited by taldcroft