Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Masked inside Time #15231

Merged
merged 22 commits into from Oct 27, 2023
Merged

Using Masked inside Time #15231

merged 22 commits into from Oct 27, 2023

Conversation

mhvk
Copy link
Contributor

@mhvk mhvk commented Aug 26, 2023

To do:

  • Make mask read-only (at least as gotten from .mask, perhaps also internally).
  • Add configuration option for having np.ma.MaskedArray always on output.
  • Add documentation (change-log entry, what's new, documentation about config)

@taldcroft - when I introduced Masked, I had hoped to soon use it to have masked SkyCoord, etc. Of course, things never go very fast. But even at the time I had made a start by seeing how things would work for Time, where it would replace using np.nan for jd2. This PR is to actually do that (as draft for now). This is triggered in part by #15230 which led back to #6509, a request to have an option for "NaT", ie., a marker for a bad time (as opposed to a masked one).

I remember vaguely that at the time you were rather hesitant to go this route, but do not really remember why, except that it probably involved backward compatibility! The approach taken here is that if Time gets np.ma.MaskedArray as input, that's also what it will give as output (reusing the _shaped_like_input method as before), otherwise it will be Masked. I think this is a reasonable option, though I think we should add a configuration option for those who want to just set a given type. If you think that makes sense, I'll add that.

Note that the PR consists of 4 commits all of which are working states, so perhaps useful to review independently:

  1. Move to use Masked everywhere. Conversion upon inputs, all output as Masked. EDIT: also a small bug fix in fitstime, uncovered because now jd2 is no longer nan if an element is masked.
  2. Since formats do not really have to know about masking any more, move the mask stuff to Time proper.
  3. For outputs, use the masked class used for the inputs. Note: this is incomplete: if one does t2 = t-t[0], then t2 just uses Masked again. Probably should take _masked_cls from the left-most argument or so. Or maybe this should be an attribute on info and let that do the merging?!
  4. Extra tests for all formats and a fix for a bug found as a result (can mix in 1 and 3 if desired).

One API change is that the mask is now writeable if masked values were passed in. This is consistent with how masked arrays generally work (but could be changed). One consequence of that is that the masked property can no longer be cached. (Obviously, this could be changed. Since the mask is always a copy of that from the inputs, we could write-protect it.)

Out of scope here:

  1. Solving NaN equivalent for astropy.time.Time #6509 - though if we use Masked internally, we become free to use np.nan to indicate "NaT";
  2. Allowing input of masked data where the underlying values do not get overridden (right now, one would have to initialize t = Time(val.unmasked, ...) and then set the mask after, with t[val.mask] = np.ma.masked).
  3. Might be nice for Masked(times, mask=...) to work too.

Small trial:

from astropy.time import Time
from astropy.utils.masked import Masked
t = Time(Masked([50000., 50001, 50002], mask=[False, True, False]), format='mjd')
t
# <Time object: scale='utc' format='mjd' value=[50000.    ——— 50002.]>
t.isot
# MaskedNDArray(['1995-10-10T00:00:00.000',                       ———,
#                '1995-10-12T00:00:00.000'], dtype='<U23')
(t-t[0]).to('s')
# <MaskedQuantity [     0.,     ———, 172800.] s>

from astropy.table import QTable
QTable([t])
 
<QTable length=3>
  col0 
  Time 
-------
50000.0
    ———
50002.0
  • By checking this box, the PR author has requested that maintainers do NOT use the "Squash and Merge" button. Maintainers should respect this when possible; however, the final decision is at the discretion of the maintainer that merges the PR.

@mhvk mhvk added this to the v6.0 milestone Aug 26, 2023
@github-actions
Copy link

github-actions bot commented Aug 26, 2023

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • Is a milestone set? Milestone must be set but we cannot check for it on Actions; do not let the green checkmark fool you.
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

@github-actions
Copy link

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

@mhvk mhvk force-pushed the masked-time-ma-optional branch 3 times, most recently from 65ebd26 to 8522edb Compare August 27, 2023 09:30
@taldcroft
Copy link
Member

@mhvk - you got it that my big concern would be about breaking existing code. The minimal impact to existing astropy tests is a good start, but some quick questions without really looking at the code yet:

  • You mentioned that the mask is writeable, so would that cause problems for the caching of format conversions? (e.g. tm.iso is cached).
  • To what extent is your MaskedNDArray API-compatible with the current numpy MaskedArray? What is the potential for incompatibility there?
  • What about speed and memory performance changes?
  • Does this fix TypeError when comparing TimeDelta and a nan Quantity #15230 ?

@mhvk
Copy link
Contributor Author

mhvk commented Aug 27, 2023

@mhvk - you got it that my big concern would be about breaking existing code. The minimal impact to existing astropy tests is a good start, but some quick questions without really looking at the code yet:

* You mentioned that the `mask` is writeable, so would that cause problems for the caching of format conversions? (e.g. `tm.iso` is cached).

I do indeed worry about that, and have not checked. I do not have a strong opinion on whether the mask should be writeable or not, so we can stick more to the Time-is-almost-immutable case that we have and make the mask itself (or perpaps what is returned by .mask) read-only again. Indeed, I just realized that since the mask is shared between jd1 and jd2, making it read-only might give some extra protection against them somehow getting out of sync.

* To what extent is your `MaskedNDArray` API-compatible with the current numpy `MaskedArray`? What is the potential for incompatibility there?

See https://docs.astropy.org/en/latest/utils/masked/index.html#utils-masked-vs-numpy-maskedarray

I think that if we implement the config item I suggested that enforces np.ma.MaskedArray output, users of Time can ensure there are no differences whatsoever.

* What about speed and memory performance changes?

I would expect a (slight) degradation in both cases, since there now is some overhead in propagating and storing the mask (latter only 1 byte for 16 bytes of data, though, since mask is shared). Given that actual time calculations are expensive, I think the time overhead will be minimal. Would be good if we still had functioning benchmarks! (astropy-benchmarks seems pretty defunct, and sadly astropy.time never made it on them). Of course, this only concerns the case that Time is actually masked!

* Does this fix [`TypeError` when comparing TimeDelta and a nan Quantity #15230](https://github.com/astropy/astropy/issues/15230) ?

Not directly. But it will enable using np.nan as a marker for NaT (though I think we could fix #15230 also by simply allowing TimeDelta to have non-finite numbers; unlike for Time, that might break just about nothing).

Note that the larger goal I still have in mind here is to add masking to SkyCoord (and everything below it), and potentially even modelling. Since SkyCoord uses Quantity, I will have to use Masked there. A bit similarly, I have been working on Distribution recently, to allow it to be used in representations, etc., and thus get "automatic" error propagation also for coordinates. To get that to work for Time needs an overhaul very similar to what I did here, so that jd1 and jd2 are no longer required to be base ndarray.

Anyway, am happy to make changes, but would like to be sure you think this is OK at least in principle...

@taldcroft
Copy link
Member

@mhvk - in principle I am fine going ahead. In part this is because I think SkyCoord really needs masking and so doing things in a unified way is the right approach. The nan approach for Time was always meant as a "good-enough" approach in lieu of another mature solution. Since Masked has been out for some time now it should be solid enough.

Thanks for the work!

@mhvk
Copy link
Contributor Author

mhvk commented Aug 27, 2023

OK, sounds good. I add two checks on top for things I should implement.

@mhvk
Copy link
Contributor Author

mhvk commented Oct 5, 2023

@taldcroft - I pushed an update to this effort to use Masked inside Time, which includes updates to the documentation (plus changelog entry and what's-new), adds a configuration option to get np.ma.MaskedArray output, and makes the .mask read-only.

Compared to what I had before, the main change is that I kept the option simple: choose either Masked or np.ma.MaskedArray output, i.e., I dropped my original idea that the output should by default depend on what class was used for input. This mostly because propagating that information through operations seemed needlessly complex, but also just to "refuse the temptation to guess".

This is fairly complete, but there are a few related things I'm wondering about; would be great to get your opinion!

  1. Should .masked now just depend on whether the internal arrays are Masked, rather than on whether there is any element masked? When using np.nan we really didn't have much choice, but always having a mask when the data is Masked might help avoid surprises, that, e.g., one can only do time.jd.filled(np.nan) if any element is masked, even if one passed Masked input (but perhaps had no masked elements).
  2. Related: should it be possible make Time masked? E.g., by setting time.masked = True. Or, perhaps nicer, by making Masked(time, mask=...) work? The latter may need a bit more thinking, but should be do-able (normally, one would have separate MaskedTime and Time classes, but I don't think it is best here and it should be possible to avoid it).
  3. Related too: should Time gain a .filled() method and/or an unmasked property? (Latter could also be done by time.masked = False.)

Here, all could be done as follow-up, but I think at least the first point would be best addressed before 6.0, since otherwise people get another API change in 6.1. The other items just add features, so could be done in a later version too (but good to think through a bit what we want, since it would affect SkyCoord as well).

@mhvk
Copy link
Contributor Author

mhvk commented Oct 5, 2023

p.s. I checked, and it turns out to be fairly trivial to get Masked(time, mask=...) and to even have a MaskedTime = Masked(Time) which takes a mask argument. One could imagine that this MaskedTime has the unmasked property and filled() method, while regular Time can have a mask, but does not have those "extras". In that case, we could leave the behaviour of Time.masked exactly as it is now, since one can do ifinstance(possibly_masked_time, Masked).

p.s.2 The scheme I used to create MaskedTime would work trivially on other classes like SkyCoord!

@taldcroft
Copy link
Member

This makes me think about a big picture question and overall API consistency.

The current Time class can be masked and that is indicated by the masked attribute. Because Time is a container class it has the option of converting to a "masked" object in-place, and presumably the same would apply to SkyCoord. However, Column and Quantity are different and for those they require a distinct Masked... version.

If there will be a MaskedTime which is slight different from Time with masked JD values, then would a table join with missing values create a Time or a MaskedTime? Would one choice be preferred?

Writing it out like this makes me think that we should adhere to "one and only one way" to doing masking for each class at the expense of inconsistency in how masking is done.

That would imply answers to your questions (of course up for more discussion):

  1. Should .masked now just depend on whether the internal arrays are Masked? Yes.
  2. Should Masked(time, mask=...) work? Whether or not it works, astropy core should not document that as part of core astropy nor define or use a MaskedTime class anywhere.
  3. Should Time gain a .filled() method and/or an unmasked property? Yes, those are both well defined even for an object that is not masked so the API can be consistent.

@mhvk
Copy link
Contributor Author

mhvk commented Oct 7, 2023

Thanks, that's super helpful! I agree with your logic. Shall I add .filled() and unmasked to this PR, then?

One further question: it would seem sensible to allow setting Time.masked -- otherwise, to make an instance masked without setting any masked elements would require something a bit silly like doing t[:0] = np.ma.masked. Setting to False would then be equivalent to doing .unmasked inplace.

As for working with Masked, I made a few trials and it is possible to recognize Time as a Masked instance if it is in fact Masked, which I think will be useful. The MaskedTime class is easily made into a pure factory class, which just adds a mask argument to Time, but produces a Time instance (for which instance(t, MaskedTime) will work as long as t.masked holds. But that is for another PR -- I think much of the code can live on a new MaskableShapedLikeNDarray class, which can then be used by SkyCoord, etc., as well.

@taldcroft
Copy link
Member

@mhvk - About the .masked attribute. Seeing your question makes me think again and wonder if there will be any internal or downstream issues with the subtle API change in what .masked implies (has some masked items vs might have masked items). I took a quick look through astropy time and table and nothing jumped out.

In any case, I suppose that .masked could be settable, but the more important question is whether that should ever be necessary. I would expect that doing tm[2] = np.ma.masked for an unmasked Time would auto-convert the _time object to Masked and that would be it. In other words why would you want to set .masked without actually masking any item?

Conceptually this gets to the original implementation and analogy with Pandas that from the user perspective there doesn't need to be much (if any) distinction between masked or unmasked. This is the benefit of having a container class (or using sentinel values).

In the other direction, doing tm.masked = False will have the consequence of unmasking values in-place, which might be surprising to some users. The explicit way to write this would be tm = tm.unmasked.

Upshot: I'm on the fence about whether to allow setting .masked, but either way I don't think we should document this is a feature of the API.

@mhvk
Copy link
Contributor Author

mhvk commented Oct 7, 2023

OK, why don't I leave settable masked out of this PR -- it is something that can be added later anyway.

@mhvk
Copy link
Contributor Author

mhvk commented Oct 7, 2023

OK, I pushed the further commits that make time.masked reflect whether or not the internal jd1 is Masked.

@mhvk mhvk marked this pull request as ready for review October 7, 2023 20:59
@mhvk
Copy link
Contributor Author

mhvk commented Oct 7, 2023

Note that the test on scalars uncovered a small bug in Masked (bca63dc) - I'll make a separate PR with that, since that probably should still be backported.

@taldcroft
Copy link
Member

Here is where we come back to the fundamental issue of masked vs missing. This is surprising to me because of the implication that the data are valid but simply masked:

>>> dates = Masked(["2000-01-01", "2003-12-30"], mask=[True, False])
>>> print(Time(dates, format="iso", out_subfmt="date").unmasked)
['2003-12-30' '2003-12-30']

Most people would probably expect ['2000-01-01', '2003-12-30']. But of course the code should also work for the example of missing or bad data (and indeed it does work but with a result that might surprise some people):

>>> dates = Masked(["", "2003-12-30"], mask=[True, False])
>>> print(Time(dates, format="iso", out_subfmt="date").unmasked)
['2003-12-30' '2003-12-30']

I wonder if we could address this ambiguity with this:

  1. Define a cached property for each format class that provides a default value which is equivalent to "2000-01-01".
  2. Use that default value consistently to replace any masked / missing values.
  3. Document that new property and in particular make it clear for Time that any input masked values are always replaced with the default value.

Note that these issues are not necessarily new with this PR, but are somehow more obvious to me now.

@mhvk
Copy link
Contributor Author

mhvk commented Oct 26, 2023

OK, the final push worked locally at least...

mhvk and others added 22 commits October 26, 2023 19:03
This replaces the use of nan for jd2 with Masked arrays for jd1 and jd2.
For now, the mask is not writeable if nothing Masked was input, since
we do not want to change the nature of jd1 and jd2 without explicit input.

Along the way, this uncovered a bug in how masked elements were
treated in fitstime. The old implementation merged jd1 and jd2
together using np.stack, which removes any mask. But because jd2 was
nan for any element anyway, the element still was recognized as masked.
Now it is treated properly.
Moving the cache to Time itself seems more logical, since apart from
the mask, all the handling of cache state is done in Time. Indeed, it
was on Time before setting of Time elements was introduced in astropygh-6028.
This just moves it back now that the mask handling is much simpler.

One resulting change is that any cached format information still needs
be deleted when changing format, since it can be out of date.
Also check that the mask is shared internally between jd1 and jd2,
but not with any inputs or outputs.
Copy link
Member

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, approved!!

@mhvk mhvk merged commit f2cd545 into astropy:main Oct 27, 2023
27 checks passed
@mhvk mhvk deleted the masked-time-ma-optional branch October 27, 2023 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants