Time: add setitem and missing value support (masking) #6028

taldcroft · 2017-05-09T20:55:52Z

This is an implementation of the necessary bits to fully support table join, hstack, and vstack for the Time class.

It includes new functionality for Time masking and setting, along with an implementation of the info class new_like() method.

Masking: there was discussion about whether to do this with np.nan internally or using MaskedArray. I first did the latter and got something mostly working (passing most but not all tests), but it ended up getting messy. In particular the interactions with Quantity got ugly because Quantity knows how to play with ndarray but not MaskedArray in arithmetic etc.

But it looks like the np.nan solution is fine, with the unexpected twist that using this strictly in jd2 makes life a lot easier because it is easy to locally set the nan values to 0 before entering any ERFA functions. This avoids the dubious values problem.

Setting: I think it is time to make Time mutable! Admittedly I haven't thought super hard about pitfalls yet, but I think just being careful about clearing the cache is enough. This is going to be needed to make Time work in table operations (and probably Time series) where you need masking and item setting.

To do:

Tests of setting
Docs
Changelog

mhvk

This looks quite good. I tried to focus on high-level concerns only for now, and these are partially addressed in the comments. Mostly,

We need to be careful with how we interact with location
We should ensure we don't get performance regressions
We need a way to pass in a mask; I'm not sure I like using jd2=nan on input -- that would be an internal implementation detail.

Bit more comments in the main thread.

mhvk · 2017-05-10T13:26:16Z

astropy/time/core.py

+                except Exception:
+                    raise ValueError('cannot convert value to a compatible Time object')
+
+        if self.location != value.location:


location can be an array, in which case this statement would break. Now it is guaranteed that location is either a single element or an array with the correct shape, but the shape may have been broadcast, so one cannot just set an element. So, this needs some trickery.... But it also means in principle a different location is fine. Similarly, if value.location is None, I think it might be treated as OK if self.location is not None (but arguably only if self.location is a single element).

I think that a non-scalar location is a corner case and that we could just raise an exception if either the self or value location is not a scalar or None. At least for now this would be fine and we can see if any users actually complain.

mhvk · 2017-05-10T13:26:36Z

astropy/time/formats.py

+
+    @property
+    def jd2_filled(self):
+        return np.nan_to_num(self.jd2)


to ensure that the regular no-nan case is not slowed down too much:

return np.nan_to_num(self.jd2) if self.masked else self.jd2

Yup, I had also thought about that.

mhvk · 2017-05-10T13:28:57Z

astropy/time/formats.py

@@ -72,6 +73,11 @@ def _regexify_subfmts(subfmts):
    return tuple(new_subfmts)


+class TimeMaskedArray(np.ma.MaskedArray):


Why do we need a new class? It seems you want to avoid repr, but where is that a problem?

Otherwise you get this:

In [9]: tm.cxcsec Out[9]: masked_array(data = [99.99999999999805 -- 299.999999999999], mask = [False True False], fill_value = 1e+20)

Instead of:

In [4]: tm.cxcsec Out[4]: [99.99999999999805 -- 299.999999999999]

But I think that's OK. Right now, if I don't mask anything, I get:

In [23]: Time(np.arange(50000, 50010), format='mjd').cxcsec Out[23]: array([-70329538.816, -70243138.816, -70156738.816, -70070338.816, -69983938.816, -69897538.816, -69811138.816, -69724738.816, -69638338.816, -69551938.816])

so why shouldn't a masked Time return a MaskedArray?

I don't like the masked_array repr because:

I don't need to see an explicit data = .., which has -- marking masked elements, AND mask = [..], which is basically the same information.

The fill_value is just a silly thing to include. It is irrelevant most of the time.

The data = is redundant. I know it is the data.

One can say this uses an old-school definition of repr, that it gives you the info to reproduce the object (although there it fails because it doesn't include dtype). But I think in this age of Jupyter notebook / interactive analysis, it is more important to have the object repr be a concise and visually-friendly representation of the object.

For consistency it could be:

def __repr__(self): return 'masked_array({})'.format(self)

OK, got rid of the TimeMaskedArray wrapper.

mhvk · 2017-05-10T14:14:36Z

For location, this is a Quantity, so in principle one can set elements already. I think one needs a function something like the following in __setitem__:

if self.location is None:
    if value.location is not None:
        # Cannot have some locations unknown and some set.
        # EDIT: this should be OK if self does not yet contain anything
        raise  
    # no locations involved; just go on
else:
    if value.location is None:
        if self.location.size > 1:
            # have different locations, but nothing set
            raise
        # value does not set location, so keep the single one we have
    else:
        if np.any(self.location[item] != value.location):
            # remove any broadcasting
            if not self.location.flags.writable:
                new_location = self.location.copy()
            new_location[item] = value.location
            # maybe do it at end, so `self` remains unchanged if there are errors.
            self.location = new_location

Now the above is obviously a bit elaborate, but while we perhaps could get rid of broadcasting, I think we should continue to allow single locations, especially as we will need to do the same for SkyCoord and its many attributes anyway... Which, indeed, suggests this function could live on ShapedLikeNDArray (which already implements __getitem__).

Now, given the need for dealing with attributes that do not necessarily share the objects shape, I think one way to make mask easier is to treat it as such an attribute as well, i.e., it can be None or a (partially broadcast) array (a single item seems not so useful in this case).

Also a suggestion: I think at some level setting items and having a mask are different issues; would it be an idea to decouple the two? If we do __setitem__ first, and ensure it is tested thoroughly, we will be in a better place to see if my idea of treating mask like another attribute will work or not.

mhvk · 2017-05-10T14:29:19Z

p.s. Obviously, just having __setitem__ will at least allow vstack of tables, which probably is a main use. And it would allow defining insert as well...

For the mask, I think one decision we need to make is the extent to which this is a work-around for table, or general. E.g., you return masked array for all the properties, but what about time_delta.to(u.s)?

taldcroft · 2017-05-10T17:17:12Z

For the mask, I think one decision we need to make is the extent to which this is a work-around for table, or general. E.g., you return masked array for all the properties, but what about time_delta.to(u.s)?

My thinking is that this is general. Time properties (in particular all format properties) will be a masked array. Any Quantity output will just be a Quantity with nan:

In [8]: dt
Out[8]: <TimeDelta object: scale='tt' format='jd' value=[0.001041666666666663 -- 0.003124999999999989]>

In [9]: dt.to(u.s)
Out[9]: <Quantity [  90.,  nan, 270.] s>

In case you haven't guessed, I'm pushing for nan as the formal indicator of masking for Quantity. It's something we can easily do for 2.0. I admit I always thought Quantity had to be float until you mentioned it recently, so this is not a 100% solution. But for this (and probably 98% or more) of use cases, this is fine.

taldcroft · 2017-05-10T17:29:08Z

I think one principle that will keep the Time masking implementation manageable and simple is to not apply the mask to all non-format attributes, e.g. location. As it currently stands mask is a derived attribute which is necessarily the same shape as the data. I think that trying to shoehorn this into every other Time attribute, which may or may not share the shape is going to make this much messier and potentially introduce unexpected changes to users.

taldcroft · 2017-05-10T17:34:25Z

About splitting this into two PRs, in principle I agree that is the right thing. In practice it is going to make things go faster for me if I don't. I have a limited amount of time before the day-job pulls me back and substantially slows development. In fact I changed the title and you'll see that I'm about to include what could reasonably be a 3rd PR. Note that the 3rd would require 1 (setting) and 2 (masking), and 2 would require 1, so you can see how developing them independently just slows the process.

mhvk · 2017-05-10T18:22:05Z

@taldcroft - I'm much more uncomfortable with the mask than with __setitem__, as the particular way of doing the mask has implications through other parts of astropy (Quantity, clearly, which may still more logically be done with MaskedArray, but also SkyCoord, where the trick of just doing jd2 is not available).

Perhaps more importantly, just having __setitem__ addresses at least one large table problem (the largest?) -- that vstack and insert are not possible. And, as said, I can also see how it nearly trivially extends to SkyCoord as long as we treat attributes consistently from the get-go.

mhvk · 2017-05-10T18:25:11Z

p.s. I do worry quite a bit about unexpected consequences of making Time mutable. As is, internal arrays can be shared between instances, so if one changes, another will too, but all code written will have implicitly assumed this never happens. We may want a mutable flag which by default is False...

taldcroft · 2017-05-11T01:42:38Z

as the particular way of doing the mask has implications through other parts of astropy

From my perspective this is completely internal and the fact that _time.jd2 has NaN's is not exposed at all (at least that is the intent). So we would be free to change the internal implementation in the future at any time.

Now it is true that in this implementation, a Quantity output like the .to(..) method assumes that Quantity implements masking by the NaN sentinel. We could certainly hold off on this decision by making .to(..) etc raise an exception for a masked Time object. I'm OK with that, but I will say:

I think that you are underestimating the difficulty of making a MaskedContainer or MaskedQuantity. My gut feeling is that at the end it will be a lot of work to develop and maintain.
For the case of float arrays, I think that using NaN is just the better way. Everything is already built-in and it is fast and memory-efficient.

Perhaps more importantly, just having setitem addresses at least one large table problem (the largest?) -- that vstack and insert are not possible.

I'm tired of mixins being crippled. By adopting the NaN protocol they can be fully functional! I'm mostly thinking of join(), which is actually very useful and (except for inner) needs masking.

taldcroft · 2017-05-11T01:46:21Z

SkyCoord, where the trick of just doing jd2 is not available

True, but setting the representation values to nan could work, though currently this generates warnings from the Angle classes. Admittedly this is a bigger beast than Time.

taldcroft · 2017-05-11T14:07:38Z

About changing Time to be mutable, I have sent email to astropy-dev to request input. I agree on the idea of some attribute (writeable like the numpy flag?) that can be used to prevent setting. However I would have that default to writeable = True.

In general I think developers are very much used to the pitfalls of shared memory in numpy array objects, so I think we should enable writing by default. Since no existing code writes to Time objects, on the transition to 2.0 nothing would immediately break. For the small minority of cases where mutability is a problem, developers would have the opportunity to fix their code.

This could be as simple (and back-compatible) as t.writeable = False. For pre-2.0 this would just set an attribute that is not used anywhere. For post-2.0 this could be a property that is a view of the flags.WRITEABLE property of the underlying _time.jd1,2. (That's just an idea, haven't thought it through all the way, but the goal is to avoid a new Time attribute that needs to be manually managed).

dkirkby · 2017-05-11T14:18:07Z

I like the idea of mirroring the numpy WRITEABLE flag since that allows new functionality to be implemented gradually but probably also requires that the default is WRITEABLE = False, at least initially.

An alternative paradigm would be to base a new MutableTime class off Time, but I would only recommend that in a strongly typed language.

mhvk · 2017-06-12T16:01:17Z

@eteq - this is the PR for making Time (somewhat) mutable -- as you expressed a strong opinion about possibly doing the same for coordinates, now would be the moment to think about an API that would make sense (see discussion above; part of this is needed to be able to merge to sets of times -- the same would be useful for coordinates; for both, this is already somewhat in place for merging a list of scalars).

taldcroft · 2017-06-23T02:02:43Z

I've been working on this but I see that making it for 2.0 is perhaps not likely at this point.

astrofrog

I left a few comments/questions below. In general, I agree we should merge this in soon and if others want to propose a better solution, then there is still time to change things.

But first, A couple of big picture questions/concerns:

Why can e.g. t.mask[1] = True not be supported? For a Time object in a Table, it will be confusing for users if some columns support this and not others.
If I mask a value, then unmask it, won't precision be lost due to jd2 being set to NaN temporarily? EDIT: I guess the point of the API here is that one can't 'unmask'?

I do worry a bit about having Time diverge too much from SkyCoord, but overall I think this PR does make table operations possible that weren't before, so to me it seems like a net improvement.

astrofrog · 2018-01-26T22:05:25Z

astropy/time/core.py

+            if axis is not None:
+                approx = np.expand_dims(approx, axis)
+        else:
+            approx = np.max(jd, axis, keepdims=True)


Note that for 3.1 we are dropping Numpy < 1.13 so these kinds of blocks can be simplified. However #7058 will need a little work before being merged, so we can always just merge this as-is then remove later.

astrofrog · 2018-01-26T22:06:40Z

astropy/time/formats.py

-                (val2 is None or
-                 val2.dtype == np.double and np.all(np.isfinite(val2)))):
+        # val1 cannot contain nan, but val2 can contain nan
+        ok1 = val1.dtype == np.double and np.all(np.isfinite(val1))


Why isfinite and not isnan? This will catch inf/-inf

This line for ok1 is the same as it has always been and says that the user input must be double and finite and not NaN. (isfinite(nan) is False).

The next line for ok2 is mostly the same but requires that all values be a finite number or NaN. (isinf(nan) is False).

Summary: same as before except that NaN is allowed for val2.

Ok sounds good!

astrofrog · 2018-01-26T22:09:24Z

astropy/time/tests/test_basic.py

+    # Fails because the right hand side has location=None
+    with pytest.raises(ValueError) as err:
+        t[0, 0] = Time(-1, format='cxcsec')
+    assert 'cannot set to Time with different location' in str(err)


The error message doesn't seem quite right here - isn't the issue more that on the right the location isn't set so it's ambiguous?

It's different in the sense that the left side has location = EarthLocation(...) while the right side has location = None. I think the exception will let the user diagnose the problem by hinting to print the location of left and right sides. I could make the exception message be .. different location attribute to be even more clear? Or something else?

Yes, saying different location attribute - maybe could you explicitly say what the location is in each case? (expected ..., got ...)

astrofrog · 2018-01-26T22:15:10Z

docs/time/index.rst

+
+  >>> t.mask
+  array([False, False,  True, False]...)
+  >>> t[:2] = np.ma.masked


I think you should explain that it's not possible (and why) to set the mask using e.g. t.mask[:2] = True

astrofrog

I've re-reviewed this after having taken some time to think. Apart from some of the minor inline comments above which still hold, I think we should actually go ahead and merge this. My main argument for this is that this doesn't introduce any API that I think we would want to revert in future. For example, I think that regardless of the internal implementation, doing things like:

>>> t[1] = Time.now()

or

>>> t[1] = np.ma.masked

are great. If anything, we'd probably want to allow similar behavior in SkyCoord (that is, setting a single SkyCoord element to another SkyCoord).

I would like to consider in future that we allow e.g.:

t.mask[1] = True

i.e. a mutable mask, and this would require a separate mask array (so that unmasking works) but my point is that the API additions here are already a step forward and I don't see the point in delaying this. I don't think there are API additions here that we'll want to revert (and as far as I can tell there is no API being broken).

taldcroft · 2018-02-02T18:41:44Z

The syntax of t.mask[1] = True is trivial, but of course the data behind the mask are lost and you cannot unmask. Is there an obvious use-case for masking then unmasking Time? I note that numpy MaskedArray discourages manipulation of the mask directly as you showed, and I've gone in that direction.

astrofrog · 2018-02-02T18:51:14Z

The syntax of t.mask[1] = True is trivial, but of course the data behind the mask are lost and you cannot unmask. Is there an obvious use-case for masking then unmasking Time? I note that numpy MaskedArray discourages manipulation of the mask directly as you showed, and I've gone in that direction.

Right, I think that given the NaN-based implementation we shouldn't allow this since unmasking doesn't work. I was not aware that Numpy were moving away from that. In any case, my point is that t.mask[1] = True isn't possible anyway right now, so we're not taking away something users can do already. I think we should discuss whether we want to move to something that allows that after this PR.

eteq · 2018-02-02T19:06:15Z

Realizing I've had several out-of-band conversations on this without commenting in this PR... I haven't been able to review in detail but I have some general thoughts. To summarize, now that @astrofrog clarified that the NaN part was implementation and not visible to the user, I'm fine with this. I'm not convinced we don't want mutable masks in a general sense (e.g. one could argue this might be good in SkyCoord). And I agree with @mhvk that we want similar behavior as much as possible across packages like time, table, and coordinates. But I think it's safe to start with an immutible mask (which my understanding is that this does), and potentially make it mutable later if we think that's reasonable across the packages. That's safer than the converse, so I'm OK with it.

I note that numpy MaskedArray discourages manipulation of the mask directly as you showed

I didn't appreciate that either... I'm also not sure that I agree - I think one of the advantages of table is that it allows this. But I think it's a moot point for this PR (see above).

mhvk · 2018-02-02T19:51:31Z

I note that numpy MaskedArray discourages manipulation of the mask directly as you showed

I don't know that this is true. It can just be messy for arrays with a structured dtype (which part does one mask).

taldcroft · 2018-02-02T19:59:54Z

@mhvk - agreed it is not critical to this discussion nor necessarily the viewpoint of astropy, but I stand by my statement. 😄

From https://docs.scipy.org/doc/numpy-1.13.0/reference/maskedarray.generic.html#modifying-the-mask:

mhvk · 2018-02-02T22:09:17Z

I agree with you (and the documentation) that it is better to just write masked to an element. I guess my main point, rather indirectly, is that MaskedArray does its utmost to preserve the masked values (arguably too much effort!) and that this PR breaks that expectation. Anyway, not directly relevant indeed.

taldcroft · 2018-02-03T21:51:32Z

@astrofrog - I think I addressed your comments.

Thanks to all for your reviews!

astrofrog · 2018-02-04T12:04:39Z

@taldcroft - looks good! This is good to go :)

Moving the cache to Time itself seems more logical, since apart from the mask, all the handling of cache state is done in Time. Indeed, it was on Time before setting of Time elements was introduced in astropygh-6028. This just moves it back now that the mask handling is much simpler. One resulting change is that any cached format information still needs be deleted when changing format, since it can be out of date.

taldcroft added Affects-release Enhancement time labels May 9, 2017

taldcroft added this to the v2.0.0 milestone May 9, 2017

taldcroft requested review from astrofrog and mhvk May 9, 2017 20:55

mhvk reviewed May 10, 2017

View reviewed changes

taldcroft changed the title ~~WIP: Time masking and setting~~ Support join, hstack, vstack for Time May 10, 2017

taldcroft added the table label May 10, 2017

taldcroft mentioned this pull request May 13, 2017

POC: minimal masking for Quantity #6031

Closed

taldcroft force-pushed the time-nan branch from 681d73c to 6bac636 Compare June 10, 2017 10:49

taldcroft changed the title ~~Support join, hstack, vstack for Time~~ Time: add setitem, masking, and support join, hstack, vstack Jun 10, 2017

taldcroft added the Priority-High label Jun 12, 2017

taldcroft added this to High priority in 2.0 Feature Planning Jun 12, 2017

taldcroft force-pushed the time-nan branch from 6bac636 to a6fb0e7 Compare June 19, 2017 11:34

taldcroft moved this from High priority to Postpone to future versions in 2.0 Feature Planning Jun 24, 2017

taldcroft added this to High Priority in 3.0 Feature Planning Jun 24, 2017

astrofrog reviewed Jan 26, 2018

View reviewed changes

astrofrog approved these changes Feb 2, 2018

View reviewed changes

Improve exception message and add note about setting mask

9789fe5

taldcroft force-pushed the time-nan branch from 07d592b to 9789fe5 Compare February 3, 2018 19:59

taldcroft merged commit dc560f6 into astropy:master Feb 4, 2018

taldcroft deleted the time-nan branch February 4, 2018 13:36

bsipocz mentioned this pull request Jun 19, 2019

astropy.time.Time does not support missing values #4032

Closed

This was referenced Nov 17, 2019

Missing values in Time obj raise error when format is changed #9612

Closed

Removing MAGIC_TIME variable in favor of masked Times astropy/astroplan#435

Merged

pllim mentioned this pull request Jul 21, 2021

TST: astropy/time/tests/test_methods.py might trigger race condition in parallel job #11966

Closed

mhvk mentioned this pull request Oct 22, 2023

Using Masked inside Time #15231

Merged

4 tasks

		@@ -72,6 +73,11 @@ def _regexify_subfmts(subfmts):
		return tuple(new_subfmts)


		class TimeMaskedArray(np.ma.MaskedArray):

Time: add setitem and missing value support (masking) #6028

Time: add setitem and missing value support (masking) #6028

Conversation

taldcroft commented May 9, 2017 • edited

mhvk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhvk commented May 10, 2017 • edited

mhvk commented May 10, 2017

taldcroft commented May 10, 2017 • edited

taldcroft commented May 10, 2017

taldcroft commented May 10, 2017

mhvk commented May 10, 2017

mhvk commented May 10, 2017

taldcroft commented May 11, 2017

taldcroft commented May 11, 2017

taldcroft commented May 11, 2017

dkirkby commented May 11, 2017

mhvk commented Jun 12, 2017

taldcroft commented Jun 23, 2017

astrofrog left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astrofrog Jan 26, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astrofrog left a comment

Choose a reason for hiding this comment

taldcroft commented Feb 2, 2018

astrofrog commented Feb 2, 2018

eteq commented Feb 2, 2018

mhvk commented Feb 2, 2018

taldcroft commented Feb 2, 2018

mhvk commented Feb 2, 2018

taldcroft commented Feb 3, 2018

astrofrog commented Feb 4, 2018

taldcroft commented May 9, 2017 •

edited

mhvk commented May 10, 2017 •

edited

taldcroft commented May 10, 2017 •

edited

astrofrog left a comment •

edited

astrofrog Jan 26, 2018 •

edited