New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broken FITS table round-trip with masked Tables #4708

Open
sbailey opened this Issue Mar 18, 2016 · 13 comments

Comments

Projects
None yet
6 participants
@sbailey

sbailey commented Mar 18, 2016

When the fill_value of a masked Table Column is equal to an unmasked value, the round-trip to a FITS format file is broken [EDIT taldcroft example changed]. Note that #7481 in astropy 3.1 provides a work-around by specifying a serialize_method, but does not fix the default behavior.

In [23]: from astropy.table.table_helpers import simple_table
In [24]: t = simple_table(masked=True)
In [25]: t['a'].fill_value = 2
In [26]: t.write('junk.fits', overwrite=True)
In [27]: t2 = Table.read('junk.fits')

In [28]: t
Out[28]: 
<Table masked=True length=3>
  a      b     c  
int64 float64 str1
----- ------- ----
   --     1.0    c
    2     2.0   --
    3      --    e

In [29]: t2
Out[29]: 
<Table masked=True length=3>
  a      b      c   
int64 float64 bytes1
----- ------- ------
   --     1.0      c
   --     2.0      N
    3     nan      e

This is because the Table is storing the fill_value in the TNULLn keywords to flag mask values, resulting in masking any row that has that value, even if it wasn't originally masked. [EDIT taldcroft] In addition, handling of floating point masking via NaN is broken as well as string-type masking.

To fix this, it seems that the Table needs to separate out the concept of what values are masked (using TNULLn following the FITS standard, vs. what the fill_value should be for those masked elements. AFAIK, the FITS standard doesn't define how to store a fill_value, but TFILLn would seem reasonable.

The current behavior is especially problematic when coupled with issue #4707 where filtering masked tables resets the fill_value to 1 -- a common number that can result in any other 1 getting masked after round tripping to a FITS file.

@eteq

This comment has been minimized.

Member

eteq commented Apr 26, 2016

@taldcroft - do you agree this is a bug, and not in some subtle way intended behavior? (Also note #4707)

@taldcroft

This comment has been minimized.

Member

taldcroft commented Apr 26, 2016

This looks like a bug. It is not a bug in Table per se but rather the io.fits I/O registry function that writes out a Table as FITS. The fill_value attribute is basically just saying what to use when making a filled version of the table via the filled() method. Thus it should not (AFAIK) be directly represented in the FITS output except perhaps in the metadata.

@saimn

This comment has been minimized.

Contributor

saimn commented May 15, 2018

The fill_value attribute is basically just saying what to use when making a filled version of the table via the filled() method. Thus it should not (AFAIK) be directly represented in the FITS output except perhaps in the metadata.

While I agree that fill_value and FITS's TNULL do not carry exactly the same meaning, it seems logical to use both together as it is done currently. But this also means that one should not use a fill_value value that is also used for correct values present in the table (as done in the example above).

So what could be the good way to store masks/missing values in FITS with Table ?

  • keep the current behavior, using TNULL to identify masked values that have been replaced with fill_value. Then we could issue a warning if the fill_value is also used for non-masked values? This would be easy to do and seems to be the more straightforward solution.

  • find another way to store the mask, e.g. in an additional extension, and use the serialization feature for fill_value ? This would be more specific to Astropy, but storing masks in separate extension is a quite common way to proceed, we just need a reliable way to identify the mask extension associated with a table.

  • other ideas ?

@sbailey

This comment has been minimized.

sbailey commented May 15, 2018

Rather than defining a new extension for the mask, how about continuing to use TNULLn for the mask (as defined by the FITS standard) and then use a new keyword TFILLn to store the fill values for each column? I think that would be better than a completely new extension for the mask.

@taldcroft

This comment has been minimized.

Member

taldcroft commented May 16, 2018

I think that the serialization machinery in Table could be used effectively here. It's supports a capability where the mask attribute would be included in a "encoded" version of the table that actually gets written to FITS.

This does not create a new extension, but instead writes the mask as a boolean column in the same extension table. This does circumvent the FITS standard somewhat, but has the advantage of being completely general and not prone to problems like where the fill value exists in the column. For dtypes like uint8 or a single character string this problem is a likely occurrence.

@taldcroft

This comment has been minimized.

Member

taldcroft commented May 16, 2018

This would be treating MaskedColumn as a quasi-mixin. I need to look at the details, but this might be a very small patch to MaskColumnInfo.

Thoughts @mhvk?

@sbailey

This comment has been minimized.

sbailey commented May 16, 2018

Please be very wary of inventing non-standard FITS usage that would cause the table to be interpreted differently by non-astropy readers in other languages (or fitsio for that matter). I acknowledge that my TFILLn flirts with that too, but at least it doesn't circumvent the FITS standard keyword of TNULLn or add extra columns.

@taldcroft

This comment has been minimized.

Member

taldcroft commented May 16, 2018

@sbailey - yes, we indeed strive hard to avoid non-standard FITS conventions. If there is a FITS convention that can accommodate masked data of any type (including e.g. uint8 where all 256 values are represented in the data), then that would be important information to know.

And although it is a matter of some philosophical controversy, numpy masked arrays (and MaskedColumn) do allow turning off the mask and restoring a masked element to its original value. AFAIK that cannot be done via FITS (which relies on sentinel values) without adding extra columns. So if you want a hi-fi round trip of a MaskedColumn through FITS, the only option is adding columns with the mask and values stored separately.

I have had long discussions with people (including FITS gurus like Arnold Rots), and there are certainly good arguments for saying that FITS is simply not the right container for serializing astropy objects and we should accept such limitations. But the fact on the ground is that astronomers want/expect FITS to round-trip, as evidenced by the title of this issue you opened. And I am highly in favor of lossless round-tripping wherever possible.

@taldcroft

This comment has been minimized.

Member

taldcroft commented May 18, 2018

See #7481.

@taldcroft

This comment has been minimized.

Member

taldcroft commented May 21, 2018

To add to this issue, even without the corner case of the int NULL value overlapping with a valid data value, it appears that masked table support is fairly broken. It does not seem to round trip for float or string columns. I spent a fair bit of time on #7481 to provide and optional work-around for these problems, but somebody should work on the core FITS support for missing data. I think this is a FITS issue, not Table. 😄.

In [1]: astro
astropy=3.0.1

In [2]: t = simple_table(masked=True)

In [3]: t
Out[3]: 
<Table masked=True length=3>
  a      b     c  
int64 float64 str1
----- ------- ----
   --     1.0    c
    2     2.0   --
    3      --    e

In [4]: t.write('junk.fits', overwrite=True)

In [5]: t2 = Table.read('junk.fits')

In [6]: t2
Out[6]: 
<Table masked=True length=3>
  a      b      c   
int64 float64 bytes1
----- ------- ------
   --     1.0      c
    2     2.0      N
    3     nan      e

In [7]: t2['b'].mask
Out[7]: array([False, False, False], dtype=bool)

In [8]: t2['c'].mask
Out[8]: array([False, False, False], dtype=bool)
@saimn

This comment has been minimized.

Contributor

saimn commented May 23, 2018

It does not seem to round trip for float or string columns. I spent a fair bit of time on #7481 to provide and optional work-around for these problems, but somebody should work on the core FITS support for missing data. I think this is a FITS issue, not Table. smile.

For float it could make sense to mask NaNs by default I guess ? Or at least with an optional keyword (Table.read(..., mask_invalid=True) ?).

@SaOgaz

This comment has been minimized.

Contributor

SaOgaz commented Jun 5, 2018

I'm not sure if this is related to this issue (maybe the same underlying problem?), if it's not I can open a separate issue but here goes:

When round tripping a binary table, and adding a row when the data is in it's Astropy Table incarnation, the TNULL value actually gets changed to 999999. The only seems to happen when a row is added, not when editing any individual values in the table, or new columns are added. For example, the fitsdiff shows afterwords:

Keyword TNULL19  has different values:
        a> -2147483647
        b> 999999
@saimn

This comment has been minimized.

Contributor

saimn commented Jun 5, 2018

@SaOgaz - probably not the same issue, but it's hard to say without a minimal example to reproduce ;). I think it deserves its own issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment