Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TNULL not respected/retained by astropy.table when reading/writing from FITS file #14693

Closed
gpsgibb opened this issue Apr 26, 2023 · 5 comments · Fixed by #14723
Closed

TNULL not respected/retained by astropy.table when reading/writing from FITS file #14693

gpsgibb opened this issue Apr 26, 2023 · 5 comments · Fixed by #14723

Comments

@gpsgibb
Copy link
Contributor

gpsgibb commented Apr 26, 2023

Description

When astropy.table reads or writes FITS tables, the TNULL parameter for each integer column is always reset to 999999.

When reading in a table, the reader respects the TNULL in the FITS header, such that the returned table (with masked columns) has the masks where the values in the column were TNULL, however the fill_value attribute of the column is set to 999999.

When writing a table with masked columns (with an arbitrary fill_value) the TNULL for the column is always set as 999999, regardless of the choice of fill_value.

This causes problems where a table read in, then written out may have its TNULL values changed, such that the input table is no longer equal to the output table. Not to mention, for an integer column, one may expect 999999 to be a valid (e.g. non-NULL) value, but with astropy.table this value can be inadvertently converted to NULL upon write.

I note that if a column with an arbitrary fill value has its filled() method called, the masked rows are filled with the correct value. Given that fill values seem to be respected in table operations within astropy.table, I suspect that in table I/O numpy's masked array default fill values are being used rather than those set for the column (e.g. column.fill_value), as numpy.ma's default fill value for ints is 999999. Also of note is that numpy.ma has the default float mask value to be 1E20. This seems to be the default fill_value for a masked column containing floats (e.g. see my example code).

My best guess is when reading a table/column in, it creates the masked array properly, but fails to propagate TNULL to MaskedColumn.fill_value. When writing the table, instead of using MaskedColumn.fill_value it uses np.ma.default_fill_value)

Expected behavior

For a table being read in, if an integer column's TNULL ise set to, say, -1, then that column's fill_value attribute should be equal to -1, not 999999.

For a table being written out, if a given column's fill_value attribute is set to, say, -1, then TNULL for that column should be -1 in the FITS header, not 999999

How to Reproduce

The below code illustrates the behaviour.

First a table with a float32 column and an int32 column is created with some masked values (using BinTableHDU which correctly fills in the NULL values). The null value for the integer column is set to -1. This table then read in by astropy.table, and we find the TNULL for the integer column has been changed to 999999. Upon writing this table out again, TNULL is set to 999999 in the output table.

Secondly we create a table using astropy.table with masked columns. Again, the integer column has its null/fill value set to -1. Using Table.filled(), we note that the masked integer values are correctly filled with -1. Upon writing this table out, the masked elements are filled with 999999, and the TNULL property for that column is set to 999999.

import numpy as np
from astropy.table import Table, Column, MaskedColumn
from astropy.io import fits
from astropy import __version__

INT_NULL = -1

print(f"Using Astropy {__version__}")
print("")

print(f"NULL for integers has been defined as {INT_NULL}")

rng = np.random.RandomState(seed=1)

# generate some integer and float data
int_data = np.asarray([i for i in range(10)], dtype = np.int32)
float_data = np.asarray([float(i) for i in range(10)], dtype = np.float32)
mask = rng.randint(2,size=10, dtype = bool)

masked_rows = [ i for i, masked in enumerate(mask) if masked]

# First of all, write a (correct) table with NULL values in columns, and see how astropy.Table handles these
print("")
print("Create table with BinTableHDU:")

# make table via astropy.io.fits 
floats = np.copy(float_data)
floats[mask] = np.NaN
ints = np.copy(int_data)
ints[mask] = -1
c1 = fits.Column(name="float_col", format="D", array=floats)
# This column's NULL will be defined in the TNULL2 card in the table's FITS header
c2 = fits.Column(name="int_col", format="K", null=INT_NULL, array=ints) 
hdu = fits.BinTableHDU.from_columns([c1,c2])
hdul = fits.HDUList([fits.PrimaryHDU(), hdu])
hdul.writeto("t_io_fits.fits", overwrite=True)

# open this table (with astropy.io.fits) and check what TNULL2 is (should be INT_NULL as defined above)
with fits.open("t_io_fits.fits") as hdul:
    header = hdul[1].header
null = header["TNULL2"]
print(f"BinTableHDU table has TNULL2 as {null}. Correct? {null == INT_NULL}")


# Now read this table in with astropy.table
t = Table.read("t_io_fits.fits")
print(f"Rows that should be masked are {masked_rows}")
print("Table read in by astropy.table:")
print(t)
null = t["int_col"].fill_value
print(f"Table read in by astropy.table has TNULL2 as {null}. Correct? {null == INT_NULL}") 
float_null = t["float_col"].fill_value
print(f"Float column has default fill value of {float_null}")

# write the table out with astropy.table, reading in with astropy.io.fits and checking what null is
t.write("t_table.fits", overwrite=True)
with fits.open("t_table.fits") as hdul:
    header = hdul[1].header
null = header["TNULL2"]
print(f"Table written by astropy.table has TNULL2 as {null}. Correct? {null == INT_NULL}")


# Now try creating a table with astropy.table
print("")
print("Create table with astropy.Table:")

c1 = MaskedColumn(name="float_col", data=float_data, mask=mask, fill_value=np.NaN)
c2 = MaskedColumn(name="int_col", data=int_data, mask=mask, fill_value=-1)
t2 = Table(data = [c1,c2])
null = t2["int_col"].fill_value
print(f"Table created with Table has TNULL2 as {null}. Correct? {null == INT_NULL}")

# Try the filled method of the table
t3 = t2.filled()
masked_vals = t3["int_col"][mask][0]
print(f"Filled table has NULL values as {masked_vals}. Correct? {masked_vals==INT_NULL}")

# write the (unfilled) table out
t2.write("t2.fits", overwrite=True)
with fits.open("t2.fits") as hdul:
    header = hdul[1].header
null = header["TNULL2"]
print(f"Output Table has TNULL2 as {null}. Correct? {null == INT_NULL}")

Output of code snippet:

Using Astropy 5.1.1

NULL for integers has been defined as -1

Create table with BinTableHDU:
BinTableHDU table has TNULL2 as -1. Correct? True
Rows that should be masked are [0, 2, 5]
Table read in by astropy.table:
float_col int_col
--------- -------
       --      --
      1.0       1
       --      --
      3.0       3
      4.0       4
       --      --
      6.0       6
      7.0       7
      8.0       8
      9.0       9
Table read in by astropy.table has TNULL2 as 999999. Correct? False
Float column has default fill value of 1e+20
Table written by astropy.table has TNULL2 as 999999. Correct? False

Create table with astropy.Table:
Table created with Table has TNULL2 as -1. Correct? True
Filled table has NULL values as -1. Correct? True
Output Table has TNULL2 as 999999. Correct? False

Versions

Linux-5.19.0-40-generic-x86_64-with-glibc2.35
Python 3.10.6
astropy 5.1.1
Numpy 1.21.5
pyerfa 2.0.0.1
Scipy 1.8.0

@gpsgibb gpsgibb added the Bug label Apr 26, 2023
@gpsgibb
Copy link
Contributor Author

gpsgibb commented Apr 26, 2023

Looks like when reading in a table and creating a masked column, the fill value is not set: https://github.com/astropy/astropy/blob/main/astropy/io/fits/connect.py#L281

@pllim
Copy link
Member

pllim commented Apr 26, 2023

Thank you for reporting and looking into this! Since you basically know where the bug is, would you be interested in submitting a patch to get credit for fixing it? Usually it would also come with a change log and a regression test.

@gpsgibb
Copy link
Contributor Author

gpsgibb commented Apr 26, 2023

When writing a table out, it looks like Table.as_array() does not respect fill_value for each column, and uses the numpy.ma defaults

@gpsgibb
Copy link
Contributor Author

gpsgibb commented Apr 26, 2023

I can have a go. Of course, if my solution doesn't work/breaks other things then I may need some extra input!

@pllim
Copy link
Member

pllim commented Apr 26, 2023

We would be happy to assist in that case. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants