Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening a FITS file in update mode changes file content, even when only reading data #11312

Open
astro-friedel opened this issue Feb 9, 2021 · 11 comments

Comments

@astro-friedel
Copy link

astro-friedel commented Feb 9, 2021

Description

General Description
I have a set of python code that I maintain which reads part of a Binary table and updates the primary HDU header fields based on data in the table. The file is opened in 'update' mode as we expect to be updating the header values. After updating from pyfits to astropy to read the files it was noticed that the resulting FITS files were becoming corrupted.

Specifics
The python code reads in data from a FITS binary table in one of the secondary HDUs, grabs some specific values, and then updates the header of the primary HDU based on those values. The header for the table extension is:

XTENSION= 'BINTABLE'           / THIS IS A BINARY TABLE (FROM THE LDACTOOLS)    
BITPIX  =                    8 /                                                
NAXIS   =                    2 /                                                
NAXIS1  =                28800 / BYTES PER ROW                                  
NAXIS2  =                    1 / NUMBER OF ROWS                                 
PCOUNT  =                    0 / RANDOM PARAMETER COUNT                         
GCOUNT  =                    1 / GROUP COUNT                                    
TFIELDS =                    1 / FIELDS PER ROWS                                
EXTNAME = 'LDAC_IMHEAD'        / TABLE NAME                                     
TTYPE1  = 'Field Header Card'                                                   
TFORM1  = '28800A  '                                                            
TDIM1   = '(80, 360)'                                                           
END                                                                             

The table is essentially one long row which consists of 80 character strings (mimicking a header).
(Note: I am aware that it is not the best way to store the data but I do not have control over the format of the table)

It was noticed that after the header update the FITS table the file was becoming unreadable to down stream programs. In the original file the 80 character strings are all padded with spaces on the right hand side to ensure that they are all the correct length. But after the update all of the padding spaces are replaced with NULL, and not just a single NULL per string, but a NULL per space. I know the FITS standard states that text based columns in a table are to be NULL terminated with no space padding and I am guessing that when the table data are read the astropy code converts the data to the standard, however this breaks the structure of the table (especially since the strings are not just NULL terminated, but terminated by a group of NULLs). So one issue is that the resulting data should not be corrupted, but this may be a very complex issue due to the way the table is constructed.

After doing extensive testing it was found that it was not the updating of the header that caused the corruption, but just the reading of the table. My primary issue is that if an operation is just reading data, regardless of the mode the file is opened in, the content of the file should not change.

This is still present in the development version.

Expected behavior

I would expect that the file would not change if the data were just read, regardless of the mode that was used to open the file.

Actual behavior

After reading data from the file and closing the file handle, the table data were changed.

Steps to Reproduce

This file can be used as an example: testfile.fits

  1. make a copy of the input FITS file
  2. run the python code below
  3. diff the testfile.fits and the copy to see that it changed
from astropy.io import fits
hdulist = fits.open('testfile.fits', 'update')
a = hdulist['LDAC_IMHEAD'].data[0][0]
hdulist.close()

System Details

>>> import platform; print(platform.platform())
Linux-3.10.0-1160.11.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core
>>> import sys; print("Python", sys.version)
Python 3.7.1 (default, Feb 18 2020, 11:23:17) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
>>> import numpy; print("Numpy", numpy.__version__)
Numpy 1.19.1
>>> import astropy; print("astropy", astropy.__version__)
astropy 4.0
>>> import scipy; print("Scipy", scipy.__version__)
Scipy 1.5.2
>>> import matplotlib; print("Matplotlib", matplotlib.__version__)
Matplotlib 3.1.2
@github-actions
Copy link

github-actions bot commented Feb 9, 2021

Welcome to Astropy 👋 and thank you for your first issue!

A project member will respond to you as soon as possible; in the meantime, please double-check the guidelines for submitting issues and make sure you've provided the requested details.

If you feel that this issue has not been responded to in a timely manner, please leave a comment mentioning our software support engineer @embray, or send a message directly to the development mailing list. If the issue is urgent or sensitive in nature (e.g., a security vulnerability) please send an e-mail directly to the private e-mail feedback@astropy.org.

@pllim pllim added the io.fits label Feb 9, 2021
@embray
Copy link
Member

embray commented Feb 10, 2021

Thank you for the detailed report. I'll have to chew in this one a bit, but in short it's not a bug and the behavior is as intended. I don't know what the downstream software is, but it shouldn't be relying on what are technically malformatted FITS files in order to function correctly. You mentioned the data being "corrupted" but it is not actually corrupt as far as FITS is concerned. A better way to implement this use case might be with a text TABLE HDU which does pad fields with spaces.

However, I agree that merely updating the header would also change the data is surprising. I think the behavior here is a blunt: Some updates the the header will result in having to update the data anyways, so when it sees that the header has changed it does so. But not all modifications to a header necessitate rewriting the data (unless it changes the overall size of the header, in which case it will at least require moving the data). This could be more intelligent. Either way, whether the data needs to be moved or modified or not, astropy.io.fits tries very hard to NOT allow writing invalid FITS files, precisely to discourage the kinds of use cases that rely on malformatted files.

This also falls under the rubric of FITS verification overhaul discussed in other tickets which might allow more fine-grained control over what rules are and aren't enforced when reading and writing FITS files.

@embray
Copy link
Member

embray commented Feb 10, 2021

I meant to add:

My primary issue is that if an operation is just reading data, regardless of the mode the file is opened in, the content of the file should not change.

The tricky thing here is if the file is opened in update mode, IIRC numpy doesn't provide an easy way to check whether or not the data has actually been modified. But I'd like to know if there is a way to do that. Some FITS files have checksums embedded in which case the data can be compared against the previous checksum, but without that it would be onerous to compute one when opening the file.

It might also be useful to have a "header-only" update mode, but something like this should disallow updating header keywords that would change the structure of the data.

@pllim pllim added the question label Feb 10, 2021
@astro-friedel
Copy link
Author

After some careful reading of the FITS standards I believe that both the input data and output data meet the standards. Using TDIM to specify a subarray of character strings is ok, and the dimensions specified in TDIM agree with the TFORM value. Also 7.3.3.1 of the standard states that character strings may be NULL terminated, but do not have to be. Thus both formats meet the standard. That being said, it could be argued that any code reading the data and then writing it back out should not change the format (terminating space to NULL), as either are valid, but that argument is for another time.

As far as numpy not knowing whether data has changed, I agree that it has no mechanism for this. However, I think that it is not on numpy's shoulders to track this, it is merely a data storage mechanism. I believe that it is astropy's responsibility to track this, possibly by subclassing the numpy data to be able to detect if something has changed. I go back to my original assertion, that just reading the data, regardless of the mode used, should not change the contents of the file.

@embray
Copy link
Member

embray commented Feb 10, 2021

Yes, we are mostly in agreement I think. Will need a closer look.

@embray
Copy link
Member

embray commented Feb 15, 2021

For reference, here is the exact text from the FITS standard, as you mentioned, from section 7.3.3.1:

Character. If the value of the TFORMn keyword specifies Data Type ’A’, Field n shall contain a character string of zero-or-more members, composed of the restricted set of ASCII-text characters. This character string may be terminated before the length specified by the repeat count by an ASCII NULL (hexadecimal code 00). Characters after the first ASCII NULL are not defined. A string with the number of characters specified by the repeat count is not NULL terminated. Null strings are defined by the presence of an ASCII NULL as the first character.

So indeed, there is nothing here about stripping trailing whitespace at all, and I can't find anything about that in the FITS standard after all. I'll have to look at the code and see if there is any indication as to why this is being done.

@embray
Copy link
Member

embray commented Feb 15, 2021

There appears to be quite a bit of legacy to this--the issue definitely sounded familiar to me but I couldn't remember exactly when or why it came about.

Part of it has to do with the fact that Numpy recarrays, which as far back as when I first started working on PyFITS were used internally for representing data in FITS binary tables. Recarrays had (until later versions of Numpy) a feature that string fields were returned using the old chararray type, which is largely considered obsolete. It has this feature of automatically r-stripping whitespace when displayed:

>>> s = np.char.array([b'abc', b'ab ', b'a  ', b'   '])
>>> s
chararray([b'abc', b'ab', b'a', ''], dtype='|S3')

IIRC this is just for display purposes though; the underlying data buffer is not modified:

>>> s.tofile('s.dat')
>>> !hexdump -C s.dat
00000000  61 62 63 61 62 20 61 20  20 20 20 20              |abcab a     |
0000000c

But there is also a bit in the PyFITS code the first reference to which I found here (though it's been moved about a lot since then) about explicitly stripping whitespace at the end of string fields and replacing them with nulls. I can't find anything in the commit messages explaining this, save for references to some issues in the old Trac bug tracker. Does anyone at STScI know if that is still archived somewhere?

Aside from the broader issue of not touching the data in update mode if it isn't modified, I wonder if we should remove this functionality. It's not mentioned in the FITS standard, and can in fact break data like in @astro-friedel's case.

This functionality has been around so long though, I also wonder what it would break. I think if anything it might be a user convenience. In most cases I think users would not want trailing whitespace in string fields. But obviously if there is a use case for that it should be supported, and in fact should probably be the default.

@saimn
Copy link
Contributor

saimn commented Feb 15, 2021

So that would be this ?

# Ensure that blanks at the end of each string are
# converted to nulls instead of spaces, see Trac #15
# and #111
_rstrip_inplace(output_field)

I remember some discussions with @pllim about the Trac, maybe she knows ?

@pllim
Copy link
Member

pllim commented Feb 15, 2021

Re: Trac -- Oh, dear... If you really need that info, please send in a help call to hsthelp.stsci.edu . I think that's all I am allowed to say officially.

@embray
Copy link
Member

embray commented Feb 22, 2021

That's really too bad about the Trac site....valuable information potentially lost.

I'm not so sure what to do about this. I think the rstrip stuff should be removed, but I worry what impact that will have on data products output by existing code that are relying on this (perhaps carelessly).

Here's a doable approach that would at least take some care:

  1. Add an option to BinTableHDU/FITS_rec for r-stripping string fields, by default set it to True (for now).
  2. Check if any string fields actually contain trailing whitespace before stripping it: in this case (and only in this case) issue a DeprecationWarning that automatic r-stripping of whitespace will go away in the future.
  3. After a deprecation period set the default to False.

The DeprecationWarning might just get ignored, of course, but it's better than nothing.

The bigger issue here, which is that updating header fields that don't actually require a change to the data should not trigger any modifications to the data, even if it has to be relocated. I think I will open a new issue about the r-strip stuff.

@bhavyakh
Copy link
Contributor

bhavyakh commented Apr 16, 2021

@embray,
By "adding an option to BinTableHDUs", do you refer to adding a boolean parameter in the functions (from_columns, fromstring, load, readfrom,open, etc.) for r-stripping or something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants