-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Opening a FITS file in update mode changes file content, even when only reading data #11312
Comments
Welcome to Astropy 👋 and thank you for your first issue! A project member will respond to you as soon as possible; in the meantime, please double-check the guidelines for submitting issues and make sure you've provided the requested details. If you feel that this issue has not been responded to in a timely manner, please leave a comment mentioning our software support engineer @embray, or send a message directly to the development mailing list. If the issue is urgent or sensitive in nature (e.g., a security vulnerability) please send an e-mail directly to the private e-mail feedback@astropy.org. |
Thank you for the detailed report. I'll have to chew in this one a bit, but in short it's not a bug and the behavior is as intended. I don't know what the downstream software is, but it shouldn't be relying on what are technically malformatted FITS files in order to function correctly. You mentioned the data being "corrupted" but it is not actually corrupt as far as FITS is concerned. A better way to implement this use case might be with a text However, I agree that merely updating the header would also change the data is surprising. I think the behavior here is a blunt: Some updates the the header will result in having to update the data anyways, so when it sees that the header has changed it does so. But not all modifications to a header necessitate rewriting the data (unless it changes the overall size of the header, in which case it will at least require moving the data). This could be more intelligent. Either way, whether the data needs to be moved or modified or not, This also falls under the rubric of FITS verification overhaul discussed in other tickets which might allow more fine-grained control over what rules are and aren't enforced when reading and writing FITS files. |
I meant to add:
The tricky thing here is if the file is opened in update mode, IIRC numpy doesn't provide an easy way to check whether or not the data has actually been modified. But I'd like to know if there is a way to do that. Some FITS files have checksums embedded in which case the data can be compared against the previous checksum, but without that it would be onerous to compute one when opening the file. It might also be useful to have a "header-only" update mode, but something like this should disallow updating header keywords that would change the structure of the data. |
After some careful reading of the FITS standards I believe that both the input data and output data meet the standards. Using TDIM to specify a subarray of character strings is ok, and the dimensions specified in TDIM agree with the TFORM value. Also 7.3.3.1 of the standard states that character strings may be NULL terminated, but do not have to be. Thus both formats meet the standard. That being said, it could be argued that any code reading the data and then writing it back out should not change the format (terminating space to NULL), as either are valid, but that argument is for another time. As far as numpy not knowing whether data has changed, I agree that it has no mechanism for this. However, I think that it is not on numpy's shoulders to track this, it is merely a data storage mechanism. I believe that it is astropy's responsibility to track this, possibly by subclassing the numpy data to be able to detect if something has changed. I go back to my original assertion, that just reading the data, regardless of the mode used, should not change the contents of the file. |
Yes, we are mostly in agreement I think. Will need a closer look. |
For reference, here is the exact text from the FITS standard, as you mentioned, from section 7.3.3.1:
So indeed, there is nothing here about stripping trailing whitespace at all, and I can't find anything about that in the FITS standard after all. I'll have to look at the code and see if there is any indication as to why this is being done. |
There appears to be quite a bit of legacy to this--the issue definitely sounded familiar to me but I couldn't remember exactly when or why it came about. Part of it has to do with the fact that Numpy recarrays, which as far back as when I first started working on PyFITS were used internally for representing data in FITS binary tables. Recarrays had (until later versions of Numpy) a feature that string fields were returned using the old >>> s = np.char.array([b'abc', b'ab ', b'a ', b' '])
>>> s
chararray([b'abc', b'ab', b'a', ''], dtype='|S3') IIRC this is just for display purposes though; the underlying data buffer is not modified: >>> s.tofile('s.dat')
>>> !hexdump -C s.dat
00000000 61 62 63 61 62 20 61 20 20 20 20 20 |abcab a |
0000000c But there is also a bit in the PyFITS code the first reference to which I found here (though it's been moved about a lot since then) about explicitly stripping whitespace at the end of string fields and replacing them with nulls. I can't find anything in the commit messages explaining this, save for references to some issues in the old Trac bug tracker. Does anyone at STScI know if that is still archived somewhere? Aside from the broader issue of not touching the data in update mode if it isn't modified, I wonder if we should remove this functionality. It's not mentioned in the FITS standard, and can in fact break data like in @astro-friedel's case. This functionality has been around so long though, I also wonder what it would break. I think if anything it might be a user convenience. In most cases I think users would not want trailing whitespace in string fields. But obviously if there is a use case for that it should be supported, and in fact should probably be the default. |
So that would be this ? astropy/astropy/io/fits/fitsrec.py Lines 1215 to 1218 in 84bbbc1
I remember some discussions with @pllim about the Trac, maybe she knows ? |
Re: Trac -- Oh, dear... If you really need that info, please send in a help call to hsthelp.stsci.edu . I think that's all I am allowed to say officially. |
That's really too bad about the Trac site....valuable information potentially lost. I'm not so sure what to do about this. I think the rstrip stuff should be removed, but I worry what impact that will have on data products output by existing code that are relying on this (perhaps carelessly). Here's a doable approach that would at least take some care:
The The bigger issue here, which is that updating header fields that don't actually require a change to the data should not trigger any modifications to the data, even if it has to be relocated. I think I will open a new issue about the r-strip stuff. |
@embray, |
Description
General Description
I have a set of python code that I maintain which reads part of a Binary table and updates the primary HDU header fields based on data in the table. The file is opened in 'update' mode as we expect to be updating the header values. After updating from pyfits to astropy to read the files it was noticed that the resulting FITS files were becoming corrupted.
Specifics
The python code reads in data from a FITS binary table in one of the secondary HDUs, grabs some specific values, and then updates the header of the primary HDU based on those values. The header for the table extension is:
The table is essentially one long row which consists of 80 character strings (mimicking a header).
(Note: I am aware that it is not the best way to store the data but I do not have control over the format of the table)
It was noticed that after the header update the FITS table the file was becoming unreadable to down stream programs. In the original file the 80 character strings are all padded with spaces on the right hand side to ensure that they are all the correct length. But after the update all of the padding spaces are replaced with NULL, and not just a single NULL per string, but a NULL per space. I know the FITS standard states that text based columns in a table are to be NULL terminated with no space padding and I am guessing that when the table data are read the astropy code converts the data to the standard, however this breaks the structure of the table (especially since the strings are not just NULL terminated, but terminated by a group of NULLs). So one issue is that the resulting data should not be corrupted, but this may be a very complex issue due to the way the table is constructed.
After doing extensive testing it was found that it was not the updating of the header that caused the corruption, but just the reading of the table. My primary issue is that if an operation is just reading data, regardless of the mode the file is opened in, the content of the file should not change.
This is still present in the development version.
Expected behavior
I would expect that the file would not change if the data were just read, regardless of the mode that was used to open the file.
Actual behavior
After reading data from the file and closing the file handle, the table data were changed.
Steps to Reproduce
This file can be used as an example: testfile.fits
System Details
The text was updated successfully, but these errors were encountered: