Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

including input table as metadata in Table like in astropy.io.fits and io.fits HDUList.data/header #6429

Open
richardgmcmahon opened this issue Aug 7, 2017 · 7 comments

Comments

@richardgmcmahon
Copy link

astropy.io.fits has a method HDUList.filename() which contains the filename of the file that is read in.

From a data provenance point of view, this is very useful and it would be useful if functionality was also in astropy Table.

It would be also be convenient if the filename was carried forward into the Header and Data 'objects' in astropy.io.fits.

i.e. print(data.filename()) would print the filename that the header came from.

http://docs.astropy.org/en/stable/io/fits/api/hdulists.html

Thanks

@MSeifert04
Copy link
Contributor

I think it's not worth it to subclass np.ndarray just to add a filename attribute to the data property of HDUs. So that's a 👎

However, I'm undecided about having it on HDUs or headers. It probably wouldn't be hard to add it but it could be tricky if one puts HDUs from one file in a new HDUList. That could lead to inconsistent filenames inside one HDUList. That depends on how rigorous we want to keep the filenames in sync.

No idea about Tables. I thought they already had some meta in case it was read from a file.

@richardgmcmahon
Copy link
Author

Thanks for the response, just to add that in principle any user can use the meta attribute in Table to store the input filename.

table.meta['filename'] = filename

I am advocating that from a scientific point of view it would be good practice if the input filename was a defined attribute. I would argue that just is just as important as image or table column units; the input filename is a fundamental descriptor.

@mhvk
Copy link
Contributor

mhvk commented Aug 8, 2017

Agreed on the importance as well as on the difficulty: what happens when you create a new table, or join two tables? Or read from a filehandle? And should this particular bit of metadata be saved if you write to a file? If so, what should the filename be when the table is read back in? MIDAS partially solved this by having a HISTORY keyword that logged everything that happened, and even though this had obvious limitations, I found this to be an incredibly useful thing. But it partially worked so well since every larger object always was a file, so it was meaningful to refer to things by their name. For python/astropy, this is much less clear - column names are well-defined, but table names are not.

@MSeifert04
Copy link
Contributor

MSeifert04 commented Aug 8, 2017

I'm not arguing that the filename isn't important. That's one thing I really liked about ccdprocs ImageFileCollection.

However cascading the filename can't be the only option. I agree that it might be important for Table and NDData but for the low-level io.fits objects having the filename on the HDUList is (in my opinion) enough.

But I'm open for discussion about this. Except for adding it to the HDUs data because that would require to subclass np.ndarray just to add that attribute.

@pllim
Copy link
Member

pllim commented Aug 9, 2017

In STScI FITS files, most of them has FILENAME keyword in the primary header (EXT 0), which if read into a Table, it would be in mytable.meta['header']['FILENAME']. However, if I recall correctly, that keyword was added manually (please correct me if I am wrong) and can be outdated if a file is renamed.

But my point is, if you want filename in Table metadata, the current solution is to use a FILENAME keyword in your FITS header.

I agree with the above point that it is difficult to get "filename" attribute in sync if carried around "officially". If I read in a file, and then modified its buffer without saving it back to the same filename, then the "filename" attribute can be misleading.

@richardgmcmahon
Copy link
Author

For info, I have been reading a hdf5 file and it has a filename attribute which stores the name of the input file. I think it would be shared good practice to have filename metadata in a Table

e.g.
h5 = h5py.File(infile)
print('h5.filename:', h5.filename)

I accept the point that there is some danger if the table content is changed. Maybe this needs to be managed in some way with another piece of metadata that indicates that there has been a table change. This could just be binary; True/False.

@MSeifert04
Copy link
Contributor

yeah, but that h5py.File is really associated with a file, a Table reads the file but after reading it it's a different entity (maybe not if it's memory-mapped but I'm not completely sure on that point).

And it's very easy to add it manually so a simple custom wrapper would be enough to keep the filename around if one needs it and it makes it clearer that it's the responsibility of the user to keep the contents in sync or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants