Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate fibermap units through all steps of pipeline #2068

Closed
weaverba137 opened this issue Jun 7, 2023 · 11 comments · Fixed by #2176
Closed

Propagate fibermap units through all steps of pipeline #2068

weaverba137 opened this issue Jun 7, 2023 · 11 comments · Fixed by #2176
Assignees

Comments

@weaverba137
Copy link
Member

The DESI data model claims that columns in redrock, zpix, ztile and zall files have units, but in fact they do not. There are no TUNIT keywords set in any of the files in the fuji zcatalog directory, and a spot check shows that they are not set in individual redrock files either.

This also makes it much more difficult to propagate units even further downstream to VACs derived from these files.

For the purposes of EDR, I think the response has to be documentation: "even though the data model claims these columns have units, in fact, the files do not contain unit metadata."

All along we've been strongly encouraging VAC authors to put units in data files, but the pipeline is not actually doing that for some of the most critical outputs, so I think for the future, we should really be more careful about propagating units.

@weaverba137
Copy link
Member Author

To speed up the discussion a little bit, I know that units are dropped internally for performance reasons, which explains why units are not present in the redrock files. However, it seems like the assembly of zpix and ztile files would be IO-bound, not CPU-bound, so there would be no performance penalty to adding units to those files.

A better-phrased warning might be: "the units described in the data model for DESI_SPECTRO_REDUX/SPECPROD are the units of the values in those files, however, the files do not necessarily contain the unit metadata themselves (TUNITxxx header cards)."

@sbailey
Copy link
Contributor

sbailey commented Jun 7, 2023

A better-phrased warning might be: "the units described in the data model for DESI_SPECTRO_REDUX/SPECPROD are the units of the values in those files, however, the files do not necessarily contain the unit metadata themselves (TUNITxxx header cards)."

That is how I have always interpreted the meaning of the datamodel — describing the actual units of the data, not the existence (or not) of TUNIT and BUNIT keywords. I don't think that requires a specific caveat in the text of the data model.

We can separately address if there are any unintended side effects from adding TUNIT keywords to the data files in the future. I agree that in general it is preferable for us to propagate those and it was an oversight (not purposeful) that they are not included in the final redshift catalogs.

@weaverba137
Copy link
Member Author

weaverba137 commented Jun 7, 2023

That is how I have always interpreted the meaning of the datamodel.

But will the public interpret it that way? Especially since some files do have units and some don't.

@sbailey
Copy link
Contributor

sbailey commented Jun 8, 2023

Adding units to the files is a post-EDR activity for future specprods / releases.

For the EDR Countdown dashboard I'm moving this into the optional category for adding caveats to the datamodel and/or the known issues section of desidatadocs.

@weaverba137
Copy link
Member Author

I have removed this ticket from EDR Countdown and opened desihub/desidatamodel#178 to document the specific issue of the presence or absence of units in existing files.

@sbailey
Copy link
Contributor

sbailey commented Aug 29, 2023

The units are being dropped far earlier than redrock — even the original fibermap generated by assemble_fibermap doesn't have units.

Short term fix for DR1:

Long term fix for Y3 reprocessing:

  • fix assemble_fibermap so that it outputs units in the first place
  • check each of the stages that propagates the fibermap -> preproc -> frame -> cframe -> spectra -> coadd -> redrock -> zcatalog

The lowest-level place to fix this is in desispec.io.table.read_table, which uses fitsio under the hood and loses units. The original motivation of read_table was to work around the astropy 5.0 "feature" of masking NaNs instead of propagating them (fixed in later astropy versions). However, note that Table.read(filename, 'FIBERMAP') is ~2x slower than Table(fitsio.read(filename, 'FIBERMAP')) on compressed files, so even if NaNs weren't an issue, it would be preferable to use fitsio and then re-parse the header to add units to maintain I/O efficiency.

@weaverba137
Copy link
Member Author

@sbailey, in terms of closing out the 'DR1 Patching' project, what is the status of this ticket?

@sbailey
Copy link
Contributor

sbailey commented Jan 8, 2024

This is done for the purposes of DR1, but not yet for future productions, so I'll remove it from "DR1 Patching" and move it to "Jura". For DR1, units were added to the redshift catalogs at the end as part of running desi_zcatalog; for future productions we want this units propagation to be automatic through all intermediate files. Steps to check/fix:

  • bin/assemble_fibermap
  • desispec.io.table.read_table
  • desispec.io.read_frame / .write_frame round trip
  • desispec.io.read_spectra, desispec.coaddition.coadd, desispec.io.write_spectra
  • bin/desi_group_spectra
  • Redrock bin/rrdesi
  • bin/desi_zcatalog propagating units from input files, without having to specify --add-units (which is intended for when the input files don't have units in the first place, like iron)
  • and probably something else...

Contributions to fix any of those steps would be valuable, even if someone else needs to dig deeper to fix another step.

Note that anywhere that we use fitsio, it is for a specific purposeful reason, so we shouldn't fix this issue just by replacing fitsio with astropy.table.Table stuff.

@sbailey sbailey changed the title Units are not being propagated into redrock files and downstream Propagate fibermap units through all steps of pipeline Jan 8, 2024
@sbailey
Copy link
Contributor

sbailey commented Jan 29, 2024

Additional note: I originally thought that we'd have assemble_fibermap output the correct units in the first place, and then have all subsequent read+write steps propagate those units. That might be technically messy due to astropy Tables vs. astropy BinTableHDUs vs. fitsio etc. We additionally need to support pipeline steps adding new columns including their units.

An alternative to propagate+add would be to have every writer use desispec.io.fibermap.fibermap_columns to re-assert what the units should be and add those while writing, since that is likely what assemble_fibermap would be using in the first place for the original columns. This would also keep the units for all columns in a single place instead of spread over the N>1 steps that add columns.

As belt-and-suspenders, readers could also complain if they read in a fibermap column with a claimed unit different from fibermap_columns.

@weaverba137 weaverba137 self-assigned this Jan 29, 2024
@weaverba137
Copy link
Member Author

Linking to relevant issue #2171.

@weaverba137
Copy link
Member Author

Since we're working on assemble_fibermap anyway, look into taking care of #2149.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants