Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] EDR cleanup #167

Merged
merged 37 commits into from Jun 1, 2023
Merged

[WIP] EDR cleanup #167

merged 37 commits into from Jun 1, 2023

Conversation

@weaverba137 weaverba137 added the WIP Work in Progress label May 24, 2023
@weaverba137 weaverba137 requested a review from sbailey May 24, 2023 23:18
@weaverba137 weaverba137 self-assigned this May 24, 2023
@coveralls
Copy link

coveralls commented May 24, 2023

Coverage Status

coverage: 100.0%. remained the same when pulling 8a398d0 on edr-cleanup into 832e348 on main.

@weaverba137
Copy link
Member Author

I added a unit test that I know will fail (at least for now), because these columns do not have types or descriptions:

IN_RADIUS,,arcsec,
NEAR_RADIUS,,arcsec,
PROBA_RF,,,
REF_MAG,,mag,

None of these columns appear in any data model file, so it's not obvious where these come from.

@weaverba137
Copy link
Member Author

@aureliocarnero do you know where the mystery columns listed above come from? Were they possibly in an early version of the LSS catalog?

@geordie666
Copy link
Contributor

geordie666 commented May 25, 2023

Porting a message I just sent Ben on Slack:

It's likely these columns may have been inherited from the desitarget units database?

If so:

IN_RADIUS and NEAR_RADIUS are the separations used to set the IN_BRIGHT_OBJECT and NEAR_BRIGHT_OBJECT bits for bright star masks. The default values of these separations (which are all that were ever used for DESI) are set in desitarget.brightmask.radii().

REF_MAG is the reference magnitude used to inform the bright star mask. As noted in the link to the desitarget.brightmask.radii() function, above, IN_BRIGHT_OBJECT is set if "a certain magnitude" is < 12; but, that magnitude is derived from different catalogs. The magnitude that is used to populate REF_MAG can be, in order of preference G-band from Gaia, then HP or BT or VT magnitude from Tycho. The reason this can be taken from different source catalogs is that some stars are too bright to appear in Gaia.

PROBA_RF is the probability of being a quasar generated by the quasar Random Forest targeting algorithm. The quasar RF targeting algorithm is something of a black box, but presumably it's described in sufficient detail in the quasar target selection paper.

@weaverba137
Copy link
Member Author

Thank you @geordie666.

  • For IN_RADIUS and NEAR_RADIUS, it's clear that function defines those, but the columns must be pre-existing in the input argument mag, correct? That's my reading of the docstring. In what file are those inputs or outputs? Ultimately, if those columns will never actually appear in any file ever used by DESI, do we need them in the data model?
  • Similar question for REF_MAG: It seems like that's mag, but why doesn't the function call mag['REF_MAG']?
  • I looked at the QSO paper, but could not find an explicit reference to PROBA_RF; the closest match was QSO_RF_PASS4 and QSO_RF_PASS8.

@weaverba137
Copy link
Member Author

Note: I'm using the definition recarray is a numpy array that contains named columns.

@geordie666
Copy link
Contributor

Based on a comment from Ben on Slack, I think the resolution is straightforward, here.

IN_RADIUS, NEAR_RADIUS, PROBA_RF and REF_MAG are quantities that are in the desitarget units yaml file because they could be written by desitarget code, if certain command line flags are passed. In other words, the desitarget units database is careful to account for cases where someone decides to write out a non-standard column.

I don't think these columns will be included in the EDR. So, there's no need to include them in the data model. As I noted above, I suspect these quantities may have been blindly inherited from the desitarget units yaml file, even though they don't correspond to columns that are written to the production targeting files that were published with the ETS.

@weaverba137
Copy link
Member Author

Thank you @geordie666!

@aureliocarnero
Copy link
Contributor

Dear @weaverba137 et al. I think you found the origin to these columns. They are not used in LSS. Cheers

@weaverba137
Copy link
Member Author

Update: as of now there are no longer any "default" descriptions as described in #163. Instead, every remaining instance now has TODO: description needed. This includes column descriptions that were formerly blank and therefore almost impossible to find. Now we can just

grep -r TODO doc

to find missing information.

At this point there is nothing special about coordinates or pm files (#120) in DESI_SPECTRO_DATA that is not shared by other files that need their TODOs filled in.

I believe I have also satisfied all the suggestions in #140, except where TODOs with paper links need to be filled in.

That just leaves verification and remaining TODOs to work on.

@weaverba137
Copy link
Member Author

Mapping data model files to files that are actually in EDR:

# Already tested recently
DESI_ROOT/vac/RELEASE/lss
# To be tested: disk path is obvious
DESI_ROOT/survey/fiberassign
DESI_SPECTRO_CALIB = DESI_ROOT/spectro/desi_spectro_calib
DESI_SPECTRO_DATA = DESI_ROOT/spectro/data
DESI_SPECTRO_REDUX/SPECPROD = DESI_ROOT/spectro/redux/fuji
DESI_TARGET/fiberassign/tiles
DESI_TARGET/TARG_DIR
DESI_TARGET/SCND_DIR
# To be tested: disk path is missing or not obvious.
DESI_ROOT/survey/GFA -- has no data model?
DESI_ROOT/survey/ops -- has no data model? Is this DESI_SURVEYOPS?
DESI_SURVEYOPS = DESI_ROOT/survey/ops? It doesn't look the same
DESI_TARGET/skybricks/ -- has no data model?
DESI_TARGET/skyhealpixs/ -- has no data model?
# Unknown status
DESI_BASIS_TEMPLATES
MTL files?
# Will not be tested.
DESI_SPECTRO_SIM is not in EDR.
DESIMODEL is not in EDR. Plus it's a can of worms.
DESISURVEY_OUTPUT is not in EDR.
PROTODESI is not in EDR.

@weaverba137
Copy link
Member Author

None of the intermediate fiberassign files matches what is on disk for EDR. The directory path is significantly different.

@weaverba137
Copy link
Member Author

weaverba137 commented May 31, 2023

In the (non-intermediate) fiberassign files, the uncompressed fiberassign-EXPID.fits files have a lot of differences from the compressed fiberassign-EXPID.fits.gz files. But the data model for the compressed files is good.

@araichoor
Copy link
Contributor

sorry, just checking: what do you mean by "a lot of differences"?
also, I only see one such case:

find -L /global/cfs/cdirs/desi/survey/fiberassign/{SV1,SV2,SV3} -name "fiberassign-??????.fits*" | grep -v gz

/global/cfs/cdirs/desi/survey/fiberassign/SV1/20201212/fiberassign-080609.fits

this folder is a bit "special", in the sense that we discovered that the fiberassign-TILEID.fits.gz files there were not the ones that ended up being used... i.e. we must have re-designed those (first day of sv1, big rush...)
that s why files there are labelled "bugged" there.
I see that fiberassign-080609.fits is the only uncompressed file there... maybe a leftover of when we looked at the files to discover that; if useful, I can try to dig the history here, please let me know.

ls -1 /global/cfs/cdirs/desi/survey/fiberassign/SV1/20201212/fiberassign*fits*

/global/cfs/cdirs/desi/survey/fiberassign/SV1/20201212/fiberassign-080605-bugged.fits
/global/cfs/cdirs/desi/survey/fiberassign/SV1/20201212/fiberassign-080605.fits.gz
/global/cfs/cdirs/desi/survey/fiberassign/SV1/20201212/fiberassign-080606-bugged.fits
/global/cfs/cdirs/desi/survey/fiberassign/SV1/20201212/fiberassign-080606.fits.gz
/global/cfs/cdirs/desi/survey/fiberassign/SV1/20201212/fiberassign-080607-bugged.fits
/global/cfs/cdirs/desi/survey/fiberassign/SV1/20201212/fiberassign-080607.fits.gz
/global/cfs/cdirs/desi/survey/fiberassign/SV1/20201212/fiberassign-080608-bugged.fits
/global/cfs/cdirs/desi/survey/fiberassign/SV1/20201212/fiberassign-080608.fits.gz
/global/cfs/cdirs/desi/survey/fiberassign/SV1/20201212/fiberassign-080609-bugged.fits
/global/cfs/cdirs/desi/survey/fiberassign/SV1/20201212/fiberassign-080609.fits
/global/cfs/cdirs/desi/survey/fiberassign/SV1/20201212/fiberassign-080610.fits.gz

@weaverba137
Copy link
Member Author

weaverba137 commented May 31, 2023

@araichoor, please be careful here.

The uncompressed non-intermediate files have many differences from the compressed files. Those are the differences I am talking about.

That said,

None of the intermediate fiberassign files matches what is on disk for EDR. The directory path is significantly different.

That is an issue with the directory path not the files. The intermediate files are described in this directory: https://github.com/desihub/desidatamodel/tree/main/doc/DESI_ROOT/survey/fiberassign/SURVEY/TILEXX.

However, the pattern in EDR is:

$DESI_ROOT/survey/fiberassign/SURVEY/NIGHT/

And SURVEY is different from other instances of SURVEY because capitalization is used.

@weaverba137
Copy link
Member Author

@araichoor, PS I am using this space to take notes on the many data model tasks left to perform before EDR, and the notes are very raw. They may even be edited after being written. I would not spend a lot of time reading this thread until the PR is actually ready for review.

@weaverba137
Copy link
Member Author

Every *targets* file in DESI_TARGET/TARG_DIR matches this:

find catalogs/ -type f -name \*targets\*.fits | grep -E -v 'catalogs/(dr[89]|gaiadr2)/[0-9.]+/targets/(sv1|sv2|sv3|main|main2|cmx)/(resolve|secondary)/(bright|dark|no-obscon|backup|supp)/(cmx|sv1|sv2|sv3|main2|)targets-(bright|dark|no-obscon|backup|supp)-(hp-[0-9]+|secondary)(-dr9photometry|)\.fits'

@weaverba137
Copy link
Member Author

Merging this after discussion with @sbailey. Remaining issues can be settled with more targeted PRs.

@weaverba137 weaverba137 merged commit 69f82aa into main Jun 1, 2023
22 checks passed
@weaverba137 weaverba137 deleted the edr-cleanup branch June 1, 2023 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Work in Progress
Projects
None yet
6 participants