New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emlinefit #1386
Conversation
Thanks @araichoor for this PR and for including the full descriptions and example output files. @moustakas could you take a look at this? It's fine for this to be simpler than fastspecfit, but I also want to avoid arbitrary incompatibilities in variable names, wavelength definitions, 10e-17 prefixes or not, etc. Any cross-checks of conceptually similar quantities from fastspecfit would also be welcome. Questions that occur to me that I haven't looked into:
|
thanks for those remarks! answering your specific points: - SB: "how are masked but ivar != 0 input wavelengths handled?" - SB: "do fits include ivar weighting?" desispec/bin/desi_emlinefit_afterburner Lines 305 to 314 in 7e5c17b
additionally, following David s suggestion, I consider all pixels in the spectrographs overlap region, ie coming from both arms. - SB: "what is in output for a line if the redshifted wavelength is outside the wavelength range of the spectrum (NaN?)? completely masked? not completely masked, but e.g. only one pixel is unmasked such that you can't do an N parameter fit?" desispec/bin/desi_emlinefit_afterburner Lines 254 to 266 in 7e5c17b
I first discard pixels with infinite flux or ivar>0; then if I do not forget cases, I think I return NaN if one of the following happen:
those are kind of ad hoc choices, but look overall ok for my needs for the ELGs. |
about consistency with fastspecfit for line wavelengths: JM: https://github.com/desihub/fastspecfit/blob/everest-1/py/fastspecfit/data/emlines.ecsv
for other quantities: |
I had not realized that masked pixels did not have |
In earlier steps of the pipeline we purposefully do not immediately zero-out the ivars of masked pixels so that we don't lose information. However, now that your ask/wonder/mention it, I remembered that we do set ivar=0 for masked pixels when creating the cframe files, and thus the spectra and coadds as well, because we thought end-users would expect that. i.e. I think this is a non-issue (but you are welcome to double check me on that and report any cases with a mask but ivar != 0 in cframe / spectra / coadds) |
Thanks for the explanations @araichoor . @moustakas any more comments on this? In particular for data model names (SHARE, RFEW, ...) and units? Algorithmically fastspecfit and emlinefit can be different, but let's try to have common variable names and units for the same concepts. Could you explicitly cross check that and many any suggestions? Thanks. |
Apologies for the delay. Algorithmically, @araichoor has already shown that there is good agreement between his measurements and my independent measurements with
Finally, a broader question (especially for @sbailey) that is semi-related to this PR but which should not be a blocking factor. We now have two after-burners for emission lines, With more capabilities, of course, there is a computational cost. |
thanks a lot for all the relevant comments/questions! I answer below to each item (sorry for the long message). But first a general comment relevant for several items: this afterburner is primarily designed to be used for the ELG zreliable criterion , which currently relies on the OII SNR; maybe the OIII SNR will be used later, but probably not much more. Now going per item:
|
for info, I hope to be able to implement those changes today/tomorrow. |
Minor: please put the I/O functions in desispec.io.emlinefit to be grouped with the other I/O sub-modules under desispec.io. |
oh right, thanks for noticing that; sure, will do. |
As said, for conveniency, I also propagate those columns from the redrock file: However, I notice that - obviously - |
With further thinking, I think I ll actually remove the
That way, along with the Besides, I also plan to propagate the arguments of
|
@araichoor I think you should consider continuing to fit all the target classes, at least for the time being. You've shown that the compute cost is negligible and it would be really helpful to have a comparison dataset for |
I see, thanks for the feedback. Another point if we go this way: with comparing BGS/LRG measurements with fastspecfit, I notice that the default SIGMA (observed Gaussian width) settings I adopted, based on the ELGs, are no more valid for those different tracers; I set SIGMA < 10 A, whereas it can be larger for BGS/LRGs; I can see if I can allow larger SIGMA, but with not affecting the ELG results.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for delayed review. This is nice and fast and meets the core needs, but I've put some comments inline related to code structure and future maintainability.
The core algorithm is isolated in the emlines_gaussfit
function (good), but other pieces seem to mix I/O, algorithms, and plotting. Could you try to separate these more so that the outer wrapper script looks more like:
stuff = desispec.io.emlinefit.read_stuff(coadd_filename, redrock_filename)
results = desispec.emlinefit.emlines_gaussfit(stuff)
desispec.io.emlinefit.write_emlines(outfile, results)
if args.outpdf:
desispec.io.emlinefit.plot_emlines(args.outpdf, results)
(don't actually use the name "stuff", that's a placeholder for whatever would be read and returned)
Maybe that's what get_emlines
is supposed to be? But in that case get_targetids
should be working on stuff that get_emlines
is already reading instead of working off of filenames directly. Please try to restructure this into:
- desispec.io.emlinefit - I/O and convenience reformatting, but no calculations
- desispec.emlinefit - the core algorithms but no I/O, in a form that would be useful to someone calling on a simulated spectrum or from a Jupyter notebook.
- bin/desi_emlinefit_afterburner - puts those two together
If that doesn't make sense let me know and we can chat in person to clarify. Thanks.
py/desispec/io/emlinefit.py
Outdated
mydict[emname]["fitivar"][i] = iv | ||
mydict[emname]["model"][i] = m | ||
# AR plot? | ||
if outpdf is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we want to separate calculation from plotting, e.g. so that if you want to update the plots you can call it with some data to get a new plot without having to redo intermediate calculations. Could you move out this plotting into a separate function that would take the outputs of emlines_gaussfit
and make a plot, or does it rely upon some internal variable that isn't returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, I've moved that to desispec.io.emlinefit.plot_emlines()
.
py/desispec/io/emlinefit.py
Outdated
|
||
allowed_emnames = ["OII", "HDELTA", "HGAMMA", "HBETA", "OIII", "HALPHA"] | ||
|
||
def get_targetids(redrockfn, bitnames, log=get_logger()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function opens and reads the redrock file twice to perform a selection of targetids. I think it would be cleaner to make this a non-I/O function that takes a fibermap and bitnames and returns targetids (or maybe the indices of the fibermap or a boolean array of matches).
Some of this logic looks like it would be cleaner to use desitarget.targets.main_cmx_or_sv
to figure out what the relevant target column is and what mask to import. Please check if that would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I've kept the function as is (for the time being).
Maybe that doesn't really matter as the redrock*fits
files are small, but by my experience, I think it's faster to proceed this way:
- first only read the file structure with
fits.open()
, but not the data, to inspect the column content (that step is super-quick) - then read only the relevant columns.
than just reading the whole data.
Same fordesitarget.targets.main_cmx_or_sv
: that function assumes that the whole data have been read.
Note that in the new code version, I also use that two-steps approach, e.g.:
desispec/py/desispec/io/emlinefit.py
Lines 148 to 164 in febddee
h = fits.open(redrock) | |
extnames = [h[i].header["EXTNAME"] for i in range(1, len(h))] | |
if "REDSHIFTS" in extnames: | |
rr_extname = "REDSHIFTS" | |
elif "ZBEST" in extnames: | |
rr_extname = "ZBEST" | |
else: | |
msg = "{} has neither REDSHIFTS or ZBEST extension".format(redrock) | |
log.error(msg) | |
raise RuntimeError(msg) | |
# AR rr_keys, fm_keys: restrict to existing ones | |
rmv_rr_keys = [key for key in rr_keys.split(",") if key not in h[rr_extname].columns.names] | |
if len(rmv_rr_keys) > 0: | |
log.info("{} removed from rr_keys, as not present in {}".format(",".join(rmv_rr_keys), rr_extname)) | |
rr_keys = ",".join([key for key in rr_keys.split(",") if key not in rmv_rr_keys]) | |
# AR redrock: reading | |
rr = fitsio.read(redrock, ext=rr_extname, columns=rr_keys.split(",")) |
Please let me know if there is a better way to proceed.
Thanks a lot @sbailey for detailed+pedagogical review! Except the comment on I've checked on very cases that the new version produces the same answer as the old version, but I'll test more tomorrow; though I'm already pushing/sharing the code, so that you have time to re-review when that suits you best. I now reply to each comment. |
And before I forget: besides the code-structure changes, one key question is if we do want to provide those fits for all spectra or not. |
Thanks for the very quick updates. The new structure looks better. By default for the pipeline let's run it on all targets, and document that the file is primarily intended for ELGs. Complete failures have NaN, and you also propagate IVARs and CHI2s if someone wants to check goodness / significance of fit when applying to BGS or LRGs, so even if they aren't "good" fits, there are ways to inspect how good they are.
So, my proposed structure is:
Background motivation: our experience at NERSC is that when the filesystems work, they are so fast that in most cases it doesn't matter (looping over tractor and target selection inputs is a case that gets big enough where it does matter). But there is a small probability that basic operations like open, close, read will hang and we are more dominated by those outliers than whether a given read takes 3ms or 30ms. FWIW, the best structure for I/O is to use fitsio and open the file once and do all the necessary reads before closing, e.g. using a "context manager" that will auto-cleanup if something goes wrong during the read: with fitsio.FITS(redrock_filename) as rr:
redshifts = rr['REDSHIFTS'].read()
fibermap = rr['FIBERMAP'].read() instead of redshifts = fitsio.read(redrock_filename, 'REDSHIFTS')
fibermap = fitsio.read(redrock_filename, 'FIBERMAP') Full disclosure: I've certainly written a lot of code that looks like the second case instead of first, but I acknowledge that the first is better from an I/O metadata load standpoint. |
Again, thanks a lot for the reply (+testing the speed of different calls!). I support the idea of keeping Lastly, as long as you provide me useful tips in using fitsio: if I read the redrock/fibermap catalogs with all the columns, I still need to then remove all columns except rr_keys/fm_keys, as those will be propagated in the final output catalog. |
yes, sounds good.
I had forgotten about the issue of what gets propagated to the output catalog. You can filter down to just the columns you want with e.g. blat = fibermap['TARGETID', 'DESI_TARGET']
|
I should have addressed your last batch of comments:
Plus few code syntax cleanings. |
Thanks for the updates; the new structure looks great. As compensation for your patience, I have added the final set of changes to add unit tests:
These aren't sophisticated tests for correctness, but are basic "does it run without crashing" tests that will help us in the future to catch any problems with changes to numpy, astropy, fitsio, etc. before running this in the full pipeline. Speaking of the full pipeline -- I think this is ready to merge now, but I'll leave it open for you to look at what I did with the tests and complain/question if needed. After cori is back let's add the final hooks to include this in the daily pipeline desi_tile_redshifts (which can be a separate PR). My structural requests have caused your original single-file-script to become multiple files:
While that is admittedly somewhat of a pain, it is motivated by:
Final notes:
|
Interesting. Then an algorithm choice question: I was expecting that if the data were negative at the location of a line, you'd just get a negative FLUX rather than forcing it to NaN. To me that is different from a "failed" fit from not having enough good pixels at that wavelength (or having the line completely redshifted out of the wavelength coverage). Is there a specific reason to not propagate a negative flux fit? To me this is conceptually equivalent to not clipping photometric fluxes to 0 when making an imaging catalog -- let the data fluctuate, and sometimes the fit will be negative even though you know the object photons are strictly positive, but the negative tail can be useful in assessing the noise/statistical properties of the dataset. Also, can you point to the piece of code that sets negative fits to NaN (or otherwise doesn't overwrite the default NaN because the fit was negative)? From a quick glance that isn't obvious to me where that happens. |
I acknowledge my approach may be a bit "brutal", but the idea was end-user oriented: if there is a not-NaN measurement, then it can be trusted (at least for the ELGs :)); no need to make several cuts to decide if the fit is trustable or not; that of course has been driven by my work on ELGs. In that spirit, I've put several requirements;
Around that place in the code you may find other informations on that topic; happy to give more precision / discuss about that. |
This PR adds the
desi_emlinefit_afterburner
script.This script computes standard emission lines for all 500 spectra of a given set of {redrock,coadd} file.
Expected calling sequence in the pipeline:
desi_emlinefit_afterburner --redrockfn $RRFN --coaddfn $COADDFN --outfn $OUTROOT.fits
Functions:
get_emlines()
: called bymain()
, reads redrock/coadd files, callsemlines_gaussfit()
, outputs results, in the format of a dictionary;emlines_gaussfit()
: called byget_emlines()
, performs the fit for a single spectrum and a single emission line;get_rf_em_waves()
: called byemlines_gaussfit()
, defines the vacuum, rest-frame emission wavelengths.Emission lines:
Current fitted lines are: OII (doublet, free ratio), Hdelta, Hgamma, Hbeta, OIII (doublet, fixed ratio), Halpha.
For each line, I report the columns {EMNAME}_{QUANT} with QUANT in:
I also propagate some columns from the REDSHIFTS/FIBERMAP extensions:
Example files:
In
/global/cscratch1/sd/raichoor/tmpdir/desispec_emlinefit/
, there are fits and log files:I also provide in the same folder few examples of the (optional) output pdf; it has been useful to me to make diagnoses (and debug!):
Speed:
On the logging node, processing one file takes 10s-20s; generating the pdf takes much longer (~6mins).
Choices: