Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emlinefit #1386

Merged
merged 32 commits into from Dec 21, 2021
Merged

Emlinefit #1386

merged 32 commits into from Dec 21, 2021

Conversation

araichoor
Copy link
Contributor

This PR adds the desi_emlinefit_afterburner script.
This script computes standard emission lines for all 500 spectra of a given set of {redrock,coadd} file.

Expected calling sequence in the pipeline:
desi_emlinefit_afterburner --redrockfn $RRFN --coaddfn $COADDFN --outfn $OUTROOT.fits

Functions:

  • get_emlines(): called by main(), reads redrock/coadd files, calls emlines_gaussfit(), outputs results, in the format of a dictionary;
  • emlines_gaussfit(): called by get_emlines(), performs the fit for a single spectrum and a single emission line;
  • get_rf_em_waves(): called by emlines_gaussfit(), defines the vacuum, rest-frame emission wavelengths.

Emission lines:
Current fitted lines are: OII (doublet, free ratio), Hdelta, Hgamma, Hbeta, OIII (doublet, fixed ratio), Halpha.
For each line, I report the columns {EMNAME}_{QUANT} with QUANT in:

  • OBSEMWAVES: observed wavelengths at which the fit is done
  • CHI2, NDOF: chi2 and nb of degrees of freedom
  • CONT, CONT_IVAR: continuum in 1e-17 * erg/cm2/s/A
  • FLUX, FLUX_IVAR: flux in 1e-17 * erg/cm2/s/A
  • SIGMA, SIGMA_IVAR: line width in A (observed frame)
  • SHARE, SHARE_IVAR: f1/(f0+f1) for OII and OIII doublets
  • RFEW, RFEW_IVAR: rest-frame equivalent width
  • wave, data, ivar, model: data used for fitting + fitted model, used in case plotting is requested
    I also propagate some columns from the REDSHIFTS/FIBERMAP extensions:
  • TARGETID, TARGET_RA, TARGET_DEC, OBJTYPE, FIBER, Z, ZWARN, SPECTYPE, DELTACHI2.

Example files:
In /global/cscratch1/sd/raichoor/tmpdir/desispec_emlinefit/, there are fits and log files:

denali/tiles/cumulative/80606/20201219/emlinefit-?-80606-thru20201219.{fits,log}
denali/tiles/cumulative/80613/20210324/emlinefit-?-80606-thru20201219.{fits,log}
everest/tiles/cumulative/1/20210406/emlinefit-?-1-thru20210406.{fits,log}
everest/tiles/cumulative/1000/20210517/emlinefit-?-1000-thru220210517.{fits,log}
everest/tiles/cumulative/20000/20210609/emlinefit-?-20000-thru20210609.{fits,log}
everest/tiles/cumulative/80606/20201219/emlinefit-?-80606-thru20201219.{fits,log}

I also provide in the same folder few examples of the (optional) output pdf; it has been useful to me to make diagnoses (and debug!):

denali/tiles/cumulative/80606/20201219/emlinefit-0-80606-thru20201219.pdf
denali/tiles/cumulative/80613/20210324/emlinefit-0-80613-thru20210324.pdf
everest/tiles/cumulative/80606/20201219/emlinefit-0-80606-thru20201219.pdf
everest/tiles/cumulative/80613/20210324/emlinefit-0-80613-thru20210324.pdf

Speed:
On the logging node, processing one file takes 10s-20s; generating the pdf takes much longer (~6mins).

Choices:

  • I try to output nans for failed/dubious fits
  • for simplicity, the code generates outputs for the 500 spectra (though many fits will be nans, or won t make sense)
  • emlines_gaussfit() has the flexibility to be run on a subset of TARGETIDs only; same I kept that in case useful for testing; could be removed (that would simplify the code)
  • I propagate as input arguments some of the emission line settings (rf_fit_hw, min_rf_fit_hw, rf_cont_w, rv, balmerfit), in case people want to test those; could be removed
  • in get_emlines(), I allow pre-everest format (i.e. ZBEST instead of REDSHIFTS extension in redrockfn)
  • the infos in log files are quite minimalist; happy to add more if relevant/useful
  • I do not propagate {DESI,BGS,MWS,SCND}_TARGET columns, as that would imply to handle SV1/SV2/SV3/Main; could do if useful
  • in args.outpdf, spectra are ordered by increasing redshifts.

@araichoor araichoor requested a review from sbailey August 26, 2021 21:54
@coveralls
Copy link

coveralls commented Aug 26, 2021

Coverage Status

Coverage decreased (-1.2%) to 25.587% when pulling 3c29232 on emlinefit into 7e8b39f on master.

@sbailey
Copy link
Contributor

sbailey commented Aug 26, 2021

Thanks @araichoor for this PR and for including the full descriptions and example output files.

@moustakas could you take a look at this? It's fine for this to be simpler than fastspecfit, but I also want to avoid arbitrary incompatibilities in variable names, wavelength definitions, 10e-17 prefixes or not, etc. Any cross-checks of conceptually similar quantities from fastspecfit would also be welcome.

Questions that occur to me that I haven't looked into:

  • how are masked but ivar != 0 input wavelengths handled?
  • do fits include ivar weighting?
  • what is in output for a line if the redshifted wavelength is
    • outside the wavelength range of the spectrum (NaN?)?
    • completely masked?
    • not completely masked, but e.g. only one pixel is unmasked such that you can't do an N parameter fit?

@sbailey sbailey requested a review from moustakas August 26, 2021 22:26
@sbailey sbailey added this to In progress in Fuji via automation Aug 26, 2021
@araichoor
Copy link
Contributor Author

thanks for those remarks!

answering your specific points:

- SB: "how are masked but ivar != 0 input wavelengths handled?"
=> I currently don t use the {BRZ}_MASK extensions; shall I e.g. set ivar=0 for the masked pixels?

- SB: "do fits include ivar weighting?"
=> yes, the fit is done with scipy.optimize.curve_fit:

popt, pcov = curve_fit(
myfunc,
waves[keep_line],
fluxes[keep_line],
p0 = p0,
sigma = 1. / np.sqrt(ivars[keep_line]),
maxfev = 10000000,
gtol = 1.49012e-8,
bounds = bounds,
)

additionally, following David s suggestion, I consider all pixels in the spectrographs overlap region, ie coming from both arms.

- SB: "what is in output for a line if the redshifted wavelength is outside the wavelength range of the spectrum (NaN?)? completely masked? not completely masked, but e.g. only one pixel is unmasked such that you can't do an N parameter fit?"
=> in those cases, NaN should be returned.
e.g.

keep_cont &= ~keep_line
# AR discarding flux=nan and ivars == 0 pixels
keep_line &= (np.isfinite(fluxes)) & (ivars > 0)
keep_cont &= (np.isfinite(fluxes)) & (ivars > 0)
# AR
models = np.nan+np.zeros(keep_line.sum())
# AR enough pixels to fit?
succeed = False
if (
(keep_line.sum() >= 3) &
(keep_cont.sum() >= 3) &
(n_cov_lines >= min_n_lines)
):

I first discard pixels with infinite flux or ivar>0; then if I do not forget cases, I think I return NaN if one of the following happen:

  • less than 3 pixels to estimate the continuum (hard-coded);
  • less than 3 pixels to fit the line (hard-coded);
  • edge of the spectro: less than 20A in rest-frame on each side of the line (min_rf_fit_hw argument);
  • the flux value(s) at the line position(s) is smaller than the estimated continuum value.
    if none of the above happens, I perform a fit with: popt, pcov = curve_fit().
    I then further request:
  • np.diag(pcov) > 0
  • np.diag(pcov) value for the flux parameter is > 0 (so actually that makes the previous requirement useless..)
  • the fitted flux value is 1% away from the allowed fit boundaries (arguments: min_flux=1e-5, max_flux=1e9, in e-17 * erg/cm2/s/A)
    if those are not satisfied, I return NaN.

those are kind of ad hoc choices, but look overall ok for my needs for the ELGs.

@araichoor
Copy link
Contributor Author

about consistency with fastspecfit

for line wavelengths:
it overall looks ok to me, see below; though, shall I just replace my values by the fastspecfit ones?

JM: https://github.com/desihub/fastspecfit/blob/everest-1/py/fastspecfit/data/emlines.ecsv
AR : http://classic.sdss.org/dr6/algorithms/linestable.html

LINE JM AR
oii_3726 3727.092 3727.092
oii_3729 3729.874 3729.875
hdelta 4102.892 4102.89
hgamma 4341.684 4341.68
hbeta 4862.683 4862.68
oiii_4959 4960.295 4960.295
oiii_5007 5008.239 5008.240
halpha 6564.613 6564.61

for other quantities:
I think fastspecfit also provides fluxes in 1e-17 erg/..., so we should be consistent here.
For other quantities (sigma, share, rfew), I m happy to modify my code to be more consistent with fastspecfit; I just need guidelines.

@moustakas
Copy link
Member

how are masked but ivar != 0 input wavelengths handled?

I had not realized that masked pixels did not have ivar = 0 either, so this is something I'll need to implement as well.

@sbailey
Copy link
Contributor

sbailey commented Aug 27, 2021

how are masked but ivar != 0 input wavelengths handled?

I had not realized that masked pixels did not have ivar = 0 either, so this is something I'll need to implement as well.

In earlier steps of the pipeline we purposefully do not immediately zero-out the ivars of masked pixels so that we don't lose information. However, now that your ask/wonder/mention it, I remembered that we do set ivar=0 for masked pixels when creating the cframe files, and thus the spectra and coadds as well, because we thought end-users would expect that. i.e. I think this is a non-issue (but you are welcome to double check me on that and report any cases with a mask but ivar != 0 in cframe / spectra / coadds)

@sbailey
Copy link
Contributor

sbailey commented Sep 1, 2021

Thanks for the explanations @araichoor .

@moustakas any more comments on this? In particular for data model names (SHARE, RFEW, ...) and units? Algorithmically fastspecfit and emlinefit can be different, but let's try to have common variable names and units for the same concepts. Could you explicitly cross check that and many any suggestions? Thanks.

@moustakas
Copy link
Member

Apologies for the delay.

Algorithmically, @araichoor has already shown that there is good agreement between his measurements and my independent measurements with fastspecfit, so I don't intend to review the code. I'll just add some notes and questions. Apologies in advance for the length of my report!

  • First, I agree that we should make the rest-frame wavelengths consistent. I've been using PyNeb, which uses the latest atomic data, to get these wavelengths and in any case our disagreements are mostly at the third decimal point.

  • My second question is: why these lines and not others? For example, the [OII] and [OIII] doublets and H-beta make sense for a code focused on ELGs, but if you're fitting H-alpha why not fit the adjacent [NII] doublet? Presumably it's too weak, but then so are H-gamma and H-delta. In addition, as you know, the Balmer lines are all affected by stellar absorption (becoming fractionally more important for the higher-order lines), so that will affect the measured Balmer fluxes except in galaxies with very large specific star formation rates (equivalent widths). But maybe it doesn't matter given the goals of emlinefit?

  • Third, regarding the data model, to make it easy to compare our two codes, I've pasted below the reported measurements from emlinefit for just the [OII] doublet and for the [OII] 3726 line from fastspecfit (see also the fastspec data model). Here are my reactions / thoughts / comments but please feel free to ignore any and all of my suggestions!

    • It's odd to me that OBSEMWAVES is a byte array rather than a 2-element single-precision array. Also, the observed wavelengths are derivable from the redshift and the rest-frame wavelengths so perhaps they don't need to be stored (although it's clearly convenient!)?
    • NDOF should be an integer, not float array.
    • Is your CHI2 the reduced chi^2 (mine is reduced)?
    • How strongly do you feel about the suffix RFEW (emlinefit) vs EW (fastspecfit)? In my upbringing as an optical spectroscopist, equivalent widths are always rest-frame, hence my simpler suffix.
    • In my code, the peak amplitude of the line is the critical measurement, not the integrated flux. (The reason is that you can have a broad line totally buried in the noise which is "significant" based on the integrated flux but which is insignificant based on the ratio of the line-amplitude to the rms of the surrounding continuum. I've done a lot of visual inspection on these two approaches and have found using the amplitude much more akin to what you and I would do to identify significant lines by eye.)
    • I'm a little unclear about the SHARE suffix and whether it should be a reported quantity in the core measurements (as opposed to a quantity-of-convenience that an end-user should construct themselves). Since you're fitting each component of the [OII] doublet independently, in order to construct this quantity, presumably you have to make some choices about the significance of each of the components of the doublet. For example, if the [OII] 3726 line-flux is negative (let's say it's affected by problematic sky-subtraction) but the [OII] 3729 line-flux is positive (and real), do you still use both lines to combine them? In fastspecfit I opted to keep each line separate (although in my current branch I've decided to fix the doublet ratio) and let the end-user combine the individual lines however they need or want depending on their application(s).

    emlinefit data model (one line)

           name        dtype  n_bad
    ----------------- ------- -----
       OII_OBSEMWAVES bytes20     0
             OII_FLUX float32   129
        OII_FLUX_IVAR float32   129
            OII_SIGMA float32   129
       OII_SIGMA_IVAR float32   129
             OII_CONT float32    61
        OII_CONT_IVAR float32    61
            OII_SHARE float32   129
       OII_SHARE_IVAR float32   129
             OII_RFEW float32   129
        OII_RFEW_IVAR float32   129
             OII_CHI2 float32   129
             OII_NDOF float32   129
    

    fastspecfit data model (one line)

             name           dtype  shape
    ---------------------- ------- -----
              OII_3726_AMP float32
         OII_3726_AMP_IVAR float32
             OII_3726_FLUX float32
        OII_3726_FLUX_IVAR float32
          OII_3726_BOXFLUX float32
           OII_3726_VSHIFT float32
            OII_3726_SIGMA float32
             OII_3726_CONT float32
        OII_3726_CONT_IVAR float32
               OII_3726_EW float32
          OII_3726_EW_IVAR float32
       OII_3726_FLUX_LIMIT float32
         OII_3726_EW_LIMIT float32
             OII_3726_CHI2 float32
             OII_3726_NPIX   int32
    
    • It's probably my history with IDL, but I'm not at all a fan of using nan to represent non-measurements. In my world, nan means something went terribly wrong and needs to be investigated. In fastspecfit if the ivar of a measurement is zero then it means no measurement was possible.
    • It looks like you don't allow a velocity-shift of the line-center with respect to the redrock redshift, which fastspecfit does. I haven't investigated how much this flexibility is needed, but in Denali I found that the 1-sigma scatter in the Balmer-line redshifts compared to the nominal redrock redshifts were about 30 km/s (less than a pixel over most of our wavelength range); for the forbidden lines the scatter was a bit larger, roughly 60 km/s, although with a tail out to +/- 100 km/s (>3 pixels). It may be worth thinking about this point more (but not for this PR).
  • It doesn't look like you use the resolution matrix---why not? I guess this means that your measured line-widths (which are in observed-frame Angstroms) need to be deconvolved for the instrumental resolution before they're converted to km/s.

  • Why do you fit all 500 spectra, including the sky spectra? How are the fits to the non-TGT spectra used?

  • You write that it takes 10-20 seconds to fit a single petal (500 spectra), which is shockingly fast---nice! However, I don't see any multiprocessing in the code. Presumably with multiprocessing you could fit 32 spectra simultaneously and get this time down to <1 second. Along the same lines, how is the code MPI-parallelized, if at all?

Finally, a broader question (especially for @sbailey) that is semi-related to this PR but which should not be a blocking factor. We now have two after-burners for emission lines, emlinefit for ELGs and the MgII after-burner for QSOs. It has taken some time to evolve, but fastspecfit now does both these tasks and much more, delivering quantities that are basically science-ready for all extragalactic objects (well, the QSOs are hard!).

With more capabilities, of course, there is a computational cost. fastspecfit is significantly slower than either of these existing after-burners, although my latest refactor sped things up by about a factor of 10 (desihub/fastspecfit#37) and I think with more streamlining (read: remove as many of the astropy wrappers as possible!) the code can be sped up even more. Something to discuss and think about.

@araichoor
Copy link
Contributor Author

thanks a lot for all the relevant comments/questions!

I answer below to each item (sorry for the long message).

But first a general comment relevant for several items: this afterburner is primarily designed to be used for the ELG zreliable criterion , which currently relies on the OII SNR; maybe the OIII SNR will be used later, but probably not much more.
I know it is very tempting to improve/refine things, but then it will just try to reproduce fastspecfit, which is way more precise/developed.
That s why I m more enclined to keep things as simple as possible, with of course knowing that the provided measurements are coarse ones; but those are sufficient for the ELG zreliable criterion.
The other provided measurements may be useful for users looking for simple diagnoses.

Now going per item:

  • rest-frame wavelengths. => no problem, I ll switch to your (PyNeb) values;
  • why these lines and not others? => as said, that was initially for ELGs; so I started with OII, then added OIII; and the first Balmer lines where suggested then..
  • data model:
  • OBSEMWAVES is a byte array => I came up with a string because I ve one entry for the doublets, so two wavelengths stored in one array element; as you mentioned, that is mainly for conveniency; one could always go back to the code, and get it with the redshifts; I can remove that column;
  • NDOF should be an integer, not float array => that s because of the np.nan, I believe; if that s an issue I could try to find a work-around
  • Is your CHI2 the reduced chi^2 (mine is reduced)? => no; I could provide CHI2/NDOF (to me it would sound better to then call it RCHI2, but I guess you re not going to change that in fastspecfit);
  • How strongly do you feel about the suffix RFEW (emlinefit) vs EW (fastspecfit)? => sure I can switch to EW, I don t mind;
  • peak amplitude vs. integrated flux => in case it s relevant here, I m requesting SIGMA<10 A, so that may prevent such cases, no? (looking at the OII_SIGMA distribution, it peaks at ~2 A, has a std of ~2A, and a tail extending to 10 A)
  • SHARE: it is only relevant for the OII doublet, where I let it free; for the OIII doublet, I fix it during the fit (and obviously it is not relevant for the Balmer lines); the distribution is centered on the initial guess (=expected value) of 0.58, with a std of ~0.2; I note that it happens my approach is the same as the one recommended by B. Weiner in [desi-galaxies 703]; for any valid fit, I request that the flux at the observed emission line(s) is above the continuum, so for the OII doublet; I m not sure I make some choice about the significance of each component; for the OII, OIII doublets, I use this simple functional form:
    myfunc = lambda ws, sigma, F0, sh : mydict["CONT"] + gauss_nocont(ws, sigma, (1-sh) * F0, obs_em_waves[0]) + gauss_nocont(ws, sigma, sh * F0, obs_em_waves[1])
  • I'm not at all a fan of using nan to represent non-measurements => to make my life simpler, I just set to np.nan all cases where I don t have a "reliable" fit, be it because it is outside the spectral range, there are not enough pixels to fit, the fit hit the boundaries, etc. I agree it would be best to proceed as you suggest, but that may require more coding work, so if that s fine, I ll keep the nans; besides it also simplifies the end-user;
  • velocity-shift of the line-center with respect to the redrock redshift => correct, I don t allow that, I just use redrock redshift; for the ELG-purpose I guess it s fine;
  • It doesn't look like you use the resolution matrix---why not? I guess this means that your measured line-widths (which are in observed-frame Angstroms) need to be deconvolved for the instrumental resolution before they're converted to km/s => that is correct; I guess the reason is that my main product is the OII SNR, all other products are "for free"; I d more remove those other products than going to refine the analysis...
  • Why do you fit all 500 spectra => yes, I fit everything; that is simpler from a file structure point of view; I agree that the non-TGT fits are non-sense; I could force np.nan for those.
  • multiprocessing/mpi: my understanding is that the pipeline is launching one task per redrock file, so the parallelization is already done in the pipeline, no?
  • fastspecfit now does both these tasks and much more => sure, I agree! the primary initial goal of this emlinefit afterburner is to get the OII SNR in the daily reduction.

@araichoor
Copy link
Contributor Author

for info, I hope to be able to implement those changes today/tomorrow.
also, I plan to factor out the functions in a emlinefit_io.py; so that one can load them through "from desispec.emlinefit_io import .."; and also call get_emlines() for just a set a provided targetids.

@sbailey
Copy link
Contributor

sbailey commented Sep 23, 2021

Minor: please put the I/O functions in desispec.io.emlinefit to be grouped with the other I/O sub-modules under desispec.io.

@araichoor
Copy link
Contributor Author

oh right, thanks for noticing that; sure, will do.

@araichoor
Copy link
Contributor Author

As said, for conveniency, I also propagate those columns from the redrock file:
TARGETID, TARGET_RA, TARGET_DEC, OBJTYPE, FIBER, Z, ZWARN, SPECTYPE, DELTACHI2

However, I notice that - obviously - FIBER is not present in the healpix FIBERMAP extension.
So I think to remove FIBER from that list, i.e. have the same datamodel for tiles/ and healpix/ catalogs (rather than adding the FIBER column only for the tiles/ catalogs).

@araichoor
Copy link
Contributor Author

With further thinking, I think I ll actually remove the OBSEMWAVES column, and instead put that information in the EMLINEFIT extension header.
I plan to use the following format, with first listing the fitted lines in EMNAMES, and then report for each line the rest-frame wavelength(s) in RFWAVE00, RFWAVE01, etc, with using the order of appearance in EMNAMES:

EMNAMES = 'OII,HDELTA,HGAMMA,HBETA,OIII,HALPHA'                                 
RFWAVE00= '3727.092,3729.874'                                                   
RFWAVE01= '4102.892'                                                            
RFWAVE02= '4341.684'                                                            
RFWAVE03= '4862.683'                                                            
RFWAVE04= '4960.295,5008.239'                                                   
RFWAVE05= '6564.613'                                                            

That way, along with the Z column, one can reconstruct the fitted observed wavelength; and it will offer flexibility, and keep the header key at 8 characters (provided we don t fit more than 99 lines!)

Besides, I also plan to propagate the arguments of desi_emlinefit_afterburner relevant to the fit:

RRFN    = '/global/cfs/cdirs/desi/spectro/redux/everest/healpix/main/dark/100/&'
CONTINUE  '10000/redrock-main-dark-10000.fits'                                  
COADDFN = '/global/cfs/cdirs/desi/spectro/redux/everest/healpix/main/dark/100/&'
CONTINUE  '10000/coadd-main-dark-10000.fits'                                    
RFHW    =                   40                                                  
MINRFHW =                   20                                                  
RFCONTW =                  200                                                  
RV      =                  3.1                                                  
BALMFIT = 'em      '                                                            

@moustakas
Copy link
Member

@araichoor I think you should consider continuing to fit all the target classes, at least for the time being. You've shown that the compute cost is negligible and it would be really helpful to have a comparison dataset for fastspecfit for more than just he ELGs (which, as you know, are the faintest and hardest objects we're observing!)

@araichoor
Copy link
Contributor Author

I see, thanks for the feedback.
My worries is that we would then provide measurements, which we know some of them are not reliable / well-tested.

Another point if we go this way: with comparing BGS/LRG measurements with fastspecfit, I notice that the default SIGMA (observed Gaussian width) settings I adopted, based on the ELGs, are no more valid for those different tracers; I set SIGMA < 10 A, whereas it can be larger for BGS/LRGs; I can see if I can allow larger SIGMA, but with not affecting the ELG results..
I ll put that on top on my to-do-list.

Copy link
Contributor

@sbailey sbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for delayed review. This is nice and fast and meets the core needs, but I've put some comments inline related to code structure and future maintainability.

The core algorithm is isolated in the emlines_gaussfit function (good), but other pieces seem to mix I/O, algorithms, and plotting. Could you try to separate these more so that the outer wrapper script looks more like:

stuff = desispec.io.emlinefit.read_stuff(coadd_filename, redrock_filename)
results = desispec.emlinefit.emlines_gaussfit(stuff)
desispec.io.emlinefit.write_emlines(outfile, results)
if args.outpdf:
    desispec.io.emlinefit.plot_emlines(args.outpdf, results)

(don't actually use the name "stuff", that's a placeholder for whatever would be read and returned)

Maybe that's what get_emlines is supposed to be? But in that case get_targetids should be working on stuff that get_emlines is already reading instead of working off of filenames directly. Please try to restructure this into:

  • desispec.io.emlinefit - I/O and convenience reformatting, but no calculations
  • desispec.emlinefit - the core algorithms but no I/O, in a form that would be useful to someone calling on a simulated spectrum or from a Jupyter notebook.
  • bin/desi_emlinefit_afterburner - puts those two together

If that doesn't make sense let me know and we can chat in person to clarify. Thanks.

bin/desi_emlinefit_afterburner Outdated Show resolved Hide resolved
py/desispec/io/emlinefit.py Outdated Show resolved Hide resolved
py/desispec/io/emlinefit.py Outdated Show resolved Hide resolved
py/desispec/io/emlinefit.py Outdated Show resolved Hide resolved
bin/desi_emlinefit_afterburner Outdated Show resolved Hide resolved
py/desispec/io/emlinefit.py Outdated Show resolved Hide resolved
mydict[emname]["fitivar"][i] = iv
mydict[emname]["model"][i] = m
# AR plot?
if outpdf is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we want to separate calculation from plotting, e.g. so that if you want to update the plots you can call it with some data to get a new plot without having to redo intermediate calculations. Could you move out this plotting into a separate function that would take the outputs of emlines_gaussfit and make a plot, or does it rely upon some internal variable that isn't returned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, I've moved that to desispec.io.emlinefit.plot_emlines().

py/desispec/io/emlinefit.py Outdated Show resolved Hide resolved
py/desispec/io/emlinefit.py Outdated Show resolved Hide resolved

allowed_emnames = ["OII", "HDELTA", "HGAMMA", "HBETA", "OIII", "HALPHA"]

def get_targetids(redrockfn, bitnames, log=get_logger()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function opens and reads the redrock file twice to perform a selection of targetids. I think it would be cleaner to make this a non-I/O function that takes a fibermap and bitnames and returns targetids (or maybe the indices of the fibermap or a boolean array of matches).

Some of this logic looks like it would be cleaner to use desitarget.targets.main_cmx_or_sv to figure out what the relevant target column is and what mask to import. Please check if that would work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I've kept the function as is (for the time being).

Maybe that doesn't really matter as the redrock*fits files are small, but by my experience, I think it's faster to proceed this way:

  • first only read the file structure with fits.open(), but not the data, to inspect the column content (that step is super-quick)
  • then read only the relevant columns.
    than just reading the whole data.
    Same for desitarget.targets.main_cmx_or_sv: that function assumes that the whole data have been read.

Note that in the new code version, I also use that two-steps approach, e.g.:

h = fits.open(redrock)
extnames = [h[i].header["EXTNAME"] for i in range(1, len(h))]
if "REDSHIFTS" in extnames:
rr_extname = "REDSHIFTS"
elif "ZBEST" in extnames:
rr_extname = "ZBEST"
else:
msg = "{} has neither REDSHIFTS or ZBEST extension".format(redrock)
log.error(msg)
raise RuntimeError(msg)
# AR rr_keys, fm_keys: restrict to existing ones
rmv_rr_keys = [key for key in rr_keys.split(",") if key not in h[rr_extname].columns.names]
if len(rmv_rr_keys) > 0:
log.info("{} removed from rr_keys, as not present in {}".format(",".join(rmv_rr_keys), rr_extname))
rr_keys = ",".join([key for key in rr_keys.split(",") if key not in rmv_rr_keys])
# AR redrock: reading
rr = fitsio.read(redrock, ext=rr_extname, columns=rr_keys.split(","))

Please let me know if there is a better way to proceed.

@araichoor
Copy link
Contributor Author

Thanks a lot @sbailey for detailed+pedagogical review!

Except the comment on get_targetids(), I think I've addressed all of those.
Apologies for the "big commit" (fe63705), but I didn't find an easy to split it in smaller commits that wouldn't leave the code in a broken state.

I've checked on very cases that the new version produces the same answer as the old version, but I'll test more tomorrow; though I'm already pushing/sharing the code, so that you have time to re-review when that suits you best.

I now reply to each comment.

@araichoor
Copy link
Contributor Author

And before I forget: besides the code-structure changes, one key question is if we do want to provide those fits for all spectra or not.
My worry doing so is that I'm not super-confident for the fits for non-ELG tracers, and I don't have time to play to see if changing the default fitting parameters could help.
But maybe that is still better than nothing...

@sbailey
Copy link
Contributor

sbailey commented Dec 14, 2021

Thanks for the very quick updates. The new structure looks better.

By default for the pipeline let's run it on all targets, and document that the file is primarily intended for ELGs. Complete failures have NaN, and you also propagate IVARs and CHI2s if someone wants to check goodness / significance of fit when applying to BGS or LRGs, so even if they aren't "good" fits, there are ways to inspect how good they are.

get_targetids: it looks like you are trying to be more efficient with I/O by reading only the minimal set of columns needed instead of all of the columns. That can be a big win when reading huge files with lots of columns, but it doesn't seem to matter for these small FIBERMAP HDUs where we are dominated by overhead. I tested reading the FIBERMAP HDU from redrock files and the timing varied between 3ms and 30ms regardless of whether I was reading all columns or just ['TARGETID', 'DESI_TARGET']. The second time I read a file is always way faster than the first time, but the first read of a new file fluctuated a lot regardless of how many columns I was reading.

So, my proposed structure is:

  • read_emlines_inputs takes args.bitnames as an input instead of targetids
    • even better would be to leave targetids as an additional option to enable reading a specific set of pre-selected targets regardless of their bits
  • read_emlines_inputs would read the full FIBERMAP HDU without trying to pre-filter to only specific columns.
  • it would then call get_targetids with that table, and get_targetids itself wouldn't do any I/O
  • ideally get_targetids would use desitarget.targets.main_cmx_or_sv instead of doing similar logic so that we only have to maintain that kind of column name parsing in one place instead of several (e.g. if N years from now we start observing DESI-II tiles that have yet another column name, ideally we'd only have to add support for that in main_cmx_or_sv.)

Background motivation: our experience at NERSC is that when the filesystems work, they are so fast that in most cases it doesn't matter (looping over tractor and target selection inputs is a case that gets big enough where it does matter). But there is a small probability that basic operations like open, close, read will hang and we are more dominated by those outliers than whether a given read takes 3ms or 30ms.

FWIW, the best structure for I/O is to use fitsio and open the file once and do all the necessary reads before closing, e.g. using a "context manager" that will auto-cleanup if something goes wrong during the read:

with fitsio.FITS(redrock_filename) as rr:
    redshifts = rr['REDSHIFTS'].read()
    fibermap = rr['FIBERMAP'].read()

instead of

redshifts = fitsio.read(redrock_filename, 'REDSHIFTS')
fibermap = fitsio.read(redrock_filename, 'FIBERMAP')

Full disclosure: I've certainly written a lot of code that looks like the second case instead of first, but I acknowledge that the first is better from an I/O metadata load standpoint.

@araichoor
Copy link
Contributor Author

Again, thanks a lot for the reply (+testing the speed of different calls!).
All that you write makes sense, and should be easily doable, so I'll go for that.

I support the idea of keeping targetids as an argument, as it can be useful to be able to call the line fitter only for few desired targets.
I think that, in the case where both bitnames and targetids are provided, I'll make a selection on both of those (i.e. take the provided targetids which pass the bitname selection); makes sense?

Lastly, as long as you provide me useful tips in using fitsio: if I read the redrock/fibermap catalogs with all the columns, I still need to then remove all columns except rr_keys/fm_keys, as those will be propagated in the final output catalog.
My question: once I've read a catalog with fitsio, how do I properly remove a given column from that catalog?

@sbailey
Copy link
Contributor

sbailey commented Dec 14, 2021

I support the idea of keeping targetids as an argument, as it can be useful to be able to call the line fitter only for few desired targets.
I think that, in the case where both bitnames and targetids are provided, I'll make a selection on both of those (i.e. take the provided targetids which pass the bitname selection); makes sense?

yes, sounds good.

Lastly, as long as you provide me useful tips in using fitsio: if I read the redrock/fibermap catalogs with all the columns, I still need to then remove all columns except rr_keys/fm_keys, as those will be propagated in the final output catalog.
My question: once I've read a catalog with fitsio, how do I properly remove a given column from that catalog?

I had forgotten about the issue of what gets propagated to the output catalog. You can filter down to just the columns you want with e.g.

blat = fibermap['TARGETID', 'DESI_TARGET']

numpy.lib.recfunctions also has a set of functions for adding and removing columns from a numpy structured array. Another option is to cast to an astropy Table and then add/remove columns (which I think is more memory efficient but less CPU efficient, but in this case it probably doesn't matter).

@araichoor
Copy link
Contributor Author

I should have addressed your last batch of comments:

  • default to fitting all 500 fibers in bin/desi_emlinefit_afterburner;
  • use desitarget.targets.main_cmx_or_sv in get_targetids()
  • nest the call to get_targetids() inside read_emlines_inputs();
  • keep the targetids=None as argument of read_emlines_inputs() (though that is not propagated as an argument of bin/desi_emlinefit_afterburner, as I don't feel it's reasonable to have this at the command line level);
  • cleaner use of fitsio to read files.

Plus few code syntax cleanings.

@sbailey
Copy link
Contributor

sbailey commented Dec 15, 2021

Thanks for the updates; the new structure looks great.

As compensation for your patience, I have added the final set of changes to add unit tests:

  • Moved most of bin/desi_emlinefit_afterburner into py/desispec/scripts/emlinefit.py so that the functions become available to call from unit tests
  • Added py/desispec/test/test_emlinefits.py
    • at GitHub, this tests the algorithms in desispec.emlinefit with 95% coverage
    • at NERSC, it additionally calls the full script on everest inputs to also test desispec.scripts.emlinefit (75%) and desispec.io.emlinefit (61%).
    • making the pdf and filtering by targetid are the pieces not tested

These aren't sophisticated tests for correctness, but are basic "does it run without crashing" tests that will help us in the future to catch any problems with changes to numpy, astropy, fitsio, etc. before running this in the full pipeline.

Speaking of the full pipeline -- I think this is ready to merge now, but I'll leave it open for you to look at what I did with the tests and complain/question if needed. After cori is back let's add the final hooks to include this in the daily pipeline desi_tile_redshifts (which can be a separate PR).

My structural requests have caused your original single-file-script to become multiple files:

  • bin/desi_emlinefit_afterburner
  • py/desispec/scripts/emlinefit.py
  • py/desispec/emlinefit.py
  • py/desispec/io/emlinefit.py

While that is admittedly somewhat of a pain, it is motivated by:

  • separating algorithms from I/O from plotting
  • which makes it easier to reuse the algorithms from Jupyter notebooks or other future MPI-parallelized wrappers
  • and makes it easier to test (e.g. algorithms at GitHub without requiring any input data and full tests at NERSC)

Final notes:

  • a new github sphinx failure has appeared, though full docs build without warnings or errors at NERSC. I think this is related to pinning the version of sphinx that is already fixed in a previous PR that came after this branch, so I'll ignore it here.
  • coverage is still complaining, but that's because the full script tests requires files at NERSC that aren't available to the coverage test at GitHub. At NERSC I checked coverage with: pytest --cov=desispec --cov-report=html py/desispec/test/test_emlinefit.py and then copied the "htmlcov" dir to $CFS/desi/users/sjbailey so that I could view it via data.desi.lbl.gov .

@araichoor
Copy link
Contributor Author

Thanks a lot for the support, I've learned quite some stuff during this exercise.

Just one remark about the test:
I see this test/comment:

#- both redshifts should have an answer for OII

However, the code will return a NaN if it fails, e.g. if there are not enough valid pixels or if the flux is negative at the fitted location.
As the flux is a random variable centered on zero, it can happen it is negative at the [OII] location.
I checked with copy-pasting the code lines, looks like the two random fluxes are positive there, so it should be ok:
tmp-0
tmp-1

Maybe we could add a positive offset to the fluxes, to be sure / explicit about that, e.g.:

 fluxes = 10 + rand.normal(size=(nspec, nwave))

@sbailey
Copy link
Contributor

sbailey commented Dec 15, 2021

Interesting. Then an algorithm choice question: I was expecting that if the data were negative at the location of a line, you'd just get a negative FLUX rather than forcing it to NaN. To me that is different from a "failed" fit from not having enough good pixels at that wavelength (or having the line completely redshifted out of the wavelength coverage). Is there a specific reason to not propagate a negative flux fit? To me this is conceptually equivalent to not clipping photometric fluxes to 0 when making an imaging catalog -- let the data fluctuate, and sometimes the fit will be negative even though you know the object photons are strictly positive, but the negative tail can be useful in assessing the noise/statistical properties of the dataset.

Also, can you point to the piece of code that sets negative fits to NaN (or otherwise doesn't overwrite the default NaN because the fit was negative)? From a quick glance that isn't obvious to me where that happens.

@araichoor
Copy link
Contributor Author

I acknowledge my approach may be a bit "brutal", but the idea was end-user oriented: if there is a not-NaN measurement, then it can be trusted (at least for the ELGs :)); no need to make several cuts to decide if the fit is trustable or not; that of course has been driven by my work on ELGs.

In that spirit, I've put several requirements;
I described those at the beginning of this PR (Aug, 26! #1386 (comment)); for what we discussed, I think the relevant ones are:

  • the flux at the line position is above the estimated continuum (so what I wrote in the previous message is not exactly correct; but the two random fluxes in the test still pass that requirement, I think):

    # AR is the flux above continuum for at least one line?
    if obs_em_fluxes.max() > emdict["CONT"]:

  • I set some boundaries for the flux fitting (min_flux=1e-5, max_flux=1e9), and if the fitted flux is close to the boundaries (1% margin), I decide it is a failure:

    # AR fit succeeded?
    # AR - pcov.__class__ criterion => dates from JC, not sure how relevant.. keeping it
    # AR - then we decide that the fit is no good if:
    # AR - var = 0,
    # AR - or var(flux) = 0
    # AR - or flux closer than 1% to the allowed boundaries
    if pcov.__class__ == np.ndarray:
    diag = np.diag(pcov)
    if (diag.sum() > 0) & (diag[1] > 0) & (popt[1] > 1.01 * min_flux) & (popt[1] < 0.99 * max_flux):
    succeed = True

Around that place in the code you may find other informations on that topic; happy to give more precision / discuss about that.

@sbailey sbailey merged commit 4129b1a into master Dec 21, 2021
Fuji automation moved this from In progress to Done Dec 21, 2021
@sbailey sbailey deleted the emlinefit branch December 21, 2021 05:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Fuji
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants