Emlinefit #1386

araichoor · 2021-08-26T21:54:19Z

This PR adds the desi_emlinefit_afterburner script.
This script computes standard emission lines for all 500 spectra of a given set of {redrock,coadd} file.

Expected calling sequence in the pipeline:
desi_emlinefit_afterburner --redrockfn $RRFN --coaddfn $COADDFN --outfn $OUTROOT.fits

Functions:

get_emlines(): called by main(), reads redrock/coadd files, calls emlines_gaussfit(), outputs results, in the format of a dictionary;
emlines_gaussfit(): called by get_emlines(), performs the fit for a single spectrum and a single emission line;
get_rf_em_waves(): called by emlines_gaussfit(), defines the vacuum, rest-frame emission wavelengths.

Emission lines:
Current fitted lines are: OII (doublet, free ratio), Hdelta, Hgamma, Hbeta, OIII (doublet, fixed ratio), Halpha.
For each line, I report the columns {EMNAME}_{QUANT} with QUANT in:

OBSEMWAVES: observed wavelengths at which the fit is done
CHI2, NDOF: chi2 and nb of degrees of freedom
CONT, CONT_IVAR: continuum in 1e-17 * erg/cm2/s/A
FLUX, FLUX_IVAR: flux in 1e-17 * erg/cm2/s/A
SIGMA, SIGMA_IVAR: line width in A (observed frame)
SHARE, SHARE_IVAR: f1/(f0+f1) for OII and OIII doublets
RFEW, RFEW_IVAR: rest-frame equivalent width
wave, data, ivar, model: data used for fitting + fitted model, used in case plotting is requested
I also propagate some columns from the REDSHIFTS/FIBERMAP extensions:
TARGETID, TARGET_RA, TARGET_DEC, OBJTYPE, FIBER, Z, ZWARN, SPECTYPE, DELTACHI2.

Example files:
In /global/cscratch1/sd/raichoor/tmpdir/desispec_emlinefit/, there are fits and log files:

denali/tiles/cumulative/80606/20201219/emlinefit-?-80606-thru20201219.{fits,log}
denali/tiles/cumulative/80613/20210324/emlinefit-?-80606-thru20201219.{fits,log}
everest/tiles/cumulative/1/20210406/emlinefit-?-1-thru20210406.{fits,log}
everest/tiles/cumulative/1000/20210517/emlinefit-?-1000-thru220210517.{fits,log}
everest/tiles/cumulative/20000/20210609/emlinefit-?-20000-thru20210609.{fits,log}
everest/tiles/cumulative/80606/20201219/emlinefit-?-80606-thru20201219.{fits,log}

I also provide in the same folder few examples of the (optional) output pdf; it has been useful to me to make diagnoses (and debug!):

denali/tiles/cumulative/80606/20201219/emlinefit-0-80606-thru20201219.pdf
denali/tiles/cumulative/80613/20210324/emlinefit-0-80613-thru20210324.pdf
everest/tiles/cumulative/80606/20201219/emlinefit-0-80606-thru20201219.pdf
everest/tiles/cumulative/80613/20210324/emlinefit-0-80613-thru20210324.pdf

Speed:
On the logging node, processing one file takes 10s-20s; generating the pdf takes much longer (~6mins).

Choices:

I try to output nans for failed/dubious fits
for simplicity, the code generates outputs for the 500 spectra (though many fits will be nans, or won t make sense)
emlines_gaussfit() has the flexibility to be run on a subset of TARGETIDs only; same I kept that in case useful for testing; could be removed (that would simplify the code)
I propagate as input arguments some of the emission line settings (rf_fit_hw, min_rf_fit_hw, rf_cont_w, rv, balmerfit), in case people want to test those; could be removed
in get_emlines(), I allow pre-everest format (i.e. ZBEST instead of REDSHIFTS extension in redrockfn)
the infos in log files are quite minimalist; happy to add more if relevant/useful
I do not propagate {DESI,BGS,MWS,SCND}_TARGET columns, as that would imply to handle SV1/SV2/SV3/Main; could do if useful
in args.outpdf, spectra are ordered by increasing redshifts.

coveralls · 2021-08-26T22:04:47Z

Coverage decreased (-1.2%) to 25.587% when pulling 3c29232 on emlinefit into 7e8b39f on master.

sbailey · 2021-08-26T22:26:04Z

Thanks @araichoor for this PR and for including the full descriptions and example output files.

@moustakas could you take a look at this? It's fine for this to be simpler than fastspecfit, but I also want to avoid arbitrary incompatibilities in variable names, wavelength definitions, 10e-17 prefixes or not, etc. Any cross-checks of conceptually similar quantities from fastspecfit would also be welcome.

Questions that occur to me that I haven't looked into:

how are masked but ivar != 0 input wavelengths handled?
do fits include ivar weighting?
what is in output for a line if the redshifted wavelength is
- outside the wavelength range of the spectrum (NaN?)?
- completely masked?
- not completely masked, but e.g. only one pixel is unmasked such that you can't do an N parameter fit?

araichoor · 2021-08-27T00:19:28Z

thanks for those remarks!

answering your specific points:

- SB: "how are masked but ivar != 0 input wavelengths handled?"
=> I currently don t use the {BRZ}_MASK extensions; shall I e.g. set ivar=0 for the masked pixels?

- SB: "do fits include ivar weighting?"
=> yes, the fit is done with scipy.optimize.curve_fit:

desispec/bin/desi_emlinefit_afterburner

Lines 305 to 314 in 7e5c17b

    
           popt, pcov = curve_fit( 
        
               myfunc, 
        
               waves[keep_line], 
        
               fluxes[keep_line], 
        
               p0 = p0, 
        
               sigma = 1. / np.sqrt(ivars[keep_line]), 
        
               maxfev = 10000000, 
        
               gtol = 1.49012e-8, 
        
               bounds = bounds, 
        
           )

additionally, following David s suggestion, I consider all pixels in the spectrographs overlap region, ie coming from both arms.

- SB: "what is in output for a line if the redshifted wavelength is outside the wavelength range of the spectrum (NaN?)? completely masked? not completely masked, but e.g. only one pixel is unmasked such that you can't do an N parameter fit?"
=> in those cases, NaN should be returned.
e.g.

desispec/bin/desi_emlinefit_afterburner

Lines 254 to 266 in 7e5c17b

    
           keep_cont &= ~keep_line 
        
           # AR discarding flux=nan and ivars == 0 pixels 
        
           keep_line &= (np.isfinite(fluxes)) & (ivars > 0) 
        
           keep_cont &= (np.isfinite(fluxes)) & (ivars > 0) 
        
           # AR 
        
           models = np.nan+np.zeros(keep_line.sum()) 
        
           # AR enough pixels to fit? 
        
           succeed = False 
        
           if ( 
        
               (keep_line.sum() >= 3) & 
        
               (keep_cont.sum() >= 3) & 
        
               (n_cov_lines >= min_n_lines) 
        
           ):

I first discard pixels with infinite flux or ivar>0; then if I do not forget cases, I think I return NaN if one of the following happen:

less than 3 pixels to estimate the continuum (hard-coded);
less than 3 pixels to fit the line (hard-coded);
edge of the spectro: less than 20A in rest-frame on each side of the line (min_rf_fit_hw argument);
the flux value(s) at the line position(s) is smaller than the estimated continuum value.
if none of the above happens, I perform a fit with: popt, pcov = curve_fit().
I then further request:
np.diag(pcov) > 0
np.diag(pcov) value for the flux parameter is > 0 (so actually that makes the previous requirement useless..)
the fitted flux value is 1% away from the allowed fit boundaries (arguments: min_flux=1e-5, max_flux=1e9, in e-17 * erg/cm2/s/A)
if those are not satisfied, I return NaN.

those are kind of ad hoc choices, but look overall ok for my needs for the ELGs.

araichoor · 2021-08-27T00:54:54Z

about consistency with fastspecfit

for line wavelengths:
it overall looks ok to me, see below; though, shall I just replace my values by the fastspecfit ones?

JM: https://github.com/desihub/fastspecfit/blob/everest-1/py/fastspecfit/data/emlines.ecsv
AR : http://classic.sdss.org/dr6/algorithms/linestable.html

LINE	JM	AR
oii_3726	3727.092	3727.092
oii_3729	3729.874	3729.875
hdelta	4102.892	4102.89
hgamma	4341.684	4341.68
hbeta	4862.683	4862.68
oiii_4959	4960.295	4960.295
oiii_5007	5008.239	5008.240
halpha	6564.613	6564.61

for other quantities:
I think fastspecfit also provides fluxes in 1e-17 erg/..., so we should be consistent here.
For other quantities (sigma, share, rfew), I m happy to modify my code to be more consistent with fastspecfit; I just need guidelines.

moustakas · 2021-08-27T01:43:48Z

how are masked but ivar != 0 input wavelengths handled?

I had not realized that masked pixels did not have ivar = 0 either, so this is something I'll need to implement as well.

sbailey · 2021-08-27T04:42:04Z

how are masked but ivar != 0 input wavelengths handled?

I had not realized that masked pixels did not have ivar = 0 either, so this is something I'll need to implement as well.

In earlier steps of the pipeline we purposefully do not immediately zero-out the ivars of masked pixels so that we don't lose information. However, now that your ask/wonder/mention it, I remembered that we do set ivar=0 for masked pixels when creating the cframe files, and thus the spectra and coadds as well, because we thought end-users would expect that. i.e. I think this is a non-issue (but you are welcome to double check me on that and report any cases with a mask but ivar != 0 in cframe / spectra / coadds)

sbailey · 2021-09-01T05:13:26Z

Thanks for the explanations @araichoor .

@moustakas any more comments on this? In particular for data model names (SHARE, RFEW, ...) and units? Algorithmically fastspecfit and emlinefit can be different, but let's try to have common variable names and units for the same concepts. Could you explicitly cross check that and many any suggestions? Thanks.

moustakas · 2021-09-01T14:17:40Z

Apologies for the delay.

Algorithmically, @araichoor has already shown that there is good agreement between his measurements and my independent measurements with fastspecfit, so I don't intend to review the code. I'll just add some notes and questions. Apologies in advance for the length of my report!

First, I agree that we should make the rest-frame wavelengths consistent. I've been using PyNeb, which uses the latest atomic data, to get these wavelengths and in any case our disagreements are mostly at the third decimal point.
My second question is: why these lines and not others? For example, the [OII] and [OIII] doublets and H-beta make sense for a code focused on ELGs, but if you're fitting H-alpha why not fit the adjacent [NII] doublet? Presumably it's too weak, but then so are H-gamma and H-delta. In addition, as you know, the Balmer lines are all affected by stellar absorption (becoming fractionally more important for the higher-order lines), so that will affect the measured Balmer fluxes except in galaxies with very large specific star formation rates (equivalent widths). But maybe it doesn't matter given the goals of emlinefit?
Third, regarding the data model, to make it easy to compare our two codes, I've pasted below the reported measurements from emlinefit for just the [OII] doublet and for the [OII] 3726 line from fastspecfit (see also the fastspec data model). Here are my reactions / thoughts / comments but please feel free to ignore any and all of my suggestions!
- It's odd to me that OBSEMWAVES is a byte array rather than a 2-element single-precision array. Also, the observed wavelengths are derivable from the redshift and the rest-frame wavelengths so perhaps they don't need to be stored (although it's clearly convenient!)?
- NDOF should be an integer, not float array.
- Is your CHI2 the reduced chi^2 (mine is reduced)?
- How strongly do you feel about the suffix RFEW (emlinefit) vs EW (fastspecfit)? In my upbringing as an optical spectroscopist, equivalent widths are always rest-frame, hence my simpler suffix.
- In my code, the peak amplitude of the line is the critical measurement, not the integrated flux. (The reason is that you can have a broad line totally buried in the noise which is "significant" based on the integrated flux but which is insignificant based on the ratio of the line-amplitude to the rms of the surrounding continuum. I've done a lot of visual inspection on these two approaches and have found using the amplitude much more akin to what you and I would do to identify significant lines by eye.)
- I'm a little unclear about the SHARE suffix and whether it should be a reported quantity in the core measurements (as opposed to a quantity-of-convenience that an end-user should construct themselves). Since you're fitting each component of the [OII] doublet independently, in order to construct this quantity, presumably you have to make some choices about the significance of each of the components of the doublet. For example, if the [OII] 3726 line-flux is negative (let's say it's affected by problematic sky-subtraction) but the [OII] 3729 line-flux is positive (and real), do you still use both lines to combine them? In fastspecfit I opted to keep each line separate (although in my current branch I've decided to fix the doublet ratio) and let the end-user combine the individual lines however they need or want depending on their application(s).
emlinefit data model (one line)
```
       name        dtype  n_bad
----------------- ------- -----
   OII_OBSEMWAVES bytes20     0
         OII_FLUX float32   129
    OII_FLUX_IVAR float32   129
        OII_SIGMA float32   129
   OII_SIGMA_IVAR float32   129
         OII_CONT float32    61
    OII_CONT_IVAR float32    61
        OII_SHARE float32   129
   OII_SHARE_IVAR float32   129
         OII_RFEW float32   129
    OII_RFEW_IVAR float32   129
         OII_CHI2 float32   129
         OII_NDOF float32   129
```
fastspecfit data model (one line)
```
         name           dtype  shape
---------------------- ------- -----
          OII_3726_AMP float32
     OII_3726_AMP_IVAR float32
         OII_3726_FLUX float32
    OII_3726_FLUX_IVAR float32
      OII_3726_BOXFLUX float32
       OII_3726_VSHIFT float32
        OII_3726_SIGMA float32
         OII_3726_CONT float32
    OII_3726_CONT_IVAR float32
           OII_3726_EW float32
      OII_3726_EW_IVAR float32
   OII_3726_FLUX_LIMIT float32
     OII_3726_EW_LIMIT float32
         OII_3726_CHI2 float32
         OII_3726_NPIX   int32
```
- It's probably my history with IDL, but I'm not at all a fan of using nan to represent non-measurements. In my world, nan means something went terribly wrong and needs to be investigated. In fastspecfit if the ivar of a measurement is zero then it means no measurement was possible.
- It looks like you don't allow a velocity-shift of the line-center with respect to the redrock redshift, which fastspecfit does. I haven't investigated how much this flexibility is needed, but in Denali I found that the 1-sigma scatter in the Balmer-line redshifts compared to the nominal redrock redshifts were about 30 km/s (less than a pixel over most of our wavelength range); for the forbidden lines the scatter was a bit larger, roughly 60 km/s, although with a tail out to +/- 100 km/s (>3 pixels). It may be worth thinking about this point more (but not for this PR).
It doesn't look like you use the resolution matrix---why not? I guess this means that your measured line-widths (which are in observed-frame Angstroms) need to be deconvolved for the instrumental resolution before they're converted to km/s.
Why do you fit all 500 spectra, including the sky spectra? How are the fits to the non-TGT spectra used?
You write that it takes 10-20 seconds to fit a single petal (500 spectra), which is shockingly fast---nice! However, I don't see any multiprocessing in the code. Presumably with multiprocessing you could fit 32 spectra simultaneously and get this time down to <1 second. Along the same lines, how is the code MPI-parallelized, if at all?

Finally, a broader question (especially for @sbailey) that is semi-related to this PR but which should not be a blocking factor. We now have two after-burners for emission lines, emlinefit for ELGs and the MgII after-burner for QSOs. It has taken some time to evolve, but fastspecfit now does both these tasks and much more, delivering quantities that are basically science-ready for all extragalactic objects (well, the QSOs are hard!).

With more capabilities, of course, there is a computational cost. fastspecfit is significantly slower than either of these existing after-burners, although my latest refactor sped things up by about a factor of 10 (desihub/fastspecfit#37) and I think with more streamlining (read: remove as many of the astropy wrappers as possible!) the code can be sped up even more. Something to discuss and think about.

araichoor · 2021-09-02T05:03:41Z

thanks a lot for all the relevant comments/questions!

I answer below to each item (sorry for the long message).

But first a general comment relevant for several items: this afterburner is primarily designed to be used for the ELG zreliable criterion , which currently relies on the OII SNR; maybe the OIII SNR will be used later, but probably not much more.
I know it is very tempting to improve/refine things, but then it will just try to reproduce fastspecfit, which is way more precise/developed.
That s why I m more enclined to keep things as simple as possible, with of course knowing that the provided measurements are coarse ones; but those are sufficient for the ELG zreliable criterion.
The other provided measurements may be useful for users looking for simple diagnoses.

Now going per item:

rest-frame wavelengths. => no problem, I ll switch to your (PyNeb) values;
why these lines and not others? => as said, that was initially for ELGs; so I started with OII, then added OIII; and the first Balmer lines where suggested then..
data model:
OBSEMWAVES is a byte array => I came up with a string because I ve one entry for the doublets, so two wavelengths stored in one array element; as you mentioned, that is mainly for conveniency; one could always go back to the code, and get it with the redshifts; I can remove that column;
NDOF should be an integer, not float array => that s because of the np.nan, I believe; if that s an issue I could try to find a work-around
Is your CHI2 the reduced chi^2 (mine is reduced)? => no; I could provide CHI2/NDOF (to me it would sound better to then call it RCHI2, but I guess you re not going to change that in fastspecfit);
How strongly do you feel about the suffix RFEW (emlinefit) vs EW (fastspecfit)? => sure I can switch to EW, I don t mind;
peak amplitude vs. integrated flux => in case it s relevant here, I m requesting SIGMA<10 A, so that may prevent such cases, no? (looking at the OII_SIGMA distribution, it peaks at ~2 A, has a std of ~2A, and a tail extending to 10 A)

SHARE: it is only relevant for the OII doublet, where I let it free; for the OIII doublet, I fix it during the fit (and obviously it is not relevant for the Balmer lines); the distribution is centered on the initial guess (=expected value) of 0.58, with a std of ~0.2; I note that it happens my approach is the same as the one recommended by B. Weiner in [desi-galaxies 703]; for any valid fit, I request that the flux at the observed emission line(s) is above the continuum, so for the OII doublet; I m not sure I make some choice about the significance of each component; for the OII, OIII doublets, I use this simple functional form:

desispec/bin/desi_emlinefit_afterburner

Line 275 in 7e5c17b

    
           myfunc = lambda ws, sigma, F0, sh : mydict["CONT"] + gauss_nocont(ws, sigma, (1-sh) * F0, obs_em_waves[0]) + gauss_nocont(ws, sigma, sh * F0, obs_em_waves[1])

I'm not at all a fan of using nan to represent non-measurements => to make my life simpler, I just set to np.nan all cases where I don t have a "reliable" fit, be it because it is outside the spectral range, there are not enough pixels to fit, the fit hit the boundaries, etc. I agree it would be best to proceed as you suggest, but that may require more coding work, so if that s fine, I ll keep the nans; besides it also simplifies the end-user;
velocity-shift of the line-center with respect to the redrock redshift => correct, I don t allow that, I just use redrock redshift; for the ELG-purpose I guess it s fine;
It doesn't look like you use the resolution matrix---why not? I guess this means that your measured line-widths (which are in observed-frame Angstroms) need to be deconvolved for the instrumental resolution before they're converted to km/s => that is correct; I guess the reason is that my main product is the OII SNR, all other products are "for free"; I d more remove those other products than going to refine the analysis...
Why do you fit all 500 spectra => yes, I fit everything; that is simpler from a file structure point of view; I agree that the non-TGT fits are non-sense; I could force np.nan for those.
multiprocessing/mpi: my understanding is that the pipeline is launching one task per redrock file, so the parallelization is already done in the pipeline, no?
fastspecfit now does both these tasks and much more => sure, I agree! the primary initial goal of this emlinefit afterburner is to get the OII SNR in the daily reduction.

araichoor · 2021-09-23T15:49:18Z

for info, I hope to be able to implement those changes today/tomorrow.
also, I plan to factor out the functions in a emlinefit_io.py; so that one can load them through "from desispec.emlinefit_io import .."; and also call get_emlines() for just a set a provided targetids.

sbailey · 2021-09-23T17:41:20Z

Minor: please put the I/O functions in desispec.io.emlinefit to be grouped with the other I/O sub-modules under desispec.io.

araichoor · 2021-09-23T17:42:57Z

oh right, thanks for noticing that; sure, will do.

araichoor · 2021-09-25T16:32:36Z

As said, for conveniency, I also propagate those columns from the redrock file:
TARGETID, TARGET_RA, TARGET_DEC, OBJTYPE, FIBER, Z, ZWARN, SPECTYPE, DELTACHI2

However, I notice that - obviously - FIBER is not present in the healpix FIBERMAP extension.
So I think to remove FIBER from that list, i.e. have the same datamodel for tiles/ and healpix/ catalogs (rather than adding the FIBER column only for the tiles/ catalogs).

araichoor · 2021-09-25T18:06:31Z

With further thinking, I think I ll actually remove the OBSEMWAVES column, and instead put that information in the EMLINEFIT extension header.
I plan to use the following format, with first listing the fitted lines in EMNAMES, and then report for each line the rest-frame wavelength(s) in RFWAVE00, RFWAVE01, etc, with using the order of appearance in EMNAMES:

EMNAMES = 'OII,HDELTA,HGAMMA,HBETA,OIII,HALPHA'                                 
RFWAVE00= '3727.092,3729.874'                                                   
RFWAVE01= '4102.892'                                                            
RFWAVE02= '4341.684'                                                            
RFWAVE03= '4862.683'                                                            
RFWAVE04= '4960.295,5008.239'                                                   
RFWAVE05= '6564.613'

That way, along with the Z column, one can reconstruct the fitted observed wavelength; and it will offer flexibility, and keep the header key at 8 characters (provided we don t fit more than 99 lines!)

Besides, I also plan to propagate the arguments of desi_emlinefit_afterburner relevant to the fit:

RRFN    = '/global/cfs/cdirs/desi/spectro/redux/everest/healpix/main/dark/100/&'
CONTINUE  '10000/redrock-main-dark-10000.fits'                                  
COADDFN = '/global/cfs/cdirs/desi/spectro/redux/everest/healpix/main/dark/100/&'
CONTINUE  '10000/coadd-main-dark-10000.fits'                                    
RFHW    =                   40                                                  
MINRFHW =                   20                                                  
RFCONTW =                  200                                                  
RV      =                  3.1                                                  
BALMFIT = 'em      '

moustakas · 2021-10-06T13:58:14Z

@araichoor I think you should consider continuing to fit all the target classes, at least for the time being. You've shown that the compute cost is negligible and it would be really helpful to have a comparison dataset for fastspecfit for more than just he ELGs (which, as you know, are the faintest and hardest objects we're observing!)

araichoor · 2021-10-06T15:59:40Z

I see, thanks for the feedback.
My worries is that we would then provide measurements, which we know some of them are not reliable / well-tested.

Another point if we go this way: with comparing BGS/LRG measurements with fastspecfit, I notice that the default SIGMA (observed Gaussian width) settings I adopted, based on the ELGs, are no more valid for those different tracers; I set SIGMA < 10 A, whereas it can be larger for BGS/LRGs; I can see if I can allow larger SIGMA, but with not affecting the ELG results..
I ll put that on top on my to-do-list.

sbailey

Apologies for delayed review. This is nice and fast and meets the core needs, but I've put some comments inline related to code structure and future maintainability.

The core algorithm is isolated in the emlines_gaussfit function (good), but other pieces seem to mix I/O, algorithms, and plotting. Could you try to separate these more so that the outer wrapper script looks more like:

stuff = desispec.io.emlinefit.read_stuff(coadd_filename, redrock_filename)
results = desispec.emlinefit.emlines_gaussfit(stuff)
desispec.io.emlinefit.write_emlines(outfile, results)
if args.outpdf:
    desispec.io.emlinefit.plot_emlines(args.outpdf, results)

(don't actually use the name "stuff", that's a placeholder for whatever would be read and returned)

Maybe that's what get_emlines is supposed to be? But in that case get_targetids should be working on stuff that get_emlines is already reading instead of working off of filenames directly. Please try to restructure this into:

desispec.io.emlinefit - I/O and convenience reformatting, but no calculations
desispec.emlinefit - the core algorithms but no I/O, in a form that would be useful to someone calling on a simulated spectrum or from a Jupyter notebook.
bin/desi_emlinefit_afterburner - puts those two together

If that doesn't make sense let me know and we can chat in person to clarify. Thanks.

bin/desi_emlinefit_afterburner

py/desispec/io/emlinefit.py

bin/desi_emlinefit_afterburner

py/desispec/io/emlinefit.py

sbailey · 2021-12-14T00:26:48Z

py/desispec/io/emlinefit.py

+            mydict[emname]["fitivar"][i] = iv
+            mydict[emname]["model"][i] = m
+    # AR plot?
+    if outpdf is not None:


Usually we want to separate calculation from plotting, e.g. so that if you want to update the plots you can call it with some data to get a new plot without having to redo intermediate calculations. Could you move out this plotting into a separate function that would take the outputs of emlines_gaussfit and make a plot, or does it rely upon some internal variable that isn't returned?

done, I've moved that to desispec.io.emlinefit.plot_emlines().

py/desispec/io/emlinefit.py

sbailey · 2021-12-14T00:38:33Z

py/desispec/io/emlinefit.py

+
+allowed_emnames = ["OII", "HDELTA", "HGAMMA", "HBETA", "OIII", "HALPHA"]
+
+def get_targetids(redrockfn, bitnames, log=get_logger()):


This function opens and reads the redrock file twice to perform a selection of targetids. I think it would be cleaner to make this a non-I/O function that takes a fibermap and bitnames and returns targetids (or maybe the indices of the fibermap or a boolean array of matches).

Some of this logic looks like it would be cleaner to use desitarget.targets.main_cmx_or_sv to figure out what the relevant target column is and what mask to import. Please check if that would work.

Here, I've kept the function as is (for the time being).

Maybe that doesn't really matter as the redrock*fits files are small, but by my experience, I think it's faster to proceed this way:

first only read the file structure with fits.open(), but not the data, to inspect the column content (that step is super-quick)

then read only the relevant columns.
than just reading the whole data.
Same for desitarget.targets.main_cmx_or_sv: that function assumes that the whole data have been read.

Note that in the new code version, I also use that two-steps approach, e.g.:

desispec/py/desispec/io/emlinefit.py

Lines 148 to 164 in febddee

h = fits.open(redrock)

extnames = [h[i].header["EXTNAME"] for i in range(1, len(h))]

if "REDSHIFTS" in extnames:

rr_extname = "REDSHIFTS"

elif "ZBEST" in extnames:

rr_extname = "ZBEST"

else:

msg = "{} has neither REDSHIFTS or ZBEST extension".format(redrock)

log.error(msg)

raise RuntimeError(msg)

# AR rr_keys, fm_keys: restrict to existing ones

rmv_rr_keys = [key for key in rr_keys.split(",") if key not in h[rr_extname].columns.names]

if len(rmv_rr_keys) > 0:

log.info("{} removed from rr_keys, as not present in {}".format(",".join(rmv_rr_keys), rr_extname))

rr_keys = ",".join([key for key in rr_keys.split(",") if key not in rmv_rr_keys])

# AR redrock: reading

rr = fitsio.read(redrock, ext=rr_extname, columns=rr_keys.split(","))

Please let me know if there is a better way to proceed.

araichoor · 2021-12-14T09:53:12Z

Thanks a lot @sbailey for detailed+pedagogical review!

Except the comment on get_targetids(), I think I've addressed all of those.
Apologies for the "big commit" (fe63705), but I didn't find an easy to split it in smaller commits that wouldn't leave the code in a broken state.

I've checked on very cases that the new version produces the same answer as the old version, but I'll test more tomorrow; though I'm already pushing/sharing the code, so that you have time to re-review when that suits you best.

I now reply to each comment.

araichoor · 2021-12-14T10:15:06Z

And before I forget: besides the code-structure changes, one key question is if we do want to provide those fits for all spectra or not.
My worry doing so is that I'm not super-confident for the fits for non-ELG tracers, and I don't have time to play to see if changing the default fitting parameters could help.
But maybe that is still better than nothing...

sbailey · 2021-12-14T22:16:07Z

Thanks for the very quick updates. The new structure looks better.

By default for the pipeline let's run it on all targets, and document that the file is primarily intended for ELGs. Complete failures have NaN, and you also propagate IVARs and CHI2s if someone wants to check goodness / significance of fit when applying to BGS or LRGs, so even if they aren't "good" fits, there are ways to inspect how good they are.

get_targetids: it looks like you are trying to be more efficient with I/O by reading only the minimal set of columns needed instead of all of the columns. That can be a big win when reading huge files with lots of columns, but it doesn't seem to matter for these small FIBERMAP HDUs where we are dominated by overhead. I tested reading the FIBERMAP HDU from redrock files and the timing varied between 3ms and 30ms regardless of whether I was reading all columns or just ['TARGETID', 'DESI_TARGET']. The second time I read a file is always way faster than the first time, but the first read of a new file fluctuated a lot regardless of how many columns I was reading.

So, my proposed structure is:

read_emlines_inputs takes args.bitnames as an input instead of targetids
- even better would be to leave targetids as an additional option to enable reading a specific set of pre-selected targets regardless of their bits
read_emlines_inputs would read the full FIBERMAP HDU without trying to pre-filter to only specific columns.
it would then call get_targetids with that table, and get_targetids itself wouldn't do any I/O
ideally get_targetids would use desitarget.targets.main_cmx_or_sv instead of doing similar logic so that we only have to maintain that kind of column name parsing in one place instead of several (e.g. if N years from now we start observing DESI-II tiles that have yet another column name, ideally we'd only have to add support for that in main_cmx_or_sv.)

Background motivation: our experience at NERSC is that when the filesystems work, they are so fast that in most cases it doesn't matter (looping over tractor and target selection inputs is a case that gets big enough where it does matter). But there is a small probability that basic operations like open, close, read will hang and we are more dominated by those outliers than whether a given read takes 3ms or 30ms.

FWIW, the best structure for I/O is to use fitsio and open the file once and do all the necessary reads before closing, e.g. using a "context manager" that will auto-cleanup if something goes wrong during the read:

with fitsio.FITS(redrock_filename) as rr:
    redshifts = rr['REDSHIFTS'].read()
    fibermap = rr['FIBERMAP'].read()

instead of

redshifts = fitsio.read(redrock_filename, 'REDSHIFTS')
fibermap = fitsio.read(redrock_filename, 'FIBERMAP')

Full disclosure: I've certainly written a lot of code that looks like the second case instead of first, but I acknowledge that the first is better from an I/O metadata load standpoint.

araichoor · 2021-12-14T22:30:30Z

Again, thanks a lot for the reply (+testing the speed of different calls!).
All that you write makes sense, and should be easily doable, so I'll go for that.

I support the idea of keeping targetids as an argument, as it can be useful to be able to call the line fitter only for few desired targets.
I think that, in the case where both bitnames and targetids are provided, I'll make a selection on both of those (i.e. take the provided targetids which pass the bitname selection); makes sense?

Lastly, as long as you provide me useful tips in using fitsio: if I read the redrock/fibermap catalogs with all the columns, I still need to then remove all columns except rr_keys/fm_keys, as those will be propagated in the final output catalog.
My question: once I've read a catalog with fitsio, how do I properly remove a given column from that catalog?

sbailey · 2021-12-14T22:42:41Z

I support the idea of keeping targetids as an argument, as it can be useful to be able to call the line fitter only for few desired targets.
I think that, in the case where both bitnames and targetids are provided, I'll make a selection on both of those (i.e. take the provided targetids which pass the bitname selection); makes sense?

yes, sounds good.

Lastly, as long as you provide me useful tips in using fitsio: if I read the redrock/fibermap catalogs with all the columns, I still need to then remove all columns except rr_keys/fm_keys, as those will be propagated in the final output catalog.
My question: once I've read a catalog with fitsio, how do I properly remove a given column from that catalog?

I had forgotten about the issue of what gets propagated to the output catalog. You can filter down to just the columns you want with e.g.

blat = fibermap['TARGETID', 'DESI_TARGET']

numpy.lib.recfunctions also has a set of functions for adding and removing columns from a numpy structured array. Another option is to cast to an astropy Table and then add/remove columns (which I think is more memory efficient but less CPU efficient, but in this case it probably doesn't matter).

araichoor · 2021-12-15T02:24:47Z

I should have addressed your last batch of comments:

default to fitting all 500 fibers in bin/desi_emlinefit_afterburner;
use desitarget.targets.main_cmx_or_sv in get_targetids()
nest the call to get_targetids() inside read_emlines_inputs();
keep the targetids=None as argument of read_emlines_inputs() (though that is not propagated as an argument of bin/desi_emlinefit_afterburner, as I don't feel it's reasonable to have this at the command line level);
cleaner use of fitsio to read files.

Plus few code syntax cleanings.

sbailey · 2021-12-15T07:46:14Z

Thanks for the updates; the new structure looks great.

As compensation for your patience, I have added the final set of changes to add unit tests:

Moved most of bin/desi_emlinefit_afterburner into py/desispec/scripts/emlinefit.py so that the functions become available to call from unit tests
Added py/desispec/test/test_emlinefits.py
- at GitHub, this tests the algorithms in desispec.emlinefit with 95% coverage
- at NERSC, it additionally calls the full script on everest inputs to also test desispec.scripts.emlinefit (75%) and desispec.io.emlinefit (61%).
- making the pdf and filtering by targetid are the pieces not tested

These aren't sophisticated tests for correctness, but are basic "does it run without crashing" tests that will help us in the future to catch any problems with changes to numpy, astropy, fitsio, etc. before running this in the full pipeline.

Speaking of the full pipeline -- I think this is ready to merge now, but I'll leave it open for you to look at what I did with the tests and complain/question if needed. After cori is back let's add the final hooks to include this in the daily pipeline desi_tile_redshifts (which can be a separate PR).

My structural requests have caused your original single-file-script to become multiple files:

bin/desi_emlinefit_afterburner
py/desispec/scripts/emlinefit.py
py/desispec/emlinefit.py
py/desispec/io/emlinefit.py

While that is admittedly somewhat of a pain, it is motivated by:

separating algorithms from I/O from plotting
which makes it easier to reuse the algorithms from Jupyter notebooks or other future MPI-parallelized wrappers
and makes it easier to test (e.g. algorithms at GitHub without requiring any input data and full tests at NERSC)

Final notes:

a new github sphinx failure has appeared, though full docs build without warnings or errors at NERSC. I think this is related to pinning the version of sphinx that is already fixed in a previous PR that came after this branch, so I'll ignore it here.
coverage is still complaining, but that's because the full script tests requires files at NERSC that aren't available to the coverage test at GitHub. At NERSC I checked coverage with: pytest --cov=desispec --cov-report=html py/desispec/test/test_emlinefit.py and then copied the "htmlcov" dir to $CFS/desi/users/sjbailey so that I could view it via data.desi.lbl.gov .

araichoor · 2021-12-15T17:48:00Z

Thanks a lot for the support, I've learned quite some stuff during this exercise.

Just one remark about the test:
I see this test/comment:

desispec/py/desispec/test/test_emlinefit.py

Line 48 in 707223b

#- both redshifts should have an answer for OII

However, the code will return a NaN if it fails, e.g. if there are not enough valid pixels or if the flux is negative at the fitted location.
As the flux is a random variable centered on zero, it can happen it is negative at the [OII] location.
I checked with copy-pasting the code lines, looks like the two random fluxes are positive there, so it should be ok:

Maybe we could add a positive offset to the fluxes, to be sure / explicit about that, e.g.:

 fluxes = 10 + rand.normal(size=(nspec, nwave))

sbailey · 2021-12-15T18:04:42Z

Interesting. Then an algorithm choice question: I was expecting that if the data were negative at the location of a line, you'd just get a negative FLUX rather than forcing it to NaN. To me that is different from a "failed" fit from not having enough good pixels at that wavelength (or having the line completely redshifted out of the wavelength coverage). Is there a specific reason to not propagate a negative flux fit? To me this is conceptually equivalent to not clipping photometric fluxes to 0 when making an imaging catalog -- let the data fluctuate, and sometimes the fit will be negative even though you know the object photons are strictly positive, but the negative tail can be useful in assessing the noise/statistical properties of the dataset.

Also, can you point to the piece of code that sets negative fits to NaN (or otherwise doesn't overwrite the default NaN because the fit was negative)? From a quick glance that isn't obvious to me where that happens.

araichoor · 2021-12-15T18:20:50Z

I acknowledge my approach may be a bit "brutal", but the idea was end-user oriented: if there is a not-NaN measurement, then it can be trusted (at least for the ELGs :)); no need to make several cuts to decide if the fit is trustable or not; that of course has been driven by my work on ELGs.

In that spirit, I've put several requirements;
I described those at the beginning of this PR (Aug, 26! #1386 (comment)); for what we discussed, I think the relevant ones are:

the flux at the line position is above the estimated continuum (so what I wrote in the previous message is not exactly correct; but the two random fluxes in the test still pass that requirement, I think):

desispec/py/desispec/emlinefit.py

Lines 212 to 213 in 707223b

# AR is the flux above continuum for at least one line?

if obs_em_fluxes.max() > emdict["CONT"]:

I set some boundaries for the flux fitting (min_flux=1e-5, max_flux=1e9), and if the fitted flux is close to the boundaries (1% margin), I decide it is a failure:

desispec/py/desispec/emlinefit.py

Lines 225 to 234 in 707223b

    
           # AR fit succeeded? 
        
           # AR - pcov.__class__ criterion => dates from JC, not sure how relevant.. keeping it 
        
           # AR - then we decide that the fit is no good if: 
        
           # AR   - var = 0, 
        
           # AR   - or var(flux) = 0 
        
           # AR   - or flux closer than 1% to the allowed boundaries 
        
           if pcov.__class__ == np.ndarray: 
        
               diag = np.diag(pcov) 
        
               if (diag.sum() > 0) & (diag[1] > 0) & (popt[1] > 1.01 * min_flux) & (popt[1] < 0.99 * max_flux): 
        
                   succeed = True

Around that place in the code you may find other informations on that topic; happy to give more precision / discuss about that.

anand_raichoor added 5 commits August 26, 2021 11:20

first commit of desi_emlinefit_afterburner

f6493ff

docstring

1ae41ec

add EXTNAME to output + switch to fitsio for writing

6245ece

remove commented line

d48ee60

few modifs in displayed infos in args.outpdf

7e5c17b

araichoor requested a review from sbailey August 26, 2021 21:54

sbailey requested a review from moustakas August 26, 2021 22:26

sbailey added this to In progress in Fuji via automation Aug 26, 2021

moustakas approved these changes Sep 1, 2021

View reviewed changes

anand_raichoor added 7 commits September 23, 2021 15:21

factoring out functions to py/desispec/io/emlinefit.py

262f1c5

use fastspecfit rest-frame wavelengths

8b5a67c

chi2 -> reduced chi2, as in fastspecfit

73e6f0c

change RFEW, RFEW_IVAR to EW, EW_IVAR

bae23d6

docstring

a6c583b

NDOF: set to integer, and default to -99 instead of nan

ab5efff

remove FIBER from output (not working for healpix)

88856f0

sbailey requested changes Dec 14, 2021

View reviewed changes

anand_raichoor added 5 commits December 14, 2021 01:11

re-organizing code structure

fe63705

replace sys.exit(1) by raising errors

16b112d

define log=get_logger() outside function default

e8c21fc

rename redrockfn/coaddfn/outfn to redrock/coadd/output

926f94f

move plotting imports inside plot_emlines()

febddee

anand_raichoor added 4 commits December 14, 2021 15:50

running by default on all fibers

767ea21

improve file reading + get_targetids()

ddcc35e

remove unused imports

0c59ebf

small syntax edits (spaces, etc)

430df89

anand_raichoor and others added 2 commits December 14, 2021 20:08

more small syntax edits

0b5f0c5

add emlinefit tests

707223b

sbailey approved these changes Dec 15, 2021

View reviewed changes

don't require postive flux to fit

3c29232

sbailey merged commit 4129b1a into master Dec 21, 2021

Fuji automation moved this from In progress to Done Dec 21, 2021

sbailey deleted the emlinefit branch December 21, 2021 05:05

moustakas mentioned this pull request Jan 24, 2022

basic line strength measurements #982

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emlinefit #1386

Emlinefit #1386

araichoor commented Aug 26, 2021

coveralls commented Aug 26, 2021 •

edited

sbailey commented Aug 26, 2021

araichoor commented Aug 27, 2021

araichoor commented Aug 27, 2021

moustakas commented Aug 27, 2021

sbailey commented Aug 27, 2021

sbailey commented Sep 1, 2021

moustakas commented Sep 1, 2021

araichoor commented Sep 2, 2021

araichoor commented Sep 23, 2021

sbailey commented Sep 23, 2021

araichoor commented Sep 23, 2021

araichoor commented Sep 25, 2021

araichoor commented Sep 25, 2021

moustakas commented Oct 6, 2021

araichoor commented Oct 6, 2021

sbailey left a comment

sbailey Dec 14, 2021

araichoor Dec 14, 2021

sbailey Dec 14, 2021

araichoor Dec 14, 2021

araichoor commented Dec 14, 2021

araichoor commented Dec 14, 2021

sbailey commented Dec 14, 2021

araichoor commented Dec 14, 2021

sbailey commented Dec 14, 2021

araichoor commented Dec 15, 2021

sbailey commented Dec 15, 2021

araichoor commented Dec 15, 2021

sbailey commented Dec 15, 2021

araichoor commented Dec 15, 2021


		allowed_emnames = ["OII", "HDELTA", "HGAMMA", "HBETA", "OIII", "HALPHA"]

		def get_targetids(redrockfn, bitnames, log=get_logger()):

	h = fits.open(redrock)
	extnames = [h[i].header["EXTNAME"] for i in range(1, len(h))]
	if "REDSHIFTS" in extnames:
	rr_extname = "REDSHIFTS"
	elif "ZBEST" in extnames:
	rr_extname = "ZBEST"
	else:
	msg = "{} has neither REDSHIFTS or ZBEST extension".format(redrock)
	log.error(msg)
	raise RuntimeError(msg)
	# AR rr_keys, fm_keys: restrict to existing ones
	rmv_rr_keys = [key for key in rr_keys.split(",") if key not in h[rr_extname].columns.names]
	if len(rmv_rr_keys) > 0:
	log.info("{} removed from rr_keys, as not present in {}".format(",".join(rmv_rr_keys), rr_extname))
	rr_keys = ",".join([key for key in rr_keys.split(",") if key not in rmv_rr_keys])
	# AR redrock: reading
	rr = fitsio.read(redrock, ext=rr_extname, columns=rr_keys.split(","))

Emlinefit #1386

Emlinefit #1386

Conversation

araichoor commented Aug 26, 2021

coveralls commented Aug 26, 2021 • edited

sbailey commented Aug 26, 2021

araichoor commented Aug 27, 2021

araichoor commented Aug 27, 2021

moustakas commented Aug 27, 2021

sbailey commented Aug 27, 2021

sbailey commented Sep 1, 2021

moustakas commented Sep 1, 2021

araichoor commented Sep 2, 2021

araichoor commented Sep 23, 2021

sbailey commented Sep 23, 2021

araichoor commented Sep 23, 2021

araichoor commented Sep 25, 2021

araichoor commented Sep 25, 2021

moustakas commented Oct 6, 2021

araichoor commented Oct 6, 2021

sbailey left a comment

Choose a reason for hiding this comment

sbailey Dec 14, 2021

Choose a reason for hiding this comment

araichoor Dec 14, 2021

Choose a reason for hiding this comment

sbailey Dec 14, 2021

Choose a reason for hiding this comment

araichoor Dec 14, 2021

Choose a reason for hiding this comment

araichoor commented Dec 14, 2021

araichoor commented Dec 14, 2021

sbailey commented Dec 14, 2021

araichoor commented Dec 14, 2021

sbailey commented Dec 14, 2021

araichoor commented Dec 15, 2021

sbailey commented Dec 15, 2021

araichoor commented Dec 15, 2021

sbailey commented Dec 15, 2021

araichoor commented Dec 15, 2021

coveralls commented Aug 26, 2021 •

edited