Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cross-correlation redshift/rv-fiding #544

Merged
merged 27 commits into from
Mar 18, 2020
Merged

Implement cross-correlation redshift/rv-fiding #544

merged 27 commits into from
Mar 18, 2020

Conversation

ibusko
Copy link
Contributor

@ibusko ibusko commented Oct 22, 2019

This code implements part of the workflow necessary for actually finding the redshift/RV: the cross-correlation computation. The correlation peak finding is supposed to happen elsewhere, maybe in a notebook where the user will have his/hers favorite workflow laid out. Also, any preparation of the spectra required by the actual science (e.g. removal of emission lines, masking of regions, and the like) is assumed to take place before the template_correlate function is called.

@ibusko
Copy link
Contributor Author

ibusko commented Oct 23, 2019

@eteq @nmearl @camipacifici can you review this? I am not being able to tag reviewers via the normal access point in the github page. Perhaps because of permissions or credentials or something.

Copy link
Contributor

@nmearl nmearl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from a very minor docstring change, this looks good to me.

specutils/analysis/correlation.py Outdated Show resolved Hide resolved
Copy link
Contributor

@nmearl nmearl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, this is going to need documentation in the docs.

@nmearl nmearl changed the title JDAT-57: Implement cross-correlation redshift/rv-fiding Implement cross-correlation redshift/rv-fiding Nov 11, 2019
docs/analysis.rst Outdated Show resolved Hide resolved
docs/analysis.rst Outdated Show resolved Hide resolved
docs/analysis.rst Outdated Show resolved Hide resolved
docs/analysis.rst Outdated Show resolved Hide resolved
An example of how to get the cross correlation follows. Note that the observed spectrum must have a rest wavelength
value set in.

.. code-block:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a question for @eteq or @camipacifici: would be want to demonstrate loading in a simple ASCII file with e.g. wavelength/flux values to use as the template? (That is, for someone perusing the docs, would that be more of a helpful example?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think the more useful one is something that actually uses real templates that are loaded as Spectrum1D objects. That is, a fits file with an actual spectrum.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. take a galaxy template from synphot and convert it to ascii and load in into a Spectrum1D object to use as the template.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some templates in the Synphot library (Bruzual-Charlot galaxy templates for example) https://pysynphot.readthedocs.io/en/latest/ $PYSYN_CDBS/grid/bc95/. I'd suggest bc95_g_10E9.fits. I don't think there is a URL available for this single file, but @pllim might know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In developing a test case which uses a real template spectrum (wave and flux from bc95_g_10E9.fits converted to ASCII format), I realized that we need also an observed spectrum to correlate against this template. Any idea on how to get one of those without resorting to too many manipulations? The goal is to have this in the docs, so I believe it has to be straightforward to get the observed spectrum, as it is to get the template.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ibusko - For an observed spectrum, I think the easiest thing is to use the SDSS spectrum that's already an example from the specutils docs - see https://specutils.readthedocs.io/en/stable/ (or the index.rst here) for the exact URL. That way we don't have to keep track of multiple examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although @camipacifici pointed out to me out-of-band that she has some actual observed spectra and matched templates for them that is part of the notebook she's working on. Maybe we should use those as the template/observed combo?

Unfortunately I'm not sure it's going to be easy to access these publicly until she gets a chance to put them in a reasonable data location. So it might be best to follow the above just to get this PR in, but with a plan to move to @camipacifici's data once we can get it (and use it also in the chi2-template-fitting section?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data is served by RedCat, independent of Synphotses code base. Pick your poison at https://ssb.stsci.edu/cdbs/grid/bc95/templates/

Copy link
Contributor Author

@ibusko ibusko Nov 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put in place a unit test function in a branch in my fork for people to review and continue the discussion. The goal is to have something that can be put in the docs as an example on how to use the correlation function with real data. I found that it is perhaps trickier and lengthier than it should be, since it involves other aspects of the process besides the correlation itself. The code is in function test_correlation_real_data in ibusko:redshift-measurement-sandbox.

Temporarily I have both the observation and template spectra files added to the tests directory, because my dev environment (in PyCharm) can't run the test (it skips it) if I use the remote file loading code. It runs from the command line tough, so it's not a real problem. More important, remote loading works for the observed SDSS spectrum, but not for the synphot template. This file has to be manually changed into ECSV format to be usable. I think there is a way to use the file as is, but I didn't invest time pursuing it since that's not the goal of this exercise. These are topics to be solved later on. For now, we have to decide how to translate the code in the test function into a doc-usable code snippet.

Note that the test function is not doing all possible steps that are eventually needed to pre-process spectra before they can be cross-correlated, such as continuum subtraction or rectification. Should they be included in the example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I implemented API for synphot to read from and write to Spectrum1D. Maybe it'll make things easier for you. See spacetelescope/synphot_refactor#243

@ibusko ibusko requested review from eteq and nmearl February 26, 2020 15:44
Copy link
Contributor

@javerbukh javerbukh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great documentation in the code! I had a problem running pytest on my machine on your code, however. This could be my conda environment or the config file, I'm not entirely sure. Once you get that fixed everything else looks good!

Copy link
Member

@eteq eteq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I reviewed this in light of my experimentation for the stellar cross-correlation use case, which you can look at here:
https://github.com/eteq/specutils/blob/redshift-stars/docs/Stellar%20cross-correlation.ipynb

the short version is: it doesn't work 😢. But fortunately I think not for core-algorithm reasons, but rather for lack-of-options reasons. Specifically, the following issues:

  1. There's currently no way to skip the resampling step. That's a deal-breaker because in my stellar x-corr example, I already have done the resampling in a way that's specific to that particular use case and need to be able to skip the internal step.
  2. I could be mis-understanding, but I think the current resampling step is subtly wrong in that it's resampling onto two different spectral_axis grids
  3. There's no way to say "I only want a sub-set of the lags" as far as I can tell. This is a problem because it means the correlation step is very slow when I try to do it at 10x resolution, which is roughly what's needed for the stellar x-corr case. So I would want a way to be able to say something like "only do the correlation for lags that are between -1000 and 1000 km/s", so that it's fast enough to work for this use case.

The first two I think are pretty easily fixed (and I've left comments inline about that). The third is a bit more, but perhaps we should have it be left as a follow-on PR, since this PR still has relevant use cases that do not require that.

EDIT/UPDATE: I realized I was missing one part of the workflow (continuum normalization), and once I did that it is working 👏, although not optimially: all the concerns above still apply because it's still clear that I need at least some level of sub-pixel resolution in the lags. But it is working in that it's getting the right ballpark answer!

# that both spectrum and template share a common spectral axis.
resampler = FluxConservingResampler()
observed_log_spectrum = _log_resampler(observed_spectrum, resampler)
template_log_spectrum = _log_resampler(template_spectrum, resampler)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing to me. It looks like the _log_resampler is resampling onto log-spaced grids but not the same log-spaced grids. So there's no guarantee they will be on the same grid. So I would have thought instead you would create a log grid based on whichever is the higher resolution or possibly just the observed one (also see my other comment about allowing a way to short-circuit this), and then use that same grid on the template?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed in branch https://github.com/ibusko/specutils/tree/log_wavelength_2. You're right that the code above is wrong. The reason it did work in the galaxy workflow is that we originally had the resampling to a common wavelength scale done in the notebook. The new code (in the branch pointed above) resamples both spectrum and template into a common log-wavelength array.

return num/denom


def template_correlate(observed_spectrum, template_spectrum):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To follow in the same model as the resamplers, this should probably be a class. Right now it would be a trivial class - something like

class TemplateCorrelate:
    def __init__(self, ... any new arguments added from my comments...):
        self.newarg = newarg

    def __call__(self, observed_spectrum, templat_spectrum):
        ... all the same as this current function, using self.newarg where needed...

that seems more consistent with the resamplers and the consensus from #266

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just watched this video last night Stop Using Classes. This suggestion sets off alarm bells.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's worth considering, @hcferguson, but we had that discussion in #266 . To be fair I think we said we'd allow simple functions that we might wrap in classes later, so that doesn't have to block this PR, I suppose though.


# Resample both spectrum and template to log wavelength. This is assuming
# that both spectrum and template share a common spectral axis.
resampler = FluxConservingResampler()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised to see that this always uses a resampler. In my use case in https://github.com/eteq/specutils/blob/redshift-stars/docs/Stellar%20cross-correlation.ipynb, I already explicitly re-sample to put the two spectra on a shared wavelength grid, and the results depend critically on how that is done. So I think there should be a keyword argument (or class attribute, as discussed in my other comment) that makes it possible to skip the resampling phase.

Either way, it is critical that the docstring states that resampling takes place, since I wouldn't have known the data are being transformed without looking at the code (which many users won't do).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also may not be essential to conserve flux for this application, since it is just trying to measure the shift. A simpler resampler might be faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, easy to add a keyword. As for the simpler resampler, I ended up using the flux conserving one at some point in the development, because it gave better results. But that could have been due to something else as well. Once I got the function+galaxy notebook working, I forgot to revisit the resampler type issue. I will try with the linear type and see how it goes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hcferguson indeed, replacing the flux conserving by the linear resampler speeded up the calculation significantly. And didn't result in any visible side effects (as we should expect anyways).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this did change the results quite noticeably for the stellar case. Not in a way that significantly affected the final answer, but in a way I strongly suspect matters a lot more for fainter spectra than my test. So that just emphasizes the importance of allowing the user to choose "let me do my own resampling, I know what I'm doing" if they want to.

wave_l = observed_log_spectrum.spectral_axis.value
delta_log_wave = np.log10(wave_l[1]) - np.log10(wave_l[0])
deltas = (np.array(range(len(corr))) - len(corr)/2 + 0.5) * delta_log_wave
lags = (np.power(10., deltas) - 1. ) * const.c.to('km/s')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the lags should be in "redshift" units instead of km/s?

Or alternatively, perhaps a keyword argument of lag_units that defaults to km/s, but can alternatively be a velocity unit (in which case you'd just replace this line with ...to(lag_units) or the dimensionless_unscaled unit, which is then interpreted as redshift? I.e. this line but without multiplying by c?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds easy to implement as well. The reason I have the lags done in that way is just because the IRAF xcorr task does it in that way. I used xcorr's code to guide me into implementing the steps in the notebook and function.

@ibusko
Copy link
Contributor Author

ibusko commented Mar 12, 2020

@eteq as for having just a subset of the lags, it may require a little more thought so I won't be adding it to the next iteration. Lets have the other issues fixed first.

Something that may help in that regard is that the new code I am moving into this PR allows the user to optionally define either, or both, the blue and red wavelength extremes to use in the resampling. These limits are used to define the common wavelength scale that both spectrum and template will be resampled into. They can be either present in the spectrum/template, or be off their limits, in which case the resampled data gets zero-padded. As for the sampling step to use to build the low-wavelength scale, the code right now is using the smaller wavelength spacing it can find in the observed spectrum. We could add that sampling step as a keyword parameter as well. That may allow the kind of oversampling you require in your high resolution work.

@ibusko
Copy link
Contributor Author

ibusko commented Mar 13, 2020

I put in place all suggested modifications except two: lags selected by the user (subject of another ticket TBD), and didn't implement the function as a class (following @hcferguson 's alert). Although we can address that in a TBD ticket as well.

I tested this with the galaxy redshift workflow and the unit tests only. @eteq 's workflow still need to be tested with this code.

@ibusko ibusko requested a review from eteq March 18, 2020 17:32
@eteq
Copy link
Member

eteq commented Mar 18, 2020

I pushed up a few commits making updates to the changes made such my last review - I don't think we want to write-in all the keyword arguments specifically for the resampling, so I re-worked it into a separate (public) function that does the log-resampling. But following the principle of "sensible defaults", I worked it so that the resampler still works as it did before by default but the user can control how they want the resampler to work or just do it themselves.

I also re-ran this using my notebook in https://github.com/eteq/specutils/blob/redshift-stars/docs/Stellar%20cross-correlation.ipynb and it all seems to be working as expected now, so that's great!

I ran into a couple of other items that I don't have time to fix but will make follow-on issues for those. None are blockers so assuming the tests pass I think all is good to merge.

@eteq
Copy link
Member

eteq commented Mar 18, 2020

the failures in the last commit are all unrelated due to remote data outages not this PR, so I think it can be considered passing. I'll give this a bit for @nmearl or @rosteen to review my last set of commits, but if they are ok this can be merged.

@rosteen
Copy link
Contributor

rosteen commented Mar 18, 2020

I looked over the recent commits by @eteq and they look good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants