-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Landy Szalay estimator #477
Conversation
@@ -47,7 +47,14 @@ def __init__(self, mode, edges, first, second, Nmu, pimax, show_progress=False): | |||
# store the total size of the sources | |||
self.attrs['N1'] = first.csize | |||
self.attrs['N2'] = second.csize if second is not None else None | |||
|
|||
|
|||
if second is None or second==first: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a tab/space mixture issue -- we only use 4 spaces for indentation, no tabs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also to test if two fields we can probably use 'is' instead of ==.
@@ -143,7 +143,7 @@ def run(chunk): | |||
|
|||
# run in chunks | |||
pc = None | |||
chunks = numpy.array_split(range(loads[self.comm.rank]), N, axis=0) | |||
chunks = numpy.array_split(numpy.arange(loads[self.comm.rank]), N, axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arange(..., dtype='intp') may be better?
I am not so sure about N(N - 1)/2 part. A correctness test case seems to be failing due to this? |
Hi, I agree with all you said, and made the subsequent changes. I also imported weightedpairs attributes from PairCountBase to BasePairCount2PCF as it can be useful to users to recompute LS estimators or others themselves. By the way, please do not hesitate to submit your changes directly, as you are far better programmer than I am (plus it is your project). |
I can of course add some clean ups after we settle on N(N-1) / 2. The formula in section 2 of the paper you gave are for paircounts that are (i, j) i < j. (appearently). Do you know if the paircount here includes reverted pairs or excludes reverted pairs? (i, j) with no constraints. In nbodykit before the switch to corrfunc we count all pairs (no constraints). Those with NN looked more correct: https://travis-ci.org/bccp/nbodykit/jobs/368171877#L1420 first row is from other libraries, using NN, second row is after the patch. The first number is sqrt(2) in the NN case, a nice integer number it is. |
Well, what matters in N(N-1)/2 is the -1: we do not count pairs formed by any object with itself, because it is simply not a pair (these "false pairs" are anyway rejected because you require bin edges to be >0). It is not a matter of reverted pairs. Since you compute only cross correlation, in fact we get a total number of pairs (i,j) with i!=j (anyway, I think corrfunc also chooses this convention with autocorrelation). By the way, you could as well remove 0.5 factors in the "weightedpairs" calculation, it will give the same result as the overall normalization is given by ratios of "weightedpairs". I just added these 0.5 factors to match the somewhat intuitive definition of pair counts (no double count). Cheers |
I see what you mean. The integral constraint for the case of pairs (i, j)|{i < j} is indeed n(n-1)/2. But the integral constraint for the case of pairs (i, j)|{any i, j} is I see the argument in the Landy-Szalay paper (http://adsabs.harvard.edu/abs/1993ApJ...412...64L ) on n(n-1). But it would equally apply to the second definition of pairs and would give So I think I am still confused about this (and leaning towards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. We can do some code cleanup but I think the N*(N-1) normalization is proper thing to do here. Take a look at these notes in the Corrfunc docs:
http://corrfunc.readthedocs.io/en/master/modules/rr_autocorrelations.html#weighted-rr
The last equation of those notes seems to be the same as what is added here.
else: | ||
wpairs1, wpairs2 = self.comm.allreduce(first.compute(first[weight].sum())), self.comm.allreduce(second.compute(second[weight].sum())) | ||
self.attrs['weightedpairs'] = 0.5*wpairs1*wpairs2 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I like weighted_npairs
better here. But the calculation looks correct to me
fN2 = float(NR2)/ND2 | ||
nonzero = R1R2['npairs'] > 0 | ||
Error = numpy.zeros(D1D2.pairs.shape) | ||
Error[:] = numpy.nan | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a Poisson error calculation would be a nice addition, I agree
logger.info("computing randoms1 - randoms2 pair counts") | ||
R1R2 = pair_counter(first=randoms1, second=randoms2, **kwargs).pairs | ||
|
||
fDD, fDR, fRD = R1R2.attrs['weightedpairs']/D1D2.attrs['weightedpairs'], R1R2.attrs['weightedpairs']/D1R2.attrs['weightedpairs'], R1R2.attrs['weightedpairs']/D2R1.attrs['weightedpairs'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should break this up into multiple lines, but the calculation looks right to me
if mu_sel is None: | ||
if mu_range: mu_sel = numpy.nonzero((self.corr.edges['mu'][:-1]>=mu_range[0]) & (self.corr.edges['mu'][1:]<=mu_range[1]))[0] | ||
else: mu_sel = slice(len(self.corr.edges['mu'])-1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can slice the BinnedStatistic option by mu
using the .sel()
function of self.corr
here. Something like:
sliced = self.corr.sel(mu=slice(*mu_range), method='nearest')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we could use that. Actually, I wanted to be conservative in the selected mu bins, i.e., if bin edges are [0.1,0.2] and mu_range starts at 0.18, I wanted that bin to be excluded from the calculation (to avoid polluting the result with systematics).
But I agree your solution would suit to most people.
As for tests, I agree that halotools uses the same normalization as the original version in nbodykit. I think the |
PC = getattr(self, pc) | ||
self.attrs['weightedpairs'][pc] = PC.attrs['weightedpairs'] | ||
setattr(self,pc,PC.pairs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know if you want to keep the possibility of passing R1R2 (SurveyDataPairCount or SimulatedBoxPairCount instance) as additional argument of BasePairCount2PCF. If so, I would advocate keeping self.D1D2, self.D1R2, etc. as PairCountBase instance all the way (instead of getting the BinnedStatistic object pairs), and handle them in the getstate and setstate (using getstate and _setstate of PairCountBase). So that the user can access again to the PairCountBase instances R1R2, D1R2, etc. when loading BasePairCount2PCF, to do what he wants with pair counts (calculate another estimator, etc.).
Sorry about the late response. The calculation in Corrfunc adds in the self-pairs only iff pairs at zero distance are considered real (i.e., @lgarrison demonstrated the need to add in this extra |
Thanks for the clarification(s)!
…On Mon, Apr 23, 2018 at 10:54 AM, Manodeep Sinha ***@***.***> wrote:
Sorry about the late response. The calculation in Corrfunc adds in the
self-pairs only *iff* pairs at zero distance are considered real (i.e.,
rmin==0.0). Check out corresponding code and comments here
<https://github.com/manodeep/Corrfunc/blob/master/theory/wp/countpairs_wp_impl.c.src#L543-L568>
@lgarrison <https://github.com/lgarrison> demonstrated the need to add in
this extra N term via some gist. @lgarrison <https://github.com/lgarrison>
- was it this notebook
<http://nbviewer.jupyter.org/gist/lgarrison/1efabe4430429996733a9d29397423d2>
?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#477 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAIbTNwDesAOGVc6s9nsD_-vjeGjAteuks5trhVqgaJpZM4TZBTy>
.
|
@nickhand let's try to merge this. What can we do about the failed test case? |
Do you want anything from me, e.g. update the branch according to the previous comments? |
Yes. It would be great if you address some of the naming and conventions Nick suggested. Since we agree the test results are correct, I would suggest hard coding the current values as the correct values for the comparison and bypass the halotools test as a knownfailure, or comment it out. |
Yes I agree with Yu. I think the only way to fix the tests at the moment is
to either mark it as a knownfailure, or load the hard coded test results
from a file that were computed either from nbodykit, or ideally TreeCorr.
We do a similar loading of test results in the 3PCF tests to compare
against an external C code
…On Wed, May 2, 2018 at 7:35 PM Yu Feng ***@***.***> wrote:
Yes. It would be great if you address some of the naming and conventions
Nick suggested. Since we agree the test results are correct, I would
suggest hard coding the current values as the correct values for the
comparison and bypass the halotools test as a knownfailure, or comment it
out.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#477 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACKF4H2CiRDqcC89Mjc2pgdtywZpUydwks5tukLYgaJpZM4TZBTy>
.
|
Hi @rainwoodman and @nickhand - sorry that me and @duncandc are late in replying to this, it's been an extremely busy couple of weeks. I still haven't had a chance to fully digest the problem in this issue. Does this correlation function normalization problem apply only to the case where brute force randoms are used together with Landy-Szalay? Or does it also apply to the case of analytical randoms inside a periodic box? |
It appears the normalization affects DD, DR and RR the same way. If (-1) is skipped then it is biased. |
Nick's conventions have been applied. Tests that failed due to the N*(N-1)/2 normalization of auto-paircounts have been modified to read reference csv files instead (produced using nbodykit (see make_ref.py)... I do not have TreeCorr installed). These tests were:
|
---------- | ||
ells : array_like | ||
the list of multipoles to compute | ||
mu_range: array_like, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API seems to be a bit redunant with mu_sel. Is this absolutely necessary? Perhaps passing in range(mu_range[0], mu_range[1]) would work the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you can probably also get rid of return_mu_sel because mu_sel is always an input.
We will need the input from @nickhand on the NaturalEstimator. It looked suspicious to my eyes as well. I am worried having a new script to create the test data (both data and the results). We were using a similar technique previously (using pre-generated data files), and found it to quickly become painful to manageable when the number of test cases scales up. Eventually we decided always regenerating the test data within the individual test cases, such that they can be simply copied and ran without help of external data. In that spirit, also noting the correct result is only 10 numbers, what about hard coding them in the test case directly? This way we avoid commiting some 40k data files that doesn't mean anything into the repo, and also avoid the duplication of code between make_ref.py and the test cases. (I see the data files are used to reduce the duplication as well). If later the test case is broken, we'll take a look why, and if it is supposed to break, then we can simply run it with --pdb, get the 10 updated numbers, and paste them into the test case again. |
@rainwoodman I'll take a look at the natural estimator tomorrow....very possible that there's a bug |
|
||
# compute 2PCF | ||
r = SurveyData2PCF('2d', data1, randoms2, redges, Nmu=Nmu, cosmo=cosmo, data2=data2, randoms2=randoms2) | ||
xil = r.to_xil(ells=ells) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FFTPower, FFTCorr both takes a poles=[]
argument for computing multi-poles. Would it match the existing API better if we call this r.to_poles() here?
dims = [x] | ||
edges = [self.corr.edges[x]] | ||
|
||
if mu_range: sliced = self.corr.sel(mu=slice(*mu_range), method='nearest') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like mu_range is used only once here to perform a selection on the corr, and the only object the to_xil function operates on is self.corr. It sort of feels like the function may belong somewhere else. e.g. What if we make corr an instances of a subclass of BinnedStatistics, which adds a method to_xil (or to_poles) on top of existing methods of BinnedStatistics, that returns a new BinnedStatistics of poles?
The benefit is that the to_poles() method no longer needs to couple the selection logic.
r = SurveyDataBox2PCF(.....)
r.corr.sel(mu=slice(0.3, 0.5), method='nearest').to_poles(ells=[0, 2, 4])
I think this may require some changes in the BinnedStatistics class, allowing sel() to return a subclass instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a good way forward to me
Also asserts subclass type is preserved during sel(). This paves the road for the plan described in bccp#477 (comment)
@@ -71,7 +71,7 @@ def __call__(self, NR1, NR2=None): | |||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This natural estimator calculation also needs to be fixed I think. If not a cross-correlation it should be N1*(N1-1)
to match the other calculation.
I think we will need to add a is_cross_corr
keyword to the NaturalEstimator
constructor, which is called from here:
self.R1R2, self.corr = NaturalEstimator(DD) |
And then if it's not a cross-correlation, only pass ND1
here:
_R1R2 = AnalyticUniformRandoms(mode, edges, BoxSize)(ND1, ND2) |
(currently, we always pass ND1
and ND2
)
Let's merge this PR as is, and I will do some follow-ups before we tag a release. These features are useful. |
I've moved this to #488, with two followup changes to fix the NaturalEstimator (by adding proper meta data to RR), and adding 'WedgeBinnedStatistic', which can be converted to poles -- if the tests are all green I will merge it there. |
Merged. Thanks @adematti! |
I updated the normalization of the Landy-Szalay estimator. Indeed, one should normalize D1D2, D1R2, R1D2, R1R2 by the total number of (weighted) pairs involved in D1D2, D1R2, R1D2, R1R2 (instead of the product of the sizes of D1 and D2, D2 and R1, etc.). More specifically, if D1=D2, one should subtract the double count of pairs: normalization is given by N1*(N1-1)/2 instead of N1*N2.
Since corrfunc does not take care of this subtlety, test_survey_auto of pairount_tpcf/test/test_1d.py (and 2d, angular and projected) will fail.
I also added the option R1R2 to BasePairCount2PCF and children, so that one can provide R1R2 if already calculated (once for all mocks). No consistency check between the provided R1R2 and the attributes of BasePairCount2PCF is performed, though.
I added the function to_xil which computes multipoles of the r-mu correlation function.
Please tell me if anything looks weird to you, and, of course, please make the changes you want (it is just a proposal).