Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Landy Szalay estimator #477

Merged
merged 9 commits into from
May 9, 2018
Merged

Landy Szalay estimator #477

merged 9 commits into from
May 9, 2018

Conversation

adematti
Copy link
Collaborator

I updated the normalization of the Landy-Szalay estimator. Indeed, one should normalize D1D2, D1R2, R1D2, R1R2 by the total number of (weighted) pairs involved in D1D2, D1R2, R1D2, R1R2 (instead of the product of the sizes of D1 and D2, D2 and R1, etc.). More specifically, if D1=D2, one should subtract the double count of pairs: normalization is given by N1*(N1-1)/2 instead of N1*N2.
Since corrfunc does not take care of this subtlety, test_survey_auto of pairount_tpcf/test/test_1d.py (and 2d, angular and projected) will fail.
I also added the option R1R2 to BasePairCount2PCF and children, so that one can provide R1R2 if already calculated (once for all mocks). No consistency check between the provided R1R2 and the attributes of BasePairCount2PCF is performed, though.
I added the function to_xil which computes multipoles of the r-mu correlation function.
Please tell me if anything looks weird to you, and, of course, please make the changes you want (it is just a proposal).

@@ -47,7 +47,14 @@ def __init__(self, mode, edges, first, second, Nmu, pimax, show_progress=False):
# store the total size of the sources
self.attrs['N1'] = first.csize
self.attrs['N2'] = second.csize if second is not None else None


if second is None or second==first:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a tab/space mixture issue -- we only use 4 spaces for indentation, no tabs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to test if two fields we can probably use 'is' instead of ==.

@@ -143,7 +143,7 @@ def run(chunk):

# run in chunks
pc = None
chunks = numpy.array_split(range(loads[self.comm.rank]), N, axis=0)
chunks = numpy.array_split(numpy.arange(loads[self.comm.rank]), N, axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arange(..., dtype='intp') may be better?

@rainwoodman
Copy link
Member

I am not so sure about N(N - 1)/2 part. A correctness test case seems to be failing due to this?

@adematti
Copy link
Collaborator Author

adematti commented Apr 18, 2018

Hi,

I agree with all you said, and made the subsequent changes. I also imported weightedpairs attributes from PairCountBase to BasePairCount2PCF as it can be useful to users to recompute LS estimators or others themselves. By the way, please do not hesitate to submit your changes directly, as you are far better programmer than I am (plus it is your project).
Yes, as I said, auto correlation tests with LS estimator will fail, since they compare to either corrfunc's convert_3d_counts_to_cf ot halotools's tpcf, which normalize auto pair counts by N*N/2 instead of N(N - 1)/2. About this normalization of DD or RR, see e.g. https://arxiv.org/pdf/1211.6211v2.pdf, or the implementation of CUTE by David Alonso: https://github.com/damonge/CUTE/blob/master/CUTE/src/io.c, l. 136-137.
Maybe I am missing some subtlety about the bias of the LS estimator, though.
Cheers,

@rainwoodman
Copy link
Member

rainwoodman commented Apr 18, 2018

I can of course add some clean ups after we settle on N(N-1) / 2.

The formula in section 2 of the paper you gave are for paircounts that are (i, j) i < j. (appearently).

Do you know if the paircount here includes reverted pairs or excludes reverted pairs? (i, j) with no constraints. In nbodykit before the switch to corrfunc we count all pairs (no constraints).
I am not so sure after the switch to corrfunc; but this must be relevant.

Those with NN looked more correct:

https://travis-ci.org/bccp/nbodykit/jobs/368171877#L1420

first row is from other libraries, using NN, second row is after the patch. The first number is sqrt(2) in the NN case, a nice integer number it is.

@adematti
Copy link
Collaborator Author

Well, what matters in N(N-1)/2 is the -1: we do not count pairs formed by any object with itself, because it is simply not a pair (these "false pairs" are anyway rejected because you require bin edges to be >0). It is not a matter of reverted pairs. Since you compute only cross correlation, in fact we get a total number of pairs (i,j) with i!=j (anyway, I think corrfunc also chooses this convention with autocorrelation). By the way, you could as well remove 0.5 factors in the "weightedpairs" calculation, it will give the same result as the overall normalization is given by ratios of "weightedpairs". I just added these 0.5 factors to match the somewhat intuitive definition of pair counts (no double count).
Concerning the test that fails: as I said, it uses halotools's tpcf, which uses the exact same formula as you do: line 58 of https://github.com/astropy/halotools/blob/master/halotools/mock_observables/two_point_clustering/tpcf_estimators.py. That explains why the test succeeded before, and why it fails now. I am just arguing this formula is not fully correct.

Cheers

@rainwoodman
Copy link
Member

rainwoodman commented Apr 18, 2018

I see what you mean. The integral constraint for the case of pairs (i, j)|{i < j} is indeed n(n-1)/2. But the integral constraint for the case of pairs (i, j)|{any i, j} is n**2. In principle the self pairs should have affected only the zero-lag bin, but in this case it is affecting the way we define the integral constraint.

I see the argument in the Landy-Szalay paper (http://adsabs.harvard.edu/abs/1993ApJ...412...64L ) on n(n-1). But it would equally apply to the second definition of pairs and would give n**2.

So I think I am still confused about this (and leaning towards n**2). In large n limit it shouldn't matter, but in low n limit it does; so I still would like to get this right. I wonder if @manodeep or @aphearin thought about this as well?

Copy link
Member

@nickhand nickhand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. We can do some code cleanup but I think the N*(N-1) normalization is proper thing to do here. Take a look at these notes in the Corrfunc docs:

http://corrfunc.readthedocs.io/en/master/modules/rr_autocorrelations.html#weighted-rr

The last equation of those notes seems to be the same as what is added here.

else:
wpairs1, wpairs2 = self.comm.allreduce(first.compute(first[weight].sum())), self.comm.allreduce(second.compute(second[weight].sum()))
self.attrs['weightedpairs'] = 0.5*wpairs1*wpairs2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like weighted_npairs better here. But the calculation looks correct to me

fN2 = float(NR2)/ND2
nonzero = R1R2['npairs'] > 0
Error = numpy.zeros(D1D2.pairs.shape)
Error[:] = numpy.nan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a Poisson error calculation would be a nice addition, I agree

logger.info("computing randoms1 - randoms2 pair counts")
R1R2 = pair_counter(first=randoms1, second=randoms2, **kwargs).pairs

fDD, fDR, fRD = R1R2.attrs['weightedpairs']/D1D2.attrs['weightedpairs'], R1R2.attrs['weightedpairs']/D1R2.attrs['weightedpairs'], R1R2.attrs['weightedpairs']/D2R1.attrs['weightedpairs']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should break this up into multiple lines, but the calculation looks right to me

if mu_sel is None:
if mu_range: mu_sel = numpy.nonzero((self.corr.edges['mu'][:-1]>=mu_range[0]) & (self.corr.edges['mu'][1:]<=mu_range[1]))[0]
else: mu_sel = slice(len(self.corr.edges['mu'])-1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can slice the BinnedStatistic option by mu using the .sel() function of self.corr here. Something like:

sliced = self.corr.sel(mu=slice(*mu_range), method='nearest')

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we could use that. Actually, I wanted to be conservative in the selected mu bins, i.e., if bin edges are [0.1,0.2] and mu_range starts at 0.18, I wanted that bin to be excluded from the calculation (to avoid polluting the result with systematics).
But I agree your solution would suit to most people.

@nickhand
Copy link
Member

As for tests, I agree that halotools uses the same normalization as the original version in nbodykit. I think thexi or wp functions of Corrfunc use the proper normalization (we are only using the pair counting functions in Corrfunc right now), so we could do an off-line calculation with CUTE or these Corrfunc functions and compare on Travis I think.

PC = getattr(self, pc)
self.attrs['weightedpairs'][pc] = PC.attrs['weightedpairs']
setattr(self,pc,PC.pairs)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know if you want to keep the possibility of passing R1R2 (SurveyDataPairCount or SimulatedBoxPairCount instance) as additional argument of BasePairCount2PCF. If so, I would advocate keeping self.D1D2, self.D1R2, etc. as PairCountBase instance all the way (instead of getting the BinnedStatistic object pairs), and handle them in the getstate and setstate (using getstate and _setstate of PairCountBase). So that the user can access again to the PairCountBase instances R1R2, D1R2, etc. when loading BasePairCount2PCF, to do what he wants with pair counts (calculate another estimator, etc.).

@manodeep
Copy link

Sorry about the late response. The calculation in Corrfunc adds in the self-pairs only iff pairs at zero distance are considered real (i.e., rmin==0.0). Check out corresponding code and comments here

@lgarrison demonstrated the need to add in this extra N term via some gist. @lgarrison - was it this notebook?

@rainwoodman
Copy link
Member

rainwoodman commented Apr 23, 2018 via email

@rainwoodman
Copy link
Member

@nickhand let's try to merge this. What can we do about the failed test case?

@adematti
Copy link
Collaborator Author

Do you want anything from me, e.g. update the branch according to the previous comments?

@rainwoodman
Copy link
Member

Yes. It would be great if you address some of the naming and conventions Nick suggested. Since we agree the test results are correct, I would suggest hard coding the current values as the correct values for the comparison and bypass the halotools test as a knownfailure, or comment it out.

@nickhand
Copy link
Member

nickhand commented May 3, 2018 via email

@aphearin
Copy link

aphearin commented May 3, 2018

Hi @rainwoodman and @nickhand - sorry that me and @duncandc are late in replying to this, it's been an extremely busy couple of weeks. I still haven't had a chance to fully digest the problem in this issue. Does this correlation function normalization problem apply only to the case where brute force randoms are used together with Landy-Szalay? Or does it also apply to the case of analytical randoms inside a periodic box?

@rainwoodman
Copy link
Member

It appears the normalization affects DD, DR and RR the same way. If (-1) is skipped then it is biased.

@adematti
Copy link
Collaborator Author

adematti commented May 3, 2018

Nick's conventions have been applied. Tests that failed due to the N*(N-1)/2 normalization of auto-paircounts have been modified to read reference csv files instead (produced using nbodykit (see make_ref.py)... I do not have TreeCorr installed). These tests were:

  • auto-correlations of non periodic simulated data, in test_1d.py and test_angular.py
  • auto-correlations of survey data
    Note: auto-correlations of non periodic simulated data of test_2d.py and test_projected.py did not fail only because the tolerance is loose enough. I can modify these tests to take reference csv files instead.
    Looking quickly at the NaturalEstimator, I am not 100% sure it is correct: I think (though I have never used it), that weights are not taken into account in the normalization of DD/RR.
    Cheers

----------
ells : array_like
the list of multipoles to compute
mu_range: array_like, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API seems to be a bit redunant with mu_sel. Is this absolutely necessary? Perhaps passing in range(mu_range[0], mu_range[1]) would work the same?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you can probably also get rid of return_mu_sel because mu_sel is always an input.

@rainwoodman
Copy link
Member

rainwoodman commented May 4, 2018

We will need the input from @nickhand on the NaturalEstimator. It looked suspicious to my eyes as well.

I am worried having a new script to create the test data (both data and the results). We were using a similar technique previously (using pre-generated data files), and found it to quickly become painful to manageable when the number of test cases scales up. Eventually we decided always regenerating the test data within the individual test cases, such that they can be simply copied and ran without help of external data.

In that spirit, also noting the correct result is only 10 numbers, what about hard coding them in the test case directly?

This way we avoid commiting some 40k data files that doesn't mean anything into the repo, and also avoid the duplication of code between make_ref.py and the test cases. (I see the data files are used to reduce the duplication as well).

If later the test case is broken, we'll take a look why, and if it is supposed to break, then we can simply run it with --pdb, get the 10 updated numbers, and paste them into the test case again.

@nickhand
Copy link
Member

nickhand commented May 4, 2018

@rainwoodman I'll take a look at the natural estimator tomorrow....very possible that there's a bug


# compute 2PCF
r = SurveyData2PCF('2d', data1, randoms2, redges, Nmu=Nmu, cosmo=cosmo, data2=data2, randoms2=randoms2)
xil = r.to_xil(ells=ells)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FFTPower, FFTCorr both takes a poles=[] argument for computing multi-poles. Would it match the existing API better if we call this r.to_poles() here?

dims = [x]
edges = [self.corr.edges[x]]

if mu_range: sliced = self.corr.sel(mu=slice(*mu_range), method='nearest')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like mu_range is used only once here to perform a selection on the corr, and the only object the to_xil function operates on is self.corr. It sort of feels like the function may belong somewhere else. e.g. What if we make corr an instances of a subclass of BinnedStatistics, which adds a method to_xil (or to_poles) on top of existing methods of BinnedStatistics, that returns a new BinnedStatistics of poles?

The benefit is that the to_poles() method no longer needs to couple the selection logic.

     r = SurveyDataBox2PCF(.....)
     r.corr.sel(mu=slice(0.3, 0.5), method='nearest').to_poles(ells=[0, 2, 4])

I think this may require some changes in the BinnedStatistics class, allowing sel() to return a subclass instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a good way forward to me

rainwoodman added a commit to rainwoodman/nbodykit that referenced this pull request May 4, 2018
Also asserts subclass type is preserved during sel(). This paves
the road for the plan described in

bccp#477 (comment)
@@ -71,7 +71,7 @@ def __call__(self, NR1, NR2=None):
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This natural estimator calculation also needs to be fixed I think. If not a cross-correlation it should be N1*(N1-1) to match the other calculation.

I think we will need to add a is_cross_corr keyword to the NaturalEstimator constructor, which is called from here:

self.R1R2, self.corr = NaturalEstimator(DD)

And then if it's not a cross-correlation, only pass ND1 here:

_R1R2 = AnalyticUniformRandoms(mode, edges, BoxSize)(ND1, ND2)

(currently, we always pass ND1 and ND2)

@rainwoodman
Copy link
Member

Let's merge this PR as is, and I will do some follow-ups before we tag a release. These features are useful.

@rainwoodman rainwoodman mentioned this pull request May 9, 2018
@rainwoodman
Copy link
Member

I've moved this to #488, with two followup changes to fix the NaturalEstimator (by adding proper meta data to RR), and adding 'WedgeBinnedStatistic', which can be converted to poles -- if the tests are all green I will merge it there.

@rainwoodman rainwoodman merged commit 5e4394e into bccp:master May 9, 2018
@rainwoodman
Copy link
Member

Merged. Thanks @adematti!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants