Landy Szalay estimator #477

adematti · 2018-04-17T20:46:50Z

I updated the normalization of the Landy-Szalay estimator. Indeed, one should normalize D1D2, D1R2, R1D2, R1R2 by the total number of (weighted) pairs involved in D1D2, D1R2, R1D2, R1R2 (instead of the product of the sizes of D1 and D2, D2 and R1, etc.). More specifically, if D1=D2, one should subtract the double count of pairs: normalization is given by N1*(N1-1)/2 instead of N1*N2.
Since corrfunc does not take care of this subtlety, test_survey_auto of pairount_tpcf/test/test_1d.py (and 2d, angular and projected) will fail.
I also added the option R1R2 to BasePairCount2PCF and children, so that one can provide R1R2 if already calculated (once for all mocks). No consistency check between the provided R1R2 and the attributes of BasePairCount2PCF is performed, though.
I added the function to_xil which computes multipoles of the r-mu correlation function.
Please tell me if anything looks weird to you, and, of course, please make the changes you want (it is just a proposal).

…, and to_xil function

rainwoodman · 2018-04-17T21:13:15Z

nbodykit/algorithms/pair_counters/base.py

@@ -47,7 +47,14 @@ def __init__(self, mode, edges, first, second, Nmu, pimax, show_progress=False):
        # store the total size of the sources
        self.attrs['N1'] = first.csize
        self.attrs['N2'] = second.csize if second is not None else None
-
+
+       	if second is None or second==first:


Looks like a tab/space mixture issue -- we only use 4 spaces for indentation, no tabs.

Also to test if two fields we can probably use 'is' instead of ==.

rainwoodman · 2018-04-17T21:27:23Z

nbodykit/algorithms/pair_counters/corrfunc/base.py

@@ -143,7 +143,7 @@ def run(chunk):

        # run in chunks
        pc = None
-        chunks = numpy.array_split(range(loads[self.comm.rank]), N, axis=0)
+        chunks = numpy.array_split(numpy.arange(loads[self.comm.rank]), N, axis=0)


arange(..., dtype='intp') may be better?

rainwoodman · 2018-04-17T22:21:34Z

I am not so sure about N(N - 1)/2 part. A correctness test case seems to be failing due to this?

adematti · 2018-04-18T14:32:24Z

Hi,

I agree with all you said, and made the subsequent changes. I also imported weightedpairs attributes from PairCountBase to BasePairCount2PCF as it can be useful to users to recompute LS estimators or others themselves. By the way, please do not hesitate to submit your changes directly, as you are far better programmer than I am (plus it is your project).
Yes, as I said, auto correlation tests with LS estimator will fail, since they compare to either corrfunc's convert_3d_counts_to_cf ot halotools's tpcf, which normalize auto pair counts by N*N/2 instead of N(N - 1)/2. About this normalization of DD or RR, see e.g. https://arxiv.org/pdf/1211.6211v2.pdf, or the implementation of CUTE by David Alonso: https://github.com/damonge/CUTE/blob/master/CUTE/src/io.c, l. 136-137.
Maybe I am missing some subtlety about the bias of the LS estimator, though.
Cheers,

rainwoodman · 2018-04-18T17:43:44Z

I can of course add some clean ups after we settle on N(N-1) / 2.

The formula in section 2 of the paper you gave are for paircounts that are (i, j) i < j. (appearently).

Do you know if the paircount here includes reverted pairs or excludes reverted pairs? (i, j) with no constraints. In nbodykit before the switch to corrfunc we count all pairs (no constraints).
I am not so sure after the switch to corrfunc; but this must be relevant.

Those with NN looked more correct:

https://travis-ci.org/bccp/nbodykit/jobs/368171877#L1420

first row is from other libraries, using NN, second row is after the patch. The first number is sqrt(2) in the NN case, a nice integer number it is.

adematti · 2018-04-18T21:11:16Z

Well, what matters in N(N-1)/2 is the -1: we do not count pairs formed by any object with itself, because it is simply not a pair (these "false pairs" are anyway rejected because you require bin edges to be >0). It is not a matter of reverted pairs. Since you compute only cross correlation, in fact we get a total number of pairs (i,j) with i!=j (anyway, I think corrfunc also chooses this convention with autocorrelation). By the way, you could as well remove 0.5 factors in the "weightedpairs" calculation, it will give the same result as the overall normalization is given by ratios of "weightedpairs". I just added these 0.5 factors to match the somewhat intuitive definition of pair counts (no double count).
Concerning the test that fails: as I said, it uses halotools's tpcf, which uses the exact same formula as you do: line 58 of https://github.com/astropy/halotools/blob/master/halotools/mock_observables/two_point_clustering/tpcf_estimators.py. That explains why the test succeeded before, and why it fails now. I am just arguing this formula is not fully correct.

Cheers

rainwoodman · 2018-04-18T22:57:01Z

I see what you mean. The integral constraint for the case of pairs (i, j)|{i < j} is indeed n(n-1)/2. But the integral constraint for the case of pairs (i, j)|{any i, j} is n**2. In principle the self pairs should have affected only the zero-lag bin, but in this case it is affecting the way we define the integral constraint.

I see the argument in the Landy-Szalay paper (http://adsabs.harvard.edu/abs/1993ApJ...412...64L ) on n(n-1). But it would equally apply to the second definition of pairs and would give n**2.

So I think I am still confused about this (and leaning towards n**2). In large n limit it shouldn't matter, but in low n limit it does; so I still would like to get this right. I wonder if @manodeep or @aphearin thought about this as well?

nickhand

This looks good to me. We can do some code cleanup but I think the N*(N-1) normalization is proper thing to do here. Take a look at these notes in the Corrfunc docs:

http://corrfunc.readthedocs.io/en/master/modules/rr_autocorrelations.html#weighted-rr

The last equation of those notes seems to be the same as what is added here.

nickhand · 2018-04-22T19:57:22Z

nbodykit/algorithms/pair_counters/base.py

+        else:
+            wpairs1, wpairs2 = self.comm.allreduce(first.compute(first[weight].sum())), self.comm.allreduce(second.compute(second[weight].sum()))
+            self.attrs['weightedpairs'] = 0.5*wpairs1*wpairs2
+


I think I like weighted_npairs better here. But the calculation looks correct to me

nickhand · 2018-04-22T19:59:21Z

nbodykit/algorithms/paircount_tpcf/estimators.py

-    fN2 = float(NR2)/ND2
-    nonzero = R1R2['npairs'] > 0
+    Error = numpy.zeros(D1D2.pairs.shape)
+    Error[:] = numpy.nan



a Poisson error calculation would be a nice addition, I agree

nickhand · 2018-04-22T20:00:07Z

nbodykit/algorithms/paircount_tpcf/estimators.py

-        logger.info("computing randoms1 - randoms2 pair counts")
-    R1R2 = pair_counter(first=randoms1, second=randoms2, **kwargs).pairs
+
+    fDD, fDR, fRD = R1R2.attrs['weightedpairs']/D1D2.attrs['weightedpairs'], R1R2.attrs['weightedpairs']/D1R2.attrs['weightedpairs'], R1R2.attrs['weightedpairs']/D2R1.attrs['weightedpairs']


we should break this up into multiple lines, but the calculation looks right to me

nickhand · 2018-04-22T20:12:56Z

nbodykit/algorithms/paircount_tpcf/tpcf.py

+        if mu_sel is None:
+            if mu_range: mu_sel = numpy.nonzero((self.corr.edges['mu'][:-1]>=mu_range[0]) & (self.corr.edges['mu'][1:]<=mu_range[1]))[0]
+            else: mu_sel = slice(len(self.corr.edges['mu'])-1)
+


We can slice the BinnedStatistic option by mu using the .sel() function of self.corr here. Something like:

sliced = self.corr.sel(mu=slice(*mu_range), method='nearest')

I agree we could use that. Actually, I wanted to be conservative in the selected mu bins, i.e., if bin edges are [0.1,0.2] and mu_range starts at 0.18, I wanted that bin to be excluded from the calculation (to avoid polluting the result with systematics).
But I agree your solution would suit to most people.

nickhand · 2018-04-22T20:30:04Z

As for tests, I agree that halotools uses the same normalization as the original version in nbodykit. I think thexi or wp functions of Corrfunc use the proper normalization (we are only using the pair counting functions in Corrfunc right now), so we could do an off-line calculation with CUTE or these Corrfunc functions and compare on Travis I think.

adematti · 2018-04-23T16:35:13Z

nbodykit/algorithms/paircount_tpcf/tpcf.py

+                PC = getattr(self, pc)
+                self.attrs['weightedpairs'][pc] = PC.attrs['weightedpairs']
+                setattr(self,pc,PC.pairs)
+


I do not know if you want to keep the possibility of passing R1R2 (SurveyDataPairCount or SimulatedBoxPairCount instance) as additional argument of BasePairCount2PCF. If so, I would advocate keeping self.D1D2, self.D1R2, etc. as PairCountBase instance all the way (instead of getting the BinnedStatistic object pairs), and handle them in the getstate and setstate (using getstate and _setstate of PairCountBase). So that the user can access again to the PairCountBase instances R1R2, D1R2, etc. when loading BasePairCount2PCF, to do what he wants with pair counts (calculate another estimator, etc.).

manodeep · 2018-04-23T17:54:50Z

Sorry about the late response. The calculation in Corrfunc adds in the self-pairs only iff pairs at zero distance are considered real (i.e., rmin==0.0). Check out corresponding code and comments here

@lgarrison demonstrated the need to add in this extra N term via some gist. @lgarrison - was it this notebook?

rainwoodman · 2018-04-23T18:43:54Z

Thanks for the clarification(s)!

…

On Mon, Apr 23, 2018 at 10:54 AM, Manodeep Sinha ***@***.***> wrote: Sorry about the late response. The calculation in Corrfunc adds in the self-pairs only *iff* pairs at zero distance are considered real (i.e., rmin==0.0). Check out corresponding code and comments here <https://github.com/manodeep/Corrfunc/blob/master/theory/wp/countpairs_wp_impl.c.src#L543-L568> @lgarrison <https://github.com/lgarrison> demonstrated the need to add in this extra N term via some gist. @lgarrison <https://github.com/lgarrison> - was it this notebook <http://nbviewer.jupyter.org/gist/lgarrison/1efabe4430429996733a9d29397423d2> ? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#477 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAIbTNwDesAOGVc6s9nsD_-vjeGjAteuks5trhVqgaJpZM4TZBTy> .

rainwoodman · 2018-04-30T20:20:06Z

@nickhand let's try to merge this. What can we do about the failed test case?

adematti · 2018-04-30T23:15:12Z

Do you want anything from me, e.g. update the branch according to the previous comments?

rainwoodman · 2018-05-02T23:35:51Z

Yes. It would be great if you address some of the naming and conventions Nick suggested. Since we agree the test results are correct, I would suggest hard coding the current values as the correct values for the comparison and bypass the halotools test as a knownfailure, or comment it out.

nickhand · 2018-05-03T00:14:43Z

Yes I agree with Yu. I think the only way to fix the tests at the moment is to either mark it as a knownfailure, or load the hard coded test results from a file that were computed either from nbodykit, or ideally TreeCorr. We do a similar loading of test results in the 3PCF tests to compare against an external C code

…

On Wed, May 2, 2018 at 7:35 PM Yu Feng ***@***.***> wrote: Yes. It would be great if you address some of the naming and conventions Nick suggested. Since we agree the test results are correct, I would suggest hard coding the current values as the correct values for the comparison and bypass the halotools test as a knownfailure, or comment it out. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#477 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACKF4H2CiRDqcC89Mjc2pgdtywZpUydwks5tukLYgaJpZM4TZBTy> .

aphearin · 2018-05-03T00:49:36Z

Hi @rainwoodman and @nickhand - sorry that me and @duncandc are late in replying to this, it's been an extremely busy couple of weeks. I still haven't had a chance to fully digest the problem in this issue. Does this correlation function normalization problem apply only to the case where brute force randoms are used together with Landy-Szalay? Or does it also apply to the case of analytical randoms inside a periodic box?

rainwoodman · 2018-05-03T18:57:34Z

It appears the normalization affects DD, DR and RR the same way. If (-1) is skipped then it is biased.

adematti · 2018-05-03T22:44:54Z

Nick's conventions have been applied. Tests that failed due to the N*(N-1)/2 normalization of auto-paircounts have been modified to read reference csv files instead (produced using nbodykit (see make_ref.py)... I do not have TreeCorr installed). These tests were:

auto-correlations of non periodic simulated data, in test_1d.py and test_angular.py
auto-correlations of survey data
Note: auto-correlations of non periodic simulated data of test_2d.py and test_projected.py did not fail only because the tolerance is loose enough. I can modify these tests to take reference csv files instead.
Looking quickly at the NaturalEstimator, I am not 100% sure it is correct: I think (though I have never used it), that weights are not taken into account in the normalization of DD/RR.
Cheers

rainwoodman · 2018-05-04T01:38:46Z

nbodykit/algorithms/paircount_tpcf/tpcf.py

+        ----------
+        ells : array_like
+            the list of multipoles to compute
+        mu_range: array_like, optional


This API seems to be a bit redunant with mu_sel. Is this absolutely necessary? Perhaps passing in range(mu_range[0], mu_range[1]) would work the same?

Then you can probably also get rid of return_mu_sel because mu_sel is always an input.

rainwoodman · 2018-05-04T01:51:45Z

We will need the input from @nickhand on the NaturalEstimator. It looked suspicious to my eyes as well.

I am worried having a new script to create the test data (both data and the results). We were using a similar technique previously (using pre-generated data files), and found it to quickly become painful to manageable when the number of test cases scales up. Eventually we decided always regenerating the test data within the individual test cases, such that they can be simply copied and ran without help of external data.

In that spirit, also noting the correct result is only 10 numbers, what about hard coding them in the test case directly?

This way we avoid commiting some 40k data files that doesn't mean anything into the repo, and also avoid the duplication of code between make_ref.py and the test cases. (I see the data files are used to reduce the duplication as well).

If later the test case is broken, we'll take a look why, and if it is supposed to break, then we can simply run it with --pdb, get the 10 updated numbers, and paste them into the test case again.

nickhand · 2018-05-04T02:30:46Z

@rainwoodman I'll take a look at the natural estimator tomorrow....very possible that there's a bug

rainwoodman · 2018-05-04T16:31:10Z

nbodykit/algorithms/paircount_tpcf/tests/test_2d.py

+
+    # compute 2PCF
+    r = SurveyData2PCF('2d', data1, randoms2, redges, Nmu=Nmu, cosmo=cosmo, data2=data2, randoms2=randoms2)
+    xil = r.to_xil(ells=ells)


FFTPower, FFTCorr both takes a poles=[] argument for computing multi-poles. Would it match the existing API better if we call this r.to_poles() here?

rainwoodman · 2018-05-04T16:42:09Z

nbodykit/algorithms/paircount_tpcf/tpcf.py

+        dims = [x]
+        edges = [self.corr.edges[x]]
+
+        if mu_range: sliced = self.corr.sel(mu=slice(*mu_range), method='nearest')


Looks like mu_range is used only once here to perform a selection on the corr, and the only object the to_xil function operates on is self.corr. It sort of feels like the function may belong somewhere else. e.g. What if we make corr an instances of a subclass of BinnedStatistics, which adds a method to_xil (or to_poles) on top of existing methods of BinnedStatistics, that returns a new BinnedStatistics of poles?

The benefit is that the to_poles() method no longer needs to couple the selection logic.

r = SurveyDataBox2PCF(.....) r.corr.sel(mu=slice(0.3, 0.5), method='nearest').to_poles(ells=[0, 2, 4])

I think this may require some changes in the BinnedStatistics class, allowing sel() to return a subclass instance.

This sounds like a good way forward to me

Also asserts subclass type is preserved during sel(). This paves the road for the plan described in bccp#477 (comment)

nickhand · 2018-05-05T22:24:06Z

nbodykit/algorithms/paircount_tpcf/estimators.py

@@ -71,7 +71,7 @@ def __call__(self, NR1, NR2=None):
        else:


This natural estimator calculation also needs to be fixed I think. If not a cross-correlation it should be N1*(N1-1) to match the other calculation.

I think we will need to add a is_cross_corr keyword to the NaturalEstimator constructor, which is called from here:

nbodykit/nbodykit/algorithms/paircount_tpcf/tpcf.py

Line 105 in 211419b

self.R1R2, self.corr = NaturalEstimator(DD)

And then if it's not a cross-correlation, only pass ND1 here:

nbodykit/nbodykit/algorithms/paircount_tpcf/estimators.py

Line 183 in 211419b

_R1R2 = AnalyticUniformRandoms(mode, edges, BoxSize)(ND1, ND2)

(currently, we always pass ND1 and ND2)

rainwoodman · 2018-05-08T15:55:39Z

Let's merge this PR as is, and I will do some follow-ups before we tag a release. These features are useful.

rainwoodman · 2018-05-09T00:29:03Z

I've moved this to #488, with two followup changes to fix the NaturalEstimator (by adding proper meta data to RR), and adding 'WedgeBinnedStatistic', which can be converted to poles -- if the tests are all green I will merge it there.

rainwoodman · 2018-05-09T16:06:56Z

Merged. Thanks @adematti!

adematti added 4 commits April 17, 2018 19:31

weights in LS estimator

537239a

Updated normalization of Landy-Szalay estimator, included R1R2 option…

c982b56

…, and to_xil function

Update changelog

0bc4173

Update test autocorr

a0202e9

rainwoodman reviewed Apr 17, 2018

View reviewed changes

rainwoodman requested a review from nickhand April 17, 2018 22:06

adematti added 2 commits April 18, 2018 12:05

weightedpairs

ca663a8

tpcf tests corrections

8768dc3

nickhand reviewed Apr 22, 2018

View reviewed changes

adematti commented Apr 23, 2018

View reviewed changes

Update following comments, workaround for tests

2b8cbef

rainwoodman reviewed May 4, 2018

View reviewed changes

adematti added 2 commits May 4, 2018 11:15

cleaning up of to_xil, hard-coding autocorrelation tests

287d0cd

cleaning up of to_xil, hard-coding autocorrelation tests

5e4394e

rainwoodman reviewed May 4, 2018

View reviewed changes

rainwoodman added a commit to rainwoodman/nbodykit that referenced this pull request May 4, 2018

ENH: Allow new type in BinnedStatistics.copy().

7c237e8

Also asserts subclass type is preserved during sel(). This paves the road for the plan described in bccp#477 (comment)

rainwoodman mentioned this pull request May 4, 2018

ENH: Allow new type in BinnedStatistics.copy(). #483

Merged

nickhand reviewed May 5, 2018

View reviewed changes

rainwoodman mentioned this pull request May 9, 2018

Adematti landyszalay #488

Merged

rainwoodman merged commit 5e4394e into bccp:master May 9, 2018

Landy Szalay estimator #477

Landy Szalay estimator #477

Conversation

adematti commented Apr 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rainwoodman commented Apr 17, 2018

adematti commented Apr 18, 2018 • edited Loading

rainwoodman commented Apr 18, 2018 • edited Loading

adematti commented Apr 18, 2018

rainwoodman commented Apr 18, 2018 • edited Loading

nickhand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickhand commented Apr 22, 2018

Choose a reason for hiding this comment

manodeep commented Apr 23, 2018

rainwoodman commented Apr 23, 2018 via email

rainwoodman commented Apr 30, 2018

adematti commented Apr 30, 2018

rainwoodman commented May 2, 2018

nickhand commented May 3, 2018 via email

aphearin commented May 3, 2018

rainwoodman commented May 3, 2018

adematti commented May 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rainwoodman commented May 4, 2018 • edited Loading

nickhand commented May 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rainwoodman commented May 8, 2018

rainwoodman commented May 9, 2018

rainwoodman commented May 9, 2018

adematti commented Apr 18, 2018 •

edited

Loading

rainwoodman commented Apr 18, 2018 •

edited

Loading

rainwoodman commented Apr 18, 2018 •

edited

Loading

rainwoodman commented May 4, 2018 •

edited

Loading