stats.effhist - estimate efficiency and uncertainties #754

kbarbary · 2013-02-10T05:54:13Z

This implements a function stats.effhist() that is very useful for estimating detection efficiency as a function of some parameter (e.g., magnitude) in Monte Carlo experiments. Importantly, reasonable uncertainties on the detection efficiency are calculated. This is done with a helper function stats.binom_conf_interval().

These functions both require scipy.

Built docstrings for the two functions can be viewed here:
http://kbarbary.github.com/astropy/stats/index.html

eteq · 2013-02-11T21:49:09Z

These look great and is well tested, but my only pause is that I'm not sure what we settled on for what gets in astropy.stats versus affiliated packages or other places. Am I right in assuming you issued this because it's not present anywhere else, @kbarbary ? In that case, I think I'm fine including it, but there might be others who disagree...

perwin · 2013-02-11T22:01:24Z

The binomial confidence interval is something fundamental and widely usable that ought to be in something like scipy, but isn't, so making it available in astropy.stats seems like a good idea for the time being.

Apropos of that -- Is there a particular reason you're using a flat prior for the binomial confidence interval? I ask because previous discussions of a Bayesian approach to the problem seem to use a Jeffreys prior.

(As a side comment, using the Wilson [1927] confidence intervals would probably be just as accurate, and wouldn't require scipy, since it's just a simple bit of algebra.)

kbarbary · 2013-02-11T23:14:29Z

@eteq It may belong in the core package. But if we decide astropy.stats shouldn't contain this sort of thing, I'm fine with that too. I think astropy makes sense over an affiliated package, because this is useful across many astronomy disciplines, as @perwin says.

@perwin Both good questions. You're probably right that this should use a Jeffreys prior (I don't really know, but that's what Bayesian experts seem to think...). Also, the confidence interval I'm estimating is reasonably well-motivated, but seems non-standard. Perhaps I should use a more "standard" one, such as the Jeffreys interval (which would also use scipy.special.betaincinv) or the Wilson interval, as you suggest. Or there could be multiple intervals, with a keyword option: binom_conf_interval(k, n, conf=0.683, interval='wilson')? This would make clear that the interval choice is not unique.

I think this sort of thing isn't included in scipy is because the choice of interval is not a unique one, and many interval choices are easy to estimate using existing tools in scipy (if you know how to do it). The two functions in this PR are fairly simple in that they only string together existing numpy and scipy tools; their real value is their names and docs. Users who don't know a lot about binomial confidence intervals or statistics wouldn't know to use scipy.special.betaincinv or to look up the Wilson interval and implement it. They know if they observe 1 success out of 1 trial, saying p = 1. +/- 0. or p = 1. +/- 1. is wrong, but knowing how to find a reasonable answer requires a lot of additional knowledge.

perwin · 2013-02-12T16:28:41Z

@kbarbary

My perspective is largely based on Brown, Cai, & Dasgupta (2001, Statistical Science); you can find a PDF of the article here:
http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.ss/1009213286
They compare the various different methods of estimating binomial confidence intervals and conclude that the Wilson and Jeffreys-prior approaches are the two best; they seem to be equally good. There isn't really any explanation of why the Jeffreys prior is used instead of the uniform prior, though, other than a vague suggestion that this is perhaps the most common Bayesian approach to the problem. (Cameron 2011 [http://adsabs.harvard.edu/abs/2011PASA...28..128C] seems to be using a uniform prior, though he mentions the Jeffreys prior as well, making it sound as though it's entirely a matter of taste which you use...)

I have a slight preference for the Wilson interval mainly because it's relatively easy to compute and doesn't require any additional libraries, but if people are happier with a Bayesian derivation and don't mind requiring the scipy library, there's not much reason to object. (I have Python code for computing the Wilson interval, but I haven't bothered properly generalizing it to handle arbitrary percentages for the interval; it only works with multiples of the 68.3% interval.)

Given how many astronomers are still using the "standard" Gaussian approximation (what Brown et al. refer to as the "Wald interval"), pretty much any improvement would be a good thing!

It might be useful to do some testing to see if the confidence intervals vary appreciably between using the Jeffreys prior and the uniform prior (and the Wilson intervals); if the variation is slight, then in most cases people won't care which they're using, as long as it's clearly better than the Wald interval.

eteq · 2013-02-12T16:57:03Z

I like the idea of an interval option that lets the user choose which one they want. But the above discussion still applies in that you need to decide which is the default (and I agree with @perwin that something better than the gaussian approximation!)

kbarbary · 2013-02-12T18:07:38Z

@perwin Thanks for the reference! I'll have a closer look at it. One question I have about the Jeffreys interval: it seems like for k=0 and k=n, the interval contains 1 - alpha/2 rather than 1 - alpha (alpha being, e.g. 0.317 for conf=0.683). Is that right?

I like the idea of an interval option too. Some may not agree: you might argue that since the different intervals will be nearly completely different implementations underneath, they should be separate functions, such as binom_conf_interval_wilson(...), binom_conf_interval_jeffreys(...), etc.

But since no one here is arguing that, I'm going to go ahead and implement the kwarg. I think 'wilson' should be the default because it doesn't require scipy. As you both say, most people won't really care, as long as it is better than the Wald interval.

astrofrog · 2013-02-13T22:38:03Z

Just a minor comment - can we think of a better/more explicit name than effhist? How about detection_efficiency? (or something like that). I'd rather longer but explicit names than short but ambiguous, if possible.

eteq · 2013-02-13T22:48:50Z

👍 to astrofrog's suggestion here.

kbarbary · 2013-02-13T23:15:25Z

effhist derives from "efficiency historgram" (histogram because the values are binned), but it doesn't actually give you a histogram, so that's not a great name, I agree.

"detection" might be too specific. This works for anything where you are estimating probability of an outcome as a function of some parameter (for example, the probability of passing some cut).

How about binned_efficiency?

cc @sosey because she mentioned completeness studies in the photometry API discussion.

perwin · 2013-02-14T10:02:37Z

@kbarbary -- sounds like a good idea. (I agree that a single function with an option to specify how the interval is calculated is simpler and the way to go.)

I'm not sure I know the proper answer to your question about the Jeffreys interval definition (but presumably you could code up both version and test which matches the Wilson interval better!).

kbarbary · 2013-02-17T06:40:36Z

I've updated this PR in response to the comments. Rebuilt docs are in the same place as before:

http://kbarbary.github.com/astropy/stats/index.html

Two notes:

I still had to use scipy for the Wilson interval. If you want to allow arbitrary confidence (other than 0.6829 for example), you have to use the inverse error function, which is not available in numpy.
DOCS: I noticed that the problem with the sidebar not extending to the bottom of the page is noticable on pages with latex math, such as the binom_conf_interval page. This is because the sidebar javascript runs before the images load. @astrofrog fixed this on the index page by specifying the image height, but that's not possible here, as far as I know.

eteq · 2013-02-17T21:56:51Z

astropy/stats/funcs.py

+        Desired probability content contained in confidence interval
+        (p - perr[0], p + perr[1]). Default is 0.68269.
+    interval : {'wilson', 'jeffreys'}, optional
+        Formula used for confidence interval. Default is 'wilson'.


Maybe you should add a sentence here that directs users to the binom_conf_interval docstrings for a detailed description of what these intervals mean? When I first looked at it, I thought you hadn't documented that. I see that the link is below, but if someone reads this over without background info, they won't necessarily understand that that's where they should go for the background needed to understand.

eteq · 2013-02-17T22:04:12Z

It's fine to use scipy for the Wilson interval - that's the whole point of having the special functions there, after all, and there's no reason to go out of the way to avoid it.

Can you add a #TODO above each function that just says something like "note scipy dependency"? At some point in 0.3 we'll hopefully have a formal way of indicating which functions have optional dependencies, and having a TODO will hopefully keep us from forgetting to add that here.

And what about adding a 'gaussian' or 'normal' to the list of interval options? It would be rather silly to use it seeing as how you've already implemented 'wilson' and 'jeffreys', but as @perwin mentioned, there are probably a fair number of people used to doing that, and they may want to at least compare to other results.

kbarbary · 2013-02-20T05:53:39Z

@eteq Excellent comments, thanks. The plot directive is especially cool. I think the example is a lot better now as well.

I implemented all your suggestions with the exception of the 'gaussian' / 'normal' (or 'wald') interval. I personally want to demote its use as much as possible, so I think that it should not even be an option, lest someone use it by accident. It gives nonsensical results, particularly in typical astronomy use cases: When the detection efficiency is near 0 or 1 (see binned efficiency example) the interval gives an uncertainty of 0, even when there has only been a single trial. You can argue about priors and whatnot with the Jeffreys and Wilson and other intervals, but the Wald interval is clearly wrong.

In any case, it's only a single line of code

perr = np.sqrt(k / n * (1 - k / n) / n)  # for conf=0.683 (e.g., 1 "sigma")

so users can compare it to binom_conf_interval results pretty easily if they want to.

kbarbary · 2013-02-20T05:54:55Z

@eteq Before this is merged, should I squash these commits into one? Will that remove the png image from the repo history? (probably doesn't matter much but...)

eteq · 2013-02-20T15:11:29Z

@kbarbary - Good point about the commits. I would say don't squish all the commits, just remove the one that added the image in the first place. You should be able to do something like thisgit rebase -i HEAD^^^^ and then when your editor pops up, just delete the 8122067 commit where you added the image (smart of you to do that in its own commit!). You may have to do something on the last commit where you removed it, but you probably just have to use normal rebase conflict resolution (in theory...).

Does anyone else (@astrofrog or @perwin) have strong opinions on including/not including a 'normal' option? I agree it gives wrong results, but I still think people will want to use it for the "what not to do" sorts of examples, as well as comparing to earlier work that does use it. And anyway, it does work fine for the large n case, right? (even if that's very rare in astronomy)

An alternative might be to add another example showing why you should not use the Wald interval. If that example includes your single line of code here, then anyone who actually reads it should have no problem reproducing the same thing (in which case I agree there's less need for it as an explicit option.)

kbarbary · 2013-02-20T15:36:23Z

@eteq when the efficiency is near 0 or 1, the Wald interval gives wrong results even for large n.

The example of why not to use it is a good idea though; I'll do that.

perwin · 2013-02-20T15:59:51Z

I don't have any strong feelings one way or the other about making the Wald interval an available option (as long as it's not the default!). I agree that not including it at all is potentially a good way of encouraging people not to use it -- and that mentioning/illustrating the disadvantages of the Wald interval is a good idea.

As a minor comment, it might be useful to alternate "frequency" with "efficiency" in the discussion/examples for binom_conf_interval, just because people may not realize that it's the same thing. E.g., someone wanting to know what the confidence intervals for the frequency of galaxies with some property measured from their sample may find the term "detection efficiency" somewhat confusing.

kbarbary · 2013-02-20T16:15:05Z

I'm reconsidering the 'wald' interval. If it isn't the default and there's a warning in the docs about why it probably shouldn't be used, then it might be fine to use it as an option. As long as we call it the 'wald' interval and not 'normal' or 'gaussian'. I don't want a user to see 'normal' and think, "I'm familiar with that; I know what it means... I'll use that." (and having it as an option in binom_conf_interval() will make it easier to construct the example of why not to use it in binned_efficiency())

As for terminology @perwin, I was battling with this a bit. You're right that "efficiency" doesn't really apply in many potential use cases. I see how you're using "frequency" but "frequency of galaxies with some property" could mean something else, such as number per unit volume. I think the most general term is probably "proportion", but I was worried that this sounds a bit more opaque to a common user, so I tried to use both "proportion" and "efficiency" in places. Do you think binned_efficiency() should be called binned_proportion()? It's technically a more accurate description...

kbarbary · 2013-02-21T04:57:01Z

@eteq I implemented the Wald interval and removed the image commit. Let me know what you think. Docs are getting long! http://kbarbary.github.com/astropy/stats/index.html

Still to do:

clean up terminology, as per @perwin (possibly change name of binned_efficiency())
tests for wald interval, maybe additional tests for wilson interval as well?

eteq · 2013-02-21T05:03:10Z

Looks great - long docs are a good thing, especially for statistics!

perwin · 2013-02-21T14:04:41Z

@kbarbary -- sure, "proportion" is probably just as good as "frequency" (and the Wikipedia entry on binomial statistics seems to use "proportion", so there's that...).

The revised documentation for binom_conf_interval looks good!

kbarbary · 2013-02-21T19:40:31Z

Any preference between binned_efficiency() and binned_proportion(), @eteq ?

@perwin, do you prefer binned_proportion() or was your last comment referring only to discussion in the docstrings?

I think I'm leaning towards the later as a more general description.

eteq · 2013-02-21T21:27:20Z

From the name, it's not apparent to me what binned_proportion() means - "proportion" of what? So I'd be a little worried it would confuse people by being too general. But maybe that's ok if the examples are clear. But I don't have a terribly strong opinion.

perwin · 2013-02-21T23:03:38Z

@kbarbary, I guess I would prefer binned_proportion, since it seems more straightforwardly descriptive (my first reaction to the initial code was "I wonder what an 'efficiency histogram' is..." -- I guess this is a particular usage that makes perfect sense to you, but it is not something I've come across).

I'll admit to still being a little confused as to what that function is actually doing... maybe a slightly simpler example of usage?

eteq · 2013-02-22T01:27:46Z

When I think about it a bit more, I think a more fully descriptive name is better to remove the ambiguity of the word "proportion." Something like binned_sucess_to_failure_ratio? If that seems too awkwardly long, maybe binned_bernoulli_ratio (i.e., the ratio of the two possible outcomes of a Bernoulli trial)?

eteq · 2013-02-22T01:30:44Z

astropy/stats/funcs.py

+    --------
+    Suppose we wish to estimate the detection efficiency of an astronomical
+    source as a function of magnitude using a Monte Carlo experiment with
+    100 trials between magnitude 20. and 30. Suppose the true 50% detection


misplaced period here, I think. Also, above, I think you mean "efficiency of a survey in detecting an astronomical source", right? Right now it sounds like you're estimating how good some star is at detecting things :)

Yep. Will change.

kbarbary · 2013-02-24T23:45:26Z

@perwin I'll explain my use case for binned_efficiency() and then maybe you can advise on how to improve the docs to make it more clear what the function is doing:

For supernova surveys, we often want to find the efficiency at which SNe are detected as a function of magnitude. To do this, I'd insert a bunch of "fake stars" with magnitudes spread over a wide range at the start of the data reduction pipeline. Then at the other end of the pipeline, I check whether each fake star was detected or not. So I have a list of magnitudes and a list of True/False values ("detected"/"not detected"). binned_efficiency() takes those two lists and gives you what you really want: the detection efficiency as a function of magnitude.

kbarbary · 2013-02-24T23:53:50Z

@eteq Name-wise, I think we should stick with binned_efficiency(). binned_success_to_failure_ratio() is technically not correct (it's a ratio of successes to total trials) and too long (descriptive is good, but there's a limit...). A lot of people aren't familiar with what a Bernoulli trial is, and "Bernoulli ratio" doesn't appear on the wikipedia page for Bernoulli trial. I believe "efficiency" will align with most (if not all) use cases for the function, and so this name should make it easiest to find for most people wanting to use it.

perwin · 2013-02-25T15:50:38Z

@kbarbary -- Ah, OK, that makes more sense now. It wasn't quite clear to me that you were doing something like "start with a set of data points where x values are continuously distributed and y values are binomial, then bin them according to x value, then calculate proportions and confidence intervals" -- right?

I could still argue that there are use cases for which the term "efficiency" seems a bit odd -- e.g., a lot of the plots in a couple of recent papers I was co-author on were just this kind of histogram w/ binomial confidence intervals, but we were plotting "fraction of galaxies in each bin with property X" (spiral morphology, detected bar, AGN spectrum) vs e.g. galaxy absolute magnitude.

perwin · 2013-02-25T15:54:30Z

Two comments on the current help text for binned_efficiency() [as seen on the current documentation page]:

"For a set of success/failure trials, each corresponding to some value of the parameter x, place the trials into bins and return an estimate of the efficiency (number of successes / number of failures) and its uncertainty in each bin."

that should be "number of successes / total number", no? (Or even just "fraction of successes"?)

A suggested slightly more verbose introduction to the example:

"Suppose we wish to estimate the detection efficiencies for astronomical sources as a function of magnitude (i.e., the probability of detecting a source given its magnitude). In a realistic case, we might prepare a large number of sources with randomly selected magnitudes, inject them into simulated images, and then record which were detected at the end of the reduction pipeline. As a toy example, we generate 100 data points with randomly selected magnitudes between 20 and 30 and "observe" them with a known detection function (here, the error function, with 50% detection probability at magnitude 25):"

kbarbary · 2013-02-25T15:56:28Z

@perwin Exactly right. I should add "continuously distributed" to the description for x.
Your comments on the help text sound good, thanks.

binned_binom_proportion()? :)

eteq · 2013-02-26T00:34:42Z

I guess I still see "efficiency" and "proportion" as fairly ambiguous words that could mean multiple things (although binned_binom_propertion() is better in that regard, so I do like that over efficiency). What about binned_success_ratio()? Or binned_success_rate()?

Although I admit this name stuff is all quibbling, as people probably need to read the docs for this one, anyway...

kbarbary · 2013-02-26T00:36:41Z

Well, as @perwin's example points out, it's not always going to be describing a success/failure scenario.

eteq · 2013-02-26T01:04:35Z

Well, to me that's still a "success"/total ratio, in the statistical sense of the word. That is, fraction with property X might be treated as the "success" rate of that property. That's actually why I suggested "bernoulli trial" before - it gets across the stats concept without necessarily tying it directly to a specific example (which is good in the docs, but not necessarily in the name).

But it is becoming clear to me that this is probably just we each have slightly different hazy memories of the language used in our earliest stats textbook or something :) And that means there may not be any good answer. So having said my piece (and perhaps a bit more...), I'll leave it to you to contemplate the discussion and I'm fine with whatever you settle on.

kbarbary · 2013-02-26T18:33:47Z

Thanks for all the comments both of you. I've updated docstrings according to the comments, and implemented basic tests for the wald interval.

Naming: I agree @eteq there is probably not a name that satisfies: (1) unambiguous (2) completely general (3) not opaque (4) not > 80 characters :) . I changed the name to binned_binom_proportion() as a compromise that at least satisfies being general and corresponding to language on the Wikipedia page for "binomial proportion confidence interval". That said, I think these functions might get further review once astropy.stats is being filled out in ernest and more people are paying attention. I'm happy to reconsider the naming with more input from others at that time (presumably before the 0.3 release).

If you look at the built docs table: http://kbarbary.github.com/astropy/stats/index.html at least the descriptions for the two functions match up nicely.

As far as I'm concerned this is ready to merge.

eteq · 2013-02-26T19:00:21Z

astropy/stats/funcs.py

+    >>> plt.errorbar(bins, p, xerr=binshw, yerr=perr, ls='none', marker='o',
+    ...              label='estimate')
+
+    .. plot::


I wonder if maybe you should add a title to this plot (and perhaps the above one)? The text above this is easily confused with the previous plot. Maybe even title this one "Why the Gaussian approximation is bad" so it's obvious you're anti-recommending it?

eteq · 2013-02-26T19:01:36Z

Aside from that possible one-line change I just suggested, looks good to me. I'll give a day or so and then merge, in case anyone else wants to comment and/or you want to do something regarding the above comment.

perwin · 2013-02-26T19:17:57Z

Looks pretty good to me...

kbarbary · 2013-02-26T20:03:26Z

@eteq good idea. I added titles and also added to the discussion before the second plot example.

http://kbarbary.github.com/astropy/_generated/astropy.stats.funcs.binned_binom_proportion.html

eteq · 2013-02-26T20:58:20Z

Looks great! As I said, I'll give it a day and merge if there's nothing else.

eteq · 2013-02-26T23:28:27Z

Whoops, it appears travis is unhappy because it doesn't have matplotlib, so it can't do the plot command.

I'll have to take some time to fiddle with the travis configuration to make this work. It should be fine to just install matplotlib - if so, I'll merge this alongside a commit to add that.

astrofrog · 2013-02-26T23:44:13Z

@eteq - note that if you want matplotlib to work on Travis, you'll have issues with Python 3 unless you use the latest master: matplotlib/matplotlib#1781 (comment)

eteq · 2013-02-28T20:56:44Z

@astrofrog - I think that's not an issue because our sphinx build is on 2.7, but I'll test it on my fork before merging anything

eteq · 2013-03-01T07:45:03Z

See https://travis-ci.org/eteq/astropy/builds/5148687, which passed. The sphinx build now takes quite some time (~25 minutes) because it now needs scipy, so I moved it higher up in the build order.

So I'll merge that into master, which should close this. Thanks @kbarbary!

…est fixes also included from eteq/effhist

eteq reviewed Feb 17, 2013
View reviewed changes

eteq reviewed Feb 22, 2013
View reviewed changes

kbarbary added 6 commits February 26, 2013 12:12

implement effhist and helper function

080d70c

implement tests for effhist

78e0edf

changes in response to comments in PR

77cebf8

changes in response to PR comments

c5d54d0

implement wald interval

f39984c

docstrings; test wald; rename binned_efficiency()

9a5eb87

eteq reviewed Feb 26, 2013
View reviewed changes

plot titles and wald interval discussion

8c24fe3

eteq added a commit that referenced this pull request Mar 1, 2013

Merge pull request #754 from kbarbary/effhist (manual merge) travis t…

51b72be

…est fixes also included from eteq/effhist

eteq merged commit 8c24fe3 into astropy:master Mar 1, 2013

kbarbary deleted the effhist branch November 22, 2013 23:49

stats.effhist - estimate efficiency and uncertainties #754

stats.effhist - estimate efficiency and uncertainties #754

Conversation

kbarbary commented Feb 10, 2013

eteq commented Feb 11, 2013

perwin commented Feb 11, 2013

kbarbary commented Feb 11, 2013

perwin commented Feb 12, 2013

eteq commented Feb 12, 2013

kbarbary commented Feb 12, 2013

astrofrog commented Feb 13, 2013

eteq commented Feb 13, 2013

kbarbary commented Feb 13, 2013

perwin commented Feb 14, 2013

kbarbary commented Feb 17, 2013

eteq Feb 17, 2013

Choose a reason for hiding this comment

eteq commented Feb 17, 2013

kbarbary commented Feb 20, 2013

kbarbary commented Feb 20, 2013

eteq commented Feb 20, 2013

kbarbary commented Feb 20, 2013

perwin commented Feb 20, 2013

kbarbary commented Feb 20, 2013

kbarbary commented Feb 21, 2013

eteq commented Feb 21, 2013

perwin commented Feb 21, 2013

kbarbary commented Feb 21, 2013

eteq commented Feb 21, 2013

perwin commented Feb 21, 2013

eteq commented Feb 22, 2013

eteq Feb 22, 2013

Choose a reason for hiding this comment

kbarbary Feb 24, 2013

Choose a reason for hiding this comment

kbarbary commented Feb 24, 2013

kbarbary commented Feb 24, 2013

perwin commented Feb 25, 2013

perwin commented Feb 25, 2013

kbarbary commented Feb 25, 2013

eteq commented Feb 26, 2013

kbarbary commented Feb 26, 2013

eteq commented Feb 26, 2013

kbarbary commented Feb 26, 2013

eteq Feb 26, 2013

Choose a reason for hiding this comment

eteq commented Feb 26, 2013

perwin commented Feb 26, 2013

kbarbary commented Feb 26, 2013

eteq commented Feb 26, 2013

eteq commented Feb 26, 2013

astrofrog commented Feb 26, 2013

eteq commented Feb 28, 2013

eteq commented Mar 1, 2013