Skip to content
This repository has been archived by the owner on Nov 10, 2017. It is now read-only.

quantile_curve(curves, quantile, weights) is broken when weights is None #577

Closed
micheles opened this issue Jan 26, 2017 · 9 comments · Fixed by gem/oq-engine#2492
Closed
Assignees
Labels
Milestone

Comments

@micheles
Copy link
Contributor

micheles commented Jan 26, 2017

By looking at https://github.com/gem/oq-hazardlib/blob/engine-2.2/openquake/hazardlib/stats.py#L35 one can see that there is a special case when the weights are None; here is the code which has been used for 5+ years:

    if weights is None:
        # this implementation is an alternative to
        # numpy.array(mstats.mquantiles(curves, prob=quantile, axis=0))[0]
        # more or less copied from the scipy mquantiles function, just special
        # cased for what we need (and a lot faster)
        arr = numpy.array(curves).reshape(len(curves), -1)
        p = numpy.array(quantile)
        m = 0.4 + p * 0.2
        n = len(arr)
        aleph = n * p + m
        k = numpy.floor(aleph.clip(1, n - 1)).astype(int)
        gamma = (aleph - k).clip(0, 1)
        data = numpy.sort(arr, axis=0).transpose()
        qcurve = (1.0 - gamma) * data[:, k - 1] + gamma * data[:, k]
        return qcurve

This code is used in the case of sampling. I submit that it is broken, since for weights=None one would expect to get the same result than assigning identical weights to all realizations, consistently with how it works in mean_curve, which is not the case.
The problem is that quantile_curve(curves, quantile, weights) uses a different algorithm if the weights are not None, so the numbers are different. For instance:

In[1]: import numpy

In [2]: from openquake.hazardlib.stats import quantile_curve

In [3]: quantile = 0.75

In [4]: curves = numpy.array([
            [.98161, .97837, .95579],
            [.97309, .96857, .93853],
        ])

In [5]: quantile_curve(curves, quantile, None)
Out[5]: array([ 0.98161,  0.97837,  0.95579])

In [6]: quantile_curve(curves, quantile, [.5, .5])
Out[6]: array([ 0.97735,  0.97347,  0.94716])

Given that we do not have a performance problem (the postprocessing is fast compared to the real computation) I would use the same algorithm in all cases, i.e. I would use the algorithm used for the full enumeration case, with nontrivial weights, which is based on interpolation.

PS: the code for weights=None was built to be compatible with scipy.mstats.mquantiles(curves, prob=quantile), however the tests has this comment:

        # TODO(LB): Check with our hazard experts to see if this is reasonable
        # tolerance. Better yet, get a fresh set of test data. (This test data
        # was just copied verbatim from from some old tests in
        # `tests/hazard_test.py`.
@micheles micheles added the bug label Jan 26, 2017
@micheles micheles added this to the Engine 2.3.0 milestone Jan 26, 2017
@mmpagani
Copy link
Member

I agree we must be consistent in the way we calculate quantiles hence I would eliminate the 'customised' implementation

@raoanirudh
Copy link
Member

raoanirudh commented Jan 26, 2017

When the individual curves are obtained by sampling the logic-tree and no weights are assigned, we do not know the empirical distribution of the sampled values. To estimate statistics from the sampled values, we would need to assign a plotting position based on assumptions regarding the underlying distribution and depending upon the statistic we are trying to estimate. The plotting position recommended by Cunnae (1978) for unbiased estimation of quantiles when the underlying distribution type is unknown seems to be the one implemented in the snippet above. The equation recommended in that paper by Cunnae is listed here.

We don't have to go through the same process to estimate the mean because the sample mean is already an unbiased estimator.

@micheles
Copy link
Contributor Author

Here is an example that uses the algorithm currently implemented for the full enumeration case:

  quantile = 0.75
  poes = numpy.array([.99, .95, .93, .87, .60])
  weights = numpy.array([.4, .1, .3, .1, .1])
  sorted_poe_idxs = numpy.argsort(poes)  # array([4, 3, 2, 1, 0])
  sorted_poes = poes[sorted_poe_idxs]  # array([ 0.6 ,  0.87,  0.93,  0.95,  0.99])
  sorted_weights = weights[sorted_poe_idxs]  # array([ 0.1,  0.1,  0.3,  0.1,  0.4])
  cum_weights = numpy.cumsum(sorted_weights)  # array([ 0.1,  0.2,  0.5,  0.6,  1. ])
  numpy.interp(quantile, cum_weights, sorted_poes)  # 0.965

Does anybody have an idea of the origin of this algorithm? It would be nice to have a link to some standard reference.

@raoanirudh
Copy link
Member

I believe that's obtaining the quantiles simply by linear interpolation on the empirical CDF of the poes. For reference, see the Type 4 quantile as computed by R.

@micheles
Copy link
Contributor Author

micheles commented Jan 26, 2017

Yes, it makes sense since it does not make any assumption on the underlying distribution. I would keep this algorithm, since it is compatible with the past. BTW, even if we remove the other algorithm for weights=None there is no impact on the risk tests, since there are no risk tests for sampling.

@mmpagani
Copy link
Member

mmpagani commented Jan 26, 2017

Thanks this is all very intesting. I have (at least) one question. Isn't the scipy function taking into account the fact case where data doesn't have weight associated?

@micheles
Copy link
Contributor Author

micheles commented Jan 26, 2017

It is used in that case. The issue is that the scipy function gives different numbers than the interpolated CDF approach when the weights are all equal. That make the calculation of quantiles surprising and inconsistent with the mean computation.

@mmpagani
Copy link
Member

mmpagani commented Jan 26, 2017

I'll prepare some additional tests on sampling. This is also interesting: https://phobson.github.io/mpl-probscale/tutorial/closer_look_at_plot_pos.html

@micheles
Copy link
Contributor Author

If you want to see the effect on the current tests, see gem/oq-engine#2492

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants