quantile_curve(curves, quantile, weights) is broken when weights is None #577

micheles · 2017-01-26T09:40:26Z

By looking at https://github.com/gem/oq-hazardlib/blob/engine-2.2/openquake/hazardlib/stats.py#L35 one can see that there is a special case when the weights are None; here is the code which has been used for 5+ years:

    if weights is None:
        # this implementation is an alternative to
        # numpy.array(mstats.mquantiles(curves, prob=quantile, axis=0))[0]
        # more or less copied from the scipy mquantiles function, just special
        # cased for what we need (and a lot faster)
        arr = numpy.array(curves).reshape(len(curves), -1)
        p = numpy.array(quantile)
        m = 0.4 + p * 0.2
        n = len(arr)
        aleph = n * p + m
        k = numpy.floor(aleph.clip(1, n - 1)).astype(int)
        gamma = (aleph - k).clip(0, 1)
        data = numpy.sort(arr, axis=0).transpose()
        qcurve = (1.0 - gamma) * data[:, k - 1] + gamma * data[:, k]
        return qcurve

This code is used in the case of sampling. I submit that it is broken, since for weights=None one would expect to get the same result than assigning identical weights to all realizations, consistently with how it works in mean_curve, which is not the case.
The problem is that quantile_curve(curves, quantile, weights) uses a different algorithm if the weights are not None, so the numbers are different. For instance:

In[1]: import numpy

In [2]: from openquake.hazardlib.stats import quantile_curve

In [3]: quantile = 0.75

In [4]: curves = numpy.array([
            [.98161, .97837, .95579],
            [.97309, .96857, .93853],
        ])

In [5]: quantile_curve(curves, quantile, None)
Out[5]: array([ 0.98161,  0.97837,  0.95579])

In [6]: quantile_curve(curves, quantile, [.5, .5])
Out[6]: array([ 0.97735,  0.97347,  0.94716])

Given that we do not have a performance problem (the postprocessing is fast compared to the real computation) I would use the same algorithm in all cases, i.e. I would use the algorithm used for the full enumeration case, with nontrivial weights, which is based on interpolation.

PS: the code for weights=None was built to be compatible with scipy.mstats.mquantiles(curves, prob=quantile), however the tests has this comment:

        # TODO(LB): Check with our hazard experts to see if this is reasonable
        # tolerance. Better yet, get a fresh set of test data. (This test data
        # was just copied verbatim from from some old tests in
        # `tests/hazard_test.py`.

The text was updated successfully, but these errors were encountered:

mmpagani · 2017-01-26T14:21:56Z

I agree we must be consistent in the way we calculate quantiles hence I would eliminate the 'customised' implementation

raoanirudh · 2017-01-26T14:34:01Z

When the individual curves are obtained by sampling the logic-tree and no weights are assigned, we do not know the empirical distribution of the sampled values. To estimate statistics from the sampled values, we would need to assign a plotting position based on assumptions regarding the underlying distribution and depending upon the statistic we are trying to estimate. The plotting position recommended by Cunnae (1978) for unbiased estimation of quantiles when the underlying distribution type is unknown seems to be the one implemented in the snippet above. The equation recommended in that paper by Cunnae is listed here.

We don't have to go through the same process to estimate the mean because the sample mean is already an unbiased estimator.

micheles · 2017-01-26T14:40:32Z

Here is an example that uses the algorithm currently implemented for the full enumeration case:

  quantile = 0.75
  poes = numpy.array([.99, .95, .93, .87, .60])
  weights = numpy.array([.4, .1, .3, .1, .1])
  sorted_poe_idxs = numpy.argsort(poes)  # array([4, 3, 2, 1, 0])
  sorted_poes = poes[sorted_poe_idxs]  # array([ 0.6 ,  0.87,  0.93,  0.95,  0.99])
  sorted_weights = weights[sorted_poe_idxs]  # array([ 0.1,  0.1,  0.3,  0.1,  0.4])
  cum_weights = numpy.cumsum(sorted_weights)  # array([ 0.1,  0.2,  0.5,  0.6,  1. ])
  numpy.interp(quantile, cum_weights, sorted_poes)  # 0.965

Does anybody have an idea of the origin of this algorithm? It would be nice to have a link to some standard reference.

raoanirudh · 2017-01-26T14:53:02Z

I believe that's obtaining the quantiles simply by linear interpolation on the empirical CDF of the poes. For reference, see the Type 4 quantile as computed by R.

micheles · 2017-01-26T14:57:44Z

Yes, it makes sense since it does not make any assumption on the underlying distribution. I would keep this algorithm, since it is compatible with the past. BTW, even if we remove the other algorithm for weights=None there is no impact on the risk tests, since there are no risk tests for sampling.

mmpagani · 2017-01-26T15:51:33Z

Thanks this is all very intesting. I have (at least) one question. Isn't the scipy function taking into account the fact case where data doesn't have weight associated?

micheles · 2017-01-26T15:55:47Z

It is used in that case. The issue is that the scipy function gives different numbers than the interpolated CDF approach when the weights are all equal. That make the calculation of quantiles surprising and inconsistent with the mean computation.

mmpagani · 2017-01-26T16:30:14Z

I'll prepare some additional tests on sampling. This is also interesting: https://phobson.github.io/mpl-probscale/tutorial/closer_look_at_plot_pos.html

micheles · 2017-01-26T17:56:33Z

If you want to see the effect on the current tests, see gem/oq-engine#2492

micheles added the bug label Jan 26, 2017

micheles added this to the Engine 2.3.0 milestone Jan 26, 2017

micheles assigned mmpagani and raoanirudh Jan 26, 2017

This was referenced Jan 26, 2017

Using the same quantile algorithm both for weight=None and not None #578

Merged

Fixed the expected quantile files for the tests with sampling gem/oq-engine#2492

Merged

micheles closed this as completed in #578 Jan 31, 2017

raoanirudh mentioned this issue Feb 14, 2019

In event based with sampling, should we go back to identical realization weights? gem/oq-engine#4308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quantile_curve(curves, quantile, weights) is broken when weights is None #577

quantile_curve(curves, quantile, weights) is broken when weights is None #577

micheles commented Jan 26, 2017 •

edited

Loading

mmpagani commented Jan 26, 2017

raoanirudh commented Jan 26, 2017 •

edited

Loading

micheles commented Jan 26, 2017

raoanirudh commented Jan 26, 2017

micheles commented Jan 26, 2017 •

edited

Loading

mmpagani commented Jan 26, 2017 •

edited

Loading

micheles commented Jan 26, 2017 •

edited

Loading

mmpagani commented Jan 26, 2017 •

edited

Loading

micheles commented Jan 26, 2017

quantile_curve(curves, quantile, weights) is broken when weights is None #577

quantile_curve(curves, quantile, weights) is broken when weights is None #577

Comments

micheles commented Jan 26, 2017 • edited Loading

mmpagani commented Jan 26, 2017

raoanirudh commented Jan 26, 2017 • edited Loading

micheles commented Jan 26, 2017

raoanirudh commented Jan 26, 2017

micheles commented Jan 26, 2017 • edited Loading

mmpagani commented Jan 26, 2017 • edited Loading

micheles commented Jan 26, 2017 • edited Loading

mmpagani commented Jan 26, 2017 • edited Loading

micheles commented Jan 26, 2017

micheles commented Jan 26, 2017 •

edited

Loading

raoanirudh commented Jan 26, 2017 •

edited

Loading

micheles commented Jan 26, 2017 •

edited

Loading

mmpagani commented Jan 26, 2017 •

edited

Loading

micheles commented Jan 26, 2017 •

edited

Loading

mmpagani commented Jan 26, 2017 •

edited

Loading