Knuth’s rule fails with simple and small array (eats up system's memory) #11357

Gabriel-p · 2021-02-28T16:26:00Z

Description

The knuth_bin_width is not able to handle a small and simple array.

Expected behavior

A histogram should be generated, or an error shown explaining why it was not possible to obtain it.

Actual behavior

The function starts to gobble up the system's memory.

Steps to Reproduce

import numpy as np
import matplotlib.pyplot as plt
from astropy.visualization import hist
arr = np.array([0.05555556, 0. , 0. , 0. , 0. ,1. , 0. , 0. , 0. , 0.5 ])
ax = plt.subplot(111)
hist(arr, bins='knuth', ax=ax)

System Details

Linux-5.5.0-050500-generic-x86_64-with-glibc2.10
>>> Python 3.8.8 (default, Feb 24 2021, 21:46:12) 
[GCC 7.3.0]
>>> Numpy 1.19.2
>>> astropy 4.2
>>> Scipy 1.5.2
>>> Matplotlib 3.3.1

The text was updated successfully, but these errors were encountered:

pllim · 2021-03-01T18:39:49Z

Did you see this error? I see it when I run with 4.3.dev:

ValueError: Histogram has too many bins: 45298337. Use max_bins to increase
the number of allowed bins or range to restrict the histogram range.

Gabriel-p · 2021-03-01T18:44:23Z

Nope, no error no warnings. Is that branch accessible? I can not see it here

pllim · 2021-03-01T18:47:57Z

You can get 4.3.dev by installing the development version from master. It does take a while to compute. Didn't eat up all my machine's memory, so the eating does stop at some point. I think this might actually be a bug in:

astropy/astropy/stats/histogram.py

Line 16 in 6fd98e5

def calculate_bin_edges(a, bins=10, range=None, weights=None):

cc @larrybradley

pllim · 2021-03-01T19:45:51Z

I think I tracked it down to here:

astropy/astropy/stats/histogram.py

Line 328 in 6fd98e5

M = optimize.fmin(knuthF, len(bins0), disp=not quiet)[0]

But I don't know how to fix it.

>>> import numpy as np
>>> from scipy import optimize
>>> from astropy.stats.histogram import _KnuthF, freedman_bin_width
>>> x = np.array([0.05555556, 0, 0, 0, 0, 1, 0, 0, 0, 0.5])
>>> a = np.asarray(x).ravel()
>>> dx0, bins0 = freedman_bin_width(a, True)
>>> dx0
0.03867991004116571
>>> bins0
array([0.        , 0.03867991, 0.07735982, 0.11603973, 0.15471964,
       0.19339955, 0.23207946, 0.27075937, 0.30943928, 0.34811919,
       0.3867991 , 0.42547901, 0.46415892, 0.50283883, 0.54151874,
       0.58019865, 0.61887856, 0.65755847, 0.69623838, 0.73491829,
       0.7735982 , 0.81227811, 0.85095802, 0.88963793, 0.92831784,
       0.96699775, 1.00567766])
>>> len(bins0)
27
>>> knuthF = _KnuthF(a)
>>> zzz = optimize.fmin(knuthF, len(bins0), disp=True)  # This takes a long time
Optimization terminated successfully.
         Current function value: -11.814026
         Iterations: 63
         Function evaluations: 153
>>> zzz
array([45298336.05000024])
>>> M = zzz[0]
>>> bins = knuthF.bins(M)
>>> len(bins)  # over 42 million elements
45298337

larrybradley · 2021-03-01T20:05:52Z

It looks like something in _KnuthF needs to be fixed.

Gabriel-p · 2021-03-13T15:35:29Z

An easy (hacky) fix would be to introduce an error when the number of bins gets too large here

astropy/astropy/stats/histogram.py

Line 399 in d971b1c

M = int(M)

Something like:

...
M = int(M)

if M > 1000000:
    raise ValueError(
        "Max bins limit reached (1e6). Use max_bins to increase the "
        "number of allowed bins or range to restrict the histogram "
        "range.")
if M <= 0:
    return np.inf
...

pllim · 2021-03-13T16:06:10Z

Why not do the check here (after this line)?

astropy/astropy/stats/histogram.py

Line 328 in d971b1c

M = optimize.fmin(knuthF, len(bins0), disp=not quiet)[0]

Then you can actually have max_bins be set by user.

Gabriel-p · 2021-03-13T16:43:56Z

Because the issue happens inside the optimize call. That's where M grows without bound

mirca · 2021-07-10T03:02:21Z

The issue is that for this particular instance of data = arr, the objective function of the underlying optimization problem (minimizing the negative log-likelihood function _KnuthF) doesn't have a minimum, i.e., _KnuthF.eval will keep decreasing as M increases. You can see from Figure 1 on the paper that the shape of this loglikelihood function changes a lot as a function of the input data: https://arxiv.org/pdf/physics/0605197.pdf

mirca · 2021-07-10T03:28:16Z

Since this log-likelihood function is one-dimensional, it would be more sensible to just evaluate it in a bounded grid of number of bins and then pick the value of M that optimizes it. What do y'all think?

Gabriel-p · 2021-07-11T13:23:12Z

@mirca could you write a quick example of how your fix would work?

mirca · 2021-07-14T04:28:01Z

@Gabriel-p The solution I propose is to replace

astropy/astropy/stats/histogram.py

Line 328 in d971b1c

M = optimize.fmin(knuthF, len(bins0), disp=not quiet)[0]

by an integer grid search over knuthF.

Gabriel-p · 2021-07-14T13:51:05Z

@mirca sounds reasonable, but I'm worried that it could slow down the evaluation. Over what domain do you propose to run the grid search? From 1 to 1000000? Do you propose to use optimize.brute?

mirca · 2021-09-27T05:25:13Z

@Gabriel-p Even if that is case, I think it's better than the current code which tries to optimize a highly nonconvex function that can possibly have no minimum. A grid search with a sensible choice for the upper-bound on the number of bins seems a reasonable solution. What do you think?

PS: The max number of bins could even be an input parameter with some sensible default.

Gabriel-p · 2021-09-29T13:13:44Z

Ok, I've been running some tests with optimize.brute() and comparing the results with the default optimize.fmin() (code is below). I tested the results for a uniform and a normal distribution, with N=50 and N=500 (number of random points). This is what I get:

N=50

N=500

Two things are apparent in these simple tests:

The number of bins returned minimizing knuthF() is very similar to just using len(bins0) (i.e., the output from freedman_bin_width())
M_fmin and M_brute start to diverge for larger estimated M_fmin

What do you guys think?

import matplotlib.pyplot as plt
import numpy as np
from scipy import optimize
from astropy.stats.histogram import _KnuthF, freedman_bin_width


def findM(x):
    a = np.asarray(x).ravel()
    dx0, bins0 = freedman_bin_width(a, True)
    knuthF = _KnuthF(a)

    # fmin method
    M_fmin = optimize.fmin(knuthF, len(bins0), disp=False)[0]

    # HARDCODED?
    bmin, bmax = max(2, .5 * len(bins0)), 2 * len(bins0)
    step = (bmax + bmin) / 100
    # brute method
    M_brute = optimize.brute(
        knuthF, (slice(bmin, bmax, step),), full_output=False, finish=None)

    return int(M_fmin), M_brute, len(bins0)


Npoints = 500
res = []
for _ in range(Npoints):
    x = np.random.uniform(0, 1, 10)
    M_fmin, M_brute, Nbins0 = findM(x)
    res.append([M_fmin, M_brute, Nbins0])
res_uniform = np.array(res).T

res = []
for _ in range(Npoints):
    x = np.random.normal(0, 1, 10)
    M_fmin, M_brute, Nbins0 = findM(x)
    res.append([M_fmin, M_brute, Nbins0])
res_normal = np.array(res).T

fig = plt.figure(figsize=(15, 5))

plt.subplot(121)
plt.title("uniform (Npoints={})".format(Npoints))
plt.scatter(res_uniform[0], res_uniform[0] - res_uniform[1], c=res_uniform[2])
plt.axhline(0, ls=":")
plt.xlabel("M_fmin")
plt.ylabel("M_fmin - M_brute")
cb = plt.colorbar()
cb.set_label("len(bins0)")

plt.subplot(122)
plt.title("normal (Npoints={})".format(Npoints))
plt.scatter(res_normal[0], res_normal[0] - res_normal[1], c=res_normal[2])
plt.axhline(0, ls=":")
plt.xlabel("M_fmin")
plt.ylabel("M_fmin - M_brute")
cb = plt.colorbar()
cb.set_label("len(bins0)")

fig.tight_layout()
plt.savefig("{}.png".format(Npoints), dpi=150)

pllim added visualization Performance Bug labels Mar 1, 2021

pllim added the stats label Mar 1, 2021

pllim removed the visualization label Mar 1, 2021

mirca mentioned this issue May 2, 2023

Improving the lightkurve binning function by replacing the aggregate_downsample function lightkurve/lightkurve#1328

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Knuth’s rule fails with simple and small array (eats up system's memory) #11357

Knuth’s rule fails with simple and small array (eats up system's memory) #11357

Gabriel-p commented Feb 28, 2021

pllim commented Mar 1, 2021

Gabriel-p commented Mar 1, 2021

pllim commented Mar 1, 2021

pllim commented Mar 1, 2021

larrybradley commented Mar 1, 2021

Gabriel-p commented Mar 13, 2021 •

edited

pllim commented Mar 13, 2021 •

edited

Gabriel-p commented Mar 13, 2021

mirca commented Jul 10, 2021 •

edited

mirca commented Jul 10, 2021

Gabriel-p commented Jul 11, 2021

mirca commented Jul 14, 2021 •

edited

Gabriel-p commented Jul 14, 2021

mirca commented Sep 27, 2021

Gabriel-p commented Sep 29, 2021

Knuth’s rule fails with simple and small array (eats up system's memory) #11357

Knuth’s rule fails with simple and small array (eats up system's memory) #11357

Comments

Gabriel-p commented Feb 28, 2021

Description

Expected behavior

Actual behavior

Steps to Reproduce

System Details

pllim commented Mar 1, 2021

Gabriel-p commented Mar 1, 2021

pllim commented Mar 1, 2021

pllim commented Mar 1, 2021

larrybradley commented Mar 1, 2021

Gabriel-p commented Mar 13, 2021 • edited

pllim commented Mar 13, 2021 • edited

Gabriel-p commented Mar 13, 2021

mirca commented Jul 10, 2021 • edited

mirca commented Jul 10, 2021

Gabriel-p commented Jul 11, 2021

mirca commented Jul 14, 2021 • edited

Gabriel-p commented Jul 14, 2021

mirca commented Sep 27, 2021

Gabriel-p commented Sep 29, 2021

Gabriel-p commented Mar 13, 2021 •

edited

pllim commented Mar 13, 2021 •

edited

mirca commented Jul 10, 2021 •

edited

mirca commented Jul 14, 2021 •

edited