Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knuth’s rule fails with simple and small array (eats up system's memory) #11357

Open
Gabriel-p opened this issue Feb 28, 2021 · 15 comments
Open

Comments

@Gabriel-p
Copy link
Contributor

Description

The knuth_bin_width is not able to handle a small and simple array.

Expected behavior

A histogram should be generated, or an error shown explaining why it was not possible to obtain it.

Actual behavior

The function starts to gobble up the system's memory.

Steps to Reproduce

import numpy as np
import matplotlib.pyplot as plt
from astropy.visualization import hist
arr = np.array([0.05555556, 0. , 0. , 0. , 0. ,1. , 0. , 0. , 0. , 0.5 ])
ax = plt.subplot(111)
hist(arr, bins='knuth', ax=ax)

System Details

Linux-5.5.0-050500-generic-x86_64-with-glibc2.10
>>> Python 3.8.8 (default, Feb 24 2021, 21:46:12) 
[GCC 7.3.0]
>>> Numpy 1.19.2
>>> astropy 4.2
>>> Scipy 1.5.2
>>> Matplotlib 3.3.1
@pllim
Copy link
Member

pllim commented Mar 1, 2021

Did you see this error? I see it when I run with 4.3.dev:

ValueError: Histogram has too many bins: 45298337. Use max_bins to increase
the number of allowed bins or range to restrict the histogram range.

@Gabriel-p
Copy link
Contributor Author

Nope, no error no warnings. Is that branch accessible? I can not see it here

@pllim
Copy link
Member

pllim commented Mar 1, 2021

You can get 4.3.dev by installing the development version from master. It does take a while to compute. Didn't eat up all my machine's memory, so the eating does stop at some point. I think this might actually be a bug in:

def calculate_bin_edges(a, bins=10, range=None, weights=None):

cc @larrybradley

@pllim pllim added the stats label Mar 1, 2021
@pllim
Copy link
Member

pllim commented Mar 1, 2021

I think I tracked it down to here:

M = optimize.fmin(knuthF, len(bins0), disp=not quiet)[0]

But I don't know how to fix it.

>>> import numpy as np
>>> from scipy import optimize
>>> from astropy.stats.histogram import _KnuthF, freedman_bin_width
>>> x = np.array([0.05555556, 0, 0, 0, 0, 1, 0, 0, 0, 0.5])
>>> a = np.asarray(x).ravel()
>>> dx0, bins0 = freedman_bin_width(a, True)
>>> dx0
0.03867991004116571
>>> bins0
array([0.        , 0.03867991, 0.07735982, 0.11603973, 0.15471964,
       0.19339955, 0.23207946, 0.27075937, 0.30943928, 0.34811919,
       0.3867991 , 0.42547901, 0.46415892, 0.50283883, 0.54151874,
       0.58019865, 0.61887856, 0.65755847, 0.69623838, 0.73491829,
       0.7735982 , 0.81227811, 0.85095802, 0.88963793, 0.92831784,
       0.96699775, 1.00567766])
>>> len(bins0)
27
>>> knuthF = _KnuthF(a)
>>> zzz = optimize.fmin(knuthF, len(bins0), disp=True)  # This takes a long time
Optimization terminated successfully.
         Current function value: -11.814026
         Iterations: 63
         Function evaluations: 153
>>> zzz
array([45298336.05000024])
>>> M = zzz[0]
>>> bins = knuthF.bins(M)
>>> len(bins)  # over 42 million elements
45298337

@larrybradley
Copy link
Member

It looks like something in _KnuthF needs to be fixed.

@Gabriel-p
Copy link
Contributor Author

Gabriel-p commented Mar 13, 2021

An easy (hacky) fix would be to introduce an error when the number of bins gets too large here

Something like:

...
M = int(M)

if M > 1000000:
    raise ValueError(
        "Max bins limit reached (1e6). Use max_bins to increase the "
        "number of allowed bins or range to restrict the histogram "
        "range.")
if M <= 0:
    return np.inf
...

@pllim
Copy link
Member

pllim commented Mar 13, 2021

Why not do the check here (after this line)?

M = optimize.fmin(knuthF, len(bins0), disp=not quiet)[0]

Then you can actually have max_bins be set by user.

@Gabriel-p
Copy link
Contributor Author

Because the issue happens inside the optimize call. That's where M grows without bound

@mirca
Copy link
Member

mirca commented Jul 10, 2021

The issue is that for this particular instance of data = arr, the objective function of the underlying optimization problem (minimizing the negative log-likelihood function _KnuthF) doesn't have a minimum, i.e., _KnuthF.eval will keep decreasing as M increases. You can see from Figure 1 on the paper that the shape of this loglikelihood function changes a lot as a function of the input data: https://arxiv.org/pdf/physics/0605197.pdf

@mirca
Copy link
Member

mirca commented Jul 10, 2021

Since this log-likelihood function is one-dimensional, it would be more sensible to just evaluate it in a bounded grid of number of bins and then pick the value of M that optimizes it. What do y'all think?

@Gabriel-p
Copy link
Contributor Author

@mirca could you write a quick example of how your fix would work?

@mirca
Copy link
Member

mirca commented Jul 14, 2021

@Gabriel-p The solution I propose is to replace

M = optimize.fmin(knuthF, len(bins0), disp=not quiet)[0]

by an integer grid search over knuthF.

@Gabriel-p
Copy link
Contributor Author

@mirca sounds reasonable, but I'm worried that it could slow down the evaluation. Over what domain do you propose to run the grid search? From 1 to 1000000? Do you propose to use optimize.brute?

@mirca
Copy link
Member

mirca commented Sep 27, 2021

@Gabriel-p Even if that is case, I think it's better than the current code which tries to optimize a highly nonconvex function that can possibly have no minimum. A grid search with a sensible choice for the upper-bound on the number of bins seems a reasonable solution. What do you think?

PS: The max number of bins could even be an input parameter with some sensible default.

@Gabriel-p
Copy link
Contributor Author

Ok, I've been running some tests with optimize.brute() and comparing the results with the default optimize.fmin() (code is below). I tested the results for a uniform and a normal distribution, with N=50 and N=500 (number of random points). This is what I get:

N=50

50

N=500

500

Two things are apparent in these simple tests:

  1. The number of bins returned minimizing knuthF() is very similar to just using len(bins0) (i.e., the output from freedman_bin_width())
  2. M_fmin and M_brute start to diverge for larger estimated M_fmin

What do you guys think?


import matplotlib.pyplot as plt
import numpy as np
from scipy import optimize
from astropy.stats.histogram import _KnuthF, freedman_bin_width


def findM(x):
    a = np.asarray(x).ravel()
    dx0, bins0 = freedman_bin_width(a, True)
    knuthF = _KnuthF(a)

    # fmin method
    M_fmin = optimize.fmin(knuthF, len(bins0), disp=False)[0]

    # HARDCODED?
    bmin, bmax = max(2, .5 * len(bins0)), 2 * len(bins0)
    step = (bmax + bmin) / 100
    # brute method
    M_brute = optimize.brute(
        knuthF, (slice(bmin, bmax, step),), full_output=False, finish=None)

    return int(M_fmin), M_brute, len(bins0)


Npoints = 500
res = []
for _ in range(Npoints):
    x = np.random.uniform(0, 1, 10)
    M_fmin, M_brute, Nbins0 = findM(x)
    res.append([M_fmin, M_brute, Nbins0])
res_uniform = np.array(res).T

res = []
for _ in range(Npoints):
    x = np.random.normal(0, 1, 10)
    M_fmin, M_brute, Nbins0 = findM(x)
    res.append([M_fmin, M_brute, Nbins0])
res_normal = np.array(res).T

fig = plt.figure(figsize=(15, 5))

plt.subplot(121)
plt.title("uniform (Npoints={})".format(Npoints))
plt.scatter(res_uniform[0], res_uniform[0] - res_uniform[1], c=res_uniform[2])
plt.axhline(0, ls=":")
plt.xlabel("M_fmin")
plt.ylabel("M_fmin - M_brute")
cb = plt.colorbar()
cb.set_label("len(bins0)")

plt.subplot(122)
plt.title("normal (Npoints={})".format(Npoints))
plt.scatter(res_normal[0], res_normal[0] - res_normal[1], c=res_normal[2])
plt.axhline(0, ls=":")
plt.xlabel("M_fmin")
plt.ylabel("M_fmin - M_brute")
cb = plt.colorbar()
cb.set_label("len(bins0)")

fig.tight_layout()
plt.savefig("{}.png".format(Npoints), dpi=150)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants