Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add upper limit to number of bins in hist and refactor bin edge calculation #7991

Merged
merged 13 commits into from Nov 13, 2018

Conversation

mwcraig
Copy link
Member

@mwcraig mwcraig commented Oct 26, 2018

This PR fixes #7758 by separating the calculation of the bin edges for a histogram from the calculation of the histogram and raising an exception if the number of bins is very large. The default value I chose for the maximum number of bins is ridiculously large (1e6) but I was afraid that setting a value too low (around 3-4000 one sees matplotlib slow down) might start to raise exceptions in code people are already running.

FWIW, I first hit #7758 when I tried to histogram the pixel values of a calibrated image which turned out to have some extreme values. That led to ~5e8 bins, followed by a computer freeze as all the memory was slowly consumed. That same image dropped into this fix generates an error instead.

Edit: Also fixes #8010.

@astropy-bot
Copy link

astropy-bot bot commented Oct 26, 2018

Hi there @mwcraig 👋 - thanks for the pull request! I'm just a friendly 🤖 that checks for issues related to the changelog and making sure that this pull request is milestoned and labeled correctly. This is mainly intended for the maintainers, so if you are not a maintainer you can ignore this, and a maintainer will let you know if any action is required on your part 😃.

Everything looks good from my point of view! 👍

If there are any issues with this message, please report them here.

@mwcraig
Copy link
Member Author

mwcraig commented Oct 26, 2018

Happy to add a changelog entry once a milestone is set...

@bsipocz bsipocz added this to the v2.0.10 milestone Oct 27, 2018
@astrofrog
Copy link
Member

@mwcraig - can you rebase?

@mwcraig
Copy link
Member Author

mwcraig commented Oct 27, 2018

@astrofrog -- rebased -- does the changelog entry go in the section for 3.1 or 2.10?

@astrofrog
Copy link
Member

@mwcraig - I think 2.0.10 makes sense as I think this was really a bug. Note that there are some real failures here in the doc build:

TypeError: _calculate_bin_edges() got an unexpected keyword argument 'histtype'

@crawfordsm
Copy link
Member

Looks like a reasonable change to me. Though I think that _calculate_bin_edges could be a public function, and then in that case, it could use a docstring and a test.

@mwcraig
Copy link
Member Author

mwcraig commented Oct 28, 2018

I made the new function public and added a doctstring.

A couple more changes are incoming because it turns out there is a bug when calculating the histogram with a range...unlike in numpy, where range specifies the upper and lower limit of the histogram edges, our histogram restricts the data to fall within the range but produces bin edges that can be smaller than the range or outside the range.

@bsipocz bsipocz added the Bug label Nov 2, 2018
@bsipocz
Copy link
Member

bsipocz commented Nov 2, 2018

@crawfordsm or @larrybradley - could you please review this?

Copy link
Member

@larrybradley larrybradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @mwcraig. I did find one typo in a docstring.

astropy/stats/histogram.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Nov 13, 2018

Codecov Report

Merging #7991 into master will increase coverage by 0.04%.
The diff coverage is 84.61%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7991      +/-   ##
==========================================
+ Coverage   86.87%   86.92%   +0.04%     
==========================================
  Files         384      383       -1     
  Lines       57860    57855       -5     
  Branches     1078     1056      -22     
==========================================
+ Hits        50265    50288      +23     
+ Misses       6957     6953       -4     
+ Partials      638      614      -24
Impacted Files Coverage Δ
astropy/visualization/hist.py 100% <100%> (ø) ⬆️
astropy/stats/histogram.py 92% <81.25%> (-5.6%) ⬇️
astropy/stats/funcs.py 89.54% <0%> (-0.87%) ⬇️
astropy/modeling/parameters.py 89.11% <0%> (-0.49%) ⬇️
astropy/convolution/convolve.py 85.21% <0%> (-0.31%) ⬇️
astropy/nddata/nduncertainty.py 93.1% <0%> (-0.29%) ⬇️
astropy/io/fits/header.py 96.11% <0%> (-0.13%) ⬇️
astropy/coordinates/representation.py 95.15% <0%> (-0.06%) ⬇️
astropy/time/formats.py 97.47% <0%> (-0.04%) ⬇️
astropy/io/fits/hdu/table.py 92.87% <0%> (-0.02%) ⬇️
... and 33 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1ca593a...ce61ee1. Read the comment docs.

it will be (x.min(), x.max()). However, if bins is a list it is
returned unmodified regardless of the range argument.

weights : array_like, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for a random comment from the sidelines, but why include weights if it does not influence the calculation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was because of this statement in np.histogram_bin_edges, which takes a weights argument:

weights : array_like, optional

An array of weights, of the same shape as a. Each value in a only contributes its associated weight towards the bin count (instead of 1). This is currently not used by any of the bin estimators, but may be in the future.

Right now it does no harm, but if, in the future, there is some sort of way of finding histogram edges in numpy that uses weights we won't need to make any changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, if we just mimic np.histogram_bin_edges, then that's totally fine. Maybe add the same sentence, that it might be used in future.

Copy link
Member

@astrofrog astrofrog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@astrofrog astrofrog added the 💤 merge-when-ci-passes Do not use: We have auto-merge option now. label Nov 13, 2018
@bsipocz bsipocz merged commit 35dc079 into astropy:master Nov 13, 2018
@bsipocz
Copy link
Member

bsipocz commented Nov 13, 2018

Thank you @mwcraig!

bsipocz added a commit that referenced this pull request Nov 13, 2018
Add upper limit to number of bins in hist and refactor bin edge calculation
@bsipocz
Copy link
Member

bsipocz commented Nov 13, 2018

Looking at the diffs here again, now it this looks more borderline of a slight API change (with the addition of a new kwarg, and well as a new function exposed to the users) and a pure bugfix. If the tests go all well with the backport I'm on the side of letting it end up in 2.0.10, but let me know if anyone would like to veto it.
@eteq?

bsipocz added a commit that referenced this pull request Nov 13, 2018
Add upper limit to number of bins in hist and refactor bin edge calculation
@mwcraig mwcraig deleted the fix-7758 branch April 22, 2019 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants