-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to bin integral vs floating point correctly (precision issue and general question) #336
Comments
Thanks for the kind praise! This is a recurring question and the standard answer is to use an So would using If for some reason your input values are really floating point, you could add half the bin width to avoid the rounding errors. This is not an elegant solution, but it works. This is also not an elegant solution, but you can even make your own axis class that does this step automatically. See https://www.boost.org/doc/libs/develop/libs/histogram/doc/html/histogram/guide.html#histogram.guide.expert.user_defined_axes for examples. Basically, you would override the |
Come to think of it, you could also override the |
The correct solution for this is for us to support |
Thanks a lot @HDembinski for the help! What you say makes sense. We just looked at the integer axis, and it most likely would help. But I understand it can not be used with arbitrary bin counts, correct? This is not a huge impediment but it would be a bit sad. Also the suggestion to add half a bin width would seem feasible and possibly a good idea! But then I guess we could not use the range-based Support for |
Can I also ask a related question, in case you know: from a completely different point of view, is it sensible (or common) to shift the bin bounds by half a bin width down? We where considering this idea that for example bin |
That is correct,
If you make your own axis which implements this hack, it will work with range-based fill().
I don't know, but I don't think the speed penalty would be terrible. It is not hard to implement, I think, so we should really support this and then I can check the benchmarks. Yet another option that I forgot is to use the |
Thanks a million for this fast and very helpful reply. I need to digest this new input a bit. Support for |
Once again thinking more about it and reading your original message again, is it correct that you only have an issue with the very last bin edge? Because then your need may be resolved if we implement #319 Boost.Histogram by default gives you under and overflow bins for each axis, and then the interval you specify is semi-open [a, b). This makes sense if your input is real-valued and drawn from a continuous distribution. In your case, you probably want a closed interval [a, b]. I plan for a while to support this if the user creates a regular axis with the overflow bin turned off. |
Sorry for the delay in feedback. We've just tested #344 and it does solve a corner case for us, thanks for this. However the original problem still remains because we commonly do use overflow bins.
We have not observed the problem with other bins yet. But I am under the impression that this is just by chance. The problem should happen with any bin, when just by chance the raw index position falls by a margin below the lower bound of the "correct" bin. We've discussed this internally and can't make up our minds between the multiple possible workarounds. What I see so far:
|
After a longer discussion we are considering a third option, that I wanted to get your feedback on. The best of all worlds for us seems to be an axis that supports both integer and floating point and can automatically switch between the two. It should behave like the integer axis, as long as the bounds and the scale are (precisely) integral values. In this case, it correctly bins integer and floating point values by using the precise (and faster) integer arithmetics. Only when any of the relevant parameters (bounds or scale) is set to floating point, it automatically switches to floating point arithmetics. This seems to bring a set of nice benefits, and we could not see any downsides (yet). Benefits:
Did we overlook some obstacles here? What do you think? |
Hi @HDembinski , sorry to bother you with this question again, but it would be great to get your feedback on the above implementation idea. We've implemented such an axis as a test (an axis that supports both integer and floating point precision, and automatically switches between the two, depending on the user parameters). |
Sorry for not responding earlier. I think as a practical fix, it is fine to just shift the bins by half a bin width if you are still considering this. The axis that you mention, which uses exact integer arithmetic when everything is integral: this is already how floating point numbers with integral values work to my knowledge. Could you perhaps produce a small code example to show the wrong behavior and the expected? |
In any case, if you can make a prototype for such an axis with these properties, I am happy to review and build a PR to include it. That being said, I still think that the clean solution is to support boost::rational in the regular axis. I don't think that the performance hit is terrible, but we have to benchmark it. Thanks for your endurance in trying to work out a solution. |
Awesome! I'll take a look at it in the evening, thanks a lot already! |
@emmenlau Could you please post a minimal example that shows the bug you experiencing? What I said earlier, that integer arithmetic with doubles is exact to my knowledge, is only true when no fractions are computed in intermediate results. I would like to understand your issue better. Consider these two mathematically identical calculations of an index from a value:
These two calculations, when calculated with floating point types, are not giving the exact same result even when x, a, b, n are integer values. In my understanding, however, at least the first index1 should give a result identical to a calculation with integer types. I need to test this, though. |
axis::regular currently uses yet another version of this calculation:
This calculation also uses non-integer values as intermediate results, so it is not expected to give the exact same result as an integer calculation even when x, a, b, n are integer values. |
For reference, numpy.histogram uses a rather expensive algorithm to handle values which end up exactly on bin edges: |
Here is a little numerical test. It looks like the calculation that axis::regular uses is the worst of the aforementioned alternatives. Surprisingly, index2, index4, and index5 work the best. index5 is the best version, because it avoids the division in the hot calculation, which is slower than multiplication. import numpy as np
rng = np.random.default_rng(1)
for i in range(10):
info = np.iinfo(np.int32)
# for large ranges we also get round-off errors in index2
a = rng.integers(info.min // 2, info.max // 2, size=1000000)
b = rng.integers(info.min // 2, info.max // 2, size=1000000)
m = a < b
a = a[m]
b = b[m]
n = b - a # bin width is 1
# cast to integer rounds toward 0, not down, so the formulas below
# are only correct for x >= a
x = rng.integers(a, b + 1, size=len(a))
# reference, uses integer calculation
index0 = (x - a) * n // (b - a)
x = x.astype(float)
a = a.astype(float)
b = b.astype(float)
index1 = (((x - a) * n) / (b - a)).astype(np.int64)
index2 = ((x - a) / ((b - a) / n)).astype(np.int64)
index3 = (((x - a) / ((b - a))) * n).astype(np.int64)
index4 = ((x - a) * (1.0 / ((b - a) / n))).astype(np.int64)
index5 = ((x - a) * (n / (b - a))).astype(np.int64)
m = [
index0 != index1,
index0 != index2,
index0 != index3,
index0 != index4,
index0 != index5,
]
for j, mj in enumerate(m):
if np.any(mj):
print(f"index{j+1}", x[mj][0], a[mj][0], b[mj][0]) this returns
|
That index2, index4 and index5 give the same answers is not a surprise, since by construction (b-a) / n == 1.0. |
@HDembinski thanks for all your input, here is an example of the problem we are experiencing: godbolt. |
@HDembinski your input and discussion is very enlightening, thanks! As @vakokako shows, the problem for us typically arises when using the axis with larger integral bounds, that cause small rounding errors. If a bin in the middle of the histogram is affected (as in @vakokako's example) the problem is not so visible. But in other, more common cases, the last bin can be affected, and it will remain completely empty. This causes typically some confusion with our users. Is this a suitable motivation for the issue? When further looking into this, @vakokako found that we can use the more precise integral arithmetic when:
|
Here is our solution axis to this: godbolt. This is achieved by adding a bool flag Example of axes where
Example of axes where
The additional members for lower/upper bounds are not neccessary, they are there only cause we wanted to keep the dependency on |
Hi, I was sent here from this issue by @henryiii from the above issue, and just wanted to ask if you anticipate being able to implement one of the fixes discussed here at some time soon? My issue just comes from integer values in the middle of the distribution falling on the wrong side of integer bin edges, so I think either the solution with boost::rational or a change in the calculation as you discussed should fix this. |
Hi @Dominic-Stafford ! We have an implementation based on what you find in the godbolt-link above and it works quite well for us, all problems with precision are obsolete. However @HDembinski is much more knowledgeable about the possible pro's and con's with respect to i.e. the solution with |
Ok, I think I get it. To fix this, we would indeed have to copy numpy's slow but exact algorithm to sort values into the right bins. |
I also think that boost::rational would be slower, because one needs to spend more CPU cycles to do the same work. Your solution should be pretty fast since |
Thanks for the thumbs up! I should say that all the nice work was done by @vakokako .
From a brief look at the code I agree that may not be required. Possibly it's just an artifact from a previous test. I'll check with @vakokako and get back to you today! |
I did the check, it turns out that the implementation of value() also needs to be modified. The version that I use ensures that the upper edge of the last bin is equal to the upper edge that the user supplied, while the calculation that @vakokako uses only guarantees that for bin widths that are integral values, but for that case it gives correct bin edges, while my version does not. |
Thanks a lot @HDembinski for the quick check, now I recall the same from our discussions! Let me kindly ask, do you think something like this (our custom axis with implicit integral support) could be included into the official library? We would be more than happy to provide a reasonable PR. Or if you can do something based on our work, please feel free to copy and/or use the code at godbolt to your liking. Consider it to be public domain, or any license that suits your needs. The axis is nice because at little extra code, and virtually no runtime overhead, it transparently supports the integral (precise) case but also the floating cases. No conscious action from the user required. |
Yes, I plan to finally fix this and I will use your research. I am very grateful for the offer of a PR but I think I need to use this opportunity to also refactor regular which became quite complex with all the optional functionality. Good design is an iterative process which I can do more quickly on my own. |
Thanks a lot for your consideration and help! I'm more than happy about this approach! |
Thanks @HDembinski and @emmenlau! It would be really nice to have a solution in the official library since we're using this via the python bindings, but I agree it's worth taking the time to settle on a good solution. |
Dear @HDembinski, I just wanted to kindly promote this issue and your nice suggestion to implement this in a future version of the library. We would be delighted to have a more "official" support for this feature, which we use a lot. |
Thanks for pinging me on this. I want to fix this, but I am not sure what is the best way. |
The use case you have is covered by I cannot just copy your implementation for the issue that you supplied via godbolt, because using an I started working on that feature a while ago but got distracted by other projects. |
I think the precision problems discussed in this issue may be addressed by PR draft #386. I added a unit test based on @vakokako's example that demonstrates exact
This method is cheap, as it does not correct for numerical errors. Could this address some of the precision problems raised in this issue? Note: this method is only exact when numbers take few floating point bits to represent. For example, it will be exact for a bin width of 1/2 = 0.5, but not for 1/3 = 0.333333333333333... |
Dear @HDembinski and other authors, thanks for this awesome library. We have just started using it and the speed and versatility are impressive! We have a problem that manifests as a precision problem, but also shows a more general question.
The problem: We use the floating precision histogram to bin any kind of data. When we bin data (gray value camera images) of integer precision, we set lower and upper bound based on the data range to
[min, max+1]
and the number of bins toupper bound - lower bound
. However there are sometimes cases where the numeric precision then leads to an empty last bin. This is due to rounding error, the values fall intomax-1
instead ofmax
. We checked and the bin index is then computed as(max-1).999...
which finally resolves tomax-1
. Is there a good solution to this? It leads to confusion that the last bin is empty, and its also a bit problematic in writing tests.The underlying question: We are not completely certain what is a "recommended" way to set histogram bounds for integral data space. Are there common guidelines? In floating point precision it is more understandable what data range is covered by each bin. But in integral space, it seems the value
0
could be best represented by bin[-0.5, 0.5)
rather than the more obvious[0, 1)
. And help, links or pointers to documentation would be highly appreciated!The text was updated successfully, but these errors were encountered: