Statistical assessment of spatial distributions #12835
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Objective
In #12484 the question arose of how we would actually go about testing the point-sampling methods introduced. In this PR, we introduce statistical tools for assessing the quality of spatial distributions in general and, in particular, of the
ShapeSample
implementations that presently exist.Background and approach
A uniform probability distribution is one where the probability density is proportional to the area — that is, for any given region, the probability of a sample being drawn from that region is equal to the proportion of the total area that region occupies.
It follows from this that, if one discretizes the sample space by partitioning it into labeled regions and assigning to each sample the label of the region it falls into, the discrete probability distribution sampled from the labels is a multinomial distribution with probabilities given by the proportions of the total area taken by each region of the partition.
Given, then, some probability distribution which is supposed to be uniform on some region, we can attempt to assess its uniformity by discretizing — as described above — and then performing statistical analysis of the resulting discrete distribution using Pearson's chi-squared test. The point is that, if the distribution exhibits some bias, it might be detected in the discrete distribution, which will fail to conform adequately to the associated multinomial density.
Solution
This branch contains a small library that supports this process with a few parts:
The preceding trait models the discretization process for arbitrary spatial distributions, but provides no metadata about what the associated multinomial densities should be; that is supported by the following additional trait:
Next, an
N
-dimensional histogram type is used to actually aggregate samples for the purposes of comparison:Finally, chi-squared analysis functions take these histograms (or their projections) as input and produce actual chi-squared values:
Presently, the actual testing implemented by this branch includes
Binned
implementations for the interiors and boundaries ofCircle
andSphere
. Two wrapper types,InteriorOf<T>
andBoundaryOf<T>
have been introduced for implementors ofShapeSample
, with the purpose of allowing the constituent sampling methods to be used directly asDistribution
s. This adds modularity; the library itself operates also at the level ofDistribution
s.Changelog
shape_sampling.rs
into a newsampling
submodule ofbevy_math
that holds all of therand
dependencies.InteriorOf<T>
andBoundaryOf<T>
allow conversion ofShapeSample
implementors intoDistribution
s.Discussion
Caveat emptor
The statistical tests in
sampling/statistical_tests/impls.rs
are marked#[ignore]
so that they do not run in CI testing. They must never, ever, ever run in CI testing. The purpose of these statistical tests is that they reliably fail when something is wrong — not that they always succeed when everything is fine.Presently, the alpha-level of each individual test is .001, meaning that each constituent check fails 1/1000th of the time; with the current volume of tests, this means that about 1% of the time, a failure would occur even if everything was perfect.
On the other hand, chi-squared error has the property that it grows with sample size for mismatched distributions, while remaining constant for matched ones. That is to say: statistical biases in the output should lead to the tests failing quite reliably, meaning they do not need to be run particularly often. We can use very large sample sizes to ensure this if need be.
Personally, I am not sure what the best way of using these tests would be other than running them manually. Presently, this can be done as follows:
What?
I'm sure this looks like building a death ray to kill an ant. In a sense, it is. Frankly, the reason that I made this isn't because I wanted to (not that I didn't enjoy myself), but really that I couldn't think of any other way to externally assess the quality of our sampling code that was actually meaningful in any way. For example, using a fixed-seed RNG and comparing output to some known values doesn't really demonstrate anything (and, in fact, breaks spuriously when the code is refactored).