-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dimension statistics #19
Comments
Require this, or make it optional? |
+1 to include statistics in COPC. It is just few bytes extra in the output file and the computation overhead for writer is negligible, too, yet it makes things much easier for readers like QGIS - no need to run some heuristics to estimate range of values for dimensions... |
+1 I'd propose to make these dimension statistics applicable to a voxel/page as well, so you get virtual points. With a quadtree layout these would be comparable to overviews in COGs. These overviews are quite useful for quick visualisation or to be ignored in certain analyses. Probably of interest to @m-schuetz if used in progressive rendering. edit: I'd say these statistics could be required on a file level (like we already store the min/max of the coordinates) and optional when applied to the remaining voxels, but if applied, it needs to be present for all voxels in the hierarchy. |
@evetion Could you please provide some specification language or a struct definition that implements voxel stats? Could you provide some concrete use scenarios for these statistics? This seems more like a 'feature' than a 'need'. My counterpoints:
It is a design goal of COPC to be as simple as possible, have only what is needed for clients to be able to operate, and not to lard the spec up with features that most end consumers of COPC will not end up using. I'm not saying that voxel stats is fatback, but the case hasn't been made yet. |
Thanks for the feedback. As a developer I can appreciate a lean standard instead of yet another VLR/compatibility/check.
The whole point of these Cloud Optimized formats is to selectively read parts of interest instead of needing to download the whole file. The current draft only allows those selections based on the
Let's say we store only the bounds (min/max) of each dimension for each voxel, this would be the size of two points. Doing the same for parent voxels in an octree will roughly add 33%, say 3 points. As long as the voxels store roughly thousand(s) of points, which is common, this would be less than a percent compared to the uncompressed data and 3% when compared to compressed data (assuming laszip results in 10% of original size). Note that COGs require overviews, taking a 33% size increase for granted.
These statistics can (or at least should) be aggregated from the leaf voxels with points and up from there (
Let's make it required then.
It could follow the example by @connormanning, but then multiple consecutive objects in the same order of the pages and its entries. Since this would require offsets (due to the unknown length of struct DimensionBounds
{
double minimum;
double maximum;
}
struct EntryStats
{
DimensionBounds [number_of_domains_in_pointformat];
// VoxelKey key; # implicit by using the same order as the entries
} With the assumption that readers will read the complete octree into memory (or at least until you find the |
@connormanning I absolutely do see the use case for a |
With recent year's baby steps towards support for arbitrary attributes in formats and viewers, statistical data can be handy to provide useful defaults for the visualization of attributes with unkown semantics or ranges. I'm already using min and max in Potree for the gradient ranges of attributes, but that approach frequently fails if there are outliers. For example, using min and max for intensity often breaks if the majority of samples are between 0 and 1,000, but a couple of samples with values over 10,000 screws up the range. |
What does |
It would be a very useful optional hint, especially in order to distinguish between scalars and enums/integers/flags. I was planning to provide materials for several standard mappings from attributes to colors such as:
Perhaps there are some more that might be useful. |
This might also be very useful for gps-time. The sparseness between flight-lines makes it a bit tough to implement useful gps-time filters via sliders. If some empty gps ranges between consecutive flight-lines can be identified with the counts, then the slider could ignore those. It wouldn't be perfect since it likely misses many gaps, but it may or may not help a bit. |
A fair point, but are LAS files often queried and segmented by dimensions other than XYZ? Maybe a little bit, but I'm not convinced that the complication of voxel statistics on the writer side are worth the cost to everything else that only a few really sophisticated clients can benefit from. We are not building the "ultimate cloud friendly point cloud format" here. We are building "a LAZ file that can behave reasonably for incremental, spatially segmented remote access". |
Personally, I have not had an issue using ept without adding additional statistics w.r.t potree and other libs. The one exception was gps-time, when implementing my potree ept reader I had to store gps-time min/max in the file metadata. |
It feels like there's support for a VLR describing statistics about the entire domain of points (not per-voxel) that is required. I have some questions about its composition:
Any other questions? Rendering-type folks, please chime in. |
+1 |
On Fri, Sep 3, 2021 at 4:00 PM Guilhem Villemin ***@***.***> wrote:
+1
Stats can be useful to estimate how many points you'll get for a given
spatial request or how they could be distributed.
Having more stats and more info might look cool but I can't really see how
you could exploit them for point query as you still will have to get all
the points in the selected tiles.
The point counts for cells are already available, so I'm not sure what
you're suggesting here.
…-- Andrew Bell
***@***.***
|
Maybe I get mixed up by the above discussions about stats per cell, for which I can't clearly see how you could efficiently use them. Having global stats can give hints for rendering, but maybe you'll need a more per dimension specific description. I guess that's what @connormanning intended with the IMHO @hobu is right, you'll need to have specific description for the kind of data stored in a given dimension. Buckets works fine for enumerations. Histogram could work with discrete / continuous data to get an approximate representation of the distribution of the data. EDIT: histogram is just a collection of |
I am happy with Connor's original proposal. One of the things EPT lacked was codified statistics. As I'm implementing Potree support for COPC, the one thing we definitely need is GPS Time stats, and as Markus said, the stats would be helpful for other dimensions as well.
I think this is the best way to go about this within LAS. Would it be better to store all the CopcStatistics objects in one VLR?
I would say yes, so that the consumer doesn't have to compute which statistics are present and which aren't.
For simplicity sake, I would again say yes. It's too hard within the LAS spec to pick and choose dimensions.
Assuming we store statistics for every dimension in order, I think this should be implied based on the knowledge of the header format.
Not sure about this one. @hobu @abellgithub Let me know your thoughts on adding this to the spec so we can move forward with the potree side of things. |
We talked about this a lot the last day or two - I think we settled on pretty much this original proposal, with stats entries being required for all dimensions in order including bit fields (e.g. "Classification Flags" are split out into its 4 constituent fields), and importantly removing the While I still see a lot of utility in having these for Classification, it a) complicates the core specification and b) is a half-measure for the functionality you'd really want for binned data counts. For example, for GpsTime or Intensity you probably want to be able to bin histogram data like The other part of the discussion was "which dimensions are required to have statistics (and do bit-fields get entries as well)?". We eventually came to "all of them (and yes)". We talked about perhaps requiring only a subset of the ones that "make sense", because as asked above, what exactly does the mean of an enum mean, or the variance of a bit field? However, sometimes the seemingly silly values can be useful. For example: the |
Oh and also, remove |
Agreed with Connor's suggestions! As for the use case for visualization in QGIS:
Just for a reference, these are the options offered in QGIS for min/max range when visualizing raster layers: It would be nice to have the same choices for point clouds as well - with the suggested options we would only miss the percentile-based min/max range which helps with the situations with outliers like Markus has mentioned. But different people may want to use different percentiles (90? 95? 98? 99?), which would require more granular entries, so it makes sense to skip all that in the initial spec and possibly introduce some optional advanced dimension stats later if really needed. |
@connormanning Sounds good to me! Would we be able to get a draft into the spec, and I can go ahead and start implementing it within copclib/potree? We'll need access to it within copc.js as well, if you don't mind updating that. |
+1 on min/max statistics for the entire domain of points. I don't think stddev, mean and variance are used much. |
Nowadays the Entwine builder adds detailed dimension statistics to its schema (see here for example) including minimum, maximum, mean, stddev, and variance. Currently this is sort of an undocumented extension intended to be eventually be codified as an optional (in order to be backward-compatible) extension to the EPT specification. Would it be worth specifying a statistics VLR to capture this information? I think the "number of points by return" array in the LAS 1.4 header adds precedent for this kind of thing.
An example might be:
These statistics would then be stored in the order that the dimensions appear in the point data record format header, followed by statistics for extra-bytes dimensions in the order that they appear. Is there enough demand for this type of information to put it in the spec? Enough to require it?
One example of their usage I found is wonder-sk/point-cloud-experiments#60: "QGIS implementation is able to read those stats and use them to set up renderers (min/max values are especially useful to correctly set ranges for rendering)".
If anyone has other concrete use-cases where such data would be used, that would be useful.
The text was updated successfully, but these errors were encountered: