New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate lat/lon BKD and spatial3d [LUCENE-6699] #7757
Comments
Karl Wright (@DaddyWri) (migrated from JIRA) Hi Mike, The coordinates you'd need to store for geospatial3d are: (x,y,z), rather than (lat,lon,z). The values are floating-point numbers that range between -1.0 and 1.0 for a sphere, and slightly more than that in abs value for a WGS84 ellipsoid. There was a ticket where I attached code for a packing scheme that would ram all three values into a 64-bit long; I'll see if I can find it. That packing scheme basically gave you resolution of about a meter. However, as you have pointed out, it's not strictly necessary to stick to 64-bit longs either, so you're free to propose anything that makes sense. ;-) |
Karl Wright (@DaddyWri) (migrated from JIRA)
@mikemccand Ok, so now you have me confused a bit as to what your requirements are for BKD. If you want to split the globe up into lat/lon rectangles, and use BKD that way to descend, then obviously you'd need points to be stored in lat/lon. But that would make less sense for geospatial3d, because what you're really trying to do is assess membership in a shape, or distance also in regards to a shape, both of which require (x,y,z) not lat/lon. Yes, you can convert to (x,y,z) from lat/lon, but the conversion is relatively expensive. Instead, I could imagine just staying natively in (x,y,z), and doing your splits in that space, e.g. split in x, then in y, then in z. So you'd have a GeoPoint3D which would pack (x,y,z) in a format you could rapidly extract, and a splitting algorithm that would use the known ranges for these values. Does that make sense to you? Would that work with BKD? |
Michael McCandless (@mikemccand) (migrated from JIRA) OK, thanks @DaddyWri, I think I understand. Yes BKD tree can easily split on the 3 (x,y,z) dimensions ... I just need to generalize it (it's currently "hardwired" for just 2 dimensions). So for this to work, I need to be able to efficiently ask the query shape whether it separately overlaps with a range in each of the 3 dimensions. E.g. "do you intersect 0.3 <= x < 0.7", and same for y and z. I assume this is fast/simple to do with the spatial3d APIs? And then of course with each indexed x,y,z point I also need to test whether the shape includes it. Hmm, one source of efficiency for the BKD tree is it can recognize at search time when a given indexed cell is fully contained by the query shape, and then it doesn't need to test every point against the shape (it just blindly collects all docIDs in that cell). But in the 3D case, I think that optimization would never apply? I.e, the leaf cells would all be little cubes on the earth's surface, which include a volume (above and below earth's surface) that the shape does not accept? |
Karl Wright (@DaddyWri) (migrated from JIRA)
Hmm. There's a "bounds()" method which obtains the lat/lon bounds of a shape. But for 3d we'd not be able to use that. However:
Well, IF we presume that the records all lie on the surface of the world, then you can ask the question, "do these x, y, z bounds, when intersected with the world surface, all lie within the shape, or overlap the shape?" I'll have to think about how efficient that can be made to work, but you ARE after all describing a second surface shape by your x-y-z ranges, so in theory this should be doable. In particular, the GeoBBox construct allows you to find the relationship between a specific kind of surface shape (currently only lat/lon rectangles and their degenerate 3d equivalents) and any other GeoShape. We could try to introduce x-y-z 3d rectangles as shapes; they'd be completely describable by planes, so really all the same logic would apply. I would need to create a new GeoAreaFactory method, and a family of x-y-z bounding shapes, e.g. GeoXYZArea, and it should all work. To be sure, have a look at the functionality that GeoArea gives you. If that looks like what you need, and you can convince yourself that it would be reasonably efficient as far as BKD is concerned, then I'll go ahead and create a patch that does what you need. Thinking in more detail about BKD efficiency, I realize that GeoXYZArea could describe a huge number of 3D rectangles that have NO intersection with the world surface. That's worrisome because it implies that BKD over the (x,y,z) space may not be a very good way of organizing records? Or does it? |
Karl Wright (@DaddyWri) (migrated from JIRA) More analysis: (1) Your BKD search would need to construct a GeoArea object each time it descends a level. XYZArea Note: these are not named "Geo" objects, because they do not have anything to do with the surface of the world. |
Michael McCandless (@mikemccand) (migrated from JIRA)
That's exactly it. Essentially the BKD tree needs a way to recursively chop up the earth surface into smaller and smaller curved-rectangle-like (I think?) shapes on the earth's surface, both at indexing time and at search time. At search time, for a given cell, it needs to "relate" to the query shape to know if the query shape full contains the cell, does not overlap with the cell, or partially overlaps. But I don't understand the "degenerate in X/Y" Area objects... it seems like GeoArea would be used for the lat/lon "approximation" (outer bounding box?) to a GeoShape? Seems like it's better if we can do all functions using proper earth-surface shapes? Sorry this is still all very new to me :) |
Karl Wright (@DaddyWri) (migrated from JIRA)
With x/y/z splitting, actually you're not quite doing that. Instead, you are chopping up the space that the entire world lives in (not just its surface). One these 3d rectangles may or may not actually intersect the surface (which is where all the geo shapes actually lie). If it does intersect, it might intersect on only one side of the world, or it might intersect on two sides of the world. A long, thin 3d rectangle could well encompass a little bit of territory in (say) the UK as well as Alaska, for instance. But as you describe the algorithm, I'm not sure that this is important at all to know.
The GeoArea interface gives you all of that, which is why I wanted to implement objects that aren't limited to the surface but do implement GeoArea.
GeoArea objects are not constrained to be surface objects. Right now the only implementers of GeoArea are bounding boxes, bounded in latitude and longitude, but that's merely due to lack of need for anything else. The relationship types GeoArea objects can determine are against general GeoShape objects (which are surface objects), so the semantics are perfect for what you are trying to do. You can determine whether a lat/lon rectangle overlaps, contains, is within, or doesn't intersect at all with, any arbitrary spatial3d surface shape. But let's be clear: for BKD in the x,y,z space, GeoArea objects bounding in lat/lon are useless. So we need to invent GeoArea objects that represent 3d rectangles. If we try to talk about a general XYZ-bounded area, then there will be up to two bounds for X (min and max), two bounds for Y (min and max), and two bounds for Z (min and max). The degenerate cases come into play when min-X == max-X, or min-Y == max-Y, etc, or when there is no min-X bound but just a max-X one, for instance. This can be conditional logic but the whole thing is faster and more efficient if there's an object for each funky case. Hope this helps. @mikemccand If this all is clear now, and you want to proceed, let me know and I can start working on it tonight. |
Nick Knize (@nknize) (migrated from JIRA) I had thrown this code together a while ago before I put the GeoPointField / Geo3d integration work on hold. This rough draft converts 3D LLA coordinates into ECEF cartesian coordinates using the WGS84 based non-iterative approach in GeoProjectionUtils (derived from the conversion approach illustrated in the Manual of Photogrammetry using 2 sin/cos and 1 sqrt). The ECEF cartesian coordinate is scaled to a unit spheroid (since this is presumably what Geo3D requires) and each of the 32 bits are interleaved (XYZ) akin to MortonEncoding. Decoding procedures are also provided. While this encoding is not nicely represented as a long (not yet convinced 21 bits per will preserve "acceptable" precision, though we could allocate bits differently since larger altitudes degrade w/ conversion) it is BinaryDocValue friendly and enables a space partitioning/prefix coded approach similar to the way GeoPointField currently works. The Most-Significant 3 bits represent an Oct-Cell at the first level, next 3 for level 2, etc. |
Michael McCandless (@mikemccand) (migrated from JIRA)
Right, but, that approach seems wasteful? I mean it should work correctly, but as you said some cells will span strange parts of the surface. It should work fine but seems inefficient ... Like, if there is a way instead to make smaller and smaller surface shapes in the BKD tree that would be best? I.e. fix BKD to operate entirely on the surface ...
Thank you for being so eager and volunteering here but maybe wait until we figure out the likely best approach here! I am still a newbie/lost in the details... I lack intuition. |
Karl Wright (@DaddyWri) (migrated from JIRA)
So, if you turn the BKD descent into something other than (x,y,z), it would imply that you store points for records in something other than (x,y,z) too, no? I am unsure whether the additional cost of representing records with, say, lat/lon, and doing the conversion to (x,y,z) at search time for every records encountered would be more or less expensive than having a somewhat wasteful representation for the tree itself. I expect the conversion for every record would turn out to be more expensive, but maybe this is something we'd just have to try. The obvious alternative would be to store just lat/lon, exactly like is done for Nicholas's code, and use the same BKD tree. The only difference would be: (a) you'd need to construct a GeoBBox, using GeoBBoxFactory, representing your area, and then use getRelationship() to determine intersection, contains, within, or disjoint for the shape, and (b) you'd have to create a GeoPoint at search time for every record encountered. |
Nick Knize (@nknize) (migrated from JIRA)
Just to clarify, so there's no confusion, the attached code converts from lat/lon to ECEF (either full range in meters, or scaled to the unit spheroid) and stores the result in a 96 bit BitSet. GeoPointField stores encoded lat/lon (if that's what you were referring to). The attached was intended to be used for a Geo3d integration w/ GeoPointField.
Just to note again, the conversion from lat/lon to ECEF XYZ is a non-iterative conversion (2 sines, 2 cosines, 1 sqrt). Conversion cost vs. wateful representation is the interesting question, so I threw together a quick encoding/decoding benchmark (on my i7 16GB System76) and ran it on 100M points (converting from lla to ECEF and back). The law of large numbers took over at around 35M. For interest the results are provided:
Again, those are both conversions to ECEF and back to LLA. So roughly 1 minute for 100M points. Halve those numbers and you have the cost for converting either direction. Next we could benchmark tree access and compare? I suspect traversal cost should be a function of the on-disk representation and chosen split method (that's where RTrees tend to become costly). Since BKD splits a sorted space, and it can exploit file system cache? I suspect there aren't many random seeks? Seems benchmarking might be relatively straightforward? |
Karl Wright (@DaddyWri) (migrated from JIRA)
Right, but you'd be comparing 2 sines, 2 cosines, and 1 sqrt against only the cost of unpacking, which for 1 billion records would be a noticeable amount of time.
Many Lucene systems (including ours) store the index in memory, using memory mapping, and SSDs would be expected too even when not, so I'd think a benchmarking should not overemphasize seeks as being costly. But Mike is the expert as far as that is concerned. |
Nick Knize (@nknize) (migrated from JIRA)
Encoding/Decoding ECEF into 96 Bits:
Compared to packing/unpacking lat/lon into 64 bits using using GeoPointField morton bit twiddling:
So using the 3D BitSet approach is 10 times longer, with the obvious culprit being the for loop for each bit. This can be optimized, though, using a 3-way bit twiddle and 2 longs if a 64 bit 3D packing yields unacceptable loss of precision.
Maybe not. But it does depend on the on-disk representation of the tree (and I typically don't use SSDs as an excuse for not paying attention to good file layout). I was mentioning this in the context of index size as a function of a "wasteful" vs. efficient encoding. |
Karl Wright (@DaddyWri) (migrated from JIRA) If the decision is made to use (x,y,z) encoding rather than (lat,lon), then FWIW I think the right way forward with spatial3d would be to extend the current Bounds infrastructure to allow the finding of x,y,z bounds for a shape, rather than just lat/lon. I've thought about this for a day and have concluded it would be far and away the fastest solution, since computing the Bounds for any shape would need to be done only once. Once x, y, and z bounds are known for a shape, the BKD code itself can make most of its decisions itself, without much computation at all, for most of the recursive steps. Only when descending within the computed bounds would it be necessary to construct a GeoArea XYZArea object to determine the exact relationship between a 3d rectangle and the specified GeoShape. |
Michael McCandless (@mikemccand) (migrated from JIRA) I can't follow all the spatial-heavy discussion here, but:
BKD tree's "tree" (the internal nodes) is already forced into heap, at least in the current impl. So walking that tree should always be fast, and then there's only seeking once we hit the leaf nodes. But that seeking should be sequential walk through the file...
I think the encoding for each point's value can be different from what the BKD tree uses? I suspect the points shoudl use (x,y,z) encoding when stored in doc-values per document, because when we need to filter hits for the BKD leaf cells overlapping the shape's boundary, we want to take advantage of the fast math in (x,y,z) space that spatial3d gives us? But then, for the inner-nodes (split values) for the BKD tree, we could use either (lat,lon) or (x,y,z), whichever gives us fastest shape relation computations? At each step in the recursion, BKD tree must check a split in one dimension against the query shape. |
Karl Wright (@DaddyWri) (migrated from JIRA) Patch to geo3d creating an XYZArea object, which can be used to determine intersection with any GeoShape. This is a preliminary patch; I will be including unit tests as well as a more general class of XYZArea degenerate objects if that should prove necessary. |
Michael McCandless (@mikemccand) (migrated from JIRA) Thanks @DaddyWri, I'll try to generalize BKD to 3 dimensions soon ... |
Karl Wright (@DaddyWri) (migrated from JIRA) @mikemccand: Turns out that my class is having problems properly relating to GeoWorld. I'm going to have to make some modifications to make things play nicely. Stay tuned. |
Karl Wright (@DaddyWri) (migrated from JIRA) Revised patch that fixes some of the corner cases, especially interactions with the GeoWorld shape. |
Michael McCandless (@mikemccand) (migrated from JIRA) Initial dirty patch, but its testBasic and testRandom seem to be passing! I just made a 3D version of BKD, and the random test tests across the full cube of +/- 1.022. I did a naive double -> int encoding for each dimension, storing in 12 byte binary DV per doc. Multi-valued docs aren't supported yet but I think it shouldn't be too hard. There are tons of nocommits because I don't know how to tie into the geo3d APIs... I think we should make a branch here, commit the two patches, and then iterate? |
Karl Wright (@DaddyWri) (migrated from JIRA) If you create a branch I can generate patches against it. |
ASF subversion and git services (migrated from JIRA) Commit 1695368 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: make branch |
ASF subversion and git services (migrated from JIRA) Commit 1695369 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: Karl's patch to extend geo3d apis to 3d rectangles |
ASF subversion and git services (migrated from JIRA) Commit 1695370 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: initial 3D BKD implementation |
Karl Wright (@DaddyWri) (migrated from JIRA) Ok, for a start – the way you get X, Y, and Z ranges for a given planet model is via PlanetModel.getMinimumX(), getMaximumX(), getMinimumY(), getMaximumY(), getMinimumZ(), getMaximumZ(). The GeoShape interface does not provide a means of obtaining the PlanetModel, so you will need to pass this in to your constructor in addition to what you currently have. Second, the following code: + double x = BKD3DTreeDocValuesFormat.decodeValue(BKD3DTreeDocValuesFormat.readInt(bytes.bytes, bytes.offset));
+ double y = BKD3DTreeDocValuesFormat.decodeValue(BKD3DTreeDocValuesFormat.readInt(bytes.bytes, bytes.offset+4));
+ double z = BKD3DTreeDocValuesFormat.decodeValue(BKD3DTreeDocValuesFormat.readInt(bytes.bytes, bytes.offset+8));
+ //return GeoUtils.pointInPolygon(polyLons, polyLats, lat, lon);
+ // nocommit fixme!
+ return true; ... should call GeoShape.isWithin(x,y,z) to determine membership within the shape. Finally, + public BKD3DTreeReader.Relation compare(int xMin, int xMax, int yMin, int yMax, int zMin, int zMax) {
+ // nocommit fixme!
+ return BKD3DTreeReader.Relation.INSIDE;
+ } ... should do the following: GeoArea xyzSolid = new XYZSolid(planetModel, xMin, xMax, yMin, yMax, zMin, zMax);
return xyzSolid.getRelationship(geoShape) == GeoArea.<WHATEVER>?xxx:yyy |
Michael McCandless (@mikemccand) (migrated from JIRA) OK I created the branch and committed our two latest patches! https://svn.apache.org/repos/asf/lucene/dev/branches/lucene6699 |
Karl Wright (@DaddyWri) (migrated from JIRA) ok, I'll try to work my proposed changes into them. The biggest thing, though, is that we need access to a PlanetModel instance inside the compare() method. So some plumbing will be necessary to set that up. Stay tuned. ;-) |
Michael McCandless (@mikemccand) (migrated from JIRA) Thanks @DaddyWri, I started on one part, lemme go commit so we don't stomp on each other! |
ASF subversion and git services (migrated from JIRA) Commit 1695377 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: fold in some feedback |
Michael McCandless (@mikemccand) (migrated from JIRA) OK I committed my change, hack away! |
ASF subversion and git services (migrated from JIRA) Commit 1698232 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: store planet's max in the index, and validate at search time |
ASF subversion and git services (migrated from JIRA) Commit 1698321 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: fix encode to round instead of truncate to reduce quantization error; fix quantixed bbox decode |
ASF subversion and git services (migrated from JIRA) Commit 1698454 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: add test case |
ASF subversion and git services (migrated from JIRA) Commit 1698455 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: fix rename |
ASF subversion and git services (migrated from JIRA) Commit 1700049 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: iterate |
ASF subversion and git services (migrated from JIRA) Commit 1700102 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: iterate |
ASF subversion and git services (migrated from JIRA) Commit 1700165 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: randomize other shapes too; fix a nocommit |
ASF subversion and git services (migrated from JIRA) Commit 1700313 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: make encode/decodeValue symmetric |
ASF subversion and git services (migrated from JIRA) Commit 1700450 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: increase bounds fudge factor |
ASF subversion and git services (migrated from JIRA) Commit 1700457 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: only print verbose details of the failing iter |
ASF subversion and git services (migrated from JIRA) Commit 1700459 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: include the shape in the failure message |
ASF subversion and git services (migrated from JIRA) Commit 1700587 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: increase fudge factor further |
ASF subversion and git services (migrated from JIRA) Commit 1700788 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: remove nocommits |
ASF subversion and git services (migrated from JIRA) Commit 1700800 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: merge trunk |
ASF subversion and git services (migrated from JIRA) Commit 1700820 from @mikemccand in branch 'dev/branches/lucene6699' LUCENE-6699: fix precommit |
ASF subversion and git services (migrated from JIRA) Commit 1700883 from @mikemccand in branch 'dev/trunk' LUCENE-6699: add geo3d + BKD for fast, accurate earth-surface point-in-shape queries |
ASF subversion and git services (migrated from JIRA) Commit 1700887 from @mikemccand in branch 'dev/branches/branch_5x' LUCENE-6699: add geo3d + BKD for fast, accurate earth-surface point-in-shape queries |
Michael McCandless (@mikemccand) (migrated from JIRA) Thanks @DaddyWri! |
ASF subversion and git services (migrated from JIRA) Commit 1703537 from @mikemccand in branch 'dev/trunk' LUCENE-6699: fix silly test bug |
ASF subversion and git services (migrated from JIRA) Commit 1703538 from @mikemccand in branch 'dev/branches/branch_5x' LUCENE-6699: fix silly test bug |
Terry Smith (migrated from JIRA) Karl, were you able to find that packing scheme? I'm interested in poking the x,y,z values into a SortedNumericDocValuesField to see how well it would perform. |
Karl Wright (@DaddyWri) (migrated from JIRA) no time, I'm afraid... |
Michael McCandless (@mikemccand) (migrated from JIRA) Terry Smith maybe it's this comment? https://issues.apache.org/jira/browse/LUCENE-6480?focusedCommentId=14543396&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14543396 |
Nick Knize (@nknize) (migrated from JIRA) Terry Smith I updated the Geo3DPacking code some time ago to avoid the overhead of BitSet and use raw morton bit twiddling. The intent is to use it in #7539. Since that issue has stalled a bit I went ahead and attached the standalone class (with benchmarks) to the #7539 issue if you're interested in tinkering. |
Terry Smith (migrated from JIRA) Thanks guys. I was hoping to squeeze those x,y,z values into a 64 bits instead of 96. I'm not a bit twiddler but I'll take a look at Nicholas' patch and see if I can adapt it. |
Nick Knize (@nknize) (migrated from JIRA) I started by packing all 3 values into a 64 bit long - in fact those MAGIC numbers are still there (MAGIC[7:12]). The problem with this is precision loss from compressing each value into 21 bits. The decoded values will give, at best, 3 decimal precision accuracy - which is ~110m at the equator. If you're fine with course grain accuracy then copy the BitUtil.{interleave | deinterleave} methods and use MAGIC[7:12]. The code is much simpler with the accuracy trade-off. |
I'm opening this for discussion, because I'm not yet sure how to do
this integration, because of my ignorance about spatial in general and
spatial3d in particular :)
Our BKD tree impl is very fast at doing lat/lon shape intersection
(bbox, polygon, soon distance: LUCENE-6698) against previously indexed
points.
I think to integrate with spatial3d, we would first need to record
lat/lon/z into doc values. Somewhere I saw discussion about how we
could stuff all 3 into a single long value with acceptable precision
loss? Or, we could use BinaryDocValues? We need all 3 dims available
to do the fast per-hit query time filtering.
But, second: what do we index into the BKD tree? Can we "just" index
earth surface lat/lon, and then at query time is spatial3d able to
give me an enclosing "surface lat/lon" bbox for a 3d shape? Or
... must we index all 3 dimensions into the BKD tree (seems like this
could be somewhat wasteful)?
Migrated from LUCENE-6699 by Michael McCandless (@mikemccand), 1 vote, resolved Sep 02 2015
Attachments: Geo3DPacking.java, LUCENE-6699.patch (versions: 26)
Linked issues:
The text was updated successfully, but these errors were encountered: