-
Notifications
You must be signed in to change notification settings - Fork 923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add query and query_bulk to sindex #1401
ENH: Add query and query_bulk to sindex #1401
Conversation
geopandas/sindex.py
Outdated
@@ -81,22 +223,119 @@ def is_empty(self): | |||
|
|||
if compat.HAS_PYGEOS: | |||
|
|||
from pygeos import STRtree, box, points # noqa | |||
import geopandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick note, (I'll do a proper look later), can you import only classes you need from their files instead of whole geopandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was running into circular imports with geopandas.geoseries.GeoSeries
. I think the best we can do is:
from . import geoseries
from .array import GeometryArray
...
geoseries.GeoSeries
Would that be okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
Didn't look yet in detail, but some quick first comments
with_objects = namedtuple("with_objects", "object id") | ||
|
||
# set of valid predicates for this spatial index | ||
# by default, the global set | ||
valid_query_predicates = VALID_QUERY_PREDICATES |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
geopandas/sindex.py
Outdated
if predicate == "within": | ||
# since these are inverse, we can flip the operation | ||
# and test with prepared predicates from tree | ||
predicate = "contains" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the geometries get flipped in order as well? (because it seems you are now using the same order for both contains and within ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here is:
if "within":
tree[i].contains(geometry)
if "contains":
geometry.contains(tree[i])
So indeed, the geometries do get flipped. What gets appended to the results is i
, which in both cases is the index of the geometry in the tree. query
only outputs indexes for tree geometries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to add more comments to document this, if there is any specific way I can clarify this please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-organized this section and added some more comments. Please let me know if it is clear now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I missed they were using a different order in the actual call (I actually assumed this block was for within and contains, but it was actually for within and intersects, and contains was done below differently).
But added some new comments now ;)
geopandas/sindex.py
Outdated
# handle shapely geometries | ||
if compat.PYGEOS_SHAPELY_COMPAT: | ||
geometry = from_shapely(geometry) | ||
# fallback going through WKB | ||
elif geometry.is_empty and geometry.geom_type == "Point": | ||
# empty point does not roundtrip through WKB | ||
geometry = from_wkb("POINT EMPTY") | ||
else: | ||
geometry = from_wkb(geometry.wkb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is duplicating the code in _shapely_to_geom
/ _shapely_to_pygeos
, it is fine to use those directly instead of duplicating the code.
(yes, those were coded as "private", but it's fine to use them for now within geopandas)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, with that okay I went ahead and imported _shapely_to_geom
(we don't really need _shapely_to_pygeos
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adriangb sorry for the slow rate of my review here; I've only made it through the Rtree implementation so far.
I don't know if it would lead to more proliferation of files than is desirable here, but it struck me that the Rtree and STRtree implementations could be in separate files, to keep their imports cleaner? Maybe making sindex
into a package instead of a single file...
geopandas/sindex.py
Outdated
if getattr(self._prepared_geometries[i], predicate)(geometry): | ||
res.append(i) | ||
tree_query = res | ||
elif predicate == "contains" and len(tree_query) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limiting this case to len(tree_query) > 1
is not immediately obvious. Am I correct in thinking that even if there is 1 hit from the tree, it does not guarantee that the input geometry contains
the tree geometry? As in, their bounding boxes intersect, but without evaluating the predicate, we don't know how the actual geometries relate to each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disregard earlier comment, there was no bug, but obviously I need more comments if even I get confused.
Note that the condition below this is elif predicate is not None:
so if predicate
== contains
but there was only 1 hit, the predicate is still checked but the input geometry is not prepared. I am going to re-organize as follows:
elif predicate is not None:
if len(tree_idx) > 1 and predicate == "contains":
# prepare this geometry
geometry = prep(geometry)
tree_idx = [
i
for i in tree_idx
if getattr(geometry, predicate)(self._geometries[i])
]
Separately, I know that the choice of thresholding prepping the input geometry for len(tree_idx) > 1
is somewhat arbitrary. Do you think we should just always prep if predicate == "contains"
, or maybe there is a better threshold than 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche answered below. We are going to always use prepared geometries.
geopandas/sindex.py
Outdated
predicate : {None, 'intersects', 'within', 'contains', 'overlaps', 'crosses', 'touches'}, optional | ||
If predicate is provided, a prepared version of the input geometry is tested using | ||
the predicate function against each item in the index whose extent intersects the | ||
envelope of the input geometry: predicate(geometry, tree_geometry). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't technically correct, per below implementation.
If predicate is provided, the input geometry is tested using the predicate function against each item in the index whose extent intersects the envelope of the input geometry: predicate(geometry, tree_geometry). If possible,
prepared geometries are used to help speed up the predicate operation.
(sorry for bad prior suggestion here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will add that to the rtree sindex implementation.
I am going to leave any mention of prepared geometries out of the pygeos sindex docstrings since it currently does not use prepared geometries and if it does, that might happen internally within pygeos.
No worries, I know that I've been putting a lot of stuff up for review!
I think that loops back to the discussion in #1344. I see keeping them in the same file as the "default" option. If a consensus is reached on an alternative, I'd be happy to implement it. |
I would keep it as it is for now, expecting that major changes will happen in future. |
Yes, let's keep it in the current file for now, to minimize the diff. |
geopandas/sindex.py
Outdated
# Since only certain predicates support prepared geometries, | ||
# the fallback is to check non-prepared geometries. | ||
if predicate in ("intersects", "within"): | ||
# For these two predicates, we compare tree_geom.predicate(input_geom) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For "within" I understand now why we are flipping this around. But why do this for "intersects"? I think it will be more efficient to prepare the input geometry, since this is called multiple times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well we can cache the tree geometry, but not the input geometries. So I guess it depends on the specific dataset. I did it this way so that we are caching whenever possible, but I'm open to doing it the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming a single call to query_bulk, each geometry will be passed once to query, so I think it is OK they are not cached.
(we currently also don't do this, and caching prepared geoms in general is a separate topic we can discuss with pygeos, I would say. We shouldn't put too much effort in optimizing the current shapely code, at least not beyond what we have now. As it will become obsolete soon)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(BTW, this is also how it is done in pygeos now (prepare the input geometry, not the tree geometry)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, no problem, I can move intersects to the other loop so that the input geometry is prepared.
For within, we still need to prep the tree geometry instead of the input geometry. Shapely prepared geometries support "contains" but not "within", flipping the comparison allows us to also use prepared geometries for "within". So that said, should I just get rid of caching prepared tree geometries altogether? I agree generally that we shouldn't worry about improving shapely performance, but this improvement does not seem very complicated. I leave the choice up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping it for "within" might be fine. We do switch the order in sjoin
, so in practice are using prepared geoms there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I got the idea from sjoin. Once we put bulk_query
into sjoin, my idea would be to remove prepared geometries from sjoin (they're not compatible with pygeos) which would also allow us to get rid of the order flipping logic.
geopandas/sindex.py
Outdated
if predicate == "within": | ||
# since these are inverse, we can flip the operation | ||
# and test with prepared predicates from tree | ||
predicate = "contains" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I missed they were using a different order in the actual call (I actually assumed this block was for within and contains, but it was actually for within and intersects, and contains was done below differently).
But added some new comments now ;)
…parameters from tests.
Thank you the review @jorisvandenbossche! Aside from the formatting things and unused parameters in tests, the main issue (which had also been brought up by @brendan-ward) was the logic in prepping geometries and such for the I think I've simplified it now and as discussed we are now always comparing Let me know if there is anything else I missed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Added some small comments
geopandas/sindex.py
Outdated
predicate : {None, 'intersects', 'within', 'contains', \ | ||
'overlaps', 'crosses', 'touches'}, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predicate : {None, 'intersects', 'within', 'contains', \ | |
'overlaps', 'crosses', 'touches'}, optional | |
predicate : {None, 'intersects', 'within', 'contains', \ | |
'overlaps', 'crosses', 'touches'}, optional |
(a bit ugly, but that's how it works to avoid having the whitespace in the actual docstring)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change it, but just so I know, this is for readthedocs to display it correctly?
I found a possible bug: for both implementations, there are crashes/errors when non-valid geometries are used. I see three options here:
Let me know what you think. All should be relatively easy to implement. |
How do you get an error for the rtree implementation with invalid geometries? (does getting the bounds error?) We shouldn't skip invalid geometries, IMO those are the responsibility of the user. It would be good that there is a decent error message of course. But ideally that can happen without having to check for them (at least on this level of the sindex).
The that sounds something to solve in pygeos. Can you open an issue there with an example? |
I got the error during the predicate check I think. Probably from trying to check if something is contained in a self touching polygon or something like that. Let me try to make a simpler geometry than the ones in the naturalearth dataset and test. |
Ah, yes, I forgot the predicate check. In principle, it's "just" the predicate function that should raise an informative error message. And then we don't need to care about those here in the spatial index. |
So I found the geometry causing the issue, I forgot to remove Antartica after reprojecting 👎 , so really this should be a non-issue for most datasets. I'm not even going to open an issue in pygeos because this is not a real world use case. Sorry for the distraction! |
So back to this PR. Here are the things I think may need double checking:
|
That's still welcome I would say. Invalid geometries are already an annoying part of daily life for gis analysts / geopandas users, so we should make sure it is not even more confusing with bad error messages. |
Ok, posted the issue pygeos/pygeos#139 |
Back to the actual PR. Regarding the tests, that would be nice to check (but I have the impression there is already quite a good coverage?). For the benchmarks, I think that is fine to leave as a follow-up. |
I personally think we are covering what we need to, but there is certainly stuff that we are supporting but not testing. For example, we are not testing a lot of predicates (ex: "touches"). I also don't want to just duplicate all of the testing within pygeos if we are just wrapping it. But I leave it up to you and others to decide how much testing we want.
Agreed, they are pretty informative for now, but it's probably worth discussing performance in detail once we start using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adriangb this is looking good - thanks for the updates!
I added a few minor comments re: tests, but overall I'm 👍 on not repeating a bunch of predicate tests here if they are already well-covered in pygeos, so long as they are also well-covered here for the rtree implementation.
It might be good to add a test case that uses the countries / capitals, with varying predicates, to verify that the size of the results is the same between rtree and pygeos.
For the predicates where pygeos is slower than expected (e.g., time_query_bulk('intersects', 'polygons', 'polygons')
), it might be useful to know the size of the result sets. It could be that we're constructing the result arrays in a suboptimal fashion in pygeos.
Otherwise, I didn't see anything obvious that was causing the benchmarks for a few of the cases to perform worse for pygeos, and I think by and large those can be optimized later.
geopandas/tests/test_sindex.py
Outdated
(None, box(-1, -1, -0.5, -0.5), []), | ||
(None, box(-0.5, -0.5, 0.5, 0.5), [0]), | ||
(None, box(0, 0, 1, 1), [0, 1]), | ||
# bbox intersects but not geometry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't seem to explain the following cases.
It might make these easier to group mentally by the input geometries instead of by predicate, so that it is easier to see the variations in output based on just the predicate:
(None, LineString([(0, 1), (1, 0)]), [0, 1]) # bounding box intersects
("intersects", LineString([(0, 1), (1, 0)]), []), # geometry does not intersect
("within", LineString([(0, 1), (1, 0)]), []), # intersects but not within
("contains", LineString([(0, 1), (1, 0)]), []), # intersects but not contains
Might be good to add a touches
variant to this one too, but overall I agree with other comments that predicate testing does not need to be exhaustive here if appropriately covered already for rtree implementation (since most predicates are at least partly covered by pygeos tests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments were badly organized, that was supposed to refer to just the geometry under it. I added more comments and reorganized them.
I think I am going to skip reorganizing by input geometry. I do see the argument for it, but it would require reorganizing all test cases to be consistent. Let me know if this is okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fine not to reorganize, so long as the comments make sense for the conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Let me know if the new comments help.
Thanks @brendan-ward. I will address the specific comments soon. A general question, regarding testing of predicates: we are currently using the same tests for both rtree and pygeos, this means there are predicates we are not testing with rtree. The predicates we are testing are All of the missing predicates share the same logic, so do you think we can get away with testing a single one (ex: For example, I am thinking adding two test cases to ("touches", box(-1, -1, 0, 0), [0]), # bbox intersects and touches
("touches", box(-0.5, -0.5, 1.5, 1.5), []), # bbox intersects but geom does not touch |
Are you suggesting we directly compare (dynamically setting the flag within the testcase or something), or that we hardcode the expected output size into the tests? |
Hardcode the expected size. Using |
I'll add an integration test with hardcoded sizes for the |
Tests added. Let's get this merged! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple minor comments in the most recent test, but otherwise this looks good to me. 👍
Thanks @adriangb !
geopandas/tests/test_sindex.py
Outdated
"""Tests output sizes for the naturalearth datasets.""" | ||
world = read_file(datasets.get_path("naturalearth_lowres")) | ||
capitals = read_file(datasets.get_path("naturalearth_cities")) | ||
# Reproject to Mercator (after dropping Antartica) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is reprojection necessary? I would assume that geographic coordinates should work fine for this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the right thing to do. That said, the tests would work without it, but then the results might be more confusing if checked by hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the reprojection and tested, all working. It is nicer to have less things being tested, so I think this is a good idea.
Pinging @jorisvandenbossche and @martinfleis to re-review or merge if ready. |
@adriangb sorry for the delay, and thanks a lot for this PR !! |
No problem, excited to move on to actually implementation this stuff now! |
Based on suggestion at geopandas#1401 (comment).
The next step in integration of
pygeos.strtree
. xref #1404A couple of notes:
A general comment:
I think it would be cool to have a spatial index abstract base class that lives inside of
sindex.py
that starts to sketch out what API we expect for spatial indexes. Any thoughts on that are welcome.CCing @jorisvandenbossche @martinfleis @brendan-ward