Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add query and query_bulk to sindex #1401

Merged
merged 22 commits into from
May 13, 2020

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Apr 27, 2020

The next step in integration of pygeos.strtree. xref #1404

A couple of notes:

  1. I included sorting for now, but am happy to remove it if it adds too much complexity.
  2. I included support for other predicates (and tests, it works) but will be happy to remove it for now if we want to do that separately.

A general comment:
I think it would be cool to have a spatial index abstract base class that lives inside of sindex.py that starts to sketch out what API we expect for spatial indexes. Any thoughts on that are welcome.

CCing @jorisvandenbossche @martinfleis @brendan-ward

@martinfleis martinfleis changed the title Add query and query_bulk to sindex ENH: Add query and query_bulk to sindex Apr 28, 2020
@@ -81,22 +223,119 @@ def is_empty(self):

if compat.HAS_PYGEOS:

from pygeos import STRtree, box, points # noqa
import geopandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick note, (I'll do a proper look later), can you import only classes you need from their files instead of whole geopandas?

Copy link
Contributor Author

@adriangb adriangb Apr 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was running into circular imports with geopandas.geoseries.GeoSeries. I think the best we can do is:

from . import geoseries
from .array import GeometryArray
...
geoseries.GeoSeries

Would that be okay?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!
Didn't look yet in detail, but some quick first comments

geopandas/sindex.py Outdated Show resolved Hide resolved
geopandas/sindex.py Outdated Show resolved Hide resolved
geopandas/sindex.py Outdated Show resolved Hide resolved
geopandas/tests/test_sindex.py Outdated Show resolved Hide resolved
with_objects = namedtuple("with_objects", "object id")

# set of valid predicates for this spatial index
# by default, the global set
valid_query_predicates = VALID_QUERY_PREDICATES

This comment was marked as off-topic.

geopandas/sindex.py Outdated Show resolved Hide resolved
geopandas/sindex.py Show resolved Hide resolved
if predicate == "within":
# since these are inverse, we can flip the operation
# and test with prepared predicates from tree
predicate = "contains"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the geometries get flipped in order as well? (because it seems you are now using the same order for both contains and within ?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here is:

if "within":
    tree[i].contains(geometry)
if "contains":
   geometry.contains(tree[i])

So indeed, the geometries do get flipped. What gets appended to the results is i, which in both cases is the index of the geometry in the tree. query only outputs indexes for tree geometries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to add more comments to document this, if there is any specific way I can clarify this please let me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-organized this section and added some more comments. Please let me know if it is clear now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I missed they were using a different order in the actual call (I actually assumed this block was for within and contains, but it was actually for within and intersects, and contains was done below differently).

But added some new comments now ;)

geopandas/sindex.py Outdated Show resolved Hide resolved
geopandas/tests/test_sindex.py Outdated Show resolved Hide resolved
Comment on lines 354 to 362
# handle shapely geometries
if compat.PYGEOS_SHAPELY_COMPAT:
geometry = from_shapely(geometry)
# fallback going through WKB
elif geometry.is_empty and geometry.geom_type == "Point":
# empty point does not roundtrip through WKB
geometry = from_wkb("POINT EMPTY")
else:
geometry = from_wkb(geometry.wkb)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is duplicating the code in _shapely_to_geom / _shapely_to_pygeos, it is fine to use those directly instead of duplicating the code.
(yes, those were coded as "private", but it's fine to use them for now within geopandas)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, with that okay I went ahead and imported _shapely_to_geom (we don't really need _shapely_to_pygeos).

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriangb sorry for the slow rate of my review here; I've only made it through the Rtree implementation so far.

I don't know if it would lead to more proliferation of files than is desirable here, but it struck me that the Rtree and STRtree implementations could be in separate files, to keep their imports cleaner? Maybe making sindex into a package instead of a single file...

geopandas/sindex.py Show resolved Hide resolved
geopandas/sindex.py Outdated Show resolved Hide resolved
geopandas/sindex.py Outdated Show resolved Hide resolved
geopandas/sindex.py Outdated Show resolved Hide resolved
geopandas/sindex.py Outdated Show resolved Hide resolved
if getattr(self._prepared_geometries[i], predicate)(geometry):
res.append(i)
tree_query = res
elif predicate == "contains" and len(tree_query) > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Limiting this case to len(tree_query) > 1 is not immediately obvious. Am I correct in thinking that even if there is 1 hit from the tree, it does not guarantee that the input geometry contains the tree geometry? As in, their bounding boxes intersect, but without evaluating the predicate, we don't know how the actual geometries relate to each other.

Copy link
Contributor Author

@adriangb adriangb Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disregard earlier comment, there was no bug, but obviously I need more comments if even I get confused.

Note that the condition below this is elif predicate is not None: so if predicate == contains but there was only 1 hit, the predicate is still checked but the input geometry is not prepared. I am going to re-organize as follows:

elif predicate is not None:
    if len(tree_idx) > 1 and predicate == "contains":
        # prepare this geometry
        geometry = prep(geometry)
    tree_idx = [
        i
        for i in tree_idx
        if getattr(geometry, predicate)(self._geometries[i])
    ]

Separately, I know that the choice of thresholding prepping the input geometry for len(tree_idx) > 1 is somewhat arbitrary. Do you think we should just always prep if predicate == "contains", or maybe there is a better threshold than 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche answered below. We are going to always use prepared geometries.

predicate : {None, 'intersects', 'within', 'contains', 'overlaps', 'crosses', 'touches'}, optional
If predicate is provided, a prepared version of the input geometry is tested using
the predicate function against each item in the index whose extent intersects the
envelope of the input geometry: predicate(geometry, tree_geometry).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't technically correct, per below implementation.

If predicate is provided, the input geometry is tested using the predicate function against each item in the index whose extent intersects the envelope of the input geometry: predicate(geometry, tree_geometry).  If possible,
prepared geometries are used to help speed up the predicate operation.

(sorry for bad prior suggestion here)

Copy link
Contributor Author

@adriangb adriangb Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will add that to the rtree sindex implementation.

I am going to leave any mention of prepared geometries out of the pygeos sindex docstrings since it currently does not use prepared geometries and if it does, that might happen internally within pygeos.

geopandas/sindex.py Outdated Show resolved Hide resolved
geopandas/sindex.py Outdated Show resolved Hide resolved
geopandas/sindex.py Outdated Show resolved Hide resolved
@adriangb
Copy link
Contributor Author

@adriangb sorry for the slow rate of my review here; I've only made it through the Rtree implementation so far.

No worries, I know that I've been putting a lot of stuff up for review!

I don't know if it would lead to more proliferation of files than is desirable here, but it struck me that the Rtree and STRtree implementations could be in separate files, to keep their imports cleaner? Maybe making sindex into a package instead of a single file...

I think that loops back to the discussion in #1344. I see keeping them in the same file as the "default" option. If a consensus is reached on an alternative, I'd be happy to implement it.

@martinfleis
Copy link
Member

I don't know if it would lead to more proliferation of files than is desirable here, but it struck me that the Rtree and STRtree implementations could be in separate files, to keep their imports cleaner? Maybe making sindex into a package instead of a single file...

I think that loops back to the discussion in #1344. I see keeping them in the same file as the "default" option. If a consensus is reached on an alternative, I'd be happy to implement it.

I would keep it as it is for now, expecting that major changes will happen in future.

@jorisvandenbossche
Copy link
Member

Yes, let's keep it in the current file for now, to minimize the diff.

geopandas/sindex.py Outdated Show resolved Hide resolved
geopandas/sindex.py Outdated Show resolved Hide resolved
# Since only certain predicates support prepared geometries,
# the fallback is to check non-prepared geometries.
if predicate in ("intersects", "within"):
# For these two predicates, we compare tree_geom.predicate(input_geom)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For "within" I understand now why we are flipping this around. But why do this for "intersects"? I think it will be more efficient to prepare the input geometry, since this is called multiple times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we can cache the tree geometry, but not the input geometries. So I guess it depends on the specific dataset. I did it this way so that we are caching whenever possible, but I'm open to doing it the other way around.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming a single call to query_bulk, each geometry will be passed once to query, so I think it is OK they are not cached.

(we currently also don't do this, and caching prepared geoms in general is a separate topic we can discuss with pygeos, I would say. We shouldn't put too much effort in optimizing the current shapely code, at least not beyond what we have now. As it will become obsolete soon)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(BTW, this is also how it is done in pygeos now (prepare the input geometry, not the tree geometry)

Copy link
Contributor Author

@adriangb adriangb Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, no problem, I can move intersects to the other loop so that the input geometry is prepared.

For within, we still need to prep the tree geometry instead of the input geometry. Shapely prepared geometries support "contains" but not "within", flipping the comparison allows us to also use prepared geometries for "within". So that said, should I just get rid of caching prepared tree geometries altogether? I agree generally that we shouldn't worry about improving shapely performance, but this improvement does not seem very complicated. I leave the choice up to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping it for "within" might be fine. We do switch the order in sjoin, so in practice are using prepared geoms there.

Copy link
Contributor Author

@adriangb adriangb Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I got the idea from sjoin. Once we put bulk_query into sjoin, my idea would be to remove prepared geometries from sjoin (they're not compatible with pygeos) which would also allow us to get rid of the order flipping logic.

if predicate == "within":
# since these are inverse, we can flip the operation
# and test with prepared predicates from tree
predicate = "contains"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I missed they were using a different order in the actual call (I actually assumed this block was for within and contains, but it was actually for within and intersects, and contains was done below differently).

But added some new comments now ;)

geopandas/sindex.py Outdated Show resolved Hide resolved
geopandas/tests/test_sindex.py Outdated Show resolved Hide resolved
geopandas/tests/test_sindex.py Outdated Show resolved Hide resolved
geopandas/tests/test_sindex.py Outdated Show resolved Hide resolved
@adriangb
Copy link
Contributor Author

Thank you the review @jorisvandenbossche!

Aside from the formatting things and unused parameters in tests, the main issue (which had also been brought up by @brendan-ward) was the logic in prepping geometries and such for the rtree query.

I think I've simplified it now and as discussed we are now always comparing input_geom.predicate(tree_geom) except for predicate=within in which case it is flipped so that we can use contains (tree_geom.contains(input_geom)).

Let me know if there is anything else I missed!

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Added some small comments

Comment on lines 111 to 112
predicate : {None, 'intersects', 'within', 'contains', \
'overlaps', 'crosses', 'touches'}, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
predicate : {None, 'intersects', 'within', 'contains', \
'overlaps', 'crosses', 'touches'}, optional
predicate : {None, 'intersects', 'within', 'contains', \
'overlaps', 'crosses', 'touches'}, optional

(a bit ugly, but that's how it works to avoid having the whitespace in the actual docstring)

Copy link
Contributor Author

@adriangb adriangb May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change it, but just so I know, this is for readthedocs to display it correctly?

geopandas/sindex.py Outdated Show resolved Hide resolved
geopandas/sindex.py Outdated Show resolved Hide resolved
geopandas/sindex.py Outdated Show resolved Hide resolved
benchmarks/sindex.py Outdated Show resolved Hide resolved
geopandas/sindex.py Outdated Show resolved Hide resolved
@adriangb
Copy link
Contributor Author

adriangb commented May 1, 2020

I found a possible bug: for both implementations, there are crashes/errors when non-valid geometries are used. I see three options here:

  1. Don't even check. Likely the fastest, but could result in confusion (with pygeos, the error message is internal).
  2. Check and raise a useful error.
  3. Check and skip invalid geometries (we already do this for empty geometries).

Let me know what you think. All should be relatively easy to implement.

@jorisvandenbossche
Copy link
Member

How do you get an error for the rtree implementation with invalid geometries? (does getting the bounds error?)

We shouldn't skip invalid geometries, IMO those are the responsibility of the user. It would be good that there is a decent error message of course. But ideally that can happen without having to check for them (at least on this level of the sindex).

with pygeos, the error message is internal).

The that sounds something to solve in pygeos. Can you open an issue there with an example?

@adriangb
Copy link
Contributor Author

adriangb commented May 2, 2020

How do you get an error for the rtree implementation with invalid geometries? (does getting the bounds error?)

We shouldn't skip invalid geometries, IMO those are the responsibility of the user. It would be good that there is a decent error message of course. But ideally that can happen without having to check for them (at least on this level of the sindex).

with pygeos, the error message is internal).

The that sounds something to solve in pygeos. Can you open an issue there with an example?

I got the error during the predicate check I think. Probably from trying to check if something is contained in a self touching polygon or something like that. Let me try to make a simpler geometry than the ones in the naturalearth dataset and test.

@jorisvandenbossche
Copy link
Member

Ah, yes, I forgot the predicate check. In principle, it's "just" the predicate function that should raise an informative error message. And then we don't need to care about those here in the spatial index.

@adriangb
Copy link
Contributor Author

adriangb commented May 2, 2020

So I found the geometry causing the issue, I forgot to remove Antartica after reprojecting 👎 , so really this should be a non-issue for most datasets. I'm not even going to open an issue in pygeos because this is not a real world use case. Sorry for the distraction!

@adriangb
Copy link
Contributor Author

adriangb commented May 2, 2020

So back to this PR. Here are the things I think may need double checking:

  • Make sure the tests are covering all corner cases.
  • Make sure the benchmarks make sense and we don't have any unnecessary slowdowns.

@jorisvandenbossche
Copy link
Member

I'm not even going to open an issue in pygeos because this is not a real world use case. Sorry for the distraction!

That's still welcome I would say. Invalid geometries are already an annoying part of daily life for gis analysts / geopandas users, so we should make sure it is not even more confusing with bad error messages.

@adriangb
Copy link
Contributor Author

adriangb commented May 2, 2020

That's still welcome I would say. Invalid geometries are already an annoying part of daily life for gis analysts / geopandas users, so we should make sure it is not even more confusing with bad error messages.

Ok, posted the issue pygeos/pygeos#139

@jorisvandenbossche
Copy link
Member

Back to the actual PR. Regarding the tests, that would be nice to check (but I have the impression there is already quite a good coverage?). For the benchmarks, I think that is fine to leave as a follow-up.

@adriangb
Copy link
Contributor Author

adriangb commented May 5, 2020

Regarding the tests, that would be nice to check (but I have the impression there is already quite a good coverage?).

I personally think we are covering what we need to, but there is certainly stuff that we are supporting but not testing. For example, we are not testing a lot of predicates (ex: "touches"). I also don't want to just duplicate all of the testing within pygeos if we are just wrapping it. But I leave it up to you and others to decide how much testing we want.

For the benchmarks, I think that is fine to leave as a follow-up.

Agreed, they are pretty informative for now, but it's probably worth discussing performance in detail once we start using query_bulk in other parts.

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriangb this is looking good - thanks for the updates!

I added a few minor comments re: tests, but overall I'm 👍 on not repeating a bunch of predicate tests here if they are already well-covered in pygeos, so long as they are also well-covered here for the rtree implementation.

It might be good to add a test case that uses the countries / capitals, with varying predicates, to verify that the size of the results is the same between rtree and pygeos.

For the predicates where pygeos is slower than expected (e.g., time_query_bulk('intersects', 'polygons', 'polygons')), it might be useful to know the size of the result sets. It could be that we're constructing the result arrays in a suboptimal fashion in pygeos.

Otherwise, I didn't see anything obvious that was causing the benchmarks for a few of the cases to perform worse for pygeos, and I think by and large those can be optimized later.

(None, box(-1, -1, -0.5, -0.5), []),
(None, box(-0.5, -0.5, 0.5, 0.5), [0]),
(None, box(0, 0, 1, 1), [0, 1]),
# bbox intersects but not geometry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't seem to explain the following cases.

It might make these easier to group mentally by the input geometries instead of by predicate, so that it is easier to see the variations in output based on just the predicate:

(None, LineString([(0, 1), (1, 0)]), [0, 1])  # bounding box intersects
("intersects", LineString([(0, 1), (1, 0)]), []),  # geometry does not intersect
("within", LineString([(0, 1), (1, 0)]), []),  # intersects but not within
("contains", LineString([(0, 1), (1, 0)]), []),  # intersects but not contains

Might be good to add a touches variant to this one too, but overall I agree with other comments that predicate testing does not need to be exhaustive here if appropriately covered already for rtree implementation (since most predicates are at least partly covered by pygeos tests).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments were badly organized, that was supposed to refer to just the geometry under it. I added more comments and reorganized them.

I think I am going to skip reorganizing by input geometry. I do see the argument for it, but it would require reorganizing all test cases to be consistent. Let me know if this is okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine not to reorganize, so long as the comments make sense for the conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Let me know if the new comments help.

geopandas/tests/test_sindex.py Outdated Show resolved Hide resolved
geopandas/tests/test_sindex.py Show resolved Hide resolved
@adriangb
Copy link
Contributor Author

adriangb commented May 6, 2020

Thanks @brendan-ward. I will address the specific comments soon. A general question, regarding testing of predicates: we are currently using the same tests for both rtree and pygeos, this means there are predicates we are not testing with rtree.

The predicates we are testing are intersects, within, contains and None. We are not testing overlaps, crosses or touches.

All of the missing predicates share the same logic, so do you think we can get away with testing a single one (ex: touches)? And can we do it only in query (query_bulk has no predicate logic`)? I think this is what you are implying in your comments, but I just want to clarify.

For example, I am thinking adding two test cases to query might suffice:

("touches", box(-1, -1, 0, 0), [0]),  # bbox intersects and touches
("touches", box(-0.5, -0.5, 1.5, 1.5), []),  # bbox intersects but geom does not touch

@adriangb
Copy link
Contributor Author

adriangb commented May 6, 2020

It might be good to add a test case that uses the countries / capitals, with varying predicates, to verify that the size of the results is the same between rtree and pygeos.

Are you suggesting we directly compare (dynamically setting the flag within the testcase or something), or that we hardcode the expected output size into the tests?

@brendan-ward
Copy link
Member

Are you suggesting we directly compare (dynamically setting the flag within the testcase or something), or that we hardcode the expected output size into the tests?

Hardcode the expected size.

Using touches is fine. The goal here is less about the correctness of the predicates, and more that we're correctly catching where they differ from the bounding box intersection. More tests can always get added later, you've done a lot here!

@adriangb
Copy link
Contributor Author

adriangb commented May 6, 2020

I'll add an integration test with hardcoded sizes for the naturalearth dataset. And then I think we should be done?

@adriangb
Copy link
Contributor Author

adriangb commented May 6, 2020

Tests added. Let's get this merged!

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor comments in the most recent test, but otherwise this looks good to me. 👍

Thanks @adriangb !

"""Tests output sizes for the naturalearth datasets."""
world = read_file(datasets.get_path("naturalearth_lowres"))
capitals = read_file(datasets.get_path("naturalearth_cities"))
# Reproject to Mercator (after dropping Antartica)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is reprojection necessary? I would assume that geographic coordinates should work fine for this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the right thing to do. That said, the tests would work without it, but then the results might be more confusing if checked by hand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the reprojection and tested, all working. It is nicer to have less things being tested, so I think this is a good idea.

geopandas/tests/test_sindex.py Outdated Show resolved Hide resolved
@adriangb
Copy link
Contributor Author

adriangb commented May 7, 2020

Pinging @jorisvandenbossche and @martinfleis to re-review or merge if ready.

@jorisvandenbossche jorisvandenbossche merged commit 2c22a26 into geopandas:master May 13, 2020
@jorisvandenbossche
Copy link
Member

@adriangb sorry for the delay, and thanks a lot for this PR !!

@adriangb
Copy link
Contributor Author

No problem, excited to move on to actually implementation this stuff now!

@adriangb adriangb deleted the sindex-bulk-query branch May 13, 2020 14:39
weiji14 added a commit to weiji14/geopandas that referenced this pull request Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants