-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: spatial partitioning of the GeoDataFrame #8
Comments
What do you think about potentially partitioning by attribute? For example, if I have a set of points (maybe cities) in the US with the state as a feature, and I want to spatially join it with a set of counties (polygons) in the US that also has the state as a feature, the computation could be significantly optimized by not having to worry about overlap/other geometric nuances in the borders, and most computation would be within the partition. |
FWIW I found that spatial partitioning with a GeoPandas series felt pretty clean. Many of the operations like spatial joins were relativly straightforward to write. Ideally some the implementation in the previous version could be reused. Things do become odd when you start considering geometries that can cross partition boundaries though. (This will be a problem regardless of which partitioning scheme you use). I think that there are ways to handle it, but it requires some thinking. For point-wise geometries though everything is pretty straightforward. |
I think that's something that is already supported by dask (but @mrocklin can correct me if I am wrong). See eg the doc page on shuffling at https://docs.dask.org/en/latest/dataframe-groupby.html The spatial partitioning will then mostly be useful for cases when you don't have such attribute that can be used as index (or where the regional extent of the attribute is not exactly the same accross datasets). Of course, if you know you have an attribute that describes certain regions, we could make it possible to specify this as input to the "spatial re-partitioning" function as a basis on how to repartition (instead of using some other logic to divide the space into regions). |
I think copying Matt's original implementation of using a geoseries to represent the divisions makes sense. Is there a cost associated with that that makes the spatialpandas path more compelling? |
Only boxes are in principle a bit simpler, but on the other hand, storing it as a geoseries might make working with it easier. And is also more general (and actually gives a spatial index on those partition bounds directly as well, as the geoseries has it). So certainly on board with starting with such an approach. Regarding naming, I actually like the "partition_bounds", only "bounds" is typically pointing to a bounding box. |
Can we just call them |
I am not sure that will play nicely with dask? Eg does dask expect it to be a tuple? Also, this are not divisions for the index, but a different concept, which might be confusing? |
I might be looking at this issue too much from one side, but I am really drawn towards the bounding box-only approach. Such a partition looks more like an array chunk. For vector-raster analyses this match will potentially allow more optimalizations. Maybe this would also allow borrowing some (battle-tested) logic from dask.array? |
In the analogy with dask.array, I think some problems are that 1) we don't have a regular grid with rectangles (in dask, the chunksize can vary in each dimension, but it's still are all rectangles in a grid, I think), and 2) we will typically have overlapping bounding boxes / spatial partitions (with polygons of all shapes, and each polygon having to fit completely in one of the bounding boxes, I think it is unavoidable that the boxes will typically overlap slightly). (that's not to say that a simpler bounding-box-only scheme couldn't be interesting, just pointing out it's quite a bit more complicated as dask.array's chunking scheme) |
You are correct that dask.array works with rectangular grids. I'd propose to make that the default option also for the spatial partitioning of geometries. There are roughly two options: 1) a partition consists of all geometries that are intersecting with it or 2) a partition consists of all geometries that have some deterministically assigned point (e.g. centroid) in it. The first will give duplicates, but the second doesn't allow pruning partitions in some spatial operations like rasterizing or overlay. It depends on your typical geometry size if the duplicates are acceptable; in many cases they are and you still get a big performance increase because of the parallelization. So I'd propose the first one (intersections, with duplicates) but keep the door open for developing more advanced schemes (carrying the convex hull of the parition along, quadtree, space filling curves, ...). Advantages of this approach (in my personal experience) are:
|
Interesting! That seems to point to a fundamental difference in thinking about "what is a spatial partition (bounds)". Because in my mind, it's still something else as the two options you list. In your two options, you seem to start from the spatial extent, and then determine which geometries belong to that spatial partition (either all geometries intersecting with it, or having a representative point lying within it). In my mind, and thinking about how a dask.dataframe works, another option (and what is currently done in the dummy implementation) is to start from the actual physical data: a dask dataframe is splitted into different partitions (sub-dataframes), and so the spatial partitioning is determined by the spatial extent (total bounding box for simplicity, or a more complex polygon) of the geometries that are stored in a certain partition. So the spatial bounds of a partition is a bounding box (or polygon) that at least fully covers all geometries of its partition. In that mindset, for non-point geometries and when working with simple bounding boxes, you will have by definition overlapping bounding boxes of the spatial partitions (unless all the polygon geometries are rectangular as wel, but let's not consider such special case for the general design ;)) I want to get back to the two options you listed above as well, and discuss some potential problems (or to explain "properties" that I think are important for a spatial partitioning system):
As you mention, this will give duplicates. However, I am not sure this is actually practical with dask.dataframe to have duplicates. Either you need to accept that things like
For this one, you mention that it doesn't allow pruning partitions in some spatial operations. And I think that's indeed an essential property of the spatial partitioning is that the bounding box (or polygon) representing a given partition should fully cover the actual geometries (and not only contain a representative point, or only intersect with the polygon). |
Interesting discussion indeed!
I agree that there is this difference of having (or not having) the data determine the partitions. Having to duplicate all of the input vector data into a partitioned form before an analysis can be done can be quite a show-stopper. IO is expensive, especially if the data is accessed over a network (let's say an SQL database or web API). At the same time, data that is accessible over the network is ideally suited for distributed computations. In my ideal world, you would be able to construct a computation on your laptop and send it to a cluster for computation without ever having to have all the IO flowing to your laptop. I think this is a really important point to decide on: does
So this is a real problem indeed. Maybe
This approach is indeed not suitable for spatial operations. But it does have use cases: for example, when you want to do a polygon-raster operation in which the raster is averaged inside each polygon. Both datasets are partitioned. You don't want duplicates in the polygons, but you do want the geometries in a partition to be spatially close so that you can prune the raster partitions. Summarizing: we could define non-overlapping partitions (e.g. a rectangular grid). Each partition contains at least all geometries that intersect with it, with a flag indicating if it is in our out bounds. Also it has 2 bounding boxes as metadata: the partition box and the box that contains all geometries (including the out-of-bounds ones). One more thoughts that popped up while writing this: I think the problem here is not so much in the fact that we are working in 2D, but mainly in the fact that the partitioning column consists of "range" objects instead of scalar objects. When partitioning a dask.dataframe, you take a column of scalars (typically, time) and divide this in blocks. For geometries, this can't be done easily because e.g. a polygon covers ranges of x and y. With dask.dataframes, you might have this issue too if your data consists of "events" with a start and end time. How would you partition that by time? Are there examples out there? Maybe somebody already dealt with the problem that partitions of events overlap in time? |
One approach that looks promising, at least in theory, is bounding interval hierarchy. A bit more on the approach here and here. It uses a recursive splitting algorithm like kdtree (points), but extends this to ranges and explicitly handles geometries that overlap the split line - so that once complete, those geometries are still assigned uniquely to a partition. It looks like it might take a bit of figuring to work out how to handle range queries against this data structure, since most of the examples are from single raycasting. |
Two options I encountered:
|
Spatial partitioning in Apache Sedona seems to be based on KDB-Tree, Quad-Tree and R-Tree. https://sedona.apache.org/tutorial/core-python/#use-spatial-partitioning Their example reminds me that we should have a method that repartitions a dataframe A based on partitions of a dataframe B. |
I'll jump in here again and promote using polygons. Writing the code the first time around felt really natural. Just to be clear, they don't necessarily have to fully partition the data, as someone said above you don't want to force a model where data must be organized, because in many cases this won't be the case, and it's expensive to implement. Instead, you want possibly overlapping bounding polygons for each partition, with the infinity polygon being the default option. In the original dask-geopandas I intentionally tried to solve some of the harder problems to make sure that the approach would work well. Spatial joins and whatever-the-groupby-thing-is called (please forgive my memory) were both interesting to write, but also both clean in the end. It felt very much like the right abstraction. We got to reuse a lot of the logic in geopandas. A parallel spatial join between two parallel geodataframes turned into a spatial join of the high level regions of each, followed by mapping spatial-join across those intersections. We got to leverage the power of the already written geopandas code. It was great. Trees are also good, but I personally would start with a tree of depth 1, and only expand that depth once computation/indexing proved troublesome. I wouldn't expect this to happen until you were well beyond thousands of partitions, which I think already covers the vast majority of use cases.
Operations like this are already written in the original version. If people haven't already I encourage them to look through it. |
Thanks Matthew for chiming in. I agree with having the spatial partitions as a GeoSeries, it makes writing code very natural (I updated your original spatial join implementation last week -> #54, mostly an update for dask API changes for now). I also want to emphasize again that it is really not straightforward to have "duplicated" rows (to support non-overlapping spatial partitions). That will certainly be an interesting model for certain applications, but that also means that you basically have to implement that model with dask from scratch, instead of basing it on It might still be useful to have readers that are using a spatial intersection to query data from some source, but then the "deduplication" step (eg based on containment of representative point) will need to happen during IO, so that once the data is read and materialized as a geopandas.GeoDataFrame inside a partition, it can be used as is.
Indeed, for a number of partitions in the thousands, doing a plain
Yes, we indeed should. As mentioned above, I already ported the |
We are speaking about "spatial indexing / partitioning" a lot here, but I think there are two distinct aspects:
I think several mentions of spatial indexing methods above are rather for the second aspect, but I would propose to open a separate issue for the repartitioning goal, and focus in this issue on the actual spatial partitioning concept and mechanism of the GeoDataFrame (the first aspect). |
Agreed. I really liked the idea at first, but you're right. I'll stop talking about this. As @mrocklin said, partitioning data will be expensive for many datasources. We could allow some There seems to be a lot of progress in this area (e.g. https://coiled.io/dask-under-the-hood-scheduler-refactor/) and I wonder if we can make this such that the partitions evaluate lazily as an iterator. I think this is a question that is in scope here, because choosing for e.g. a Geoseries to implement the partitions may limit our options. Another question that someone else may be able to answer: does the scheduler have access to the same packages (GEOS, PyGEOS, geopandas) as the workers? Coming back to the discussion about bbox vs. polygons. Initially, I had the feeling that packing partitions inside a Geoseries (which contains an I am still slightly worried about the polygons though. I have encountered some cases of severe "oversampling" in sourcedata, e.g. a arc that gets a point every centimeter, giving single polygons ~100 megabytes. When computing the convex hull of that, these terrible things might get into the partitions. Using bounding boxes, you just can't have that issue. So we could leave this option open and require |
Yes, I think it will indeed be interesting to investigate how we can leverage the work around high-level graphs to optimize this.
We can assume that, yes, but that is something you need to ensure it's the case as the user (or dev ops providing the client/scheduler environments).
At the moment just the |
What's the plan for dask.dataframe's |
I don't think there are any. GeometryArray currently doesn't support sorting. I'd say that the best solution, for now, would be to raise |
For making spatial joins or overlays, spatial predicates, reading from spatially partitioned datasets, etc more efficient, we can have spatially partitioned dataframes: the bounds of each partition is known, and thus it can be checked based on those bounds whether on operation needs to involve that partition or not.
And then geodataframes can also be re-partitioned to optimize the bounds (minimize the overlap) as much as possible (initial costly shuffle operation, but can pay-off later).
This complicates the implementation (we need to keep track of the spatial partitioning, the partitions can change during spatial operations, ..), but I think it will also be critical for improving performance on large datasets.
How can we add this?
In the previous iteration at https://github.com/mrocklin/dask-geopandas, the dataframes had an additional
_regions
attribute, which was a geopandas.GeoSeries with the "regions" of each partition (solen(regions) == npartitions
).See https://github.com/mrocklin/dask-geopandas/blob/8133969bf03d158f51faf85d020641e86c9a7e28/dask_geopandas/core.py#L50
I think one advantage of using a GeoSeries is that this makes it easy to work with (eg it is easy to check which partitions would intersect with a given geometry).
In
spatialpandas
(https://github.com/holoviz/spatialpandas), there is a combo ofpartition_bounds
andpartition_sindex
.The
partition_bounds
is basically thetotal_bounds
of each partition (so you could see it as the_regions
but limited to a rectangular box and stores as the 4 (minx, miny, maxx, maxy) numbers). And thenpartition_sindex
is a spatial index built on thepartition_bounds
.See https://github.com/holoviz/spatialpandas/blob/master/spatialpandas/dask.py
I suppose starting with a basic "partition bounds" should be fine, and allows to later expand it with a spatial index or with more fine-grained shapes.
The text was updated successfully, but these errors were encountered: