Implement `explode_geometries` method #111

edpft · 2022-09-18T19:23:00Z

Closes #26 by implementing an explode_geometries method that takes a GeoSeries with multipart geometries and return a GeoSeries with those geometries split into multiple individual geometries.

I've taken the approach suggested in the original issue, of iterating through the geometries and appending them to a new series.

I've added a test cribbed from the geopandas example to prove this works and a very minimal docstring, also copied from geopandas.

A few things to note:

I went with explode_geometries instead of explode because when I used explode I was just getting the polars method of the same name. This is probably just ignorance on my part!
I'm not handling Geometry::GeometryCollection because I wasn't sure if I should be but I could do this recursively, if I abstract the match expression into a function.
My naming conversions are a little more verbose than yours (e.g. geometry instead of geom) but I'd be happy to conform if needed

For what it's worth, If I was going to implement this for your GeoDataFrame, I'd probably duplicate the other information and add a "geometry index" Series with the index of the geometry within the original multipart geometry.

kylebarron · 2022-09-19T08:58:09Z

geopolars/src/geoseries.rs

@@ -70,6 +70,9 @@ pub trait GeoSeries {
    /// Applies to GeoSeries containing only Polygons. Returns `None` for other geometry types.
    fn exterior(&self) -> Result<Series>;

+    /// Explodes multi-part geometries into multiple single geometries.
+    fn explode_geometries(&self) -> Result<Series>;


I see that polars already has an explode function, but in this case, the meaning is the same for both, so maybe we could call this fn explode as well?

Addressed in 28dd535

kylebarron · 2022-09-19T09:11:20Z

geopolars/src/geoseries.rs

+                    let points: Vec<Geometry> = geometry.into_iter().map(Geometry::Point).collect();
+                    nested_vector.push(points)


Instead of having nested_vector being vec![vec![geometry]], could we push individual parts into that function?

for geom in geometry.into_iter() { output_vector.push(Geometry::Point(geom)) }

That means

we avoid the extra vec construction for each geometry

we avoid an extra iteration down below when you flatten

Addressed in 43f9edb

kylebarron · 2022-09-19T09:12:02Z

Thanks for the PR!

I've never actually used explode from GeoPandas, so I'm just reading up on it now. My first thought is that it might make sense for us to consider the GeoDataFrame.explode and GeoSeries.explode implementations at the same time, so that we don't implement it for Series and then find that it's difficult to implement for GeoDataFrame.

In GeoPandas, it looks like explode sets a multiindex to keep track of which rows the geometries came from. Polars doesn't have a concept of an index, so we'll have to record indexes separately somehow.

I went with explode_geometries instead of explode because when I used explode I was just getting the polars method of the same name. This is probably just ignorance on my part!

Hmm I have to look up the syntax but I think there's a way for users to declare which trait a function comes from. So if we name it the same, I think users would have to "opt-in" to the GeoSeries trait's explode

I'm not handling Geometry::GeometryCollection because I wasn't sure if I should be but I could do this recursively, if I abstract the match expression into a function.

That's ok. I think for now it's ok to ignore GeometryCollection and we can come back in the future when we want to make things more stable.

For what it's worth, If I was going to implement this for your GeoDataFrame, I'd probably duplicate the other information and add a "geometry index" Series with the index of the geometry within the original multipart geometry.

For the GeoDataFrame case, I'm curious if there's an existing polars function that would allow us to provide a list of indexes and have polars automatically duplicate rows. Maybe we could leverage join here somehow

brendan-ward · 2022-09-19T16:19:30Z

I'm not yet familiar enough with GeoPolars to be sure I'm saying the right thing here, so take this with a grain of salt - these are just thoughts from the GeoPandas implementation in case those are useful here (I regularly use .explode() and the underlying get_parts).

In case it helps, the implementation for exploding geometries used by GeoPandas comes from PyGEOS / Shapely here (in Cython).

We're able to pre-calculate the size of the resulting vector by summing the results from GEOSGetNumGeometries_r for each geometry in the input vector, then we fill each slot in the vector via GEOSGetGeometryN_r. Performance-wise, I don't know how it would differ in Rust to build the vector as you iterate vs pre-allocate and fill, since the latter involves additional function calls but theoretically less moving of memory.

If I recall correctly, both the count and the individual geometry accessor return the next level down and does not recurse to the most granular geometry type. So if you have a GeometryCollection(MultiPolygon(Polygon)), you'd get back a MultiPolygon. Hopefully deeply-nested types are awkward, and a user can always detect that by inspecting the resulting types and loop until there are no more container types; seems just fine to not recurse in the implementation here either.

More important than the MultiIndex used by GeoPandas is simply getting back the index of the original geometries, so that you can join back to the original attributes. So long as the individual (inner) geometries are returned in the same order as they occur in the input geometries, it is possible to calculate their inner index. I'm not sure how often those inner indexes are used, but they seem less critical overall.

edpft · 2022-09-19T20:28:56Z

To assess the options I wrote a benchmark exploding a series of 1 million two-point MultiPoints into a series of 2 million Points.

My initial implementation:

explode                 time:   [1.1605 s 1.1769 s 1.1977 s]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

With @kylebarron's suggestion of iterating with the match arms and pushing directly to an exploded_vector:

explode                 time:   [1.0562 s 1.0753 s 1.0974 s]
                        change: [-10.774% -8.6362% -6.1199%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

I also tried pre-allocating as, I think (apologies if I'm misinterpreting!), @brendan-ward is suggesting:

- let mut exploded_vector = Vec::new();
+ let capacity = self.estimated_size();
+ let mut exploded_vector = Vec::with_capacity(capacity);

But it didn't didn't make a noticeable difference:

explode                 time:   [1.0825 s 1.1029 s 1.1260 s]
                        change: [-0.1609% +2.5676% +5.6027%] (p = 0.08 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

I've since changed the benchmark to exploding a series of 45k two-point MultiPoints into a series of 90k Points so that cargo bench can complete 100 samples in 5s.

edpft · 2022-09-19T20:39:13Z

For the GeoDataFrame case, I'm curious if there's an existing polars function that would allow us to provide a list of indexes and have polars automatically duplicate rows. Maybe we could leverage join here somehow

More important than the MultiIndex used by GeoPandas is simply getting back the index of the original geometries, so that you can join back to the original attributes. So long as the individual (inner) geometries are returned in the same order as they occur in the input geometries, it is possible to calculate their inner index. I'm not sure how often those inner indexes are used, but they seem less critical overall.

So, could a solution be for GeoSeries::explode to return a GeoDataFrame of two series: the index of the original geometries and the exploded geometries? I can see the benefit when GeoSeries::explode is called in a future GeoDataFrame::explode method but is that necessary in the GeoSeries implementation?

I'm happy to take a swing at it but I feel it should be a separate issue / pull request.

kylebarron · 2022-09-19T21:01:33Z

Thanks @brendan-ward for chiming in!

In case it helps, the implementation for exploding geometries used by GeoPandas comes from PyGEOS / Shapely here (in Cython).

I figure when we get around to focusing on perf improvements we'll be looking at shapely code often 🙂.

@edpft Awesome! Thanks for making a benchmark! Looks great and good to see that removing that extra vec allocation did make a difference! (I tried to figure out an example benchmarking setup previously but didn't get around to making many benches. btw is that output just cargo bench?)

I think @brendan-ward's suggestion here is more involved than just changing Vec::with_capacity. I think it would consist of two loops over the input data: one quick one just to count how big the output arrays should be. Then we can allocate arrays and just fill coordinates on the second loop. I do think this would likely be faster, but maybe not until we can move to an arrow-native geometry encoding (#31)

Edit: to be clear, definitely not necessary for this PR; at some time in the future when we're more focused on performance would be a good time to revisit

edpft · 2022-09-19T21:11:29Z

Hi @kylebarron, yes, it's just cargo bench.

I thought @brendan-ward's suggestion might be more involved, but I was hoping I could approximate it by allocating a Vec with capacity. Is this something you'd like me to look at? I'm here for the learning more than anything, so I'm happy to keep working on this PR.

kylebarron · 2022-09-19T21:13:14Z

So, could a solution be for GeoSeries::explode to return a GeoDataFrame of two series: the index of the original geometries and the exploded geometries?

What I had in mind was returning a second uint column of indexes that mapped the row index of existing geometries into the row index of the exploded geometries (or the reverse, not sure which would be easier to work with).

But I don't know in general the best practices for returning two columns from a polars function. Like in pandas you can assign two columns onto an existing table by doing something like

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [3, 4, 5]})
df[['col1_dup', 'col2_dup']] = df

and then df has 4 columns. Not sure what the usual way to do that in polars is. This is also why I got stuck on #11

I'm happy to take a swing at it but I feel it should be a separate issue / pull request.

Yeah I agree, just saying it's useful to discuss more the end goal of what the function's API should be, which we can revisit in future PRs.

kylebarron · 2022-09-19T21:16:18Z

Is this something you'd like me to look at?

Honestly I don't think it's necessary at this stage. Most importantly your implementation is correct and plenty fast. My personal preference is to first focus on just getting things working, and then in the future we can come back and optimize performance

kylebarron · 2022-09-19T21:20:26Z

Sometime in the future we'll also want to add this to the Python bindings, but it wouldn't hurt to discuss the explode API a bit more first

initial commit

7372ebf

kylebarron reviewed Sep 19, 2022

View reviewed changes

edpft added 4 commits September 19, 2022 21:00

iterate and push within the match arms

43f9edb

use constructor method instead of macro

df8ffd2

rename method

28dd535

add benchmark for explode

670eedb

edpft requested a review from kylebarron September 19, 2022 20:39

kylebarron approved these changes Sep 19, 2022

View reviewed changes

kylebarron merged commit 028a3f9 into geopolars:master Sep 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `explode_geometries` method #111

Implement `explode_geometries` method #111

edpft commented Sep 18, 2022

kylebarron Sep 19, 2022

edpft Sep 19, 2022

kylebarron Sep 19, 2022

edpft Sep 19, 2022

kylebarron commented Sep 19, 2022

brendan-ward commented Sep 19, 2022

edpft commented Sep 19, 2022 •

edited

edpft commented Sep 19, 2022 •

edited

kylebarron commented Sep 19, 2022 •

edited

edpft commented Sep 19, 2022

kylebarron commented Sep 19, 2022

kylebarron commented Sep 19, 2022

kylebarron commented Sep 19, 2022

		let points: Vec<Geometry> = geometry.into_iter().map(Geometry::Point).collect();
		nested_vector.push(points)

Implement explode_geometries method #111

Implement explode_geometries method #111

Conversation

edpft commented Sep 18, 2022

kylebarron Sep 19, 2022

Choose a reason for hiding this comment

edpft Sep 19, 2022

Choose a reason for hiding this comment

kylebarron Sep 19, 2022

Choose a reason for hiding this comment

edpft Sep 19, 2022

Choose a reason for hiding this comment

kylebarron commented Sep 19, 2022

brendan-ward commented Sep 19, 2022

edpft commented Sep 19, 2022 • edited

edpft commented Sep 19, 2022 • edited

kylebarron commented Sep 19, 2022 • edited

edpft commented Sep 19, 2022

kylebarron commented Sep 19, 2022

kylebarron commented Sep 19, 2022

kylebarron commented Sep 19, 2022

Implement `explode_geometries` method #111

Implement `explode_geometries` method #111

edpft commented Sep 19, 2022 •

edited

edpft commented Sep 19, 2022 •

edited

kylebarron commented Sep 19, 2022 •

edited