Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement explode_geometries method #111

Merged
merged 5 commits into from Sep 19, 2022
Merged

Conversation

edpft
Copy link
Contributor

@edpft edpft commented Sep 18, 2022

Closes #26 by implementing an explode_geometries method that takes a GeoSeries with multipart geometries and return a GeoSeries with those geometries split into multiple individual geometries.

I've taken the approach suggested in the original issue, of iterating through the geometries and appending them to a new series.

I've added a test cribbed from the geopandas example to prove this works and a very minimal docstring, also copied from geopandas.

A few things to note:

  • I went with explode_geometries instead of explode because when I used explode I was just getting the polars method of the same name. This is probably just ignorance on my part!
  • I'm not handling Geometry::GeometryCollection because I wasn't sure if I should be but I could do this recursively, if I abstract the match expression into a function.
  • My naming conversions are a little more verbose than yours (e.g. geometry instead of geom) but I'd be happy to conform if needed

For what it's worth, If I was going to implement this for your GeoDataFrame, I'd probably duplicate the other information and add a "geometry index" Series with the index of the geometry within the original multipart geometry.

@@ -70,6 +70,9 @@ pub trait GeoSeries {
/// Applies to GeoSeries containing only Polygons. Returns `None` for other geometry types.
fn exterior(&self) -> Result<Series>;

/// Explodes multi-part geometries into multiple single geometries.
fn explode_geometries(&self) -> Result<Series>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that polars already has an explode function, but in this case, the meaning is the same for both, so maybe we could call this fn explode as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 28dd535

Comment on lines 323 to 324
let points: Vec<Geometry> = geometry.into_iter().map(Geometry::Point).collect();
nested_vector.push(points)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having nested_vector being vec![vec![geometry]], could we push individual parts into that function?

for geom in geometry.into_iter() {
    output_vector.push(Geometry::Point(geom))
}

That means

  1. we avoid the extra vec construction for each geometry
  2. we avoid an extra iteration down below when you flatten

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 43f9edb

@kylebarron
Copy link
Member

Thanks for the PR!

I've never actually used explode from GeoPandas, so I'm just reading up on it now. My first thought is that it might make sense for us to consider the GeoDataFrame.explode and GeoSeries.explode implementations at the same time, so that we don't implement it for Series and then find that it's difficult to implement for GeoDataFrame.

In GeoPandas, it looks like explode sets a multiindex to keep track of which rows the geometries came from. Polars doesn't have a concept of an index, so we'll have to record indexes separately somehow.

  • I went with explode_geometries instead of explode because when I used explode I was just getting the polars method of the same name. This is probably just ignorance on my part!

Hmm I have to look up the syntax but I think there's a way for users to declare which trait a function comes from. So if we name it the same, I think users would have to "opt-in" to the GeoSeries trait's explode

  • I'm not handling Geometry::GeometryCollection because I wasn't sure if I should be but I could do this recursively, if I abstract the match expression into a function.

That's ok. I think for now it's ok to ignore GeometryCollection and we can come back in the future when we want to make things more stable.

For what it's worth, If I was going to implement this for your GeoDataFrame, I'd probably duplicate the other information and add a "geometry index" Series with the index of the geometry within the original multipart geometry.

For the GeoDataFrame case, I'm curious if there's an existing polars function that would allow us to provide a list of indexes and have polars automatically duplicate rows. Maybe we could leverage join here somehow

@brendan-ward
Copy link

I'm not yet familiar enough with GeoPolars to be sure I'm saying the right thing here, so take this with a grain of salt - these are just thoughts from the GeoPandas implementation in case those are useful here (I regularly use .explode() and the underlying get_parts).

In case it helps, the implementation for exploding geometries used by GeoPandas comes from PyGEOS / Shapely here (in Cython).

We're able to pre-calculate the size of the resulting vector by summing the results from GEOSGetNumGeometries_r for each geometry in the input vector, then we fill each slot in the vector via GEOSGetGeometryN_r. Performance-wise, I don't know how it would differ in Rust to build the vector as you iterate vs pre-allocate and fill, since the latter involves additional function calls but theoretically less moving of memory.

If I recall correctly, both the count and the individual geometry accessor return the next level down and does not recurse to the most granular geometry type. So if you have a GeometryCollection(MultiPolygon(Polygon)), you'd get back a MultiPolygon. Hopefully deeply-nested types are awkward, and a user can always detect that by inspecting the resulting types and loop until there are no more container types; seems just fine to not recurse in the implementation here either.

More important than the MultiIndex used by GeoPandas is simply getting back the index of the original geometries, so that you can join back to the original attributes. So long as the individual (inner) geometries are returned in the same order as they occur in the input geometries, it is possible to calculate their inner index. I'm not sure how often those inner indexes are used, but they seem less critical overall.

@edpft
Copy link
Contributor Author

edpft commented Sep 19, 2022

To assess the options I wrote a benchmark exploding a series of 1 million two-point MultiPoints into a series of 2 million Points.

My initial implementation:

explode                 time:   [1.1605 s 1.1769 s 1.1977 s]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

With @kylebarron's suggestion of iterating with the match arms and pushing directly to an exploded_vector:

explode                 time:   [1.0562 s 1.0753 s 1.0974 s]
                        change: [-10.774% -8.6362% -6.1199%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

I also tried pre-allocating as, I think (apologies if I'm misinterpreting!), @brendan-ward is suggesting:

- let mut exploded_vector = Vec::new();
+ let capacity = self.estimated_size();
+ let mut exploded_vector = Vec::with_capacity(capacity);

But it didn't didn't make a noticeable difference:

explode                 time:   [1.0825 s 1.1029 s 1.1260 s]
                        change: [-0.1609% +2.5676% +5.6027%] (p = 0.08 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe

I've since changed the benchmark to exploding a series of 45k two-point MultiPoints into a series of 90k Points so that cargo bench can complete 100 samples in 5s.

@edpft
Copy link
Contributor Author

edpft commented Sep 19, 2022

For the GeoDataFrame case, I'm curious if there's an existing polars function that would allow us to provide a list of indexes and have polars automatically duplicate rows. Maybe we could leverage join here somehow

More important than the MultiIndex used by GeoPandas is simply getting back the index of the original geometries, so that you can join back to the original attributes. So long as the individual (inner) geometries are returned in the same order as they occur in the input geometries, it is possible to calculate their inner index. I'm not sure how often those inner indexes are used, but they seem less critical overall.

So, could a solution be for GeoSeries::explode to return a GeoDataFrame of two series: the index of the original geometries and the exploded geometries? I can see the benefit when GeoSeries::explode is called in a future GeoDataFrame::explode method but is that necessary in the GeoSeries implementation?

I'm happy to take a swing at it but I feel it should be a separate issue / pull request.

@kylebarron
Copy link
Member

kylebarron commented Sep 19, 2022

Thanks @brendan-ward for chiming in!

In case it helps, the implementation for exploding geometries used by GeoPandas comes from PyGEOS / Shapely here (in Cython).

I figure when we get around to focusing on perf improvements we'll be looking at shapely code often 🙂.


@edpft Awesome! Thanks for making a benchmark! Looks great and good to see that removing that extra vec allocation did make a difference! (I tried to figure out an example benchmarking setup previously but didn't get around to making many benches. btw is that output just cargo bench?)

I think @brendan-ward's suggestion here is more involved than just changing Vec::with_capacity. I think it would consist of two loops over the input data: one quick one just to count how big the output arrays should be. Then we can allocate arrays and just fill coordinates on the second loop. I do think this would likely be faster, but maybe not until we can move to an arrow-native geometry encoding (#31)

Edit: to be clear, definitely not necessary for this PR; at some time in the future when we're more focused on performance would be a good time to revisit

@edpft
Copy link
Contributor Author

edpft commented Sep 19, 2022

Hi @kylebarron, yes, it's just cargo bench.

I thought @brendan-ward's suggestion might be more involved, but I was hoping I could approximate it by allocating a Vec with capacity. Is this something you'd like me to look at? I'm here for the learning more than anything, so I'm happy to keep working on this PR.

@kylebarron
Copy link
Member

So, could a solution be for GeoSeries::explode to return a GeoDataFrame of two series: the index of the original geometries and the exploded geometries?

What I had in mind was returning a second uint column of indexes that mapped the row index of existing geometries into the row index of the exploded geometries (or the reverse, not sure which would be easier to work with).

But I don't know in general the best practices for returning two columns from a polars function. Like in pandas you can assign two columns onto an existing table by doing something like

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [3, 4, 5]})
df[['col1_dup', 'col2_dup']] = df

and then df has 4 columns. Not sure what the usual way to do that in polars is. This is also why I got stuck on #11

I'm happy to take a swing at it but I feel it should be a separate issue / pull request.

Yeah I agree, just saying it's useful to discuss more the end goal of what the function's API should be, which we can revisit in future PRs.

@kylebarron
Copy link
Member

Is this something you'd like me to look at?

Honestly I don't think it's necessary at this stage. Most importantly your implementation is correct and plenty fast. My personal preference is to first focus on just getting things working, and then in the future we can come back and optimize performance

@kylebarron
Copy link
Member

Sometime in the future we'll also want to add this to the Python bindings, but it wouldn't hurt to discuss the explode API a bit more first

@kylebarron kylebarron merged commit 028a3f9 into geopolars:master Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement: Explode
3 participants