New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement explode_geometries
method
#111
Conversation
geopolars/src/geoseries.rs
Outdated
@@ -70,6 +70,9 @@ pub trait GeoSeries { | |||
/// Applies to GeoSeries containing only Polygons. Returns `None` for other geometry types. | |||
fn exterior(&self) -> Result<Series>; | |||
|
|||
/// Explodes multi-part geometries into multiple single geometries. | |||
fn explode_geometries(&self) -> Result<Series>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that polars already has an explode
function, but in this case, the meaning is the same for both, so maybe we could call this fn explode
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 28dd535
geopolars/src/geoseries.rs
Outdated
let points: Vec<Geometry> = geometry.into_iter().map(Geometry::Point).collect(); | ||
nested_vector.push(points) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having nested_vector
being vec![vec![geometry]]
, could we push individual parts into that function?
for geom in geometry.into_iter() {
output_vector.push(Geometry::Point(geom))
}
That means
- we avoid the extra vec construction for each geometry
- we avoid an extra iteration down below when you flatten
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 43f9edb
Thanks for the PR! I've never actually used In GeoPandas, it looks like
Hmm I have to look up the syntax but I think there's a way for users to declare which trait a function comes from. So if we name it the same, I think users would have to "opt-in" to the
That's ok. I think for now it's ok to ignore GeometryCollection and we can come back in the future when we want to make things more stable.
For the |
I'm not yet familiar enough with GeoPolars to be sure I'm saying the right thing here, so take this with a grain of salt - these are just thoughts from the GeoPandas implementation in case those are useful here (I regularly use In case it helps, the implementation for exploding geometries used by GeoPandas comes from PyGEOS / Shapely here (in Cython). We're able to pre-calculate the size of the resulting vector by summing the results from If I recall correctly, both the count and the individual geometry accessor return the next level down and does not recurse to the most granular geometry type. So if you have a GeometryCollection(MultiPolygon(Polygon)), you'd get back a MultiPolygon. Hopefully deeply-nested types are awkward, and a user can always detect that by inspecting the resulting types and loop until there are no more container types; seems just fine to not recurse in the implementation here either. More important than the MultiIndex used by GeoPandas is simply getting back the index of the original geometries, so that you can join back to the original attributes. So long as the individual (inner) geometries are returned in the same order as they occur in the input geometries, it is possible to calculate their inner index. I'm not sure how often those inner indexes are used, but they seem less critical overall. |
To assess the options I wrote a benchmark exploding a series of 1 million two-point My initial implementation: explode time: [1.1605 s 1.1769 s 1.1977 s]
Found 5 outliers among 100 measurements (5.00%)
2 (2.00%) high mild
3 (3.00%) high severe With @kylebarron's suggestion of iterating with the match arms and pushing directly to an explode time: [1.0562 s 1.0753 s 1.0974 s]
change: [-10.774% -8.6362% -6.1199%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) high mild
4 (4.00%) high severe I also tried pre-allocating as, I think (apologies if I'm misinterpreting!), @brendan-ward is suggesting: - let mut exploded_vector = Vec::new();
+ let capacity = self.estimated_size();
+ let mut exploded_vector = Vec::with_capacity(capacity); But it didn't didn't make a noticeable difference: explode time: [1.0825 s 1.1029 s 1.1260 s]
change: [-0.1609% +2.5676% +5.6027%] (p = 0.08 > 0.05)
No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
7 (7.00%) high mild
5 (5.00%) high severe I've since changed the benchmark to exploding a series of 45k two-point |
So, could a solution be for I'm happy to take a swing at it but I feel it should be a separate issue / pull request. |
Thanks @brendan-ward for chiming in!
I figure when we get around to focusing on perf improvements we'll be looking at shapely code often 🙂. @edpft Awesome! Thanks for making a benchmark! Looks great and good to see that removing that extra vec allocation did make a difference! (I tried to figure out an example benchmarking setup previously but didn't get around to making many benches. btw is that output just I think @brendan-ward's suggestion here is more involved than just changing Edit: to be clear, definitely not necessary for this PR; at some time in the future when we're more focused on performance would be a good time to revisit |
Hi @kylebarron, yes, it's just I thought @brendan-ward's suggestion might be more involved, but I was hoping I could approximate it by allocating a |
What I had in mind was returning a second But I don't know in general the best practices for returning two columns from a polars function. Like in pandas you can assign two columns onto an existing table by doing something like df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [3, 4, 5]})
df[['col1_dup', 'col2_dup']] = df and then
Yeah I agree, just saying it's useful to discuss more the end goal of what the function's API should be, which we can revisit in future PRs. |
Honestly I don't think it's necessary at this stage. Most importantly your implementation is correct and plenty fast. My personal preference is to first focus on just getting things working, and then in the future we can come back and optimize performance |
Sometime in the future we'll also want to add this to the Python bindings, but it wouldn't hurt to discuss the explode API a bit more first |
Closes #26 by implementing an
explode_geometries
method that takes aGeoSeries
with multipart geometries and return aGeoSeries
with those geometries split into multiple individual geometries.I've taken the approach suggested in the original issue, of iterating through the geometries and appending them to a new series.
I've added a test cribbed from the
geopandas
example to prove this works and a very minimal docstring, also copied fromgeopandas
.A few things to note:
explode_geometries
instead ofexplode
because when I usedexplode
I was just getting thepolars
method of the same name. This is probably just ignorance on my part!Geometry::GeometryCollection
because I wasn't sure if I should be but I could do this recursively, if I abstract the match expression into a function.geometry
instead ofgeom
) but I'd be happy to conform if neededFor what it's worth, If I was going to implement this for your
GeoDataFrame
, I'd probably duplicate the other information and add a "geometry index"Series
with the index of the geometry within the original multipart geometry.