New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Use geometry equals method when using drop_duplicates() on a geometry column #3098
Comments
You can use gdf["geometry"] = gdf.normalize()
gdf.drop_duplicates() See geopandas.GeoSeries([
shapely.LineString([(0, 0), (1, 0), (2, 0)]),
shapely.LineString([(2, 0), (1, 0), (0, 0)]),
]).normalize().to_wkt()
0 LINESTRING (0 0, 1 0, 2 0)
1 LINESTRING (0 0, 1 0, 2 0)
dtype: object Using |
Ah that's wonderful, thank you very much! Is there somewhere appropriate I could add this to the documentation? Or a kind of 'hints and tips' page or something? I can't see a GeoPandas drop_duplicates docs page because I think it just uses the pandas methods |
That is a good question. We don't have a proper place right now it seems. But maybe adding a new page like "How to..." to the user guide and populating it with tips like this one would be a good thing. |
That sounds good - I know a number of other projects have those sorts of pages and I often find them useful (usually through finding them via search for some problem I'm having, then reading the rest of the page and learning a lot of useful tips). Should I just create a PR to create such a page, or do you want to discuss it with the rest of the maintainers and/or work out an appropriate location in the docs for it? |
You can go ahead and add a new page under User Guide. We can potentially move it within the PR if others prefer different location. |
While it's good to think about an how-to page, in addition we could also consider for this case to inherit |
I've submitted a PR to add a How to page. If someone can point me to another method that has been inherited just to update the docstring then I'd be happy to do a PR to add that for this situation. (I had a brief look but couldn't immediately see a place where that had been done). |
This works for me, I have 28000 linestings and it is really fast to process: |
Is your feature request related to a problem?
Yes, I've got a GeoDataFrame with a large number of rows. Some of the geometries are duplicates of other geometries - but in some cases the list of co-ordinates are reversed. For example, for a LineString, one of the geometries might be P1->P2->P3->P4 (where PX is a point in the LineString), but another geometry might be P4->P3->P2->P1. In practice, these produce the same geometry on a map, they're just the reverse of each other. In this case,
geom1.equals(geom2)
is True, but other methods of testing equality give False.However, when using the
drop_duplicates('geometry')
method on that GeoDataFrame, these duplicates are not detected and removed. It seems like the duplicates are being checked by checking a WKB representation of the data, in which the order of points is taken into account.Describe the solution you'd like
I'd like an option with
drop_duplicates
to use the geometryequals()
method, so that these sorts of duplicates can be picked up and removed.API breaking implications
This should probably be kept behind an keyword option, such as
use_geometry_equals
so that it doesn't change existing behaviour.Describe alternatives you've considered
At the moment I've implemented a very simple, inefficient uniqueness check like this:
It runs pretty slowly (well, it starts fast when the
uniques
list is short, and gets slower). I'm sure I can do a better version using spatial indexing, but it'd be great to have something built-in.The text was updated successfully, but these errors were encountered: