Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Use geometry equals method when using drop_duplicates() on a geometry column #3098

Open
robintw opened this issue Dec 4, 2023 · 8 comments

Comments

@robintw
Copy link

robintw commented Dec 4, 2023

Is your feature request related to a problem?

Yes, I've got a GeoDataFrame with a large number of rows. Some of the geometries are duplicates of other geometries - but in some cases the list of co-ordinates are reversed. For example, for a LineString, one of the geometries might be P1->P2->P3->P4 (where PX is a point in the LineString), but another geometry might be P4->P3->P2->P1. In practice, these produce the same geometry on a map, they're just the reverse of each other. In this case, geom1.equals(geom2) is True, but other methods of testing equality give False.

However, when using the drop_duplicates('geometry') method on that GeoDataFrame, these duplicates are not detected and removed. It seems like the duplicates are being checked by checking a WKB representation of the data, in which the order of points is taken into account.

Describe the solution you'd like

I'd like an option with drop_duplicates to use the geometry equals() method, so that these sorts of duplicates can be picked up and removed.

API breaking implications

This should probably be kept behind an keyword option, such as use_geometry_equals so that it doesn't change existing behaviour.

Describe alternatives you've considered

At the moment I've implemented a very simple, inefficient uniqueness check like this:

uniques = []

for index, row in tqdm(gdf.iterrows()):
    if not any(unique_row.geometry.equals(row.geometry) for unique_row in uniques):
        uniques.append(row)

It runs pretty slowly (well, it starts fast when the uniques list is short, and gets slower). I'm sure I can do a better version using spatial indexing, but it'd be great to have something built-in.

@martinfleis
Copy link
Member

You can use normalize() to ensure the order of coordinates follows the canonical form and then use drop_duplicates() in its current form.

gdf["geometry"] = gdf.normalize()
gdf.drop_duplicates()

See

geopandas.GeoSeries([
    shapely.LineString([(0, 0), (1, 0), (2, 0)]),
    shapely.LineString([(2, 0), (1, 0), (0, 0)]),
]).normalize().to_wkt()

0    LINESTRING (0 0, 1 0, 2 0)
1    LINESTRING (0 0, 1 0, 2 0)
dtype: object

Using equals would not be very performant compared to a chain of normalize and current drop_duplicates. What we could do, is to add a keyword controlling the normalization within drop_duplicates but that would require overriding pandas drop_duplicates and I'd rather ask users to do normalization if it is needed manually.

@robintw
Copy link
Author

robintw commented Dec 4, 2023

Ah that's wonderful, thank you very much!

Is there somewhere appropriate I could add this to the documentation? Or a kind of 'hints and tips' page or something? I can't see a GeoPandas drop_duplicates docs page because I think it just uses the pandas methods

@martinfleis
Copy link
Member

That is a good question. We don't have a proper place right now it seems. But maybe adding a new page like "How to..." to the user guide and populating it with tips like this one would be a good thing.

@robintw
Copy link
Author

robintw commented Dec 4, 2023

That sounds good - I know a number of other projects have those sorts of pages and I often find them useful (usually through finding them via search for some problem I'm having, then reading the rest of the page and learning a lot of useful tips).

Should I just create a PR to create such a page, or do you want to discuss it with the rest of the maintainers and/or work out an appropriate location in the docs for it?

@martinfleis
Copy link
Member

martinfleis commented Dec 4, 2023

Should I just create a PR to create such a page, or do you want to discuss it with the rest of the maintainers and/or work out an appropriate location in the docs for it?

You can go ahead and add a new page under User Guide. We can potentially move it within the PR if others prefer different location.

@jorisvandenbossche
Copy link
Member

While it's good to think about an how-to page, in addition we could also consider for this case to inherit drop_duplicates just to update the parent docstring. I think it would be useful in the method docstring to note how duplicates for geometry dtype are determined, and then we can also give the hint of using normalize first

@robintw
Copy link
Author

robintw commented Dec 4, 2023

I've submitted a PR to add a How to page.

If someone can point me to another method that has been inherited just to update the docstring then I'd be happy to do a PR to add that for this situation. (I had a brief look but couldn't immediately see a place where that had been done).

@ZhengRen91
Copy link

ZhengRen91 commented Apr 11, 2024

This works for me, I have 28000 linestings and it is really fast to process:
linegdf = gpd.GeoDataFrame(geometry=trilines) linegdf['geometry'] = linegdf.normalize() linegdf_single = linegdf.drop_duplicates('geometry')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants