Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement Hilbert distance #70

Merged
merged 38 commits into from
Jul 22, 2021

Conversation

tastatham
Copy link
Contributor

Implementing hilbert_distance for Dask-GeoPandas - based on the _with_hilbert_distance_column function from SpatialPandas.

Details:

  • Calculating the hilbert distance will allow us to partition Dask-GeoPandas using this information.
  • This supports both GeoPandas & Dask-GeoPandas
  • The output only returns the calculated hilbert distance as a numpy array - not as a column
  • Numba is currently supported but I have commented out these parts - we need to discuss whether we want numba as a dependency.

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good and works like a charm! Thanks a lot!

The output only returns the calculated hilbert distance as a numpy array - not as a column

That is fine I think, you can always do ddf["hilbert"] = ddf.hilbert_distance() if you need it as a column.

Numba is currently supported but I have commented out these parts - we need to discuss whether we want numba as a dependency.

The algorithm is already performant but with numba it is 5x faster (once compiled), at least with the GADM of US (3k polygons). So I'd be up for supporting numba but we can discuss the potential issues related to that tomorrow.

Can you ensure you follow the black & flake8 code styles to make CI happy?

Regarding tests, can we add one testing the correctness of the actual Hilbert distance (including different p) apart from the existing check that geopandas-based and dask-geopandas-based results are equal?

dask_geopandas/core.py Outdated Show resolved Hide resolved
dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
@tastatham
Copy link
Contributor Author

tastatham commented Jul 7, 2021

The algorithm is already performant but with numba it is 5x faster (once compiled), at least with the GADM of US (3k polygons). So I'd be up for supporting numba but we can discuss the potential issues related to that tomorrow.

Yes, we can discuss this tomorrow.

Can you ensure you follow the black & flake8 code styles to make CI happy?

I totally forgot to follow the styling rules! I will update this

Regarding tests, can we add one testing the correctness of the actual Hilbert distance (including different p) apart from the existing check that geopandas-based and dask-geopandas-based results are equal?

Yes, I can do this. I did think about this but didn't know how to define the "correctness of the hilbert distance" without referencing another package e.g. SpatialPandas or https://github.com/galtay/hilbertcurve

@jorisvandenbossche
Copy link
Member

I totally forgot to follow the styling rules! I will update this

I can recommend to use pre-commit! (https://geopandas.readthedocs.io/en/latest/community/contributing.html#style-guide-linting)

@martinfleis
Copy link
Member

I did think about this but didn't know how to define the "correctness of the hilbert distance" without referencing another package

It can be indirect. You can have the expected order of geometries and test if sorting according to Hilbert distance produces the same order. I'd say that for some dummy geometries, you can also hard-code expected value (just to test it does not unexpectedly change by some future PR).

@tastatham
Copy link
Contributor Author

I totally forgot to follow the styling rules! I will update this

I can recommend to use pre-commit! (https://geopandas.readthedocs.io/en/latest/community/contributing.html#style-guide-linting)

Thanks! this is useful

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Added a few comments.

Another idea for the tests:

  • add tests using different geometry types as input (polygon, line, point)

dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
dask_geopandas/hilbert_distance.py Show resolved Hide resolved
geom_mids = [
((bounds[:, 0] + bounds[:, 2]) / 2.0),
((bounds[:, 1] + bounds[:, 3]) / 2.0),
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, do we know if there is an advantage to using such midpoints vs "centroid" vs "representative point" ?

(if there is not theoretical/practical reason for one or the other, we should maybe check which one is typically the cheapest to compute. EDIT: and based on a quick check, calculating the mids based on the bounds seems much faster)

Copy link
Contributor Author

@tastatham tastatham Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My prior was that this approach would be faster - but I should have checked manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This result makes sense: the operations are in increasing levels of complexity. Midpoint is simple math based on bounds, centroid is more complex based on calculating the "center of mass" of the geometry (varies by geometry type), and "representative point" (point on surface) is probably yet more complex since it needs to ensure the result intersects the polygon.

More of a theoretical question (longer term): the key thing to consider here is how representative the midpoints are for ordinating geometries along the Hilbert curve: what is the tradeoff for how well the points represent the locations of the geometries for partitioning (e.g., suboptimal partitions) vs spatial operations performed against those partitions. Put differently: using midpoint for Hilbert may produce partitions quickly, but if those are suboptimal for overlay operations and makes those much slower, then maybe it is worth a somewhat more expensive method for getting the representative points. To do this, one would need to compare the full compute time of calculate Hilbert curve, repartition, overlay. I'm thinking of a case like the admin boundaries of France, which includes overseas territories. Midpoint of bounds will be far away from any of those boundaries. Centroid will be maybe a bit better but still far away from those boundaries. Representative point would be in continental France (I think?) and thus be more optimal for repartitioning and then overlay with other European polygons. Though - this could easily be solved by exploding into single-part geometries...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Midpoints (and the same for centroids / representative points) might also give suboptimal results if you have a mix of large and small polygons. I was thinking it could be something to explore (later) if you could calculate the hilbert distance for eg the bounding box points, and consider those 4 points per row together when deciding the partitions. But then of course it's not a simple "sort the hilbert_distance column" to determine the partitions.

dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jul 8, 2021

The algorithm is already performant but with numba it is 5x faster (once compiled), at least with the GADM of US (3k polygons). So I'd be up for supporting numba but we can discuss the potential issues related to that tomorrow.

Using a point dataset (subset of the OSM GPS points, first 1 million points) and only focusing on the actual algorithm (after calculating the integer coordinates and the total bounds), I get a 200x speed-up using when using numba (with the one missing @ngjit added).
I also quickly compared with https://github.com/PrincetonLIPS/numpy-hilbert-curve, which is faster as the plain version here, but slower as the numba version: ~1min50 without numba, ~5s for numpy-hilbert-curve and ~500ms with numba.

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't have a chance to fully review, but added a few comments to consider.

dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
@martinfleis martinfleis added this to the 0.1 milestone Jul 12, 2021
@martinfleis
Copy link
Member

@tastatham can you give us a short summary of recent commits and an overview of what is still missing (at least tests for sure)?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! A few more comments

dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
continuous_integration/envs/38-latest.yaml Outdated Show resolved Hide resolved
continuous_integration/envs/39-dev.yaml Outdated Show resolved Hide resolved
continuous_integration/envs/39-latest.yaml Outdated Show resolved Hide resolved
dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved
tests/test_core.py Outdated Show resolved Hide resolved
tastatham and others added 10 commits July 15, 2021 11:05
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Explicitly call total_bounds in _continuous_to_discrete_coords

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@martinfleis
Copy link
Member

martinfleis commented Jul 15, 2021

@tastatham I posted this on gitter but it got lost in the following conversation. Could you make a tiny PR adding backticks around "set_geometry" (to format it as a code) in ReadMe (to get "columns of x-y points to the set_geometry method")? I will merge that straight away which ensures that CI workflows in this PR will run automatically. Now we have to approve it after every single commit because you're not a contributor yet. I am trying to keep an eye on that and do it as soon as possible but I'd rather not have to do that.

@tastatham
Copy link
Contributor Author

@tastatham I posted this on gitter but it got lost in the following conversation. Could you make a tiny PR adding backticks around "set_geometry" (to format it as a code) in ReadMe (to get "columns of x-y points to the set_geometry method")? I will merge that straight away which ensures that CI workflows in this PR will run automatically. Now we have to approve it after every single commit because you're not a contributor yet. I am trying to keep an eye on that and do it as soon as possible but I'd rather not have to do that.

PR created #83


Parameters
----------
p : Hilbert curve parameter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add some explanation of what this is, how it influences the result, and some guidance on whether you should set this or not

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can essentially merge after some minor changes.

Before we do that, can you open an issue listing follow-up tasks? Some discussed yesterday, I've noted some in the code. Mention also the documentation notebook in there.

Can you add numba to requirements here?

dask-geopandas/setup.py

Lines 6 to 10 in 5984a6f

install_requires = [
"geopandas",
"dask>=2.18.0,!=2021.05.1",
"distributed>=2.18.0,!=2021.05.1",
]

Almost there, good job!

.pre-commit-config.yaml Outdated Show resolved Hide resolved
Comment on lines 313 to 314
A function that calculates hilbert distance for each geometry
in each partition of a Dask-GeoDataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a sentence explaining what the Hilbert distance is and why it is useful? So that user does not have to google it to understand what the function does.

dask_geopandas/core.py Show resolved Hide resolved
dask_geopandas/hilbert_distance.py Show resolved Hide resolved
dask_geopandas/hilbert_distance.py Show resolved Hide resolved
Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're ready to merge ;).

@martinfleis
Copy link
Member

martinfleis commented Jul 22, 2021

Ehm, we're not. Can you make sure CI passes?

edit: I think I fixed it in e88f940.

setup.py Outdated Show resolved Hide resolved
@martinfleis martinfleis merged commit 837f2a3 into geopandas:master Jul 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants