ENH: Implement Hilbert distance #70

tastatham · 2021-07-07T09:44:37Z

Implementing hilbert_distance for Dask-GeoPandas - based on the _with_hilbert_distance_column function from SpatialPandas.

Details:

Calculating the hilbert distance will allow us to partition Dask-GeoPandas using this information.
This supports both GeoPandas & Dask-GeoPandas
The output only returns the calculated hilbert distance as a numpy array - not as a column
Numba is currently supported but I have commented out these parts - we need to discuss whether we want numba as a dependency.

…geopandas as dask-geopandas

martinfleis

This looks really good and works like a charm! Thanks a lot!

The output only returns the calculated hilbert distance as a numpy array - not as a column

That is fine I think, you can always do ddf["hilbert"] = ddf.hilbert_distance() if you need it as a column.

Numba is currently supported but I have commented out these parts - we need to discuss whether we want numba as a dependency.

The algorithm is already performant but with numba it is 5x faster (once compiled), at least with the GADM of US (3k polygons). So I'd be up for supporting numba but we can discuss the potential issues related to that tomorrow.

Can you ensure you follow the black & flake8 code styles to make CI happy?

Regarding tests, can we add one testing the correctness of the actual Hilbert distance (including different p) apart from the existing check that geopandas-based and dask-geopandas-based results are equal?

dask_geopandas/core.py

dask_geopandas/hilbert_distance.py

tastatham · 2021-07-07T13:50:53Z

The algorithm is already performant but with numba it is 5x faster (once compiled), at least with the GADM of US (3k polygons). So I'd be up for supporting numba but we can discuss the potential issues related to that tomorrow.

Yes, we can discuss this tomorrow.

Can you ensure you follow the black & flake8 code styles to make CI happy?

I totally forgot to follow the styling rules! I will update this

Regarding tests, can we add one testing the correctness of the actual Hilbert distance (including different p) apart from the existing check that geopandas-based and dask-geopandas-based results are equal?

Yes, I can do this. I did think about this but didn't know how to define the "correctness of the hilbert distance" without referencing another package e.g. SpatialPandas or https://github.com/galtay/hilbertcurve

jorisvandenbossche · 2021-07-07T14:01:22Z

I totally forgot to follow the styling rules! I will update this

I can recommend to use pre-commit! (https://geopandas.readthedocs.io/en/latest/community/contributing.html#style-guide-linting)

martinfleis · 2021-07-07T14:06:21Z

I did think about this but didn't know how to define the "correctness of the hilbert distance" without referencing another package

It can be indirect. You can have the expected order of geometries and test if sorting according to Hilbert distance produces the same order. I'd say that for some dummy geometries, you can also hard-code expected value (just to test it does not unexpectedly change by some future PR).

tastatham · 2021-07-07T14:07:59Z

I totally forgot to follow the styling rules! I will update this

I can recommend to use pre-commit! (https://geopandas.readthedocs.io/en/latest/community/contributing.html#style-guide-linting)

Thanks! this is useful

jorisvandenbossche

Nice! Added a few comments.

Another idea for the tests:

add tests using different geometry types as input (polygon, line, point)

dask_geopandas/hilbert_distance.py

jorisvandenbossche · 2021-07-08T09:11:31Z

dask_geopandas/hilbert_distance.py

+    geom_mids = [
+        ((bounds[:, 0] + bounds[:, 2]) / 2.0),
+        ((bounds[:, 1] + bounds[:, 3]) / 2.0),
+    ]


In general, do we know if there is an advantage to using such midpoints vs "centroid" vs "representative point" ?

(if there is not theoretical/practical reason for one or the other, we should maybe check which one is typically the cheapest to compute. EDIT: and based on a quick check, calculating the mids based on the bounds seems much faster)

My prior was that this approach would be faster - but I should have checked manually.

This result makes sense: the operations are in increasing levels of complexity. Midpoint is simple math based on bounds, centroid is more complex based on calculating the "center of mass" of the geometry (varies by geometry type), and "representative point" (point on surface) is probably yet more complex since it needs to ensure the result intersects the polygon.

More of a theoretical question (longer term): the key thing to consider here is how representative the midpoints are for ordinating geometries along the Hilbert curve: what is the tradeoff for how well the points represent the locations of the geometries for partitioning (e.g., suboptimal partitions) vs spatial operations performed against those partitions. Put differently: using midpoint for Hilbert may produce partitions quickly, but if those are suboptimal for overlay operations and makes those much slower, then maybe it is worth a somewhat more expensive method for getting the representative points. To do this, one would need to compare the full compute time of calculate Hilbert curve, repartition, overlay. I'm thinking of a case like the admin boundaries of France, which includes overseas territories. Midpoint of bounds will be far away from any of those boundaries. Centroid will be maybe a bit better but still far away from those boundaries. Representative point would be in continental France (I think?) and thus be more optimal for repartitioning and then overlay with other European polygons. Though - this could easily be solved by exploding into single-part geometries...

Midpoints (and the same for centroids / representative points) might also give suboptimal results if you have a mix of large and small polygons. I was thinking it could be something to explore (later) if you could calculate the hilbert distance for eg the bounding box points, and consider those 4 points per row together when deciding the partitions. But then of course it's not a simple "sort the hilbert_distance column" to determine the partitions.

dask_geopandas/hilbert_distance.py

tests/test_core.py

jorisvandenbossche · 2021-07-08T09:26:57Z

The algorithm is already performant but with numba it is 5x faster (once compiled), at least with the GADM of US (3k polygons). So I'd be up for supporting numba but we can discuss the potential issues related to that tomorrow.

Using a point dataset (subset of the OSM GPS points, first 1 million points) and only focusing on the actual algorithm (after calculating the integer coordinates and the total bounds), I get a 200x speed-up using when using numba (with the one missing @ngjit added).
I also quickly compared with https://github.com/PrincetonLIPS/numpy-hilbert-curve, which is faster as the plain version here, but slower as the numba version: ~1min50 without numba, ~5s for numpy-hilbert-curve and ~500ms with numba.

dask_geopandas/core.py

brendan-ward

I didn't have a chance to fully review, but added a few comments to consider.

dask_geopandas/hilbert_distance.py

dask_geopandas/core.py

dask_geopandas/hilbert_distance.py

martinfleis · 2021-07-13T09:01:11Z

@tastatham can you give us a short summary of recent commits and an overview of what is still missing (at least tests for sure)?

tests/test_hilbert_curve.py

continuous_integration/envs/39-dev.yaml

jorisvandenbossche

Thanks for the updates! A few more comments

dask_geopandas/hilbert_distance.py

continuous_integration/envs/38-latest.yaml

continuous_integration/envs/39-dev.yaml

continuous_integration/envs/39-latest.yaml

dask_geopandas/hilbert_distance.py

tests/test_core.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Explicitly call total_bounds in _continuous_to_discrete_coords Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

martinfleis · 2021-07-15T12:56:32Z

@tastatham I posted this on gitter but it got lost in the following conversation. Could you make a tiny PR adding backticks around "set_geometry" (to format it as a code) in ReadMe (to get "columns of x-y points to the set_geometry method")? I will merge that straight away which ensures that CI workflows in this PR will run automatically. Now we have to approve it after every single commit because you're not a contributor yet. I am trying to keep an eye on that and do it as soon as possible but I'd rather not have to do that.

tastatham · 2021-07-15T13:27:25Z

@tastatham I posted this on gitter but it got lost in the following conversation. Could you make a tiny PR adding backticks around "set_geometry" (to format it as a code) in ReadMe (to get "columns of x-y points to the set_geometry method")? I will merge that straight away which ensures that CI workflows in this PR will run automatically. Now we have to approve it after every single commit because you're not a contributor yet. I am trying to keep an eye on that and do it as soon as possible but I'd rather not have to do that.

PR created #83

jorisvandenbossche · 2021-07-15T16:13:10Z

dask_geopandas/core.py

+
+        Parameters
+        ----------
+        p : Hilbert curve parameter


It would be good to add some explanation of what this is, how it influences the result, and some guidance on whether you should set this or not

martinfleis

I think we can essentially merge after some minor changes.

Before we do that, can you open an issue listing follow-up tasks? Some discussed yesterday, I've noted some in the code. Mention also the documentation notebook in there.

Can you add numba to requirements here?

dask-geopandas/setup.py

Lines 6 to 10 in 5984a6f

    
           install_requires = [ 
        
               "geopandas", 
        
               "dask>=2.18.0,!=2021.05.1", 
        
               "distributed>=2.18.0,!=2021.05.1", 
        
           ]

Almost there, good job!

.pre-commit-config.yaml

martinfleis · 2021-07-16T08:52:12Z

dask_geopandas/core.py

+        A function that calculates hilbert distance for each geometry
+        in each partition of a Dask-GeoDataFrame


Can you add a sentence explaining what the Hilbert distance is and why it is useful? So that user does not have to google it to understand what the function does.

dask_geopandas/core.py

dask_geopandas/hilbert_distance.py

martinfleis

I think we're ready to merge ;).

martinfleis · 2021-07-22T11:01:00Z

Ehm, we're not. Can you make sure CI passes?

edit: I think I fixed it in e88f940.

setup.py

tastatham added 3 commits July 7, 2021 10:32

calculate hilbert distances for geopandas

78403c0

calculate hilbert distances for dask-geopandas

825e31e

add test to check whether calculated hilbert distances match between …

cb173d6

…geopandas as dask-geopandas

martinfleis reviewed Jul 7, 2021

View reviewed changes

dask_geopandas/core.py Outdated Show resolved Hide resolved

dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved

dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved

dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved

tastatham added 2 commits July 7, 2021 16:22

reformatted previous commits using black

f5d5d52

updated _hilbert_distance with no for loops

bc6fd49

tastatham mentioned this pull request Jul 7, 2021

ENH: Implement partitioning based on hilbert distance #71

Closed

jorisvandenbossche reviewed Jul 8, 2021

View reviewed changes

dask_geopandas/hilbert_distance.py Show resolved Hide resolved

jorisvandenbossche reviewed Jul 8, 2021

View reviewed changes

tests/test_core.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Jul 8, 2021

View reviewed changes

dask_geopandas/core.py Outdated Show resolved Hide resolved

brendan-ward reviewed Jul 9, 2021

View reviewed changes

jorisvandenbossche reviewed Jul 9, 2021

View reviewed changes

dask_geopandas/core.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Jul 9, 2021

View reviewed changes

dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved

dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved

dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved

dask_geopandas/hilbert_distance.py Outdated Show resolved Hide resolved

jorisvandenbossche mentioned this pull request Jul 9, 2021

FIX: set known shape on total_bounds result #75

Merged

tastatham added 6 commits July 12, 2021 11:25

drop test_hilbert_distance tmp

d0f8802

add numba acceleration

bc9b632

update hilbert_distance.py docstring

ce52794

updated black to avoid failing

92849da

add numba to continuous env yml files

fc898fa

updated hilbert_distance to lazily evaluate total_bounds

0fba9a8

martinfleis added this to the 0.1 milestone Jul 12, 2021

tastatham added 2 commits July 12, 2021 12:42

updated core.py docstring for hilbert_distance

9cdb098

reformat calculating hilbert distance for numba & cleaner syntax

251ae60

martinfleis reviewed Jul 14, 2021

View reviewed changes

tests/test_hilbert_curve.py Outdated Show resolved Hide resolved

update hilbert test & ci env

c8fbd7f

martinfleis reviewed Jul 14, 2021

View reviewed changes

continuous_integration/envs/39-dev.yaml Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Jul 15, 2021

View reviewed changes

tastatham and others added 10 commits July 15, 2021 11:05

Preserve original gdf index using hilbert_distance

3fa1d01

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Update hilbert curve dependency in 39-dev.yaml

d8a2d46

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

Use latest version of numba

dbfd011

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Use latest version of numba

7d750a2

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Use latest version of numba

a9cf761

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Explicitly call x & y mids in _continuous_to_discrete_coords

922addd

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Update dask_geopandas/hilbert_distance.py

e193f87

Explicitly call total_bounds in _continuous_to_discrete_coords Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Drop unnecessary () from pytest.fixture in test_core.py

aa47fdd

call total bounds and x&y mids in _continuous_discrete

cdf9458

merge _hilbert_distance and _calculate_hilbert_distance

98780cd

tastatham added 2 commits July 15, 2021 15:48

Use latest version of numba

1e1c196

add assert statement to check whether gdf indexes are equal

8a8b114

jorisvandenbossche reviewed Jul 15, 2021

View reviewed changes

martinfleis reviewed Jul 16, 2021

View reviewed changes

tastatham added 3 commits July 16, 2021 15:42

Merge remote-tracking branch 'origin/master' into hilbert_distance

7c19310

add numba to requirements

7ae61ca

update docstring

bc41634

martinfleis approved these changes Jul 22, 2021

View reviewed changes

martinfleis reviewed Jul 22, 2021

View reviewed changes

setup.py Outdated Show resolved Hide resolved

Update setup.py

e88f940

martinfleis merged commit 837f2a3 into geopandas:master Jul 22, 2021

martinfleis mentioned this pull request Aug 4, 2021

ENH: Implement morton_distance #89

Closed

tastatham mentioned this pull request Aug 16, 2021

ENH: Implement Morton distance #90

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Implement Hilbert distance #70

ENH: Implement Hilbert distance #70

tastatham commented Jul 7, 2021

martinfleis left a comment

tastatham commented Jul 7, 2021 •

edited

jorisvandenbossche commented Jul 7, 2021

martinfleis commented Jul 7, 2021

tastatham commented Jul 7, 2021

jorisvandenbossche left a comment

jorisvandenbossche Jul 8, 2021

tastatham Jul 8, 2021 •

edited

brendan-ward Jul 8, 2021

jorisvandenbossche Jul 9, 2021

jorisvandenbossche commented Jul 8, 2021 •

edited

brendan-ward left a comment

martinfleis commented Jul 13, 2021

jorisvandenbossche left a comment

martinfleis commented Jul 15, 2021 •

edited

tastatham commented Jul 15, 2021

jorisvandenbossche Jul 15, 2021

martinfleis left a comment

martinfleis Jul 16, 2021

martinfleis left a comment

martinfleis commented Jul 22, 2021 •

edited

	install_requires = [
	"geopandas",
	"dask>=2.18.0,!=2021.05.1",
	"distributed>=2.18.0,!=2021.05.1",
	]

		A function that calculates hilbert distance for each geometry
		in each partition of a Dask-GeoDataFrame

ENH: Implement Hilbert distance #70

ENH: Implement Hilbert distance #70

Conversation

tastatham commented Jul 7, 2021

martinfleis left a comment

Choose a reason for hiding this comment

tastatham commented Jul 7, 2021 • edited

jorisvandenbossche commented Jul 7, 2021

martinfleis commented Jul 7, 2021

tastatham commented Jul 7, 2021

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Jul 8, 2021

Choose a reason for hiding this comment

tastatham Jul 8, 2021 • edited

Choose a reason for hiding this comment

brendan-ward Jul 8, 2021

Choose a reason for hiding this comment

jorisvandenbossche Jul 9, 2021

Choose a reason for hiding this comment

jorisvandenbossche commented Jul 8, 2021 • edited

brendan-ward left a comment

Choose a reason for hiding this comment

martinfleis commented Jul 13, 2021

jorisvandenbossche left a comment

Choose a reason for hiding this comment

martinfleis commented Jul 15, 2021 • edited

tastatham commented Jul 15, 2021

jorisvandenbossche Jul 15, 2021

Choose a reason for hiding this comment

martinfleis left a comment

Choose a reason for hiding this comment

martinfleis Jul 16, 2021

Choose a reason for hiding this comment

martinfleis left a comment

Choose a reason for hiding this comment

martinfleis commented Jul 22, 2021 • edited

tastatham commented Jul 7, 2021 •

edited

tastatham Jul 8, 2021 •

edited

jorisvandenbossche commented Jul 8, 2021 •

edited

martinfleis commented Jul 15, 2021 •

edited

martinfleis commented Jul 22, 2021 •

edited