-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Implement Hilbert distance #70
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good and works like a charm! Thanks a lot!
The output only returns the calculated hilbert distance as a numpy array - not as a column
That is fine I think, you can always do ddf["hilbert"] = ddf.hilbert_distance()
if you need it as a column.
Numba is currently supported but I have commented out these parts - we need to discuss whether we want numba as a dependency.
The algorithm is already performant but with numba it is 5x faster (once compiled), at least with the GADM of US (3k polygons). So I'd be up for supporting numba but we can discuss the potential issues related to that tomorrow.
Can you ensure you follow the black
& flake8
code styles to make CI happy?
Regarding tests, can we add one testing the correctness of the actual Hilbert distance (including different p
) apart from the existing check that geopandas-based and dask-geopandas-based results are equal?
Yes, we can discuss this tomorrow.
I totally forgot to follow the styling rules! I will update this
Yes, I can do this. I did think about this but didn't know how to define the "correctness of the hilbert distance" without referencing another package e.g. SpatialPandas or https://github.com/galtay/hilbertcurve |
I can recommend to use pre-commit! (https://geopandas.readthedocs.io/en/latest/community/contributing.html#style-guide-linting) |
It can be indirect. You can have the expected order of geometries and test if sorting according to Hilbert distance produces the same order. I'd say that for some dummy geometries, you can also hard-code expected value (just to test it does not unexpectedly change by some future PR). |
Thanks! this is useful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Added a few comments.
Another idea for the tests:
- add tests using different geometry types as input (polygon, line, point)
dask_geopandas/hilbert_distance.py
Outdated
geom_mids = [ | ||
((bounds[:, 0] + bounds[:, 2]) / 2.0), | ||
((bounds[:, 1] + bounds[:, 3]) / 2.0), | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, do we know if there is an advantage to using such midpoints vs "centroid" vs "representative point" ?
(if there is not theoretical/practical reason for one or the other, we should maybe check which one is typically the cheapest to compute. EDIT: and based on a quick check, calculating the mids based on the bounds seems much faster)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My prior was that this approach would be faster - but I should have checked manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This result makes sense: the operations are in increasing levels of complexity. Midpoint is simple math based on bounds, centroid is more complex based on calculating the "center of mass" of the geometry (varies by geometry type), and "representative point" (point on surface) is probably yet more complex since it needs to ensure the result intersects the polygon.
More of a theoretical question (longer term): the key thing to consider here is how representative the midpoints are for ordinating geometries along the Hilbert curve: what is the tradeoff for how well the points represent the locations of the geometries for partitioning (e.g., suboptimal partitions) vs spatial operations performed against those partitions. Put differently: using midpoint for Hilbert may produce partitions quickly, but if those are suboptimal for overlay operations and makes those much slower, then maybe it is worth a somewhat more expensive method for getting the representative points. To do this, one would need to compare the full compute time of calculate Hilbert curve, repartition, overlay. I'm thinking of a case like the admin boundaries of France, which includes overseas territories. Midpoint of bounds will be far away from any of those boundaries. Centroid will be maybe a bit better but still far away from those boundaries. Representative point would be in continental France (I think?) and thus be more optimal for repartitioning and then overlay with other European polygons. Though - this could easily be solved by exploding into single-part geometries...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Midpoints (and the same for centroids / representative points) might also give suboptimal results if you have a mix of large and small polygons. I was thinking it could be something to explore (later) if you could calculate the hilbert distance for eg the bounding box points, and consider those 4 points per row together when deciding the partitions. But then of course it's not a simple "sort the hilbert_distance column" to determine the partitions.
Using a point dataset (subset of the OSM GPS points, first 1 million points) and only focusing on the actual algorithm (after calculating the integer coordinates and the total bounds), I get a 200x speed-up using when using numba (with the one missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't have a chance to fully review, but added a few comments to consider.
@tastatham can you give us a short summary of recent commits and an overview of what is still missing (at least tests for sure)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! A few more comments
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Explicitly call total_bounds in _continuous_to_discrete_coords Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@tastatham I posted this on gitter but it got lost in the following conversation. Could you make a tiny PR adding backticks around "set_geometry" (to format it as a code) in ReadMe (to get "columns of x-y points to the |
PR created #83 |
dask_geopandas/core.py
Outdated
|
||
Parameters | ||
---------- | ||
p : Hilbert curve parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add some explanation of what this is, how it influences the result, and some guidance on whether you should set this or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can essentially merge after some minor changes.
Before we do that, can you open an issue listing follow-up tasks? Some discussed yesterday, I've noted some in the code. Mention also the documentation notebook in there.
Can you add numba
to requirements here?
Lines 6 to 10 in 5984a6f
install_requires = [ | |
"geopandas", | |
"dask>=2.18.0,!=2021.05.1", | |
"distributed>=2.18.0,!=2021.05.1", | |
] |
Almost there, good job!
dask_geopandas/core.py
Outdated
A function that calculates hilbert distance for each geometry | ||
in each partition of a Dask-GeoDataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a sentence explaining what the Hilbert distance is and why it is useful? So that user does not have to google it to understand what the function does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're ready to merge ;).
Ehm, we're not. Can you make sure CI passes? edit: I think I fixed it in e88f940. |
Implementing
hilbert_distance
for Dask-GeoPandas - based on the_with_hilbert_distance_column function
from SpatialPandas.Details: