-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Implement Hilbert distance #70
Changes from 5 commits
78403c0
825e31e
cb173d6
f5d5d52
bc6fd49
d0f8802
bc9b632
ce52794
92849da
fc898fa
0fba9a8
9cdb098
251ae60
4511b94
1266357
8971210
f7eb6d5
91465f3
8737e55
b86b5d2
947bf77
c8fbd7f
3fa1d01
d8a2d46
dbfd011
7d750a2
a9cf761
922addd
e193f87
aa47fdd
cdf9458
98780cd
1e1c196
8a8b114
7c19310
7ae61ca
bc41634
e88f940
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,261 @@ | ||
import geopandas | ||
import numpy as np | ||
|
||
# from numba import jit # optional - do we want numba as a dependency for dask-geopandas? | ||
# ngjit = jit(nopython=True, nogil=True) | ||
tastatham marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
def _hilbert_distance(gdf, total_bounds, p): | ||
|
||
""" | ||
Calculate the hilbert distance for a GeoDataFrame based on the mid-point of | ||
the bounds for each geom and total bounds of the collection of geoms | ||
|
||
Based on: https://github.com/holoviz/spatialpandas/blob/ | ||
9252a7aba5f8bc7a435fffa2c31018af8d92942c/spatialpandas/dask.py#L172 | ||
|
||
Parameters | ||
---------- | ||
gdf : GeoDataFrame | ||
tastatham marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
total_bounds : Total bounds of GeoDataFrame | ||
|
||
p : Hilbert curve parameter | ||
|
||
Returns | ||
--------- | ||
Array of hilbert distances for each geom | ||
""" | ||
|
||
if total_bounds is None: | ||
total_bounds = gdf.total_bounds | ||
|
||
# Calculate bounds of each geom | ||
bounds = gdf.bounds.to_numpy() | ||
martinfleis marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Hilbert Side len | ||
side_length = 2 ** p | ||
|
||
# Calculate x and y range of total bound coords - returns array | ||
geom_ranges = [ | ||
(total_bounds[0], total_bounds[2]), | ||
(total_bounds[1], total_bounds[3]), | ||
] | ||
tastatham marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# Calculate mid points for x and y bound coords - returns array | ||
geom_mids = [ | ||
((bounds[:, 0] + bounds[:, 2]) / 2.0), | ||
((bounds[:, 1] + bounds[:, 3]) / 2.0), | ||
] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In general, do we know if there is an advantage to using such midpoints vs "centroid" vs "representative point" ? (if there is not theoretical/practical reason for one or the other, we should maybe check which one is typically the cheapest to compute. EDIT: and based on a quick check, calculating the mids based on the bounds seems much faster) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My prior was that this approach would be faster - but I should have checked manually. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This result makes sense: the operations are in increasing levels of complexity. Midpoint is simple math based on bounds, centroid is more complex based on calculating the "center of mass" of the geometry (varies by geometry type), and "representative point" (point on surface) is probably yet more complex since it needs to ensure the result intersects the polygon. More of a theoretical question (longer term): the key thing to consider here is how representative the midpoints are for ordinating geometries along the Hilbert curve: what is the tradeoff for how well the points represent the locations of the geometries for partitioning (e.g., suboptimal partitions) vs spatial operations performed against those partitions. Put differently: using midpoint for Hilbert may produce partitions quickly, but if those are suboptimal for overlay operations and makes those much slower, then maybe it is worth a somewhat more expensive method for getting the representative points. To do this, one would need to compare the full compute time of calculate Hilbert curve, repartition, overlay. I'm thinking of a case like the admin boundaries of France, which includes overseas territories. Midpoint of bounds will be far away from any of those boundaries. Centroid will be maybe a bit better but still far away from those boundaries. Representative point would be in continental France (I think?) and thus be more optimal for repartitioning and then overlay with other European polygons. Though - this could easily be solved by exploding into single-part geometries... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Midpoints (and the same for centroids / representative points) might also give suboptimal results if you have a mix of large and small polygons. I was thinking it could be something to explore (later) if you could calculate the hilbert distance for eg the bounding box points, and consider those 4 points per row together when deciding the partitions. But then of course it's not a simple "sort the hilbert_distance column" to determine the partitions.
tastatham marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Empty coord array | ||
coords = np.zeros((bounds.shape[0], 2), dtype=np.int64) | ||
# Transform continuous int to discrete int for each dimension | ||
coords[:, 0] = _continuous_int_to_discrete_int( | ||
geom_mids[0], geom_ranges[0], side_length | ||
) | ||
coords[:, 1] = _continuous_int_to_discrete_int( | ||
geom_mids[1], geom_ranges[1], side_length | ||
) | ||
martinfleis marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Calculate hilbert distance | ||
hilbert_distances = _distances_from_coordinates(p, coords) | ||
|
||
return hilbert_distances | ||
tastatham marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
# @ngjit | ||
def _continuous_int_to_discrete_int(vals, val_range, n): | ||
tastatham marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
""" | ||
Convert an array of values from continuous data coordinates to discrete | ||
martinfleis marked this conversation as resolved.
Show resolved
Hide resolved
|
||
int coordinates | ||
martinfleis marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Based on: https://github.com/holoviz/spatialpandas/blob/ | ||
9252a7aba5f8bc7a435fffa2c31018af8d92942c/spatialpandas/utils.py#L9 | ||
tastatham marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Parameters | ||
---------- | ||
vals : Array of continuous coordinates to be | ||
([([val_1, val_2,..., val_n]), array([val_1, val_2,..., val_n])]) | ||
|
||
val_range : Ranges of x and y values ([(xmin, xmax), (ymin, ymax)]) | ||
|
||
n : Number of discrete coords (int) | ||
|
||
Returns | ||
--------- | ||
Array of discrete int coords | ||
""" | ||
|
||
x_width = val_range[1] - val_range[0] | ||
martinfleis marked this conversation as resolved.
Show resolved
Hide resolved
|
||
res = ((vals - val_range[0]) * (n / x_width)).astype(np.int64) | ||
tastatham marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# clip | ||
res[res < 0] = 0 | ||
res[res > n - 1] = n - 1 | ||
return res | ||
|
||
|
||
def _distances_from_coordinates(p, coords): | ||
tastatham marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
""" | ||
Calculate hilbert distance for a set of coords | ||
|
||
Based on: https://github.com/holoviz/spatialpandas/blob/ | ||
9252a7aba5f8bc7a435fffa2c31018af8d92942c/spatialpandas/spatialindex/hilbert_curve.py#L173 | ||
|
||
Parameters | ||
---------- | ||
p : Hilbert curve param | ||
|
||
coords : Array of coordinates | ||
|
||
Returns | ||
--------- | ||
Array of hilbert distances for each geom | ||
""" | ||
|
||
# Create empty coord list | ||
# coords = np.atleast_2d(coords).copy() | ||
result = np.zeros(coords.shape[0], dtype=np.int64) | ||
# For each coord calculate hilbert distance | ||
for i in range(coords.shape[0]): | ||
coord = coords[i, :] | ||
result[i] = _distance_from_coordinate(p, coord) | ||
return result | ||
|
||
|
||
# @ngjit | ||
def _distance_from_coordinate(p, coord): | ||
|
||
""" | ||
Calculate hilbert distance for a single coord | ||
|
||
Based on: https://github.com/holoviz/spatialpandas/blob/ | ||
9252a7aba5f8bc7a435fffa2c31018af8d92942c/spatialpandas/spatialindex/rtree.py#L50 | ||
|
||
Parameters | ||
---------- | ||
p : Hilbert curve param | ||
|
||
coord : Array of coordinates | ||
|
||
Returns | ||
--------- | ||
Array of hilbert distances for a single coord | ||
""" | ||
|
||
n = len(coord) | ||
M = 1 << (p - 1) | ||
Q = M | ||
while Q > 1: | ||
P = Q - 1 | ||
for i in range(n): | ||
if coord[i] & Q: | ||
coord[0] ^= P | ||
else: | ||
t = (coord[0] ^ coord[i]) & P | ||
coord[0] ^= t | ||
coord[i] ^= t | ||
Q >>= 1 | ||
# Gray encode | ||
for i in range(1, n): | ||
coord[i] ^= coord[i - 1] | ||
t = 0 | ||
Q = M | ||
while Q > 1: | ||
if coord[n - 1] & Q: | ||
t ^= Q - 1 | ||
Q >>= 1 | ||
for i in range(n): | ||
coord[i] ^= t | ||
h = _transpose_to_hilbert_integer(p, coord) | ||
return h | ||
|
||
|
||
# @ngjit | ||
def _transpose_to_hilbert_integer(p, coord): | ||
|
||
""" | ||
Calculate hilbert distance for a single coord | ||
|
||
Based on: https://github.com/holoviz/spatialpandas/blob/ | ||
9252a7aba5f8bc7a435fffa2c31018af8d92942c/spatialpandas/spatialindex/hilbert_curve.py#L53 | ||
|
||
|
||
Parameters | ||
---------- | ||
p : Hilbert curve param | ||
|
||
coord : Array of coordinates | ||
|
||
Returns | ||
--------- | ||
Array of hilbert distances for a single coord | ||
""" | ||
|
||
n = len(coord) | ||
bins = [_int_2_binary(v, p) for v in coord] | ||
concat = np.zeros(n * p, dtype=np.uint8) | ||
for i in range(p): | ||
for j in range(n): | ||
concat[n * i + j] = bins[j][i] | ||
|
||
h = _binary_2_int(concat) | ||
return h | ||
|
||
|
||
# @ngjit | ||
def _int_2_binary(v, width): | ||
|
||
""" | ||
Convert an array of values from discrete int coordinates to binary byte | ||
|
||
Based on: https://github.com/holoviz/spatialpandas/blob/ | ||
9252a7aba5f8bc7a435fffa2c31018af8d92942c/spatialpandas/spatialindex/hilbert_curve.py#L12 | ||
|
||
Parameters | ||
---------- | ||
p : Hilbert curve param | ||
|
||
coord : Array of coordinates | ||
|
||
Returns | ||
--------- | ||
# Returns binary byte | ||
""" | ||
|
||
res = np.zeros(width, dtype=np.uint8) | ||
for i in range(width): | ||
res[width - i - 1] = v % 2 # zero-passed to width | ||
v = v >> 1 | ||
return res | ||
|
||
|
||
# @ngjit | ||
def _binary_2_int(bin_vec): | ||
|
||
""" | ||
Convert binary byte to int | ||
|
||
Based on: https://github.com/holoviz/spatialpandas/blob/ | ||
9252a7aba5f8bc7a435fffa2c31018af8d92942c/spatialpandas/spatialindex/hilbert_curve.py#L23 | ||
|
||
Parameters | ||
---------- | ||
p : Hilbert curve param | ||
|
||
coord : Array of coordinates | ||
|
||
Returns | ||
--------- | ||
# Returns discrete int | ||
""" | ||
|
||
res = 0 | ||
next_val = 1 | ||
width = len(bin_vec) | ||
for i in range(width): | ||
res += next_val * bin_vec[width - i - 1] | ||
next_val <<= 1 | ||
return res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a sentence explaining what the Hilbert distance is and why it is useful? So that user does not have to google it to understand what the function does.