ENH: Implement Morton distance #90

tastatham · 2021-08-04T16:58:32Z

Implement morton_distance for Dask-GeoPandas. This provides another way of spatially partitioning Dask-DataFrames using another space-filling curve.

Details:

Calculating the hilbert distance will allow us to partition Dask-GeoPandas using this information.
This supports both GeoPandas & Dask-GeoPandas
The output is returned as a Pandas Series

When compared to the Hilbert distance, there is very little difference in compute time. However, from some basic comparisons, the Hilbert distance does appear to preserve locality better. Further work needs to be carried out to compare the two (as well as a third partitioning method; Geohash).

martinfleis

Thanks a lot! First quick pass (haven't played with it yet).

Can it be combined with numba? That loop in _distances_from_coordinates may benefit from that with large GeoDataFrames.

continuous_integration/envs/37-latest.yaml

dask_geopandas/morton_distance.py

martinfleis · 2021-08-04T17:37:24Z

dask_geopandas/morton_distance.py

+
+    """
+    bounds = gdf.bounds.to_numpy()
+    coords = _continuous_to_discrete_coords(total_bounds, bounds, p)


The p parameter is the same as the Hilbert curve has? I mean, denoting the same concept?

I believe the following logic holds true - correct me otherwise;
Using the _continuous_to_discrete_coords transforms each of the coordinate pairs from continuous to integers.
The p (or order of curve) value will control the number of iterations in constructing the [Hilbert/Morton] curve.
Then calculated distances along each curve work entirely in terms of integers, where the size of the hypercube will reflect the number of bits per dimension - controlled by p;

The 1st order curve supports values (0|1) on both x and y axis - giving us 2 bit binary numbers of the range from 0,1,2,3.

Then the 2nd order curve supports values (0|1|2|3) on ... - giving us 4 bit binary numbers of the range 0,1...,15 and so on...

Therefore, using the same p value (or n iterations) for calculated distances along both curves will be bound by the same size hypercube.

The maximum discrete distances along each order are as follows;

iterations = 16 # or p n = 2 # two-dimension max_discrete_distances = [] for p in range(iterations): max_discrete_distances.append(2 ** (n*p) - 1 ) # -1 to start index at 0 max_discrete_distances[1:] # from 1st to 15th order curve

long story short, yes it denotes the same concept.

Cool! Thanks for the great explanation!

The p (or order of curve) value will control the number of iterations in constructing the [Hilbert/Morton] curve.

But as far as I can see, to calculate the Morton curve, we are not using any iterations? (in contrast to Hilbert curve)
The encode_morton and part1by1 seem to do a fixed set of computations independent of p.

(the explanation that it controls the size of the hypercube (and thus the precision of the calculated distances) is of course still valid nonetheless)

(I suppose the "classical" implementation also uses loops, but the bit-shifting method is an optimized version that doesn't require loops)

jorisvandenbossche

Naming question: if I google "morton distance", I basically get no useful result (while "hilbert distance" does give something useful). For google it, you rather need something like "morton curve". So wondering if "morton distance" is the best name (although it's of course the "distance along the curve", I suppose?
Now, that still results in a wiki page for which the main title is "Z-order curve", so we could also consider using "Z-order" instead of "Morton" for the naming?

dask_geopandas/morton_distance.py

jorisvandenbossche · 2021-08-04T18:28:42Z

When compared to the Hilbert distance, there is very little difference in compute time.

I suppose that with using the numpy-based implementation, this will become faster. A
lso, best test the performance on somewhat larger data (eg I see in your notebook that you are timing it on the "naturalearth_lowres" data, which is tiny. If you profile that timing (eg with snakeviz), you'll see that you are mostly measuring overhead, and the actual hilbert/morton encoding only is a small part). You can also easily combine the data multiple times to artificially get bigger data (gdf=pd.concat([gdf]*1000, axis=0))

continuous_integration/envs/39-dev.yaml

tastatham · 2021-08-05T11:05:55Z

Naming question: if I google "morton distance", I basically get no useful result (while "hilbert distance" does give something useful). For google it, you rather need something like "morton curve". So wondering if "morton distance" is the best name (although it's of course the "distance along the curve", I suppose?
Now, that still results in a wiki page for which the main title is "Z-order curve", so we could also consider using "Z-order" instead of "Morton" for the naming?

I did think about this briefly and thought that the naming of the morton_distance was appropriate. This is because;

We are calculating the distance along the Morton/Z-order curve.
To not confuse Z-order with Z-dimension
It's consistent with the hilbert_distance function (which we can also change)

I'm happy to discuss this later on in the call.

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

tastatham · 2021-08-05T11:11:44Z

When compared to the Hilbert distance, there is very little difference in compute time.

I suppose that with using the numpy-based implementation, this will become faster. A
lso, best test the performance on somewhat larger data (eg I see in your notebook that you are timing it on the "naturalearth_lowres" data, which is tiny. If you profile that timing (eg with snakeviz), you'll see that you are mostly measuring overhead, and the actual hilbert/morton encoding only is a small part). You can also easily combine the data multiple times to artificially get bigger data (gdf=pd.concat([gdf]*1000, axis=0))

Good point, I will profile this and the numpy version using the artificial dataset.

dask_geopandas/core.py

brendan-ward

Good progress! The main changes are to update the docstrings.

If possible, it would be good to add the numpy version of the bitwise operations; hopefully that speeds it up significantly compared to Hilbert.

dask_geopandas/core.py

dask_geopandas/morton_distance.py

dask_geopandas/core.py

Co-authored-by: Brendan Ward <bcward@astutespruce.com>

… into morton_distance

continuous_integration/envs/39-dev.yaml

dask_geopandas/core.py

dask_geopandas/morton_distance.py

jorisvandenbossche · 2021-08-07T09:50:36Z

dask_geopandas/morton_distance.py

+    )
+
+
+def _part1by1(n):


Do you have an idea where the "part1by1" name comes from? (I mean, it comes from the active state recipe, but what's the reason for it? (I find it a bit cryptic name))

I agree, the naming convention is a bit cryptic but seems to be widely adopted. See Real-Time Collision Detection book, Chapter 7.

I think it comes from the fact the part1by1 function inserts one zero bit between each original data bit (part), the part1by2 function inserts two zero bits between each original data bit (part) and so on...

jorisvandenbossche · 2021-08-07T09:54:44Z

dask_geopandas/morton_distance.py

+        Interleaved bits
+    """
+
+    n &= 0x0000FFFF


This line is not included in the code in http://graphics.stanford.edu/~seander/bithacks.html#InterleaveBMN (the reference you added now). Do you know what it is for? (it's eg also there in pymorton https://github.com/trevorprater/pymorton/blob/f248683c19d90e904193c58bbd03e77bc2c43768/pymorton/pymorton.py#L11)

yes, it's also not used in the [Real-Time Collision Detection book, Chapter 7, 317](http://www.r-5.org/files/books/computers/algo-list/realtime-3d/Christer_Ericson-Real-Time_Collision_Detection-EN.pdf

Updated in 27ca9e7

Looking a little bit into it, this might actually still be needed because we are using int64 for the integer coordinates. With the current value of p=15 that you are using for testing, the coordinates never become larger than the max value of an uint16.

But if you try to run the tests with eg p=20, do they still pass?

The example code in the book you reference uses uint32 type for the input values, and says it only uses the low 16 bits of those numbers (and then spreading those out with the bit interleaving). So if I am understanding correctly, if your coordinates are 16bit, you view then as 32bit when passing to the bit interleaving code. And that line n &= 0x0000FFFF might ensure that it only uses the lower 16 bits? (although I still don't fully understand why it is in there, as if you have a number that is larger than what can be represented by the 16bits (>65535), it gets truncated by n &= 0x0000FFFF anyway and then you also get a wrong result ..
But eg GEOS also has a Morton implementation, and includes it as well: https://github.com/libgeos/geos/blob/16610691611decd2b9545ea68ef41e8b94f859e6/src/shape/fractal/MortonCode.cpp#L95-L104

yes, that's correct. These examples assume the input values are of type uint32. Currently uint64 coords are passed to the _part1by1 function. This comes from the _continuous_to_discrete_coords function.

In #70 and #90 we discussed several follow up issues;

Setting the dtype for the _continuous_to_discrete_coords function

Centralizing several functions which are used by both Hilbert and Morton distance.

We could look into this further alongside these follow up issues and for now, simply pass the coords to the _part1by1 function as type unit32? - coords.astype("uint32")

I think this is the last component that needs updating prior to merging it. Any thoughts?

I would maybe revert the removal of this line for now, and then list it as a follow-up to check / test more.

Otherwise, I would at least eg add a test with a higher p value as mentioned above.

dask_geopandas/morton_distance.py

martinfleis

Final super tiny suggestion from me. Otherwise, this is good to go. Nice job @tastatham!

continuous_integration/envs/38-latest.yaml

continuous_integration/envs/39-latest.yaml

continuous_integration/envs/37-latest.yaml

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

dask_geopandas/core.py

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

jorisvandenbossche · 2021-08-17T16:47:35Z

The linting is failing (probably something white space in the suggested edit)

dask_geopandas/core.py

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

martinfleis

I think we're done here. Thanks a lot @tastatham!

brendan-ward

Thanks for working on this!

Eventually it would be good to add more guidance or at least an appropriate range for the p parameter, at least so that the user doesn't overflow data types by using too large a p parameter.

dask_geopandas/morton_distance.py

Co-authored-by: Brendan Ward <bcward@astutespruce.com>

jorisvandenbossche · 2021-08-19T13:52:23Z

I added back one line in the algorithm, so it matches what pymorton does (see the discussion above #90 (comment)). We can then properly figure out if it is needed / if we need to cast the integer coordinates to a certain bitsize etc in a follow-up.

jorisvandenbossche · 2021-08-19T14:29:59Z

Thanks @tastatham !

reformat morton_distance

f257271

martinfleis reviewed Aug 4, 2021

View reviewed changes

jorisvandenbossche reviewed Aug 4, 2021

View reviewed changes

dask_geopandas/morton_distance.py Show resolved Hide resolved

dask_geopandas/morton_distance.py Outdated Show resolved Hide resolved

tastatham added 2 commits August 5, 2021 10:10

use pip instead of conda for pymorton

937bbbf

resolve pymorton package not being found in dev env

2c90399

martinfleis reviewed Aug 5, 2021

View reviewed changes

continuous_integration/envs/39-dev.yaml Outdated Show resolved Hide resolved

continuous_integration/envs/39-dev.yaml Outdated Show resolved Hide resolved

remove pymorton as conda dependency in dev env

72bb05d

Drop for loop for vectorization

a1eae6c

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jorisvandenbossche reviewed Aug 5, 2021

View reviewed changes

dask_geopandas/core.py Show resolved Hide resolved

brendan-ward requested changes Aug 5, 2021

View reviewed changes

tastatham and others added 5 commits August 6, 2021 12:25

Remove hilbert distance from morton distance docstring

60c64c3

Co-authored-by: Brendan Ward <bcward@astutespruce.com>

update docstring using numpy & drop encode_morton

703b5b4

Merge branch 'morton_distance' of github.com:tastatham/dask-geopandas…

3117348

… into morton_distance

drop master from pymorton

9756156

change comment about what this script is based on

ba96592

martinfleis reviewed Aug 6, 2021

View reviewed changes

continuous_integration/envs/39-dev.yaml Show resolved Hide resolved

dask_geopandas/core.py Show resolved Hide resolved

dask_geopandas/core.py Show resolved Hide resolved

jorisvandenbossche reviewed Aug 7, 2021

View reviewed changes

tastatham added 3 commits August 8, 2021 17:19

update style & drop unnecessary parts

27ca9e7

update docstring

2c539e0

add pip as conda dependency

07fdecc

martinfleis reviewed Aug 14, 2021

View reviewed changes

continuous_integration/envs/38-latest.yaml Outdated Show resolved Hide resolved

continuous_integration/envs/39-latest.yaml Outdated Show resolved Hide resolved

continuous_integration/envs/37-latest.yaml Show resolved Hide resolved

tastatham and others added 3 commits August 16, 2021 10:40

Update 38-latest.yaml with pip dep

d5965b0

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

Update 37-latest.yaml with pip dep

8acc95d

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

Update 39-latest.yaml with pip dep

6bf2168

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

martinfleis reviewed Aug 16, 2021

View reviewed changes

dask_geopandas/core.py Outdated Show resolved Hide resolved

tastatham and others added 2 commits August 16, 2021 21:17

Merge branch 'geopandas:master' into morton_distance

c2ada12

update docstring to reflect output is a dask.Series, not pandas.Series

a08e85e

martinfleis reviewed Aug 17, 2021

View reviewed changes

dask_geopandas/core.py Outdated Show resolved Hide resolved

dask_geopandas/core.py Outdated Show resolved Hide resolved

tastatham and others added 2 commits August 17, 2021 12:40

Update morton_distance docstring

c1e1620

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

Update docstring for morton_distance

8dbc9e2

Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>

jorisvandenbossche reviewed Aug 17, 2021

View reviewed changes

dask_geopandas/core.py Outdated Show resolved Hide resolved

tastatham and others added 2 commits August 17, 2021 20:39

Update morton docstring in core.py

6f7bac1

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

update docstring to drop whitespace

ef62162

martinfleis approved these changes Aug 17, 2021

View reviewed changes

brendan-ward approved these changes Aug 18, 2021

View reviewed changes

dask_geopandas/morton_distance.py Outdated Show resolved Hide resolved

tastatham and others added 2 commits August 18, 2021 20:51

update docstring of morton_distance

40ca74d

Co-authored-by: Brendan Ward <bcward@astutespruce.com>

add back first bitwise and operator

b9c282a

jorisvandenbossche merged commit bcbdfd1 into geopandas:master Aug 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Implement Morton distance #90

ENH: Implement Morton distance #90

tastatham commented Aug 4, 2021

martinfleis left a comment

martinfleis Aug 4, 2021

tastatham Aug 5, 2021

martinfleis Aug 5, 2021

jorisvandenbossche Aug 5, 2021

jorisvandenbossche Aug 5, 2021

jorisvandenbossche left a comment

jorisvandenbossche commented Aug 4, 2021

tastatham commented Aug 5, 2021

tastatham commented Aug 5, 2021

brendan-ward left a comment

jorisvandenbossche Aug 7, 2021

tastatham Aug 8, 2021

jorisvandenbossche Aug 7, 2021

tastatham Aug 8, 2021

jorisvandenbossche Aug 8, 2021

tastatham Aug 16, 2021

tastatham Aug 17, 2021

jorisvandenbossche Aug 17, 2021

martinfleis left a comment

jorisvandenbossche commented Aug 17, 2021

martinfleis left a comment

brendan-ward left a comment

jorisvandenbossche commented Aug 19, 2021

jorisvandenbossche commented Aug 19, 2021

		)


		def _part1by1(n):

ENH: Implement Morton distance #90

ENH: Implement Morton distance #90

Conversation

tastatham commented Aug 4, 2021

martinfleis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 4, 2021

tastatham commented Aug 5, 2021

tastatham commented Aug 5, 2021

brendan-ward left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinfleis left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 17, 2021

martinfleis left a comment

Choose a reason for hiding this comment

brendan-ward left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 19, 2021

jorisvandenbossche commented Aug 19, 2021