Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement Morton distance #90

Merged
merged 24 commits into from
Aug 19, 2021

Conversation

tastatham
Copy link
Contributor

Implement morton_distance for Dask-GeoPandas. This provides another way of spatially partitioning Dask-DataFrames using another space-filling curve.

Details:

  • Calculating the hilbert distance will allow us to partition Dask-GeoPandas using this information.
  • This supports both GeoPandas & Dask-GeoPandas
  • The output is returned as a Pandas Series

When compared to the Hilbert distance, there is very little difference in compute time. However, from some basic comparisons, the Hilbert distance does appear to preserve locality better. Further work needs to be carried out to compare the two (as well as a third partitioning method; Geohash).

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! First quick pass (haven't played with it yet).

Can it be combined with numba? That loop in _distances_from_coordinates may benefit from that with large GeoDataFrames.

continuous_integration/envs/37-latest.yaml Outdated Show resolved Hide resolved
dask_geopandas/morton_distance.py Outdated Show resolved Hide resolved

"""
bounds = gdf.bounds.to_numpy()
coords = _continuous_to_discrete_coords(total_bounds, bounds, p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The p parameter is the same as the Hilbert curve has? I mean, denoting the same concept?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the following logic holds true - correct me otherwise;
Using the _continuous_to_discrete_coords transforms each of the coordinate pairs from continuous to integers.
The p (or order of curve) value will control the number of iterations in constructing the [Hilbert/Morton] curve.
Then calculated distances along each curve work entirely in terms of integers, where the size of the hypercube will reflect the number of bits per dimension - controlled by p;

  • The 1st order curve supports values (0|1) on both x and y axis - giving us 2 bit binary numbers of the range from 0,1,2,3.
  • Then the 2nd order curve supports values (0|1|2|3) on ... - giving us 4 bit binary numbers of the range 0,1...,15 and so on...

Therefore, using the same p value (or n iterations) for calculated distances along both curves will be bound by the same size hypercube.

The maximum discrete distances along each order are as follows;

iterations = 16 # or p 
n = 2 # two-dimension
max_discrete_distances = []
for p in range(iterations):
    max_discrete_distances.append(2 ** (n*p)  - 1 ) # -1 to start index at 0
max_discrete_distances[1:]  # from 1st to 15th order curve

long story short, yes it denotes the same concept.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Thanks for the great explanation!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The p (or order of curve) value will control the number of iterations in constructing the [Hilbert/Morton] curve.

But as far as I can see, to calculate the Morton curve, we are not using any iterations? (in contrast to Hilbert curve)
The encode_morton and part1by1 seem to do a fixed set of computations independent of p.

(the explanation that it controls the size of the hypercube (and thus the precision of the calculated distances) is of course still valid nonetheless)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I suppose the "classical" implementation also uses loops, but the bit-shifting method is an optimized version that doesn't require loops)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming question: if I google "morton distance", I basically get no useful result (while "hilbert distance" does give something useful). For google it, you rather need something like "morton curve". So wondering if "morton distance" is the best name (although it's of course the "distance along the curve", I suppose?
Now, that still results in a wiki page for which the main title is "Z-order curve", so we could also consider using "Z-order" instead of "Morton" for the naming?

dask_geopandas/morton_distance.py Show resolved Hide resolved
dask_geopandas/morton_distance.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

When compared to the Hilbert distance, there is very little difference in compute time.

I suppose that with using the numpy-based implementation, this will become faster. A
lso, best test the performance on somewhat larger data (eg I see in your notebook that you are timing it on the "naturalearth_lowres" data, which is tiny. If you profile that timing (eg with snakeviz), you'll see that you are mostly measuring overhead, and the actual hilbert/morton encoding only is a small part). You can also easily combine the data multiple times to artificially get bigger data (gdf=pd.concat([gdf]*1000, axis=0))

continuous_integration/envs/39-dev.yaml Outdated Show resolved Hide resolved
continuous_integration/envs/39-dev.yaml Outdated Show resolved Hide resolved
@tastatham
Copy link
Contributor Author

Naming question: if I google "morton distance", I basically get no useful result (while "hilbert distance" does give something useful). For google it, you rather need something like "morton curve". So wondering if "morton distance" is the best name (although it's of course the "distance along the curve", I suppose?
Now, that still results in a wiki page for which the main title is "Z-order curve", so we could also consider using "Z-order" instead of "Morton" for the naming?

I did think about this briefly and thought that the naming of the morton_distance was appropriate. This is because;

  • We are calculating the distance along the Morton/Z-order curve.
  • To not confuse Z-order with Z-dimension
  • It's consistent with the hilbert_distance function (which we can also change)

I'm happy to discuss this later on in the call.

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@tastatham
Copy link
Contributor Author

When compared to the Hilbert distance, there is very little difference in compute time.

I suppose that with using the numpy-based implementation, this will become faster. A
lso, best test the performance on somewhat larger data (eg I see in your notebook that you are timing it on the "naturalearth_lowres" data, which is tiny. If you profile that timing (eg with snakeviz), you'll see that you are mostly measuring overhead, and the actual hilbert/morton encoding only is a small part). You can also easily combine the data multiple times to artificially get bigger data (gdf=pd.concat([gdf]*1000, axis=0))

Good point, I will profile this and the numpy version using the artificial dataset.

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress! The main changes are to update the docstrings.

If possible, it would be good to add the numpy version of the bitwise operations; hopefully that speeds it up significantly compared to Hilbert.

dask_geopandas/core.py Outdated Show resolved Hide resolved
dask_geopandas/morton_distance.py Show resolved Hide resolved
dask_geopandas/morton_distance.py Outdated Show resolved Hide resolved
dask_geopandas/morton_distance.py Show resolved Hide resolved
dask_geopandas/core.py Outdated Show resolved Hide resolved
continuous_integration/envs/39-dev.yaml Show resolved Hide resolved
dask_geopandas/core.py Show resolved Hide resolved
dask_geopandas/core.py Show resolved Hide resolved
dask_geopandas/morton_distance.py Outdated Show resolved Hide resolved
)


def _part1by1(n):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an idea where the "part1by1" name comes from? (I mean, it comes from the active state recipe, but what's the reason for it? (I find it a bit cryptic name))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the naming convention is a bit cryptic but seems to be widely adopted. See Real-Time Collision Detection book, Chapter 7.

I think it comes from the fact the part1by1 function inserts one zero bit between each original data bit (part), the part1by2 function inserts two zero bits between each original data bit (part) and so on...

Interleaved bits
"""

n &= 0x0000FFFF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is not included in the code in http://graphics.stanford.edu/~seander/bithacks.html#InterleaveBMN (the reference you added now). Do you know what it is for? (it's eg also there in pymorton https://github.com/trevorprater/pymorton/blob/f248683c19d90e904193c58bbd03e77bc2c43768/pymorton/pymorton.py#L11)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's also not used in the [Real-Time Collision Detection book, Chapter 7, 317](http://www.r-5.org/files/books/computers/algo-list/realtime-3d/Christer_Ericson-Real-Time_Collision_Detection-EN.pdf

Updated in 27ca9e7

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking a little bit into it, this might actually still be needed because we are using int64 for the integer coordinates. With the current value of p=15 that you are using for testing, the coordinates never become larger than the max value of an uint16.

But if you try to run the tests with eg p=20, do they still pass?

The example code in the book you reference uses uint32 type for the input values, and says it only uses the low 16 bits of those numbers (and then spreading those out with the bit interleaving). So if I am understanding correctly, if your coordinates are 16bit, you view then as 32bit when passing to the bit interleaving code. And that line n &= 0x0000FFFF might ensure that it only uses the lower 16 bits? (although I still don't fully understand why it is in there, as if you have a number that is larger than what can be represented by the 16bits (>65535), it gets truncated by n &= 0x0000FFFF anyway and then you also get a wrong result ..
But eg GEOS also has a Morton implementation, and includes it as well: https://github.com/libgeos/geos/blob/16610691611decd2b9545ea68ef41e8b94f859e6/src/shape/fractal/MortonCode.cpp#L95-L104

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's correct. These examples assume the input values are of type uint32. Currently uint64 coords are passed to the _part1by1 function. This comes from the _continuous_to_discrete_coords function.

In #70 and #90 we discussed several follow up issues;

  1. Setting the dtype for the _continuous_to_discrete_coords function
  2. Centralizing several functions which are used by both Hilbert and Morton distance.

We could look into this further alongside these follow up issues and for now, simply pass the coords to the _part1by1 function as type unit32? - coords.astype("uint32")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the last component that needs updating prior to merging it. Any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe revert the removal of this line for now, and then list it as a follow-up to check / test more.

Otherwise, I would at least eg add a test with a higher p value as mentioned above.

dask_geopandas/morton_distance.py Outdated Show resolved Hide resolved
Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final super tiny suggestion from me. Otherwise, this is good to go. Nice job @tastatham!

continuous_integration/envs/38-latest.yaml Outdated Show resolved Hide resolved
continuous_integration/envs/39-latest.yaml Outdated Show resolved Hide resolved
continuous_integration/envs/37-latest.yaml Show resolved Hide resolved
tastatham and others added 3 commits August 16, 2021 10:40
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
dask_geopandas/core.py Outdated Show resolved Hide resolved
dask_geopandas/core.py Outdated Show resolved Hide resolved
dask_geopandas/core.py Outdated Show resolved Hide resolved
tastatham and others added 2 commits August 17, 2021 12:40
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
Co-authored-by: Martin Fleischmann <martin@martinfleischmann.net>
@jorisvandenbossche
Copy link
Member

The linting is failing (probably something white space in the suggested edit)

tastatham and others added 2 commits August 17, 2021 20:39
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're done here. Thanks a lot @tastatham!

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

Eventually it would be good to add more guidance or at least an appropriate range for the p parameter, at least so that the user doesn't overflow data types by using too large a p parameter.

dask_geopandas/morton_distance.py Outdated Show resolved Hide resolved
tastatham and others added 2 commits August 18, 2021 20:51
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
@jorisvandenbossche
Copy link
Member

I added back one line in the algorithm, so it matches what pymorton does (see the discussion above #90 (comment)). We can then properly figure out if it is needed / if we need to cast the integer coordinates to a certain bitsize etc in a follow-up.

@jorisvandenbossche jorisvandenbossche merged commit bcbdfd1 into geopandas:master Aug 19, 2021
@jorisvandenbossche
Copy link
Member

Thanks @tastatham !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants