ENH: Add shuffle #104

tastatham · 2021-08-22T16:39:55Z

Implement shuffle for Dask-GeoPandas. This allows for spatially shuffling/partitioning a Dask-GeoPandas object using one of the three methods (hilbert, morton or geohash) or a user defined column.

martinfleis

Thanks for looking into that.

It may also be useful to explore shuffle method of dask.dataframe (https://github.com/dask/dask/blob/4229c16cf0cc7ed5f60f313618680bfd04e381ea/dask/dataframe/shuffle.py#L301). We may be able to use that instead of set_index to avoid some overhead.

dask_geopandas/core.py

martinfleis · 2021-08-23T14:52:30Z

Just dropping here a note from @mrocklin from dask/dask#8075 (comment) for a reference

Also, as a heads-up, I'm working on a newer algorithm here. The API will
still be the same (I saw that you're doing this for geopandas) but there
may be new algorithms to learn from in the future that may perform slightly
better at large scale.

TomAugspurger · 2021-08-26T18:20:13Z

An alternative to a separate spatial_shuffle is to override the parent shuffle method (use the same keywords) but have an additional keyword like method="hilbert". Then you'd call the parent method with on=self.hilbert_distance(self[on])

In dask.dataframe.DataFrame, on is required but I think it'd be fine to set the default to None and use the primary geometry column by default.

gjoseph92 · 2021-09-14T23:01:28Z

I also recommend @TomAugspurger's approach. Calling into super().shuffle(...) seems better than fully overriding the method. Plus, there are some improvements to shuffling coming up that it would be easier to benefit from with this method.

Seems like if on is None, then you want super().shuffle(self, hash_method(self), ...), otherwise just super().shuffle(self, on, ...). (Note that shuffling also isn't quite the same as set_index.)

tastatham · 2021-10-07T14:07:33Z

@martinfleis, I have made the changes based no your suggestions.

The function is now using Dasks shuffle method instead of set_index.
I have temporarily named the function spatial_shuffle instead of shuffle but we can discuss this in our next meeting.
I have added partitions as an additional argument - whether to calculate the spatial partitions or not
I had to add set_geometry to ensure geometries are retained, as discussed in BUG: Computing a dask shuffle returns a pd.DataFrame, not gpd.GeoDataFrame #116
I also dropped the column parameter and instead computed the partitioning information outside of the spatial_function method. I felt this was both more interactive and cleaner but happy to retain this within the function.

A quick example is shown below;

import geopandas
import dask_geopandas

p = 10
npartitions = 20

gdf = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
ddf = dask_geopandas.from_geopandas(gdf, npartitions=1)

ddf["hilbert"] = ddf.hilbert_distance(p)
shuffled_ddf = ddf.spatial_shuffle(on="hilbert", npartitions=npartitions, partitions=True)
print(shuffled_ddf)

shuffled_ddf.visualize()

martinfleis · 2021-10-07T15:43:02Z

That wouldn't work, see the resulting spatial partitions:

shuffle doesn't work in the same way as set_index. See the docs:

Uses hashing of on to map rows to output partitions. After this operation, rows with the same value of on will be in the same partition.

It means that it doesn't do the sorting. What we would need here is to get bins and use bin labels as on in shuffle.

Even if it worked, my proposal was slightly different. I was suggesting that we do the calculation of e.g. hilbert distance under the hood, something along these lines:

def spatial_shuffle(self, on="hilbert", npartitions=20, partitions=True, **kwargs):
    if on == "hilbert":
        on = self.hilbert_distance()
    elif on == "morton":
        on = self.morton_distance()

    # do the actual shuffle here

martinfleis · 2022-01-18T20:12:50Z

@tastatham To get closer to the actual release, I have opened #131 to implement spatial_shuffle based on set_index there. That should free a bit of your time and you can finish that notebook with documentation to wrap up GSoC.

martinfleis · 2022-01-28T23:19:15Z

Superseded by #131

add shuffle to core

f29f8b0

martinfleis reviewed Aug 23, 2021

View reviewed changes

dask_geopandas/core.py Outdated Show resolved Hide resolved

dask_geopandas/core.py Outdated Show resolved Hide resolved

dask_geopandas/core.py Show resolved Hide resolved

dask_geopandas/core.py Outdated Show resolved Hide resolved

dask_geopandas/core.py Outdated Show resolved Hide resolved

martinfleis mentioned this pull request Sep 9, 2021

ENH: Implement partitioning based on hilbert distance #71

Closed

tastatham and others added 2 commits October 7, 2021 14:16

Merge branch 'geopandas:master' into shuffle

c8eb76a

update spatial_shuffle to use shuffle method instead of set_index

37dc964

martinfleis mentioned this pull request Jan 17, 2022

ENH: implement spatial_shuffle method #131

Merged

martinfleis closed this Jan 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add shuffle #104

ENH: Add shuffle #104

tastatham commented Aug 22, 2021

martinfleis left a comment

martinfleis commented Aug 23, 2021

TomAugspurger commented Aug 26, 2021

gjoseph92 commented Sep 14, 2021

tastatham commented Oct 7, 2021 •

edited

martinfleis commented Oct 7, 2021

martinfleis commented Jan 18, 2022

martinfleis commented Jan 28, 2022

ENH: Add shuffle #104

ENH: Add shuffle #104

Conversation

tastatham commented Aug 22, 2021

martinfleis left a comment

Choose a reason for hiding this comment

martinfleis commented Aug 23, 2021

TomAugspurger commented Aug 26, 2021

gjoseph92 commented Sep 14, 2021

tastatham commented Oct 7, 2021 • edited

martinfleis commented Oct 7, 2021

martinfleis commented Jan 18, 2022

martinfleis commented Jan 28, 2022

tastatham commented Oct 7, 2021 •

edited