-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add shuffle #104
ENH: Add shuffle #104
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into that.
It may also be useful to explore shuffle
method of dask.dataframe (https://github.com/dask/dask/blob/4229c16cf0cc7ed5f60f313618680bfd04e381ea/dask/dataframe/shuffle.py#L301). We may be able to use that instead of set_index
to avoid some overhead.
Just dropping here a note from @mrocklin from dask/dask#8075 (comment) for a reference
|
An alternative to a separate In |
I also recommend @TomAugspurger's approach. Calling into Seems like if |
@martinfleis, I have made the changes based no your suggestions.
A quick example is shown below; import geopandas
import dask_geopandas
p = 10
npartitions = 20
gdf = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
ddf = dask_geopandas.from_geopandas(gdf, npartitions=1)
ddf["hilbert"] = ddf.hilbert_distance(p)
shuffled_ddf = ddf.spatial_shuffle(on="hilbert", npartitions=npartitions, partitions=True)
print(shuffled_ddf)
shuffled_ddf.visualize() |
That wouldn't work, see the resulting spatial partitions:
It means that it doesn't do the sorting. What we would need here is to get bins and use bin labels as Even if it worked, my proposal was slightly different. I was suggesting that we do the calculation of e.g. hilbert distance under the hood, something along these lines: def spatial_shuffle(self, on="hilbert", npartitions=20, partitions=True, **kwargs):
if on == "hilbert":
on = self.hilbert_distance()
elif on == "morton":
on = self.morton_distance()
# do the actual shuffle here |
@tastatham To get closer to the actual release, I have opened #131 to implement |
Superseded by #131 |
Implement shuffle for Dask-GeoPandas. This allows for spatially shuffling/partitioning a Dask-GeoPandas object using one of the three methods (hilbert, morton or geohash) or a user defined column.