Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask Summit 2021 - "Scaling geospatial vector data" workshop #4

Open
jorisvandenbossche opened this issue Apr 30, 2021 · 6 comments
Open

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 30, 2021

During the Dask Summit, we have a 2-hour workshop scheduled about scaling geospatial vector data on Thursday May 20th at 11-13:00 UTC (https://summit.dask.org/schedule/presentation/22/scaling-geospatial-vector-data/)

We can use this issue to further gather ideas and discuss the exact content of the workshop.

Workshop abstract:

The geospatial Python ecosystem provides a nice set of tools for working with vector data, including Shapely for geometry operations and GeoPandas to work with tabular data (and many other packages for IO, visualization, domain specific processing, …). One of the limitations of those core tools is a sub-optimal performance and limited scaling possibilities.

Over the last years, effort has been put in improving the performance through vectorized interfaces to GEOS, the underlying C library of Shapely. In turn, that enables releasing the GIL and makes the Dask - GeoPandas combination more interesting. GeoPandas is an extension to the pandas DataFrame, and thus how Dask scales pandas can be applied on GeoPandas as well. Initial effort to build a bridge between Dask and GeoPandas is currently taking the shape of the dask-geopandas library.

Also other interesting efforts in this space are popping up. The SpatialPandas package provides alternative pandas and Dask extensions for vectorized spatial and geometric operations. Libraries such as datashader and pydeckgl can be used to visualize larger spatial datasets.

This workshop will give a brief overview of some of the packages and ongoing efforts, and provide a place to discuss further improvements and interoperability between the libraries, with an emphasis on the conceptual design of distributed computation on inherently unpredictable vector data.

More detailed agenda:

  • Demo of dask-geopandas - Joris Van den Bossche
  • spatialpandas - Jon Mease
  • Datashader for visualizing geospatial data - Jim Bednar
  • Use cases:
    • Stefanie Lumnitz - GEDI data for biomass estimation
    • Anita Graser - movement data
    • Dani Arribas-Bel - areal interpolation
  • Partitioning of spatial data - Martin Fleischmann + dicussion
  • IO - brief overview of current possibilities + open discussion about what is needed

cc @martinfleis @jsignell

@jorisvandenbossche
Copy link
Member Author

Posting my initial brainstorm list of possible topics here:

  • On dask-geopandas itself:
    • Short overview / demo of what the current status (what's already implemented / working)
    • Discussion about spatial partitioning (making better use of a spatial index, operations that overlapping regions Overlapping computations dask-geopandas#40)
  • spatialpandas is another library also having a dask implementation: what can we learn from them? Ways to collaborate / share code?
  • IO: which data sources would be important? (general IO, pyogrio, parquet and geo-arrow-spec, postgis, ..)
  • Visualization (integration with datashader, pydeckgl, ..)
  • GPU / cuspatial (although very interesting, this might get us too far?)
  • Use cases: some people could briefly present their use case (the data, analyses, current bottlenecks, ..)
  • Are there interesting libraries to look at outside the Python ecosystem? (eg Apache Sedona (former GeoSpark))

@martinfleis
Copy link
Member

Thanks for starting this!

I would leave GPU out of the discussion for now. The situation there is very different at the moment and it would probably require its own introduction and discussion topics, not necessarily linked to dask.

I would like to spend a reasonable amount of time on spatial partitioning and overlapping computations because figuring out this bit properly is key in my eyes. It is not straightforward task at all because one approach needs to be used for postcode zones (contiguous compact polygons) and another one for, say, linestring trajectories.

Agree on IO. I guess that PostGIS links will be more important in dask-geopandas than they're in geopandas.

We can touch visualisation while talking about spatialpandas, since that is used as a direct interface to datashader. (As a side note, it may be useful to work out dask-based conversion between dask-geopandas and spatialpandas geometries.)

@jorisvandenbossche
Copy link
Member Author

I updated the top post with a summary of what we discussed yesterday (and to be completed if people confirm)

@martinfleis
Copy link
Member

martinfleis commented May 13, 2021

Should we maybe switch use cases and my bit on partitioning and indexing? That way I can try to summarise them and open the floor for the main discussion in which we can reflect on real-life use cases along the way.

edit: I switched it above

@jorisvandenbossche
Copy link
Member Author

Sounds good. I am only wondering if we then should also move spatialpandas to just before your talk (since it will mainly touch on the spatial partitioning / hilbert curve for repartitioning) ? Although on the other hand it also fits after my dask-geopandas explanation.

@martinfleis
Copy link
Member

I'd leave it where it is to cover the existing packages first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants