Dask Summit 2021 - "Scaling geospatial vector data" workshop #4

jorisvandenbossche · 2021-04-30T17:23:40Z

During the Dask Summit, we have a 2-hour workshop scheduled about scaling geospatial vector data on Thursday May 20th at 11-13:00 UTC (https://summit.dask.org/schedule/presentation/22/scaling-geospatial-vector-data/)

We can use this issue to further gather ideas and discuss the exact content of the workshop.

Workshop abstract:

The geospatial Python ecosystem provides a nice set of tools for working with vector data, including Shapely for geometry operations and GeoPandas to work with tabular data (and many other packages for IO, visualization, domain specific processing, …). One of the limitations of those core tools is a sub-optimal performance and limited scaling possibilities.

Over the last years, effort has been put in improving the performance through vectorized interfaces to GEOS, the underlying C library of Shapely. In turn, that enables releasing the GIL and makes the Dask - GeoPandas combination more interesting. GeoPandas is an extension to the pandas DataFrame, and thus how Dask scales pandas can be applied on GeoPandas as well. Initial effort to build a bridge between Dask and GeoPandas is currently taking the shape of the dask-geopandas library.

Also other interesting efforts in this space are popping up. The SpatialPandas package provides alternative pandas and Dask extensions for vectorized spatial and geometric operations. Libraries such as datashader and pydeckgl can be used to visualize larger spatial datasets.

This workshop will give a brief overview of some of the packages and ongoing efforts, and provide a place to discuss further improvements and interoperability between the libraries, with an emphasis on the conceptual design of distributed computation on inherently unpredictable vector data.

More detailed agenda:

Demo of dask-geopandas - Joris Van den Bossche
spatialpandas - Jon Mease
Datashader for visualizing geospatial data - Jim Bednar
Use cases:
- Stefanie Lumnitz - GEDI data for biomass estimation
- Anita Graser - movement data
- Dani Arribas-Bel - areal interpolation
Partitioning of spatial data - Martin Fleischmann + dicussion
IO - brief overview of current possibilities + open discussion about what is needed

cc @martinfleis @jsignell

jorisvandenbossche · 2021-04-30T17:35:56Z

Posting my initial brainstorm list of possible topics here:

On dask-geopandas itself:
- Short overview / demo of what the current status (what's already implemented / working)
- Discussion about spatial partitioning (making better use of a spatial index, operations that overlapping regions Overlapping computations dask-geopandas#40)
spatialpandas is another library also having a dask implementation: what can we learn from them? Ways to collaborate / share code?
IO: which data sources would be important? (general IO, pyogrio, parquet and geo-arrow-spec, postgis, ..)
Visualization (integration with datashader, pydeckgl, ..)
GPU / cuspatial (although very interesting, this might get us too far?)
Use cases: some people could briefly present their use case (the data, analyses, current bottlenecks, ..)
Are there interesting libraries to look at outside the Python ecosystem? (eg Apache Sedona (former GeoSpark))

martinfleis · 2021-05-03T10:02:58Z

Thanks for starting this!

I would leave GPU out of the discussion for now. The situation there is very different at the moment and it would probably require its own introduction and discussion topics, not necessarily linked to dask.

I would like to spend a reasonable amount of time on spatial partitioning and overlapping computations because figuring out this bit properly is key in my eyes. It is not straightforward task at all because one approach needs to be used for postcode zones (contiguous compact polygons) and another one for, say, linestring trajectories.

Agree on IO. I guess that PostGIS links will be more important in dask-geopandas than they're in geopandas.

We can touch visualisation while talking about spatialpandas, since that is used as a direct interface to datashader. (As a side note, it may be useful to work out dask-based conversion between dask-geopandas and spatialpandas geometries.)

jorisvandenbossche · 2021-05-12T07:12:50Z

I updated the top post with a summary of what we discussed yesterday (and to be completed if people confirm)

martinfleis · 2021-05-13T15:19:03Z

Should we maybe switch use cases and my bit on partitioning and indexing? That way I can try to summarise them and open the floor for the main discussion in which we can reflect on real-life use cases along the way.

edit: I switched it above

jorisvandenbossche · 2021-05-14T12:01:05Z

Sounds good. I am only wondering if we then should also move spatialpandas to just before your talk (since it will mainly touch on the spatial partitioning / hilbert curve for repartitioning) ? Although on the other hand it also fits after my dask-geopandas explanation.

martinfleis · 2021-05-14T12:32:06Z

I'd leave it where it is to cover the existing packages first.

jorisvandenbossche mentioned this issue May 7, 2021

SpatialPandas design and features holoviz/spatialpandas#1

Open

60 tasks

jorisvandenbossche mentioned this issue May 12, 2021

Announcement: workshop on scaling geospatial vector data during Dask Summit (Thursday May 20th) geopandas/dask-geopandas#45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask Summit 2021 - "Scaling geospatial vector data" workshop #4

Dask Summit 2021 - "Scaling geospatial vector data" workshop #4

jorisvandenbossche commented Apr 30, 2021 •

edited

jorisvandenbossche commented Apr 30, 2021

martinfleis commented May 3, 2021

jorisvandenbossche commented May 12, 2021

martinfleis commented May 13, 2021 •

edited

jorisvandenbossche commented May 14, 2021

martinfleis commented May 14, 2021

Dask Summit 2021 - "Scaling geospatial vector data" workshop #4

Dask Summit 2021 - "Scaling geospatial vector data" workshop #4

Comments

jorisvandenbossche commented Apr 30, 2021 • edited

jorisvandenbossche commented Apr 30, 2021

martinfleis commented May 3, 2021

jorisvandenbossche commented May 12, 2021

martinfleis commented May 13, 2021 • edited

jorisvandenbossche commented May 14, 2021

martinfleis commented May 14, 2021

jorisvandenbossche commented Apr 30, 2021 •

edited

martinfleis commented May 13, 2021 •

edited