ENH: extract spatial partitioning information from partitioned Parquet dataset #28

jorisvandenbossche · 2020-10-16T21:47:23Z

Follow-up on https://github.com/jsignell/dask-geopandas/pull/14 to populate the spatial_partitions attribute when reading with parquet.

…t dataset

jsignell

Do we want to merge this before the alpha release?

…ing-parquet

jorisvandenbossche · 2021-07-05T20:43:13Z

I have been using this on some demo tests(https://github.com/jorisvandenbossche/scipy2020_spatial_algorithms_at_scale), and has been working nicely for now. So I think this is good enough for now.

martinfleis

I would say that this is good enough for now but we should find a way how to store the actual GeoSeries alongside partitioned dask GeoDataFrame. This does not give you the same partitions you had before saving until you call calculate_spatial_partitions again.

But that is for later. Can we open an issue to track that?

jorisvandenbossche · 2021-07-06T19:24:52Z

We currently fully rely on the metadata saved in the Parquet file as done by GeoPandas / defined in https://github.com/geopandas/geo-arrow-spec, and this currently only has a bbox (and not generic extent). So we could start a dicussion there to expand this, or on the short term add some custom metadata for dask_geopandas in the to_parquet (this should be easy to do, but of course custom to dask-geopandas)

martinfleis · 2021-07-06T19:38:24Z

I know. We should first resolve #8 to ensure that we know how we want to store partitions here and then try to make them part of geo-arrow-spec.

jorisvandenbossche · 2021-07-08T08:11:42Z

Opened #73 to keep track of this discussion

TomAugspurger · 2021-09-11T19:26:33Z

This broke reading from remote filesystems, at least for fsspec + adlfs.

>>> import dask_geopandas

>>> dask_geopandas.read_parquet(
...     "abfs://gbif/occurrence/2021-09-01/occurrence.parquet",
...     storage_options={"account_name": "ai4edataeuwest"}
... )
Traceback (most recent call last):
  File "/home/taugspurger/src/geopandas/dask-geopandas/foo.py", line 3, in <module>
    dask_geopandas.read_parquet(
  File "/home/taugspurger/src/geopandas/dask-geopandas/dask_geopandas/io/parquet.py", line 117, in read_parquet
    result = dd.read_parquet(*args, engine=GeoArrowEngine, **kwargs)
  File "/home/taugspurger/miniconda3/envs/geopandas-dev/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py", line 318, in read_parquet
    read_metadata_result = engine.read_metadata(
  File "/home/taugspurger/src/geopandas/dask-geopandas/dask_geopandas/io/parquet.py", line 67, in read_metadata
    regions = geopandas.GeoSeries([_get_partition_bounds(part) for part in parts])
  File "/home/taugspurger/src/geopandas/dask-geopandas/dask_geopandas/io/parquet.py", line 67, in <listcomp>
    regions = geopandas.GeoSeries([_get_partition_bounds(part) for part in parts])
  File "/home/taugspurger/src/geopandas/dask-geopandas/dask_geopandas/io/parquet.py", line 37, in _get_partition_bounds
    pq_metadata = read_metadata(path)
  File "/home/taugspurger/miniconda3/envs/geopandas-dev/lib/python3.9/site-packages/pyarrow/parquet.py", line 2232, in read_metadata
    return ParquetFile(where, memory_map=memory_map).metadata
  File "/home/taugspurger/miniconda3/envs/geopandas-dev/lib/python3.9/site-packages/pyarrow/parquet.py", line 228, in __init__
    self.reader.open(
  File "pyarrow/_parquet.pyx", line 966, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/io.pxi", line 1531, in pyarrow.lib.get_reader
  File "pyarrow/io.pxi", line 1522, in pyarrow.lib.get_native_file
  File "pyarrow/io.pxi", line 886, in pyarrow.lib.OSFile.__cinit__
  File "pyarrow/io.pxi", line 896, in pyarrow.lib.OSFile._open_readable
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 112, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file 'gbif/occurrence/2021-09-01/occurrence.parquet/000000'. Detail: [errno 2] No such file or directory

I'm looking into it now.

TomAugspurger · 2021-09-11T19:27:56Z

Actually, #103 might fix it (it's hitting a different error now, but seemed to fix the previous issue)

ENH: extract spatial partitioning information from partitioned Parque…

eb35966

…t dataset

jsignell reviewed Mar 4, 2021

View reviewed changes

jorisvandenbossche added 3 commits May 15, 2021 21:34

Merge remote-tracking branch 'upstream/master' into spatial-partition…

537a626

…ing-parquet

Merge remote-tracking branch 'upstream/master' into spatial-partition…

c46b73d

…ing-parquet

update test

02297a2

jorisvandenbossche requested review from jsignell and martinfleis July 5, 2021 20:42

martinfleis approved these changes Jul 6, 2021

View reviewed changes

jorisvandenbossche merged commit c433cd6 into geopandas:master Jul 8, 2021

jorisvandenbossche deleted the spatial-partitioning-parquet branch July 8, 2021 08:07

jorisvandenbossche mentioned this pull request Jul 8, 2021

ENH: support roundtripping full spatial partitioning instead of only the bbox in Parquet IO #73

Open

TomAugspurger mentioned this pull request Sep 11, 2021

TST/BUG: Fix and test read_parquet with Cloud filesystem (using S3 / moto) #103

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: extract spatial partitioning information from partitioned Parquet dataset #28

ENH: extract spatial partitioning information from partitioned Parquet dataset #28

jorisvandenbossche commented Oct 16, 2020

jsignell left a comment

jorisvandenbossche commented Jul 5, 2021

martinfleis left a comment

jorisvandenbossche commented Jul 6, 2021

martinfleis commented Jul 6, 2021

jorisvandenbossche commented Jul 8, 2021

TomAugspurger commented Sep 11, 2021

TomAugspurger commented Sep 11, 2021

ENH: extract spatial partitioning information from partitioned Parquet dataset #28

ENH: extract spatial partitioning information from partitioned Parquet dataset #28

Conversation

jorisvandenbossche commented Oct 16, 2020

jsignell left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jul 5, 2021

martinfleis left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jul 6, 2021

martinfleis commented Jul 6, 2021

jorisvandenbossche commented Jul 8, 2021

TomAugspurger commented Sep 11, 2021

TomAugspurger commented Sep 11, 2021