Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: extract spatial partitioning information from partitioned Parquet dataset #28

Conversation

jorisvandenbossche
Copy link
Member

Follow-up on https://github.com/jsignell/dask-geopandas/pull/14 to populate the spatial_partitions attribute when reading with parquet.

Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to merge this before the alpha release?

@jorisvandenbossche
Copy link
Member Author

I have been using this on some demo tests(https://github.com/jorisvandenbossche/scipy2020_spatial_algorithms_at_scale), and has been working nicely for now. So I think this is good enough for now.

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that this is good enough for now but we should find a way how to store the actual GeoSeries alongside partitioned dask GeoDataFrame. This does not give you the same partitions you had before saving until you call calculate_spatial_partitions again.

But that is for later. Can we open an issue to track that?

@jorisvandenbossche
Copy link
Member Author

We currently fully rely on the metadata saved in the Parquet file as done by GeoPandas / defined in https://github.com/geopandas/geo-arrow-spec, and this currently only has a bbox (and not generic extent). So we could start a dicussion there to expand this, or on the short term add some custom metadata for dask_geopandas in the to_parquet (this should be easy to do, but of course custom to dask-geopandas)

@martinfleis
Copy link
Member

I know. We should first resolve #8 to ensure that we know how we want to store partitions here and then try to make them part of geo-arrow-spec.

@jorisvandenbossche
Copy link
Member Author

Opened #73 to keep track of this discussion

@TomAugspurger
Copy link
Contributor

This broke reading from remote filesystems, at least for fsspec + adlfs.

>>> import dask_geopandas

>>> dask_geopandas.read_parquet(
...     "abfs://gbif/occurrence/2021-09-01/occurrence.parquet",
...     storage_options={"account_name": "ai4edataeuwest"}
... )
Traceback (most recent call last):
  File "/home/taugspurger/src/geopandas/dask-geopandas/foo.py", line 3, in <module>
    dask_geopandas.read_parquet(
  File "/home/taugspurger/src/geopandas/dask-geopandas/dask_geopandas/io/parquet.py", line 117, in read_parquet
    result = dd.read_parquet(*args, engine=GeoArrowEngine, **kwargs)
  File "/home/taugspurger/miniconda3/envs/geopandas-dev/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py", line 318, in read_parquet
    read_metadata_result = engine.read_metadata(
  File "/home/taugspurger/src/geopandas/dask-geopandas/dask_geopandas/io/parquet.py", line 67, in read_metadata
    regions = geopandas.GeoSeries([_get_partition_bounds(part) for part in parts])
  File "/home/taugspurger/src/geopandas/dask-geopandas/dask_geopandas/io/parquet.py", line 67, in <listcomp>
    regions = geopandas.GeoSeries([_get_partition_bounds(part) for part in parts])
  File "/home/taugspurger/src/geopandas/dask-geopandas/dask_geopandas/io/parquet.py", line 37, in _get_partition_bounds
    pq_metadata = read_metadata(path)
  File "/home/taugspurger/miniconda3/envs/geopandas-dev/lib/python3.9/site-packages/pyarrow/parquet.py", line 2232, in read_metadata
    return ParquetFile(where, memory_map=memory_map).metadata
  File "/home/taugspurger/miniconda3/envs/geopandas-dev/lib/python3.9/site-packages/pyarrow/parquet.py", line 228, in __init__
    self.reader.open(
  File "pyarrow/_parquet.pyx", line 966, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/io.pxi", line 1531, in pyarrow.lib.get_reader
  File "pyarrow/io.pxi", line 1522, in pyarrow.lib.get_native_file
  File "pyarrow/io.pxi", line 886, in pyarrow.lib.OSFile.__cinit__
  File "pyarrow/io.pxi", line 896, in pyarrow.lib.OSFile._open_readable
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 112, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file 'gbif/occurrence/2021-09-01/occurrence.parquet/000000'. Detail: [errno 2] No such file or directory

I'm looking into it now.

@TomAugspurger
Copy link
Contributor

Actually, #103 might fix it (it's hitting a different error now, but seemed to fix the previous issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants