-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: extract spatial partitioning information from partitioned Parquet dataset #28
ENH: extract spatial partitioning information from partitioned Parquet dataset #28
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to merge this before the alpha release?
I have been using this on some demo tests(https://github.com/jorisvandenbossche/scipy2020_spatial_algorithms_at_scale), and has been working nicely for now. So I think this is good enough for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that this is good enough for now but we should find a way how to store the actual GeoSeries alongside partitioned dask GeoDataFrame. This does not give you the same partitions you had before saving until you call calculate_spatial_partitions
again.
But that is for later. Can we open an issue to track that?
We currently fully rely on the metadata saved in the Parquet file as done by GeoPandas / defined in https://github.com/geopandas/geo-arrow-spec, and this currently only has a bbox (and not generic extent). So we could start a dicussion there to expand this, or on the short term add some custom metadata for dask_geopandas in the |
I know. We should first resolve #8 to ensure that we know how we want to store partitions here and then try to make them part of geo-arrow-spec. |
Opened #73 to keep track of this discussion |
This broke reading from remote filesystems, at least for fsspec + adlfs. >>> import dask_geopandas
>>> dask_geopandas.read_parquet(
... "abfs://gbif/occurrence/2021-09-01/occurrence.parquet",
... storage_options={"account_name": "ai4edataeuwest"}
... )
Traceback (most recent call last):
File "/home/taugspurger/src/geopandas/dask-geopandas/foo.py", line 3, in <module>
dask_geopandas.read_parquet(
File "/home/taugspurger/src/geopandas/dask-geopandas/dask_geopandas/io/parquet.py", line 117, in read_parquet
result = dd.read_parquet(*args, engine=GeoArrowEngine, **kwargs)
File "/home/taugspurger/miniconda3/envs/geopandas-dev/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py", line 318, in read_parquet
read_metadata_result = engine.read_metadata(
File "/home/taugspurger/src/geopandas/dask-geopandas/dask_geopandas/io/parquet.py", line 67, in read_metadata
regions = geopandas.GeoSeries([_get_partition_bounds(part) for part in parts])
File "/home/taugspurger/src/geopandas/dask-geopandas/dask_geopandas/io/parquet.py", line 67, in <listcomp>
regions = geopandas.GeoSeries([_get_partition_bounds(part) for part in parts])
File "/home/taugspurger/src/geopandas/dask-geopandas/dask_geopandas/io/parquet.py", line 37, in _get_partition_bounds
pq_metadata = read_metadata(path)
File "/home/taugspurger/miniconda3/envs/geopandas-dev/lib/python3.9/site-packages/pyarrow/parquet.py", line 2232, in read_metadata
return ParquetFile(where, memory_map=memory_map).metadata
File "/home/taugspurger/miniconda3/envs/geopandas-dev/lib/python3.9/site-packages/pyarrow/parquet.py", line 228, in __init__
self.reader.open(
File "pyarrow/_parquet.pyx", line 966, in pyarrow._parquet.ParquetReader.open
File "pyarrow/io.pxi", line 1531, in pyarrow.lib.get_reader
File "pyarrow/io.pxi", line 1522, in pyarrow.lib.get_native_file
File "pyarrow/io.pxi", line 886, in pyarrow.lib.OSFile.__cinit__
File "pyarrow/io.pxi", line 896, in pyarrow.lib.OSFile._open_readable
File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 112, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file 'gbif/occurrence/2021-09-01/occurrence.parquet/000000'. Detail: [errno 2] No such file or directory I'm looking into it now. |
Actually, #103 might fix it (it's hitting a different error now, but seemed to fix the previous issue) |
Follow-up on https://github.com/jsignell/dask-geopandas/pull/14 to populate the
spatial_partitions
attribute when reading with parquet.