ENH: support `storage_options` argument in `read_parquet` #2071

knaaptime · 2021-08-20T21:20:46Z

Is your feature request related to a problem?

I store lots of data in a quilt bucket (i.e. S3 storage) and use s3fs with geopandas to read data directly from the wire, like

gpd.read_parquet("s3://spatial-ucr/census/administrative/counties.parquet")

often, that works perfectly. But depending on the botocore/sf3fs/aiobotocore/fsspec version collection, it can throw botocore.exceptions.NoCredentialsError: Unable to locate credentials.

Describe the solution you'd like

the pandas version of read_parquet supports passing storage_options={"anon": True} which I believe will get around that particular error, but in geopandas that argument fails with TypeError: read_table() got an unexpected keyword argument 'storage_options'. It would be great if gpd.read_parquet would allow me to pass that arg as well.

API breaking implications

None

Describe alternatives you've considered

I could probably read the file directly with pandas, then convert the serialized geometry column myself, but that would skirt the nice efficient implementation already in the geopandas version of read_parquet :)

The text was updated successfully, but these errors were encountered:

martinfleis · 2021-08-23T20:24:48Z

I am guessing a bit here but you may be able to read directly. Our parquet IO uses pyarrow.parquet.read_table(path, columns=columns, **kwargs). So passing a filesystem keyword with a proper S3FileSystem object should do the trick.

s3 = fs.S3FileSystem(anonymous=True)
geopandas.read_parquet("s3://spatial-ucr/census/administrative/counties.parquet", filesystem=s3)

However, in my env this doesn't work on AWS Error [code 100]: No response body. while reading directly with no filesystem specification works.

Can you pin down the specification of the environment in which this fails?

Since our parquet IO is different from pandas under the hood, I am not sure to which degree we can reasonably mirror pandas API here.

knaaptime · 2021-08-23T21:04:40Z

I'll keep looking, but i think i've narrowed it to fsspec<=0.8.3. Anything higher than that raises this issue, but still need to test a bit. I'll give your solution a shot and see if that works. thanks!

martinfleis · 2021-08-23T21:15:37Z

fsspec 0.8.2 still works for me.

knaaptime · 2021-08-23T21:17:42Z

sorry, yeah i wrote that a little backwards. pinning fsspec<=0.8.3 works. It's going higher than that which breaks

martinfleis · 2021-08-23T21:34:01Z

Well, the latest fsspec 2021.7.0 also works in my env so it has to be something else or a combination of versions causing an issue.

knaaptime · 2021-08-23T21:38:15Z

i'll keep digging. thanks again

as you can imagine, things usually work with conda, but when i need to resort to pip, this issue pops up and its hard to diagnose which combination of pkgs is responsible

jorisvandenbossche · 2021-08-27T13:05:16Z

[on passing a filesystem object as work-around] However, in my env this doesn't work on AWS Error [code 100]: No response body. while reading directly with no filesystem specification works.

One guess: it might be that if you pass an explicit filesystem object, you need to leave out the s3:// from the file path

jorisvandenbossche · 2021-08-27T13:13:51Z

And on the original topic: I think it's a good idea to add support for the storage_options keyword (there are other aspects you might want to tweak, like the region, or endpoint, etc).

Athough it's in theory superfluous with passing an actual filesystem object (and you can create an s3fs filesystem with those same storage_options, pyarrow will accept a s3fs filesystem as well), it gives consistency with pandas and dask (and dask-geopandas).

Implementation wise, I think we can do something like:

if storage_options is not None:
    if filesytem is not None: 
        raise error
    filesystem, _, path = fsspec.get_fs_token_paths(path)

Closes geopandas#2071

knaaptime added the enhancement label Aug 20, 2021

TomAugspurger added a commit to TomAugspurger/geopandas that referenced this issue Sep 10, 2021

Added storage_options to read_parquet

4c45bed

Closes geopandas#2071

TomAugspurger added a commit to TomAugspurger/geopandas that referenced this issue Sep 10, 2021

Added storage_options to read_parquet

92b2132

Closes geopandas#2071

TomAugspurger added a commit to TomAugspurger/geopandas that referenced this issue Sep 10, 2021

Added storage_options to read_parquet

83f020c

Closes geopandas#2071

TomAugspurger mentioned this issue Sep 10, 2021

ENH: Add storage_options to read_parquet #2107

Merged

jorisvandenbossche closed this as completed in #2107 Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: support `storage_options` argument in `read_parquet` #2071

ENH: support `storage_options` argument in `read_parquet` #2071

knaaptime commented Aug 20, 2021 •

edited

martinfleis commented Aug 23, 2021 •

edited

knaaptime commented Aug 23, 2021

martinfleis commented Aug 23, 2021

knaaptime commented Aug 23, 2021

martinfleis commented Aug 23, 2021

knaaptime commented Aug 23, 2021

jorisvandenbossche commented Aug 27, 2021

jorisvandenbossche commented Aug 27, 2021 •

edited

ENH: support storage_options argument in read_parquet #2071

ENH: support storage_options argument in read_parquet #2071

Comments

knaaptime commented Aug 20, 2021 • edited

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

martinfleis commented Aug 23, 2021 • edited

knaaptime commented Aug 23, 2021

martinfleis commented Aug 23, 2021

knaaptime commented Aug 23, 2021

martinfleis commented Aug 23, 2021

knaaptime commented Aug 23, 2021

jorisvandenbossche commented Aug 27, 2021

jorisvandenbossche commented Aug 27, 2021 • edited

ENH: support `storage_options` argument in `read_parquet` #2071

ENH: support `storage_options` argument in `read_parquet` #2071

knaaptime commented Aug 20, 2021 •

edited

martinfleis commented Aug 23, 2021 •

edited

jorisvandenbossche commented Aug 27, 2021 •

edited