Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support storage_options argument in read_parquet #2071

Closed
knaaptime opened this issue Aug 20, 2021 · 8 comments · Fixed by #2107
Closed

ENH: support storage_options argument in read_parquet #2071

knaaptime opened this issue Aug 20, 2021 · 8 comments · Fixed by #2107

Comments

@knaaptime
Copy link

knaaptime commented Aug 20, 2021

Is your feature request related to a problem?

I store lots of data in a quilt bucket (i.e. S3 storage) and use s3fs with geopandas to read data directly from the wire, like

gpd.read_parquet("s3://spatial-ucr/census/administrative/counties.parquet")

often, that works perfectly. But depending on the botocore/sf3fs/aiobotocore/fsspec version collection, it can throw botocore.exceptions.NoCredentialsError: Unable to locate credentials.

Describe the solution you'd like

the pandas version of read_parquet supports passing storage_options={"anon": True} which I believe will get around that particular error, but in geopandas that argument fails with TypeError: read_table() got an unexpected keyword argument 'storage_options'. It would be great if gpd.read_parquet would allow me to pass that arg as well.

API breaking implications

None

Describe alternatives you've considered

I could probably read the file directly with pandas, then convert the serialized geometry column myself, but that would skirt the nice efficient implementation already in the geopandas version of read_parquet :)

@martinfleis
Copy link
Member

martinfleis commented Aug 23, 2021

I am guessing a bit here but you may be able to read directly. Our parquet IO uses pyarrow.parquet.read_table(path, columns=columns, **kwargs). So passing a filesystem keyword with a proper S3FileSystem object should do the trick.

s3 = fs.S3FileSystem(anonymous=True)
geopandas.read_parquet("s3://spatial-ucr/census/administrative/counties.parquet", filesystem=s3)

However, in my env this doesn't work on AWS Error [code 100]: No response body. while reading directly with no filesystem specification works.

Can you pin down the specification of the environment in which this fails?

Since our parquet IO is different from pandas under the hood, I am not sure to which degree we can reasonably mirror pandas API here.

@knaaptime
Copy link
Author

I'll keep looking, but i think i've narrowed it to fsspec<=0.8.3. Anything higher than that raises this issue, but still need to test a bit. I'll give your solution a shot and see if that works. thanks!

@martinfleis
Copy link
Member

fsspec 0.8.2 still works for me.

@knaaptime
Copy link
Author

sorry, yeah i wrote that a little backwards. pinning fsspec<=0.8.3 works. It's going higher than that which breaks

@martinfleis
Copy link
Member

Well, the latest fsspec 2021.7.0 also works in my env so it has to be something else or a combination of versions causing an issue.

@knaaptime
Copy link
Author

i'll keep digging. thanks again

as you can imagine, things usually work with conda, but when i need to resort to pip, this issue pops up and its hard to diagnose which combination of pkgs is responsible

@jorisvandenbossche
Copy link
Member

[on passing a filesystem object as work-around] However, in my env this doesn't work on AWS Error [code 100]: No response body. while reading directly with no filesystem specification works.

One guess: it might be that if you pass an explicit filesystem object, you need to leave out the s3:// from the file path

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 27, 2021

And on the original topic: I think it's a good idea to add support for the storage_options keyword (there are other aspects you might want to tweak, like the region, or endpoint, etc).

Athough it's in theory superfluous with passing an actual filesystem object (and you can create an s3fs filesystem with those same storage_options, pyarrow will accept a s3fs filesystem as well), it gives consistency with pandas and dask (and dask-geopandas).

Implementation wise, I think we can do something like:

if storage_options is not None:
    if filesytem is not None: 
        raise error
    filesystem, _, path = fsspec.get_fs_token_paths(path)

TomAugspurger added a commit to TomAugspurger/geopandas that referenced this issue Sep 10, 2021
TomAugspurger added a commit to TomAugspurger/geopandas that referenced this issue Sep 10, 2021
TomAugspurger added a commit to TomAugspurger/geopandas that referenced this issue Sep 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants