Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] FileSystem.from_uri doesn't decode %-encoded characters in path #33598

Closed
asfimport opened this issue Dec 14, 2022 · 2 comments
Closed

Comments

@asfimport
Copy link
Collaborator

When attempting to create a new filesystem object from a public dataset in S3, where there is a space in the bucket name, an error is raised.

 

Here's a minimal reproducer:

from pyarrow.fs import FileSystem
result = FileSystem.from_uri("s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet") 

which fails with the following traceback:

 

Traceback (most recent call last):
  File "/Users/james/projects/dask/dask/test.py", line 3, in <module>
    result = FileSystem.from_uri("s3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet")
  File "pyarrow/_fs.pyx", line 470, in pyarrow._fs.FileSystem.from_uri
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Cannot parse URI: 's3://nyc-tlc/trip data/fhvhv_tripdata_2022-06.parquet'

 

Note that things work if I use a different dataset that doesn't have a space in the URI, or if I replace the portion of the URI that has a space with a \* wildcard

 

from pyarrow.fs import FileSystem
result = FileSystem.from_uri("s3://ursa-labs-taxi-data/2009/01/data.parquet") # works
 result = FileSystem.from_uri("s3://nyc-tlc/*/fhvhv_tripdata_2022-06.parquet") # works

 

The wildcard isn't necessarily equivalent to the original failing URI, but I think highlights that the space is somehow problematic.

Environment: - OS: macOS

PRs and other links:

Note: This issue was originally created as ARROW-18436. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
That's because the space needs to be encoded. However, there is an issue that it isn't decoded on return:

>>> result = FileSystem.from_uri("s3://nyc-tlc/trip%20data/fhvhv_tripdata_2022-06.parquet")
>>> result
(<pyarrow._s3fs.S3FileSystem at 0x7f6c114fc730>,
 'nyc-tlc/trip%20data/fhvhv_tripdata_2022-06.parquet')

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 14974
#14974

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants