Describe the bug, including details regarding any error messages, version, and platform.
I use pyarrow.dataset.dataset often to load data from S3. However, it's always been a bit finicky regarding named AWS profiles. For example, let's say I have the following ~/.aws/credentials file that defines two identical profiles and uses an external process to get credentials (which shouldn't matter, but I'm adding it for completeness):
[default]
region = us-east-1
credential_process = /path/to/get-creds.sh
[dev]
region = us-east-1
credential_process = /path/to/get-creds.sh
These commands both work, which validates that both accounts have access to S3:
aws s3 ls s3://bucket/path --profile default
aws s3 ls s3://bucket/path --profile dev
Now, as I understand it, the only way to specify a named AWS profile with pyarrow is to use environment variables. To test the access, I use a trivial script:
# script.py
import pyarrow.dataset as ds
dataset = ds.dataset("s3://bucket/path/")
Now, here's the strangeness:
# Check that no environment variables are set in this shell:
env | grep AWS
# This works:
python ./script.py
# So does this:
AWS_PROFILE=default python ./script.py
# But this fails:
AWS_PROFILE=dev python ./script.py
The error is:
Traceback (most recent call last):
File "./script.py", line 3, in <module>
dataset = ds.dataset("s3://bucket/path/")
File "/home/antonstv/miniconda3/envs/pdna/lib/python3.8/site-packages/pyarrow/dataset.py", line 763, in dataset
return _filesystem_dataset(source, **kwargs)
File "/home/antonstv/miniconda3/envs/pdna/lib/python3.8/site-packages/pyarrow/dataset.py", line 446, in _filesystem_dataset
fs, paths_or_selector = _ensure_single_source(source, filesystem)
File "/home/antonstv/miniconda3/envs/pdna/lib/python3.8/site-packages/pyarrow/dataset.py", line 413, in _ensure_single_source
file_info = filesystem.get_file_info(path)
File "pyarrow/_fs.pyx", line 571, in pyarrow._fs.FileSystem.get_file_info
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: When getting information for key 'path' in bucket 'bucket': AWS Error ACCESS_DENIED during HeadObject operation: No response body.
I'm really quite at a loss for how this could happen. And since the underlying code isn't python, it seems tricky for me to help debug. It's possible that I'm doing something dumb, but I really can't imagine why two identical profiles would give different results like this.
The pyarrow version I'm using is 12.0.1.
Component(s)
Python
Describe the bug, including details regarding any error messages, version, and platform.
I use
pyarrow.dataset.datasetoften to load data from S3. However, it's always been a bit finicky regarding named AWS profiles. For example, let's say I have the following~/.aws/credentialsfile that defines two identical profiles and uses an external process to get credentials (which shouldn't matter, but I'm adding it for completeness):These commands both work, which validates that both accounts have access to S3:
Now, as I understand it, the only way to specify a named AWS profile with pyarrow is to use environment variables. To test the access, I use a trivial script:
Now, here's the strangeness:
The error is:
I'm really quite at a loss for how this could happen. And since the underlying code isn't python, it seems tricky for me to help debug. It's possible that I'm doing something dumb, but I really can't imagine why two identical profiles would give different results like this.
The pyarrow version I'm using is
12.0.1.Component(s)
Python