Skip to content

[Python] Dataset doesn't seem to load named AWS profiles correctly #36416

@stevenmanton

Description

@stevenmanton

Describe the bug, including details regarding any error messages, version, and platform.

I use pyarrow.dataset.dataset often to load data from S3. However, it's always been a bit finicky regarding named AWS profiles. For example, let's say I have the following ~/.aws/credentials file that defines two identical profiles and uses an external process to get credentials (which shouldn't matter, but I'm adding it for completeness):

[default]
region = us-east-1
credential_process = /path/to/get-creds.sh

[dev]
region = us-east-1
credential_process = /path/to/get-creds.sh

These commands both work, which validates that both accounts have access to S3:

aws s3 ls s3://bucket/path --profile default
aws s3 ls s3://bucket/path --profile dev

Now, as I understand it, the only way to specify a named AWS profile with pyarrow is to use environment variables. To test the access, I use a trivial script:

# script.py
import pyarrow.dataset as ds

dataset = ds.dataset("s3://bucket/path/")

Now, here's the strangeness:

# Check that no environment variables are set in this shell:
env | grep AWS

# This works:
python ./script.py
# So does this:
AWS_PROFILE=default python ./script.py
# But this fails:
AWS_PROFILE=dev python ./script.py

The error is:

Traceback (most recent call last):
  File "./script.py", line 3, in <module>
    dataset = ds.dataset("s3://bucket/path/")
  File "/home/antonstv/miniconda3/envs/pdna/lib/python3.8/site-packages/pyarrow/dataset.py", line 763, in dataset
    return _filesystem_dataset(source, **kwargs)
  File "/home/antonstv/miniconda3/envs/pdna/lib/python3.8/site-packages/pyarrow/dataset.py", line 446, in _filesystem_dataset
    fs, paths_or_selector = _ensure_single_source(source, filesystem)
  File "/home/antonstv/miniconda3/envs/pdna/lib/python3.8/site-packages/pyarrow/dataset.py", line 413, in _ensure_single_source
    file_info = filesystem.get_file_info(path)
  File "pyarrow/_fs.pyx", line 571, in pyarrow._fs.FileSystem.get_file_info
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: When getting information for key 'path' in bucket 'bucket': AWS Error ACCESS_DENIED during HeadObject operation: No response body.

I'm really quite at a loss for how this could happen. And since the underlying code isn't python, it seems tricky for me to help debug. It's possible that I'm doing something dumb, but I really can't imagine why two identical profiles would give different results like this.

The pyarrow version I'm using is 12.0.1.

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions