remove the explicit bucket and derive from url like normal s3 #20

seddonm1 · 2022-01-22T22:58:22Z

As mentioned in #19 the current implementation is incorrect as we need to provide a specific bucket for the data to be read from.

This is incorrect as the DataFusion ObjectStoreRegistry registers by URI scheme (like s3:// or file://). This means it would be impossible to register s3:// sources from two buckets.

By encoding the bucket name into the file path as the first value before the / we can support the ObjectStoreRegistry requirements and multiple buckets.

seddonm1 · 2022-01-22T23:00:32Z

FYI this means you do this:

    let execution_config = ExecutionConfig::new().with_batch_size(32768);
    let mut execution_ctx = ExecutionContext::with_config(execution_config);
    execution_ctx.register_object_store(
        "s3",
        Arc::new(AmazonS3FileSystem::new(None, None, None, None, None, None).await),
    );

matthewmturner · 2022-01-22T23:17:29Z

Do you think we could test this by mapping the testing data twice as two different volumes so we could test two different "buckets".

Like:

--volume "$(pwd)/parquet-testing:/bucket_one" \
--volume "$(pwd)/parquet-testing:/bucket_two" \

seddonm1 · 2022-01-22T23:20:34Z

yes, i can add some more tests like you suggest.

seddonm1 · 2022-01-22T23:41:57Z

@matthewmturner I added a test. unfortunately given how minio works we can only mount one location:

ERROR Invalid command line arguments: Cross-device mounts detected on path (/data) at following locations [/data/data1 /data/data0]. Export path should not have any sub-mounts, refusing to start.

I have added a test that reads from the bad_data path and ensures the panic error message is relating to a corrupt parquet not a file not found type error.

matthewmturner · 2022-01-22T23:47:35Z

@seddonm1 ah thats unfortunate. but thats definitely a good workaround!

# Conflicts: # src/object_store/aws.rs

seddonm1 · 2022-01-23T03:16:13Z

@matthewmturner I have rebased so this is ready to merge once tests pass 👍

houqp · 2022-01-23T06:42:46Z

Good work 👍 Just a note for future improvements, it's very common to setup different access control for different buckets, so we will need to support creating different clients with specific configs for different buckets in the future. For example, in our production environment, we have spark jobs that access different buckets hosted in different AWS accounts.

seddonm1 · 2022-01-23T07:00:42Z

@houqp agreed and understood. Perhaps the ObjectStoreRegistry needs something more than URI scheme to allow for this use case?

houqp · 2022-01-23T20:13:12Z

Yeah, I can think of two different approaches to address this problem:

Maintain a set of protocol specific clients internally within the S3 object store implementation for each bucket
Extend ObjectStore abstraction in datafusion to support a hierarchy based object store lookup. i.e. first lookup a object store specific uri key generator by scheme, then calculate a unique object store key for given uri for the actual object store lookup.

I am leaning towards option 1 because it doesn't force this complexity into all object stores. For example, local file object store will never need to dispatch to different clients based on file path. @yjshen curious what's your thought on this.

matthewmturner · 2022-01-23T23:06:12Z

@houqp @seddonm1 for my info, can you provide more info on how access control is handled with configs? From my experience I've controlled access to buckets via AWS IAM policies and the attendant access_key / secret_key are linked to that. Are there other credential providers where that isnt the case? Or cases when its not that straight forward with IAM and access / secret keys?

houqp · 2022-01-24T00:43:49Z

IAM policy attached to IAM users (via access/secret key) is easier to get started with. For more secure and production ready setup, you would want to use IAM role instead of IAM users so there is no long lived secrets. The place where things get complicated is cross account S3 write access. In order to do this, you need to assume an IAM role in the S3 bucket owner account to perform the write, otherwise the bucket owner account won't be able to truly own the newly written objects. The result of that is the bucket owner won't be able to further share the objects with other accounts. In short, in some cases, the object store need to assume and switch to different IAM roles depending on which bucket it is writing to. For cross account S3 read, we don't have this problem, so you can usually get by with a single IAM role.

matthewmturner · 2022-01-24T03:07:30Z

@houqp makes sense, thanks for explanation!

remove the explicit bucket and derive from url like normal s3

ac31778

seddonm1 requested a review from matthewmturner January 22, 2022 22:58

add test for reading second bucket

b6d85f2

matthewmturner approved these changes Jan 22, 2022

View reviewed changes

matthewmturner linked an issue Jan 22, 2022 that may be closed by this pull request

API not quite correct #19

Closed

matthewmturner mentioned this pull request Jan 22, 2022

Investigate: Register into ObjectStoreRegistry #15

Closed

Merge remote-tracking branch 'upstream/main'

681d681

# Conflicts: # src/object_store/aws.rs

matthewmturner merged commit 366bb6c into datafusion-contrib:main Jan 23, 2022

matthewmturner mentioned this pull request Jan 27, 2022

Support creating client specific configs for different buckets #25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove the explicit bucket and derive from url like normal s3 #20

remove the explicit bucket and derive from url like normal s3 #20

seddonm1 commented Jan 22, 2022

seddonm1 commented Jan 22, 2022

matthewmturner commented Jan 22, 2022 •

edited

Loading

seddonm1 commented Jan 22, 2022

seddonm1 commented Jan 22, 2022

matthewmturner commented Jan 22, 2022

seddonm1 commented Jan 23, 2022

houqp commented Jan 23, 2022

seddonm1 commented Jan 23, 2022

houqp commented Jan 23, 2022

matthewmturner commented Jan 23, 2022

houqp commented Jan 24, 2022

matthewmturner commented Jan 24, 2022

remove the explicit bucket and derive from url like normal s3 #20

remove the explicit bucket and derive from url like normal s3 #20

Conversation

seddonm1 commented Jan 22, 2022

seddonm1 commented Jan 22, 2022

matthewmturner commented Jan 22, 2022 • edited Loading

seddonm1 commented Jan 22, 2022

seddonm1 commented Jan 22, 2022

matthewmturner commented Jan 22, 2022

seddonm1 commented Jan 23, 2022

houqp commented Jan 23, 2022

seddonm1 commented Jan 23, 2022

houqp commented Jan 23, 2022

matthewmturner commented Jan 23, 2022

houqp commented Jan 24, 2022

matthewmturner commented Jan 24, 2022

matthewmturner commented Jan 22, 2022 •

edited

Loading