-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove the explicit bucket and derive from url like normal s3 #20
Conversation
FYI this means you do this: let execution_config = ExecutionConfig::new().with_batch_size(32768);
let mut execution_ctx = ExecutionContext::with_config(execution_config);
execution_ctx.register_object_store(
"s3",
Arc::new(AmazonS3FileSystem::new(None, None, None, None, None, None).await),
); |
Do you think we could test this by mapping the testing data twice as two different volumes so we could test two different "buckets". Like:
|
yes, i can add some more tests like you suggest. |
@matthewmturner I added a test. unfortunately given how minio works we can only mount one location:
I have added a test that reads from the |
@seddonm1 ah thats unfortunate. but thats definitely a good workaround! |
# Conflicts: # src/object_store/aws.rs
@matthewmturner I have rebased so this is ready to merge once tests pass 👍 |
Good work 👍 Just a note for future improvements, it's very common to setup different access control for different buckets, so we will need to support creating different clients with specific configs for different buckets in the future. For example, in our production environment, we have spark jobs that access different buckets hosted in different AWS accounts. |
@houqp agreed and understood. Perhaps the |
Yeah, I can think of two different approaches to address this problem:
I am leaning towards option 1 because it doesn't force this complexity into all object stores. For example, local file object store will never need to dispatch to different clients based on file path. @yjshen curious what's your thought on this. |
@houqp @seddonm1 for my info, can you provide more info on how access control is handled with configs? From my experience I've controlled access to buckets via AWS IAM policies and the attendant access_key / secret_key are linked to that. Are there other credential providers where that isnt the case? Or cases when its not that straight forward with IAM and access / secret keys? |
IAM policy attached to IAM users (via access/secret key) is easier to get started with. For more secure and production ready setup, you would want to use IAM role instead of IAM users so there is no long lived secrets. The place where things get complicated is cross account S3 write access. In order to do this, you need to assume an IAM role in the S3 bucket owner account to perform the write, otherwise the bucket owner account won't be able to truly own the newly written objects. The result of that is the bucket owner won't be able to further share the objects with other accounts. In short, in some cases, the object store need to assume and switch to different IAM roles depending on which bucket it is writing to. For cross account S3 read, we don't have this problem, so you can usually get by with a single IAM role. |
@houqp makes sense, thanks for explanation! |
As mentioned in #19 the current implementation is incorrect as we need to provide a specific bucket for the data to be read from.
This is incorrect as the DataFusion ObjectStoreRegistry registers by URI scheme (like
s3://
orfile://
). This means it would be impossible to registers3://
sources from two buckets.By encoding the bucket name into the file path as the first value before the
/
we can support theObjectStoreRegistry
requirements and multiple buckets.