Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove the explicit bucket and derive from url like normal s3 #20

Merged
merged 3 commits into from
Jan 23, 2022

Conversation

seddonm1
Copy link
Collaborator

As mentioned in #19 the current implementation is incorrect as we need to provide a specific bucket for the data to be read from.

This is incorrect as the DataFusion ObjectStoreRegistry registers by URI scheme (like s3:// or file://). This means it would be impossible to register s3:// sources from two buckets.

By encoding the bucket name into the file path as the first value before the / we can support the ObjectStoreRegistry requirements and multiple buckets.

@seddonm1
Copy link
Collaborator Author

FYI this means you do this:

    let execution_config = ExecutionConfig::new().with_batch_size(32768);
    let mut execution_ctx = ExecutionContext::with_config(execution_config);
    execution_ctx.register_object_store(
        "s3",
        Arc::new(AmazonS3FileSystem::new(None, None, None, None, None, None).await),
    );

@matthewmturner
Copy link
Collaborator

matthewmturner commented Jan 22, 2022

Do you think we could test this by mapping the testing data twice as two different volumes so we could test two different "buckets".

Like:

--volume "$(pwd)/parquet-testing:/bucket_one" \
--volume "$(pwd)/parquet-testing:/bucket_two" \

@seddonm1
Copy link
Collaborator Author

yes, i can add some more tests like you suggest.

@seddonm1
Copy link
Collaborator Author

@matthewmturner I added a test. unfortunately given how minio works we can only mount one location:

ERROR Invalid command line arguments: Cross-device mounts detected on path (/data) at following locations [/data/data1 /data/data0]. Export path should not have any sub-mounts, refusing to start.

I have added a test that reads from the bad_data path and ensures the panic error message is relating to a corrupt parquet not a file not found type error.

@matthewmturner
Copy link
Collaborator

@seddonm1 ah thats unfortunate. but thats definitely a good workaround!

@matthewmturner matthewmturner linked an issue Jan 22, 2022 that may be closed by this pull request
# Conflicts:
#	src/object_store/aws.rs
@seddonm1
Copy link
Collaborator Author

@matthewmturner I have rebased so this is ready to merge once tests pass 👍

@matthewmturner matthewmturner merged commit 366bb6c into datafusion-contrib:main Jan 23, 2022
@houqp
Copy link
Member

houqp commented Jan 23, 2022

Good work 👍 Just a note for future improvements, it's very common to setup different access control for different buckets, so we will need to support creating different clients with specific configs for different buckets in the future. For example, in our production environment, we have spark jobs that access different buckets hosted in different AWS accounts.

@seddonm1
Copy link
Collaborator Author

@houqp agreed and understood. Perhaps the ObjectStoreRegistry needs something more than URI scheme to allow for this use case?

@houqp
Copy link
Member

houqp commented Jan 23, 2022

Yeah, I can think of two different approaches to address this problem:

  1. Maintain a set of protocol specific clients internally within the S3 object store implementation for each bucket
  2. Extend ObjectStore abstraction in datafusion to support a hierarchy based object store lookup. i.e. first lookup a object store specific uri key generator by scheme, then calculate a unique object store key for given uri for the actual object store lookup.

I am leaning towards option 1 because it doesn't force this complexity into all object stores. For example, local file object store will never need to dispatch to different clients based on file path. @yjshen curious what's your thought on this.

@matthewmturner
Copy link
Collaborator

@houqp @seddonm1 for my info, can you provide more info on how access control is handled with configs? From my experience I've controlled access to buckets via AWS IAM policies and the attendant access_key / secret_key are linked to that. Are there other credential providers where that isnt the case? Or cases when its not that straight forward with IAM and access / secret keys?

@houqp
Copy link
Member

houqp commented Jan 24, 2022

IAM policy attached to IAM users (via access/secret key) is easier to get started with. For more secure and production ready setup, you would want to use IAM role instead of IAM users so there is no long lived secrets. The place where things get complicated is cross account S3 write access. In order to do this, you need to assume an IAM role in the S3 bucket owner account to perform the write, otherwise the bucket owner account won't be able to truly own the newly written objects. The result of that is the bucket owner won't be able to further share the objects with other accounts. In short, in some cases, the object store need to assume and switch to different IAM roles depending on which bucket it is writing to. For cross account S3 read, we don't have this problem, so you can usually get by with a single IAM role.

@matthewmturner
Copy link
Collaborator

@houqp makes sense, thanks for explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API not quite correct
3 participants