Skip to content

feat: consolidate S3 client creation and enable ARN role for MSQ export#19317

Merged
gianm merged 7 commits into
apache:masterfrom
cecemei:export2
May 22, 2026
Merged

feat: consolidate S3 client creation and enable ARN role for MSQ export#19317
gianm merged 7 commits into
apache:masterfrom
cecemei:export2

Conversation

@cecemei
Copy link
Copy Markdown
Contributor

@cecemei cecemei commented Apr 15, 2026

Description

Consolidates the creation of ServerSideEncryptingAmazonS3 into a single static builder method to reduce code duplication and improve maintainability.

Release Note

  • Export query result to S3 now supports role ARN, e.x.
INSERT INTO
EXTERN(
  s3(bucket => 'cecemei-test2', prefix => 'export', assumeRoleArn => 'arn:aws:iam::00000:role/cecemei-test-20260520'))
AS CSV
SELECT ...

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@cecemei cecemei changed the title feat: support ARN role in MSQ export feat: consolidate S3 client creation and enable ARN role for MSQ export Apr 15, 2026
@cecemei cecemei marked this pull request as ready for review April 15, 2026 04:32
@FrankChen021
Copy link
Copy Markdown
Member

The changes LGTM, no correctness issues found.

Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 0
P2 1
P3 0
Total 1

This is an automated review by Codex GPT-5

return StsAssumeRoleCredentialsProvider.builder()
.stsClient(stsBuilder.build())
.refreshRequest(assumeRoleRequestBuilder.build())
.asyncCredentialUpdateEnabled(true)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Avoid leaking assume-role refresh resources

When an MSQ export specifies assumeRoleArn, S3ExportStorageProvider.createStorageConnector builds a fresh ServerSideEncryptingAmazonS3 for each connector creation. That path creates an StsAssumeRoleCredentialsProvider with asyncCredentialUpdateEnabled(true) plus a new STS client, but neither the provider, STS client, nor S3 clients are lifecycle-managed or closed after the export connector is done. Since exports create connectors for the empty-location check, worker writes, and manifest writing, each ARN export can leave background refresh/client resources behind. Reuse a lifecycle-managed role-specific client/provider or make the connector own and close the resources it creates.

}
return new S3StorageConnector(
s3OutputConfig,
ServerSideEncryptingAmazonS3.builder(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frank's automated review makes a good point about this here: https://github.com/apache/druid/pull/19317/files#r3161338202

Each of these ServerSideEncryptingAmazonS3 is going to leak some resources: thread pools, connection pools, and the like. This was a pre-existing problem with S3InputSource when a S3InputDataConfig is provided (via the properties field). I think because we have an S3InputSource per file (due to splitting), we'd potentially even in master be creating a lot of ServerSideEncryptingAmazonS3 when properties is set.

This PR isn't making it a ton worse, but it also isn't making it much better. I suppose it would be OK to merge it given that it's not a ton worse, but, could you please memoize the client in the constructor of S3ExportStorageProvider (similar to what S3InputSource does). That will at least allow it to be shared across calls to createStorageConnector.

Could you please also put some comments about the problem in ServerSideEncryptingAmazonS3#builder. Ultimately fixing it would I think involve making ServerSideEncryptingAmazonS3 closeable, and arranging for close to actually be called, which would be a larger change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frank's automated review makes a good point about this here: https://github.com/apache/druid/pull/19317/files#r3161338202

Each of these ServerSideEncryptingAmazonS3 is going to leak some resources: thread pools, connection pools, and the like. This was a pre-existing problem with S3InputSource when a S3InputDataConfig is provided (via the properties field). I think because we have an S3InputSource per file (due to splitting), we'd potentially even in master be creating a lot of ServerSideEncryptingAmazonS3 when properties is set.

This PR isn't making it a ton worse, but it also isn't making it much better. I suppose it would be OK to merge it given that it's not a ton worse, but, could you please memoize the client in the constructor of S3ExportStorageProvider (similar to what S3InputSource does). That will at least allow it to be shared across calls to createStorageConnector.

Could you please also put some comments about the problem in ServerSideEncryptingAmazonS3#builder. Ultimately fixing it would I think involve making ServerSideEncryptingAmazonS3 closeable, and arranging for close to actually be called, which would be a larger change.

Updated to use Suppliers.memoize and added javadoc for ServerSideEncryptingAmazonS3.Builder.build(). PTAL!

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main code looks good to me, just some docs are missing.

INSERT INTO
EXTERN(
s3(bucket => 'your_bucket', prefix => 'prefix/to/files'))
s3(bucket => 'your_bucket', prefix => 'prefix/to/files', assumeRoleArn => 'arn:aws:iam::some-role'))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update docs/multi-stage-query/reference.md. The table of s3 options should include assumeRoleArn and assumeRoleExternalId.

Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.

Reviewed 7 of 7 changed files.


This is an automated review by Codex GPT-5.5

@gianm gianm merged commit a946950 into apache:master May 22, 2026
38 checks passed
@github-actions github-actions Bot added this to the 38.0.0 milestone May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants