-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BULK-22, #398: Support reading s3 urls #399
Changes from 1 commit
ce3e88f
e61c276
d29a0b4
983c767
9973b16
b01d0b0
0e4e27c
c53448d
e0682eb
1cc9bd4
8239cc8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1435,18 +1435,13 @@ Default: **true**. | |
|
||
Settings applicable for reading from AWS S3 URLs. | ||
|
||
#### --s3.profile<br />--dsbulk.s3.profile _<string>_ | ||
#### --s3.clientCacheSize<br />--dsbulk.s3.clientCacheSize _<number>_ | ||
|
||
Which profile to use for AWS S3 credentials. See the [AWS SDK documentation](https://docs.aws. | ||
amazon.com/sdkref/latest/guide/file-format.html) for details on setting up profiles. | ||
The size (count) of the S3Client cache. Since each S3 URL | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd like to request that we put in additional guidance here. It's not readily obvious to me why For example, what happens if we set the cache size to 100? or 5? Thoughts? 🙂 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I pretty much just picked an arbitrary number. 🤷♂️ In our use case, we're using the same credentials for all the URLs, so the cache only ever has one entry. I suspect using the same credentials for all the URLs will be the usual use case, and it would be a bit of a pain (and somewhat unexpected) to supply more than 20 different sets of credentials in a single I'm happy to add a sentence explaining this if you want, though. Do note, however, that this file (settings.md) and application.template.conf are automatically generated from dsbulk-reference.conf, so the same explanatory statement will be shared between the three files. |
||
must contain the credentials for the target bucket, we cache | ||
the clients to prevent rebuilding the same client over and over. | ||
|
||
Default: **<unspecified>**. | ||
|
||
#### --s3.region<br />--dsbulk.s3.region _<string>_ | ||
|
||
Which AWS region to use. | ||
|
||
Default: **"us-east-1"**. | ||
Default: **20**. | ||
|
||
<a name="stats"></a> | ||
## Stats Settings | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,65 +19,25 @@ | |
import edu.umd.cs.findbugs.annotations.NonNull; | ||
import java.net.URLStreamHandler; | ||
import java.util.Optional; | ||
import java.util.concurrent.atomic.AtomicBoolean; | ||
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
import software.amazon.awssdk.auth.credentials.ProfileCredentialsProvider; | ||
import software.amazon.awssdk.http.SdkHttpClient; | ||
import software.amazon.awssdk.http.urlconnection.UrlConnectionHttpClient; | ||
import software.amazon.awssdk.regions.Region; | ||
import software.amazon.awssdk.services.s3.S3Client; | ||
import software.amazon.awssdk.services.s3.S3ClientBuilder; | ||
import software.amazon.awssdk.utils.StringUtils; | ||
|
||
public class S3URLStreamHandlerProvider implements URLStreamHandlerProvider { | ||
|
||
private static final String REGION_PATH = "dsbulk.s3.region"; | ||
private static final String PROFILE_PATH = "dsbulk.s3.profile"; | ||
|
||
private static final Logger LOGGER = LoggerFactory.getLogger(S3URLStreamHandlerProvider.class); | ||
private static final String S3CLIENT_CACHE_SIZE_PATH = "dsbulk.s3.clientCacheSize"; | ||
private static final int DEFAULT_S3CLIENT_CACHE_SIZE = 20; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In-line with my request above, could we add a one-line comment on why |
||
|
||
/** The protocol for AWS S3 URLs. I.e., URLs beginning with {@code s3://} */ | ||
public static final String S3_STREAM_PROTOCOL = "s3"; | ||
|
||
private static final AtomicBoolean INITIALIZED = new AtomicBoolean(false); | ||
|
||
private S3Client s3Client; | ||
|
||
private void init(Config config) { | ||
if (INITIALIZED.get()) { | ||
LOGGER.warn("Ignoring additional attempts to initialize the S3URLStreamHandlerProvider."); | ||
return; | ||
} | ||
|
||
String region; | ||
if (config.hasPath(REGION_PATH)) { | ||
region = config.getString(REGION_PATH); | ||
} else { | ||
throw new IllegalArgumentException("You must supply an AWS region to use S3 URls."); | ||
} | ||
String profile = config.hasPath(PROFILE_PATH) ? config.getString(PROFILE_PATH) : null; | ||
|
||
SdkHttpClient httpClient = UrlConnectionHttpClient.builder().build(); | ||
S3ClientBuilder builder = S3Client.builder().httpClient(httpClient).region(Region.of(region)); | ||
|
||
if (!StringUtils.isBlank(profile)) { | ||
LOGGER.info("Using AWS profile {} to connect to S3.", profile); | ||
builder.credentialsProvider(ProfileCredentialsProvider.create(profile)); | ||
} else { | ||
LOGGER.info("Using default credentials provider to connect to S3."); | ||
} | ||
|
||
this.s3Client = builder.build(); | ||
} | ||
|
||
@Override | ||
@NonNull | ||
public Optional<URLStreamHandler> maybeCreateURLStreamHandler( | ||
@NonNull String protocol, Config config) { | ||
if (S3_STREAM_PROTOCOL.equalsIgnoreCase(protocol)) { | ||
init(config); | ||
return Optional.of(new S3URLStreamHandler(s3Client)); | ||
int s3ClientCacheSize = | ||
config.hasPath(S3CLIENT_CACHE_SIZE_PATH) | ||
? config.getInt(S3CLIENT_CACHE_SIZE_PATH) | ||
: DEFAULT_S3CLIENT_CACHE_SIZE; | ||
return Optional.of(new S3URLStreamHandler(s3ClientCacheSize)); | ||
} | ||
return Optional.empty(); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever additional guidance gets included in
settings.md
, please add them here too. 🙏