Add Azure config options for segment prefix and max listing length#9356
Add Azure config options for segment prefix and max listing length#9356jon-wei merged 7 commits intoapache:masterfrom
Conversation
Added configuration options to allow the user to specify the prefix within the segment container to store the segment files. Also added a configuration option to allow the user to specify the maximum number of input files to stream for each iteration.
| private String container; | ||
|
|
||
| @JsonProperty | ||
| @Nonnull |
There was a problem hiding this comment.
Why Nonnull when the previous one is annotated with 'NotNull'?
Also is prefix a required config? Why is assigned to an empty string?
Perhaps using a @JsonCreator constructor with Precondition checks will make it clearer what is required in each field
@JsonCreator
public AzureDataSegmentConfig(
...
@JsonProperty("prefix") String prefix)
{
this.prefix = Preconditions.checkState(!StringUtils.isEmpty(prefix), "prefix must be non empty");
...
}
Then you don't need all the setters
There was a problem hiding this comment.
prefix is not required. Before adding this option segments were written to the root directory within the segment container specified, in a directory named after the datasource. Do we want to change the behavior here and specify a non empty default prefix? I'm not sure how this change would affect users already using the azure extension whose data is already written, will we not be able to find the segment data in this case?
I will fix to @NotNull
suneet-s
left a comment
There was a problem hiding this comment.
Looks good - 👍
Some suggestions. Main concern is with changing behavior for AzureUtils#AZURE_RETRY
| segmentConfig.getContainer(), | ||
| accountConfig.getAccount(), | ||
| AzureUtils.AZURE_STORAGE_HOST_ADDRESS, | ||
| segmentConfig.getPrefix().isEmpty() ? "" : segmentConfig.getPrefix() + '/' |
There was a problem hiding this comment.
What if prefix ends with a / Is there a util that will build the path with only one separator at the end? Is there any harm if the path ends with two /
| public String getStorageDir(DataSegment dataSegment, boolean useUniquePath) | ||
| { | ||
| String prefix = segmentConfig.getPrefix(); | ||
| boolean prefixIsNullOrEmpty = (prefix == null || prefix.isEmpty()); |
There was a problem hiding this comment.
org.apache.commons.lang.StringUtils.isEmpty(prefix)
| Throwable t = e; | ||
| for (Throwable t2 = e.getCause(); t2 != null; t2 = t2.getCause()) { | ||
| t = t2; | ||
| } |
There was a problem hiding this comment.
test for unraveling a stacktrace. Should we check an unlimited depth?
This also changes the current behavior where if the top level throwable was a "retryable" exception, we'd retry, but with this change if a StorageException is caused by a RuntimeException we won't retry. Is this intentional?
There was a problem hiding this comment.
I think the below if clauses should be checked in the above for loop.
| addExpectedGetObjectMock(EXPECTED_URIS.get(1)); | ||
| EasyMock.expect(CONFIG.getMaxListingLength()).andReturn(EXPECTED_MAX_LISTING_LENGTH); | ||
| EasyMock.replay(STORAGE); | ||
| EasyMock.replay(CONFIG); |
There was a problem hiding this comment.
nit: looks like this is repeated in multiple tests, maybe move to a helper function?
| public class GoogleCloudStorageInputSourceTest extends InitializedNullHandlingTest | ||
| { | ||
| private static final long EXPECTED_MAX_LISTING_LENGTH = 1024L; | ||
| private static final int EXPECTED_MAX_LISTING_LENGTH = 10; |
There was a problem hiding this comment.
nit: MAX_LISTING_LENGTH since we're mocking the maxListingLength() to this value
|
Added "Design Review" since this PR adds a new user-facing configuration. |
| if (AzureUtils.AZURE_RETRY.apply(e)) { | ||
| throw new IOException("Recoverable exception", e); | ||
| } | ||
| log.warn("Exception when opening stream to azure resource, containerName: %s, blobPath: %s, Error: %s", |
There was a problem hiding this comment.
Should the log level be error instead of warn?
|
|
||
| @JsonProperty | ||
| @Min(1) | ||
| private int maxListingLength = 1024; |
There was a problem hiding this comment.
I think this should be in a separate class rather than being in the class for deep storage configuration. I would suggest to add a new class AzureReadConfig (I think there could be a better name) that has the new configuration only, so that we can add more read-related configurations in the future.
There was a problem hiding this comment.
Same for other cloud storage types.
There was a problem hiding this comment.
Also please add docs for the new configurations.
There was a problem hiding this comment.
How about AzureInputDataConfig? And similar classes for AWS and Google
| Throwable t = e; | ||
| for (Throwable t2 = e.getCause(); t2 != null; t2 = t2.getCause()) { | ||
| t = t2; | ||
| } |
There was a problem hiding this comment.
I think the below if clauses should be checked in the above for loop.
| @@ -43,17 +44,20 @@ public class GoogleCloudStorageInputSource extends CloudObjectInputSource<Google | |||
| private static final int MAX_LISTING_LENGTH = 1024; | |||
There was a problem hiding this comment.
This variable is not used anymore.
| private Iterable<S3ObjectSummary> getIterableObjectsFromPrefixes() | ||
| { | ||
| return () -> S3Utils.objectSummaryIterator(s3Client, getPrefixes(), MAX_LISTING_LENGTH); | ||
| return () -> S3Utils.objectSummaryIterator(s3Client, getPrefixes(), segmentPusherConfig.getMaxListingLength()); |
There was a problem hiding this comment.
MAX_LISTING_LENGTH is defined in the parent class (CloudObjectInputSource) and is not used anymore. Please remove it.
|
LGTM |
jon-wei
left a comment
There was a problem hiding this comment.
Can you also update the S3 and GCS docs?
| |`druid.azure.container`||Azure Storage container name.|Must be set.| | ||
| |`druid.azure.protocol`|http or https||https| | ||
| |`druid.azure.maxTries`||Number of tries before cancel an Azure operation.|3| | ||
| |`druid.azure.prefix`|prefix to use, i.e. what directory.| |""| |
There was a problem hiding this comment.
Suggest:
"A prefix string that will be prepended to the blob names for the segments published to Azure deep storage"
| |`druid.azure.maxTries`||Number of tries before cancel an Azure operation.|3| | ||
| |`druid.azure.prefix`|prefix to use, i.e. what directory.| |""| | ||
| |`druid.azure.protocol`|the protocol to use|http or https|https| | ||
| |`druid.azure.maxTries`|Number of tries before cancel an Azure operation.| |3| |
|
|
||
| To configure connectivity to google cloud, run druid processes with `GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_keyfile` in the environment. | ||
|
|
||
| |Property|Description|Possible Values|Default| |
There was a problem hiding this comment.
Hm, this "Required Configuration" and the "Configuration" section that starts at line 56 should probably be merged, the new wording you have is better so let's use that
Added configuration options to allow the user to specify the prefix
within the segment container to store the segment files. Also
added a configuration option to allow the user to specify the
maximum number of input files to stream for each iteration.
This PR has: