Skip to content

Add Azure config options for segment prefix and max listing length#9356

Merged
jon-wei merged 7 commits intoapache:masterfrom
zachjsh:azure-config
Feb 21, 2020
Merged

Add Azure config options for segment prefix and max listing length#9356
jon-wei merged 7 commits intoapache:masterfrom
zachjsh:azure-config

Conversation

@zachjsh
Copy link
Contributor

@zachjsh zachjsh commented Feb 13, 2020

Added configuration options to allow the user to specify the prefix
within the segment container to store the segment files. Also
added a configuration option to allow the user to specify the
maximum number of input files to stream for each iteration.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.
  • added integration tests.
  • been tested in a test Druid cluster.

Added configuration options to allow the user to specify the prefix
within the segment container to store the segment files. Also
added a configuration option to allow the user to specify the
maximum number of input files to stream for each iteration.
@zachjsh zachjsh requested a review from jihoonson February 13, 2020 22:35
private String container;

@JsonProperty
@Nonnull
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Nonnull when the previous one is annotated with 'NotNull'?

Also is prefix a required config? Why is assigned to an empty string?

Perhaps using a @JsonCreator constructor with Precondition checks will make it clearer what is required in each field

 @JsonCreator
  public AzureDataSegmentConfig(
      ...
      @JsonProperty("prefix") String prefix)
  {
    this.prefix = Preconditions.checkState(!StringUtils.isEmpty(prefix), "prefix must be non empty");
    ...
  }

Then you don't need all the setters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefix is not required. Before adding this option segments were written to the root directory within the segment container specified, in a directory named after the datasource. Do we want to change the behavior here and specify a non empty default prefix? I'm not sure how this change would affect users already using the azure extension whose data is already written, will we not be able to find the segment data in this case?

I will fix to @NotNull

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - 👍

Some suggestions. Main concern is with changing behavior for AzureUtils#AZURE_RETRY

segmentConfig.getContainer(),
accountConfig.getAccount(),
AzureUtils.AZURE_STORAGE_HOST_ADDRESS,
segmentConfig.getPrefix().isEmpty() ? "" : segmentConfig.getPrefix() + '/'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if prefix ends with a / Is there a util that will build the path with only one separator at the end? Is there any harm if the path ends with two /

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! fixed

public String getStorageDir(DataSegment dataSegment, boolean useUniquePath)
{
String prefix = segmentConfig.getPrefix();
boolean prefixIsNullOrEmpty = (prefix == null || prefix.isEmpty());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.apache.commons.lang.StringUtils.isEmpty(prefix)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Throwable t = e;
for (Throwable t2 = e.getCause(); t2 != null; t2 = t2.getCause()) {
t = t2;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test for unraveling a stacktrace. Should we check an unlimited depth?

This also changes the current behavior where if the top level throwable was a "retryable" exception, we'd retry, but with this change if a StorageException is caused by a RuntimeException we won't retry. Is this intentional?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the below if clauses should be checked in the above for loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

addExpectedGetObjectMock(EXPECTED_URIS.get(1));
EasyMock.expect(CONFIG.getMaxListingLength()).andReturn(EXPECTED_MAX_LISTING_LENGTH);
EasyMock.replay(STORAGE);
EasyMock.replay(CONFIG);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks like this is repeated in multiple tests, maybe move to a helper function?

public class GoogleCloudStorageInputSourceTest extends InitializedNullHandlingTest
{
private static final long EXPECTED_MAX_LISTING_LENGTH = 1024L;
private static final int EXPECTED_MAX_LISTING_LENGTH = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: MAX_LISTING_LENGTH since we're mocking the maxListingLength() to this value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@jihoonson
Copy link
Contributor

Added "Design Review" since this PR adds a new user-facing configuration.

if (AzureUtils.AZURE_RETRY.apply(e)) {
throw new IOException("Recoverable exception", e);
}
log.warn("Exception when opening stream to azure resource, containerName: %s, blobPath: %s, Error: %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the log level be error instead of warn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


@JsonProperty
@Min(1)
private int maxListingLength = 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be in a separate class rather than being in the class for deep storage configuration. I would suggest to add a new class AzureReadConfig (I think there could be a better name) that has the new configuration only, so that we can add more read-related configurations in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for other cloud storage types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please add docs for the new configurations.

Copy link
Contributor Author

@zachjsh zachjsh Feb 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about AzureInputDataConfig? And similar classes for AWS and Google

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Throwable t = e;
for (Throwable t2 = e.getCause(); t2 != null; t2 = t2.getCause()) {
t = t2;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the below if clauses should be checked in the above for loop.

@@ -43,17 +44,20 @@ public class GoogleCloudStorageInputSource extends CloudObjectInputSource<Google
private static final int MAX_LISTING_LENGTH = 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is not used anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

private Iterable<S3ObjectSummary> getIterableObjectsFromPrefixes()
{
return () -> S3Utils.objectSummaryIterator(s3Client, getPrefixes(), MAX_LISTING_LENGTH);
return () -> S3Utils.objectSummaryIterator(s3Client, getPrefixes(), segmentPusherConfig.getMaxListingLength());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAX_LISTING_LENGTH is defined in the parent class (CloudObjectInputSource) and is not used anymore. Please remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@jihoonson
Copy link
Contributor

LGTM

Copy link
Contributor

@jon-wei jon-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also update the S3 and GCS docs?

|`druid.azure.container`||Azure Storage container name.|Must be set.|
|`druid.azure.protocol`|http or https||https|
|`druid.azure.maxTries`||Number of tries before cancel an Azure operation.|3|
|`druid.azure.prefix`|prefix to use, i.e. what directory.| |""|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest:

"A prefix string that will be prepended to the blob names for the segments published to Azure deep storage"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

|`druid.azure.maxTries`||Number of tries before cancel an Azure operation.|3|
|`druid.azure.prefix`|prefix to use, i.e. what directory.| |""|
|`druid.azure.protocol`|the protocol to use|http or https|https|
|`druid.azure.maxTries`|Number of tries before cancel an Azure operation.| |3|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancel -> canceling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


To configure connectivity to google cloud, run druid processes with `GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_keyfile` in the environment.

|Property|Description|Possible Values|Default|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this "Required Configuration" and the "Configuration" section that starts at line 56 should probably be merged, the new wording you have is better so let's use that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

@jon-wei jon-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 after CI

@jon-wei jon-wei merged commit f707064 into apache:master Feb 21, 2020
@jihoonson jihoonson added this to the 0.18.0 milestone Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants