Add Azure config options for segment prefix and max listing length by zachjsh · Pull Request #9356 · apache/druid

zachjsh · 2020-02-13T01:53:16Z

Added configuration options to allow the user to specify the prefix
within the segment container to store the segment files. Also
added a configuration option to allow the user to specify the
maximum number of input files to stream for each iteration.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths.
added integration tests.
been tested in a test Druid cluster.

Added configuration options to allow the user to specify the prefix within the segment container to store the segment files. Also added a configuration option to allow the user to specify the maximum number of input files to stream for each iteration.

suneet-s · 2020-02-18T17:49:35Z

...ib/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureDataSegmentConfig.java

+  private String container;
+
+  @JsonProperty
+  @Nonnull


Why Nonnull when the previous one is annotated with 'NotNull'?

Also is prefix a required config? Why is assigned to an empty string?

Perhaps using a @JsonCreator constructor with Precondition checks will make it clearer what is required in each field

@JsonCreator public AzureDataSegmentConfig( ... @JsonProperty("prefix") String prefix) { this.prefix = Preconditions.checkState(!StringUtils.isEmpty(prefix), "prefix must be non empty"); ... }

Then you don't need all the setters

prefix is not required. Before adding this option segments were written to the root directory within the segment container specified, in a directory named after the datasource. Do we want to change the behavior here and specify a non empty default prefix? I'm not sure how this change would affect users already using the azure extension whose data is already written, will we not be able to find the segment data in this case?

I will fix to @NotNull

suneet-s

Looks good - 👍

Some suggestions. Main concern is with changing behavior for AzureUtils#AZURE_RETRY

suneet-s · 2020-02-18T17:51:38Z

...ib/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureDataSegmentPusher.java

+        segmentConfig.getContainer(),
+        accountConfig.getAccount(),
+        AzureUtils.AZURE_STORAGE_HOST_ADDRESS,
+        segmentConfig.getPrefix().isEmpty() ? "" : segmentConfig.getPrefix() + '/'


What if prefix ends with a / Is there a util that will build the path with only one separator at the end? Is there any harm if the path ends with two /

good catch! fixed

suneet-s · 2020-02-18T17:53:16Z

...ib/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureDataSegmentPusher.java

  public String getStorageDir(DataSegment dataSegment, boolean useUniquePath)
  {
+    String prefix = segmentConfig.getPrefix();
+    boolean prefixIsNullOrEmpty = (prefix == null || prefix.isEmpty());


org.apache.commons.lang.StringUtils.isEmpty(prefix)

suneet-s · 2020-02-18T17:55:35Z

...nsions-contrib/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureUtils.java

+    Throwable t = e;
+    for (Throwable t2 = e.getCause(); t2 != null; t2 = t2.getCause()) {
+      t = t2;
+    }


test for unraveling a stacktrace. Should we check an unlimited depth?

This also changes the current behavior where if the top level throwable was a "retryable" exception, we'd retry, but with this change if a StorageException is caused by a RuntimeException we won't retry. Is this intentional?

I think the below if clauses should be checked in the above for loop.

suneet-s · 2020-02-18T18:07:10Z

...ions/src/test/java/org/apache/druid/data/input/google/GoogleCloudStorageInputSourceTest.java

    addExpectedGetObjectMock(EXPECTED_URIS.get(1));
+    EasyMock.expect(CONFIG.getMaxListingLength()).andReturn(EXPECTED_MAX_LISTING_LENGTH);
    EasyMock.replay(STORAGE);
+    EasyMock.replay(CONFIG);


nit: looks like this is repeated in multiple tests, maybe move to a helper function?

suneet-s · 2020-02-18T18:07:58Z

...ions/src/test/java/org/apache/druid/data/input/google/GoogleCloudStorageInputSourceTest.java

 public class GoogleCloudStorageInputSourceTest extends InitializedNullHandlingTest
 {
-  private static final long EXPECTED_MAX_LISTING_LENGTH = 1024L;
+  private static final int EXPECTED_MAX_LISTING_LENGTH = 10;


nit: MAX_LISTING_LENGTH since we're mocking the maxListingLength() to this value

jihoonson · 2020-02-18T22:09:40Z

Added "Design Review" since this PR adds a new user-facing configuration.

jihoonson · 2020-02-18T21:43:16Z

...s-contrib/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureByteSource.java

      if (AzureUtils.AZURE_RETRY.apply(e)) {
        throw new IOException("Recoverable exception", e);
      }
+      log.warn("Exception when opening stream to azure resource, containerName: %s, blobPath: %s, Error: %s",


Should the log level be error instead of warn?

jihoonson · 2020-02-18T21:49:04Z

...ib/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureDataSegmentConfig.java

+
+  @JsonProperty
+  @Min(1)
+  private int maxListingLength = 1024;


I think this should be in a separate class rather than being in the class for deep storage configuration. I would suggest to add a new class AzureReadConfig (I think there could be a better name) that has the new configuration only, so that we can add more read-related configurations in the future.

Same for other cloud storage types.

Also please add docs for the new configurations.

How about AzureInputDataConfig? And similar classes for AWS and Google

jihoonson · 2020-02-18T21:57:17Z

...nsions-contrib/azure-extensions/src/main/java/org/apache/druid/storage/azure/AzureUtils.java

+    Throwable t = e;
+    for (Throwable t2 = e.getCause(); t2 != null; t2 = t2.getCause()) {
+      t = t2;
+    }


I think the below if clauses should be checked in the above for loop.

jihoonson · 2020-02-18T21:58:51Z

...tensions/src/main/java/org/apache/druid/data/input/google/GoogleCloudStorageInputSource.java

@@ -43,17 +44,20 @@ public class GoogleCloudStorageInputSource extends CloudObjectInputSource<Google
  private static final int MAX_LISTING_LENGTH = 1024;


This variable is not used anymore.

jihoonson · 2020-02-18T22:17:56Z

extensions-core/s3-extensions/src/main/java/org/apache/druid/data/input/s3/S3InputSource.java

  private Iterable<S3ObjectSummary> getIterableObjectsFromPrefixes()
  {
-    return () -> S3Utils.objectSummaryIterator(s3Client, getPrefixes(), MAX_LISTING_LENGTH);
+    return () -> S3Utils.objectSummaryIterator(s3Client, getPrefixes(), segmentPusherConfig.getMaxListingLength());


MAX_LISTING_LENGTH is defined in the parent class (CloudObjectInputSource) and is not used anymore. Please remove it.

jihoonson · 2020-02-20T21:45:58Z

LGTM

jon-wei

Can you also update the S3 and GCS docs?

jon-wei · 2020-02-21T18:17:20Z

docs/development/extensions-contrib/azure.md

 |`druid.azure.container`||Azure Storage container name.|Must be set.|
-|`druid.azure.protocol`|http or https||https|
-|`druid.azure.maxTries`||Number of tries before cancel an Azure operation.|3|
+|`druid.azure.prefix`|prefix to use, i.e. what directory.| |""|


Suggest:

"A prefix string that will be prepended to the blob names for the segments published to Azure deep storage"

jon-wei · 2020-02-21T18:18:17Z

docs/development/extensions-contrib/azure.md

-|`druid.azure.maxTries`||Number of tries before cancel an Azure operation.|3|
+|`druid.azure.prefix`|prefix to use, i.e. what directory.| |""|
+|`druid.azure.protocol`|the protocol to use|http or https|https|
+|`druid.azure.maxTries`|Number of tries before cancel an Azure operation.| |3|


cancel -> canceling

jon-wei · 2020-02-21T19:08:26Z

docs/development/extensions-core/google.md


 To configure connectivity to google cloud, run druid processes with `GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_keyfile` in the environment.

+|Property|Description|Possible Values|Default|


Hm, this "Required Configuration" and the "Configuration" section that starts at line 56 should probably be merged, the new wording you have is better so let's use that

jon-wei

👍 after CI

zachjsh requested a review from jihoonson February 13, 2020 22:35

* Fix test failures

9dcb61e

suneet-s reviewed Feb 18, 2020

View reviewed changes

suneet-s approved these changes Feb 18, 2020

View reviewed changes

jihoonson added Area - Extension Design Review labels Feb 18, 2020

jihoonson reviewed Feb 18, 2020

View reviewed changes

zachjsh added 2 commits February 19, 2020 14:27

* Address review comments

7c27c26

* add dependency explicitly to pom

4b3e9f3

jihoonson approved these changes Feb 20, 2020

View reviewed changes

* update docs

b1e46e1

jon-wei reviewed Feb 21, 2020

View reviewed changes

* Address review comments

6725f96

jon-wei reviewed Feb 21, 2020

View reviewed changes

* Address review comments

187a6f9

jon-wei approved these changes Feb 21, 2020

View reviewed changes

jon-wei merged commit f707064 into apache:master Feb 21, 2020

jihoonson added this to the 0.18.0 milestone Mar 26, 2020

		@@ -43,17 +44,20 @@ public class GoogleCloudStorageInputSource extends CloudObjectInputSource<Google
		private static final int MAX_LISTING_LENGTH = 1024;


		To configure connectivity to google cloud, run druid processes with `GOOGLE_APPLICATION_CREDENTIALS=/path/to/service_account_keyfile` in the environment.

		\|Property\|Description\|Possible Values\|Default\|

Conversation

zachjsh commented Feb 13, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

suneet-s left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Feb 18, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zachjsh Feb 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Feb 20, 2020

Uh oh!

jon-wei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei left a comment

Choose a reason for hiding this comment

Uh oh!

zachjsh Feb 19, 2020 •

edited

Loading