Account for data format and compression in MSQ auto taskAssignment #14307

zachjsh · 2023-05-17T22:40:17Z

Description

This change allows for consideration of the input format and compression when computing how to split the input files among available tasks, in MSQ ingestion, when considering the value of the maxInputBytesPerWorker query context parameter. This query parameter allows users to control the maximum number of bytes, with granularity of input file / object, that ingestion tasks will be assigned to ingest. With this change, this context parameter now denotes the estimated weighted size in bytes of the input to split on, with consideration for input format and compression format, rather than the actual file size, reported by the file system. We assume uncompressed newline delimited json as a baseline, with scaling factor of 1. This means that when computing the byte weight that a file has towards the input splitting, we take the file size as is, if uncompressed json, 1:1. It was found during testing that gzip compressed json, and parquet, has scale factors of 4 and 8 respectively, meaning that each byte of data is weighted 4x and 8x respectively, when computing input splits. This weighted byte scaling is only considered for MSQ ingestion that uses either LocalInputSource or CloudObjectInputSource at the moment. The default value of the maxInputBytesPerWorker query context parameter has been updated from 10 GiB, to 512 MiB

This PR has:

in LocalInputSource CloudObjectInputSource

processing/src/main/java/org/apache/druid/utils/CompressionUtils.java

@@ -198,7 +240,7 @@
        throw Throwables.propagate(e);
      }
    } else {
-      final File tmpFile = File.createTempFile("compressionUtilZipCache", ZIP_SUFFIX);
+      final File tmpFile = File.createTempFile("compressionUtilZipCache", Format.ZIP.getSuffix());


gianm

Other than the line comments, some comments about how this is handled in MSQ:

Let's lower Limits#DEFAULT_MAX_INPUT_BYTES_PER_WORKER as well. The default value is a bit high even for uncompressed JSON.
Documentation for taskAssignment in MSQ's reference.md will need an update.

gianm · 2023-05-23T22:33:48Z

processing/src/main/java/org/apache/druid/data/input/InputFormat.java

+   *
+   * @return The weighted size of the input object.
+   */
+  @JsonIgnore


There should be no need for @JsonIgnore here. Typically our ObjectMapper is configured to ignore all methods that aren't explicitly annotated with @JsonProperty. Are you seeing something different?

gianm · 2023-05-23T22:37:35Z

processing/src/main/java/org/apache/druid/data/input/InputFormat.java

+   * @return The weighted size of the input object.
+   */
+  @JsonIgnore
+  default long getWeightedSize(@Nullable CompressionUtils.Format compressionFormat, long size)


IIRC, decompression is under the control of the InputFormat itself. So the caller shouldn't be passing in the CompressionUtils.Format, as it doesn't really know. It should pass in the filename and let the InputFormat decide what it wants to do.

gianm · 2023-05-23T22:38:12Z

processing/src/main/java/org/apache/druid/data/input/impl/CsvInputFormat.java

+  public long getWeightedSize(@Nullable CompressionUtils.Format compressionFormat, long size)
+  {
+    if (CompressionUtils.Format.GZ == compressionFormat) {
+      return size * 4L;


Better to have this be a constant in CompressionUtils

gianm · 2023-05-23T22:40:40Z

processing/src/main/java/org/apache/druid/data/input/impl/JsonInputFormat.java

@@ -156,6 +158,16 @@ public InputEntityReader createReader(InputRowSchema inputRowSchema, InputEntity
    }
  }

+  @JsonIgnore
+  @Override
+  public long getWeightedSize(@Nullable CompressionUtils.Format compressionFormat, long size)


I see this implemented for json, csv, and parquet. Please implement it for Avro, ORC, regex, and delimited as well. Delimited and regex can be the same as CSV. Avro and ORC can be the same as Parquet.

good call, added for those as well.

gianm · 2023-05-23T22:42:22Z

...i-stage-query/src/main/java/org/apache/druid/msq/input/external/ExternalInputSpecSlicer.java

+              item -> {
+                if (null != inputFormat) {
+                  InputFileAttribute inputFileAttribute = inputAttributeExtractor.apply(item);
+                  return inputFormat.getWeightedSize(


It'd be nice for callers to not need the InputFormat here. How about having InputFileAttribute be getSize and getWeightedSize?

Good idea, fixed

jon-wei · 2023-06-01T19:21:56Z

processing/src/main/java/org/apache/druid/data/input/impl/FlatTextInputFormat.java

+  {
+    CompressionUtils.Format compressionFormat = CompressionUtils.Format.fromFileName(path);
+    if (CompressionUtils.Format.GZ == compressionFormat) {
+      return size * CompressionUtils.COMPRESSED_TEXT_WEIGHT_FACTOR;


Is COMPRESSED_TEXT_WEIGHT_FACTOR specific to GZIP?

If so, maybe rename to constant to be specific to GZ, also, do we want to consider other compression formats here and similar places?

its not, gzip is just the most common, and one that we have data for. We can add others as need arises.

This PR catches the console up to all the backend changes for Druid 27 Specifically: Add page information to SqlStatementResource API #14512 Allow empty tiered replicants map for load rules #14432 Adding Interactive API's for MSQ engine #14416 Add replication factor column to sys table #14403 Account for data format and compression in MSQ auto taskAssignment #14307 Errors take 3 #14004

This PR catches the console up to all the backend changes for Druid 27 Specifically: Add page information to SqlStatementResource API apache#14512 Allow empty tiered replicants map for load rules apache#14432 Adding Interactive API's for MSQ engine apache#14416 Add replication factor column to sys table apache#14403 Account for data format and compression in MSQ auto taskAssignment apache#14307 Errors take 3 apache#14004

This PR catches the console up to all the backend changes for Druid 27 Specifically: Add page information to SqlStatementResource API #14512 Allow empty tiered replicants map for load rules #14432 Adding Interactive API's for MSQ engine #14416 Add replication factor column to sys table #14403 Account for data format and compression in MSQ auto taskAssignment #14307 Errors take 3 #14004 Co-authored-by: Vadim Ogievetsky <vadim@ogievetsky.com>

This PR catches the console up to all the backend changes for Druid 27 Specifically: Add page information to SqlStatementResource API apache#14512 Allow empty tiered replicants map for load rules apache#14432 Adding Interactive API's for MSQ engine apache#14416 Add replication factor column to sys table apache#14403 Account for data format and compression in MSQ auto taskAssignment apache#14307 Errors take 3 apache#14004

* add getWeightedSize method to Inputformat interface, and use

a202b0f

in LocalInputSource CloudObjectInputSource

zachjsh requested a review from gianm May 17, 2023 22:40

* add javadoc

031cc7e

github-advanced-security bot found potential problems May 17, 2023

View reviewed changes

clintropolis added the Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 label May 18, 2023

zachjsh added 4 commits May 19, 2023 12:27

* simplify

5f7fe92

* test locally and fix bugs

aef4b2b

* allow for weighted size calculation for CloudObjectInputSources

436f46a

* add tests

823c4b0

zachjsh marked this pull request as ready for review May 23, 2023 19:45

gianm reviewed May 23, 2023

View reviewed changes

* address review comments

03e45b5

zachjsh requested a review from gianm May 24, 2023 22:09

zachjsh added 2 commits May 24, 2023 15:12

* update default value of maxInputBytesPerWorker query parameter

e017ce7

* update docs

4094212

github-actions bot added the Area - Documentation label May 24, 2023

zachjsh added 3 commits May 24, 2023 16:13

* fix test failure

ea678c1

* fix more test failures

b04f8d1

* fix static check failure

aabee22

zachjsh requested a review from jon-wei June 1, 2023 18:39

jon-wei reviewed Jun 1, 2023

View reviewed changes

jon-wei approved these changes Jun 1, 2023

View reviewed changes

zachjsh merged commit e75fb8e into apache:master Jun 1, 2023

zachjsh deleted the msq-auto-data-format-estimate branch June 1, 2023 19:53

vogievetsky added the Needs web console change Backend API changes that would benefit from frontend support in the web console label Jul 3, 2023

vogievetsky mentioned this pull request Jul 7, 2023

Web console: catchup to all the backend changes #14540

Merged

AmatyaAvadhanula mentioned this pull request Jul 17, 2023

(Backport #14540) Web console: catchup to all the backend changes #14596

Merged

abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023

AmatyaAvadhanula mentioned this pull request Aug 6, 2023

[DRAFT] 27.0.0 release notes #14761

Closed

vogievetsky removed the Needs web console change Backend API changes that would benefit from frontend support in the web console label Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Account for data format and compression in MSQ auto taskAssignment #14307

Account for data format and compression in MSQ auto taskAssignment #14307

zachjsh commented May 17, 2023 •

edited

Loading

gianm left a comment

gianm May 23, 2023

gianm May 23, 2023

zachjsh May 24, 2023

gianm May 23, 2023

zachjsh May 24, 2023

gianm May 23, 2023

zachjsh May 24, 2023

gianm May 23, 2023

zachjsh May 24, 2023

jon-wei Jun 1, 2023

zachjsh Jun 1, 2023

Account for data format and compression in MSQ auto taskAssignment #14307

Account for data format and compression in MSQ auto taskAssignment #14307

Conversation

zachjsh commented May 17, 2023 • edited Loading

Description

gianm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zachjsh commented May 17, 2023 •

edited

Loading