Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for data format and compression in MSQ auto taskAssignment #14307

Merged
merged 12 commits into from
Jun 1, 2023

Conversation

zachjsh
Copy link
Contributor

@zachjsh zachjsh commented May 17, 2023

Description

This change allows for consideration of the input format and compression when computing how to split the input files among available tasks, in MSQ ingestion, when considering the value of the maxInputBytesPerWorker query context parameter. This query parameter allows users to control the maximum number of bytes, with granularity of input file / object, that ingestion tasks will be assigned to ingest. With this change, this context parameter now denotes the estimated weighted size in bytes of the input to split on, with consideration for input format and compression format, rather than the actual file size, reported by the file system. We assume uncompressed newline delimited json as a baseline, with scaling factor of 1. This means that when computing the byte weight that a file has towards the input splitting, we take the file size as is, if uncompressed json, 1:1. It was found during testing that gzip compressed json, and parquet, has scale factors of 4 and 8 respectively, meaning that each byte of data is weighted 4x and 8x respectively, when computing input splits. This weighted byte scaling is only considered for MSQ ingestion that uses either LocalInputSource or CloudObjectInputSource at the moment. The default value of the maxInputBytesPerWorker query context parameter has been updated from 10 GiB, to 512 MiB

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

  in LocalInputSource CloudObjectInputSource
@zachjsh zachjsh requested a review from gianm May 17, 2023 22:40
@@ -198,7 +240,7 @@
throw Throwables.propagate(e);
}
} else {
final File tmpFile = File.createTempFile("compressionUtilZipCache", ZIP_SUFFIX);
final File tmpFile = File.createTempFile("compressionUtilZipCache", Format.ZIP.getSuffix());

Check warning

Code scanning / CodeQL

Local information disclosure in a temporary directory

Local information disclosure vulnerability due to use of file readable by other local users.
@clintropolis clintropolis added the Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 label May 18, 2023
@zachjsh zachjsh marked this pull request as ready for review May 23, 2023 19:45
Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the line comments, some comments about how this is handled in MSQ:

  • Let's lower Limits#DEFAULT_MAX_INPUT_BYTES_PER_WORKER as well. The default value is a bit high even for uncompressed JSON.
  • Documentation for taskAssignment in MSQ's reference.md will need an update.

*
* @return The weighted size of the input object.
*/
@JsonIgnore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no need for @JsonIgnore here. Typically our ObjectMapper is configured to ignore all methods that aren't explicitly annotated with @JsonProperty. Are you seeing something different?

* @return The weighted size of the input object.
*/
@JsonIgnore
default long getWeightedSize(@Nullable CompressionUtils.Format compressionFormat, long size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, decompression is under the control of the InputFormat itself. So the caller shouldn't be passing in the CompressionUtils.Format, as it doesn't really know. It should pass in the filename and let the InputFormat decide what it wants to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

public long getWeightedSize(@Nullable CompressionUtils.Format compressionFormat, long size)
{
if (CompressionUtils.Format.GZ == compressionFormat) {
return size * 4L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to have this be a constant in CompressionUtils

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -156,6 +158,16 @@ public InputEntityReader createReader(InputRowSchema inputRowSchema, InputEntity
}
}

@JsonIgnore
@Override
public long getWeightedSize(@Nullable CompressionUtils.Format compressionFormat, long size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this implemented for json, csv, and parquet. Please implement it for Avro, ORC, regex, and delimited as well. Delimited and regex can be the same as CSV. Avro and ORC can be the same as Parquet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, added for those as well.

item -> {
if (null != inputFormat) {
InputFileAttribute inputFileAttribute = inputAttributeExtractor.apply(item);
return inputFormat.getWeightedSize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice for callers to not need the InputFormat here. How about having InputFileAttribute be getSize and getWeightedSize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, fixed

@zachjsh zachjsh requested a review from gianm May 24, 2023 22:09
@zachjsh zachjsh requested a review from jon-wei June 1, 2023 18:39
{
CompressionUtils.Format compressionFormat = CompressionUtils.Format.fromFileName(path);
if (CompressionUtils.Format.GZ == compressionFormat) {
return size * CompressionUtils.COMPRESSED_TEXT_WEIGHT_FACTOR;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is COMPRESSED_TEXT_WEIGHT_FACTOR specific to GZIP?

If so, maybe rename to constant to be specific to GZ, also, do we want to consider other compression formats here and similar places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not, gzip is just the most common, and one that we have data for. We can add others as need arises.

@zachjsh zachjsh merged commit e75fb8e into apache:master Jun 1, 2023
@zachjsh zachjsh deleted the msq-auto-data-format-estimate branch June 1, 2023 19:53
@vogievetsky vogievetsky added the Needs web console change Backend API changes that would benefit from frontend support in the web console label Jul 3, 2023
abhishekagarwal87 pushed a commit that referenced this pull request Jul 17, 2023
This PR catches the console up to all the backend changes for Druid 27

Specifically:

Add page information to SqlStatementResource API #14512
Allow empty tiered replicants map for load rules #14432
Adding Interactive API's for MSQ engine #14416
Add replication factor column to sys table #14403
Account for data format and compression in MSQ auto taskAssignment #14307
Errors take 3 #14004
AmatyaAvadhanula pushed a commit to AmatyaAvadhanula/druid that referenced this pull request Jul 17, 2023
This PR catches the console up to all the backend changes for Druid 27

Specifically:

Add page information to SqlStatementResource API apache#14512
Allow empty tiered replicants map for load rules apache#14432
Adding Interactive API's for MSQ engine apache#14416
Add replication factor column to sys table apache#14403
Account for data format and compression in MSQ auto taskAssignment apache#14307
Errors take 3 apache#14004
abhishekagarwal87 pushed a commit that referenced this pull request Jul 17, 2023
This PR catches the console up to all the backend changes for Druid 27

Specifically:

Add page information to SqlStatementResource API #14512
Allow empty tiered replicants map for load rules #14432
Adding Interactive API's for MSQ engine #14416
Add replication factor column to sys table #14403
Account for data format and compression in MSQ auto taskAssignment #14307
Errors take 3 #14004

Co-authored-by: Vadim Ogievetsky <vadim@ogievetsky.com>
@abhishekagarwal87 abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023
sergioferragut pushed a commit to sergioferragut/druid that referenced this pull request Jul 21, 2023
This PR catches the console up to all the backend changes for Druid 27

Specifically:

Add page information to SqlStatementResource API apache#14512
Allow empty tiered replicants map for load rules apache#14432
Adding Interactive API's for MSQ engine apache#14416
Add replication factor column to sys table apache#14403
Account for data format and compression in MSQ auto taskAssignment apache#14307
Errors take 3 apache#14004
@vogievetsky vogievetsky removed the Needs web console change Backend API changes that would benefit from frontend support in the web console label Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Documentation Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants