Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reasoning for choosing shardSpec to the MSQ report #16175

Merged
merged 7 commits into from Apr 9, 2024

Conversation

adarshsanjeev
Copy link
Contributor

@adarshsanjeev adarshsanjeev commented Mar 20, 2024

MSQ chooses the shard spec based on certain criteria. However, this criteria is not very transparent to the user. The only way to find the shard spec which was chosen is to search for a segment in the segment UI after the ingestion is finished.

This PR logs the segment type and reason chosen. It also adds it to the query report, to be displayed in the UI.

This PR adds a new section to the reports, segmentReport. This contains the segment type created, if the query is an ingestion, and null otherwise.

The shardSpec mentions the shardSpec type generated. MSQ prefers to use RangedShardSpec when possible. For inserts and replace queries, the default shard spec is NumberedShardSpec and DimensionRangeShardSpec respectively. If a ranged shard spec cannot be chosen for the replace query, the details field will contain the reason why it could not be used.

{
  "multiStageQuery": {
    "type": "multiStageQuery",
    "taskId": "query-3dc0c45d-34d7-4b15-86c9-cdb2d3ebfc4e",
    "payload": {
        ... ,
        "segmentReport": {
          "shardSpec": "NumberedShardSpec",
          "details": "Cannot use RangeShardSpec, RangedShardSpec only supports string CLUSTER BY keys. Using NumberedShardSpec instead."
        },
        ...
  }
}

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@github-actions github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Mar 20, 2024
@@ -299,7 +299,7 @@ The response shows an example report for a query.
},
"pendingTasks": 0,
"runningTasks": 2,
"segmentLoadStatus": {
"segmentLoadWaiterStatus": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a correction to the doc right (and not a backward incompatible API change)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I noticed that the docs were incorrect

)
{
if (mayHaveMultiValuedClusterByFields) {
// DimensionRangeShardSpec cannot handle multi-valued fields.
return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, the fields in the CLUSTER BY clause contains a multivalues. Using NumberedShardSpec instead.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: grammar
Also, if its possible to pinpoint the multiValue fields without much refactoring, then we can mention that here.

Suggested change
return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, the fields in the CLUSTER BY clause contains a multivalues. Using NumberedShardSpec instead.");
return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, the fields in the CLUSTERED BY clause contains multivalues in column [%s]. Using NumberedShardSpec instead.");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have the column name at this point, we only store a boolean mayContainMultivalues. Updated the message a bit

}

// DimensionRangeShardSpec only handles columns that appear as-is in the output.
if (outputColumns.isEmpty()) {
return Collections.emptyList();
return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, RangeShardSpec only supports columns that appear as-is in the output. Using NumberedShardSpec instead.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does as-is mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the message to "Could not find output column name for column [%s]" to include the column name. I'm not sure what conditions would cause the output column to not be found here.

final List<KeyColumn> clusterByColumns = clusterBy.getColumns();
final List<String> shardColumns = new ArrayList<>();
final boolean boosted = isClusterByBoosted(clusterBy);
final int numShardColumns = clusterByColumns.size() - clusterBy.getBucketByCount() - (boosted ? 1 : 0);

if (numShardColumns == 0) {
return Collections.emptyList();
return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, as there are no shardColumns. Using NumberedShardSpec instead.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the user doesn't supply the clustered by. In that case, the reason doesn't seem necessary, or it can be reworded.

Suggested change
return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, as there are no shardColumns. Using NumberedShardSpec instead.");
return Pair.of(Collections.emptyList(), "Using NumberedShardSpec as no columns are supplied in the 'CLUSTERED BY' clause.");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

@LakshSingla
Copy link
Contributor

I missed it before, but we should also add MSQTests where these cases are getting tripped, and assert the reason in the report.

@cryptoe cryptoe added the Needs web console change Backend API changes that would benefit from frontend support in the web console label Apr 5, 2024
@cryptoe
Copy link
Contributor

cryptoe commented Apr 5, 2024

cc @vogievetsky for the web console changes.

@cryptoe cryptoe merged commit e2e0cb9 into apache:master Apr 9, 2024
85 checks passed
@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - Documentation Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Needs web console change Backend API changes that would benefit from frontend support in the web console
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants