Add reasoning for choosing shardSpec to the MSQ report #16175

adarshsanjeev · 2024-03-20T08:04:29Z

MSQ chooses the shard spec based on certain criteria. However, this criteria is not very transparent to the user. The only way to find the shard spec which was chosen is to search for a segment in the segment UI after the ingestion is finished.

This PR logs the segment type and reason chosen. It also adds it to the query report, to be displayed in the UI.

This PR adds a new section to the reports, segmentReport. This contains the segment type created, if the query is an ingestion, and null otherwise.

The shardSpec mentions the shardSpec type generated. MSQ prefers to use RangedShardSpec when possible. For inserts and replace queries, the default shard spec is NumberedShardSpec and DimensionRangeShardSpec respectively. If a ranged shard spec cannot be chosen for the replace query, the details field will contain the reason why it could not be used.

{
  "multiStageQuery": {
    "type": "multiStageQuery",
    "taskId": "query-3dc0c45d-34d7-4b15-86c9-cdb2d3ebfc4e",
    "payload": {
        ... ,
        "segmentReport": {
          "shardSpec": "NumberedShardSpec",
          "details": "Cannot use RangeShardSpec, RangedShardSpec only supports string CLUSTER BY keys. Using NumberedShardSpec instead."
        },
        ...
  }
}

This PR has:

LakshSingla · 2024-04-01T05:06:56Z

docs/api-reference/sql-ingestion-api.md

@@ -299,7 +299,7 @@ The response shows an example report for a query.
        },
        "pendingTasks": 0,
        "runningTasks": 2,
-        "segmentLoadStatus": {
+        "segmentLoadWaiterStatus": {


This is a correction to the doc right (and not a backward incompatible API change)?

Yup, I noticed that the docs were incorrect

LakshSingla · 2024-04-01T05:09:16Z

extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/ControllerImpl.java

  )
  {
+    if (mayHaveMultiValuedClusterByFields) {
+      // DimensionRangeShardSpec cannot handle multi-valued fields.
+      return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, the fields in the CLUSTER BY clause contains a multivalues. Using NumberedShardSpec instead.");


nit: grammar
Also, if its possible to pinpoint the multiValue fields without much refactoring, then we can mention that here.

Suggested change

return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, the fields in the CLUSTER BY clause contains a multivalues. Using NumberedShardSpec instead.");

return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, the fields in the CLUSTERED BY clause contains multivalues in column [%s]. Using NumberedShardSpec instead.");

I don't think we have the column name at this point, we only store a boolean mayContainMultivalues. Updated the message a bit

LakshSingla · 2024-04-01T05:09:59Z

extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/ControllerImpl.java

      }

      // DimensionRangeShardSpec only handles columns that appear as-is in the output.
      if (outputColumns.isEmpty()) {
-        return Collections.emptyList();
+        return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, RangeShardSpec only supports columns that appear as-is in the output. Using NumberedShardSpec instead.");


What does as-is mean?

Changed the message to "Could not find output column name for column [%s]" to include the column name. I'm not sure what conditions would cause the output column to not be found here.

LakshSingla · 2024-04-01T05:14:08Z

extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/ControllerImpl.java

    final List<KeyColumn> clusterByColumns = clusterBy.getColumns();
    final List<String> shardColumns = new ArrayList<>();
    final boolean boosted = isClusterByBoosted(clusterBy);
    final int numShardColumns = clusterByColumns.size() - clusterBy.getBucketByCount() - (boosted ? 1 : 0);

    if (numShardColumns == 0) {
-      return Collections.emptyList();
+      return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, as there are no shardColumns. Using NumberedShardSpec instead.");


What happens if the user doesn't supply the clustered by. In that case, the reason doesn't seem necessary, or it can be reworded.

Suggested change

return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, as there are no shardColumns. Using NumberedShardSpec instead.");

return Pair.of(Collections.emptyList(), "Using NumberedShardSpec as no columns are supplied in the 'CLUSTERED BY' clause.");

LakshSingla · 2024-04-04T07:42:13Z

I missed it before, but we should also add MSQTests where these cases are getting tripped, and assert the reason in the report.

cryptoe · 2024-04-05T06:13:42Z

cc @vogievetsky for the web console changes.

Add reason for choosing shardSpec to the report.

910f729

github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Mar 20, 2024

Update docs

921aa61

github-actions bot added the Area - Documentation label Mar 21, 2024

LakshSingla reviewed Apr 1, 2024

View reviewed changes

Merge remote-tracking branch 'origin/master' into segment-reason-report

cd050bd

adarshsanjeev added 2 commits April 4, 2024 14:27

Update log messages

1958764

Add tests

156e8aa

cryptoe added the Needs web console change Backend API changes that would benefit from frontend support in the web console label Apr 5, 2024

adarshsanjeev added 2 commits April 6, 2024 12:19

Merge remote-tracking branch 'origin/master' into segment-reason-report

d50bcab

Add more details

dd25a0f

LakshSingla approved these changes Apr 8, 2024

View reviewed changes

cryptoe merged commit e2e0cb9 into apache:master Apr 9, 2024
85 checks passed

adarshsanjeev added this to the 30.0.0 milestone May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reasoning for choosing shardSpec to the MSQ report #16175

Add reasoning for choosing shardSpec to the MSQ report #16175

adarshsanjeev commented Mar 20, 2024 •

edited

LakshSingla Apr 1, 2024

adarshsanjeev Apr 4, 2024

LakshSingla Apr 1, 2024

adarshsanjeev Apr 5, 2024 •

edited

LakshSingla Apr 1, 2024

adarshsanjeev Apr 5, 2024

LakshSingla Apr 1, 2024

adarshsanjeev Apr 5, 2024

LakshSingla commented Apr 4, 2024

cryptoe commented Apr 5, 2024

	return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, the fields in the CLUSTER BY clause contains a multivalues. Using NumberedShardSpec instead.");
	return Pair.of(Collections.emptyList(), "Cannot use RangeShardSpec, the fields in the CLUSTERED BY clause contains multivalues in column [%s]. Using NumberedShardSpec instead.");

Add reasoning for choosing shardSpec to the MSQ report #16175

Add reasoning for choosing shardSpec to the MSQ report #16175

Conversation

adarshsanjeev commented Mar 20, 2024 • edited

LakshSingla Apr 1, 2024

Choose a reason for hiding this comment

adarshsanjeev Apr 4, 2024

Choose a reason for hiding this comment

LakshSingla Apr 1, 2024

Choose a reason for hiding this comment

adarshsanjeev Apr 5, 2024 • edited

Choose a reason for hiding this comment

LakshSingla Apr 1, 2024

Choose a reason for hiding this comment

adarshsanjeev Apr 5, 2024

Choose a reason for hiding this comment

LakshSingla Apr 1, 2024

Choose a reason for hiding this comment

adarshsanjeev Apr 5, 2024

Choose a reason for hiding this comment

LakshSingla commented Apr 4, 2024

cryptoe commented Apr 5, 2024

adarshsanjeev commented Mar 20, 2024 •

edited

adarshsanjeev Apr 5, 2024 •

edited