A new includeAllDimension flag for dimensionsSpec #12276

jihoonson · 2022-02-23T07:16:51Z

Description

Today, dimensionsSpec has two modes with flattenSpec as below.

If dimensions is set in dimensionsSpec, only those explicit dimensions are ingested.
If dimensions is not set, all dimensions that flattenSpec` provides are ingested.

However, there is a missing use case that I want to ingest both the dimensions in dimensionsSpec and other dimensions discovered using flattenSpec during ingestion. To address this, this PR adds a new flag, includeAllDimensions in dimensionsSpec. MapInputRowParser will put all explicit dimensions first in InputRow and then any other dimensions found in input data.

To reviewers, my apologies for an invasive PR. The number of files is huge, but they are mostly unit tests as tons of them are creating dimensionsSpec. To avoid such an invasive change in the future, I added a builder for dimensionsSpec and fixed unit tests to use the builder instead.

Key changed/added classes in this PR

DimensionsSpec
MapInputRowParser

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

kfaraz

Thanks for the changes, @jihoonson .
The DimensionsSpec.Builder and DimensionsSpec.EMPTY shorthand make the code much more readable.

Overall LGTM, have added some minor comments.

kfaraz · 2022-02-23T08:54:36Z

core/src/test/java/org/apache/druid/data/input/impl/MapInputRowParserTest.java

 public class MapInputRowParserTest
 {
  @Rule
  public ExpectedException expectedException = ExpectedException.none();

  private final TimestampSpec timestampSpec = new TimestampSpec("time", null, null);
-  private final List<String> dimensions = ImmutableList.of("dim");
-  private final Set<String> dimensionExclusions = ImmutableSet.of();


Is the behaviour of dimensionExclusions changing too?
This set was originally initialized to empty here and the "time" was being sent only in theMap.
But with this change, the dimensionExclusions now contain the "time" column.

The behavior hasn't changed. I made the test more realistic since the timestamp field name is always in dimensionExclusions in production. See https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/segment/indexing/DataSchema.java#L162-L177.

kfaraz · 2022-02-23T09:02:38Z

core/src/main/java/org/apache/druid/data/input/impl/MapInputRowParser.java

@@ -69,29 +66,32 @@ public static InputRow parse(InputRowSchema inputRowSchema, Map<String, Object>
    return parse(inputRowSchema.getTimestampSpec(), inputRowSchema.getDimensionsSpec(), theMap);
  }

-  private static InputRow parse(
-      TimestampSpec timestampSpec,
+  private static List<String> findDimensions(


Maybe add some comment here on

how we are using the flag and fields to include/exclude dimensions

what is theMap expected to contain (I guess we could use a better name for this field too)

I added javadoc and renamed theMap to rawInputRow.

kfaraz · 2022-02-23T09:08:58Z

...c/test/java/org/apache/druid/indexing/firehose/IngestSegmentFirehoseFactoryTimelineTest.java

@@ -91,9 +91,7 @@
          new JSONParseSpec(
              new TimestampSpec(TIME_COLUMN, "auto", null),
              new DimensionsSpec(
-                  DimensionsSpec.getDefaultSchemas(Arrays.asList(DIMENSIONS)),


new DimensionsSpec(DimensionsSpec.getDefaultSchemas(dimensionNames))
seems like a very frequent construct (atleast in tests).
Should we just a List<String> constructor to DimensionsSpec?
Or maybe a static DimensionsSpec.forDimensionNames(dimensionNames)?

I could have done it, but added something similar for the builder instead (Builder.setDefaultSchemaDimensions()) to encourage people to use the builder instead. Does this make sense to you?

Yes, that makes sense.

kfaraz · 2022-02-23T09:20:24Z

core/src/main/java/org/apache/druid/data/input/impl/DimensionsSpec.java

      @JsonProperty("dimensions") List<DimensionSchema> dimensions,
      @JsonProperty("dimensionExclusions") List<String> dimensionExclusions,
-      @Deprecated @JsonProperty("spatialDimensions") List<SpatialDimensionSchema> spatialDimensions
+      @Deprecated @JsonProperty("spatialDimensions") List<SpatialDimensionSchema> spatialDimensions,
+      @JsonProperty("includeAllDimensions") boolean includeAllDimensions


Nit: I forget the jackson behaviour, but does having a primitive boolean here break the deserialization if the flag includeAllDimensions is absent?

boolean variables are initialized to false when they are missing.

kfaraz · 2022-02-23T09:34:56Z

docs/ingestion/ingestion-spec.md

+| dimensions           | A list of [dimension names or objects](#dimension-objects). Cannot have the same column in both `dimensions` and `dimensionExclusions`.<br><br>If this and `spatialDimensions` are both null or empty arrays, Druid will treat all non-timestamp, non-metric columns that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) below for details.                           | `[]`    |
+| dimensionExclusions  | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br><br>This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details.                                                                                                                                        | `[]`    |
+| spatialDimensions    | An array of [spatial dimensions](../development/geo.md).                                                                                                                                                                                                                                                                                                                                                                                                     | `[]`    |
+| includeAllDimensions | When you use a [`flattenSpec`](./data-formats.html#flattenspec), you can set `includeAllDimensions` to true to ingest both explicit dimensions in the `dimensions` field and other dimensions that `flattenSpec` provides. If this is not set and the `dimensions` field is not empty, Druid will ingest only explicit dimensions. If this is not set and the `dimensions` field is empty, only the dimensions that `flattenSpec` provides will be ingested. | false   |


Nit: should we also mention the order in which the explicit and implicit dimensions would be added?

I updated the doc to clarify the dimension order.

kfaraz

LGTM 👍🏻

jihoonson · 2022-02-26T02:27:42Z

@kfaraz thanks for your review!

jihoonson added 2 commits February 22, 2022 22:48

includeAllDimensions in dimensionsSpec

6830948

doc

9cb341d

jihoonson added the Area - Ingestion label Feb 23, 2022

kfaraz reviewed Feb 23, 2022

View reviewed changes

address comments

575fae5

kfaraz approved these changes Feb 24, 2022

View reviewed changes

unused import and doc spelling

1eef130

jihoonson merged commit e5ad862 into apache:master Feb 26, 2022

jihoonson mentioned this pull request Apr 12, 2022

Fix indexMerger to respect the includeAllDimensions flag #12428

Merged

3 tasks

abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022

abhishekagarwal87 mentioned this pull request May 25, 2022

[Draft] 0.23.0 Release notes #12510

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A new includeAllDimension flag for dimensionsSpec #12276

A new includeAllDimension flag for dimensionsSpec #12276

jihoonson commented Feb 23, 2022 •

edited

Loading

kfaraz left a comment

kfaraz Feb 23, 2022

jihoonson Feb 23, 2022

kfaraz Feb 23, 2022

jihoonson Feb 23, 2022 •

edited

Loading

kfaraz Feb 23, 2022

jihoonson Feb 23, 2022

kfaraz Feb 24, 2022

kfaraz Feb 23, 2022

jihoonson Feb 23, 2022

kfaraz Feb 23, 2022

jihoonson Feb 23, 2022

kfaraz left a comment

jihoonson commented Feb 26, 2022

A new includeAllDimension flag for dimensionsSpec #12276

A new includeAllDimension flag for dimensionsSpec #12276

Conversation

jihoonson commented Feb 23, 2022 • edited Loading

Description

Key changed/added classes in this PR

kfaraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz left a comment

Choose a reason for hiding this comment

jihoonson commented Feb 26, 2022

jihoonson commented Feb 23, 2022 •

edited

Loading

jihoonson Feb 23, 2022 •

edited

Loading