-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A new includeAllDimension flag for dimensionsSpec #12276
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, @jihoonson .
The DimensionsSpec.Builder
and DimensionsSpec.EMPTY
shorthand make the code much more readable.
Overall LGTM, have added some minor comments.
public class MapInputRowParserTest | ||
{ | ||
@Rule | ||
public ExpectedException expectedException = ExpectedException.none(); | ||
|
||
private final TimestampSpec timestampSpec = new TimestampSpec("time", null, null); | ||
private final List<String> dimensions = ImmutableList.of("dim"); | ||
private final Set<String> dimensionExclusions = ImmutableSet.of(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the behaviour of dimensionExclusions
changing too?
This set was originally initialized to empty here and the "time"
was being sent only in theMap
.
But with this change, the dimensionExclusions
now contain the "time"
column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior hasn't changed. I made the test more realistic since the timestamp field name is always in dimensionExclusions
in production. See https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/segment/indexing/DataSchema.java#L162-L177.
@@ -69,29 +66,32 @@ public static InputRow parse(InputRowSchema inputRowSchema, Map<String, Object> | |||
return parse(inputRowSchema.getTimestampSpec(), inputRowSchema.getDimensionsSpec(), theMap); | |||
} | |||
|
|||
private static InputRow parse( | |||
TimestampSpec timestampSpec, | |||
private static List<String> findDimensions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add some comment here on
- how we are using the flag and fields to include/exclude dimensions
- what is
theMap
expected to contain (I guess we could use a better name for this field too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added javadoc and renamed theMap
to rawInputRow
.
@@ -91,9 +91,7 @@ | |||
new JSONParseSpec( | |||
new TimestampSpec(TIME_COLUMN, "auto", null), | |||
new DimensionsSpec( | |||
DimensionsSpec.getDefaultSchemas(Arrays.asList(DIMENSIONS)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new DimensionsSpec(DimensionsSpec.getDefaultSchemas(dimensionNames))
seems like a very frequent construct (atleast in tests).
Should we just a List<String>
constructor to DimensionsSpec
?
Or maybe a static DimensionsSpec.forDimensionNames(dimensionNames)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could have done it, but added something similar for the builder instead (Builder.setDefaultSchemaDimensions()
) to encourage people to use the builder instead. Does this make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense.
@JsonProperty("dimensions") List<DimensionSchema> dimensions, | ||
@JsonProperty("dimensionExclusions") List<String> dimensionExclusions, | ||
@Deprecated @JsonProperty("spatialDimensions") List<SpatialDimensionSchema> spatialDimensions | ||
@Deprecated @JsonProperty("spatialDimensions") List<SpatialDimensionSchema> spatialDimensions, | ||
@JsonProperty("includeAllDimensions") boolean includeAllDimensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I forget the jackson behaviour, but does having a primitive boolean
here break the deserialization if the flag includeAllDimensions
is absent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boolean
variables are initialized to false when they are missing.
docs/ingestion/ingestion-spec.md
Outdated
| dimensions | A list of [dimension names or objects](#dimension-objects). Cannot have the same column in both `dimensions` and `dimensionExclusions`.<br><br>If this and `spatialDimensions` are both null or empty arrays, Druid will treat all non-timestamp, non-metric columns that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` | | ||
| dimensionExclusions | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br><br>This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` | | ||
| spatialDimensions | An array of [spatial dimensions](../development/geo.md). | `[]` | | ||
| includeAllDimensions | When you use a [`flattenSpec`](./data-formats.html#flattenspec), you can set `includeAllDimensions` to true to ingest both explicit dimensions in the `dimensions` field and other dimensions that `flattenSpec` provides. If this is not set and the `dimensions` field is not empty, Druid will ingest only explicit dimensions. If this is not set and the `dimensions` field is empty, only the dimensions that `flattenSpec` provides will be ingested. | false | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should we also mention the order in which the explicit and implicit dimensions would be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the doc to clarify the dimension order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍🏻
@kfaraz thanks for your review! |
Description
Today,
dimensionsSpec
has two modes withflattenSpec
as below.dimensions
is set indimensionsSpec
, only those explicit dimensions are ingested.dimensions is not set, all dimensions that
flattenSpec` provides are ingested.However, there is a missing use case that I want to ingest both the dimensions in
dimensionsSpec
and other dimensions discovered usingflattenSpec
during ingestion. To address this, this PR adds a new flag,includeAllDimensions
indimensionsSpec
.MapInputRowParser
will put all explicit dimensions first inInputRow
and then any other dimensions found in input data.To reviewers, my apologies for an invasive PR. The number of files is huge, but they are mostly unit tests as tons of them are creating
dimensionsSpec
. To avoid such an invasive change in the future, I added a builder fordimensionsSpec
and fixed unit tests to use the builder instead.Key changed/added classes in this PR
DimensionsSpec
MapInputRowParser
This PR has: