Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new includeAllDimension flag for dimensionsSpec #12276

Merged
merged 4 commits into from
Feb 26, 2022

Conversation

jihoonson
Copy link
Contributor

@jihoonson jihoonson commented Feb 23, 2022

Description

Today, dimensionsSpec has two modes with flattenSpec as below.

  1. If dimensions is set in dimensionsSpec, only those explicit dimensions are ingested.
  2. If dimensions is not set, all dimensions that flattenSpec` provides are ingested.

However, there is a missing use case that I want to ingest both the dimensions in dimensionsSpec and other dimensions discovered using flattenSpec during ingestion. To address this, this PR adds a new flag, includeAllDimensions in dimensionsSpec. MapInputRowParser will put all explicit dimensions first in InputRow and then any other dimensions found in input data.

To reviewers, my apologies for an invasive PR. The number of files is huge, but they are mostly unit tests as tons of them are creating dimensionsSpec. To avoid such an invasive change in the future, I added a builder for dimensionsSpec and fixed unit tests to use the builder instead.


Key changed/added classes in this PR
  • DimensionsSpec
  • MapInputRowParser

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, @jihoonson .
The DimensionsSpec.Builder and DimensionsSpec.EMPTY shorthand make the code much more readable.

Overall LGTM, have added some minor comments.

public class MapInputRowParserTest
{
@Rule
public ExpectedException expectedException = ExpectedException.none();

private final TimestampSpec timestampSpec = new TimestampSpec("time", null, null);
private final List<String> dimensions = ImmutableList.of("dim");
private final Set<String> dimensionExclusions = ImmutableSet.of();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the behaviour of dimensionExclusions changing too?
This set was originally initialized to empty here and the "time" was being sent only in theMap.
But with this change, the dimensionExclusions now contain the "time" column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior hasn't changed. I made the test more realistic since the timestamp field name is always in dimensionExclusions in production. See https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/segment/indexing/DataSchema.java#L162-L177.

@@ -69,29 +66,32 @@ public static InputRow parse(InputRowSchema inputRowSchema, Map<String, Object>
return parse(inputRowSchema.getTimestampSpec(), inputRowSchema.getDimensionsSpec(), theMap);
}

private static InputRow parse(
TimestampSpec timestampSpec,
private static List<String> findDimensions(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some comment here on

  • how we are using the flag and fields to include/exclude dimensions
  • what is theMap expected to contain (I guess we could use a better name for this field too)

Copy link
Contributor Author

@jihoonson jihoonson Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added javadoc and renamed theMap to rawInputRow.

@@ -91,9 +91,7 @@
new JSONParseSpec(
new TimestampSpec(TIME_COLUMN, "auto", null),
new DimensionsSpec(
DimensionsSpec.getDefaultSchemas(Arrays.asList(DIMENSIONS)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new DimensionsSpec(DimensionsSpec.getDefaultSchemas(dimensionNames))
seems like a very frequent construct (atleast in tests).
Should we just a List<String> constructor to DimensionsSpec?
Or maybe a static DimensionsSpec.forDimensionNames(dimensionNames)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could have done it, but added something similar for the builder instead (Builder.setDefaultSchemaDimensions()) to encourage people to use the builder instead. Does this make sense to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense.

@JsonProperty("dimensions") List<DimensionSchema> dimensions,
@JsonProperty("dimensionExclusions") List<String> dimensionExclusions,
@Deprecated @JsonProperty("spatialDimensions") List<SpatialDimensionSchema> spatialDimensions
@Deprecated @JsonProperty("spatialDimensions") List<SpatialDimensionSchema> spatialDimensions,
@JsonProperty("includeAllDimensions") boolean includeAllDimensions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I forget the jackson behaviour, but does having a primitive boolean here break the deserialization if the flag includeAllDimensions is absent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boolean variables are initialized to false when they are missing.

| dimensions | A list of [dimension names or objects](#dimension-objects). Cannot have the same column in both `dimensions` and `dimensionExclusions`.<br><br>If this and `spatialDimensions` are both null or empty arrays, Druid will treat all non-timestamp, non-metric columns that do not appear in `dimensionExclusions` as String-typed dimension columns. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
| dimensionExclusions | The names of dimensions to exclude from ingestion. Only names are supported here, not objects.<br><br>This list is only used if the `dimensions` and `spatialDimensions` lists are both null or empty arrays; otherwise it is ignored. See [inclusions and exclusions](#inclusions-and-exclusions) below for details. | `[]` |
| spatialDimensions | An array of [spatial dimensions](../development/geo.md). | `[]` |
| includeAllDimensions | When you use a [`flattenSpec`](./data-formats.html#flattenspec), you can set `includeAllDimensions` to true to ingest both explicit dimensions in the `dimensions` field and other dimensions that `flattenSpec` provides. If this is not set and the `dimensions` field is not empty, Druid will ingest only explicit dimensions. If this is not set and the `dimensions` field is empty, only the dimensions that `flattenSpec` provides will be ingested. | false |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should we also mention the order in which the explicit and implicit dimensions would be added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the doc to clarify the dimension order.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍🏻

@jihoonson
Copy link
Contributor Author

@kfaraz thanks for your review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants