-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve dimension order across indexes during ingestion #2006
Conversation
indexSchema, | ||
tuningConfig.getRowFlushBoundary() | ||
); | ||
|
||
if(oldIndex != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loadDimensionOrder looks a bit hacky way to do the initialization of dimensionsOrder
can we pass in the schema with the correct dimensionsOrder ? ,
e.g newIndex = new OnheapIncrementalIndex(schema.withDimensionsSpec(dimensionsSpec with custom order from oldIndex),... )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered implementing it that way initially, but it would require a change to druid-api.
The presence of a dimensions list within the DimensionsSpec currently determines the result of hasCustomDimensions(), which controls if schema-less ingestion is used.
Perhaps there should be a later PR that changes the DimensionsSpec semantics so that the user can specify a dimensions list but still have schema-less dimension discovery enabled. Then the index initialization could be changed to use DimensionsSpec to initialize the dimensions order; this would also give the user some control over the dimension ordering during schema-less ingestion.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to have ability to specify ordering for a subset of dimensions which are known prior to ingestion and rest of the dimensions can be discovered during ingestion,
can you submit a github issue to track this and add a comment to the issue here in code indicating that it can be cleaned up once the issue is resolved ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nishantmonu51 will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c7c3a79
to
d6e1693
Compare
@@ -624,6 +624,16 @@ public void convertV8toV9(File v8Dir, File v9Dir, IndexSpec indexSpec) | |||
if (rowValue.size() > 1) { | |||
onlyOneValue = false; | |||
} | |||
|
|||
if (rowValue.size() == 1) { | |||
if(rowValue.get(0) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (rowValue.size() == 1 && rowValue.get(0) == ) {
..
} ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@himanshug merged the two if statements
besides minor comments, looks good to me overall. however, this might increase the size of segments for some users. |
@@ -385,6 +385,9 @@ public IndexedInts seek(String value) | |||
return new EmptyIndexedInts(); | |||
} | |||
if (currVal == null) { | |||
if(currIndex == dimSet.size()) { | |||
return new EmptyIndexedInts(); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@himanshug It handles the case where the only entry in dimSet is the null string; the seek needs to stop and return if the dimSet has been exhausted. The index == size check is there to distinguish between the initial state of the seeker and when it has retrieved the null string's entry (in both cases currVal is null)
eebb452
to
b0fa2f3
Compare
@@ -520,7 +530,7 @@ protected void reduce( | |||
int runningTotalLineCount = 0; | |||
long startTime = System.currentTimeMillis(); | |||
|
|||
Set<String> allDimensionNames = Sets.newHashSet(); | |||
Set<String> allDimensionNames = Sets.newLinkedHashSet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This set gets populated by inputRow.getDimensions()
which depends on the order in which rows appear. Are we sue the order gets preserved here? If I ask for [a,b,c]
in my indexing spec, and if my first row only has dimension c
, my second row only has dimension b
, and my third row only has dimension a
, then this set may end up with [c,b,a]
instead of [a,b,c]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xvrl In the case where the user has specified a list of dimensions in the indexing spec, the MapInputRowParser initializes all of the MapBasedInputRows that it creates with the specified dimension list, so inputRow.getDimensions() would return the same order every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jon-wei does it always return the same dimensions for every single row, and is that also the case for other InputRowParsers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xvrl yes, every row returns the same dimensions if the user specifies the dimension list in the ingestion spec. There's only StringInputRowParser and MapInputRowParser which are coupled together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case, why do we need to add the dimensions to the set for every single row?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xvrl for the schemaless case where dimensions are discovered on the fly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jon-wei do we have a test for hadoop indexing that covers both:
- schema-less indexing to make sure that dimension are persisted in the order they were seen
- schema-full indexing where we ensuer the order in which dimensions appear when read is different than the ones specified in the spec, and we test that the persisted order corresponds to the spec ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xvrl The test I added to IndexGeneratorJobTest covers the first case, I'll add another test that covers the second
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xvrl Added a test to IndexGeneratorJobTest that checks original order is maintained for schema-full indexing
0abe702
to
76375e6
Compare
@@ -578,6 +579,43 @@ public Integer getDimensionIndex(String dimension) | |||
return dimensionOrder.get(dimension); | |||
} | |||
|
|||
public LinkedHashMap<String, Integer> getDimensionOrder() | |||
{ | |||
return dimensionOrder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
synchronize on dimensionOrder, return a copy? (dimensionOrder is not a thread safe map)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually… we don't need dimensionOrder
at all for this, we can just return dimensions
itself. So this can return a List<String>
that is actually just ImmutableList.copyOf(dimensions)
.
That also makes the loading simpler since you can pass that list directly to the one that takes an Iterable.
I think it should be possible to do this without writing out the null columns. To avoid that we need to be sure that:
|
293cc52
to
1116f4d
Compare
@jon-wei Looks like some more conflicts |
280b9c3
to
37c4b34
Compare
merging, also updating IndexMergerV9 with similar logic (still WIP) |
3a501fb
to
c05058c
Compare
Finished the merge and updated IndexMergerV9. I've also updated IndexMergerV9Test such that the use of V9/legacy IndexMerger is controlled by a test parameter since the tests are identical to those in IndexMergerTest (now deprecated in the PR). |
@jon-wei is this a blocker for 0.9.0? |
@drcrallen I don't think it's a blocker but I think we should try to get it in. |
@@ -644,7 +644,7 @@ private void mergeIndexesAndWriteColumns( | |||
if (dimensionSkipFlag.get(i)) { | |||
continue; | |||
} | |||
if (dims[i] == null || dims[i].length == 0) { | |||
if (dims[i] == null || dims[i].length == 0 || dims[i][0] == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is dim val 0 guaranteed to be a null/emptystr?
Even if it is, this last check should be dims[i].length == 1 && dims[i][0] == 0
as the dim value could potentially be an empty str + something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it is guaranteed due to the block below. but, the other check should still be modified, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gianm added length == 1 check
@jon-wei new changes look good other than the one comment I had |
c05058c
to
cefb536
Compare
cefb536
to
747343e
Compare
rebased again |
👍 |
Preserve dimension order across indexes during ingestion
Sorry for late incursion but this makes invalid cardinality for segment meta query. If this could not be fixed we might rollback this. |
@navis Can you elaborate on the issue you are seeing? |
with druid.sample.tsv, before
after
|
I can see |
Partially addresses issue #658
Related to druid-io/druid-api#68
This changes the RealtimePlumber and IndexGeneratorJob so that when an index is persisted, the next index created will inherit the dimension order built up by the previous index.
This is to better support arbitrary dimension orders when schema-less ingestion is used.
When merging indexes, the index merger will now attempt to find the largest common dimension ordering across the indexes, e.g.:
Index 1 dims: DimA, DimB
Index 2 dims: DimC
Index 3 dims: DimC, DimA, DimB
The ordering that will be used will be DimC, DimA, DimB as that encompasses all of the shorter orderings.
If no common order is found, the merger will fall back to the original lexicographic dimension ordering.
To support this change, the persisted indexes now include null dimensions and the dimension encoding dictionaries will now always have an entry for the empty string. This is done so that previously seen dimensions are not dropped in case an index does not receive any events with values for that dimension.
LIMITATIONS: