Preserve dimension order across indexes during ingestion #2006

jon-wei · 2015-11-24T02:53:39Z

Partially addresses issue #658

This changes the RealtimePlumber and IndexGeneratorJob so that when an index is persisted, the next index created will inherit the dimension order built up by the previous index.

This is to better support arbitrary dimension orders when schema-less ingestion is used.

When merging indexes, the index merger will now attempt to find the largest common dimension ordering across the indexes, e.g.:

Index 1 dims: DimA, DimB
Index 2 dims: DimC
Index 3 dims: DimC, DimA, DimB

The ordering that will be used will be DimC, DimA, DimB as that encompasses all of the shorter orderings.

If no common order is found, the merger will fall back to the original lexicographic dimension ordering.

To support this change, the persisted indexes now include null dimensions and the dimension encoding dictionaries will now always have an entry for the empty string. This is done so that previously seen dimensions are not dropped in case an index does not receive any events with values for that dimension.

LIMITATIONS:

Dimension order can be passed between the FireHydrants within a Sink, but not across Sinks.
Dimension orders can be passed between indexes within a single partition during batch ingestion, but not across partitions.

nishantmonu51 · 2015-11-24T16:34:39Z

indexing-hadoop/src/main/java/io/druid/indexer/IndexGeneratorJob.java

        indexSchema,
        tuningConfig.getRowFlushBoundary()
    );
+
+    if(oldIndex != null) {


loadDimensionOrder looks a bit hacky way to do the initialization of dimensionsOrder
can we pass in the schema with the correct dimensionsOrder ? ,
e.g newIndex = new OnheapIncrementalIndex(schema.withDimensionsSpec(dimensionsSpec with custom order from oldIndex),... )

@nishantmonu51

I considered implementing it that way initially, but it would require a change to druid-api.

The presence of a dimensions list within the DimensionsSpec currently determines the result of hasCustomDimensions(), which controls if schema-less ingestion is used.

Perhaps there should be a later PR that changes the DimensionsSpec semantics so that the user can specify a dimensions list but still have schema-less dimension discovery enabled. Then the index initialization could be changed to use DimensionsSpec to initialize the dimensions order; this would also give the user some control over the dimension ordering during schema-less ingestion.

Thoughts?

I think it would be nice to have ability to specify ordering for a subset of dimensions which are known prior to ingestion and rest of the dimensions can be discovered during ingestion,
can you submit a github issue to track this and add a comment to the issue here in code indicating that it can be cleaned up once the issue is resolved ?

@nishantmonu51 will do

himanshug · 2015-12-03T06:10:16Z

processing/src/main/java/io/druid/segment/IndexIO.java

@@ -624,6 +624,16 @@ public void convertV8toV9(File v8Dir, File v9Dir, IndexSpec indexSpec)
            if (rowValue.size() > 1) {
              onlyOneValue = false;
            }
+
+            if (rowValue.size() == 1) {
+              if(rowValue.get(0) == 0) {


if (rowValue.size() == 1 && rowValue.get(0) == ) {
..
} ?

@himanshug merged the two if statements

himanshug · 2015-12-04T06:04:16Z

besides minor comments, looks good to me overall. however, this might increase the size of segments for some users.

himanshug · 2015-12-04T06:26:40Z

processing/src/main/java/io/druid/segment/QueryableIndexIndexableAdapter.java

@@ -385,6 +385,9 @@ public IndexedInts seek(String value)
          return new EmptyIndexedInts();
        }
        if (currVal == null) {
+          if(currIndex == dimSet.size()) {
+            return new EmptyIndexedInts();
+          }


why is this change needed?

@himanshug It handles the case where the only entry in dimSet is the null string; the seek needs to stop and return if the dimSet has been exhausted. The index == size check is there to distinguish between the initial state of the seeker and when it has retrieved the null string's entry (in both cases currVal is null)

xvrl · 2015-12-08T02:14:12Z

indexing-hadoop/src/main/java/io/druid/indexer/IndexGeneratorJob.java

@@ -520,7 +530,7 @@ protected void reduce(
        int runningTotalLineCount = 0;
        long startTime = System.currentTimeMillis();

-        Set<String> allDimensionNames = Sets.newHashSet();
+        Set<String> allDimensionNames = Sets.newLinkedHashSet();


This set gets populated by inputRow.getDimensions() which depends on the order in which rows appear. Are we sue the order gets preserved here? If I ask for [a,b,c] in my indexing spec, and if my first row only has dimension c, my second row only has dimension b, and my third row only has dimension a, then this set may end up with [c,b,a] instead of [a,b,c]

@xvrl In the case where the user has specified a list of dimensions in the indexing spec, the MapInputRowParser initializes all of the MapBasedInputRows that it creates with the specified dimension list, so inputRow.getDimensions() would return the same order every time.

@jon-wei does it always return the same dimensions for every single row, and is that also the case for other InputRowParsers?

@xvrl yes, every row returns the same dimensions if the user specifies the dimension list in the ingestion spec. There's only StringInputRowParser and MapInputRowParser which are coupled together.

in that case, why do we need to add the dimensions to the set for every single row?

@xvrl for the schemaless case where dimensions are discovered on the fly

@jon-wei do we have a test for hadoop indexing that covers both:

schema-less indexing to make sure that dimension are persisted in the order they were seen

schema-full indexing where we ensuer the order in which dimensions appear when read is different than the ones specified in the spec, and we test that the persisted order corresponds to the spec ?

@xvrl The test I added to IndexGeneratorJobTest covers the first case, I'll add another test that covers the second

@xvrl Added a test to IndexGeneratorJobTest that checks original order is maintained for schema-full indexing

gianm · 2015-12-11T19:25:58Z

processing/src/main/java/io/druid/segment/incremental/IncrementalIndex.java

@@ -578,6 +579,43 @@ public Integer getDimensionIndex(String dimension)
    return dimensionOrder.get(dimension);
  }

+  public LinkedHashMap<String, Integer> getDimensionOrder()
+  {
+    return dimensionOrder;


synchronize on dimensionOrder, return a copy? (dimensionOrder is not a thread safe map)

Actually… we don't need dimensionOrder at all for this, we can just return dimensions itself. So this can return a List<String> that is actually just ImmutableList.copyOf(dimensions).

That also makes the loading simpler since you can pass that list directly to the one that takes an Iterable.

gianm · 2015-12-11T20:47:32Z

I think it should be possible to do this without writing out the null columns. To avoid that we need to be sure that:

Whenever we merge indexes, the dimension orders of the individual indexes (some of which might be missing some dimensions) should be merged such that if any dimension A appears before a dimension B in all indexes, it must also appear before B in the final order.
Whenever we create a new index, it's provided with an order that is based on all prior indexes, not just the immediately previous one.

fjy · 2016-01-16T19:40:19Z

@jon-wei Looks like some more conflicts

jon-wei · 2016-01-18T21:41:40Z

merging, also updating IndexMergerV9 with similar logic (still WIP)

jon-wei · 2016-01-19T02:36:44Z

Finished the merge and updated IndexMergerV9.

I've also updated IndexMergerV9Test such that the use of V9/legacy IndexMerger is controlled by a test parameter since the tests are identical to those in IndexMergerTest (now deprecated in the PR).

drcrallen · 2016-01-19T03:24:32Z

@jon-wei is this a blocker for 0.9.0?

gianm · 2016-01-19T03:57:16Z

@drcrallen I don't think it's a blocker but I think we should try to get it in.

gianm · 2016-01-19T19:19:47Z

processing/src/main/java/io/druid/segment/IndexMergerV9.java

@@ -644,7 +644,7 @@ private void mergeIndexesAndWriteColumns(
        if (dimensionSkipFlag.get(i)) {
          continue;
        }
-        if (dims[i] == null || dims[i].length == 0) {
+        if (dims[i] == null || dims[i].length == 0 || dims[i][0] == 0) {


Is dim val 0 guaranteed to be a null/emptystr?

Even if it is, this last check should be dims[i].length == 1 && dims[i][0] == 0 as the dim value could potentially be an empty str + something else.

Looks like it is guaranteed due to the block below. but, the other check should still be modified, I think.

@gianm added length == 1 check

gianm · 2016-01-19T19:48:43Z

@jon-wei new changes look good other than the one comment I had

jon-wei · 2016-01-19T21:34:40Z

rebased again

gianm · 2016-01-19T22:35:41Z

👍

Preserve dimension order across indexes during ingestion

navis · 2016-01-20T00:56:13Z

Sorry for late incursion but this makes invalid cardinality for segment meta query. If this could not be fixed we might rollback this.

jon-wei · 2016-01-20T01:17:06Z

@navis Can you elaborate on the issue you are seeing?

navis · 2016-01-20T01:44:52Z

with druid.sample.tsv,

before

io.druid.segment.IndexMerger - Starting dimension[market] with cardinality[3]
io.druid.segment.IndexMerger - Starting dimension[null_column] with cardinality[0]
io.druid.segment.IndexMerger - Starting dimension[partial_null_column] with cardinality[2]
io.druid.segment.IndexMerger - Starting dimension[placement] with cardinality[1]
io.druid.segment.IndexMerger - Starting dimension[placementish] with cardinality[9]
io.druid.segment.IndexMerger - Starting dimension[quality] with cardinality[9]

after

io.druid.segment.IndexMerger - Starting dimension[market] with cardinality[4]
io.druid.segment.IndexMerger - Starting dimension[null_column] with cardinality[0]
io.druid.segment.IndexMerger - Starting dimension[partial_null_column] with cardinality[2]
io.druid.segment.IndexMerger - Starting dimension[placement] with cardinality[2]
io.druid.segment.IndexMerger - Starting dimension[placementish] with cardinality[10]
io.druid.segment.IndexMerger - Starting dimension[quality] with cardinality[10]

navis · 2016-01-20T01:46:59Z

I can see null is added to all dimensions with value. But it can make some problems in user side.

jon-wei mentioned this pull request Nov 24, 2015

Remove lexicographic sorting of dimensions in DimensionsSpec druid-io/druid-api#68

Merged

nishantmonu51 reviewed Nov 24, 2015
View reviewed changes

jon-wei force-pushed the inherit_dim_order branch from c7c3a79 to d6e1693 Compare November 25, 2015 17:39

himanshug reviewed Dec 3, 2015
View reviewed changes

himanshug reviewed Dec 4, 2015
View reviewed changes

jon-wei force-pushed the inherit_dim_order branch from eebb452 to b0fa2f3 Compare December 4, 2015 20:41

jon-wei closed this Dec 7, 2015

jon-wei reopened this Dec 7, 2015

jon-wei closed this Dec 7, 2015

jon-wei reopened this Dec 7, 2015

xvrl reviewed Dec 8, 2015
View reviewed changes

jon-wei force-pushed the inherit_dim_order branch 3 times, most recently from 0abe702 to 76375e6 Compare December 10, 2015 02:34

gianm reviewed Dec 11, 2015
View reviewed changes

jon-wei force-pushed the inherit_dim_order branch from 293cc52 to 1116f4d Compare December 14, 2015 22:20

jon-wei closed this Dec 14, 2015

jon-wei reopened this Dec 14, 2015

fjy added this to the 0.9.0 milestone Jan 16, 2016

jon-wei force-pushed the inherit_dim_order branch from 280b9c3 to 37c4b34 Compare January 18, 2016 21:40

jon-wei force-pushed the inherit_dim_order branch 3 times, most recently from 3a501fb to c05058c Compare January 19, 2016 02:29

gianm reviewed Jan 19, 2016
View reviewed changes

jon-wei force-pushed the inherit_dim_order branch from c05058c to cefb536 Compare January 19, 2016 21:04

Preserve dimension order across indexes during ingestion

747343e

jon-wei force-pushed the inherit_dim_order branch from cefb536 to 747343e Compare January 19, 2016 21:34

fjy added a commit that referenced this pull request Jan 19, 2016

Merge pull request #2006 from jon-wei/inherit_dim_order

cb8f714

Preserve dimension order across indexes during ingestion

fjy merged commit cb8f714 into apache:master Jan 19, 2016

This was referenced Jan 20, 2016

Support multi-dimensional filters #2217

Closed

Support min/max values for metadata query #2208

Merged

jon-wei mentioned this pull request Jan 20, 2016

More specific null/empty str handling in IndexMerger #2306

Merged

gianm mentioned this pull request Jan 21, 2016

Indexer does not merge properly when dimensions are provided in non-lexicographic order #658

Closed

fjy mentioned this pull request Feb 5, 2016

druid-0.9.0 release notes #2404

Closed

jon-wei added the Improvement label Mar 31, 2016

jon-wei deleted the inherit_dim_order branch October 6, 2017 22:21

Preserve dimension order across indexes during ingestion #2006

Preserve dimension order across indexes during ingestion #2006

Conversation

jon-wei commented Nov 24, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

himanshug commented Dec 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm commented Dec 11, 2015

fjy commented Jan 16, 2016

jon-wei commented Jan 18, 2016

jon-wei commented Jan 19, 2016

drcrallen commented Jan 19, 2016

gianm commented Jan 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm commented Jan 19, 2016

jon-wei commented Jan 19, 2016

gianm commented Jan 19, 2016

navis commented Jan 20, 2016

jon-wei commented Jan 20, 2016

navis commented Jan 20, 2016

navis commented Jan 20, 2016