Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve dimension order across indexes during ingestion #2006

Merged
merged 1 commit into from
Jan 19, 2016

Conversation

jon-wei
Copy link
Contributor

@jon-wei jon-wei commented Nov 24, 2015

Partially addresses issue #658

Related to druid-io/druid-api#68

This changes the RealtimePlumber and IndexGeneratorJob so that when an index is persisted, the next index created will inherit the dimension order built up by the previous index.

This is to better support arbitrary dimension orders when schema-less ingestion is used.

When merging indexes, the index merger will now attempt to find the largest common dimension ordering across the indexes, e.g.:

Index 1 dims: DimA, DimB
Index 2 dims: DimC
Index 3 dims: DimC, DimA, DimB

The ordering that will be used will be DimC, DimA, DimB as that encompasses all of the shorter orderings.

If no common order is found, the merger will fall back to the original lexicographic dimension ordering.

To support this change, the persisted indexes now include null dimensions and the dimension encoding dictionaries will now always have an entry for the empty string. This is done so that previously seen dimensions are not dropped in case an index does not receive any events with values for that dimension.

LIMITATIONS:

  • Dimension order can be passed between the FireHydrants within a Sink, but not across Sinks.
  • Dimension orders can be passed between indexes within a single partition during batch ingestion, but not across partitions.

indexSchema,
tuningConfig.getRowFlushBoundary()
);

if(oldIndex != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadDimensionOrder looks a bit hacky way to do the initialization of dimensionsOrder
can we pass in the schema with the correct dimensionsOrder ? ,
e.g newIndex = new OnheapIncrementalIndex(schema.withDimensionsSpec(dimensionsSpec with custom order from oldIndex),... )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nishantmonu51

I considered implementing it that way initially, but it would require a change to druid-api.

The presence of a dimensions list within the DimensionsSpec currently determines the result of hasCustomDimensions(), which controls if schema-less ingestion is used.

Perhaps there should be a later PR that changes the DimensionsSpec semantics so that the user can specify a dimensions list but still have schema-less dimension discovery enabled. Then the index initialization could be changed to use DimensionsSpec to initialize the dimensions order; this would also give the user some control over the dimension ordering during schema-less ingestion.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to have ability to specify ordering for a subset of dimensions which are known prior to ingestion and rest of the dimensions can be discovered during ingestion,
can you submit a github issue to track this and add a comment to the issue here in code indicating that it can be cleaned up once the issue is resolved ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nishantmonu51 will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -624,6 +624,16 @@ public void convertV8toV9(File v8Dir, File v9Dir, IndexSpec indexSpec)
if (rowValue.size() > 1) {
onlyOneValue = false;
}

if (rowValue.size() == 1) {
if(rowValue.get(0) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (rowValue.size() == 1 && rowValue.get(0) == ) {
..
} ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@himanshug merged the two if statements

@himanshug
Copy link
Contributor

besides minor comments, looks good to me overall. however, this might increase the size of segments for some users.

@@ -385,6 +385,9 @@ public IndexedInts seek(String value)
return new EmptyIndexedInts();
}
if (currVal == null) {
if(currIndex == dimSet.size()) {
return new EmptyIndexedInts();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this change needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@himanshug It handles the case where the only entry in dimSet is the null string; the seek needs to stop and return if the dimSet has been exhausted. The index == size check is there to distinguish between the initial state of the seeker and when it has retrieved the null string's entry (in both cases currVal is null)

@jon-wei jon-wei closed this Dec 7, 2015
@jon-wei jon-wei reopened this Dec 7, 2015
@jon-wei jon-wei closed this Dec 7, 2015
@jon-wei jon-wei reopened this Dec 7, 2015
@@ -520,7 +530,7 @@ protected void reduce(
int runningTotalLineCount = 0;
long startTime = System.currentTimeMillis();

Set<String> allDimensionNames = Sets.newHashSet();
Set<String> allDimensionNames = Sets.newLinkedHashSet();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This set gets populated by inputRow.getDimensions() which depends on the order in which rows appear. Are we sue the order gets preserved here? If I ask for [a,b,c] in my indexing spec, and if my first row only has dimension c, my second row only has dimension b, and my third row only has dimension a, then this set may end up with [c,b,a] instead of [a,b,c]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xvrl In the case where the user has specified a list of dimensions in the indexing spec, the MapInputRowParser initializes all of the MapBasedInputRows that it creates with the specified dimension list, so inputRow.getDimensions() would return the same order every time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jon-wei does it always return the same dimensions for every single row, and is that also the case for other InputRowParsers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xvrl yes, every row returns the same dimensions if the user specifies the dimension list in the ingestion spec. There's only StringInputRowParser and MapInputRowParser which are coupled together.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case, why do we need to add the dimensions to the set for every single row?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xvrl for the schemaless case where dimensions are discovered on the fly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jon-wei do we have a test for hadoop indexing that covers both:

  • schema-less indexing to make sure that dimension are persisted in the order they were seen
  • schema-full indexing where we ensuer the order in which dimensions appear when read is different than the ones specified in the spec, and we test that the persisted order corresponds to the spec ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xvrl The test I added to IndexGeneratorJobTest covers the first case, I'll add another test that covers the second

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xvrl Added a test to IndexGeneratorJobTest that checks original order is maintained for schema-full indexing

@jon-wei jon-wei force-pushed the inherit_dim_order branch 3 times, most recently from 0abe702 to 76375e6 Compare December 10, 2015 02:34
@@ -578,6 +579,43 @@ public Integer getDimensionIndex(String dimension)
return dimensionOrder.get(dimension);
}

public LinkedHashMap<String, Integer> getDimensionOrder()
{
return dimensionOrder;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synchronize on dimensionOrder, return a copy? (dimensionOrder is not a thread safe map)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually… we don't need dimensionOrder at all for this, we can just return dimensions itself. So this can return a List<String> that is actually just ImmutableList.copyOf(dimensions).

That also makes the loading simpler since you can pass that list directly to the one that takes an Iterable.

@gianm
Copy link
Contributor

gianm commented Dec 11, 2015

I think it should be possible to do this without writing out the null columns. To avoid that we need to be sure that:

  • Whenever we merge indexes, the dimension orders of the individual indexes (some of which might be missing some dimensions) should be merged such that if any dimension A appears before a dimension B in all indexes, it must also appear before B in the final order.
  • Whenever we create a new index, it's provided with an order that is based on all prior indexes, not just the immediately previous one.

@fjy fjy added this to the 0.9.0 milestone Jan 16, 2016
@fjy
Copy link
Contributor

fjy commented Jan 16, 2016

@jon-wei Looks like some more conflicts

@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 18, 2016

merging, also updating IndexMergerV9 with similar logic (still WIP)

@jon-wei jon-wei force-pushed the inherit_dim_order branch 3 times, most recently from 3a501fb to c05058c Compare January 19, 2016 02:29
@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 19, 2016

Finished the merge and updated IndexMergerV9.

I've also updated IndexMergerV9Test such that the use of V9/legacy IndexMerger is controlled by a test parameter since the tests are identical to those in IndexMergerTest (now deprecated in the PR).

@drcrallen
Copy link
Contributor

@jon-wei is this a blocker for 0.9.0?

@gianm
Copy link
Contributor

gianm commented Jan 19, 2016

@drcrallen I don't think it's a blocker but I think we should try to get it in.

@@ -644,7 +644,7 @@ private void mergeIndexesAndWriteColumns(
if (dimensionSkipFlag.get(i)) {
continue;
}
if (dims[i] == null || dims[i].length == 0) {
if (dims[i] == null || dims[i].length == 0 || dims[i][0] == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is dim val 0 guaranteed to be a null/emptystr?

Even if it is, this last check should be dims[i].length == 1 && dims[i][0] == 0 as the dim value could potentially be an empty str + something else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it is guaranteed due to the block below. but, the other check should still be modified, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm added length == 1 check

@gianm
Copy link
Contributor

gianm commented Jan 19, 2016

@jon-wei new changes look good other than the one comment I had

@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 19, 2016

rebased again

@gianm
Copy link
Contributor

gianm commented Jan 19, 2016

👍

fjy added a commit that referenced this pull request Jan 19, 2016
Preserve dimension order across indexes during ingestion
@fjy fjy merged commit cb8f714 into apache:master Jan 19, 2016
@navis
Copy link
Contributor

navis commented Jan 20, 2016

Sorry for late incursion but this makes invalid cardinality for segment meta query. If this could not be fixed we might rollback this.

@jon-wei
Copy link
Contributor Author

jon-wei commented Jan 20, 2016

@navis Can you elaborate on the issue you are seeing?

@navis
Copy link
Contributor

navis commented Jan 20, 2016

with druid.sample.tsv,

before

io.druid.segment.IndexMerger - Starting dimension[market] with cardinality[3]
io.druid.segment.IndexMerger - Starting dimension[null_column] with cardinality[0]
io.druid.segment.IndexMerger - Starting dimension[partial_null_column] with cardinality[2]
io.druid.segment.IndexMerger - Starting dimension[placement] with cardinality[1]
io.druid.segment.IndexMerger - Starting dimension[placementish] with cardinality[9]
io.druid.segment.IndexMerger - Starting dimension[quality] with cardinality[9]

after

io.druid.segment.IndexMerger - Starting dimension[market] with cardinality[4]
io.druid.segment.IndexMerger - Starting dimension[null_column] with cardinality[0]
io.druid.segment.IndexMerger - Starting dimension[partial_null_column] with cardinality[2]
io.druid.segment.IndexMerger - Starting dimension[placement] with cardinality[2]
io.druid.segment.IndexMerger - Starting dimension[placementish] with cardinality[10]
io.druid.segment.IndexMerger - Starting dimension[quality] with cardinality[10]

@navis
Copy link
Contributor

navis commented Jan 20, 2016

I can see null is added to all dimensions with value. But it can make some problems in user side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants