-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Combiner to merge InputRows at the Mapper during Hadoop Batch Ingestion #1472
Conversation
655713d
to
dcafabc
Compare
3b925fc
to
606d593
Compare
while(iter.hasNext()) { | ||
context.progress(); | ||
InputRow value = InputRowHelper.fromBytes(iter.next().getBytes(), config.getSchema().getDataSchema().getAggregators()); | ||
index.add(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to check index size bounds here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nishantmonu51 can you point me to some code doing the bound checking? I dont see that happening in IndexGeneratorReducer.reduce(..) either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reducer checks whether a row can be appended to the indexer or not by calling index.canAppendRow() , If the index is full it persists the current one and creates another index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I see that now.
In this case persist(..) is not really an option, what we can do is.. flush rows from that index into context.write(..) and create another index if/when index.canAppendRow() returns false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
how much using combiner helped in improving ingestion time ? |
@nishantmonu51 i haven't done much perf testing on this because performance improvements will really depend upon how much merging is possible, however I have ensured that if there was only one row for a key then combiner is effectively noop with almost no overhead. |
@@ -193,14 +200,39 @@ public boolean run() | |||
} | |||
} | |||
|
|||
public static class IndexGeneratorMapper extends HadoopDruidIndexerMapper<BytesWritable, Writable> | |||
private static IncrementalIndex makeIncrementalIndex(Bucket theBucket, AggregatorFactory[] aggs, StupidPool bufferPool, HadoopDruidIndexerConfig config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any changes in this function or did it just get moved around? It might be easier to leave it in place for review and worry about formatting separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just moved so that it can be used in combiner class too.
It looks like the tact taken was to create IncrementalIndex objects in the combiner and then persist those. I agree that this will functionally work, but I fear that it will be really difficult to manage and tune from the MR side. There are new configurations to manage about how much memory to give the combiner, etc. that will make it difficult for someone who doesn't know MR to have this "just work". I think we need to move the dimension names and values into the actual Key object and have each combiner work on only a single entry. |
ByteArrayDataOutput out = ByteStreams.newDataOutput(); | ||
|
||
List<String> dimList = row.getDimensions(); | ||
String[] dims = dimList == null ? EMPTY_STR_ARRAY : dimList.toArray(EMPTY_STR_ARRAY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have the dimList be the list used in most parts of the function, and dims be absent? the ArrayWritable can take dimList.toArray(new String[dimList.size()])
or similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will do
|
||
private static final String[] EMPTY_STR_ARRAY = new String[0]; | ||
|
||
public final static byte[] toBytes(InputRow row, AggregatorFactory[] aggs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of relying on hadoop serialization here, would it make sense for us to define our own so we can potentially re-use that serde elsewhere? For instance, it might speed up serializing and passing large groupby results between nodes, and might be useful for on-disk merging of group by results in the future? Just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense. However, from toByte(..) fromBytes(..) contract perspective, there is no dependency on hadoop libs. when I did it, hadoop libs just felt like most obvious and simple choice since it was already included in this module.
important thing is that serde implementation can be changed completely at any time with no functional user facing impact. We can do it whenever needed.
@drcrallen I have cleaned up the commit history already. commits in the history now represent key logical stages of the development which I would like to preserve. |
StringArrayWritable sw = new StringArrayWritable(dims); | ||
sw.write(out); | ||
|
||
out.writeLong(row.getTimestampFromEpoch()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd move this to the beginning. It's the only fixed-width portion of what is being serialized and having it in a known location can make it easier to re-use this format for other things (like partitioning, etc.) if we ever want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, moved timestamp to be first in the serialized byte[] .
Aside from the bits in serialization, I'm 👍 @himanshug Can you verify those bits and then merge at will? |
…tricSerde This provides the alternative to using ComplexMetricSerde.getObjectStrategy() and using the serde methods from ObjectStrategy as that usage pattern is deprecated.
This allows for arbitrary InputFormat while hadoop batch ingestion that can return records of value type other than Text
…dd a hadoop combiner to hadoop batch ingestion to do merges on the mappers if possible
@himanshug I can merge this since it has 2 +1s now. Do you still want to fix anything else up? |
@fjy pls go ahead, i'm done. |
Use Combiner to merge InputRows at the Mapper during Hadoop Batch Ingestion
@himanshug did we ever do some performance testing to get a sense of how much speed improvements we can get for things that aggregate well, and how much impact it may have on things that don't aggregate well? It would be useful to know at which point it makes sense to start using the combiner or not. |
@xvrl not yet, but, will do that. |
@xvrl I've just moved my old avro hadoop indexing module to that based on @himanshug 's unification of hadoop indexing code in 0.8.1, it's more clear in code compare with our old hacky way. And here are some performance testing results against combiner. Obviously combiner is wonderful for my data which I think is low cardinality, caused 75% indexing time off. Great job! Avro indexing w/ mapout compressed and w/o combinerUsed 310 mapper and 2 reducer, with Average time: map=51s, reduce=43m, shuffle=11m, merge=4m Counters are:
Avro indexing w/ mapout compressed and w/ combinerUsed 310 mapper and 2 reducer, with Average time: map=1m, reduce=10m, shuffle=3m, merge=16s Counters are:
|
|
@zhaown thanks for reporting the numbers, glad that its working well for you. @drcrallen I don't think adding the numbers straight away is an indicator of total job time. You would have to see something like.. ignoring overhead, % improvement ~ 75% |
@drcrallen @himanshug I think adding the numbers straight away is the total cpu-clock time istead of the wall-clock time. After all Druid has to do what it need to do for indexing, combiner does no magic reducing works besides shuffle-like-overhead. It's like moving some work from reducer to mapper, and while we cannot have many reducers as we don't want many output files, we can have many mappers, so using combiner increases parallelism, reduces wall-clock time. Actually to further increase parallelism, I change targetPartitionSize from 5,000,000 to 1,000,000, then I have 6 reducers instead of 2, reduce total indexing time by ~50%+ off. |
@zhaown @himanshug So is it safe to say that this patch helps an individual job running on its own cluster, but the amortized run time of a multi-tenant cluster which is run at or above capacity might not change (or not change much)? |
@drcrallen individual job finishing faster would mean slots getting freed up quicker. also, this is good because your ingestion finishes faster. however, since this reduces the time taken for reducers mainly which are way less in number than mappers so overall cluster capacity might not change that much. |
@himanshug As such does it also change the memory requirements of the mapper? |
@drcrallen yes, but just enough to hold one "merged" row in an index object and overhead associated with combiner. not sure how to calculate per mapper overhead from the numbers above. |
@himanshug sure no problem, I just want to make sure the potential impact areas are communicated |
this patch
(1) takes control of the serialization format of InputRow from mapper to reducer and in turn allows InputFormat to return records of arbitrary type and not just Text (only thing is that configured InputRowParser can produce InputRow from that arbitrary type)
(2) introduces combiner and merges input rows at mapper nodes if possible