METRON-678: Multithread the flat file loader #428
Conversation
Testing PlanPreliminaries
Verify 100k import with 5 threads:
Verify 100k import with 5 threads and a batch of 1000:
Verify 100k import with 1 threads:
Verify 1 entry import with 5 threads:
Verify 1 entry import with 1 threads:
|
Note, the batch size is just the split size (number of lines processed in each split) for the spliterator. The default of 128 is probably pretty good for almost all cases, I think. |
Just as a comment: on my vagrant box, importing the alexa 1m took 4 minutes with 6 threads and 14 minutes with 1 thread. |
Tested in Vagrant quick-dev and all numbers return as expected. Reviewing code now. |
.collect(Collectors.toMap(s -> s, s -> 1, Integer::sum)); | ||
Assert.assertEquals(5, count.size()); | ||
Assert.assertEquals(3, (int) count.get("foo")); | ||
Assert.assertEquals(2, (int) count.get("bar")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor Q - was grok intentionally excluded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah, no, not intentionally. I can add it in.
|
||
@Test | ||
public void testActuallyParallel() throws ExecutionException, InterruptedException, FileNotFoundException { | ||
//With 9 elements and a batch of 2, we should only ceil(9/2) = 5 batches, so at most min(5, 2) = 2 threads will be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea for a test
Manual tests all checked out. Code looks good to me. The parallelism tests were a nice addition. +1 |
Currently the flat file loader is single threaded in its writing to HBase. We could make this a lot faster by multithreading the HBase puts.
Executing this on single node vagrant with the following configuration for 100k 2-column CSV enrichment import:
A reasonable speedup was achieved: