METRON-678: Multithread the flat file loader #428

cestella · 2017-01-27T23:28:01Z

Currently the flat file loader is single threaded in its writing to HBase. We could make this a lot faster by multithreading the HBase puts.

Executing this on single node vagrant with the following configuration for 100k 2-column CSV enrichment import:

a batch size of 128
number of threads varying between 1 and 6

A reasonable speedup was achieved:

Number of Threads	Time (in seconds)
1	91.019
2	76.07
3	39.974
4	35.039
5	30.531
6	30.559

cestella · 2017-01-27T23:35:59Z

Testing Plan

Preliminaries

Download the alexa 1m dataset:

wget http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
unzip top-1m.csv.zip

Create a 100k and single entry selection:

head -n 100000 top-1m.csv > top-100k.csv
head -n 1 top-1m.csv > top-1.csv

Create an extractor.json for the CSV data by editing extractor.json and pasting in these contents:

{
  "config" : {
    "columns" : {
       "domain" : 1,
       "rank" : 0
                }
    ,"indicator_column" : "domain"
    ,"type" : "alexa"
    ,"separator" : ","
             },
  "extractor" : "CSV"
}

Verify 100k import with 5 threads:

# truncate hbase
echo "truncate 'enrichment'" | hbase shell
# import data into hbase using 5 threads
/usr/metron/0.3.0/bin/flatfile_loader.sh -i ./top-100k.csv -t enrichment -c t -e ./extractor.json -p 5 -b 128
# count data written and verify it's 100k
echo "count 'enrichment'" | hbase shell

Verify 100k import with 5 threads and a batch of 1000:

# truncate hbase
echo "truncate 'enrichment'" | hbase shell
# import data into hbase using 5 threads
/usr/metron/0.3.0/bin/flatfile_loader.sh -i ./top-100k.csv -t enrichment -c t -e ./extractor.json -p 5 -b 1000
# count data written and verify it's 100k
echo "count 'enrichment'" | hbase shell

Verify 100k import with 1 threads:

# truncate hbase
echo "truncate 'enrichment'" | hbase shell
# import data into hbase using 5 threads
/usr/metron/0.3.0/bin/flatfile_loader.sh -i ./top-100k.csv -t enrichment -c t -e ./extractor.json -p 1 -b 128
# count data written and verify it's 100k
echo "count 'enrichment'" | hbase shell

Verify 1 entry import with 5 threads:

# truncate hbase
echo "truncate 'enrichment'" | hbase shell
# import data into hbase using 5 threads
/usr/metron/0.3.0/bin/flatfile_loader.sh -i ./top-1.csv -t enrichment -c t -e ./extractor.json -p 5 -b 128
# count data written and verify it's 1
echo "count 'enrichment'" | hbase shell

Verify 1 entry import with 1 threads:

# truncate hbase
echo "truncate 'enrichment'" | hbase shell
# import data into hbase using 5 threads
/usr/metron/0.3.0/bin/flatfile_loader.sh -i ./top-1.csv -t enrichment -c t -e ./extractor.json -p 1 -b 128
# count data written and verify it's 1
echo "count 'enrichment'" | hbase shell

cestella · 2017-01-27T23:38:44Z

Note, the batch size is just the split size (number of lines processed in each split) for the spliterator. The default of 128 is probably pretty good for almost all cases, I think.

cestella · 2017-01-28T00:06:49Z

Just as a comment: on my vagrant box, importing the alexa 1m took 4 minutes with 6 threads and 14 minutes with 1 thread.

mmiklavc · 2017-01-30T21:37:42Z

Tested in Vagrant quick-dev and all numbers return as expected. Reviewing code now.

mmiklavc · 2017-01-30T22:34:05Z

...m/metron-common/src/test/java/org/apache/metron/common/utils/file/ReaderSpliteratorTest.java

+                      .collect(Collectors.toMap(s -> s, s -> 1, Integer::sum));
+      Assert.assertEquals(5, count.size());
+      Assert.assertEquals(3, (int) count.get("foo"));
+      Assert.assertEquals(2, (int) count.get("bar"));


Minor Q - was grok intentionally excluded?

hah, no, not intentionally. I can add it in.

mmiklavc · 2017-01-30T22:35:05Z

...m/metron-common/src/test/java/org/apache/metron/common/utils/file/ReaderSpliteratorTest.java

+
+  @Test
+  public void testActuallyParallel() throws ExecutionException, InterruptedException, FileNotFoundException {
+    //With 9 elements and a batch of 2, we should only ceil(9/2) = 5 batches, so at most min(5, 2) = 2 threads will be used


Good idea for a test

mmiklavc · 2017-01-30T22:39:43Z

Manual tests all checked out. Code looks good to me. The parallelism tests were a nice addition. +1

cestella added 2 commits January 27, 2017 18:15

Multithreading the SimpleEnrichmentFlatFileLoader

47d814e

doc changes.

918d4ce

Updating docs.

c6ca3a8

cestella closed this Jan 28, 2017

cestella reopened this Jan 28, 2017

cestella added 6 commits January 27, 2017 22:36

Investigating integration tests.

8c9a79c

Update integration test to be a proper integration test.

315bd18

Adding spliterator unit test for completeness

004c6f4

Updating test to use a proper file

f8dd48e

Updating docs and renaming a few things.

9b04f97

Update one more test case.

eb5b82c

cestella closed this Jan 28, 2017

cestella reopened this Jan 28, 2017

mmiklavc reviewed Jan 30, 2017

View reviewed changes

asfgit closed this in ad8724e Jan 31, 2017

tiborm mentioned this pull request Oct 19, 2018

METRON-1830: Re-implement Alerts dialog box without jQuery #1240

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

METRON-678: Multithread the flat file loader #428

METRON-678: Multithread the flat file loader #428

cestella commented Jan 27, 2017 •

edited

cestella commented Jan 27, 2017 •

edited

cestella commented Jan 27, 2017

cestella commented Jan 28, 2017

mmiklavc commented Jan 30, 2017

mmiklavc Jan 30, 2017

cestella Jan 30, 2017

mmiklavc Jan 30, 2017

mmiklavc commented Jan 30, 2017

METRON-678: Multithread the flat file loader #428

METRON-678: Multithread the flat file loader #428

Conversation

cestella commented Jan 27, 2017 • edited

cestella commented Jan 27, 2017 • edited

Testing Plan

Preliminaries

Verify 100k import with 5 threads:

Verify 100k import with 5 threads and a batch of 1000:

Verify 100k import with 1 threads:

Verify 1 entry import with 5 threads:

Verify 1 entry import with 1 threads:

cestella commented Jan 27, 2017

cestella commented Jan 28, 2017

mmiklavc commented Jan 30, 2017

mmiklavc Jan 30, 2017

Choose a reason for hiding this comment

cestella Jan 30, 2017

Choose a reason for hiding this comment

mmiklavc Jan 30, 2017

Choose a reason for hiding this comment

mmiklavc commented Jan 30, 2017

cestella commented Jan 27, 2017 •

edited

cestella commented Jan 27, 2017 •

edited