ACCUMULO-4813 New bulk import process and API #436

keith-turner · 2018-04-20T14:49:01Z

This is a big change to how bulk import works. This new API does the following for bulk import.

Computes mapping of files to tablets on the client side. This mapping is stored in HDFS as JSON (so its human readable). Stored mapping in HDFS so it does need to be read in memory on master.
The Fate operation streams the mapping from HDFS (does not load into memory) and metadata table and sends oneway thrift load message to tablet servers. This makes the fate operation very fast. Using this new API 10K files were loaded in 20 secs on a single node with 100 tablets and the FATE op took 5 secs.

The new API and functionality is complete, however the following still needs to be done.

Make metadata scan split tolerant (so it never skips a tablet) The way GC scans is split tolerant.
Decide how to make shell support this.
Possibly allow user to pass their own mapping file.
Retry when concurrent merge happens.
Document new API
Support bulk import to offline tables
When removing load flags, only scan portion of metadata table where loads happened. Currently scans entire table range in metadata table.

After this PR is finished with review, will open issues for these and anything else that comes up.

* Provide initial API * Provide some initial client code

* New Bulk Import Fate REPO * GSON serialize & implement Bulk class * Deprecated old BulkImporter * Created BulkLoadIT for new technique * TODO add more tests to BulkLoadIT

* In load RPC aow sending multiple file per tablet instead of one * Now saving mapping file to HDFS on client side * Simplified JSON serialization * Refactored new and old bulk import code into two packages * Now only iterate over mapping in master, no longer load into memory

* Fix bug for single tablet ingest * Fix BulkImportMove * Created 2 new tests in BulkLoadIT * Catch ExecutionException in Move

keith-turner · 2018-04-20T15:36:20Z

core/src/main/java/org/apache/accumulo/core/client/admin/TableOperations.java

+     *          Load files from this directory
+     * @return ImportSourceOptions
+     */
+    ImportSourceOptions from(String directory);


I think this should return ImportExecutorOptions

milleruntime · 2018-04-20T16:50:20Z

core/src/main/java/org/apache/accumulo/core/client/impl/BulkSerialize.java

+
+  /**
+   * Read Json array of Bulk.Mapping objects and return SortedMap of the bulk load mapping
+   */


Need to update this javadoc

milleruntime · 2018-04-20T16:51:34Z

core/src/main/java/org/apache/accumulo/core/client/impl/thrift/TableOperationExceptionType.java

+  OTHER(7),
+  NAMESPACE_EXISTS(8),
+  NAMESPACE_NOTFOUND(9),
+  INVALID_NAME(10);


Need to put the new type at the END of the list so the existing numbers don't change.

milleruntime · 2018-04-20T16:52:19Z

core/src/main/java/org/apache/accumulo/core/master/thrift/FateOperation.java

-  TABLE_CANCEL_COMPACT(12),
-  NAMESPACE_CREATE(13),
-  NAMESPACE_DELETE(14),
-  NAMESPACE_RENAME(15);


Need to put the new type at the END of the list so the existing numbers don't change.

milleruntime · 2018-04-20T16:55:36Z

server/master/src/main/java/org/apache/accumulo/master/tableOps/TableRangeOpWait.java

- * Normal operations, like bulk imports, will grab the read lock and prevent merges (writes) while
- * they run. Merge operations will lock out some operations while they run.
+ * Normal operations, like bulkDir imports, will grab the read lock and prevent merges (writes)
+ * while they run. Merge operations will lock out some operations while they run.


Looks like find/replace caught some unintentional javadoc. We didn't modify this file

ctubbsii

Just a few comments so far. Might do a longer review if I get time later.

ctubbsii · 2018-04-20T22:09:18Z

core/src/main/java/org/apache/accumulo/core/client/impl/BulkImport.java

+      // TODO need to handle case of file existing
+      BulkSerialize.writeLoadMapping(mappings, srcPath.toString(), p -> fs.create(p));
+
+      List<ByteBuffer> args = Arrays.asList(ByteBuffer.wrap(tableName.getBytes(UTF_8)),


This should probably send tableId instead of name.

ctubbsii · 2018-04-20T22:16:36Z

core/src/main/java/org/apache/accumulo/core/client/impl/BulkImport.java

+      }
+
+      int nThreads;
+      if (threads.toUpperCase().endsWith("C")) {


Should probably move this configuration type parsing into the ConfigurationTypeHelper.

* Update ordering of new thrift objects * Move Property parsing to ConfigTypeHelper * Fix Javadoc

* Updated documentation * Fixed bug in API * Unit tested merge detection code (and found bugs in the proccess) * Updated existing bulk IT to use new and old APIs * Addressed many TODOs

ctubbsii · 2018-05-08T17:45:33Z

Since this PR was merged, should the ACCUMULO-4813 JIRA issue be closed now?

keith-turner and others added 4 commits April 10, 2018 16:42

ACCUMULO-4813 New bulk import method (incomplete start...)

352d6b2

* Provide initial API * Provide some initial client code

ACCUMULO-4813 New bulk import method (continued...)

bafb802

* New Bulk Import Fate REPO * GSON serialize & implement Bulk class * Deprecated old BulkImporter * Created BulkLoadIT for new technique * TODO add more tests to BulkLoadIT

ACCUMULO-4813 Add permission checks to client (continued...)

8fa789d

* Fix bug for single tablet ingest * Fix BulkImportMove * Created 2 new tests in BulkLoadIT * Catch ExecutionException in Move

keith-turner changed the title ~~Accumulo 4813 fmt~~ ACCUMULO-4813 Sped up bulk import Apr 20, 2018

keith-turner changed the title ~~ACCUMULO-4813 Sped up bulk import~~ ACCUMULO-4813 New bulk import process and API Apr 20, 2018

keith-turner commented Apr 20, 2018

View reviewed changes

milleruntime reviewed Apr 20, 2018

View reviewed changes

ctubbsii reviewed Apr 20, 2018

View reviewed changes

PR Updates

a8423f6

* Update ordering of new thrift objects * Move Property parsing to ConfigTypeHelper * Fix Javadoc

keith-turner mentioned this pull request May 5, 2018

Cache rfile file lengths #467

Closed

ACCUMULO-4813 New bulk import method (continued...)

76c0cc1

* Updated documentation * Fixed bug in API * Unit tested merge detection code (and found bugs in the proccess) * Updated existing bulk IT to use new and old APIs * Addressed many TODOs

keith-turner merged commit 76c0cc1 into apache:master May 5, 2018

milleruntime mentioned this pull request May 7, 2018

Monitor 2.0 Bulk Import State is funky #475

Closed

ctubbsii added the v2.0.0 label Sep 15, 2018

keith-turner deleted the ACCUMULO-4813-fmt branch December 6, 2018 15:17

ctubbsii added this to the 2.0.0 milestone Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACCUMULO-4813 New bulk import process and API #436

ACCUMULO-4813 New bulk import process and API #436

keith-turner commented Apr 20, 2018 •

edited

Loading

keith-turner Apr 20, 2018

milleruntime Apr 20, 2018

milleruntime Apr 20, 2018

milleruntime Apr 20, 2018

milleruntime Apr 20, 2018

ctubbsii left a comment

ctubbsii Apr 20, 2018

ctubbsii Apr 20, 2018

ctubbsii commented May 8, 2018

ACCUMULO-4813 New bulk import process and API #436

ACCUMULO-4813 New bulk import process and API #436

Conversation

keith-turner commented Apr 20, 2018 • edited Loading

keith-turner Apr 20, 2018

Choose a reason for hiding this comment

milleruntime Apr 20, 2018

Choose a reason for hiding this comment

milleruntime Apr 20, 2018

Choose a reason for hiding this comment

milleruntime Apr 20, 2018

Choose a reason for hiding this comment

milleruntime Apr 20, 2018

Choose a reason for hiding this comment

ctubbsii left a comment

Choose a reason for hiding this comment

ctubbsii Apr 20, 2018

Choose a reason for hiding this comment

ctubbsii Apr 20, 2018

Choose a reason for hiding this comment

ctubbsii commented May 8, 2018

keith-turner commented Apr 20, 2018 •

edited

Loading