Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACCUMULO-4813 New bulk import process and API #436

Merged
merged 6 commits into from May 5, 2018

Conversation

@keith-turner
Copy link
Contributor

commented Apr 20, 2018

This is a big change to how bulk import works. This new API does the following for bulk import.

  • Computes mapping of files to tablets on the client side. This mapping is stored in HDFS as JSON (so its human readable). Stored mapping in HDFS so it does need to be read in memory on master.
  • The Fate operation streams the mapping from HDFS (does not load into memory) and metadata table and sends oneway thrift load message to tablet servers. This makes the fate operation very fast. Using this new API 10K files were loaded in 20 secs on a single node with 100 tablets and the FATE op took 5 secs.

The new API and functionality is complete, however the following still needs to be done.

  • Make metadata scan split tolerant (so it never skips a tablet) The way GC scans is split tolerant.
  • Decide how to make shell support this.
  • Possibly allow user to pass their own mapping file.
  • Retry when concurrent merge happens.
  • Document new API
  • Support bulk import to offline tables
  • When removing load flags, only scan portion of metadata table where loads happened. Currently scans entire table range in metadata table.

After this PR is finished with review, will open issues for these and anything else that comes up.

keith-turner and others added 4 commits Apr 10, 2018
ACCUMULO-4813 New bulk import method (incomplete start...)
 * Provide initial API
 * Provide some initial client code
ACCUMULO-4813 New bulk import method (continued...)
 * New Bulk Import Fate REPO
 * GSON serialize & implement Bulk class
 * Deprecated old BulkImporter
 * Created BulkLoadIT for new technique
 * TODO add more tests to BulkLoadIT
ACCUMULO-4813 New bulk import method (continued...)
 * In load RPC aow sending multiple file per tablet instead of one
 * Now saving mapping file to HDFS on client side
 * Simplified JSON serialization
 * Refactored new and old bulk import code into two packages
 * Now only iterate over mapping in master, no longer load into memory
ACCUMULO-4813 Add permission checks to client (continued...)
* Fix bug for single tablet ingest
* Fix BulkImportMove
* Created 2 new tests in BulkLoadIT
* Catch ExecutionException in Move

@keith-turner keith-turner changed the title Accumulo 4813 fmt ACCUMULO-4813 Sped up bulk import Apr 20, 2018

@keith-turner keith-turner changed the title ACCUMULO-4813 Sped up bulk import ACCUMULO-4813 New bulk import process and API Apr 20, 2018

* Load files from this directory
* @return ImportSourceOptions
*/
ImportSourceOptions from(String directory);

This comment has been minimized.

Copy link
@keith-turner

keith-turner Apr 20, 2018

Author Contributor

I think this should return ImportExecutorOptions


/**
* Read Json array of Bulk.Mapping objects and return SortedMap of the bulk load mapping
*/

This comment has been minimized.

Copy link
@milleruntime

milleruntime Apr 20, 2018

Contributor

Need to update this javadoc

OTHER(7),
NAMESPACE_EXISTS(8),
NAMESPACE_NOTFOUND(9),
INVALID_NAME(10);

This comment has been minimized.

Copy link
@milleruntime

milleruntime Apr 20, 2018

Contributor

Need to put the new type at the END of the list so the existing numbers don't change.

TABLE_CANCEL_COMPACT(12),
NAMESPACE_CREATE(13),
NAMESPACE_DELETE(14),
NAMESPACE_RENAME(15);

This comment has been minimized.

Copy link
@milleruntime

milleruntime Apr 20, 2018

Contributor

Need to put the new type at the END of the list so the existing numbers don't change.

* Normal operations, like bulk imports, will grab the read lock and prevent merges (writes) while
* they run. Merge operations will lock out some operations while they run.
* Normal operations, like bulkDir imports, will grab the read lock and prevent merges (writes)
* while they run. Merge operations will lock out some operations while they run.

This comment has been minimized.

Copy link
@milleruntime

milleruntime Apr 20, 2018

Contributor

Looks like find/replace caught some unintentional javadoc. We didn't modify this file

@ctubbsii
Copy link
Member

left a comment

Just a few comments so far. Might do a longer review if I get time later.

// TODO need to handle case of file existing
BulkSerialize.writeLoadMapping(mappings, srcPath.toString(), p -> fs.create(p));

List<ByteBuffer> args = Arrays.asList(ByteBuffer.wrap(tableName.getBytes(UTF_8)),

This comment has been minimized.

Copy link
@ctubbsii

ctubbsii Apr 20, 2018

Member

This should probably send tableId instead of name.

}

int nThreads;
if (threads.toUpperCase().endsWith("C")) {

This comment has been minimized.

Copy link
@ctubbsii

ctubbsii Apr 20, 2018

Member

Should probably move this configuration type parsing into the ConfigurationTypeHelper.

PR Updates
* Update ordering of new thrift objects
* Move Property parsing to ConfigTypeHelper
* Fix Javadoc
ACCUMULO-4813 New bulk import method (continued...)
 * Updated documentation
 * Fixed bug in API
 * Unit tested merge detection code (and found bugs in the proccess)
 * Updated existing bulk IT to use new and old APIs
 * Addressed many TODOs
@ctubbsii

This comment has been minimized.

Copy link
Member

commented May 8, 2018

Since this PR was merged, should the ACCUMULO-4813 JIRA issue be closed now?

@ctubbsii ctubbsii added the v2.0.0 label Sep 15, 2018

@keith-turner keith-turner deleted the keith-turner:ACCUMULO-4813-fmt branch Dec 6, 2018

@ctubbsii ctubbsii added this to Done in 2.0.0 Jun 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
3 participants
You can’t perform that action at this time.