Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACCUMULO-4813 New bulk import process and API #436

Merged
merged 6 commits into from
May 5, 2018

Conversation

keith-turner
Copy link
Contributor

@keith-turner keith-turner commented Apr 20, 2018

This is a big change to how bulk import works. This new API does the following for bulk import.

  • Computes mapping of files to tablets on the client side. This mapping is stored in HDFS as JSON (so its human readable). Stored mapping in HDFS so it does need to be read in memory on master.
  • The Fate operation streams the mapping from HDFS (does not load into memory) and metadata table and sends oneway thrift load message to tablet servers. This makes the fate operation very fast. Using this new API 10K files were loaded in 20 secs on a single node with 100 tablets and the FATE op took 5 secs.

The new API and functionality is complete, however the following still needs to be done.

  • Make metadata scan split tolerant (so it never skips a tablet) The way GC scans is split tolerant.
  • Decide how to make shell support this.
  • Possibly allow user to pass their own mapping file.
  • Retry when concurrent merge happens.
  • Document new API
  • Support bulk import to offline tables
  • When removing load flags, only scan portion of metadata table where loads happened. Currently scans entire table range in metadata table.

After this PR is finished with review, will open issues for these and anything else that comes up.

keith-turner and others added 4 commits April 10, 2018 16:42
 * Provide initial API
 * Provide some initial client code
 * New Bulk Import Fate REPO
 * GSON serialize & implement Bulk class
 * Deprecated old BulkImporter
 * Created BulkLoadIT for new technique
 * TODO add more tests to BulkLoadIT
 * In load RPC aow sending multiple file per tablet instead of one
 * Now saving mapping file to HDFS on client side
 * Simplified JSON serialization
 * Refactored new and old bulk import code into two packages
 * Now only iterate over mapping in master, no longer load into memory
* Fix bug for single tablet ingest
* Fix BulkImportMove
* Created 2 new tests in BulkLoadIT
* Catch ExecutionException in Move
@keith-turner keith-turner changed the title Accumulo 4813 fmt ACCUMULO-4813 Sped up bulk import Apr 20, 2018
@keith-turner keith-turner changed the title ACCUMULO-4813 Sped up bulk import ACCUMULO-4813 New bulk import process and API Apr 20, 2018
* Load files from this directory
* @return ImportSourceOptions
*/
ImportSourceOptions from(String directory);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should return ImportExecutorOptions


/**
* Read Json array of Bulk.Mapping objects and return SortedMap of the bulk load mapping
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update this javadoc

OTHER(7),
NAMESPACE_EXISTS(8),
NAMESPACE_NOTFOUND(9),
INVALID_NAME(10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to put the new type at the END of the list so the existing numbers don't change.

TABLE_CANCEL_COMPACT(12),
NAMESPACE_CREATE(13),
NAMESPACE_DELETE(14),
NAMESPACE_RENAME(15);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to put the new type at the END of the list so the existing numbers don't change.

* Normal operations, like bulk imports, will grab the read lock and prevent merges (writes) while
* they run. Merge operations will lock out some operations while they run.
* Normal operations, like bulkDir imports, will grab the read lock and prevent merges (writes)
* while they run. Merge operations will lock out some operations while they run.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like find/replace caught some unintentional javadoc. We didn't modify this file

Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments so far. Might do a longer review if I get time later.

// TODO need to handle case of file existing
BulkSerialize.writeLoadMapping(mappings, srcPath.toString(), p -> fs.create(p));

List<ByteBuffer> args = Arrays.asList(ByteBuffer.wrap(tableName.getBytes(UTF_8)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably send tableId instead of name.

}

int nThreads;
if (threads.toUpperCase().endsWith("C")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably move this configuration type parsing into the ConfigurationTypeHelper.

* Update ordering of new thrift objects
* Move Property parsing to ConfigTypeHelper
* Fix Javadoc
 * Updated documentation
 * Fixed bug in API
 * Unit tested merge detection code (and found bugs in the proccess)
 * Updated existing bulk IT to use new and old APIs
 * Addressed many TODOs
@ctubbsii
Copy link
Member

ctubbsii commented May 8, 2018

Since this PR was merged, should the ACCUMULO-4813 JIRA issue be closed now?

@keith-turner keith-turner deleted the ACCUMULO-4813-fmt branch December 6, 2018 15:17
@ctubbsii ctubbsii added this to the 2.0.0 milestone Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants