Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk import process can result in a lost file #800

Closed
ivakegg opened this issue Dec 5, 2018 · 3 comments

Comments

@ivakegg
Copy link
Contributor

commented Dec 5, 2018

In the 1.8, 1.9 line of code (unsure about 2.0), we can get into a situation where a major compaction starts failing because of a missing file. The scenario is as follows:

master asked to import directory
tserver1 asked the assign a file
tserver1 askes tserver2 to load a file
tserver2 loads the file successfully but not not respond to tserver1 in a timely manor
tserver2 major compacts and adds a ~del entry for the file
tserver1 times out and retries asking tserver2 to load the file
tserver2 starts to load the same file a second time
garbage collector deletes the file
tserver2 attempts to major compact the file away but now it does not exist.

@keith-turner keith-turner self-assigned this Dec 5, 2018

@keith-turner

This comment has been minimized.

Copy link
Contributor

commented Dec 7, 2018

There are mechanism in Accumulo that are supposed to prevent this. Not sure where those are falling down. I am going to create an IT that attempts to reproduce this. I plan to do this by making the IT directly call the RPC to load a file. If that does not yield anything then I will review the code paths for preventing multiple loads.

@keith-turner

This comment has been minimized.

Copy link
Contributor

commented Dec 14, 2018

Looking at the code I found a potential problem. Each bulk import has a unique fate transaction id stored in zookeeper. There is code that prevents RPCs related to the bulk import from running if the fate transaction id is deleted from ZK. Also there is code to wait for all active RPCs to finish. So the bulk import FATE op will delete the id from ZK and then wait for all tservers to complete any RPCs that were active before it was deleted.

The bulk import FATE op makes RPCs to intermediate tserver that inspect files. Once the intermediate tserver determines where a file goes then makes an RPC to tserver that should load the file. The problem is that only the intermediate RPC is checking the if the transaction id is active. Really the final RPC should be doing this check.

Below are some places in the code where this is all happening.

  • CompleteBulkImport line 42 deletes the transaction id from ZK
  • CopyFailed line 75 waits for active RPCs to complete
  • ClientServiceHandler line 345 Runs the intermediate RPC to inspect files. This RPC is run using TransactionWatcher. TransactionWatcher will not run the RPC if the id does not exist in ZK. If it does exist then increments a counter for the number of RPCs running for that transaction.
  • TabletServer line 463 This the RPC that actually loads a file into a tablet. This RPC is not run using TransactionWatcher and I think it should.

I think the intermediate RPC should stop using TransactionWatcher and the final RPC should start using TransactionWatcher. I don't think there is a benefit to the intermediate one using it and it puts more load on ZK.

@keith-turner

This comment has been minimized.

Copy link
Contributor

commented Dec 14, 2018

Also I have been working on IT that sends a load message to a tablet, compacts the tablet, and then sends another load message. This IT shows that the current code that tracks loads prevents a load after compact.

keith-turner added a commit to keith-turner/accumulo that referenced this issue Dec 18, 2018

@ctubbsii ctubbsii added this to Done in 1.9.3 Jun 14, 2019

@ctubbsii ctubbsii added this to Done in 2.0.0 Jun 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
3 participants
You can’t perform that action at this time.