-
Notifications
You must be signed in to change notification settings - Fork 477
Added ZK cleanup thread to Manager for Scan Server nodes #4562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
server/manager/src/main/java/org/apache/accumulo/manager/Manager.java
Outdated
Show resolved
Hide resolved
|
I started to comment on the loop where the lock data was read in the loop from
Taking no action and allowing that node to be processed on the next try would be safer. But, could there be a race condition between the server lock code and this cleaner. If the server lock creates the host:port node and then writes the lock there will a period where the lock does not exist, but host:port is expected to be there. What would happen if the cleaner deletes the host:port and then the server lock write is attempted? It may be possible to use the creation time of the host:port node (ZK stat ctime) and check that it is older than the loop retry period. This would delay the removal for at least one cleaner cycle. Or, the service lock code could try to recreate the host:port node. |
|
I did test the code using a small, single node instance with multiple scan servers - the scan server entries were removed on a cluster restart and when killing individual scan server processes as expected. |
EdColeman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the code and it works as expected and it is pretty standard with other code. My reservations regarding a potential race condition are optional.
|
Another mitigation may be to delay the first run of the cleaners so that initialization and scan server assignments have a better chance to complete before the cleaner runs - that way seeing a lock as it is being built should not occur. |
The Manager and CompactionCoordinator don't have an initial delay when calling |
Are you suggesting wrapping the following line with a try/catch to catch Keeper.NO_NODE? I don't think that method throws that Exception. |
|
You are correct - it does not throw NO_NODE - I incorrectly assumed that getLockData would echo the ZK exceptions - but it does not, |
Closes #4559