Added ZK cleanup thread to Manager for Scan Server nodes #4562

dlmarion · 2024-05-15T19:02:29Z

server/manager/src/main/java/org/apache/accumulo/manager/Manager.java

EdColeman · 2024-05-15T21:47:14Z

I started to comment on the loop where the lock data was read in the loop from getChildren with the following:

Would it be worth it to wrap this call with another try...catch(Keeper.NO_NODE ex) to allow it to handle the case where the ephemeral lock was removed while in the main getChildren loop? With no lock node, it could either delete the host:port then - or at least continue processing the other nodes. As is, it will retry, but handling NO_NODE could make it more responsive by processing the remaining nodes in the list.

Taking no action and allowing that node to be processed on the next try would be safer.

But, could there be a race condition between the server lock code and this cleaner. If the server lock creates the host:port node and then writes the lock there will a period where the lock does not exist, but host:port is expected to be there. What would happen if the cleaner deletes the host:port and then the server lock write is attempted?

It may be possible to use the creation time of the host:port node (ZK stat ctime) and check that it is older than the loop retry period. This would delay the removal for at least one cleaner cycle. Or, the service lock code could try to recreate the host:port node.

EdColeman · 2024-05-15T21:50:57Z

I did test the code using a small, single node instance with multiple scan servers - the scan server entries were removed on a cluster restart and when killing individual scan server processes as expected.

EdColeman

I tested the code and it works as expected and it is pretty standard with other code. My reservations regarding a potential race condition are optional.

EdColeman · 2024-05-15T23:02:46Z

Another mitigation may be to delay the first run of the cleaners so that initialization and scan server assignments have a better chance to complete before the cleaner runs - that way seeing a lock as it is being built should not occur.

dlmarion · 2024-05-16T12:13:25Z

Another mitigation may be to delay the first run of the cleaners so that initialization and scan server assignments have a better chance to complete before the cleaner runs - that way seeing a lock as it is being built should not occur.

The Manager and CompactionCoordinator don't have an initial delay when calling LiveTServerSet.startListeningForTabletServerChanges or CompactionCoordinator.startCompactionCleaner. I could make this change, but if we do, then we should do it for all of them for the same reason.

dlmarion · 2024-05-16T12:25:28Z

I started to comment on the loop where the lock data was read in the loop from getChildren with the following:

Would it be worth it to wrap this call with another try...catch(Keeper.NO_NODE ex) to allow it to handle the case where the ephemeral lock was removed while in the main getChildren loop? With no lock node, it could either delete the host:port then - or at least continue processing the other nodes. As is, it will retry, but handling NO_NODE could make it more responsive by processing the remaining nodes in the list.

Taking no action and allowing that node to be processed on the next try would be safer.

But, could there be a race condition between the server lock code and this cleaner. If the server lock creates the host:port node and then writes the lock there will a period where the lock does not exist, but host:port is expected to be there. What would happen if the cleaner deletes the host:port and then the server lock write is attempted?

It may be possible to use the creation time of the host:port node (ZK stat ctime) and check that it is older than the loop retry period. This would delay the removal for at least one cleaner cycle. Or, the service lock code could try to recreate the host:port node.

Are you suggesting wrapping the following line with a try/catch to catch Keeper.NO_NODE?

            byte[] lockData = ServiceLock.getLockData(getContext().getZooCache(), zLockPath, stat);

I don't think that method throws that Exception.

EdColeman · 2024-05-16T12:53:29Z

You are correct - it does not throw NO_NODE - I incorrectly assumed that getLockData would echo the ZK exceptions - but it does not,

Added ZK cleanup thread to Manager for Scan Server nodes

7ba8254

Closes apache#4559

dlmarion self-assigned this May 15, 2024

dlmarion linked an issue May 15, 2024 that may be closed by this pull request

Scan Server ZooKeeper entries are not removed on shutdown. #4559

Closed

dlmarion requested a review from EdColeman May 15, 2024 19:02

EdColeman reviewed May 15, 2024

View reviewed changes

server/manager/src/main/java/org/apache/accumulo/manager/Manager.java Outdated Show resolved Hide resolved

Change interval variable name and units

91bfd6a

EdColeman approved these changes May 15, 2024

View reviewed changes

dlmarion merged commit 4b5234b into apache:2.1 May 16, 2024

dlmarion deleted the 4559-manager-sserver-cleanup branch May 16, 2024 18:54

ctubbsii modified the milestones: 3.1.0, 2.1.3 Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added ZK cleanup thread to Manager for Scan Server nodes #4562

Added ZK cleanup thread to Manager for Scan Server nodes #4562

Uh oh!

dlmarion commented May 15, 2024

Uh oh!

Uh oh!

EdColeman commented May 15, 2024 •

edited

Loading

Uh oh!

EdColeman commented May 15, 2024

Uh oh!

EdColeman left a comment

Uh oh!

EdColeman commented May 15, 2024

Uh oh!

dlmarion commented May 16, 2024

Uh oh!

dlmarion commented May 16, 2024

Uh oh!

EdColeman commented May 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Added ZK cleanup thread to Manager for Scan Server nodes #4562

Added ZK cleanup thread to Manager for Scan Server nodes #4562

Uh oh!

Conversation

dlmarion commented May 15, 2024

Uh oh!

Uh oh!

EdColeman commented May 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EdColeman commented May 15, 2024

Uh oh!

EdColeman left a comment

Choose a reason for hiding this comment

Uh oh!

EdColeman commented May 15, 2024

Uh oh!

dlmarion commented May 16, 2024

Uh oh!

dlmarion commented May 16, 2024

Uh oh!

EdColeman commented May 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

EdColeman commented May 15, 2024 •

edited

Loading