Skip to content

Conversation

@dlmarion
Copy link
Contributor

Closes #4559

@dlmarion dlmarion self-assigned this May 15, 2024
@dlmarion dlmarion linked an issue May 15, 2024 that may be closed by this pull request
@dlmarion dlmarion requested a review from EdColeman May 15, 2024 19:02
@EdColeman
Copy link
Contributor

EdColeman commented May 15, 2024

I started to comment on the loop where the lock data was read in the loop from getChildren with the following:

Would it be worth it to wrap this call with another try...catch(Keeper.NO_NODE ex) to allow it to handle the case where the ephemeral lock was removed while in the main getChildren loop? With no lock node, it could either delete the host:port then - or at least continue processing the other nodes. As is, it will retry, but handling NO_NODE could make it more responsive by processing the remaining nodes in the list.

Taking no action and allowing that node to be processed on the next try would be safer.

But, could there be a race condition between the server lock code and this cleaner. If the server lock creates the host:port node and then writes the lock there will a period where the lock does not exist, but host:port is expected to be there. What would happen if the cleaner deletes the host:port and then the server lock write is attempted?

It may be possible to use the creation time of the host:port node (ZK stat ctime) and check that it is older than the loop retry period. This would delay the removal for at least one cleaner cycle. Or, the service lock code could try to recreate the host:port node.

@EdColeman
Copy link
Contributor

I did test the code using a small, single node instance with multiple scan servers - the scan server entries were removed on a cluster restart and when killing individual scan server processes as expected.

Copy link
Contributor

@EdColeman EdColeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the code and it works as expected and it is pretty standard with other code. My reservations regarding a potential race condition are optional.

@EdColeman
Copy link
Contributor

Another mitigation may be to delay the first run of the cleaners so that initialization and scan server assignments have a better chance to complete before the cleaner runs - that way seeing a lock as it is being built should not occur.

@dlmarion
Copy link
Contributor Author

Another mitigation may be to delay the first run of the cleaners so that initialization and scan server assignments have a better chance to complete before the cleaner runs - that way seeing a lock as it is being built should not occur.

The Manager and CompactionCoordinator don't have an initial delay when calling LiveTServerSet.startListeningForTabletServerChanges or CompactionCoordinator.startCompactionCleaner. I could make this change, but if we do, then we should do it for all of them for the same reason.

@dlmarion
Copy link
Contributor Author

I started to comment on the loop where the lock data was read in the loop from getChildren with the following:

Would it be worth it to wrap this call with another try...catch(Keeper.NO_NODE ex) to allow it to handle the case where the ephemeral lock was removed while in the main getChildren loop? With no lock node, it could either delete the host:port then - or at least continue processing the other nodes. As is, it will retry, but handling NO_NODE could make it more responsive by processing the remaining nodes in the list.

Taking no action and allowing that node to be processed on the next try would be safer.

But, could there be a race condition between the server lock code and this cleaner. If the server lock creates the host:port node and then writes the lock there will a period where the lock does not exist, but host:port is expected to be there. What would happen if the cleaner deletes the host:port and then the server lock write is attempted?

It may be possible to use the creation time of the host:port node (ZK stat ctime) and check that it is older than the loop retry period. This would delay the removal for at least one cleaner cycle. Or, the service lock code could try to recreate the host:port node.

Are you suggesting wrapping the following line with a try/catch to catch Keeper.NO_NODE?

            byte[] lockData = ServiceLock.getLockData(getContext().getZooCache(), zLockPath, stat);

I don't think that method throws that Exception.

@EdColeman
Copy link
Contributor

You are correct - it does not throw NO_NODE - I incorrectly assumed that getLockData would echo the ZK exceptions - but it does not,

@dlmarion dlmarion merged commit 4b5234b into apache:2.1 May 16, 2024
@dlmarion dlmarion deleted the 4559-manager-sserver-cleanup branch May 16, 2024 18:54
@ctubbsii ctubbsii modified the milestones: 3.1.0, 2.1.3 Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scan Server ZooKeeper entries are not removed on shutdown.

3 participants