Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAK-10006 Prevent writes to Azure storage if lock is not held #760

Draft
wants to merge 5 commits into
base: trunk
Choose a base branch
from

Conversation

jelmini
Copy link
Contributor

@jelmini jelmini commented Nov 18, 2022

If the lock is lost to another instance, all writes attempts should fail with an exception.
If the lock fails to be renewed, all writes are suspended (blocked) until the renewal succeed.


void doRenewLease() throws StorageException {
BlobRequestOptions blobRequestOptions = new BlobRequestOptions();
blobRequestOptions.setMaximumExecutionTimeInMs(Math.max(INTERVAL - renewalInterval - 1, INTERVAL / 2 - 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment here that explains this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will add a comment.
In summary, previously renewLease() was called with the default BlobRequestOptions. However, those defaults can be set globally for the client instance and may be not a good fit for calling renewLease() as it could block for too long. As the lease is acquired for 60 seconds and renewLease() is called by default every 30 seconds, I think it should timeout after at most 29 seconds (giving 1 sec margin), so that we can call notifyLockStatusChange(LockStatus.RENEWAL_FAILED) and suspend writes before the lease becomes available to be acquired by another instance.
BTW, I have just noticed that values should be in millis and thus multiplied by 1000...

case LOST:
try {
if (tarFiles != null) {
tarFiles.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will write the graph and references to the tar file (archive.close() in TarWriter), shouldn't we avoid that? Also it will try to flush the write queue (q.flush() in AbstractRemoteSegmentArchiveWriter), so if it's not empty, it will wait there forever, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Thanks for spotting this, @ahanikel !
I suppose we should add an unsafeShutdown() method to TarFiles, TarWriter and AbstractRemoteSegmentArchiveWriter, so that we can shut them down without attempting any finalisation writes to the remote storage, which would anyway fail or corrupt the data.
What do you think?

If the lock is lost to another instance, all writes attempts should fail
with an exception.
If the lock fails to be renewed, all writes are suspended (blocked)
until the renewal succeed.
@jelmini jelmini force-pushed the OAK-10006_prevent_writes_after_lease_lost branch from 067b62e to d434cde Compare December 22, 2022 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants