New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block repository writes if repo lock is lost #1204
Conversation
I am wondering what happens if:
|
@@ -100,7 +105,11 @@ private void refreshLease() { | |||
try { | |||
long timeSinceLastUpdate = (System.currentTimeMillis() - lastUpdate) / 1000; | |||
if (timeSinceLastUpdate > INTERVAL / 2) { | |||
writeAccessController.disableWriting(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the performance impact of this change?
My concern is we are blocking writes at every renewal, even though the lease has not expired and writes are still safe. Even a successful call to blob.acquireLease()
could take a long time if Azure is slow.
I would suggest to introduce a RenewalDeadiine timeout, while increasing the rate of renewals: for example, if lease duration is 60 seconds, we should renew the lease every 5 seconds, but if after a deadline of 40 seconds we couldn't renew the lease, then and only then we block the writes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even a successful call to blob.acquireLease() could take a long time if Azure is slow.
It is possible to do that optimization, but if Azure, as you said, is slow, then threads in SegmentWriteQueue will be slow to write to the remote repo as well.
If we block writes only 40 seconds after issuing blob.renewLease, the lease at that time will be 5 + 40 seconds old.
The writer thread might pass WriteAccessController#checkWritingAllowed
condition at 44 seconds after lease renewal and then struggle for the next 20 seconds to write the segment, eventually succeed in doing it but this time when lease has been acquired by another Oak process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 and 40 seconds were just examples and could be adjusted to, say, 10 and 30.
With those values, with my proposal, we block writes after lease renewal fails multiple times (after first renewal failure with a OperationTimeout, retry happens immediately) but at most after 30 seconds.
Currently we renew only after 30 seconds, which seems too late to recover from safely, that's why I argue that we should renew more often and have a separate deadline timeout to block writes way before the lease expires.
Also see my other comment, where I argue we should block SegmentWriteQueue
as well, to prevent the scenario you mention of a writer thread still trying to write because it just passed the check before the block.
@ahanikel that should not be possible
|
What about the |
No, if those writes and journal updates compete with writes from the new Oak process that has successfully acquired the lease.
It is also possible to configure synchronous writes when there is not queue Line 59 in e96cfcb
Check happens in the method below which is invoked also when queue is not configured Line 62 in e96cfcb
|
I'm not sure I understand. As far as I can see, |
Threads picking up items from the queue are eventually calling AzureSegmentArchiveWriter#doWriteArchiveEntry and writeAccessController.checkWritingAllowed() is being invoked there. |
Ah, now I understand! Thanks! |
this.shutdownHook = shutdownHook; | ||
this.blob = blob; | ||
this.executor = Executors.newSingleThreadExecutor(); | ||
this.timeoutSec = timeoutSec; | ||
this.writeAccessController = writeAccessController; | ||
|
||
if (INTERVAL < RENEWAL_FREQUENCY || INTERVAL < TIME_TO_WAIT_BEFORE_WRITE_BLOCK) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also check that RENEWAL_FREQUENCY
is lower than TIME_TO_WAIT_BEFORE_WRITE_BLOCK
(which actually makes the check INTERVAL < RENEWAL_FREQUENCY
redundant)
private static int INTERVAL = Integer.getInteger(INTERVAL_PROP, 60); | ||
|
||
public static final String RENEWAL_FREQUENCY_PROP = "oak.segment.azure.lock.leaseRenewalFrequency"; | ||
private static int RENEWAL_FREQUENCY = Integer.getInteger(RENEWAL_FREQUENCY_PROP, 5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: the name frequency
might be misleading, as a frequency is defined as number of events over a unit of time, like 5 renewals per second. But here we actually mean renewals every 5 seconds. Maybe RENEWAL_INTERVAL
?
|
||
private static int INTERVAL = 60; | ||
public static final String INTERVAL_PROP = "oak.segment.azure.lock.leaseDuration"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all the new system properties, I would indicate in the name if it's seconds or millis, otherwise it can be confusing. Something like oak.segment.azure.lock.leaseDurationInSec
@@ -465,6 +475,80 @@ public void testCollectBlobReferencesDoesNotFailWhenFileIsMissing() throws URISy | |||
} | |||
} | |||
|
|||
@Test | |||
public void testWriteAfterLoosingRepoLock() throws URISyntaxException, InvalidFileStoreVersionException, IOException, CommitFailedException, StorageException, InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: should be losing
@@ -59,9 +58,9 @@ public void setup() throws StorageException, InvalidKeyException, URISyntaxExcep | |||
@Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: add a test to validate the behaviour where it blocks writes after several renewals have failed.
@@ -465,6 +475,80 @@ public void testCollectBlobReferencesDoesNotFailWhenFileIsMissing() throws URISy | |||
} | |||
} | |||
|
|||
@Test | |||
public void testWriteAfterLoosingRepoLock() throws URISyntaxException, InvalidFileStoreVersionException, IOException, CommitFailedException, StorageException, InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
polish: I suggest to replace the long list of exceptions in the throws clause, by just throws Exception
.
|
||
|
||
// wait till lease expires | ||
Thread.sleep(70000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we can configure the lease duration with the system property oak.segment.azure.lock.leaseDuration
, I would set it to the minimum of 15 seconds, so that we can reduce the sleep time here and make this test take less time.
Probably we can do the same in AzureRepositoryLockTests
as well.
this.isWritingAllowed = true; | ||
|
||
synchronized (this) { | ||
this.notifyAll(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: even though the risk here is small, it's generally best practice to avoid locking on this
. I suggest using a private final Object lock = new Object();
Kudos, SonarCloud Quality Gate passed! 0 Bugs 90.3% Coverage The version of Java (11.0.21) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17. |
https://issues.apache.org/jira/browse/OAK-10006