Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block repository writes if repo lock is lost #1204

Merged
merged 42 commits into from Nov 21, 2023
Merged

Block repository writes if repo lock is lost #1204

merged 42 commits into from Nov 21, 2023

Conversation

smiroslav
Copy link
Contributor

smiroslav and others added 30 commits October 2, 2023 17:09
@ahanikel
Copy link
Contributor

I am wondering what happens if:

  • instance 1 loses the lease
  • instance 2 gets it and writes happily
  • instance 1 waits in checkWritingAllowed()
  • instance 2 terminates
  • instance 1 gets the lease and writes everything it had queued up
    Or is that not a possible scenario?

@@ -100,7 +105,11 @@ private void refreshLease() {
try {
long timeSinceLastUpdate = (System.currentTimeMillis() - lastUpdate) / 1000;
if (timeSinceLastUpdate > INTERVAL / 2) {
writeAccessController.disableWriting();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the performance impact of this change?
My concern is we are blocking writes at every renewal, even though the lease has not expired and writes are still safe. Even a successful call to blob.acquireLease() could take a long time if Azure is slow.

I would suggest to introduce a RenewalDeadiine timeout, while increasing the rate of renewals: for example, if lease duration is 60 seconds, we should renew the lease every 5 seconds, but if after a deadline of 40 seconds we couldn't renew the lease, then and only then we block the writes.

Copy link
Contributor Author

@smiroslav smiroslav Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even a successful call to blob.acquireLease() could take a long time if Azure is slow.

It is possible to do that optimization, but if Azure, as you said, is slow, then threads in SegmentWriteQueue will be slow to write to the remote repo as well.

If we block writes only 40 seconds after issuing blob.renewLease, the lease at that time will be 5 + 40 seconds old.

The writer thread might pass WriteAccessController#checkWritingAllowedcondition at 44 seconds after lease renewal and then struggle for the next 20 seconds to write the segment, eventually succeed in doing it but this time when lease has been acquired by another Oak process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 and 40 seconds were just examples and could be adjusted to, say, 10 and 30.
With those values, with my proposal, we block writes after lease renewal fails multiple times (after first renewal failure with a OperationTimeout, retry happens immediately) but at most after 30 seconds.
Currently we renew only after 30 seconds, which seems too late to recover from safely, that's why I argue that we should renew more often and have a separate deadline timeout to block writes way before the lease expires.

Also see my other comment, where I argue we should block SegmentWriteQueue as well, to prevent the scenario you mention of a writer thread still trying to write because it just passed the check before the block.

@smiroslav
Copy link
Contributor Author

I am wondering what happens if:

  • instance 1 loses the lease
  • instance 2 gets it and writes happily
  • instance 1 waits in checkWritingAllowed()
  • instance 2 terminates
  • instance 1 gets the lease and writes everything it had queued up
    Or is that not a possible scenario?

@ahanikel that should not be possible
In the test below, I have tried to renew the lease after it has expired and in the meantime, new lease created (and expired as well)

$ az storage blob lease acquire  --account-name smiljani  -b repo.lock -c oak --lease-duration 15
"803fedca-09cd-4c43-b026-58297c12c66a"
$ az storage blob lease renew  --account-name smiljani  -b repo.lock -c oak --lease-id 803fedca-09cd-4c43-b026-58297c12c66a
"803fedca-09cd-4c43-b026-58297c12c66a"
$ az storage blob lease acquire  --account-name smiljani  -b repo.lock -c oak --lease-duration 15
"9edaa378-934c-4843-b4b1-595763b62772"
$ az storage blob lease renew  --account-name smiljani  -b repo.lock -c oak --lease-id 803fedca-09cd-4c43-b026-58297c12c66a
The lease ID specified did not match the lease ID for the blob.
RequestId:0f20142b-701e-0080-7547-164489000000
Time:2023-11-13T15:39:35.6274726Z
ErrorCode:LeaseIdMismatchWithLeaseOperation

@jelmini
Copy link
Contributor

jelmini commented Nov 13, 2023

What about the SegmentWriteQueue? All the segments already in the queue can still be sent to Azure even though the lease cannot be renewed.
Instead of blocking calls to AzureSegmentArchiveWriter, shouldn't we block adding new segment to the SegmentWriteQueue and block consuming its internal queue?

@smiroslav
Copy link
Contributor Author

What about the SegmentWriteQueue? All the segments already in the queue can still be sent to Azure even though the lease cannot be renewed.

No, if those writes and journal updates compete with writes from the new Oak process that has successfully acquired the lease.

Instead of blocking calls to AzureSegmentArchiveWriter, shouldn't we block adding new segment to the SegmentWriteQueue and block consuming its internal queue?

It is also possible to configure synchronous writes when there is not queue

public void writeSegment(long msb, long lsb, @NotNull byte[] data, int offset, int size, int generation,

Check happens in the method below which is invoked also when queue is not configured

protected void doWriteArchiveEntry(RemoteSegmentArchiveEntry indexEntry, byte[] data, int offset, int size) throws IOException {

@jelmini
Copy link
Contributor

jelmini commented Nov 14, 2023

What about the SegmentWriteQueue? All the segments already in the queue can still be sent to Azure even though the lease cannot be renewed.

No, if those writes and journal updates compete with writes from the new Oak process that has successfully acquired the lease.

I'm not sure I understand. As far as I can see, SegmentWriteQueue does not check if the lease is held by the current instance when writing to Azure. Thus, if the queue is full when writes are blocked by WriteAccessController and Azure is slow, there could still be segments uploaded after the lease is lost.
For uploads already in progress when writes are blocked, we should probably just let them finish. But we should block all new uploads from starting.

@smiroslav
Copy link
Contributor Author

As far as I can see, SegmentWriteQueue does not check if the lease is held by the current instance when writing to Azure.

Threads picking up items from the queue are eventually calling AzureSegmentArchiveWriter#doWriteArchiveEntry and writeAccessController.checkWritingAllowed() is being invoked there.

@jelmini
Copy link
Contributor

jelmini commented Nov 14, 2023

Threads picking up items from the queue are eventually calling AzureSegmentArchiveWriter#doWriteArchiveEntry and writeAccessController.checkWritingAllowed() is being invoked there.

Ah, now I understand! Thanks!

this.shutdownHook = shutdownHook;
this.blob = blob;
this.executor = Executors.newSingleThreadExecutor();
this.timeoutSec = timeoutSec;
this.writeAccessController = writeAccessController;

if (INTERVAL < RENEWAL_FREQUENCY || INTERVAL < TIME_TO_WAIT_BEFORE_WRITE_BLOCK) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also check that RENEWAL_FREQUENCY is lower than TIME_TO_WAIT_BEFORE_WRITE_BLOCK (which actually makes the check INTERVAL < RENEWAL_FREQUENCY redundant)

private static int INTERVAL = Integer.getInteger(INTERVAL_PROP, 60);

public static final String RENEWAL_FREQUENCY_PROP = "oak.segment.azure.lock.leaseRenewalFrequency";
private static int RENEWAL_FREQUENCY = Integer.getInteger(RENEWAL_FREQUENCY_PROP, 5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: the name frequency might be misleading, as a frequency is defined as number of events over a unit of time, like 5 renewals per second. But here we actually mean renewals every 5 seconds. Maybe RENEWAL_INTERVAL?


private static int INTERVAL = 60;
public static final String INTERVAL_PROP = "oak.segment.azure.lock.leaseDuration";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all the new system properties, I would indicate in the name if it's seconds or millis, otherwise it can be confusing. Something like oak.segment.azure.lock.leaseDurationInSec

@@ -465,6 +475,80 @@ public void testCollectBlobReferencesDoesNotFailWhenFileIsMissing() throws URISy
}
}

@Test
public void testWriteAfterLoosingRepoLock() throws URISyntaxException, InvalidFileStoreVersionException, IOException, CommitFailedException, StorageException, InterruptedException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: should be losing

@@ -59,9 +58,9 @@ public void setup() throws StorageException, InvalidKeyException, URISyntaxExcep
@Test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: add a test to validate the behaviour where it blocks writes after several renewals have failed.

@@ -465,6 +475,80 @@ public void testCollectBlobReferencesDoesNotFailWhenFileIsMissing() throws URISy
}
}

@Test
public void testWriteAfterLoosingRepoLock() throws URISyntaxException, InvalidFileStoreVersionException, IOException, CommitFailedException, StorageException, InterruptedException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

polish: I suggest to replace the long list of exceptions in the throws clause, by just throws Exception.



// wait till lease expires
Thread.sleep(70000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we can configure the lease duration with the system property oak.segment.azure.lock.leaseDuration, I would set it to the minimum of 15 seconds, so that we can reduce the sleep time here and make this test take less time.
Probably we can do the same in AzureRepositoryLockTests as well.

this.isWritingAllowed = true;

synchronized (this) {
this.notifyAll();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: even though the risk here is small, it's generally best practice to avoid locking on this. I suggest using a private final Object lock = new Object();

Copy link

sonarcloud bot commented Nov 21, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 13 Code Smells

90.3% 90.3% Coverage
0.0% 0.0% Duplication

warning The version of Java (11.0.21) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

@smiroslav smiroslav merged commit 8879671 into trunk Nov 21, 2023
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants