Block repository writes if repo lock is lost #1204

smiroslav · 2023-11-13T09:03:49Z

https://issues.apache.org/jira/browse/OAK-10006

…ckages

…-segment-azure

ahanikel · 2023-11-13T14:23:38Z

I am wondering what happens if:

instance 1 loses the lease
instance 2 gets it and writes happily
instance 1 waits in checkWritingAllowed()
instance 2 terminates
instance 1 gets the lease and writes everything it had queued up
Or is that not a possible scenario?

jelmini · 2023-11-13T15:08:11Z

...segment-azure/src/main/java/org/apache/jackrabbit/oak/segment/azure/AzureRepositoryLock.java

@@ -100,7 +105,11 @@ private void refreshLease() {
            try {
                long timeSinceLastUpdate = (System.currentTimeMillis() - lastUpdate) / 1000;
                if (timeSinceLastUpdate > INTERVAL / 2) {
+                    writeAccessController.disableWriting();


What about the performance impact of this change?
My concern is we are blocking writes at every renewal, even though the lease has not expired and writes are still safe. Even a successful call to blob.acquireLease() could take a long time if Azure is slow.

I would suggest to introduce a RenewalDeadiine timeout, while increasing the rate of renewals: for example, if lease duration is 60 seconds, we should renew the lease every 5 seconds, but if after a deadline of 40 seconds we couldn't renew the lease, then and only then we block the writes.

Even a successful call to blob.acquireLease() could take a long time if Azure is slow.

It is possible to do that optimization, but if Azure, as you said, is slow, then threads in SegmentWriteQueue will be slow to write to the remote repo as well.

If we block writes only 40 seconds after issuing blob.renewLease, the lease at that time will be 5 + 40 seconds old.

The writer thread might pass WriteAccessController#checkWritingAllowedcondition at 44 seconds after lease renewal and then struggle for the next 20 seconds to write the segment, eventually succeed in doing it but this time when lease has been acquired by another Oak process.

5 and 40 seconds were just examples and could be adjusted to, say, 10 and 30.
With those values, with my proposal, we block writes after lease renewal fails multiple times (after first renewal failure with a OperationTimeout, retry happens immediately) but at most after 30 seconds.
Currently we renew only after 30 seconds, which seems too late to recover from safely, that's why I argue that we should renew more often and have a separate deadline timeout to block writes way before the lease expires.

Also see my other comment, where I argue we should block SegmentWriteQueue as well, to prevent the scenario you mention of a writer thread still trying to write because it just passed the check before the block.

smiroslav · 2023-11-13T15:44:14Z

I am wondering what happens if:

instance 1 loses the lease

instance 2 gets it and writes happily

instance 1 waits in checkWritingAllowed()

instance 2 terminates

instance 1 gets the lease and writes everything it had queued up
Or is that not a possible scenario?

@ahanikel that should not be possible
In the test below, I have tried to renew the lease after it has expired and in the meantime, new lease created (and expired as well)

$ az storage blob lease acquire  --account-name smiljani  -b repo.lock -c oak --lease-duration 15
"803fedca-09cd-4c43-b026-58297c12c66a"
$ az storage blob lease renew  --account-name smiljani  -b repo.lock -c oak --lease-id 803fedca-09cd-4c43-b026-58297c12c66a
"803fedca-09cd-4c43-b026-58297c12c66a"
$ az storage blob lease acquire  --account-name smiljani  -b repo.lock -c oak --lease-duration 15
"9edaa378-934c-4843-b4b1-595763b62772"
$ az storage blob lease renew  --account-name smiljani  -b repo.lock -c oak --lease-id 803fedca-09cd-4c43-b026-58297c12c66a
The lease ID specified did not match the lease ID for the blob.
RequestId:0f20142b-701e-0080-7547-164489000000
Time:2023-11-13T15:39:35.6274726Z
ErrorCode:LeaseIdMismatchWithLeaseOperation

jelmini · 2023-11-13T15:57:54Z

What about the SegmentWriteQueue? All the segments already in the queue can still be sent to Azure even though the lease cannot be renewed.
Instead of blocking calls to AzureSegmentArchiveWriter, shouldn't we block adding new segment to the SegmentWriteQueue and block consuming its internal queue?

smiroslav · 2023-11-13T17:10:39Z

What about the SegmentWriteQueue? All the segments already in the queue can still be sent to Azure even though the lease cannot be renewed.

No, if those writes and journal updates compete with writes from the new Oak process that has successfully acquired the lease.

Instead of blocking calls to AzureSegmentArchiveWriter, shouldn't we block adding new segment to the SegmentWriteQueue and block consuming its internal queue?

It is also possible to configure synchronous writes when there is not queue

jackrabbit-oak/oak-segment-remote/src/main/java/org/apache/jackrabbit/oak/segment/remote/AbstractRemoteSegmentArchiveWriter.java

Line 59 in e96cfcb

    
           public void writeSegment(long msb, long lsb, @NotNull byte[] data, int offset, int size, int generation,

Check happens in the method below which is invoked also when queue is not configured

jackrabbit-oak/oak-segment-azure/src/main/java/org/apache/jackrabbit/oak/segment/azure/AzureSegmentArchiveWriter.java

Line 62 in e96cfcb

    
           protected void doWriteArchiveEntry(RemoteSegmentArchiveEntry indexEntry, byte[] data, int offset, int size) throws IOException {

jelmini · 2023-11-14T06:35:00Z

What about the SegmentWriteQueue? All the segments already in the queue can still be sent to Azure even though the lease cannot be renewed.

No, if those writes and journal updates compete with writes from the new Oak process that has successfully acquired the lease.

I'm not sure I understand. As far as I can see, SegmentWriteQueue does not check if the lease is held by the current instance when writing to Azure. Thus, if the queue is full when writes are blocked by WriteAccessController and Azure is slow, there could still be segments uploaded after the lease is lost.
For uploads already in progress when writes are blocked, we should probably just let them finish. But we should block all new uploads from starting.

smiroslav · 2023-11-14T08:52:41Z

As far as I can see, SegmentWriteQueue does not check if the lease is held by the current instance when writing to Azure.

Threads picking up items from the queue are eventually calling AzureSegmentArchiveWriter#doWriteArchiveEntry and writeAccessController.checkWritingAllowed() is being invoked there.

jelmini · 2023-11-14T10:01:55Z

Threads picking up items from the queue are eventually calling AzureSegmentArchiveWriter#doWriteArchiveEntry and writeAccessController.checkWritingAllowed() is being invoked there.

Ah, now I understand! Thanks!

jelmini · 2023-11-15T15:27:42Z

...segment-azure/src/main/java/org/apache/jackrabbit/oak/segment/azure/AzureRepositoryLock.java

        this.shutdownHook = shutdownHook;
        this.blob = blob;
        this.executor = Executors.newSingleThreadExecutor();
        this.timeoutSec = timeoutSec;
+        this.writeAccessController = writeAccessController;
+
+        if (INTERVAL < RENEWAL_FREQUENCY || INTERVAL < TIME_TO_WAIT_BEFORE_WRITE_BLOCK) {


I would also check that RENEWAL_FREQUENCY is lower than TIME_TO_WAIT_BEFORE_WRITE_BLOCK (which actually makes the check INTERVAL < RENEWAL_FREQUENCY redundant)

jelmini · 2023-11-15T15:34:12Z

...segment-azure/src/main/java/org/apache/jackrabbit/oak/segment/azure/AzureRepositoryLock.java

+    private static int INTERVAL = Integer.getInteger(INTERVAL_PROP, 60);
+
+    public static final String RENEWAL_FREQUENCY_PROP = "oak.segment.azure.lock.leaseRenewalFrequency";
+    private static int RENEWAL_FREQUENCY = Integer.getInteger(RENEWAL_FREQUENCY_PROP, 5);


nitpick: the name frequency might be misleading, as a frequency is defined as number of events over a unit of time, like 5 renewals per second. But here we actually mean renewals every 5 seconds. Maybe RENEWAL_INTERVAL?

jelmini · 2023-11-15T15:41:32Z

...segment-azure/src/main/java/org/apache/jackrabbit/oak/segment/azure/AzureRepositoryLock.java


-    private static int INTERVAL = 60;
+    public static final String INTERVAL_PROP = "oak.segment.azure.lock.leaseDuration";


For all the new system properties, I would indicate in the name if it's seconds or millis, otherwise it can be confusing. Something like oak.segment.azure.lock.leaseDurationInSec

jelmini · 2023-11-15T15:47:21Z

...ent-azure/src/test/java/org/apache/jackrabbit/oak/segment/azure/AzureArchiveManagerTest.java

@@ -465,6 +475,80 @@ public void testCollectBlobReferencesDoesNotFailWhenFileIsMissing() throws URISy
        }
    }

+    @Test
+    public void testWriteAfterLoosingRepoLock() throws URISyntaxException, InvalidFileStoreVersionException, IOException, CommitFailedException, StorageException, InterruptedException {


Typo: should be losing

jelmini · 2023-11-15T15:54:38Z

...ent-azure/src/test/java/org/apache/jackrabbit/oak/segment/azure/AzureRepositoryLockTest.java

@@ -59,9 +58,9 @@ public void setup() throws StorageException, InvalidKeyException, URISyntaxExcep
    @Test


Suggestion: add a test to validate the behaviour where it blocks writes after several renewals have failed.

jelmini · 2023-11-15T16:34:21Z

...ent-azure/src/test/java/org/apache/jackrabbit/oak/segment/azure/AzureArchiveManagerTest.java

@@ -465,6 +475,80 @@ public void testCollectBlobReferencesDoesNotFailWhenFileIsMissing() throws URISy
        }
    }

+    @Test
+    public void testWriteAfterLoosingRepoLock() throws URISyntaxException, InvalidFileStoreVersionException, IOException, CommitFailedException, StorageException, InterruptedException {


polish: I suggest to replace the long list of exceptions in the throws clause, by just throws Exception.

jelmini · 2023-11-15T16:38:45Z

...ent-azure/src/test/java/org/apache/jackrabbit/oak/segment/azure/AzureArchiveManagerTest.java

+
+
+        // wait till lease expires
+        Thread.sleep(70000);


Now that we can configure the lease duration with the system property oak.segment.azure.lock.leaseDuration, I would set it to the minimum of 15 seconds, so that we can reduce the sleep time here and make this test take less time.
Probably we can do the same in AzureRepositoryLockTests as well.

jelmini · 2023-11-15T17:01:53Z

...ent-remote/src/main/java/org/apache/jackrabbit/oak/segment/remote/WriteAccessController.java

+        this.isWritingAllowed = true;
+
+        synchronized (this) {
+            this.notifyAll();


Nitpick: even though the risk here is small, it's generally best practice to avoid locking on this. I suggest using a private final Object lock = new Object();

sonarcloud · 2023-11-21T11:19:42Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
13 Code Smells

90.3% Coverage
0.0% Duplication

The version of Java (11.0.21) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

smiroslav and others added 30 commits October 2, 2023 17:09

OAK-10464 testcontainers dependency

1ce2d2e

OAK-10464 testcontainers dependency

fa4a114

OAK-10464 - merge with trunk

13ed631

modified blob endpoint, logback, add testcontainer dependency

48a392b

migrate mongo docker rule to testcontainer

1f5b852

Merge remote-tracking branch 'upstream/trunk' into issue/OAK-10464

f09fc4c

minor dependency changes

af3ec3c

minor changes

3a90933

refactor dependency, removed unused ones

970cbfe

Merge remote-tracking branch 'upstream/trunk' into issue/OAK-10464

7614372

Merge remote-tracking branch 'upstream/trunk' into issue/OAK-10464

914d57b

revert logback changes

9ab85a6

revert logback changes

755a976

Merge remote-tracking branch 'upstream/trunk' into issue/OAK-10464

57b3511

removed test container dependency from child modules

352bb34

Merge remote-tracking branch 'upstream/trunk' into issue/OAK-10464

15d5cec

Merge remote-tracking branch 'upstream/trunk' into issue/OAK-10464

246885a

OAK-10006 writes not possible during lease renewal

5d322e7

OAK-10006 Merge with 'trunk'

97317f3

OAK-10543 added license header and increased versions for exported pa…

b889d21

…ckages

OAK-10543 added license header and increased versions for exported pa…

7e186ca

…ckages

OAK-10543 remove try/catch

3196223

OAK-10543 remove duplicated testcontainers dependency

a276998

OAK-10543 remove null initialisation

d2cea09

OAK-10006 added test for WriteAccessController and fixed tests in oak…

66bde1c

…-segment-azure

OAK-10006 deleted unused constructor

ea80fc8

OAK-10006 modified constructor

7d07ff4

OAK-10006 modified constructor

e098e5b

OAK-10006 import statement

eacc662

OAK-10006 import statements

e96cfcb

jelmini reviewed Nov 13, 2023

View reviewed changes

smiroslav added 3 commits November 14, 2023 16:30

OAK-10006 renew lease more often and do not block writes unnecessarily

6467d7c

OAK-10006 change sys property name

39778e3

OAK-10006 remove extra line

6ce1890

jelmini approved these changes Nov 15, 2023

View reviewed changes

smiroslav added 9 commits November 20, 2023 11:29

OAK-10006 check values of system properties

8e59fd9

OAK-10006 check values of system properties

b6edff5

OAK-10006 indivate units for system properties

524f95d

OAK-10006 use rule to set system properties

0d5997c

OAK-10006 use rule to set system properties

f3c298b

OAK-10006 use separate object for lock

eccd4e1

OAK-10006 set system properties in test

d0aeefa

OAK-10006 add sys properties to log output

939c88e

OAK-10006 testWritesBlockedOnlyAfterFewUnsuccessfulAttempts

490ba60

smiroslav merged commit 8879671 into trunk Nov 21, 2023
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block repository writes if repo lock is lost #1204

Block repository writes if repo lock is lost #1204

smiroslav commented Nov 13, 2023

ahanikel commented Nov 13, 2023

jelmini Nov 13, 2023

smiroslav Nov 13, 2023 •

edited

jelmini Nov 13, 2023

smiroslav commented Nov 13, 2023

jelmini commented Nov 13, 2023

smiroslav commented Nov 13, 2023

jelmini commented Nov 14, 2023

smiroslav commented Nov 14, 2023

jelmini commented Nov 14, 2023

jelmini Nov 15, 2023

jelmini Nov 15, 2023

jelmini Nov 15, 2023

jelmini Nov 15, 2023

jelmini Nov 15, 2023

jelmini Nov 15, 2023

jelmini Nov 15, 2023

jelmini Nov 15, 2023

sonarcloud bot commented Nov 21, 2023


		private static int INTERVAL = 60;
		public static final String INTERVAL_PROP = "oak.segment.azure.lock.leaseDuration";

		@@ -59,9 +58,9 @@ public void setup() throws StorageException, InvalidKeyException, URISyntaxExcep
		@Test

Block repository writes if repo lock is lost #1204

Block repository writes if repo lock is lost #1204

Conversation

smiroslav commented Nov 13, 2023

ahanikel commented Nov 13, 2023

Choose a reason for hiding this comment

smiroslav Nov 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smiroslav commented Nov 13, 2023

jelmini commented Nov 13, 2023

smiroslav commented Nov 13, 2023

jelmini commented Nov 14, 2023

smiroslav commented Nov 14, 2023

jelmini commented Nov 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarcloud bot commented Nov 21, 2023

smiroslav Nov 13, 2023 •

edited