Fix BatchingSqlJournal emitting WriteMessageSuccess before transaction was complete. #4953

Arkatufus · 2021-04-19T19:57:19Z

Should also fix #4265, but would need further field testing.

…s before chunk execution

Arkatufus · 2021-04-19T20:00:06Z

src/contrib/persistence/Akka.Persistence.Sql.Common/Journal/BatchingSqlJournal.cs

-            var array = new IJournalRequest[operationsCount];
-            for (int i = 0; i < operationsCount; i++)
+            // Need a lock here to ensure that buffer doesn't change during this operation
+            lock (_lock)


This is the very janky code to split the queued requests into either a read request chunk or write request chunk.
Explanation on why it is coded in this peculiar way is written in the method comment.

Aaronontheweb

Let's not modify Akka.Persistence API surface area to save the BatchingSqlJournal.

Aaronontheweb · 2021-04-20T19:51:47Z

src/core/Akka.Persistence/JournalProtocol.cs

+    /// <summary>
+    /// Marker for batched write operations
+    /// </summary>
+    public interface IJournalWrite : IJournalRequest { }


Let's remove this and try to avoid making the BatchingJournal's problem all of Akka.Persistence's problem.

This reverts commit 002a4d8.

…ic API

…saction, possible database deadlock.

Aaronontheweb

LGTM overall but left some questions for @Arkatufus

Aaronontheweb · 2021-04-21T19:53:04Z

src/contrib/persistence/Akka.Persistence.Sql.Common/Journal/BatchingSqlJournal.cs

+                            {
+                                if (cause is DbException)
+                                {
+                                    // database-related exceptions should result in failure                                


Aaronontheweb · 2021-04-21T19:53:39Z

src/contrib/persistence/Akka.Persistence.Sql.Common/Journal/BatchingSqlJournal.cs

+            {
+                var id = _bufferIdCounter.GetAndIncrement();
+                // Enqueue writes and delete operation requests into the write queue,
+                // else if they are query operations, enqueue them into the read queue


Aaronontheweb · 2021-04-21T19:54:51Z

src/contrib/persistence/Akka.Persistence.Sql.Common/Journal/BatchingSqlJournal.cs

-            for (int i = 0; i < operationsCount; i++)
+            var currentBuffer = _buffers
+                .Where(q => q.Count > 0)
+                .OrderBy(q => q.Peek().id).First();


Is the buffer already sorted this way?

This is just a way to pick the buffer with the lowest id in its first item. We're trying to make sure that request order are roughly executed in order, if it was a choice between a read and a write.

Aaronontheweb · 2021-04-21T19:55:59Z

src/contrib/persistence/Akka.Persistence.Sql.Common/Journal/BatchingSqlJournal.cs

@@ -1388,29 +1389,111 @@ protected void AddParameter(TCommand command, string paramName, DbType dbType, o
        /// <param name="param">Parameter to customize</param>
        protected virtual void PreAddParameterToCommand(TCommand command, DbParameter param) { }

+        /// <summary>
+        /// Select the buffer that has the smallest id on its first item, retrieve a maximum Setup.MaxBatchSize


Because this indicates the "oldest" outstanding operation?

Yes, it is "roughly first in, first out". Question is, how sensitive is journal request ordering between read and write operations, are they strict or can operations happen in asynchrony?
If they're tightly coupled, I'd have to change this logic so that only consecutive requests of the same operations are allowed to be batched, ie. all queries, all writes, and all deletes.

This is actually a big question that needs to be addressed.

FWIW, I know that persistence-jdbc makes a point of blocking ReadHighestSequenceNrAsync for a given PersistenceId (example in ported code.

I think this is done, because the first step of Recovery in AsyncWriteJournal is to read the max sequence number from which to replay. This way, if somehow the PersistentActor crashes before a PersistAsync write completes, Recovery sequence will still read up to the correct sequence number.

Edit for clarity: In short, ReadHighestSequenceNrAsync calls for a given PersistenceId need to be queued until any pending Writes for that PersistenceId are completed, but other operations you shouldn't need to worry as much about ordering. Otherwise you have edge cases like the one I pointed out above around PersistAsync.

So, to make things as safe as possible, i guess i should only batch consecutive similar commands, unless it is a ReadHighestSequenceNr, in which i should stop. This is turning into an NotBatchedSqlJournal real fast...

Well, all query types should be safe to be batched together as long s they're consecutive, right. The danger comes when it is interleaved with writes, and for write operations, consecutive writes and consecutive deletes are fine.

After pondering it, I think we're probably Safe since BatchingSqlJournal is not from AsyncJournalBase and therefore doing recovery differently.

At -worst-, I think we would see a scenario where after recovery, the actor crashes b/c it didn't pick up the right sequence number. In that at worst, I can't imagine it would be much more than restarting the actor (or at worst the whole system.)

That's still a better worst case than with current state of BatchingSqlJournal (where writes are being confirmed when they may still fail.) Under heavy loads I would see torn write behavior (i.e. missing sequence numbers in DB) basically corrupting the journal state.

@to11mtm do you approve this set of changes then?

Aaronontheweb · 2021-04-21T19:56:23Z

src/contrib/persistence/Akka.Persistence.Sql.Common/Journal/BatchingSqlJournal.cs

-
-            return new RequestChunk(chunkId, array);
+            
+            return new RequestChunk(chunkId, operations.ToArray());


Why not just pass the List<> instead?

RequestChunk is a struct, I'm not sure if there is a specific reason why it is using an array and not a list, I assumed that it is for optimization reason, so I didn't change it.

Aaronontheweb · 2021-04-21T19:57:33Z

cc @ismaelhamed

to11mtm · 2021-04-21T21:58:10Z

src/contrib/persistence/Akka.Persistence.Sql.Common/Journal/BatchingSqlJournal.cs

                    {
-                        cause = e;
                        tx.Rollback();


Note: .Rollback() may throw. I'm not certain whether we're worried about losing the original exception in that case...

It would probably be a good idea to include it

Added a try...catch around the rollback and throw an aggregated exception of it and the original exception

ismaelhamed · 2021-04-23T05:21:08Z

LGTM. This was trickier than I thought, nice job!

Would a -beta of the Akka.Persistence.Sql.Common package with this branch be appropriate until we're sure this won't backfire in production?

I'm not concerned with @Arkatufus coding at all ;) but, given that this has deviated from the original implementation and that the original author was not involved to pick his brain in some of the issues that came up, it may be more sensible to do so.

ismaelhamed · 2021-04-23T05:39:15Z

src/contrib/persistence/Akka.Persistence.Sql.Common/Journal/BatchingSqlJournal.cs

-
-            foreach (var req in message.Requests)
+            Log.Error(cause, "An error occurred during event batch processing (chunkId: {0})", message.ChunkId);
+


I'd keep the number of requests (i.e., (chunkId: {0}) of {1} requests)

Aaronontheweb · 2021-04-23T12:09:29Z

Yep, we can do that @ismaelhamed - I can make those into beta releases

Aaronontheweb

LGTM

Split BatchingSqlJournal Buffer content into read and write operation…

b7f527b

…s before chunk execution

Arkatufus commented Apr 19, 2021

View reviewed changes

Arkatufus marked this pull request as draft April 19, 2021 20:00

Arkatufus added 2 commits April 20, 2021 22:34

Update API Approval list

002a4d8

Defer all event notification until transaction commited or failed.

3d4312a

Aaronontheweb requested changes Apr 20, 2021

View reviewed changes

Arkatufus added 6 commits April 21, 2021 03:46

Revert "Update API Approval list"

4182de8

This reverts commit 002a4d8.

Revert public API changes

99a069a

Adapt the request segregation code so that it doesn't change the publ…

44ce71a

…ic API

Make buffer private. Split buffer into write and read buffers.

eee7771

Prevent both update/insert and delete from happening in a single tran…

ce4e859

…saction, possible database deadlock.

Merge branch 'dev' into #4941_Fix_WriteMessageSuccess_batching_issue

45d7ffa

Arkatufus marked this pull request as ready for review April 21, 2021 19:17

Arkatufus changed the title ~~[WIP] Fix BatchingSqlJournal emitting WriteMessageSuccess before transaction was complete.~~ Fix BatchingSqlJournal emitting WriteMessageSuccess before transaction was complete. Apr 21, 2021

Merge branch 'dev' into #4941_Fix_WriteMessageSuccess_batching_issue

8d8e482

Aaronontheweb self-requested a review April 21, 2021 19:21

Aaronontheweb approved these changes Apr 21, 2021

View reviewed changes

to11mtm reviewed Apr 21, 2021

View reviewed changes

Aaronontheweb self-requested a review April 21, 2021 23:01

Arkatufus added 2 commits April 22, 2021 23:51

Catch possible exception thrown while doing a transaction rollback

9a1e33d

Merge branch 'dev' into #4941_Fix_WriteMessageSuccess_batching_issue

b4bd73b

ismaelhamed reviewed Apr 23, 2021

View reviewed changes

Aaronontheweb and others added 5 commits April 23, 2021 07:41

Merge branch 'dev' into #4941_Fix_WriteMessageSuccess_batching_issue

53dc468

Merge branch 'dev' into #4941_Fix_WriteMessageSuccess_batching_issue

98a467e

Update log message, make it more verbose

71eb5fe

Merge branch 'dev' into #4941_Fix_WriteMessageSuccess_batching_issue

4ead9d7

Merge branch 'dev' into #4941_Fix_WriteMessageSuccess_batching_issue

1be68d4

Merge branch 'dev' into #4941_Fix_WriteMessageSuccess_batching_issue

4ad16cc

Arkatufus mentioned this pull request May 3, 2021

Some persistent actors are stuck with RecoveryTimedOutException after circuit breaker opens #4265

Closed

Aaronontheweb approved these changes May 3, 2021

View reviewed changes

Aaronontheweb merged commit e5d079a into akkadotnet:dev May 3, 2021

Aaronontheweb mentioned this pull request May 12, 2021

marking Akka.Persistence.Sql.Common as beta for v1.4.20 #5006

Merged

Arkatufus deleted the #4941_Fix_WriteMessageSuccess_batching_issue branch February 27, 2023 17:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BatchingSqlJournal emitting WriteMessageSuccess before transaction was complete. #4953

Fix BatchingSqlJournal emitting WriteMessageSuccess before transaction was complete. #4953

Arkatufus commented Apr 19, 2021 •

edited

Loading

Arkatufus Apr 19, 2021

Aaronontheweb left a comment

Aaronontheweb Apr 20, 2021

Aaronontheweb left a comment

Aaronontheweb Apr 21, 2021

Aaronontheweb Apr 21, 2021

Aaronontheweb Apr 21, 2021

Arkatufus Apr 21, 2021

Aaronontheweb Apr 21, 2021

Arkatufus Apr 21, 2021

Arkatufus Apr 21, 2021

to11mtm Apr 21, 2021 •

edited

Loading

Arkatufus Apr 22, 2021 •

edited

Loading

Arkatufus Apr 22, 2021

to11mtm Apr 23, 2021

Aaronontheweb Apr 23, 2021

Aaronontheweb Apr 21, 2021

Arkatufus Apr 21, 2021

Aaronontheweb commented Apr 21, 2021

to11mtm Apr 21, 2021

Aaronontheweb Apr 21, 2021

Arkatufus Apr 22, 2021

ismaelhamed commented Apr 23, 2021

ismaelhamed Apr 23, 2021

Arkatufus Apr 23, 2021

Aaronontheweb commented Apr 23, 2021

Aaronontheweb left a comment


		return new RequestChunk(chunkId, array);

		return new RequestChunk(chunkId, operations.ToArray());


		foreach (var req in message.Requests)
		Log.Error(cause, "An error occurred during event batch processing (chunkId: {0})", message.ChunkId);

Fix BatchingSqlJournal emitting WriteMessageSuccess before transaction was complete. #4953

Fix BatchingSqlJournal emitting WriteMessageSuccess before transaction was complete. #4953

Conversation

Arkatufus commented Apr 19, 2021 • edited Loading

Choose a reason for hiding this comment

Aaronontheweb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Aaronontheweb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

to11mtm Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

Arkatufus Apr 22, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Aaronontheweb commented Apr 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ismaelhamed commented Apr 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Aaronontheweb commented Apr 23, 2021

Aaronontheweb left a comment

Choose a reason for hiding this comment

Arkatufus commented Apr 19, 2021 •

edited

Loading

to11mtm Apr 21, 2021 •

edited

Loading

Arkatufus Apr 22, 2021 •

edited

Loading