ARTEMIS-1570 Flush appendExecutor before take journal snapshot#1742
ARTEMIS-1570 Flush appendExecutor before take journal snapshot#1742shoukunhuai wants to merge 3 commits into
Conversation
When live start replication, it must make sure there is no pending write in message & bindings journal, or we may lost journal records during initial replication. So we need flush append executor after acquire StorageManager's write lock, before Journal's write lock. Also we set a 10 seconds timeout when flush, the same as Journal::flushExecutor. If we failed to flush in 10 seconds, we abort replication, backup will try again later. Use OrderedExecutorFactory::flushExecutor to flush executor
3d4c459 to
6f04846
Compare
|
How did u find this ? Nice job. I will see if I can change the logic to not need the wait. Any tests you can share. I’m out this week for the Xmas break. Will take a look on next week. Ok ? |
|
@clebertsuconic That will be great if we can avoid wait. |
|
@shoukunhuai I thought about developing a byteman test with some rules on stopping during the execution, and then releasing it. |
Test consistency between live and backup, espacially on a slow live. The test use MessagePersister::encode to simulate slow IO condition. After live started, we send 5 message with a delay(default 500ms), then start backup, wait until replicated, then send more message without delay. If all message sent successfully, the backup should has the same messages as live. We assert the message number only. For message's delay, you can set it via system property, i test it with `-Dmessage.property.delay=5000`, the test will fail.
|
@clebertsuconic just pushed my test for your reference. |
|
@shoukunhuai looks like some check-style bits need tidying up. looking at the PR build failure. |
|
@clebertsuconic, what's the status of this? |
|
I didn't have the time for it yet.. still working on compatibility issues. |
| ClientProducer producer = session.createProducer("slow"); | ||
| ClientMessage message = session.createMessage(true); | ||
| // this will make journal's append executor busy | ||
| message.putLongProperty("delay", Long.getLong("message.property.delay",500L)); |
There was a problem hiding this comment.
Where? how this property is affecting the semantic of the server. I didn't find it anywhere.
There was a problem hiding this comment.
Never mind.. I see.. it's on your extended persister...
I don't think we need a system property.. the test can just keep it constant at 500L
|
nice job on this one!!! impressed!!! |
|
do you want to contribute more????????? I can help you in any way you need. |
|
@clebertsuconic sorry, almost forgot this, being busy on something else. |
When live start replication, it must make sure there is
no pending write in message & bindings journal, or we may
lost journal records during initial replication.
So we need flush append executor after acquire StorageManager's
write lock, before Journal's write lock.
Also we set a 10 seconds timeout when flush, the same as
Journal::flushExecutor. If we failed to flush in 10 seconds,
we abort replication, backup will try again later.