Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FAILED: CRDTActorOwnedTests.test_actorOwned_theLastWrittenOnUpdateCallbackWins #14

Closed
ktoso opened this issue Aug 21, 2019 · 6 comments
Closed
Assignees
Labels
failed 💥 Failed tickets are CI or benchmarking failures, should be investigated as soon as possible
Milestone

Comments

@ktoso
Copy link
Member

ktoso commented Aug 21, 2019

Failed in validation

https://ci.swiftserver.group/job/swift-distributed-actors-swift50-p1rb/6/consoleFull

19:36:55 Test Case 'CRDTActorOwnedTests.test_actorOwned_theLastWrittenOnUpdateCallbackWins' started at 2019-08-21 10:36:55.100
19:36:58 /code/Tests/DistributedActorsTests/CRDT/CRDTActorOwnedTests.swift:125: error: CRDTActorOwnedTests.test_actorOwned_theLastWrittenOnUpdateCallbackWins : failed - 
19:36:58         try ownerEventPB.expectMessage(.ownerDefinedOnUpdate)
19:36:58                           ^~~~~~~~~~~~~~
19:36:58 error: Did not receive expected [ownerDefinedOnUpdate]:OwnerEventProbeProtocol within [3s], error: noMessagesInQueue
19:36:58 <EXPR>:0: error: CRDTActorOwnedTests.test_actorOwned_theLastWrittenOnUpdateCallbackWins : threw error "error(message: "\n        try ownerEventPB.expectMessage(.ownerDefinedOnUpdate)\n                          \u{1B}[0;31m^~~~~~~~~~~~~~\nerror: Did not receive expected [ownerDefinedOnUpdate]:OwnerEventProbeProtocol within [3s], error: noMessagesInQueue\u{1B}[0;0m")"
19:36:58 2019-08-21T10:36:58+0000 info: deadLetter=1 [CRDTActorOwnedTests][ActorSystem.swift:215][/dead/letters] Dead letter: [command(DistributedActors.ClusterShell.CommandMessage.unbind(DistributedActors.BlockingReceptacle<()>))]:DistributedActors.ClusterShell.Message was not delivered to ["/dead/letters"].
19:36:58 Test Case 'CRDTActorOwnedTests.test_actorOwned_theLastWrittenOnUpdateCallbackWins' failed (3.307 seconds)
@ktoso ktoso added failed 💥 Failed tickets are CI or benchmarking failures, should be investigated as soon as possible t:crdt labels Aug 21, 2019
@ktoso
Copy link
Member Author

ktoso commented Aug 21, 2019

FYI @yim-lee

Likely we may get some failures now that we moved to new CI, perhaps timing issues? 🤔

@yim-lee
Copy link
Member

yim-lee commented Aug 21, 2019

One possibility is that the test starts modifying the ActorOwned<GCounter> before registration is complete, so the replicator doesn't know about the owner and doesn't send it notifications.

I think the owning actor should stay in "not ready" state until it's registered. Currently the owner has no way of knowing when that happens. Let me add logic to allow that to be done and update tests.

This might be the problem here:
https://github.com/apple/swift-distributed-actors/blob/master/Sources/DistributedActors/CRDT/CRDT.swift

            // Register as owner of the CRDT with local replicator
            _ = replicator.ask(for: RegisterResult.self, timeout: .milliseconds(100)) { replyTo in
                .localCommand(.register(ownerRef: subReceive, id: id, data: data.asAnyStateBasedCRDT, replyTo: replyTo))
            }

We don't wait for the .register reply.

@yim-lee
Copy link
Member

yim-lee commented Aug 22, 2019

Unlikely #16 would help.

Both .register and the underlying .write for gcounter.increment go to the same local replicator, and there is a message ordering guarantee so .register should be processed before .write, which means there is low chance that the owner doesn't receive message because it hasn't been registered yet.

@ktoso
Copy link
Member Author

ktoso commented Aug 22, 2019

Does not reproduce on mac, I think it may be a timing issue... SOMEHOW.

Agree that #16 does not inherently solve it.

I was able to reproduce it twice now in docker:

  • docker-compose -f docker/docker-compose.yaml -f docker/docker-compose.1604.51.yaml run shell
  • and run tests there (swift is in $HOME/.swift).

Sadly without logs; I will keep it running for a while with logs now;

Notes:

  • CPU load does not seem to matter; with burning cpu id did not reproduce on a mac, even if the test took ages to complete
  • TSAN does not panic; it seems to be something else going on..

Will try to get us a tracelog

@ktoso
Copy link
Member Author

ktoso commented Aug 22, 2019

root@27f1dc043936:/code# ./scripts/dev/until_failure $HOME/.swift/usr/bin/swift test -Xswiftc -DSACT_TRACE_REPLICATOR --filter test_actorOwned_theLastWrittenOnUpdateCallbackWins --sanitize thread

and added tracelog for the local messages


repro, no logs, takes 3 seconds

Test Case 'CRDTActorOwnedTests.test_actorOwned_theLastWrittenOnUpdateCallbackWins' started at 2019-08-22 03:36:37.011
/code/Tests/DistributedActorsTests/CRDT/CRDTActorOwnedTests.swift:125: error: CRDTActorOwnedTests.test_actorOwned_theLastWrittenOnUpdateCallbackWins : failed -
        try ownerEventPB.expectMessage(.ownerDefinedOnUpdate)
                          ^~~~~~~~~~~~~~
error: Did not receive expected [ownerDefinedOnUpdate]:OwnerEventProbeProtocol within [3s] at probe [/user/$testProbe-1/$adapter-y], error: noMessagesInQueue
<EXPR>:0: error: CRDTActorOwnedTests.test_actorOwned_theLastWrittenOnUpdateCallbackWins : threw error "error(message: "\n        try ownerEventPB.expectMessage(.ownerDefinedOnUpdate)\n                          \u{1B}[0;31m^~~~~~~~~~~~~~\nerror: Did not receive expected [ownerDefinedOnUpdate]:OwnerEventProbeProtocol within [3s] at probe [/user/$testProbe-1/$adapter-y], error: noMessagesInQueue\u{1B}[0;0m")"
2019-08-22T03:36:40+0000 info: deadLetter=1 [CRDTActorOwnedTests][ActorSystem.swift:215][/dead/letters] Dead letter: [command(DistributedActors.ClusterShell.CommandMessage.unbind(DistributedActors.BlockingReceptacle<()>))]:DistributedActors.ClusterShell.Message was not delivered to ["/dead/letters"].
Test Case 'CRDTActorOwnedTests.test_actorOwned_theLastWrittenOnUpdateCallbackWins' failed (3.366 seconds)
Test Suite 'CRDTActorOwnedTests' failed at 2019-08-22 03:36:40.378
	 Executed 1 test, with 2 failures (1 unexpected) in 3.366 (3.366) seconds
Test Suite 'Selected tests' failed at 2019-08-22 03:36:40.378
	 Executed 1 test, with 2 failures (1 unexpected) in 3.366 (3.366) seconds

correct execution with tracelog

Test Case 'CRDTActorOwnedTests.test_actorOwned_theLastWrittenOnUpdateCallbackWins' started at 2019-08-22 03:44:01.819
2019-08-22T03:44:01+0000 warning: [CRDTActorOwnedTests][CRDTReplicatorShell.swift:66][thread:140319883364096][/system/replicator] [tracelog:replicator] L-RECV: register(ownerRef: ActorRef<CRDT.Replication.DataOwnerMessage>(/user/$anonymous-y/$subReceive-test-gcounter-y), id: DistributedActors.CRDT.Identity(id: "test-gcounter"), data: AnyCvRDT(GCounter(replicaId: actor:/user/$anonymous-y, state: [:], delta: nil)), replyTo: ActorRef<CRDT.Replicator.LocalCommand.RegisterResult>(/user/$ask-b/$adapter-y))
2019-08-22T03:44:01+0000 warning: [CRDTActorOwnedTests][CRDTReplicatorShell.swift:66][thread:140319883364096][/system/replicator] [tracelog:replicator] L-RECV: write(DistributedActors.CRDT.Identity(id: "test-gcounter"), AnyCvRDT(GCounter(replicaId: actor:/user/$anonymous-y, state: [actor:/user/$anonymous-y: 1], delta: Optional(DistributedActors.CRDT.GCounterDelta(state: [actor:/user/$anonymous-y: 1])))), consistency: DistributedActors.CRDT.OperationConsistency.local, replyTo: ActorRef<CRDT.Replicator.LocalCommand.WriteResult>(/user/$ask-n/$adapter-y))
2019-08-22T03:44:01+0000 warning: [CRDTActorOwnedTests][CRDTReplicatorShell.swift:236][thread:140319883364096][/system/replicator] [tracelog:replicator] UP-OWN($subReceive-test-gcounter-y): updated(DistributedActors.CRDT.GCounter(replicaId: actor:/user/$anonymous-y, state: [actor:/user/$anonymous-y: 1], delta: nil))
2019-08-22T03:44:01+0000 info: deadLetter=1 [CRDTActorOwnedTests][ActorSystem.swift:215][/dead/letters] Dead letter: [command(DistributedActors.ClusterShell.CommandMessage.unbind(DistributedActors.BlockingReceptacle<()>))]:DistributedActors.ClusterShell.Message was not delivered to ["/dead/letters"].
Test Case 'CRDTActorOwnedTests.test_actorOwned_theLastWrittenOnUpdateCallbackWins' passed (0.361 seconds)

@ktoso ktoso added this to the 0.1.0 milestone Aug 22, 2019
@ktoso ktoso added the 3 - in progress Ticket is being worked on label Aug 22, 2019
@ktoso
Copy link
Member Author

ktoso commented Aug 22, 2019

Reproduced with tracelog:

2019-08-22T04:02:06+0000 warning: [CRDTActorOwnedTests][CRDTReplicatorShell.swift:66][thread:140146414651136][/system/replicator] [tracelog:replicator] L-RECV: write(DistributedActors.CRDT.Identity(id: "test-gcounter"), AnyCvRDT(GCounter(replicaId: actor:/user/$anonymous-y, state: [actor:/user/$anonymous-y: 1], delta: Optional(DistributedActors.CRDT.GCounterDelta(state: [actor:/user/$anonymous-y: 1])))), consistency: DistributedActors.CRDT.OperationConsistency.local, replyTo: ActorRef<CRDT.Replicator.LocalCommand.WriteResult>(/user/$ask-n/$adapter-y))
2019-08-22T04:02:06+0000 warning: [CRDTActorOwnedTests][CRDTReplicatorShell.swift:66][thread:140146414651136][/system/replicator] [tracelog:replicator] L-RECV: register(ownerRef: ActorRef<CRDT.Replication.DataOwnerMessage>(/user/$anonymous-y/$subReceive-test-gcounter-y), id: DistributedActors.CRDT.Identity(id: "test-gcounter"), data: AnyCvRDT(GCounter(replicaId: actor:/user/$anonymous-y, state: [:], delta: nil)), replyTo: ActorRef<CRDT.Replicator.LocalCommand.RegisterResult>(/user/$ask-b/$adapter-y))
/code/Tests/DistributedActorsTests/CRDT/CRDTActorOwnedTests.swift:125: error: CRDTActorOwnedTests.test_actorOwned_theLastWrittenOnUpdateCallbackWins : failed -
        try ownerEventPB.expectMessage(.ownerDefinedOnUpdate)
                          ^~~~~~~~~~~~~~
error: Did not receive expected [ownerDefinedOnUpdate]:OwnerEventProbeProtocol within [3s] at probe [/user/$testProbe-1/$adapter-y], error: noMessagesInQueue

So yeah, it is as you said, the write happens before register; This is worrying and that should be fixed, as we are "owned" we should guarantee ordering of those two.

ktoso pushed a commit that referenced this issue Aug 22, 2019
…ateCallbackWins (#21)

Motivation:
The aforementioned test fails intermittently. See #14.

Modifications:
There is an existing issue with `.ask`. Use `.tell` instead 1) as a workaround; 2) since we don't need the result anyway.

Also made some minor improvements to code docs.

Result:
No more flaky test hopefully!
@ktoso ktoso closed this as completed Aug 22, 2019
ktoso pushed a commit that referenced this issue Aug 31, 2019
…ateCallbackWins (#21)

Motivation:
The aforementioned test fails intermittently. See #14.

Modifications:
There is an existing issue with `.ask`. Use `.tell` instead 1) as a workaround; 2) since we don't need the result anyway.

Also made some minor improvements to code docs.

Result:
No more flaky test hopefully!
@ktoso ktoso removed the 3 - in progress Ticket is being worked on label Sep 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
failed 💥 Failed tickets are CI or benchmarking failures, should be investigated as soon as possible
Projects
None yet
Development

No branches or pull requests

2 participants