-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persisted events are not always reported via EventsByTagQuery #383
Comments
I'm going to investigate this now. I'll start bu putting in some multi jvm test with https://doc.akka.io/docs/akka/2.5/multi-jvm-testing.html |
Couple of initial questions @bvingerh
So you have full logs for it that I can take alook at? |
I have logs and cassandra dumps, but the logs are at INFO level and contain very little about akka (the only akka-specific log lines are similar to the ones above: 'EventsByTagStage' lines about missing events). |
We attempted ourselves to provide a reproduce case. The result can be found below in the zip file as a java 8 maven project. The setup is as follows
What is the observed behavior:
What is the setup:
Note that the zip file below also contains a folder logs which contains the output of the test for both a successful and failed run. Let us know when more info is required |
I'm seeing the same issue with 0.90 Persist and tag 100 events (100 hits locally on 2 node cluster via gatling over 10 seconds) I create one manually and it gets replyed Another batch of 100. Persisted and tagged. This is the only setting being set atm:
Can we get any official answer on this? |
I'm struggling with the same issue. I have 5-6 akka-clusters that use akka-persistence. Events are persisted to Cassandra but queryByTag in projector (that projects from Cassandra to Elasticsearch) does not project all events. Projector opens 10 streams that query by tag. |
For me the restart doesn't solve the problem cuz most likely I persist the offset of newer event and when restarting it is started from that offset. |
@toerni I ran your reproducer and it does fail some times. However when I changed it to store the message number per persistenceId so I could dig into where the event went with a Also when running with a very large number of messages/broadcasters storing all the messages requires a large heap so I tried it with an AtomicInteger and can't get that to fail either. If anyone can provide debug logs along with which events were missed that would be very helpful. |
I've had a modified version of the reproducer fail. In this case it was when events were delayed due to the Cassandra/Akka JVMs being overloaded at the end of a time bucket. The stage then goes onto the next bucket and doesn't look back in the previous bucket as no new events come for that tag/pid combination. If others have different scenarios with logs please attach. The solution could also be #263 |
Found another possible issue where as Cassandra responds more slowly the tag writes buffer up meaning that live eventsByTag queries will move past their offset by the time they've been written. I've added in some extra visibility for this while implementing #263 with a log like:
For most use cases the tag write is more efficient than the normal message writes as they are batched via their partition key. The batch is 150 by default meaning there's only 1 tag write per 150 event writes. But if you have a lot of persistenceIds all writing the same tag this balance will shift. Raised #406 as a possible improvement. |
0.91 released with eventual consistency delay. I have run a modified version (not storing the events in memory so can use larger numbers, minute time bucket to make bucket changing issues more likely) of the reproducer 100s of times. |
We use Akka in a two-node cluster. On the one side we have 40 persistent actors, with persistence IDs 'Sender-01' up to 'Sender-40', that persist various types of events in Cassandra, each tagged 'BCE'.
On the other side we have a set of 11 persistent actors, with persistence IDs 'Receiver-01' up to 'Receiver-11', that start a stream using an EventsByTagQuery. Each receiver actor is configured with a set of event types which it should process; this is used as a filter on the tagged event stream.
For the purposes of testing/debugging this problem, both Akka nodes and a single Cassandra node run on one physical machine (ie no real network traffic, VMs or containers are involved).
The problem we encounter is that not all tagged events are seen by each of the 11 receiver actors (even before filtering).
Using additional logging and analyzing Cassandra tables we've reconstructed the situation:
Actor 'Sender-25' persists an event of a certain type (let's say 'X')
This event is saved in Cassandra with sequence number 206, timestamp bffcad70-a130-11e8-9cab-6ba70b39e234, timebucket 1534406400000, and writer UUID a4e385a5-e28a-46e7-961d-02ca51a8b5d7
This event is of interest to two receivers: 9 and 11. All receivers except one encounter the event and filter it accordingly; receiver 9 receives and processes the event, but unfortunately receiver 11 happens to be the one that doesn't encounter the event on its stream, so it can't process it (which is a problem in our application)
Our application continues and actor 'Sender-25' persists another event, this time of type 'Y'
This event is saved in Cassandra with sequence number 207, timestamp c0151770-a130-11e8-9cab-6ba70b39e234, timebucket 1534406400000 (the same), and writer UUID a4e385a5-e28a-46e7-961d-02ca51a8b5d7 (the same)
This event is of interest to only one receiver: Receiver-01; all receivers except Receiver-11 see this event
Next, 'Sender-25' persists another event of type 'X'
This event is saved in Cassandra with sequence number 208, timestamp c1c649e0-a130-11e8-9cab-6ba70b39e234, timebucket 1534406400000 (the same), and writer UUID a4e385a5-e28a-46e7-961d-02ca51a8b5d7 (the same)
This event is seen by all receivers except 9 and 11, exactly the two receivers who are interested in events of this type (more work gets stuck in our application)
In fact the last event that Receiver-11 has seen from Sender-25 is that with sequence number 205 (received on 2018-08-16 10:45:39.522), which was filtered because Receiver-11 was not interested in events of that type.
The last event that Receiver-11 has seen from any sender is one sent by Sender-08 (received on 2018-08-16 10:47:00.514).
So we can assume that the Receiver-11 actor and associated EventsByTagQuery stream are still live. Cassandra also contains all the necessary event information, but somehow some streams refuse to pick up new messages of certain senders.
As a side note: all 11 receivers have seen the event with sequence number 205 sent by Sender-25.
All Receiver actors are cluster singletons that reside on Akka node 1. The Sender actors are distributed over the two Akka nodes, but Sender-25 and Sender-08 also happen reside on Akka node 1.
Now our application idles for some time while I analyse logs and write the text above. Then I launch some more work.
I expect Sender-25 to be invoked at some point to persist an event with sequence number 209. This is indeed the case.
Between timestamps 2018-08-16 12:42:23.558 and 2018-08-16 12:52:16.897, Receivers 1 to 10 (including receiver 9) see the event and act accordingly. Then I get the following log:
2018-08-16 12:53:37.515 INFO - akka.persistence.cassandra.query.EventsByTagStage - BCE: Missing event for new persistence id: Sender-25. Expected sequence nr: 1, actual: 209.
2018-08-16 12:53:37.515 INFO - akka.persistence.cassandra.query.EventsByTagStage - BCE: Executing query to look for missing. Timebucket: TimeBucket(1534410000000, Hour, true, false). From: c1672800-a132-11e8-8080-808080808080 (2018-08-16 09:00:00:000). To: 0e6c39c0-a141-11e8-9cab-6ba70b39e234 (2018-08-16 10:42:22:172)
2018-08-16 12:53:37.518 INFO - akka.persistence.cassandra.query.EventsByTagStage - BCE: Still looking for missing. Some(LookingForMissing{previousOffset=c1672800-a132-11e8-8080-808080808080 bucket=TimeBucket(1534410000000, Hour, true, false) queryPrevious=false maxOffset=0e6c39c0-a141-11e8-9cab-6ba70b39e234 persistenceId=Sender-25 maxSequenceNr=209 missing=Set(69, 138, 101, 88, 170, 115, 5, 120, 202, 10, 56, 142, 153, 174, 185, 42, 24, 37, 25, 52, 14, 184, 110, 125, 196, 157, 189, 20, 46, 93, 152, 57, 78, 29, 164, 179, 106, 121, 84, 147, 61, 132, 89, 133, 116, 1, 74, 206, 6, 60, 117, 85, 201, 102, 28, 38, 160, 70, 192, 21, 137, 165, 33, 92, 197, 65, 97, 156, 9, 188, 53, 169, 141, 109, 124, 77, 193, 96, 173, 13, 129, 41, 134, 73, 128, 105, 2, 205, 166, 32, 34, 148, 45, 161, 64, 180, 17, 149, 176, 191, 22, 44, 59, 118, 204, 27, 71, 12, 54, 144, 49, 181, 86, 159, 187, 172, 113, 81, 76, 7, 39, 98, 208, 103, 140, 91, 66, 155, 198, 108, 130, 135, 3, 80, 167, 35, 162, 112, 123, 194, 145, 48, 63, 18, 150, 95, 50, 67, 199, 177, 182, 16, 127, 31, 154, 11, 72, 175, 143, 43, 99, 87, 203, 104, 40, 26, 158, 186, 55, 114, 171, 139, 23, 8, 75, 119, 58, 207, 82, 151, 36, 168, 146, 30, 51, 190, 183, 19, 107, 4, 126, 136, 79, 195, 94, 131, 47, 15, 163, 200, 68, 62, 178, 90, 111, 122, 83, 100) deadline=Deadline(707031245806532 nanoseconds) failIfNotFound=false). Waiting for next poll.
2018-08-16 12:53:40.524 INFO - akka.persistence.cassandra.query.EventsByTagStage - BCE: Failed to find missing sequence nr: Some(LookingForMissing{previousOffset=c1672800-a132-11e8-8080-808080808080 bucket=TimeBucket(1534413600000, Hour, false, true) queryPrevious=true maxOffset=0e6c39c0-a141-11e8-9cab-6ba70b39e234 persistenceId=Sender-25 maxSequenceNr=209 missing=Set(69, 138, 101, 88, 170, 115, 5, 120, 202, 10, 56, 142, 153, 174, 185, 42, 24, 37, 25, 52, 14, 184, 110, 125, 196, 157, 189, 20, 46, 93, 152, 57, 78, 29, 164, 179, 106, 121, 84, 147, 61, 132, 89, 133, 116, 1, 74, 206, 6, 60, 117, 85, 201, 102, 28, 38, 160, 70, 192, 21, 137, 165, 33, 92, 197, 65, 97, 156, 9, 188, 53, 169, 141, 109, 124, 77, 193, 96, 173, 13, 129, 41, 134, 73, 128, 105, 2, 205, 166, 32, 34, 148, 45, 161, 64, 180, 17, 149, 176, 191, 22, 44, 59, 118, 204, 27, 71, 12, 54, 144, 49, 181, 86, 159, 187, 172, 113, 81, 76, 7, 39, 98, 208, 103, 140, 91, 66, 155, 198, 108, 130, 135, 3, 80, 167, 35, 162, 112, 123, 194, 145, 48, 63, 18, 150, 95, 50, 67, 199, 177, 182, 16, 127, 31, 154, 11, 72, 175, 143, 43, 99, 87, 203, 104, 40, 26, 158, 186, 55, 114, 171, 139, 23, 8, 75, 119, 58, 207, 82, 151, 36, 168, 146, 30, 51, 190, 183, 19, 107, 4, 126, 136, 79, 195, 94, 131, 47, 15, 163, 200, 68, 62, 178, 90, 111, 122, 83, 100) deadline=Deadline(707031245806532 nanoseconds) failIfNotFound=false)
2018-08-16 12:53:40.524 INFO - akka.persistence.cassandra.query.EventsByTagStage - No more missing events. Sending buffered events. BufferedEvents(List(UUIDRow(Sender-25,209,0e6c39c0-a141-11e8-9cab-6ba70b39e234,209,Row[BCE, 1534413600000, 0e6c39c0-a141-11e8-9cab-6ba70b39e234, Sender-25, 209, java.nio.HeapByteBuffer[pos=0 lim=139 cap=139], , NULL, NULL, NULL, 209, 135421337, cx, a4e385a5-e28a-46e7-961d-02ca51a8b5d7])))
And then at timestamp 2018-08-16 12:53:40.524 finally Receiver-11 sees event 209 by Sender-25.
For some reason 'Sender-25' is suddenly seen as a new persistence ID, while at least sequence numbers 206-208 have been generated (and been see by some Receivers) during the same application run.
Also, sequence numbers 206-208 are still not picked up by Receiver-11. So these events seem to be lost forever.
Some questions that come to mind:
We recently upgraded from Akka persistence Cassandra version 0.60 to 0.87 in order to have multiple tag support, but if event delivery reliability is problematic, we'll have to downgrade again.
The above analysis was performed with version 0.88, so the problem we encountered in 0.87 is still present.
The text was updated successfully, but these errors were encountered: