New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seti on new/restarted comet not receiving full list of userIds from other nodes #640

Closed
giantninja opened this Issue Mar 11, 2016 · 22 comments

Comments

Projects
None yet
3 participants
@giantninja

I have noticed that in when either spinning up a new comet/node (or restarting a node), the new comet joining the oort cluster doesn't seem to receive the full list of all userIds associated with oort/seti, and the counts of userIds reported by Seti.getUserIds() on the new comet will not match the other comet, resulting in the new node not being aware of all the users actually associated in Seti.

FWIW: In my two comet/node setup, the only way I've reliably been able to ensure the association counts in each node are accurate, is to bring them both up at the same time

Link to Google groups thread with Simone:

@sbordet

This comment has been minimized.

Show comment
Hide comment
@sbordet

sbordet Apr 6, 2016

Member

Can you please detail the steps you are doing to make Seti go out of sync ?

You setup the first one, associate a bunch of users, then - with the first node idle - you start the second node, and you find that it is not in sync ?

Or the first node is not idle and it's continuously associating and disassociating users while the second node starts ?

Thanks !

Member

sbordet commented Apr 6, 2016

Can you please detail the steps you are doing to make Seti go out of sync ?

You setup the first one, associate a bunch of users, then - with the first node idle - you start the second node, and you find that it is not in sync ?

Or the first node is not idle and it's continuously associating and disassociating users while the second node starts ?

Thanks !

@giantninja

This comment has been minimized.

Show comment
Hide comment
@giantninja

giantninja Apr 6, 2016

So a common situation where I would witness this would be as follows (this is a two server/comet setup using OortMulticastConfigServlet)

Fire up both comets (after a new build lets say) by building the .war file, and starting jetty to deploy the war/application. After client reconnects have settled down and we're back to business as usual, lets say I want to test out another build on one of the comets, or just restart it in general, by stopping jetty, and starting it again via Supervisord (daemonizing utility we use for managing daemons services). Since the .war is already built, it restarts rather quickly, but the comet that was just brought back up, will not have the same amount of userIds associated when checking Seti.getUserIds().

It seems like if it was a quick restart, this would happen, but if I took a comet out of service to do some more serious work on it, or system upgrades etc, it would be fine when I brought it back up.

So a common situation where I would witness this would be as follows (this is a two server/comet setup using OortMulticastConfigServlet)

Fire up both comets (after a new build lets say) by building the .war file, and starting jetty to deploy the war/application. After client reconnects have settled down and we're back to business as usual, lets say I want to test out another build on one of the comets, or just restart it in general, by stopping jetty, and starting it again via Supervisord (daemonizing utility we use for managing daemons services). Since the .war is already built, it restarts rather quickly, but the comet that was just brought back up, will not have the same amount of userIds associated when checking Seti.getUserIds().

It seems like if it was a quick restart, this would happen, but if I took a comet out of service to do some more serious work on it, or system upgrades etc, it would be fine when I brought it back up.

@sbordet

This comment has been minimized.

Show comment
Hide comment
@sbordet

sbordet Apr 6, 2016

Member

What chances there are that you are hitting #623 ? What CometD version are you using ?

Member

sbordet commented Apr 6, 2016

What chances there are that you are hitting #623 ? What CometD version are you using ?

@giantninja

This comment has been minimized.

Show comment
Hide comment
@giantninja

giantninja Apr 6, 2016

That issue (#623) does seem like it could be related and/or the cause of what I am seeing. I'm using cometd 3.0.9 and the latest version of jetty 9.2.x

That issue (#623) does seem like it could be related and/or the cause of what I am seeing. I'm using cometd 3.0.9 and the latest version of jetty 9.2.x

@sbordet

This comment has been minimized.

Show comment
Hide comment
@sbordet

sbordet Apr 6, 2016

Member

#623 was fixed in 3.0.9, so if you still see the issue, then it's something else.

Member

sbordet commented Apr 6, 2016

#623 was fixed in 3.0.9, so if you still see the issue, then it's something else.

@giantninja

This comment has been minimized.

Show comment
Hide comment
@giantninja

giantninja Apr 6, 2016

gotcha... that issue did sound like it was very similar, but I've been on 3.0.9 since it came out, and haven't noticed any difference regarding Seti associations being out of sync.

gotcha... that issue did sound like it was very similar, but I've been on 3.0.9 since it came out, and haven't noticed any difference regarding Seti associations being out of sync.

@sbordet

This comment has been minimized.

Show comment
Hide comment
@sbordet

sbordet Apr 6, 2016

Member

The more details you can add, the better... if you can reproduce (even manually) the issue with exact steps, will be great. Even better if you can reproduce with DEBUG logs.

Thanks !

Member

sbordet commented Apr 6, 2016

The more details you can add, the better... if you can reproduce (even manually) the issue with exact steps, will be great. Even better if you can reproduce with DEBUG logs.

Thanks !

@giantninja

This comment has been minimized.

Show comment
Hide comment
@giantninja

giantninja Apr 6, 2016

I need to get a valid load test going, because the DEBUG logs in production are insanely large, and too much to sift through. I am planning on modifying the cometd-java-benchmark test that's in the official distribution, but if you have another/better suggestion/solution for load testing a cometd app, I would definitely be interested in that.

I need to get a valid load test going, because the DEBUG logs in production are insanely large, and too much to sift through. I am planning on modifying the cometd-java-benchmark test that's in the official distribution, but if you have another/better suggestion/solution for load testing a cometd app, I would definitely be interested in that.

@giantninja

This comment has been minimized.

Show comment
Hide comment
@giantninja

giantninja Apr 6, 2016

also, about the DEBUG logging, I think that (at least for the Oort/Seti packages) it would be really useful to bump up some of the crucial logging calls (like connects/disconnects and relevant listener events etc) from debug to info, so we could log at the INFO level and see the oort connects/disconnects and seti listener events we're talking about at the INFO level, without the crazy amount of data that DEBUG obviously generates... Just something that I think would be beneficial, IMHO

also, about the DEBUG logging, I think that (at least for the Oort/Seti packages) it would be really useful to bump up some of the crucial logging calls (like connects/disconnects and relevant listener events etc) from debug to info, so we could log at the INFO level and see the oort connects/disconnects and seti listener events we're talking about at the INFO level, without the crazy amount of data that DEBUG obviously generates... Just something that I think would be beneficial, IMHO

@sbordet

This comment has been minimized.

Show comment
Hide comment
@sbordet

sbordet Jun 8, 2017

Member

Is this still an issue ? Other Oort users that reported Seti synchronization problems in the past are now happy, so perhaps this issue is solved ?

Member

sbordet commented Jun 8, 2017

Is this still an issue ? Other Oort users that reported Seti synchronization problems in the past are now happy, so perhaps this issue is solved ?

@fdummert

This comment has been minimized.

Show comment
Hide comment
@fdummert

fdummert Aug 10, 2017

Hi Simone,

long time no issue report from me - it is about time to provide one :-) I'll jump on this issue # because the symptoms match. I'm running cometd 3.1.1 with jetty 9.2.22 and we have different cluster setups running (each one pre-configured, no multi-casting). I have still Oort/Seti/OortObject issues when restarting jetty servers with oort clusters: after restarts, I can see from time to time that neither Seti entries nor OortMap entries match on all nodes, and I'm missing join/left events, and broadcast messages do not receive all nodes. Effectively the cluster is broken and the nodes need again to be restarted. I could now reproduce a failure in a test environment, consisting of three cometd servers.

N1: http://192.168.255.143:8082
N2: http://192.168.255.143:8083
N3: http://192.168.251.56:8080

I attached an archive with logs logs.tar.gz (debug level on oort. info on most other packages) from N1 and N2. As it is still very noisy, I attached a filtered version of the N2 log file, oort-overview.log, concentrating on the connection between N2 and N1.

I'll describe roughly what I can see:

Example of a successful restart of N1:
At 2017-08-09 18:21:30, N2 is being restarted. It is handshaking and finally connected with N1 at 2017-08-09 18:21:31,929. After some traffic back and forth, N1 is restarted at around 2017-08-09 18:21:31 and back again at 2017-08-09 18:21:37. In N2's log, we're seeing the connection failure and N1 appearing again around 2017-08-09 18:21:39. N2 needs to rehandshake and is connected back at 2017-08-09 18:21:42,242.
Great.

Failure when restarting N1 another time:
N1 is restarted again around 2017-08-09 18:21:45, noticed by N2 at 2017-08-09 18:21:46,992. We're seeing N1 joining back at 2017-08-09 18:21:53,280, but I can't see any state updates in org.cometd.oort.OortComet.54336c81 anymore. It stays disconnected, all messages are failing. I suppose this would explain basically all out-of-sync symptoms.

logs.tar.gz
oort-overview.log.gz

Best Regards,
Florian

Hi Simone,

long time no issue report from me - it is about time to provide one :-) I'll jump on this issue # because the symptoms match. I'm running cometd 3.1.1 with jetty 9.2.22 and we have different cluster setups running (each one pre-configured, no multi-casting). I have still Oort/Seti/OortObject issues when restarting jetty servers with oort clusters: after restarts, I can see from time to time that neither Seti entries nor OortMap entries match on all nodes, and I'm missing join/left events, and broadcast messages do not receive all nodes. Effectively the cluster is broken and the nodes need again to be restarted. I could now reproduce a failure in a test environment, consisting of three cometd servers.

N1: http://192.168.255.143:8082
N2: http://192.168.255.143:8083
N3: http://192.168.251.56:8080

I attached an archive with logs logs.tar.gz (debug level on oort. info on most other packages) from N1 and N2. As it is still very noisy, I attached a filtered version of the N2 log file, oort-overview.log, concentrating on the connection between N2 and N1.

I'll describe roughly what I can see:

Example of a successful restart of N1:
At 2017-08-09 18:21:30, N2 is being restarted. It is handshaking and finally connected with N1 at 2017-08-09 18:21:31,929. After some traffic back and forth, N1 is restarted at around 2017-08-09 18:21:31 and back again at 2017-08-09 18:21:37. In N2's log, we're seeing the connection failure and N1 appearing again around 2017-08-09 18:21:39. N2 needs to rehandshake and is connected back at 2017-08-09 18:21:42,242.
Great.

Failure when restarting N1 another time:
N1 is restarted again around 2017-08-09 18:21:45, noticed by N2 at 2017-08-09 18:21:46,992. We're seeing N1 joining back at 2017-08-09 18:21:53,280, but I can't see any state updates in org.cometd.oort.OortComet.54336c81 anymore. It stays disconnected, all messages are failing. I suppose this would explain basically all out-of-sync symptoms.

logs.tar.gz
oort-overview.log.gz

Best Regards,
Florian

@sbordet

This comment has been minimized.

Show comment
Hide comment
@sbordet

sbordet Sep 12, 2017

Member

Florian, sorry for the delay, I'm now looking at this.

Member

sbordet commented Sep 12, 2017

Florian, sorry for the delay, I'm now looking at this.

@sbordet

This comment has been minimized.

Show comment
Hide comment
@sbordet

sbordet Sep 18, 2017

Member

Train of thought.

In the logs, N2 sees N1 going away at 18:21:47,010.
The N2->N1 connection is maintained by OortComet@54336c81.
The log lines are printed from Oort.OortCometDisconnectListener.removed().
There, we remove the ServerCometInfo by matching the ServerSession id, and we remove the ClientCometInfo by matching the remoteOortId taken from the ServerCometInfo.
The comet left event is notified at 18:21:47.015.

At 18:21:53,075 N2 sees an incoming handshake from N1.
Because of the incoming handshake from N1, N2 wants to connect back to N1.
However, the logs show that N2 is already connected to N1 via OortComet@54336c81, which does not make sense, since it should have been removed 6 seconds earlier.

Now, the removal is done by remoteOortId, while the existence of an OortComet already connected is matched by remoteOortURL.
The only explanation would be that 2 different ClientCometInfo, with different remoteOortId but same remoteOortURL and same OortComet instance have been registered.

In the logs there is an indirect trace of this double registration under 2 different remoteOortIds: at 18:21:31,881 for remoteOortId=1f908564-ade9-44f8-86a3-ab95aa7a9745 and at 18:21:42,127 for remoteOortId=f18ed9a9-e0ad-45cc-a007-f8bcf67236a2.
This matches with the report that N1 was restarted at about those times.

The only place where the ClientCometInfo is removed is from Oort.OortCometDisconnectListener, but only if it is explicitly disconnected (not in case of timeout because we want the OortComet to continue to try to reconnect to a node that evidently was not properly shutdown).

Member

sbordet commented Sep 18, 2017

Train of thought.

In the logs, N2 sees N1 going away at 18:21:47,010.
The N2->N1 connection is maintained by OortComet@54336c81.
The log lines are printed from Oort.OortCometDisconnectListener.removed().
There, we remove the ServerCometInfo by matching the ServerSession id, and we remove the ClientCometInfo by matching the remoteOortId taken from the ServerCometInfo.
The comet left event is notified at 18:21:47.015.

At 18:21:53,075 N2 sees an incoming handshake from N1.
Because of the incoming handshake from N1, N2 wants to connect back to N1.
However, the logs show that N2 is already connected to N1 via OortComet@54336c81, which does not make sense, since it should have been removed 6 seconds earlier.

Now, the removal is done by remoteOortId, while the existence of an OortComet already connected is matched by remoteOortURL.
The only explanation would be that 2 different ClientCometInfo, with different remoteOortId but same remoteOortURL and same OortComet instance have been registered.

In the logs there is an indirect trace of this double registration under 2 different remoteOortIds: at 18:21:31,881 for remoteOortId=1f908564-ade9-44f8-86a3-ab95aa7a9745 and at 18:21:42,127 for remoteOortId=f18ed9a9-e0ad-45cc-a007-f8bcf67236a2.
This matches with the report that N1 was restarted at about those times.

The only place where the ClientCometInfo is removed is from Oort.OortCometDisconnectListener, but only if it is explicitly disconnected (not in case of timeout because we want the OortComet to continue to try to reconnect to a node that evidently was not properly shutdown).

@sbordet

This comment has been minimized.

Show comment
Hide comment
@sbordet

sbordet Sep 19, 2017

Member

I was able to reproduce this in a test, and indeed there is a double registration under 2 different remoteOortIds.

Upon detecting a handshake reply, we create and store the ClientCometInfo, and we should check for stale remoteOortIds and remove them.
In the case above, this would have avoided the double registration, so that at 18:21:47,010 the only remaining ClientCometInfo would be removed, and at 18:21:53,075 N2 would connect back to N1 with a new, different, OortComet.

Member

sbordet commented Sep 19, 2017

I was able to reproduce this in a test, and indeed there is a double registration under 2 different remoteOortIds.

Upon detecting a handshake reply, we create and store the ClientCometInfo, and we should check for stale remoteOortIds and remove them.
In the case above, this would have avoided the double registration, so that at 18:21:47,010 the only remaining ClientCometInfo would be removed, and at 18:21:53,075 N2 would connect back to N1 with a new, different, OortComet.

sbordet added a commit that referenced this issue Sep 19, 2017

Issue #640 - Seti on new/restarted comet not receiving full list of u…
…serIds from other nodes.

Now removing stale remoteOortIds from _clientComets when processing
handshake replies.
This ensures that there is never a double registration of the same
OortComet for different remoteOortIds.
@sbordet

This comment has been minimized.

Show comment
Hide comment
@sbordet

sbordet Sep 19, 2017

Member

@fdummert can you build the latest code and try it out to see if the problem is solved for you ?

Member

sbordet commented Sep 19, 2017

@fdummert can you build the latest code and try it out to see if the problem is solved for you ?

@fdummert

This comment has been minimized.

Show comment
Hide comment
@fdummert

fdummert Sep 21, 2017

Hi Simone,

I actually had the plan to reproduce the Seti issue with 3.1.2 in a real life scenario, then patch and verify. Until now I wasn't able to see the Seti issue again under 3.1.2 (and also not with the patched version). I'll further concentrate on nailing it down.

However I found another Oort map sync issue, with both 3.1.2 and the patched version. Not sure whether I should abuse this ticket for it or whether we should handle the Oort map with a separate ticket (or reopen ticket #643)?

Let me first explain what I'm doing: My SecurityPolicy.canHandshake authenticates an external service connecting to cometd. If successful, I'm creating a UUID and associate the ServerSession of the service with Seti. The authentication process yields additional data that I want to keep in the session and share across nodes. I have an OortMap named "passports" (actually I have two Oort maps, but let's concentrate on the "passports" map), so basically after the Seti association, I'm executing passports.putAndShare(uuid, additionalData, null). After a server restart, all external services reconnect and invoke canHandshake, so in a full blown setup with many clients, there is quite some message storm after a restart.

Here the logs: logs.tar.gz, we're logging the Oort maps in separate -oort files.

I'll explain the issue shortly. We have two nodes, N1: 192.168.144.85:8081 and N2: 192.168.255.143:8082 (both patched with your latest changes in the oort.jar).
After the last restarts (8081 at 13:30:03,630 and 8082 at 13:30:12,010),
8081 has following local clients:

  • 721a5039-31f1-40e7-8493-d3a9407e97f2
  • 6dc7dead-9ed4-47c1-b778-ce0a15ee3a99
  • 512ceb46-671c-4d58-83a5-531bebc8f6a7

and 8082 has:

  • 155cc60a-3587-4fdf-9a35-b7436cca8a3f
  • eb68c47e-dd75-405e-83f7-53c11e7e0294

The Seti associations are correctly synchronized in this test run. Both OortMaps "passports" have all their local clients, but following remote entries on top:
8081:

  • 721a5039-31f1-40e7-8493-d3a9407e97f2 (local)
  • 6dc7dead-9ed4-47c1-b778-ce0a15ee3a99 (local)
  • 512ceb46-671c-4d58-83a5-531bebc8f6a7 (local)
  • 7ce2ed8b-2ff7-4ae0-8fcc-eff0ef017105 (stale left-over from N2 before its last restart)
  • dbf52346-4983-48f2-a969-3cc92e4ff7ee (stale left-over from N2 before its last restart)

8082:

  • 155cc60a-3587-4fdf-9a35-b7436cca8a3f (local)
  • eb68c47e-dd75-405e-83f7-53c11e7e0294 (local)
  • 721a5039-31f1-40e7-8493-d3a9407e97f2 (correct entry from N1)
  • 6dc7dead-9ed4-47c1-b778-ce0a15ee3a99 (correct entry from N1)
  • 512ceb46-671c-4d58-83a5-531bebc8f6a7 (correct entry from N1)

So 8082 sees the correct state, but 8081 still sees the entries of 8082 before it was restarted, and the new entries were not added successfully. During the restarts captured in these logs, I'm missing log output, like for example:

2017-09-21 13:29:41,885 [qtp519569038-22] DEBUG org.cometd.oort.OortComet.51bde877 - State updated: DISCONNECTING -> TERMINATING
2017-09-21 13:29:41,886 [qtp519569038-22] DEBUG org.cometd.oort.OortComet.51bde877 - State updated: TERMINATING -> DISCONNECTED
2017-09-21 13:29:41,886 [qtp519569038-22] DEBUG org.cometd.oort.OortComet.51bde877 - Notifying threads in waitFor()
2017-09-21 13:29:41,886 [qtp519569038-22] DEBUG org.cometd.oort.Oort.http_192_168_255_143_8082_comet_server_cometd - Comet left: 64ff6f2a-9135-4ae5-ad73-4f2a3fe17745|http://192.168.144.85:8081/comet-server/cometd
2017-09-21 13:29:41,886 [qtp519569038-22] DEBUG org.cometd.oort.Seti.http_192_168_255_143_8082_comet_server_cometd - Comet left: http://192.168.144.85:8081/comet-server/cometd
2017-09-21 13:29:41,886 [qtp519569038-22] DEBUG org.cometd.oort.Seti.http_192_168_255_143_8082_comet_server_cometd - Removing presences of comet http://192.168.144.85:8081/comet-server/cometd for users [18db6926-b362-465a-b06d-8185b36c1142, 9bae33c4-e4b4-4b17-860b-dd27697bb20e, 51661b46-1108-4921-9715-7d72803fcfc7]

Consequently, also the OortMap does not receive a cometLeft event (which is probably not good). Second, when you're searching the logs for "eb68c47e" and "155cc60a", then you can see that 8081 actually receives both entries, but in reverse order, so 8082 sends out 155cc60a with version 1, then eb68c47e with version 2, but 8082 receives first eb68c47e with version 2, then 155cc60a with version 1. Looking at the version check version > current, could this also lead to inconsistencies?

Best Regards,
Florian

Hi Simone,

I actually had the plan to reproduce the Seti issue with 3.1.2 in a real life scenario, then patch and verify. Until now I wasn't able to see the Seti issue again under 3.1.2 (and also not with the patched version). I'll further concentrate on nailing it down.

However I found another Oort map sync issue, with both 3.1.2 and the patched version. Not sure whether I should abuse this ticket for it or whether we should handle the Oort map with a separate ticket (or reopen ticket #643)?

Let me first explain what I'm doing: My SecurityPolicy.canHandshake authenticates an external service connecting to cometd. If successful, I'm creating a UUID and associate the ServerSession of the service with Seti. The authentication process yields additional data that I want to keep in the session and share across nodes. I have an OortMap named "passports" (actually I have two Oort maps, but let's concentrate on the "passports" map), so basically after the Seti association, I'm executing passports.putAndShare(uuid, additionalData, null). After a server restart, all external services reconnect and invoke canHandshake, so in a full blown setup with many clients, there is quite some message storm after a restart.

Here the logs: logs.tar.gz, we're logging the Oort maps in separate -oort files.

I'll explain the issue shortly. We have two nodes, N1: 192.168.144.85:8081 and N2: 192.168.255.143:8082 (both patched with your latest changes in the oort.jar).
After the last restarts (8081 at 13:30:03,630 and 8082 at 13:30:12,010),
8081 has following local clients:

  • 721a5039-31f1-40e7-8493-d3a9407e97f2
  • 6dc7dead-9ed4-47c1-b778-ce0a15ee3a99
  • 512ceb46-671c-4d58-83a5-531bebc8f6a7

and 8082 has:

  • 155cc60a-3587-4fdf-9a35-b7436cca8a3f
  • eb68c47e-dd75-405e-83f7-53c11e7e0294

The Seti associations are correctly synchronized in this test run. Both OortMaps "passports" have all their local clients, but following remote entries on top:
8081:

  • 721a5039-31f1-40e7-8493-d3a9407e97f2 (local)
  • 6dc7dead-9ed4-47c1-b778-ce0a15ee3a99 (local)
  • 512ceb46-671c-4d58-83a5-531bebc8f6a7 (local)
  • 7ce2ed8b-2ff7-4ae0-8fcc-eff0ef017105 (stale left-over from N2 before its last restart)
  • dbf52346-4983-48f2-a969-3cc92e4ff7ee (stale left-over from N2 before its last restart)

8082:

  • 155cc60a-3587-4fdf-9a35-b7436cca8a3f (local)
  • eb68c47e-dd75-405e-83f7-53c11e7e0294 (local)
  • 721a5039-31f1-40e7-8493-d3a9407e97f2 (correct entry from N1)
  • 6dc7dead-9ed4-47c1-b778-ce0a15ee3a99 (correct entry from N1)
  • 512ceb46-671c-4d58-83a5-531bebc8f6a7 (correct entry from N1)

So 8082 sees the correct state, but 8081 still sees the entries of 8082 before it was restarted, and the new entries were not added successfully. During the restarts captured in these logs, I'm missing log output, like for example:

2017-09-21 13:29:41,885 [qtp519569038-22] DEBUG org.cometd.oort.OortComet.51bde877 - State updated: DISCONNECTING -> TERMINATING
2017-09-21 13:29:41,886 [qtp519569038-22] DEBUG org.cometd.oort.OortComet.51bde877 - State updated: TERMINATING -> DISCONNECTED
2017-09-21 13:29:41,886 [qtp519569038-22] DEBUG org.cometd.oort.OortComet.51bde877 - Notifying threads in waitFor()
2017-09-21 13:29:41,886 [qtp519569038-22] DEBUG org.cometd.oort.Oort.http_192_168_255_143_8082_comet_server_cometd - Comet left: 64ff6f2a-9135-4ae5-ad73-4f2a3fe17745|http://192.168.144.85:8081/comet-server/cometd
2017-09-21 13:29:41,886 [qtp519569038-22] DEBUG org.cometd.oort.Seti.http_192_168_255_143_8082_comet_server_cometd - Comet left: http://192.168.144.85:8081/comet-server/cometd
2017-09-21 13:29:41,886 [qtp519569038-22] DEBUG org.cometd.oort.Seti.http_192_168_255_143_8082_comet_server_cometd - Removing presences of comet http://192.168.144.85:8081/comet-server/cometd for users [18db6926-b362-465a-b06d-8185b36c1142, 9bae33c4-e4b4-4b17-860b-dd27697bb20e, 51661b46-1108-4921-9715-7d72803fcfc7]

Consequently, also the OortMap does not receive a cometLeft event (which is probably not good). Second, when you're searching the logs for "eb68c47e" and "155cc60a", then you can see that 8081 actually receives both entries, but in reverse order, so 8082 sends out 155cc60a with version 1, then eb68c47e with version 2, but 8082 receives first eb68c47e with version 2, then 155cc60a with version 1. Looking at the version check version > current, could this also lead to inconsistencies?

Best Regards,
Florian

@sbordet

This comment has been minimized.

Show comment
Hide comment
@sbordet

sbordet Sep 26, 2017

Member

I'm making progress on this with your latest logs. I know why the comet left event was not fired.
Will keep this issue posted.

Member

sbordet commented Sep 26, 2017

I'm making progress on this with your latest logs. I know why the comet left event was not fired.
Will keep this issue posted.

@sbordet

This comment has been minimized.

Show comment
Hide comment
@sbordet

sbordet Sep 29, 2017

Member

So this is what is going on regarding the missing comet events.

From your logs, N2 starts, tries to connect to N1 that is still down, enters a reconnect loop to N1 and eventually is up and running.
Then N1 starts and connects to N2 via OortComet12; N2 sees the handshake from N1 and replies to it; N1 subscribes to channels and sends to N2 the join message; N2 receives the subscriptions and the join message and emits the "comet joined" event on N2 for N1.
At this point, data can flow from N2 to N1 via OortComet12, while N2 is not yet connected to N1 via its OortComet21. Now N2 crashes, so that it was never able to connect to N1, never able to subscribe, never able to send the join message so that N1 was never fully aware of N2, although it received data from it and recorded some internal data structure of its existence.
N1 enters a reconnect loop; N2 restarts, connects to N1 via OortComet21, subscribes, sends the join message and N1 sees a "comet joined" event for N2. Since N1 never fully saw N2 before, there is no "comet left" event on N1 for N2.
Because N2 is still starting up, N1 cannot yet connect to it via OortComet12.
Eventually N2 is up, N1 tries to connect, gets a 402::unknown_client, handshakes again, subscribes, sends the join message, and N2 emits the "comet joined" event on N2 for N1.

Member

sbordet commented Sep 29, 2017

So this is what is going on regarding the missing comet events.

From your logs, N2 starts, tries to connect to N1 that is still down, enters a reconnect loop to N1 and eventually is up and running.
Then N1 starts and connects to N2 via OortComet12; N2 sees the handshake from N1 and replies to it; N1 subscribes to channels and sends to N2 the join message; N2 receives the subscriptions and the join message and emits the "comet joined" event on N2 for N1.
At this point, data can flow from N2 to N1 via OortComet12, while N2 is not yet connected to N1 via its OortComet21. Now N2 crashes, so that it was never able to connect to N1, never able to subscribe, never able to send the join message so that N1 was never fully aware of N2, although it received data from it and recorded some internal data structure of its existence.
N1 enters a reconnect loop; N2 restarts, connects to N1 via OortComet21, subscribes, sends the join message and N1 sees a "comet joined" event for N2. Since N1 never fully saw N2 before, there is no "comet left" event on N1 for N2.
Because N2 is still starting up, N1 cannot yet connect to it via OortComet12.
Eventually N2 is up, N1 tries to connect, gets a 402::unknown_client, handshakes again, subscribes, sends the join message, and N2 emits the "comet joined" event on N2 for N1.

@sbordet

This comment has been minimized.

Show comment
Hide comment
@sbordet

sbordet Sep 29, 2017

Member

The primary issue is the way you connect your cloud. You probably do this during the application startup from Spring. This leads to the fact that the starting node can connect to an existing node, but the existing node cannot connect yet to the starting node.
See for example how we solve this with the standard OortConfigServlet: https://github.com/cometd/cometd/blob/master/cometd-java/cometd-java-oort/src/main/java/org/cometd/oort/OortConfigServlet.java#L156.

However, it is not ideal that the nodes are in a situation where A is connected to B, B is not yet connected to A (because it is starting), but data can flow from B to A via the AB connection.
That is very likely the reason you see stale data.
However, forcing applications to delay the startup like it's done by OortConfigServlet may not be the right solution, as it imposes a burden on the application developer.

To solve that, I have fixed the internal startup sequence so that now 2 nodes will wait first to be fully connected, and only then will subscribe and send the join message, which will then trigger the "comet joined" event on the receiving node. In this way, data cannot flow between the 2 nodes before they are fully connected.

Member

sbordet commented Sep 29, 2017

The primary issue is the way you connect your cloud. You probably do this during the application startup from Spring. This leads to the fact that the starting node can connect to an existing node, but the existing node cannot connect yet to the starting node.
See for example how we solve this with the standard OortConfigServlet: https://github.com/cometd/cometd/blob/master/cometd-java/cometd-java-oort/src/main/java/org/cometd/oort/OortConfigServlet.java#L156.

However, it is not ideal that the nodes are in a situation where A is connected to B, B is not yet connected to A (because it is starting), but data can flow from B to A via the AB connection.
That is very likely the reason you see stale data.
However, forcing applications to delay the startup like it's done by OortConfigServlet may not be the right solution, as it imposes a burden on the application developer.

To solve that, I have fixed the internal startup sequence so that now 2 nodes will wait first to be fully connected, and only then will subscribe and send the join message, which will then trigger the "comet joined" event on the receiving node. In this way, data cannot flow between the 2 nodes before they are fully connected.

sbordet added a commit that referenced this issue Sep 29, 2017

Issue #640 - Seti on new/restarted comet not receiving full list of u…
…serIds from other nodes.

Reviewed startup sequence by waiting first for full communication
between nodes in both directions, then sending subscriptions and
join message.
@sbordet

This comment has been minimized.

Show comment
Hide comment
@sbordet

sbordet Sep 29, 2017

Member

@fdummert if you can try the latest code will be great. Let me know ! Thanks !

Member

sbordet commented Sep 29, 2017

@fdummert if you can try the latest code will be great. Let me know ! Thanks !

@fdummert

This comment has been minimized.

Show comment
Hide comment
@fdummert

fdummert Oct 6, 2017

Hi Simone,

I tried hard to reproduce any inconsistency, but up to now I have not encountered any issue anymore (at least in my clean test environment). So it looks like it has been fixed, thanks a lot!

Best Regards,
Florian

fdummert commented Oct 6, 2017

Hi Simone,

I tried hard to reproduce any inconsistency, but up to now I have not encountered any issue anymore (at least in my clean test environment). So it looks like it has been fixed, thanks a lot!

Best Regards,
Florian

@sbordet

This comment has been minimized.

Show comment
Hide comment
@sbordet

sbordet Oct 17, 2017

Member

@fdummert thanks for the feedback and help on this issue !

Member

sbordet commented Oct 17, 2017

@fdummert thanks for the feedback and help on this issue !

@sbordet sbordet closed this Oct 17, 2017

@sbordet sbordet added this to the 3.1.3 milestone Oct 19, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment