New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oort synchronization issues, part II. #795

Open
fdummert opened this Issue Oct 11, 2018 · 0 comments

Comments

Projects
None yet
1 participant
@fdummert

fdummert commented Oct 11, 2018

Hi Simone,

we were discussing OortObject sync issues (#667) some two years ago. It seemed to be fine since you fixed some loopholes with that particular installation. Now we're having two more installations active (one 4 node oort cluster, one 2 node oort cluster) and unfortunately we're again seeing the NPE described in that ticket and the oort cluster not being able to resync properly after network issues. We're seeing it in both installations (totally different environments); about once, twice every month. We have no clue what the physical root cause is, maybe some router issues between the nodes (not under our control). As the live logs are not helpful, we were struggling for a while to reproduce a test case. We came back to the idea of ticket #667 (halfway disconnect) and now we can simulate a condition with the same symptoms reliably by

  • Starting two cometd machines connected via oort (192.168.251.50 and 192.168.251.73).
  • Note: two browsers sending a request (each towards one node) can be seen in the logs from time to time with usual exceptions when pressing reload (broken pipe, connection reset by peer) - can be ignored.
  • Then we prevented new connection establishing towards 192.168.251.73, and later, existing connections towards 192.168.251.73 were dropped
  • The first obvious sign in the logs is probably the TimeoutException appearing on both sides; starting from 15:37:52,970 on 192.168.251.50 and from 15:38:42,315 on 192.168.251.73
  • At around 15:39:40 the connectivity was restored again.

Things to note:

  • NPE on OortObjectMergers$ConcurrentMapUnionMerger.merge occurs during outage, but also after connectivity has been restored again, on 192.168.251.73 (this is triggered by a browser request trying to access information from an OortMap).
  • Most interesting is probably from 15:39:50 onwards when nodes should see each other again, but don't manage to join again
  • on 192.168.251.50, I can see "Comet left" twice without a "joined" in between (not sure whether this is expected/harmful or not)

We reproduced it with cometd 3.1.5 and jetty 9.2.26.v20180806, upgrading to cometd 4.0.0 is planned but comes with some bigger efforts on our side. Would you expect these issues are not happening on 4.0.0?

comet-oort.tar.gz

Best Regards,
Florian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment