-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors in log when 3rd repo connects to cluster #8
Comments
The same error shows up again when repo4 is connected to the cluster |
I have not observed this error before. Are all nodes using the exact same
AMP and Ignite versions?
Andreas Kring <notifications@github.com> schrieb am Di., 18. Juni 2019,
15:36:
… The same error shows up again when repo4 is connected to the cluster
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAQ35QJZAJ24HWGCAJ2G37DP3DQFBANCNFSM4HZAA5GA>
.
|
Yes - I'm pretty sure since all the instances have been set up with the same Ansible script, but I will double check that I have not made some stupid Ansible config mistake. I will get back to you. |
Hmm - as far as I can see, Ansible looks ok and all the repos are using the same AMPs. I did an md5sum of the AMPs installed on the 4 servers, and this was the (identical) output from all of them:
And since the version of Ignite is determined by the AMPs this must also be the same on all the servers. The aldica properties in |
I will try to perform this test tomorrow: make a permutation of the start order of the repos, i.e. I can try to start them in the order repo1, repo3, repo2, repo4. If this fails when the repo3 is started then there must be something wrong with repo3. If the error do not occur until repo2 is started then we must assume that the problem is somewhere in the module. |
The test described have been performed, i.e. the repos (only the first three) was started in the order repo1, repo3, repo2. Starting up repo1 and repo3 worked fine and there were no errors in the log, but when repo2 was started, the same errors as yesterday showed up. So I guess this means that repo3 is fine and it is safe to assume that all the repos are configured in the same way. The problem apparently arises when the third repo is connected to the cluster (regardless of which repo is used as the third repo). |
I tried to run a load test with 4 nodes in the cluster dispite the above errors just to see what happened. The result was that the 4 nodes performed inferior to the case with just two nodes, so the issue is affecting performance. |
I have expanded my local Docker-based test to include 3 repository nodes, started each of them up in turn and checked their logs. When checking each new node, they did not show any error in the logs. But the primary / first node did indeed show an error message about deserialization. Interestingly, I had done the same test with the current state of my commercial module, and it did not show errors as far as I can remember. Not all of the recent changes in aldica have been ported back to the commercial module since that one has been mostly frozen due to ongoing customer acceptance tests. |
Ok - I experienced the same, i.e. the logs of the new nodes look fine, and the error reported only showed up in the primary node. Do you have time to investigate the issue further? I'm also trying to look into it, but I must admit that my aldica/Ignite skills currently are sufficient :-) I don't know what the best debugging strategy is... all the errors are coming from Ignite itself, so it is hard to see from where in aldica the problem is arising. |
Typo - I meant my aldica/Ignite skills currently are not sufficient :-) |
Ok, so this looks related to Ignite event handling and the need to handle discovery events, especially when dealing with potential address translation scenarios (Docker / virtualised network). Since we can never know if a server with such mapping information might join at some time, we always register listeners for these events. Our code itself is also not the cause of the issue, it just seems to trigger it. That is also the reason why it only happens for more than 2 servers. When the 3rd server joins the data grid, the discovery event for the join needs to be sent from the primary / 1st node to all other nodes already in the data grid, which means to the 2nd node. Before the 3rd server joins, there is no need for the event to be sent. Searching Ignite JIRA for some of the details in the long stack trace, I found IGNITE-11352 which addresses a binary incompatibility in CacheMetricsSnapshot. This class is reported in our stack trace at the lowest level / as the last "Caused by" exception. That JIRA issue has already marked as resolved for the next 2.8 release. I am going to try performing a test with a SNAPSHOT of Ignite to see if that fix for IGNITE-11352 also fixes our issue. |
Thank you for your fast response! Good news that the problem is not in aldica and that the JIRA issue has been marked as resolved. It will be interesting to see the outcome of your test with the snapshot version of Ignite - let's hope that their fix works. |
I tested with a 2.8 SNAPSHOT built locally from current Ignite master, and there is no more exception in any log when the 3rd server joins the data grid. |
We can avoid this exception until 2.8 comes out by disabling statistics on all the caches created by aldica. This would mean we can no longer provide details about cache hits / misses, ratios and average duration of cache operations in the admin console tool. |
This is great news! I'm very happy to hear that it is working with the latest Ignite |
Closing as issue has been addressed via workaround for the current Ignite release. |
I have set up 4 repos on AWS with Ansible. The repos were started up one at the time starting with repo1, but when I got to repo3 i got a lot of errors in the log:
Looking in the "Grids" section on the Ignite admin page it actually says that
Topology=3
andModes=3
Have you seen these errors before?
The text was updated successfully, but these errors were encountered: