-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot Restore may leave repositories half-initialized #8879
Comments
This may be due to new timings in Felix 7.0 because initialization code is a copy/paste of audit log initialization. |
No, actually other nodes would eventually see that repository in initialized and start working. Since they fail it means that repository was really half initialized by previous failed initializations and after that can't progress anymore. |
Cluster got in serious trouble before, and could not complete repository initialization. We probably could make the initialization more robust, but in general it would not help, because updates during unhealthy cluster are unpredictable. I'll leave this issue open - to investigate, maybe we can eliminate problematic repository entry service and cache. |
Workaround for such cases - restore from snapshot. |
Just had this problem with one of our customer clusters. I was very careful when restarting the cluster, so I am dumbfounded why this could have happened. The issue was fixed by deleting system.scheduler index and then restarting the master nodes again. This process needs to be more robust, i.e. check if index is present before creating it. |
Fin part: there is a check if index already exists. |
XP version 7.7.1. |
We get this problem on a customer using Ansible as well. New cluster environment, loading in old XP 7.4.1 data into version 7.6.1 works fine, but when upgrading to 7.7.1 we get the same "Could not initialize" error as mentioned above, and the cluster gives us status 503 service unavailable. Funny thing is, we can actually log in to admin on individual data nodes on port 8080, but then there seems to be no indexed data (e.g. no apps after logging in) despite the data being present before the upgrade to 7.7.1. @sigdestad and I tried to delete the scheduler repo index using elasticsearch command, and the command completes, but after restarting the cluster, we get the same problem once again. |
Question is how to get into half initialized state. IT should not be possible in normal conditions. If I change the code to tolerate index existence various other problems may arise, because it may be an index with incorrect mappings (if created automatically) So, it is better safe than sorry. |
Bottom line: I need logs from servers to see what created that half initialized repo at first place. |
One way to get half initialized state is to do a snapshot restore on 7.7 from a snapshot done on version 7.6 or earlier. system repository on 7.6 has no record about scheduler index, but index is certainly there (because snapshot restore does not delete existing indices) |
The main fix is to, right after ES snapshot restore, check for orphan indices (that were missing in snapshot) and delete them. Other tunings:
|
Master node fails creating repo:
other nodes are left in limbo:
The text was updated successfully, but these errors were encountered: