-
Notifications
You must be signed in to change notification settings - Fork 24.6k
-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupted shard uncovered during node decommissioning #8827
Comments
Found some info about cluster status during shard movement:
Shard relocation started ~19:00:
|
Shard is currently trying to restore, loading at least one core 100%:
|
This index actually was fully scrolled with spark a couple of days ago and there were no issues. I have logs from this node for a year (since 0.90.5), hope that could help with investigation. |
Sorry, this issue got lost. Given that it is from a year ago, I assume you've resolved the issue already. So much has changed since then, that there's no point in investigating further. |
Yeah, fixed that by changing company and country, thanks! |
:D |
I removed node from allocation by ip and one shard appeared as corrupted at the end of relocation. Logs from the last restart:
.. and so on.
"Maybe that's just a glitch that could disappear with restart" – was my first thought.
... and so on.
"Gee, good job on making backups, myself!" — was my second thought.
older snapshot:
and in logs, as usual:
Well, maybe it wasn't that great decision to remove replicas for old indices.
I thought that checksums has to be checked during backups. If you have alive replica at the time of a backup, you can at least start recovering early. If you removed that healthy (?) replica after making a snapshot, you're doomed.
I'm using 1.4.1 with aws plugin for s3 snaphots on ceph (it has checksums too).
Is there a way to "fix" failed shard with data removal? If I cannot recover shard, I want my cluster to be green at least.
cc @imotov
The text was updated successfully, but these errors were encountered: