-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken translog on most indexes like NoSuchFileException elasticsearch/data/dev-cluster/nodes/0/indices/logstash-2016.01.04/2/translog/translog-226.ckp #16495
Comments
@s1monw @jasontedor plz take a look at related question if have any time. |
There are tons of bugs fixed along those lines. I don't think upgrading will heal what the bug did to you. The only possibility to help you would have been to get directory listings from you translog files. But apparently you deleted them all?
there are not translog issues known in the later 2.x versions so I am pretty confident you will be ok. can you tell me why you did this:
and also why didn't you open an issue before you delete stuff on your filesystem? :) |
btw the solution to your problem would have been simple if you had your tlog files still. You ran into the same issue reported here: https://discuss.elastic.co/t/cannot-recover-index-because-of-missing-tanslog-files/38336/6 and fixed here #15788 |
@s1monw thank You for response, no worries, I have backups :) Now I answer Ur questions I didn't delete original problematic indexes except ELK indexes older than 1.5 months from now (I don't need them). Other indexes are backed up in separate folder, and copied back to Here is list of ckp files, for sample problematic index
here is list of tlog files
that was possible solutions for older versions I've found while researching.
So I need to copy-paste each .ckp file by hand in each day index (38 in my case) on each shard that may fail (5 in my case)? Won't there any automatic function exists for that in elastic? Also #15788 was merged on Jan 6 and I've setup ES 2.0.0 several months before that, so on 2.0.0 probably that bug could be occured, yep? Can't 2.2.x detect that problems left from previous versions and resolve them automatically? |
@s1monw also solution suggested here https://discuss.elastic.co/t/cannot-recover-index-because-of-missing-tanslog-files/38336/6 won't work bcz I have almost 40 indexes (might that be 400 indexes in case when I keep all ELK data) and translog-226.ckp is not the only file missed even in scope of one index. |
there is no automatic solution for this. but I can maybe come up with a tool that can help you? |
@s1monw okay :( I just thought I have missed something in elastic API / CLI docs that could solve such problems automatically. |
@k-vladyslav just to double checking can I get a copy of |
@s1monw it was sent to Your Github profile email. |
…ranslog There is simply a coding bug that only happens if translog views are closed after the translog itself is closed. this can happen for instance if we hit a disk full exception and try to repeatedly recover the translog. This will cause a translog-N.ckp file to be deleted since the wrong generation is used to generate the path to delete. This seems like a copy/past problem. This bug doesn't affect 5.0 Relates to elastic#16495
…ranslog (#19035) There is simply a coding bug that only happens if translog views are closed after the translog itself is closed. this can happen for instance if we hit a disk full exception and try to repeatedly recover the translog. This will cause a translog-N.ckp file to be deleted since the wrong generation is used to generate the path to delete. This seems like a copy/past problem. This bug doesn't affect 5.0 Relates to #16495
…ranslog (#19035) There is simply a coding bug that only happens if translog views are closed after the translog itself is closed. this can happen for instance if we hit a disk full exception and try to repeatedly recover the translog. This will cause a translog-N.ckp file to be deleted since the wrong generation is used to generate the path to delete. This seems like a copy/past problem. This bug doesn't affect 5.0 Relates to #16495
Hi s1monw, Can you please suggest actions that would help solve this issue? This is what we're seeing in logs: That directory at /var/lib/elasticsearch/elasticsearch/nodes/0/indices/lindex/0/translog/ As you can see, the translog-148.ckp is indeed missing. Hence, ES is unable to finish initializing a shard because of this missing transaction file. I find this a critical bug as it can cause denial of service to an otherwise totally working index. I'm asking for two kinds of help here:
Thanks in advance, |
upgrade to the latest version so this shouldn't happen anymore. We fixed a bunch of bugs related to this. in 5.0 which will go GA soon we have a commandline tool For now have to restore your index. Again, please upgrade to the latest 2.4 version asap. |
Currently, I have 2.4.0 version and still ES can enter into endless loop. I store ES data in separately mounted SSD. Some times SSD's can be dropped and after I'll mount them again it causes ES to enter recovery loop. It tries to load transact log file which is empty at the moment, as the result I'm getting EOFException. |
I have a very similar set up to @igormasternoy on my dev machine and get a similar error. In my case, my laptop battery died and the computer shutdown automatically. When I booted, the translog was corrupted. I am using ES 2.3.3. I will be updating to the latest 2.4. Unfortunately I am not able to upgrade to 5.x yet because one of the 3rd-party plugins I use hasn't updated yet. Here an excerpt from the log:
|
Act 1 - Preface
How I run into that issue
It is functioning, Kibana works but one (or more) of my shards can't be initialized and elastic tries to start them in infinite loop. Because of that I have negative impact to server:
So I may suggest that it does not work as it should .
Act 2 - Pathetic Attempts to fix that thing by my own
I've read related bug reports and articles, like
#14989, #15021, #9699
have tried several things:
find . -type f -name '*.ckp' -delete
didn't help
find . -type f -name '*.tlog' -delete
didn't help
didn't help
Also read articles about
_cluster/reroute
likehttps://t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode.html
didn't help
http://www.jillesvangurp.com/2015/02/18/elasticsearch-failed-shard-recovery/
I didn't try to use "org.apache.lucene.index.CheckIndex" approach because there might be another workaround
Example of logs I'm getting
Normal start
Failed start
GET /_cat/shards
outputs
Other failed indexes were temporately moved out of
data
folder to backup folderAct 3: What's next?
This issue is reproduced all across logstash indexes I have. Some have that errors, some no. So its not about single index issue.
How I suggest to solve that issue
I need some API like
that gonna stop elastic infinite reports of broken/missed translog files, ignore all previous translog errors and run without errors even if it gonna require to loose/drop/delete some data but not all my ELK indexes.
Otherwise - I can't use my logs for last 1.5 months because getting many errors about almost every logstash index.
I've already deleted all logs older than 1.5 months, when tried to solve that issue. But that didn't help either.
Worst case scenario
I've already tried to run elastic with clean
data
directory, and then ELK stack runs as usual, no CPU overhead, no tonns of logs, everything clean and smooth.I can drop my existed ELK logs for this time. BUT! I won't be able to do so each time I getting translog errors or smth like that.
So guys - any advice on how to force elasticsearch to ignore that damn translog errors? :)
The text was updated successfully, but these errors were encountered: