New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translog recovery can repeatedly fail if we run out of disk #14695

Closed
wants to merge 2 commits into
from

Conversation

Projects
None yet
7 participants
@s1monw
Contributor

s1monw commented Nov 11, 2015

If we run out of disk while recoverying the transaction log
we repeatedly fail since we expect the latest tranlog to be uncommitted.
This change adds 2 safety levels:

  • uncommitted checkpoints are first written to a temp file and then atomically
    renamed into a committed (recovered) checkpoint
  • if the latest uncommitted checkpoints generation is already recovered it has to be
    identical, if not the recovery fails

This allows to fail in between recovering the latest uncommitted checkpoint and moving
the checkpoint generation to N+1 which can for instance happen in a situation where
we can run out of disk. If we run out of disk while recovering the uncommitted checkpoint
either the temp file writing or the atomic rename will fail such that we never have a
half written or corrupted recovered checkpoint.

@s1monw

This comment has been minimized.

Show comment
Hide comment
Contributor

s1monw commented Nov 11, 2015

@martin-g

View changes

Show outdated Hide outdated core/src/main/java/org/elasticsearch/index/translog/Translog.java
try {
Files.delete(tempFile);
} catch (IOException ex) {
logger.warn("failed to delete temp file {}", tempFile);

This comment has been minimized.

@martin-g

martin-g Nov 11, 2015

maybe it would be useful to log the exception message with this warning, just to know what could be the reason for the failure

@martin-g

martin-g Nov 11, 2015

maybe it would be useful to log the exception message with this warning, just to know what could be the reason for the failure

This comment has been minimized.

@jpountz

jpountz Nov 12, 2015

Contributor

+1 to not swallow the original exception

@jpountz

jpountz Nov 12, 2015

Contributor

+1 to not swallow the original exception

@jpountz

View changes

Show outdated Hide outdated core/src/main/java/org/elasticsearch/index/translog/Checkpoint.java
public int hashCode() {
int result = (int) (offset ^ (offset >>> 32));
result = 31 * result + numOps;
result = 31 * result + (int) (generation ^ (generation >>> 32));

This comment has been minimized.

@jpountz

jpountz Nov 12, 2015

Contributor

can you use Long.hashcode(long) for offset and generation?

@jpountz

jpountz Nov 12, 2015

Contributor

can you use Long.hashcode(long) for offset and generation?

} else {
// we first copy this into the temp-file and then fsync it followed by an atomic move into the target file
// that way if we hit a disk-full here we are still in an consistent state.
Files.copy(location.resolve(CHECKPOINT_FILE_NAME), tempFile, StandardCopyOption.REPLACE_EXISTING);

This comment has been minimized.

@jpountz

jpountz Nov 12, 2015

Contributor

REPLACE_EXISTING should not be needed since we just created the temp file?

@jpountz

jpountz Nov 12, 2015

Contributor

REPLACE_EXISTING should not be needed since we just created the temp file?

This comment has been minimized.

@s1monw

s1monw Nov 12, 2015

Contributor

if I remove that you will get a java.nio.file.FileAlreadyExistsException it's an existing file

@s1monw

s1monw Nov 12, 2015

Contributor

if I remove that you will get a java.nio.file.FileAlreadyExistsException it's an existing file

This comment has been minimized.

@jpountz

jpountz Nov 12, 2015

Contributor

oops right, createTempFile not only creates a file object but also on the filesystem

@jpountz

jpountz Nov 12, 2015

Contributor

oops right, createTempFile not only creates a file object but also on the filesystem

Checkpoint checkpointFromDisk = Checkpoint.read(commitCheckpoint);
if (checkpoint.equals(checkpointFromDisk) == false) {
throw new IllegalStateException("Checkpoint file " + commitCheckpoint.getFileName() + " already exists but has corrupted content expected: " + checkpoint + " but got: " + checkpointFromDisk);
}

This comment has been minimized.

@jpountz

jpountz Nov 12, 2015

Contributor

I'm wondering if we still need to fsync in that case, eg. if ES has been shut down after the atomic move that we do in the else block, but before the two calls to IOUtils.fsync?

@jpountz

jpountz Nov 12, 2015

Contributor

I'm wondering if we still need to fsync in that case, eg. if ES has been shut down after the atomic move that we do in the else block, but before the two calls to IOUtils.fsync?

This comment has been minimized.

@s1monw

s1monw Nov 12, 2015

Contributor

well I think we don't even need the second fsync IOUtils.fsync(commitCheckpoint, false); since we fsynced the tmp file already so no need to fsync again. ie. if the file exists it's fsynced already. @rmuir WDYT

@s1monw

s1monw Nov 12, 2015

Contributor

well I think we don't even need the second fsync IOUtils.fsync(commitCheckpoint, false); since we fsynced the tmp file already so no need to fsync again. ie. if the file exists it's fsynced already. @rmuir WDYT

This comment has been minimized.

@jpountz

jpountz Nov 16, 2015

Contributor

Just discussed it with Robert and indeed this fsync is not necessary.

@jpountz

jpountz Nov 16, 2015

Contributor

Just discussed it with Robert and indeed this fsync is not necessary.

@jpountz

This comment has been minimized.

Show comment
Hide comment
@jpountz

jpountz Nov 12, 2015

Contributor

I left some comments but the change looks good to me!

Contributor

jpountz commented Nov 12, 2015

I left some comments but the change looks good to me!

s1monw added some commits Nov 11, 2015

Translog recovery can repeatedly fail if we run out of disk
If we run out of disk while recoverying the transaction log
we repeatedly fail since we expect the latest tranlog to be uncommitted.
This change adds 2 safety levels:

 * uncommitted checkpoints are first written to a temp file and then atomically
   renamed into a committed (recovered) checkpoint
 * if the latest uncommitted checkpoints generation is already recovered it has to be
   identical, if not the recovery fails

This allows to fail in between recovering the latest uncommitted checkpoint and moving
the checkpoint generation to N+1 which can for instance happen in a situation where
we can run out of disk. If we run out of disk while recovering the uncommitted checkpoint
either the temp file writing or the atomic rename will fail such that we never have a
half written or corrupted recovered checkpoint.

s1monw added a commit that referenced this pull request Nov 16, 2015

Translog recovery can repeatedly fail if we run out of disk
If we run out of disk while recoverying the transaction log
we repeatedly fail since we expect the latest tranlog to be uncommitted.
This change adds 2 safety levels:

 * uncommitted checkpoints are first written to a temp file and then atomically
   renamed into a committed (recovered) checkpoint
 * if the latest uncommitted checkpoints generation is already recovered it has to be
   identical, if not the recovery fails

This allows to fail in between recovering the latest uncommitted checkpoint and moving
the checkpoint generation to N+1 which can for instance happen in a situation where
we can run out of disk. If we run out of disk while recovering the uncommitted checkpoint
either the temp file writing or the atomic rename will fail such that we never have a
half written or corrupted recovered checkpoint.

Close #14695

s1monw added a commit that referenced this pull request Nov 16, 2015

Translog recovery can repeatedly fail if we run out of disk
If we run out of disk while recoverying the transaction log
we repeatedly fail since we expect the latest tranlog to be uncommitted.
This change adds 2 safety levels:

 * uncommitted checkpoints are first written to a temp file and then atomically
   renamed into a committed (recovered) checkpoint
 * if the latest uncommitted checkpoints generation is already recovered it has to be
   identical, if not the recovery fails

This allows to fail in between recovering the latest uncommitted checkpoint and moving
the checkpoint generation to N+1 which can for instance happen in a situation where
we can run out of disk. If we run out of disk while recovering the uncommitted checkpoint
either the temp file writing or the atomic rename will fail such that we never have a
half written or corrupted recovered checkpoint.

Close #14695

s1monw added a commit that referenced this pull request Nov 16, 2015

Translog recovery can repeatedly fail if we run out of disk
If we run out of disk while recoverying the transaction log
we repeatedly fail since we expect the latest tranlog to be uncommitted.
This change adds 2 safety levels:

 * uncommitted checkpoints are first written to a temp file and then atomically
   renamed into a committed (recovered) checkpoint
 * if the latest uncommitted checkpoints generation is already recovered it has to be
   identical, if not the recovery fails

This allows to fail in between recovering the latest uncommitted checkpoint and moving
the checkpoint generation to N+1 which can for instance happen in a situation where
we can run out of disk. If we run out of disk while recovering the uncommitted checkpoint
either the temp file writing or the atomic rename will fail such that we never have a
half written or corrupted recovered checkpoint.

Close #14695

@s1monw s1monw closed this in 1bdf29e Nov 16, 2015

@jpountz

This comment has been minimized.

Show comment
Hide comment
@jpountz

jpountz Nov 16, 2015

Contributor

Merged via 1bdf29e

Contributor

jpountz commented Nov 16, 2015

Merged via 1bdf29e

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Nov 19, 2015

Don't delete temp recovered checkpoint file it was renamed
#14695 introduced more careful handling in recovering translog checkpoints. Part of it introduced a temp file which is used to write a new checkpoint if needed. That temp file is not always used and thus needs to be cleaned up. However, if it is used we currently log an ungly warn message about failing to delete it.
@asifalisoomro

This comment has been minimized.

Show comment
Hide comment
@asifalisoomro

asifalisoomro Dec 3, 2015

Hi expert,

I need expert guide to resolve my issue, i am using following version.
kibana 4.2.0
elasticsearch 2.1

elasticsearch log are.

[root@centos elasticsearch]# more elasticsearch.log
[2015-12-03 11:21:45,534][WARN ][bootstrap ] unable to install syscall filter: seccomp unavailable: CONFIG_SECCOMP not compiled into kernel, C
ONFIG_SECCOMP and CONFIG_SECCOMP_FILTER are needed
[2015-12-03 11:21:45,987][INFO ][node ] [Lin Sun] version[2.1.0], pid[3991], build[72cd1f1/2015-11-18T22:40:03Z]
[2015-12-03 11:21:45,987][INFO ][node ] [Lin Sun] initializing ...
[2015-12-03 11:21:46,108][INFO ][plugins ] [Lin Sun] loaded [], sites []
[2015-12-03 11:21:46,149][INFO ][env ] [Lin Sun] using [1] data paths, mounts [[/ (/dev/mapper/vg_centos-lv_root)]], net usable_space [7
.7gb], net total_space [17.1gb], spins? [possibly], types [ext4]
[2015-12-03 11:21:48,854][INFO ][node ] [Lin Sun] initialized
[2015-12-03 11:21:48,855][INFO ][node ] [Lin Sun] starting ...
[2015-12-03 11:21:49,128][INFO ][transport ] [Lin Sun] publish_address {192.168.48.63:9300}, bound_addresses {192.168.48.63:9300}
[2015-12-03 11:21:49,159][INFO ][discovery ] [Lin Sun] elasticsearch/pNZpYIdPQq-X4_OFlqUOPg
[2015-12-03 11:21:52,234][INFO ][cluster.service ] [Lin Sun] new_master {Lin Sun}{pNZpYIdPQq-X4_OFlqUOPg}{192.168.48.63}{192.168.48.63:9300}, reason
: zen-disco-join(elected_as_master, [0] joins received)
[2015-12-03 11:21:52,354][INFO ][http ] [Lin Sun] publish_address {192.168.48.63:9200}, bound_addresses {192.168.48.63:9200}
[2015-12-03 11:21:52,355][INFO ][node ] [Lin Sun] started
[2015-12-03 11:21:52,452][INFO ][gateway ] [Lin Sun] recovered [1] indices into cluster_state
[2015-12-03 11:21:53,276][WARN ][index.translog ] [Lin Sun] [.kibana][0] failed to delete temp file /var/lib/elasticsearch/elasticsearch/nodes/0/in
dices/.kibana/0/translog/translog-8220187682635755947.tlog
java.nio.file.NoSuchFileException: /var/lib/elasticsearch/elasticsearch/nodes/0/indices/.kibana/0/translog/translog-8220187682635755947.tlog
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
at java.nio.file.Files.delete(Files.java:1126)
at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:324)
at org.elasticsearch.index.translog.Translog.(Translog.java:166)
at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:209)
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:152)
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1408)
at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1403)
at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:906)
at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:883)
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:245)
at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


I tried to use localhost instead of IP in configuration file but no any luck.

My kibana console shown
Installed Plugins
Name Status
plugin:kibana Ready

plugin:elasticsearch Unable to connect to Elasticsearch at http://0.0.0.0:9200. Retrying in 2.5 seconds.

plugin:kbn_vislib_vis_types Ready
plugin:markdown_vis Ready
plugin:metric_vis Ready
plugin:spyModes Ready

Regards,

Asif,

Hi expert,

I need expert guide to resolve my issue, i am using following version.
kibana 4.2.0
elasticsearch 2.1

elasticsearch log are.

[root@centos elasticsearch]# more elasticsearch.log
[2015-12-03 11:21:45,534][WARN ][bootstrap ] unable to install syscall filter: seccomp unavailable: CONFIG_SECCOMP not compiled into kernel, C
ONFIG_SECCOMP and CONFIG_SECCOMP_FILTER are needed
[2015-12-03 11:21:45,987][INFO ][node ] [Lin Sun] version[2.1.0], pid[3991], build[72cd1f1/2015-11-18T22:40:03Z]
[2015-12-03 11:21:45,987][INFO ][node ] [Lin Sun] initializing ...
[2015-12-03 11:21:46,108][INFO ][plugins ] [Lin Sun] loaded [], sites []
[2015-12-03 11:21:46,149][INFO ][env ] [Lin Sun] using [1] data paths, mounts [[/ (/dev/mapper/vg_centos-lv_root)]], net usable_space [7
.7gb], net total_space [17.1gb], spins? [possibly], types [ext4]
[2015-12-03 11:21:48,854][INFO ][node ] [Lin Sun] initialized
[2015-12-03 11:21:48,855][INFO ][node ] [Lin Sun] starting ...
[2015-12-03 11:21:49,128][INFO ][transport ] [Lin Sun] publish_address {192.168.48.63:9300}, bound_addresses {192.168.48.63:9300}
[2015-12-03 11:21:49,159][INFO ][discovery ] [Lin Sun] elasticsearch/pNZpYIdPQq-X4_OFlqUOPg
[2015-12-03 11:21:52,234][INFO ][cluster.service ] [Lin Sun] new_master {Lin Sun}{pNZpYIdPQq-X4_OFlqUOPg}{192.168.48.63}{192.168.48.63:9300}, reason
: zen-disco-join(elected_as_master, [0] joins received)
[2015-12-03 11:21:52,354][INFO ][http ] [Lin Sun] publish_address {192.168.48.63:9200}, bound_addresses {192.168.48.63:9200}
[2015-12-03 11:21:52,355][INFO ][node ] [Lin Sun] started
[2015-12-03 11:21:52,452][INFO ][gateway ] [Lin Sun] recovered [1] indices into cluster_state
[2015-12-03 11:21:53,276][WARN ][index.translog ] [Lin Sun] [.kibana][0] failed to delete temp file /var/lib/elasticsearch/elasticsearch/nodes/0/in
dices/.kibana/0/translog/translog-8220187682635755947.tlog
java.nio.file.NoSuchFileException: /var/lib/elasticsearch/elasticsearch/nodes/0/indices/.kibana/0/translog/translog-8220187682635755947.tlog
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
at java.nio.file.Files.delete(Files.java:1126)
at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:324)
at org.elasticsearch.index.translog.Translog.(Translog.java:166)
at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:209)
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:152)
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1408)
at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1403)
at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:906)
at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:883)
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:245)
at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


I tried to use localhost instead of IP in configuration file but no any luck.

My kibana console shown
Installed Plugins
Name Status
plugin:kibana Ready

plugin:elasticsearch Unable to connect to Elasticsearch at http://0.0.0.0:9200. Retrying in 2.5 seconds.

plugin:kbn_vislib_vis_types Ready
plugin:markdown_vis Ready
plugin:metric_vis Ready
plugin:spyModes Ready

Regards,

Asif,

@clintongormley

This comment has been minimized.

Show comment
Hide comment
@clintongormley

clintongormley Dec 3, 2015

Member

@asifalis please ask questions like these in the forum http://discuss.elastic.co/

Member

clintongormley commented Dec 3, 2015

@asifalis please ask questions like these in the forum http://discuss.elastic.co/

@daimoniac

This comment has been minimized.

Show comment
Hide comment
@daimoniac

daimoniac May 12, 2016

Currently experiencing this or similar issue with Version: 2.2.1, Build: d045fc2/2016-03-09T09:38:54Z, JVM: 1.8.0_66.

Stacktrace:
[2016-05-12 11:20:38,861][WARN ][indices.cluster ] [spx-elastic-FPA2-02] [[releases_2][1]] marking and sending shard failed due to [failed recovery]
[releases_2][[releases_2][1]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/opt/esdata/spxA/nodes/0/indices/releases_2/1/translog/translog-867.ckp];
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:250)
at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [releases_2][[releases_2][1]] EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/opt/esdata/spxA/nodes/0/indices/releases_2/1/translog/translog-867.ckp];
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:155)
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1510)
at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1494)
at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:969)
at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:941)
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:241)
... 5 more
Caused by: java.nio.file.NoSuchFileException: /opt/esdata/spxA/nodes/0/indices/releases_2/1/translog/translog-867.ckp
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
at java.nio.file.Files.newInputStream(Files.java:152)
at org.elasticsearch.index.translog.Checkpoint.read(Checkpoint.java:82)
at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:330)
at org.elasticsearch.index.translog.Translog.(Translog.java:179)
at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:208)
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:151)
... 11 more

/opt/esdata/spxA/nodes/0/indices/releases_2/1/translog# ls -al
total 60
drwxr-xr-x 2 elasticsearch elasticsearch 4096 May 12 11:31 .
drwxr-xr-x 5 elasticsearch elasticsearch 4096 May 10 13:40 ..
-rw-r--r-- 1 elasticsearch elasticsearch 20 May 12 04:44 translog-865.ckp
-rw-r--r-- 1 elasticsearch elasticsearch 20 May 12 04:44 translog-866.ckp
-rw-r--r-- 1 elasticsearch elasticsearch 43 May 12 04:44 translog-867.tlog
-rw-r--r-- 1 elasticsearch elasticsearch 34570 May 12 04:44 translog-868.tlog
-rw-r--r-- 1 elasticsearch elasticsearch 20 May 12 04:44 translog.ckp

Currently experiencing this or similar issue with Version: 2.2.1, Build: d045fc2/2016-03-09T09:38:54Z, JVM: 1.8.0_66.

Stacktrace:
[2016-05-12 11:20:38,861][WARN ][indices.cluster ] [spx-elastic-FPA2-02] [[releases_2][1]] marking and sending shard failed due to [failed recovery]
[releases_2][[releases_2][1]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/opt/esdata/spxA/nodes/0/indices/releases_2/1/translog/translog-867.ckp];
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:250)
at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [releases_2][[releases_2][1]] EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/opt/esdata/spxA/nodes/0/indices/releases_2/1/translog/translog-867.ckp];
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:155)
at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1510)
at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1494)
at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:969)
at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:941)
at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:241)
... 5 more
Caused by: java.nio.file.NoSuchFileException: /opt/esdata/spxA/nodes/0/indices/releases_2/1/translog/translog-867.ckp
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
at java.nio.file.Files.newInputStream(Files.java:152)
at org.elasticsearch.index.translog.Checkpoint.read(Checkpoint.java:82)
at org.elasticsearch.index.translog.Translog.recoverFromFiles(Translog.java:330)
at org.elasticsearch.index.translog.Translog.(Translog.java:179)
at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:208)
at org.elasticsearch.index.engine.InternalEngine.(InternalEngine.java:151)
... 11 more

/opt/esdata/spxA/nodes/0/indices/releases_2/1/translog# ls -al
total 60
drwxr-xr-x 2 elasticsearch elasticsearch 4096 May 12 11:31 .
drwxr-xr-x 5 elasticsearch elasticsearch 4096 May 10 13:40 ..
-rw-r--r-- 1 elasticsearch elasticsearch 20 May 12 04:44 translog-865.ckp
-rw-r--r-- 1 elasticsearch elasticsearch 20 May 12 04:44 translog-866.ckp
-rw-r--r-- 1 elasticsearch elasticsearch 43 May 12 04:44 translog-867.tlog
-rw-r--r-- 1 elasticsearch elasticsearch 34570 May 12 04:44 translog-868.tlog
-rw-r--r-- 1 elasticsearch elasticsearch 20 May 12 04:44 translog.ckp

@jinleileiking

This comment has been minimized.

Show comment
Hide comment
@jinleileiking

jinleileiking Aug 20, 2016

I encounter this. But I have disk space.
ENV:

JVM: 1.8.0_45 ES: 2.1.0

I encounter this. But I have disk space.
ENV:

JVM: 1.8.0_45 ES: 2.1.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment