Join GitHub today
Translog base flushes can be disabled after replication relocation or slow recovery #15830
#10624 decoupled translog flush from ongoing recoveries. In the process, the translog creation was delayed to moment the engine is created (during recovery, after copying files from the primary). On the other side, TranslogService, in charge of translog based flushes, starts a background checker as soon as the shard is allocated. That checker performs it's first check after 5s expected the translog to be there. However, if the file copying phase of the recovery takes >5s (likely!) or local recovery is slow, the check can run into an exception and never recover. The end result is that the translog based flush is completely disabled.
Note that this is mitigated but shard inactivity which triggers synced flush after 5m of no indexing.
Also - this is already fixed in master, where we don't have this background check,