A failed after write refresh can prevent advancing the local checkpoint even when the operations were made durable by the translog #108190

fcofdez · 2024-05-02T12:14:25Z

Today, when an after write refresh fails, we consider the AsyncAfterWriteAction as failed:

elasticsearch/server/src/main/java/org/elasticsearch/action/support/replication/TransportWriteAction.java

Lines 466 to 471 in 7f83189

    
           // TODO: Temporary until we fail unpromotable shard 
        
           if (refreshFailure.get() != null) { 
        
               respond.onFailure(refreshFailure.get()); 
        
           } else { 
        
               respond.onSuccess(refreshed.get()); 
        
           }

This prevents from advancing the local checkpoints with the safely persisted seq nos:

elasticsearch/server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java

Lines 174 to 195 in 4076aa7

    
               primaryResult.runPostReplicationActions(new ActionListener<>() { 
        
                   @Override 
        
                   public void onResponse(Void aVoid) { 
        
                       successfulShards.incrementAndGet(); 
        
                       updateCheckPoints( 
        
                           primary.routingEntry(), 
        
                           primary::localCheckpoint, 
        
                           primary::globalCheckpoint, 
        
                           () -> decPendingAndFinishIfNeeded() 
        
                       ); 
        
                   } 
        
                   @Override 
        
                   public void onFailure(Exception e) { 
        
                       logger.trace("[{}] op [{}] post replication actions failed for [{}]", primary.routingEntry().shardId(), opType, request); 
        
                       // TODO: fail shard? This will otherwise have the local / global checkpoint info lagging, or possibly have replicas 
        
                       // go out of sync with the primary 
        
                       finishAsFailed(e); 
        
                   } 
        
               }); 
        
           }

We should reconsider this behaviour and maybe advance the local checkpoints when the refresh failed for an unpromotable shard.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-05-02T12:14:48Z

Pinging @elastic/es-distributed (Team:Distributed)

fcofdez added >enhancement :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. Team:Distributed Meta label for distributed team labels May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A failed after write refresh can prevent advancing the local checkpoint even when the operations were made durable by the translog #108190

A failed after write refresh can prevent advancing the local checkpoint even when the operations were made durable by the translog #108190

fcofdez commented May 2, 2024

elasticsearchmachine commented May 2, 2024

A failed after write refresh can prevent advancing the local checkpoint even when the operations were made durable by the translog #108190

A failed after write refresh can prevent advancing the local checkpoint even when the operations were made durable by the translog #108190

Comments

fcofdez commented May 2, 2024

elasticsearchmachine commented May 2, 2024