Skip to content

GC deleted live tservers WAL node in zookeeper. #6298

@keith-turner

Description

@keith-turner

Describe the bug

Saw this while testing #6217 3908c7e, but suspect its a more general issue in main as wal/tserver/GC code was not modified.

At startup tservers create a node in ZK to track their active WALs. Later this node must exists when the tserver creates a WAL.

In the tserver logs saw it obtained it lock

2026-04-03T19:44:16,303 Thread[57] [tserver.TabletServer] DEBUG: Obtained tablet server lock /tservers/accumulo/localhost:9800/zlock#3e1f9d39-5a26-4794-95c1-5238103b9ac8#0000000000 localhost:9800[10000ff8e2b001a]

Then a bit later the GC process removed the tservers WAL node in ZK. The GC only does this if the tserver is not in the live tserver set and it has no wals registered in ZK.

2026-04-03T19:44:50,545 Thread[52] [gc.GarbageCollectWriteAheadLogs] INFO : Removing znode for localhost:9800[10000ff8e2b001a]

Much later the tsever failed to create a new WAL because its node was deleted in ZK. This caused all writes on the tserer to hang.

2026-04-03T21:28:19,818 Thread[142] [log.TabletServerLogger] ERROR: Failed to add new WAL marker for hdfs://10.113.13.85:8020/accumulo/wal/localhost+9800/0286a9ab-1735-428f-b091-263b2e42396b
org.apache.accumulo.server.log.WalStateManager$WalMarkerException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /wals/localhost:9800[10000ff8e2b001a]/0286a9ab-1735-428f-b091-263b2e42396b
	at org.apache.accumulo.server.log.WalStateManager.updateState(WalStateManager.java:138)
	at org.apache.accumulo.server.log.WalStateManager.addNewWalMarker(WalStateManager.java:124)
	at org.apache.accumulo.tserver.TabletServer.addNewLogMarker(TabletServer.java:1032)
	at org.apache.accumulo.tserver.log.TabletServerLogger.lambda$startLogMaker$0(TabletServerLogger.java:294)
	at org.apache.accumulo.core.util.threads.Threads.lambda$createCriticalThread$0(Threads.java:76)
	at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /wals/localhost:9800[10000ff8e2b001a]/0286a9ab-1735-428f-b091-263b2e42396b
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:117)
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:53)
	at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1347)
	at org.apache.accumulo.core.zookeeper.ZooSession.create(ZooSession.java:272)
	at org.apache.accumulo.core.fate.zookeeper.ZooReaderWriter.lambda$putPersistentData$1(ZooReaderWriter.java:92)
	at org.apache.accumulo.core.fate.zookeeper.ZooReader.retryLoopMutator(ZooReader.java:174)
	at org.apache.accumulo.core.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:153)
	at org.apache.accumulo.core.fate.zookeeper.ZooReaderWriter.putPersistentData(ZooReaderWriter.java:90)
	at org.apache.accumulo.core.fate.zookeeper.ZooReaderWriter.putPersistentData(ZooReaderWriter.java:65)
	at org.apache.accumulo.server.log.WalStateManager.updateState(WalStateManager.java:136)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugThis issue has been verified to be a bug.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions