Describe the bug
Saw this while testing #6217 3908c7e, but suspect its a more general issue in main as wal/tserver/GC code was not modified.
At startup tservers create a node in ZK to track their active WALs. Later this node must exists when the tserver creates a WAL.
In the tserver logs saw it obtained it lock
2026-04-03T19:44:16,303 Thread[57] [tserver.TabletServer] DEBUG: Obtained tablet server lock /tservers/accumulo/localhost:9800/zlock#3e1f9d39-5a26-4794-95c1-5238103b9ac8#0000000000 localhost:9800[10000ff8e2b001a]
Then a bit later the GC process removed the tservers WAL node in ZK. The GC only does this if the tserver is not in the live tserver set and it has no wals registered in ZK.
2026-04-03T19:44:50,545 Thread[52] [gc.GarbageCollectWriteAheadLogs] INFO : Removing znode for localhost:9800[10000ff8e2b001a]
Much later the tsever failed to create a new WAL because its node was deleted in ZK. This caused all writes on the tserer to hang.
2026-04-03T21:28:19,818 Thread[142] [log.TabletServerLogger] ERROR: Failed to add new WAL marker for hdfs://10.113.13.85:8020/accumulo/wal/localhost+9800/0286a9ab-1735-428f-b091-263b2e42396b
org.apache.accumulo.server.log.WalStateManager$WalMarkerException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /wals/localhost:9800[10000ff8e2b001a]/0286a9ab-1735-428f-b091-263b2e42396b
at org.apache.accumulo.server.log.WalStateManager.updateState(WalStateManager.java:138)
at org.apache.accumulo.server.log.WalStateManager.addNewWalMarker(WalStateManager.java:124)
at org.apache.accumulo.tserver.TabletServer.addNewLogMarker(TabletServer.java:1032)
at org.apache.accumulo.tserver.log.TabletServerLogger.lambda$startLogMaker$0(TabletServerLogger.java:294)
at org.apache.accumulo.core.util.threads.Threads.lambda$createCriticalThread$0(Threads.java:76)
at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /wals/localhost:9800[10000ff8e2b001a]/0286a9ab-1735-428f-b091-263b2e42396b
at org.apache.zookeeper.KeeperException.create(KeeperException.java:117)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:53)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1347)
at org.apache.accumulo.core.zookeeper.ZooSession.create(ZooSession.java:272)
at org.apache.accumulo.core.fate.zookeeper.ZooReaderWriter.lambda$putPersistentData$1(ZooReaderWriter.java:92)
at org.apache.accumulo.core.fate.zookeeper.ZooReader.retryLoopMutator(ZooReader.java:174)
at org.apache.accumulo.core.fate.zookeeper.ZooReader.retryLoop(ZooReader.java:153)
at org.apache.accumulo.core.fate.zookeeper.ZooReaderWriter.putPersistentData(ZooReaderWriter.java:90)
at org.apache.accumulo.core.fate.zookeeper.ZooReaderWriter.putPersistentData(ZooReaderWriter.java:65)
at org.apache.accumulo.server.log.WalStateManager.updateState(WalStateManager.java:136)
Describe the bug
Saw this while testing #6217 3908c7e, but suspect its a more general issue in main as wal/tserver/GC code was not modified.
At startup tservers create a node in ZK to track their active WALs. Later this node must exists when the tserver creates a WAL.
In the tserver logs saw it obtained it lock
Then a bit later the GC process removed the tservers WAL node in ZK. The GC only does this if the tserver is not in the live tserver set and it has no wals registered in ZK.
Much later the tsever failed to create a new WAL because its node was deleted in ZK. This caused all writes on the tserer to hang.