Summary
A recurring NullPointerException is thrown from the monitorPendingAddOps
scheduled task when PendingAddOp.maybeTimeout dereferences clientCtx,
which has already been cleared to null by recyclePendAddOpObject().
Stack trace (verbatim)
WARNING: Unexpected throwable from task class com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask: Cannot invoke "org.apache.bookkeeper.client.ClientContext.getConf()" because "this.clientCtx" is null
java.lang.NullPointerException: Cannot invoke "org.apache.bookkeeper.client.ClientContext.getConf()" because "this.clientCtx" is null
at org.apache.bookkeeper.client.PendingAddOp.maybeTimeout(PendingAddOp.java:157)
at org.apache.bookkeeper.client.LedgerHandle.monitorPendingAddOps(LedgerHandle.java:2059)
at org.apache.bookkeeper.client.LedgerHandle.lambda$initializeWriteHandleState$0(LedgerHandle.java:290)
at com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask.run(MoreExecutors.java:625)
at org.apache.bookkeeper.common.util.SingleThreadSafeScheduledExecutorService$SafeRunnable.run(SingleThreadSafeScheduledExecutorService.java:46)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:545)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:369)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:310)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1090)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:614)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:1474)
Affected version
BookKeeper 4.17.3 (first observed after BK 4.17.3 upgraded Netty from
4.1.121.Final to 4.1.130.Final via PR #4699). The same NPE potential exists
in 4.17.2 code but does not manifest in practice with Netty 4.1.121.
Root cause analysis
LedgerHandle.initializeWriteHandleState (line 290) schedules a recurring
task that calls monitorPendingAddOps at timeoutMonitorIntervalSec
intervals. That method iterates over the ConcurrentLinkedQueue<PendingAddOp> pendingAddOps and calls op.maybeTimeout() on each entry.
PendingAddOp.maybeTimeout (line 157) accesses this.clientCtx.getConf().
recyclePendAddOpObject() (called when an op completes) sets clientCtx = null. The op is removed from pendingAddOps by sendAddSuccessCallbacks
before the user callback fires, so in theory the monitor should never see
a recycled op. However, the ConcurrentLinkedQueue iterator is weakly
consistent: it can reflect concurrent modifications mid-traversal. Under high
write throughput with Netty 4.1.130 (faster I/O completion callbacks), the
window between the iterator "seeing" an op and the scheduler thread calling
maybeTimeout on it is wide enough for the recycling to complete first, so
clientCtx is already null when the call arrives.
Observed impact
- No impact on write throughput — the NPE is swallowed by
SafeRunnable and
the scheduler keeps re-firing.
- The timeout monitor is silently disabled for any add op encountered in the
null-clientCtx state; a genuinely timed-out op may therefore hang
indefinitely rather than receive a BKTimeoutException.
- One WARNING log line + full stack trace every ~14 seconds floods pod logs.
Suggested fix
Add a null guard in PendingAddOp.maybeTimeout (or at the call site in
LedgerHandle.monitorPendingAddOps):
// PendingAddOp.java — maybeTimeout
boolean maybeTimeout() {
if (clientCtx == null) {
// op already recycled/reset; skip timeout check
return false;
}
if (MathUtils.elapsedNanos(requestTimeNanos) >= clientCtx.getConf().addEntryQuorumTimeoutNanos) {
timeoutQuorumWait();
return true;
}
return false;
}
Alternatively, the recyclerHandle.recycle(this) call at the end of
recyclePendAddOpObject could be deferred until after the op has been
confirmed absent from pendingAddOps, eliminating the race entirely.
Summary
A recurring
NullPointerExceptionis thrown from themonitorPendingAddOpsscheduled task when
PendingAddOp.maybeTimeoutdereferencesclientCtx,which has already been cleared to
nullbyrecyclePendAddOpObject().Stack trace (verbatim)
Affected version
BookKeeper 4.17.3 (first observed after BK 4.17.3 upgraded Netty from
4.1.121.Final to 4.1.130.Final via PR #4699). The same NPE potential exists
in 4.17.2 code but does not manifest in practice with Netty 4.1.121.
Root cause analysis
LedgerHandle.initializeWriteHandleState(line 290) schedules a recurringtask that calls
monitorPendingAddOpsattimeoutMonitorIntervalSecintervals. That method iterates over the
ConcurrentLinkedQueue<PendingAddOp> pendingAddOpsand callsop.maybeTimeout()on each entry.PendingAddOp.maybeTimeout(line 157) accessesthis.clientCtx.getConf().recyclePendAddOpObject()(called when an op completes) setsclientCtx = null. The op is removed frompendingAddOpsbysendAddSuccessCallbacksbefore the user callback fires, so in theory the monitor should never see
a recycled op. However, the
ConcurrentLinkedQueueiterator is weaklyconsistent: it can reflect concurrent modifications mid-traversal. Under high
write throughput with Netty 4.1.130 (faster I/O completion callbacks), the
window between the iterator "seeing" an op and the scheduler thread calling
maybeTimeouton it is wide enough for the recycling to complete first, soclientCtxis already null when the call arrives.Observed impact
SafeRunnableandthe scheduler keeps re-firing.
null-
clientCtxstate; a genuinely timed-out op may therefore hangindefinitely rather than receive a
BKTimeoutException.Suggested fix
Add a null guard in
PendingAddOp.maybeTimeout(or at the call site inLedgerHandle.monitorPendingAddOps):Alternatively, the
recyclerHandle.recycle(this)call at the end ofrecyclePendAddOpObjectcould be deferred until after the op has beenconfirmed absent from
pendingAddOps, eliminating the race entirely.