Skip to content

NullPointerException in PendingAddOp.maybeTimeout when clientCtx is null after recycling #4759

@eolivelli

Description

@eolivelli

Summary

A recurring NullPointerException is thrown from the monitorPendingAddOps
scheduled task when PendingAddOp.maybeTimeout dereferences clientCtx,
which has already been cleared to null by recyclePendAddOpObject().

Stack trace (verbatim)

WARNING: Unexpected throwable from task class com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask: Cannot invoke "org.apache.bookkeeper.client.ClientContext.getConf()" because "this.clientCtx" is null
java.lang.NullPointerException: Cannot invoke "org.apache.bookkeeper.client.ClientContext.getConf()" because "this.clientCtx" is null
	at org.apache.bookkeeper.client.PendingAddOp.maybeTimeout(PendingAddOp.java:157)
	at org.apache.bookkeeper.client.LedgerHandle.monitorPendingAddOps(LedgerHandle.java:2059)
	at org.apache.bookkeeper.client.LedgerHandle.lambda$initializeWriteHandleState$0(LedgerHandle.java:290)
	at com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask.run(MoreExecutors.java:625)
	at org.apache.bookkeeper.common.util.SingleThreadSafeScheduledExecutorService$SafeRunnable.run(SingleThreadSafeScheduledExecutorService.java:46)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:545)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:369)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:310)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1090)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:614)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:1474)

Affected version

BookKeeper 4.17.3 (first observed after BK 4.17.3 upgraded Netty from
4.1.121.Final to 4.1.130.Final via PR #4699). The same NPE potential exists
in 4.17.2 code but does not manifest in practice with Netty 4.1.121.

Root cause analysis

LedgerHandle.initializeWriteHandleState (line 290) schedules a recurring
task that calls monitorPendingAddOps at timeoutMonitorIntervalSec
intervals. That method iterates over the ConcurrentLinkedQueue<PendingAddOp> pendingAddOps and calls op.maybeTimeout() on each entry.

PendingAddOp.maybeTimeout (line 157) accesses this.clientCtx.getConf().

recyclePendAddOpObject() (called when an op completes) sets clientCtx = null. The op is removed from pendingAddOps by sendAddSuccessCallbacks
before the user callback fires, so in theory the monitor should never see
a recycled op. However, the ConcurrentLinkedQueue iterator is weakly
consistent: it can reflect concurrent modifications mid-traversal. Under high
write throughput with Netty 4.1.130 (faster I/O completion callbacks), the
window between the iterator "seeing" an op and the scheduler thread calling
maybeTimeout on it is wide enough for the recycling to complete first, so
clientCtx is already null when the call arrives.

Observed impact

  • No impact on write throughput — the NPE is swallowed by SafeRunnable and
    the scheduler keeps re-firing.
  • The timeout monitor is silently disabled for any add op encountered in the
    null-clientCtx state; a genuinely timed-out op may therefore hang
    indefinitely rather than receive a BKTimeoutException.
  • One WARNING log line + full stack trace every ~14 seconds floods pod logs.

Suggested fix

Add a null guard in PendingAddOp.maybeTimeout (or at the call site in
LedgerHandle.monitorPendingAddOps):

// PendingAddOp.java — maybeTimeout
boolean maybeTimeout() {
    if (clientCtx == null) {
        // op already recycled/reset; skip timeout check
        return false;
    }
    if (MathUtils.elapsedNanos(requestTimeNanos) >= clientCtx.getConf().addEntryQuorumTimeoutNanos) {
        timeoutQuorumWait();
        return true;
    }
    return false;
}

Alternatively, the recyclerHandle.recycle(this) call at the end of
recyclePendAddOpObject could be deferred until after the op has been
confirmed absent from pendingAddOps, eliminating the race entirely.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions