NullPointerException in PendingAddOp.maybeTimeout when clientCtx is null after recycling

## Summary

A recurring `NullPointerException` is thrown from the `monitorPendingAddOps`
scheduled task when `PendingAddOp.maybeTimeout` dereferences `clientCtx`,
which has already been cleared to `null` by `recyclePendAddOpObject()`.

## Stack trace (verbatim)

```
WARNING: Unexpected throwable from task class com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask: Cannot invoke "org.apache.bookkeeper.client.ClientContext.getConf()" because "this.clientCtx" is null
java.lang.NullPointerException: Cannot invoke "org.apache.bookkeeper.client.ClientContext.getConf()" because "this.clientCtx" is null
	at org.apache.bookkeeper.client.PendingAddOp.maybeTimeout(PendingAddOp.java:157)
	at org.apache.bookkeeper.client.LedgerHandle.monitorPendingAddOps(LedgerHandle.java:2059)
	at org.apache.bookkeeper.client.LedgerHandle.lambda$initializeWriteHandleState$0(LedgerHandle.java:290)
	at com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask.run(MoreExecutors.java:625)
	at org.apache.bookkeeper.common.util.SingleThreadSafeScheduledExecutorService$SafeRunnable.run(SingleThreadSafeScheduledExecutorService.java:46)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:545)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:369)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:310)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1090)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:614)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:1474)
```

## Affected version

**BookKeeper 4.17.3** (first observed after BK 4.17.3 upgraded Netty from
4.1.121.Final to 4.1.130.Final via PR #4699). The same NPE potential exists
in 4.17.2 code but does not manifest in practice with Netty 4.1.121.

## Root cause analysis

`LedgerHandle.initializeWriteHandleState` (line 290) schedules a recurring
task that calls `monitorPendingAddOps` at `timeoutMonitorIntervalSec`
intervals. That method iterates over the `ConcurrentLinkedQueue<PendingAddOp>
pendingAddOps` and calls `op.maybeTimeout()` on each entry.

`PendingAddOp.maybeTimeout` (line 157) accesses `this.clientCtx.getConf()`.

`recyclePendAddOpObject()` (called when an op completes) sets `clientCtx =
null`. The op is removed from `pendingAddOps` by `sendAddSuccessCallbacks`
*before* the user callback fires, so in theory the monitor should never see
a recycled op. However, the `ConcurrentLinkedQueue` iterator is weakly
consistent: it can reflect concurrent modifications mid-traversal. Under high
write throughput with Netty 4.1.130 (faster I/O completion callbacks), the
window between the iterator "seeing" an op and the scheduler thread calling
`maybeTimeout` on it is wide enough for the recycling to complete first, so
`clientCtx` is already null when the call arrives.

## Observed impact

- No impact on write throughput — the NPE is swallowed by `SafeRunnable` and
  the scheduler keeps re-firing.
- The timeout monitor is silently disabled for any add op encountered in the
  null-`clientCtx` state; a genuinely timed-out op may therefore hang
  indefinitely rather than receive a `BKTimeoutException`.
- One WARNING log line + full stack trace every ~14 seconds floods pod logs.

## Suggested fix

Add a null guard in `PendingAddOp.maybeTimeout` (or at the call site in
`LedgerHandle.monitorPendingAddOps`):

```java
// PendingAddOp.java — maybeTimeout
boolean maybeTimeout() {
    if (clientCtx == null) {
        // op already recycled/reset; skip timeout check
        return false;
    }
    if (MathUtils.elapsedNanos(requestTimeNanos) >= clientCtx.getConf().addEntryQuorumTimeoutNanos) {
        timeoutQuorumWait();
        return true;
    }
    return false;
}
```

Alternatively, the `recyclerHandle.recycle(this)` call at the end of
`recyclePendAddOpObject` could be deferred until after the op has been
confirmed absent from `pendingAddOps`, eliminating the race entirely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NullPointerException in PendingAddOp.maybeTimeout when clientCtx is null after recycling #4759

Summary

Stack trace (verbatim)

Affected version

Root cause analysis

Observed impact

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

NullPointerException in PendingAddOp.maybeTimeout when clientCtx is null after recycling #4759

Description

Summary

Stack trace (verbatim)

Affected version

Root cause analysis

Observed impact

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions