Skip to content

Conversation

@liangyepianzhou
Copy link
Contributor

@liangyepianzhou liangyepianzhou commented Sep 21, 2023

Motivation

The metadata of the ledger still exists, but the offloaded metadata is lost or damaged.

2023-08-28T02:35:23,512+0000 [offloader-OrderedScheduler-1-0] WARN  org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedReadHandleImpl - There hasn't enough data to read        , current available data has 0 bytes, seek to the first entry 0 to avoid EOF exception
     89 2023-08-28T02:35:23,541+0000 [offloader-OrderedScheduler-0-0] ERROR org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedReadHandleImpl - Failed to read entries 0 - 0 fro        m the offloader in ledger 1358058
     90 java.io.IOException: Error reading from BlobStore
     91         at org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedInputStreamImpl.refillBufferIfNeeded(BlobStoreBackedInputStreamImpl.java:91) ~[?:?]
     92         at org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedInputStreamImpl.read(BlobStoreBackedInputStreamImpl.java:99) ~[?:?]
     93         at java.io.DataInputStream.readInt(DataInputStream.java:392) ~[?:?]
     94         at org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedReadHandleImpl.lambda$readAsync$1(BlobStoreBackedReadHandleImpl.java:136) ~[?:?]
     95         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
     96         at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
     97         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
     98         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
     99         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
    100         at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.87.Final.jar:4.1.87.Final]
    101         at java.lang.Thread.run(Thread.java:829) ~[?:?]
    102 Caused by: java.lang.NullPointerException
    103 2023-08-28T02:35:23,542+0000 [offloader-OrderedScheduler-0-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Unknown exception for ManagedLedgerException.
    104 java.io.IOException: Error reading from BlobStore
    105         at org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedInputStreamImpl.refillBufferIfNeeded(BlobStoreBackedInputStreamImpl.java:91) ~[?:?]
    106         at org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedInputStreamImpl.read(BlobStoreBackedInputStreamImpl.java:99) ~[?:?]
    107         at java.io.DataInputStream.readInt(DataInputStream.java:392) ~[?:?]
    108         at org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedReadHandleImpl.lambda$readAsync$1(BlobStoreBackedReadHandleImpl.java:136) ~[?:?]
    109         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
    110         at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
    111         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
    112         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
    113         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
    114         at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.87.Final.jar:4.1.87.Final]
    115         at java.lang.Thread.run(Thread.java:829) ~[?:?]
    116 Caused by: java.lang.NullPointerException
    117 2023-08-28T02:35:23,544+0000 [offloader-OrderedScheduler-0-0] WARN  org.apache.bookkeeper.mledger.impl.OpReadEntry - [test/test/persistent/test-partition-1][test-consume] read failed from ledger at position:1358058:0
    118 org.apache.bookkeeper.mledger.ManagedLedgerException: Other exception
    119 Caused by: java.io.IOException: Error reading from BlobStore
    120         at org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedInputStreamImpl.refillBufferIfNeeded(BlobStoreBackedInputStreamImpl.java:91) ~[?:?]
    121         at org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedInputStreamImpl.read(BlobStoreBackedInputStreamImpl.java:99) ~[?:?]
    122         at java.io.DataInputStream.readInt(DataInputStream.java:392) ~[?:?]
    123         at org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedReadHandleImpl.lambda$readAsync$1(BlobStoreBackedReadHandleImpl.java:136) ~[?:?]
    124         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
    125         at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
    126         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
    127         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
    128         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
    129         at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.87.Final.jar:4.1.87.Final]
    130         at java.lang.Thread.run(Thread.java:829) ~[?:?]
    131 Caused by: java.lang.NullPointerException
    132 2023-08-28T02:35:23,545+0000 [broker-topic-workers-OrderedExecutor-5-0] ERROR org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer - [persistent://test/test/test-partition-1 / test-consume -Consumer{subscription=PersistentSubscription{topic=persistent://test/test/test-partition-1, name=test-consume}, consumerId=1, consumerName=07da3, address=/127.0.0.1:39850}] Error reading entries at 1358058:0 : Other exception - Retrying         to read in 27.426 seconds
    133 2023-08-28T02:35:25,999+0000 [offloader-OrderedScheduler-0-0] ERROR org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedReadHandleImpl - Failed to read entries 0 - 0 fro        m the offloader in ledger 1358058
    134 java.io.IOException: Error reading from BlobStore

Modification

Throw a NonRecoverableLedgerException to avoid null pointer. If the user has configured autoSkipNonRecoverableData=true, the caller will skip this ledger.

} else if (cursor.config.isAutoSkipNonRecoverableData() && exception instanceof NonRecoverableLedgerException) {
log.warn("[{}][{}] read failed from ledger at position:{} : {}", cursor.ledger.getName(), cursor.getName(),
readPosition, exception.getMessage());
final ManagedLedgerImpl ledger = (ManagedLedgerImpl) cursor.getManagedLedger();
Position nexReadPosition;
Long lostLedger = null;
if (exception instanceof ManagedLedgerException.LedgerNotExistException) {
// try to find and move to next valid ledger
nexReadPosition = cursor.getNextLedgerPosition(readPosition.getLedgerId());
lostLedger = readPosition.ledgerId;
} else {
// Skip this read operation
nexReadPosition = ledger.getValidPositionAfterSkippedEntries(readPosition, count);
}
// fail callback if it couldn't find next valid ledger
if (nexReadPosition == null) {
callback.readEntriesFailed(exception, ctx);
cursor.ledger.mbean.recordReadEntriesError();
recycle();
return;
}
updateReadPosition(nexReadPosition);
if (lostLedger != null) {
cursor.getManagedLedger().skipNonRecoverableLedger(lostLedger);
}
checkReadCompletion();
} else {

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

## Motivation
The metadata of the ledger still exists, but the offloaded metadata is lost or damaged.

## Modification
Throw a NonRecoverableLedgerException to avoid null pointer. If the user has configured autoSkipNonRecoverableData=true, the caller will skip this ledger.
https://github.com/apache/pulsar/blob/66271e3bf3cf0699789c759d852c24e6f00f90cd/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpReadEntry.java#L114-#L140
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 21, 2023
@liangyepianzhou liangyepianzhou self-assigned this Sep 22, 2023
log.warn("There hasn't enough data to read, current available data has {} bytes,"
+ " seek to the first entry {} to avoid EOF exception", inputStream.available(), firstEntry);
seekToEntry(firstEntry);
if (dataStream.available() < 12) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check if the data or index file exists

Copy link
Member

@zymap zymap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NPE happened at the reading bucket

, I don't think this PR fixed the NPE issue.

@liangyepianzhou liangyepianzhou marked this pull request as draft September 27, 2023 14:37
@liangyepianzhou liangyepianzhou closed this by deleting the head repository Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants