Retrieving data from Tiered Storage (S3) fails initially then succeeds after 15 second retry (Pulsar 2.6.1) #9266

frankjkelly · 2021-01-21T18:06:26Z

Describe the bug
Retrieving data from Tiered Storage (S3) fails initially according to Broker logs ("unknown exception") then succeeds after 15 second retry resulting in a 15 second response time minimum.
I can reproduce this consistently.

To Reproduce
Steps to reproduce the behavior:

Configure Tiered storage offload to S3
Upload data to a topic
Wait for Managed Ledger Rollver
Offload Topic
Attempt to retrieve data

Expected behavior
I expect some delay but not an exception followed by a 15 second retry

Screenshots
Logs

 17:50:18.735 [BookKeeperClientWorker-OrderedExecutor-2-0] INFO  org.apache.bookkeeper.mledger.impl.MetaStoreImpl - [cogito-dialog/wav/persistent/57f5ea00-a7a2-49c3-95d3-c8237d893d31] [PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7] Updating cursor info ledgerId=1108490 mark-delete=1107259:-1
17:50:18.739 [bookkeeper-ml-workers-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [cogito-dialog/wav/persistent/57f5ea00-a7a2-49c3-95d3-c8237d893d31] Updated cursor PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7 with ledger id 1108490 md-position=1107259:-1 rd-position=1107259:0
17:50:18.739 [bookkeeper-ml-workers-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [cogito-dialog/wav/persistent/57f5ea00-a7a2-49c3-95d3-c8237d893d31] Opened new cursor: ManagedCursorImpl{ledger=cogito-dialog/wav/persistent/57f5ea00-a7a2-49c3-95d3-c8237d893d31, name=PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7, ackPos=1107259:-1, readPos=1107259:0}
17:50:18.739 [bookkeeper-ml-workers-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [cogito-dialog/wav/persistent/57f5ea00-a7a2-49c3-95d3-c8237d893d31-PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7] Rewind from 1047055:0 to 1047055:0
17:50:18.739 [bookkeeper-ml-workers-OrderedExecutor-5-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://cogito-dialog/wav/57f5ea00-a7a2-49c3-95d3-c8237d893d31] There are no replicated subscriptions on the topic
17:50:18.739 [bookkeeper-ml-workers-OrderedExecutor-5-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://cogito-dialog/wav/57f5ea00-a7a2-49c3-95d3-c8237d893d31][PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7] Created new subscription for 0
17:50:18.739 [bookkeeper-ml-workers-OrderedExecutor-5-0] INFO  org.apache.pulsar.broker.service.ServerCnx - [/127.0.0.1:43862] Created subscription on topic persistent://cogito-dialog/wav/57f5ea00-a7a2-49c3-95d3-c8237d893d31 / PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7
17:50:18.771 [bookkeeper-ml-workers-OrderedExecutor-5-0] WARN  org.apache.bookkeeper.mledger.impl.OpReadEntry - [cogito-dialog/wav/persistent/57f5ea00-a7a2-49c3-95d3-c8237d893d31][PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7] read failed from ledger at position:1047055:0 : Unknown exception
17:50:18.771 [broker-topic-workers-OrderedScheduler-4-0] ERROR org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer - [persistent://cogito-dialog/wav/57f5ea00-a7a2-49c3-95d3-c8237d893d31 / PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7-Consumer{subscription=PersistentSubscription{topic=persistent://cogito-dialog/wav/57f5ea00-a7a2-49c3-95d3-c8237d893d31, name=PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7}, consumerId=0, consumerName=8f423, address=/127.0.0.1:43862}] Error reading entries at 1047055:0 : Unknown exception - Retrying to read in 15.0 seconds

There was no stack trace for that error

Desktop (please complete the following information):

OS: Linux
Pulsar: 2.6.1

Additional context

Tiered Offload Settings

$ ./apache-pulsar-2.6.1/bin/pulsar-admin --admin-url http://platform-pulsar-broker:8080 namespaces get-offload-threshold cogito-dialog/wav
0
$ ./apache-pulsar-2.6.1/bin/pulsar-admin --admin-url http://platform-pulsar-broker:8080 namespaces get-offload-deletion-lag cogito-dialog/wav
15 minute(s)

Broker Overrides to defaults

        managedLedgerMaxLedgerRolloverTimeMinutes: "90"

Other Broker settings

# grep ffload broker.conf  | grep -v "#"
managedLedgerOffloadDeletionLagMs=14400000
managedLedgerOffloadAutoTriggerSizeThresholdBytes=-1
offloadersDirectory=./offloaders
managedLedgerOffloadDriver=aws-s3
managedLedgerOffloadMaxThreads=2
managedLedgerOffloadPrefetchRounds=1
s3ManagedLedgerOffloadRegion=us-east-1
s3ManagedLedgerOffloadBucket=eks-saas-pulsar-455275ef9748510192359f2f92675b2e
s3ManagedLedgerOffloadServiceEndpoint=
s3ManagedLedgerOffloadMaxBlockSizeInBytes=67108864
s3ManagedLedgerOffloadReadBufferSizeInBytes=1048576
gcsManagedLedgerOffloadRegion=
gcsManagedLedgerOffloadBucket=
gcsManagedLedgerOffloadMaxBlockSizeInBytes=67108864
gcsManagedLedgerOffloadReadBufferSizeInBytes=1048576
gcsManagedLedgerOffloadServiceAccountKeyFile=
fileSystemProfilePath=../conf/filesystem_offload_core_site.xml

The text was updated successfully, but these errors were encountered:

apapia · 2021-01-22T14:29:52Z

Is it possible to reduce the retry time (15s) as a workaround?

frankjkelly · 2021-01-22T14:31:57Z

Good idea - I looked through the broker conf files and could not find where this was specified

Renkai · 2021-01-26T09:02:36Z

I have submitted a pull request to show more info about the "Unknown Exception", #9323
It's better to decide how to deal with it(retry or do something extra) after we actually know what it is.

codelipenghui · 2021-01-28T06:55:01Z

Thanks @Renkai

frankjkelly · 2021-01-28T14:10:47Z

Thanks indeed - we are using Pulsar 2.6.1 - is there any chance this could be back ported to the next maintenance release (2.6.4?)

codelipenghui · 2021-01-28T14:22:59Z

Hi, @frankjkelly. Sorry, the next release version is 2.7.1 and we do not have a plan to release 2.6.x. If you want to apply this patch, we can help to cherry-pick this fix to the branch-2.6 and you can build it by yourself.

And the streamnative.io have a weekly release at https://github.com/streamnative/pulsar/releases, https://hub.docker.com/r/streamnative/pulsar/tags?page=1&ordering=last_updated. This release is based on the official branch. We can start a quick release for 2.6.3.x with this path well.

frankjkelly · 2021-02-18T20:13:44Z

@codelipenghui it appears now that there might be a plan for a 2.6.4 release (I noticed the release label on #7356). Any chance #9323 could be back-ported?
Thanks in advance.

frankjkelly · 2021-03-25T12:20:14Z

FYI could be a dupe of #8282

codelipenghui · 2022-03-04T07:06:58Z

The issue had no activity for 30 days, mark with Stale label.

frankjkelly · 2022-03-04T13:19:34Z

No longer an issue for us - we wrote our own S3 offloader for various reasons.

frankjkelly added the type/bug The PR fixed a bug or issue reported a bug label Jan 21, 2021

sijie mentioned this issue Jan 21, 2021

ISSUE-9266: Retrieving data from Tiered Storage (S3) fails initially then succeeds after 15 second retry (Pulsar 2.6.1) streamnative/pulsar-archived#2044

Closed

Renkai mentioned this issue Jan 26, 2021

Expose more info with unknown exception #9323

Merged

codelipenghui added the lifecycle/stale label Mar 4, 2022

frankjkelly closed this as completed Mar 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieving data from Tiered Storage (S3) fails initially then succeeds after 15 second retry (Pulsar 2.6.1) #9266

Retrieving data from Tiered Storage (S3) fails initially then succeeds after 15 second retry (Pulsar 2.6.1) #9266

frankjkelly commented Jan 21, 2021

apapia commented Jan 22, 2021

frankjkelly commented Jan 22, 2021

Renkai commented Jan 26, 2021

codelipenghui commented Jan 28, 2021

frankjkelly commented Jan 28, 2021

codelipenghui commented Jan 28, 2021

frankjkelly commented Feb 18, 2021

frankjkelly commented Mar 25, 2021

codelipenghui commented Mar 4, 2022

frankjkelly commented Mar 4, 2022

Retrieving data from Tiered Storage (S3) fails initially then succeeds after 15 second retry (Pulsar 2.6.1) #9266

Retrieving data from Tiered Storage (S3) fails initially then succeeds after 15 second retry (Pulsar 2.6.1) #9266

Comments

frankjkelly commented Jan 21, 2021

apapia commented Jan 22, 2021

frankjkelly commented Jan 22, 2021

Renkai commented Jan 26, 2021

codelipenghui commented Jan 28, 2021

frankjkelly commented Jan 28, 2021

codelipenghui commented Jan 28, 2021

frankjkelly commented Feb 18, 2021

frankjkelly commented Mar 25, 2021

codelipenghui commented Mar 4, 2022

frankjkelly commented Mar 4, 2022