Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieving data from Tiered Storage (S3) fails initially then succeeds after 15 second retry (Pulsar 2.6.1) #9266

Closed
frankjkelly opened this issue Jan 21, 2021 · 10 comments
Labels
lifecycle/stale type/bug The PR fixed a bug or issue reported a bug

Comments

@frankjkelly
Copy link
Contributor

Describe the bug
Retrieving data from Tiered Storage (S3) fails initially according to Broker logs ("unknown exception") then succeeds after 15 second retry resulting in a 15 second response time minimum.
I can reproduce this consistently.

To Reproduce
Steps to reproduce the behavior:

  1. Configure Tiered storage offload to S3
  2. Upload data to a topic
  3. Wait for Managed Ledger Rollver
  4. Offload Topic
  5. Attempt to retrieve data

Expected behavior
I expect some delay but not an exception followed by a 15 second retry

Screenshots
Logs

 17:50:18.735 [BookKeeperClientWorker-OrderedExecutor-2-0] INFO  org.apache.bookkeeper.mledger.impl.MetaStoreImpl - [cogito-dialog/wav/persistent/57f5ea00-a7a2-49c3-95d3-c8237d893d31] [PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7] Updating cursor info ledgerId=1108490 mark-delete=1107259:-1
17:50:18.739 [bookkeeper-ml-workers-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [cogito-dialog/wav/persistent/57f5ea00-a7a2-49c3-95d3-c8237d893d31] Updated cursor PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7 with ledger id 1108490 md-position=1107259:-1 rd-position=1107259:0
17:50:18.739 [bookkeeper-ml-workers-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [cogito-dialog/wav/persistent/57f5ea00-a7a2-49c3-95d3-c8237d893d31] Opened new cursor: ManagedCursorImpl{ledger=cogito-dialog/wav/persistent/57f5ea00-a7a2-49c3-95d3-c8237d893d31, name=PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7, ackPos=1107259:-1, readPos=1107259:0}
17:50:18.739 [bookkeeper-ml-workers-OrderedExecutor-5-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [cogito-dialog/wav/persistent/57f5ea00-a7a2-49c3-95d3-c8237d893d31-PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7] Rewind from 1047055:0 to 1047055:0
17:50:18.739 [bookkeeper-ml-workers-OrderedExecutor-5-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://cogito-dialog/wav/57f5ea00-a7a2-49c3-95d3-c8237d893d31] There are no replicated subscriptions on the topic
17:50:18.739 [bookkeeper-ml-workers-OrderedExecutor-5-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentTopic - [persistent://cogito-dialog/wav/57f5ea00-a7a2-49c3-95d3-c8237d893d31][PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7] Created new subscription for 0
17:50:18.739 [bookkeeper-ml-workers-OrderedExecutor-5-0] INFO  org.apache.pulsar.broker.service.ServerCnx - [/127.0.0.1:43862] Created subscription on topic persistent://cogito-dialog/wav/57f5ea00-a7a2-49c3-95d3-c8237d893d31 / PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7
17:50:18.771 [bookkeeper-ml-workers-OrderedExecutor-5-0] WARN  org.apache.bookkeeper.mledger.impl.OpReadEntry - [cogito-dialog/wav/persistent/57f5ea00-a7a2-49c3-95d3-c8237d893d31][PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7] read failed from ledger at position:1047055:0 : Unknown exception
17:50:18.771 [broker-topic-workers-OrderedScheduler-4-0] ERROR org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer - [persistent://cogito-dialog/wav/57f5ea00-a7a2-49c3-95d3-c8237d893d31 / PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7-Consumer{subscription=PersistentSubscription{topic=persistent://cogito-dialog/wav/57f5ea00-a7a2-49c3-95d3-c8237d893d31, name=PulsarService-31955749-c92b-4bfe-a06a-2985d66b69c7}, consumerId=0, consumerName=8f423, address=/127.0.0.1:43862}] Error reading entries at 1047055:0 : Unknown exception - Retrying to read in 15.0 seconds

There was no stack trace for that error

Desktop (please complete the following information):

  • OS: Linux
  • Pulsar: 2.6.1

Additional context

Tiered Offload Settings

$ ./apache-pulsar-2.6.1/bin/pulsar-admin --admin-url http://platform-pulsar-broker:8080 namespaces get-offload-threshold cogito-dialog/wav
0
$ ./apache-pulsar-2.6.1/bin/pulsar-admin --admin-url http://platform-pulsar-broker:8080 namespaces get-offload-deletion-lag cogito-dialog/wav
15 minute(s)

Broker Overrides to defaults

        managedLedgerMaxLedgerRolloverTimeMinutes: "90"

Other Broker settings

# grep ffload broker.conf  | grep -v "#"
managedLedgerOffloadDeletionLagMs=14400000
managedLedgerOffloadAutoTriggerSizeThresholdBytes=-1
offloadersDirectory=./offloaders
managedLedgerOffloadDriver=aws-s3
managedLedgerOffloadMaxThreads=2
managedLedgerOffloadPrefetchRounds=1
s3ManagedLedgerOffloadRegion=us-east-1
s3ManagedLedgerOffloadBucket=eks-saas-pulsar-455275ef9748510192359f2f92675b2e
s3ManagedLedgerOffloadServiceEndpoint=
s3ManagedLedgerOffloadMaxBlockSizeInBytes=67108864
s3ManagedLedgerOffloadReadBufferSizeInBytes=1048576
gcsManagedLedgerOffloadRegion=
gcsManagedLedgerOffloadBucket=
gcsManagedLedgerOffloadMaxBlockSizeInBytes=67108864
gcsManagedLedgerOffloadReadBufferSizeInBytes=1048576
gcsManagedLedgerOffloadServiceAccountKeyFile=
fileSystemProfilePath=../conf/filesystem_offload_core_site.xml
@apapia
Copy link

apapia commented Jan 22, 2021

Is it possible to reduce the retry time (15s) as a workaround?

@frankjkelly
Copy link
Contributor Author

Good idea - I looked through the broker conf files and could not find where this was specified

@Renkai
Copy link
Contributor

Renkai commented Jan 26, 2021

I have submitted a pull request to show more info about the "Unknown Exception", #9323
It's better to decide how to deal with it(retry or do something extra) after we actually know what it is.

@codelipenghui
Copy link
Contributor

Thanks @Renkai

@frankjkelly
Copy link
Contributor Author

Thanks indeed - we are using Pulsar 2.6.1 - is there any chance this could be back ported to the next maintenance release (2.6.4?)

@codelipenghui
Copy link
Contributor

Hi, @frankjkelly. Sorry, the next release version is 2.7.1 and we do not have a plan to release 2.6.x. If you want to apply this patch, we can help to cherry-pick this fix to the branch-2.6 and you can build it by yourself.

And the streamnative.io have a weekly release at https://github.com/streamnative/pulsar/releases, https://hub.docker.com/r/streamnative/pulsar/tags?page=1&ordering=last_updated. This release is based on the official branch. We can start a quick release for 2.6.3.x with this path well.

@frankjkelly
Copy link
Contributor Author

@codelipenghui it appears now that there might be a plan for a 2.6.4 release (I noticed the release label on #7356). Any chance #9323 could be back-ported?
Thanks in advance.

@frankjkelly
Copy link
Contributor Author

FYI could be a dupe of #8282

@codelipenghui
Copy link
Contributor

The issue had no activity for 30 days, mark with Stale label.

@frankjkelly
Copy link
Contributor Author

No longer an issue for us - we wrote our own S3 offloader for various reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

4 participants