Skip to content

fix pulsar test instability#8538

Closed
walterddr wants to merge 3 commits intoapache:masterfrom
walterddr:fix_pulsar_test
Closed

fix pulsar test instability#8538
walterddr wants to merge 3 commits intoapache:masterfrom
walterddr:fix_pulsar_test

Conversation

@walterddr
Copy link
Contributor

This fixes #8537

@xiangfu0 xiangfu0 requested a review from KKcorps April 13, 2022 22:00
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know why this call may give fewer records? Is it a bug or something like messages not flushed entirely and available to consume or some messages are actually missing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we actually run a consumer to consume exact same number of messages then exit in setup() method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dunno why this fixes the bug. My guess is that pulsar flush is async but I am not sure.

GitHub Actions sometimes catch interesting bug b/c the machines are much less powerful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the publishRecordBatch was changed in #8017 - any chance @KKcorps can give some insights?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method was added to handle the scenarios where messages are published in batches which actually changes the format of MessageIds in Pulsar.

The only way to publish in batch is to enable async producer. In sync mode, all messages are flushed as soon as they are received.

Copy link
Contributor

@KKcorps KKcorps Apr 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiangfu0 your suggestion also works. I should actually add a method like waitForAllRecordsToFlush and only then exit the setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to the documentation https://pulsar.apache.org/api/client/org/apache/pulsar/client/api/Producer.html

  • producer.flush() should wait until all the message finish publishing before returning (or throw exception); thus we can't check on the producer side
  • only way to check and wait for all records reaches pulsar for consumption is to actually consume -- which is exactly what this test is doing.

so I guess we should do is wrap the test with

TestUtils.waitForCondition(aVoid -> { 
  // consume and assert 
}, timeoutMs, "")

sounds good?

Copy link
Contributor

@KKcorps KKcorps Apr 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another approach to just get topic stats. Since we are not producing messages in topics in any other method, it works for us.
I have implemented it here - #8554

once this this PR is merged, I'll pull changes into my PR and we can merge that as well.

@richardstartin richardstartin added the flaky-test Tracks a test that intermittently fails label Apr 14, 2022
@walterddr
Copy link
Contributor Author

it still fails. I wonder if theres' anything wrong with the consumer itself that it didn't pull enough data out.

alternatively I can still make the wait for condition checker in the test function too so that we can at least get better against the flakiness

@codecov-commenter
Copy link

Codecov Report

Merging #8538 (4e23c04) into master (2704d88) will decrease coverage by 3.78%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #8538      +/-   ##
==========================================
- Coverage   29.52%   25.73%   -3.79%     
==========================================
  Files        1674     1674              
  Lines       87872    87872              
  Branches    13313    13313              
==========================================
- Hits        25942    22615    -3327     
- Misses      59551    63102    +3551     
+ Partials     2379     2155     -224     
Flag Coverage Δ
integration1 ?
integration2 25.73% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...g/apache/pinot/server/api/resources/ErrorInfo.java 0.00% <0.00%> (-100.00%) ⬇️
...apache/pinot/common/lineage/LineageEntryState.java 0.00% <0.00%> (-100.00%) ⬇️
...pinot/minion/exception/TaskCancelledException.java 0.00% <0.00%> (-100.00%) ⬇️
...ker/failuredetector/ConnectionFailureDetector.java 0.00% <0.00%> (-100.00%) ⬇️
...minion/tasks/mergerollup/MergeRollupTaskUtils.java 0.00% <0.00%> (-100.00%) ⬇️
...ion/tasks/mergerollup/MergeRollupTaskExecutor.java 0.00% <0.00%> (-100.00%) ⬇️
...nverttorawindex/ConvertToRawIndexTaskExecutor.java 0.00% <0.00%> (-100.00%) ⬇️
...e/pinot/common/minion/MergeRollupTaskMetadata.java 0.00% <0.00%> (-94.74%) ⬇️
...rg/apache/pinot/common/lineage/SegmentLineage.java 0.00% <0.00%> (-91.31%) ⬇️
...ache/pinot/common/lineage/SegmentLineageUtils.java 11.11% <0.00%> (-88.89%) ⬇️
... and 191 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2704d88...4e23c04. Read the comment docs.

@walterddr
Copy link
Contributor Author

This doesn't work. It seems like not a flakiness due to delay async produce.

@walterddr walterddr closed this Apr 16, 2022
@walterddr walterddr deleted the fix_pulsar_test branch December 6, 2023 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

flaky-test Tracks a test that intermittently fails

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pulsar test instability

5 participants