Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][monitor] Fix the partitioned publisher topic stat aggregation bug #18807

Merged
merged 6 commits into from
Jan 4, 2023

Conversation

heesung-sn
Copy link
Contributor

@heesung-sn heesung-sn commented Dec 7, 2022

Motivation

The index-based publisher stat aggregation(configured by aggregatePublisherStatsByProducerName=false, default, triggered by pulsar-admin topics partitioned-stats api) can burst memory and wrongly aggregate publisher metrics if each partition stat returns a different size or order of the publisher stat list.
In the worst case, if there are many partitions and publishers created and closed concurrently, the current code can create PublisherStatsImpl objects exponentially, and this can cause a high GC time or OOM.

Discussion Thread:

https://lists.apache.org/thread/vofv1oz0wvzlwk4x9vk067rhkscn8bqo

Issue Code reference:

2c428f7#diff-02e50674125a597f8ae3405a884590759f2fdaa10104cea511d5ea44b6ff6490R224-R247

Modifications

  • Fixed the bug code (the nested loop) that could exponentially create PublisherStatsImpl objects.

Verifying this change

  • Make sure that the change passes the CI checks.

This change added unit tests.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: heesung-sn#15

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Dec 7, 2022
@Demogorgon314 Demogorgon314 added type/bug The PR fixed a bug or issue reported a bug area/metrics labels Dec 8, 2022
@Technoboy- Technoboy- added this to the 2.12.0 milestone Dec 11, 2022
@Technoboy- Technoboy- closed this Dec 11, 2022
@Technoboy- Technoboy- reopened this Dec 11, 2022
@Technoboy- Technoboy- changed the title [fix][metrics] fixed a partitioned publisher topic stat aggregation bug [fix][monitor] Fix a partitioned publisher topic stat aggregation bug Dec 11, 2022
@codecov-commenter
Copy link

codecov-commenter commented Dec 11, 2022

Codecov Report

Merging #18807 (f7e7f12) into master (492a9c3) will increase coverage by 1.14%.
The diff coverage is 28.57%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #18807      +/-   ##
============================================
+ Coverage     47.46%   48.60%   +1.14%     
- Complexity    10727    11308     +581     
============================================
  Files           711      712       +1     
  Lines         69456    72833    +3377     
  Branches       7452     8303     +851     
============================================
+ Hits          32964    35404    +2440     
- Misses        32810    33577     +767     
- Partials       3682     3852     +170     
Flag Coverage Δ
unittests 48.60% <28.57%> (+1.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...va/org/apache/pulsar/client/impl/ConsumerImpl.java 15.06% <0.00%> (-0.04%) ⬇️
.../pulsar/broker/admin/impl/SchemasResourceBase.java 90.09% <100.00%> (+0.20%) ⬆️
...lsar/client/impl/conf/ClientConfigurationData.java 96.66% <100.00%> (ø)
...java/org/apache/pulsar/proxy/stats/TopicStats.java 58.82% <0.00%> (-41.18%) ⬇️
...a/org/apache/bookkeeper/mledger/ManagedCursor.java 50.00% <0.00%> (-35.72%) ⬇️
.../apache/pulsar/broker/admin/impl/PackagesBase.java 54.12% <0.00%> (-13.77%) ⬇️
...ava/org/apache/pulsar/broker/service/Consumer.java 68.86% <0.00%> (-4.10%) ⬇️
...roker/service/persistent/MessageDeduplication.java 50.65% <0.00%> (-3.06%) ⬇️
...rg/apache/pulsar/broker/service/AbstractTopic.java 62.06% <0.00%> (-3.00%) ⬇️
...pache/pulsar/broker/admin/v2/PersistentTopics.java 72.47% <0.00%> (-1.94%) ⬇️
... and 53 more

Comment on lines +256 to +263
if (index == this.publishers.size()) {
PublisherStatsImpl newStats = new PublisherStatsImpl();
newStats.setSupportsPartialProducer(false);
this.publishers.add(newStats);
}
this.publishers.get(index)
.add((PublisherStatsImpl) s);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits:

The publishers field is not thread-safe from before. Potentially access to it from concurrent threads.
Moreover, it doesn't guarantee the size is index + 1 in L262 because it isn't held mutual lock between L259 and L262.


We want to avoid IndexOutOfBoundsException completely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

I see that *TopicStatsImpl is not a thread-safe class. Currently, the caller needs to externally synchronize the access if concurrent threads access the same instance.

Potentially access to it from concurrent threads.

Could you provide this example? I see that PartitionedTopicStatsImpl is instantiated by a single thread, and add(..) gets called from the same thread.

PartitionedTopicStatsImpl stats = new PartitionedTopicStatsImpl(partitionMetadata);
...
            FutureUtil.waitForAll(topicStatsFutureList).handle((result, exception) -> {
...
                            stats.add(statFuture.get());
                            if (perPartition) {
                                stats.getPartitions().put(topicName.getPartition(i).toString(), statFuture.get());
...
                    }
                }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case may not be in the current codes. If contributors already know "the caller needs to externally synchronize the access," we don't care in this section. I'm worried about misimplementation. So, I think it's okay to write comments about the current spec instead of fixing it to follow my comment.

Copy link
Contributor Author

@heesung-sn heesung-sn Dec 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, sorry. Do you mean we need to comment that PartitionedTopicStatsImpl is not thread-safe?

Otherwise, PTAL again (or could you approve this?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we need to comment that PartitionedTopicStatsImpl is not thread-safe?

Yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments.

@heesung-sn
Copy link
Contributor Author

@eolivelli could you review this PR by any chance?

Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to split it into 2 PRs? One is to fix the aggregation issue, another change the default value.

So that we can ship the fix to release branches, and the default configuration change only applies to the master branch (release in the next major version).

Comment on lines +169 to +172
if (index == this.nonPersistentPublishers.size()) {
NonPersistentPublisherStatsImpl newStats = new NonPersistentPublisherStatsImpl();
newStats.setSupportsPartialProducer(false);
this.nonPersistentPublishers.add(newStats);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add some description for this if branch. Like why we need to use a newStats if index == this.nonPersistentPublishers.size(). And what does index == this.nonPersistentPublishers.size() exactly means. Frankly, it's not easy to understand. I think if we have some description, it will help the readers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments.

@heesung-sn
Copy link
Contributor Author

Is it better to split it into 2 PRs? One is to fix the aggregation issue, another change the default value.

So that we can ship the fix to release branches, and the default configuration change only applies to the master branch (release in the next major version).

I reverted the config default change in this PR.
I raised another PR for the config default change here. #19114

@heesung-sn heesung-sn changed the title [fix][monitor] Fix a partitioned publisher topic stat aggregation bug [fix][monitor] Fix the partitioned publisher topic stat aggregation bug Jan 3, 2023
@codelipenghui codelipenghui reopened this Jan 4, 2023
@codelipenghui codelipenghui merged commit 8790ed1 into apache:master Jan 4, 2023
@poorbarcode
Copy link
Contributor

poorbarcode commented May 7, 2023

@heesung-sn

I have added the label [release/2.10.5, release/2.11.2] and cherry-picked this patch into these branches.

poorbarcode pushed a commit that referenced this pull request May 7, 2023
poorbarcode pushed a commit that referenced this pull request May 7, 2023
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request May 11, 2023
…ug (apache#18807)

(cherry picked from commit 8790ed1)
(cherry picked from commit b5b2de6)
@heesung-sn heesung-sn deleted the bug-fix-topic-stats-nested branch April 2, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants