-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold #9393
Conversation
Looks like the pulsar/pulsar-common/src/main/java/org/apache/pulsar/policies/data/loadbalancer/ResourceUsage.java Lines 27 to 68 in 73e0dbd
|
Nice work on this @michaeljmarshall . However, it seems that the flakiness remains after this change. Sometimes it's hard to reproduce the flaky test failures locally. One thing that seems to be a common theme is that the flaky test failures happen in CI, but can be hard to produce in local environments. While working on the fix for flaky test MessageIdTest, I found a way to reproduce some flaky test failures effectively by limiting the CPU resources to somewhat similar that there is in CI. The CI tests run on an Azure VM that has 2 cores and about 6GB free RAM (IIRC). Since I use Linux for development, the easiest way for me to constraint the resources of the test run was to use Docker. On other than Linux, VM tooling such as https://multipass.run/ could be helpful in creating an environment with limited cpu resources. These are the commands I used to test this change:
here's the output: https://gist.github.com/lhotari/7d8c7ae0a9e1a26d92599c585ba64e13
What comes into mind is that the test might start before both brokers are available. Lines 251 to 252 in 24f759c
|
@lhotari - thank you for your detailed explanation. I did not consider running this on a special environment to simulate the testing env. That is a great point, and something I'll keep in mind in the future. Based on the example logging output you provided, it looks like it is still the broker being overridden because it is considered overloaded. See the following:
Given the first log line, it does look like two brokers are considered for placement. However, the CPU is listed at 124%, which leads to an override. I mentioned a concern about this in my initial PR message:
Perhaps it is worth bumping the limit to something like 300 in this one case? If we only have 2 cores, we won't exceed that. Although, that does leave us with the potential to see this flakiness again if we ever give the test more cores. Do you think the broker's cpu utilization is high because they are still in the process of starting up? If so, perhaps your suggested |
I guess we would have to experiment to find a solution that makes sense. I don't know this area of Pulsar so I could give a direct advice.
My assumption was simply that the test might start executing before both brokers are available. That has been a source of flakiness at least in DiscoveryServiceTest. I didn't confirm this assumption in any way. A simple approach would be to experiment and check whether the problem reproduces after making changes. Being able to experiment, would require having ways to reproduce the issue in environment where you can quickly run experiments. btw. I have published my "toolbox" as open source in https://github.com/lhotari/pulsar-contributor-toolbox . That contains shell script functions that I use for reproducing Pulsar flaky test failures. It works for me when I use Linux, zsh & sdkman (for JDK installation). |
@lhotari - thanks for the insight. I'll test this locally in a VM (I'm running on a Mac) and see how bumping the overloaded threshold affects the test. |
I took a quite look at the test and most likely my suggestion about checking the broker availability in setup doesn't make sense. What you already mentioned about the overload situation could be useful to validate by modifying the code and removing the logic temporarily from ModularLoadManagerImpl. For that, you would have to be able to first consistently reproduce the issue and then experimenting will be easy. |
f8bef98
to
079263b
Compare
@lhotari - I worked trying to get my environment set up tonight (just testing out setting the cpu/memory limits on my mac's docker desktop app, no extra VM yet), but I wasn't able to get to a place where I could test out this code change. I submitted a new commit with an updated threshold, which I think should work. If it doesn't work, I can continue tomorrow evening. |
@lhotari - After running this several times on my machine, I haven't been able to reproduce the failure with the limit set to 400. Although, I also wasn't able to get the broker's CPU utilization to go above 100%, but it looks like you were, so I'm not sure what that means because you had over 100% utilization. At what point do we consider this the solution to the flaky test? Thank you for your help. As an aside, it does seem that using docker to limit CPU/mem is an option on a mac. Here is my FROM openjdk:8-jdk-slim
RUN mkdir -p /usr/share/man/man1
RUN apt update && apt -y install maven
ENTRYPOINT ["/bin/bash"] and then I started the container: docker run --cpus=2 --memory=3g -it --entrypoint bash -v /Users/Michael:/root pulsar-test-env:0.0.0 and after navigating to my mvn test -Dtest=AntiAffinityNamespaceGroupTest -DfailIfNoTests=false -pl pulsar-broker (Because I mounted my home to the container's home, maven worked properly so I could run the build and install outside of the container.) |
Thanks for the great effort on fixing this @michaeljmarshall . I'll test these changes with my setup to see if the earlier problem that I was able to reproduce is fixed. |
@michaeljmarshall unfortunately, it remains flaky. The logs are at https://gist.github.com/lhotari/f75e3efaf5f7bf7536e6d98c1898789e . I reproduced with these commands on Linux (Ubuntu / PopOS 20.10):
just in case anyone is interested, the ptbx_until_test_fails_in_docker_with_logs captures the logs which contain ansi color control codes, I stripped the ansi codes with ansi2txt before uploading the logs to the gist:
|
@lhotari - thank you for running that for me. For what it's worth, it looks like the failure is related to the cpu usage again, although this time it's not necessarily a value I would have expected--Infinity.
I'm not positive what solutions make sense here, but it does seem like a surprising behavior that the logic calculated the CPU usage at |
I dug into this a bit more this morning, and it's interesting to point out that there are two fundamental differences between the system I am running tests on (macOS, or non linux) and @lhotari's system, which is linux. I'm guessing the pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/tools/LoadReportCommand.java Lines 94 to 102 in 7c68ade
I plan to take a look at the There could certainly be an issue in the It's worth mentioning that neither of these Usage classes are explicitly tested. |
@michaeljmarshall yes, it's a VM running Ubuntu Linux. There's GitHub Actions Runner VMs specs for more details about the environment. It's possible to debug such an environment by getting a shell inside a running VM with https://github.com/marketplace/actions/debugging-with-tmate in your own fork, for example. To debug the GitHub Actions VM, you'd enable GitHub Actions in your own fork in some repository and open a PR to your own fork. However the downside of that is that it's very time consuming. |
@michaeljmarshall very good direction. One detail that immediately caught my eye in |
The code in |
079263b
to
ce14ec4
Compare
this.isCGroupsEnabled = isCGroupsEnabled; | ||
calculateBrokerHostUsage(); | ||
|
||
executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhotari - because we were scheduling calculateBrokerHostUsage
with a delay of 0
minutes and then calling it only a few lines later, it is quite possible that there could have been a race internally that led to a calculated elapsedSeconds
of 0
. If elapsedSeconds
is 0
, we'll divide by by 0
in the method to calculate cpuUsage
, which could likely be our culprit for the "Infinite" cpu usage you saw in your test.
Here is the full method:
Lines 96 to 127 in ce14ec4
@Override | |
public void calculateBrokerHostUsage() { | |
List<String> nics = getNics(); | |
double totalNicLimit = getTotalNicLimitKbps(nics); | |
double totalNicUsageTx = getTotalNicUsageTxKb(nics); | |
double totalNicUsageRx = getTotalNicUsageRxKb(nics); | |
double totalCpuLimit = getTotalCpuLimit(); | |
SystemResourceUsage usage = new SystemResourceUsage(); | |
long now = System.currentTimeMillis(); | |
double elapsedSeconds = (now - lastCollection) / 1000d; | |
double cpuUsage = getTotalCpuUsage(elapsedSeconds); | |
if (lastCollection == 0L) { | |
usage.setMemory(getMemUsage()); | |
usage.setBandwidthIn(new ResourceUsage(0d, totalNicLimit)); | |
usage.setBandwidthOut(new ResourceUsage(0d, totalNicLimit)); | |
} else { | |
double nicUsageTx = (totalNicUsageTx - lastTotalNicUsageTx) / elapsedSeconds; | |
double nicUsageRx = (totalNicUsageRx - lastTotalNicUsageRx) / elapsedSeconds; | |
usage.setMemory(getMemUsage()); | |
usage.setBandwidthIn(new ResourceUsage(nicUsageRx, totalNicLimit)); | |
usage.setBandwidthOut(new ResourceUsage(nicUsageTx, totalNicLimit)); | |
} | |
lastTotalNicUsageTx = totalNicUsageTx; | |
lastTotalNicUsageRx = totalNicUsageRx; | |
lastCollection = System.currentTimeMillis(); | |
this.usage = usage; | |
usage.setCpu(new ResourceUsage(cpuUsage, totalCpuLimit)); | |
} |
Also, note that this method likely should not have been called before setting isCGroupsEnabled
because the internals of the method rely on that value. Perhaps @merlimat can provide a bit more context for whether or not this update adds value here (he added the isCGroupsEnabled
code this past July).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that any bad state, like an infinite cpu usage, would be stored for a minute until the cpu usage is recalculated. Considering that this test runs quickly, that would definitely be a culprit for our flaky test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaeljmarshall I was also assuming some race condition and I made an experiment on top of your changes where I replaced the Files.readAllBytes
usage and made ResourceUsage
immutable: master...lhotari:issue-6368-flaky-test . However, the same problem still reproduces. CPU shows as Infinity in the log lines:
06:52:40.333 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO org.apache.pulsar.broker.loadbala
nce.impl.ModularLoadManagerImpl - 2 brokers being considered for assignment of tenant-98033a04-3a4d-4e5d-9203-
3147486f2673/use/ns1/0x00000000_0xffffffff
06:52:40.333 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbala
nce.impl.LeastLongTermMessageRate - Broker http://localhost:35061 is overloaded: max usage=Infinity
06:52:40.333 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbala
nce.impl.LeastLongTermMessageRate - Broker localhost:35061 is overloaded: CPU: Infinity%, MEMORY: 16.867092%,
DIRECT MEMORY: 2.4597175%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
06:52:40.337 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO org.apache.pulsar.broker.loadbala
nce.impl.ModularLoadManagerImpl - 1 brokers being considered for assignment of tenant-98033a04-3a4d-4e5d-9203-
3147486f2673/use/ns2/0x00000000_0xffffffff
06:52:40.337 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbala
nce.impl.LeastLongTermMessageRate - Broker http://localhost:35061 is overloaded: max usage=Infinity
06:52:40.337 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbala
nce.impl.LeastLongTermMessageRate - Broker localhost:35061 is overloaded: CPU: Infinity%, MEMORY: 16.867092%,
DIRECT MEMORY: 2.4597175%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
06:52:40.338 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbala
nce.impl.LeastLongTermMessageRate - Broker http://localhost:35061 is overloaded: max usage=Infinity
06:52:40.338 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbala
nce.impl.LeastLongTermMessageRate - Broker localhost:35061 is overloaded: CPU: Infinity%, MEMORY: 16.867092%,
DIRECT MEMORY: 2.4597175%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
!!!!!!!!! FAILURE-- [TestClass name=class org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest]
.testBrokerSelectionForAntiAffinityGroup([])-------
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look--it's good to know we're narrowing in on something. I'll take a look again tomorrow.
In looking at the class and the scheduling of the I also agree that the mix of longs and doubles seems error prone.
These classes do seem pretty fundamental to load balancing and I believe they are also essential to the load shedding mechanisms, which are pretty important. There are tests that implicitly cover this code, but perhaps we can add something more explicit in another PR. It's likely that these tests aren't covered because some of the logic is hard to test. Reworking some of the classes to make them more testable might be necessary.
Let me know if you found anything regarding this issue. My most recent commit should definitely add some stability to the |
@michaeljmarshall yes, you are right about the race condition. Good catch. It's the Besides the race, it also seems that this is a clear bug in the I have a few commits in my fork where I experimented with various improvements: |
...broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java
Show resolved
Hide resolved
...broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java
Outdated
Show resolved
Hide resolved
…pl class initialization
ce14ec4
to
cd109ef
Compare
...oker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/GenericBrokerHostUsageImpl.java
Outdated
Show resolved
Hide resolved
cpuUsageSum = 0d; | ||
cpuUsageCount = 0; | ||
return cpuUsage; | ||
synchronized (this) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: what about moving the synchronized
keyword in the method declaration ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this, and I also moved the check for cpuUsageCount
being zero into this method. It makes more sense to me in the synchronized method, and it will ensure that if any other methods end up calling this one, there isn't a chance of accidentally dividing by 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
7f902e9
to
fb87dee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, good work @michaeljmarshall
@sijie, @eolivelli, @lhotari - given the discussion in today's community meeting about what branches should have LTS, I'm thinking this commit should be cherry picked to other release branches because the changes to the underlying host usage classes are concurrency bugs. (Even if we don't have true LTS versions, this patch should likely make it into 2.7.x and possibly 2.6.x) |
…hreshold (#9393) Fixes: #6368 Flaky-test: `AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup` TestClass is flaky. The testMethod test method fails sporadically. See #6368 for example failures as well as for my analysis and detailed justification for this change. In short, by setting the `LoadBalancerBrokerOverloadedThresholdPercentage` to `100`, we remove the main edge case that allows two namespaces in the same anti-affinity group to get placed on the same broker. Note that I am assuming the following method will never return a value greater than 1, which could lead to test failure. https://github.com/apache/pulsar/blob/85f3ff4edbaa10c7894af8ad823cbce37b13829c/pulsar-common/src/main/java/org/apache/pulsar/policies/data/loadbalancer/LocalBrokerData.java#L214-L217 (cherry picked from commit 610b17d)
…hreshold (#9393) Fixes: #6368 Flaky-test: `AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup` TestClass is flaky. The testMethod test method fails sporadically. See #6368 for example failures as well as for my analysis and detailed justification for this change. In short, by setting the `LoadBalancerBrokerOverloadedThresholdPercentage` to `100`, we remove the main edge case that allows two namespaces in the same anti-affinity group to get placed on the same broker. Note that I am assuming the following method will never return a value greater than 1, which could lead to test failure. https://github.com/apache/pulsar/blob/85f3ff4edbaa10c7894af8ad823cbce37b13829c/pulsar-common/src/main/java/org/apache/pulsar/policies/data/loadbalancer/LocalBrokerData.java#L214-L217 (cherry picked from commit 610b17d)
Fixes: #6368
Flaky-test:
AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup
TestClass is flaky. The testMethod test method fails sporadically.
See #6368 for example failures as well as for my analysis and detailed justification for this change.
In short, by setting the
LoadBalancerBrokerOverloadedThresholdPercentage
to100
, we remove the main edge case that allows two namespaces in the same anti-affinity group to get placed on the same broker.Note that I am assuming the following method will never return a value greater than 1, which could lead to test failure.
pulsar/pulsar-common/src/main/java/org/apache/pulsar/policies/data/loadbalancer/LocalBrokerData.java
Lines 214 to 217 in 85f3ff4