Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold #9393

Merged
merged 9 commits into from
Feb 11, 2021

Conversation

michaeljmarshall
Copy link
Member

Fixes: #6368

Flaky-test: AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup

TestClass is flaky. The testMethod test method fails sporadically.

See #6368 for example failures as well as for my analysis and detailed justification for this change.

In short, by setting the LoadBalancerBrokerOverloadedThresholdPercentage to 100, we remove the main edge case that allows two namespaces in the same anti-affinity group to get placed on the same broker.

Note that I am assuming the following method will never return a value greater than 1, which could lead to test failure.

public double getMaxResourceUsage() {
return max(cpu.percentUsage(), memory.percentUsage(), directMemory.percentUsage(), bandwidthIn.percentUsage(),
bandwidthOut.percentUsage()) / 100;
}

@michaeljmarshall
Copy link
Member Author

michaeljmarshall commented Feb 1, 2021

Looks like the ResourceUsage class does not require usage to be less than limit. However, in looking at this class's usage, I don't see any places where the usage would be more than the limit.

public class ResourceUsage {
public double usage;
public double limit;
public ResourceUsage(double usage, double limit) {
this.usage = usage;
this.limit = limit;
}
public ResourceUsage(ResourceUsage that) {
this.usage = that.usage;
this.limit = that.limit;
}
public ResourceUsage() {
}
public void reset() {
this.usage = -1;
this.limit = -1;
}
/**
* this may be wrong since we are comparing available and not the usage.
*
* @param o
* @return
*/
public int compareTo(ResourceUsage o) {
double required = o.limit - o.usage;
double available = limit - usage;
return Double.compare(available, required);
}
public float percentUsage() {
float proportion = 0;
if (limit > 0) {
proportion = ((float) usage) / ((float) limit);
}
return proportion * 100;
}
}

@sijie sijie added this to the 2.8.0 milestone Feb 1, 2021
@lhotari
Copy link
Member

lhotari commented Feb 1, 2021

Nice work on this @michaeljmarshall .

However, it seems that the flakiness remains after this change.

Sometimes it's hard to reproduce the flaky test failures locally. One thing that seems to be a common theme is that the flaky test failures happen in CI, but can be hard to produce in local environments. While working on the fix for flaky test MessageIdTest, I found a way to reproduce some flaky test failures effectively by limiting the CPU resources to somewhat similar that there is in CI. The CI tests run on an Azure VM that has 2 cores and about 6GB free RAM (IIRC).

Since I use Linux for development, the easiest way for me to constraint the resources of the test run was to use Docker. On other than Linux, VM tooling such as https://multipass.run/ could be helpful in creating an environment with limited cpu resources.

These are the commands I used to test this change:

$ gh pr checkout 9393
$ mvn clean install -DskipTests -Dspotbugs.skip=true
$ docker run --cpus=2 --memory=6g -u 1000:1000 --net=host -it --rm -v $HOME:$HOME -w $PWD -v /etc/passwd:/etc/passwd:ro ubuntu bash -c 'source "$HOME/.sdkman/bin/sdkman-init.sh"; counter=0; while mvn -Pcore-modules -pl pulsar-broker test -DfailIfNoTests=false -Dtest=AntiAffinityNamespaceGroupTest -DredirectTestOutputToFile=false -DtestRetryCount=0; do echo "----------- LOOP $counter ---------------"; ((counter++)); done; echo "LOOP $counter"' | tee docker_output_`date +%s`.log

here's the output: https://gist.github.com/lhotari/7d8c7ae0a9e1a26d92599c585ba64e13
and pulsar-broker/target/surefire-reports/testng-results.xml https://gist.github.com/lhotari/fbbcd1405d8f16be7106f8ece9f66084

java.lang.AssertionError: did not expect [localhost:42919] but found [localhost:42919]
at org.testng.Assert.fail(Assert.java:99)
at org.testng.Assert.failEquals(Assert.java:1041)
at org.testng.Assert.assertNotEqualsImpl(Assert.java:147)
at org.testng.Assert.assertNotEquals(Assert.java:1531)
at org.testng.Assert.assertNotEquals(Assert.java:1535)
at org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup(AntiAffinityNamespaceGroupTest.java:427)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:45)
at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:73)
at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

What comes into mind is that the test might start before both brokers are available.
Something like

Awaitility.await().until(()
-> service.getDiscoveryProvider().getAvailableBrokers().size() == number);
could help with that?

@michaeljmarshall
Copy link
Member Author

@lhotari - thank you for your detailed explanation. I did not consider running this on a special environment to simulate the testing env. That is a great point, and something I'll keep in mind in the future.

Based on the example logging output you provided, it looks like it is still the broker being overridden because it is considered overloaded. See the following:

07:04:53.511 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 2 brokers being considered for assignment of tenant-c8478edb-886e-43b7-8d43-12cc159c0eb9/use/ns1/0x00000000_0xffffffff
07:04:53.511 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:33199 is overloaded: max usage=1.2440309524536133
07:04:53.511 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:33199 is overloaded: CPU: 124.40309%, MEMORY: 19.42325%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: 0.0%, BANDWIDTH OUT: 0.0%
07:04:53.516 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 1 brokers being considered for assignment of tenant-c8478edb-886e-43b7-8d43-12cc159c0eb9/use/ns2/0x00000000_0xffffffff
07:04:53.516 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:33199 is overloaded: max usage=1.2440309524536133
07:04:53.516 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:33199 is overloaded: CPU: 124.40309%, MEMORY: 19.42325%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: 0.0%, BANDWIDTH OUT: 0.0%
07:04:53.517 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:33199 is overloaded: max usage=1.2440309524536133
07:04:53.517 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:33199 is overloaded: CPU: 124.40309%, MEMORY: 19.42325%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: 0.0%, BANDWIDTH OUT: 0.0%

Given the first log line, it does look like two brokers are considered for placement. However, the CPU is listed at 124%, which leads to an override. I mentioned a concern about this in my initial PR message:

Note that I am assuming the following method will never return a value greater than 1, which could lead to test failure.

Perhaps it is worth bumping the limit to something like 300 in this one case? If we only have 2 cores, we won't exceed that. Although, that does leave us with the potential to see this flakiness again if we ever give the test more cores.

Do you think the broker's cpu utilization is high because they are still in the process of starting up? If so, perhaps your suggested await command could help by giving the brokers time to stabilize.

@lhotari
Copy link
Member

lhotari commented Feb 2, 2021

Perhaps it is worth bumping the limit to something like 300 in this one case? If we only have 2 cores, we won't exceed that. Although, that does leave us with the potential to see this flakiness again if we ever give the test more cores.

I guess we would have to experiment to find a solution that makes sense. I don't know this area of Pulsar so I could give a direct advice.

Do you think the broker's cpu utilization is high because they are still in the process of starting up? If so, perhaps your suggested await command could help by giving the brokers time to stabilize.

My assumption was simply that the test might start executing before both brokers are available. That has been a source of flakiness at least in DiscoveryServiceTest. I didn't confirm this assumption in any way. A simple approach would be to experiment and check whether the problem reproduces after making changes. Being able to experiment, would require having ways to reproduce the issue in environment where you can quickly run experiments.

btw. I have published my "toolbox" as open source in https://github.com/lhotari/pulsar-contributor-toolbox . That contains shell script functions that I use for reproducing Pulsar flaky test failures. It works for me when I use Linux, zsh & sdkman (for JDK installation).

@michaeljmarshall
Copy link
Member Author

@lhotari - thanks for the insight. I'll test this locally in a VM (I'm running on a Mac) and see how bumping the overloaded threshold affects the test.

@lhotari
Copy link
Member

lhotari commented Feb 2, 2021

I took a quite look at the test and most likely my suggestion about checking the broker availability in setup doesn't make sense. What you already mentioned about the overload situation could be useful to validate by modifying the code and removing the logic temporarily from ModularLoadManagerImpl. For that, you would have to be able to first consistently reproduce the issue and then experimenting will be easy.

@michaeljmarshall
Copy link
Member Author

@lhotari - I worked trying to get my environment set up tonight (just testing out setting the cpu/memory limits on my mac's docker desktop app, no extra VM yet), but I wasn't able to get to a place where I could test out this code change. I submitted a new commit with an updated threshold, which I think should work. If it doesn't work, I can continue tomorrow evening.

@michaeljmarshall
Copy link
Member Author

@lhotari - After running this several times on my machine, I haven't been able to reproduce the failure with the limit set to 400. Although, I also wasn't able to get the broker's CPU utilization to go above 100%, but it looks like you were, so I'm not sure what that means because you had over 100% utilization. At what point do we consider this the solution to the flaky test? Thank you for your help.

As an aside, it does seem that using docker to limit CPU/mem is an option on a mac. Here is my Dockerfile

FROM openjdk:8-jdk-slim

RUN mkdir -p /usr/share/man/man1

RUN apt update && apt -y install maven

ENTRYPOINT ["/bin/bash"]

and then I started the container:

docker run --cpus=2 --memory=3g -it --entrypoint bash -v /Users/Michael:/root pulsar-test-env:0.0.0

and after navigating to my pulsar dir, ran

mvn test -Dtest=AntiAffinityNamespaceGroupTest -DfailIfNoTests=false -pl pulsar-broker

(Because I mounted my home to the container's home, maven worked properly so I could run the build and install outside of the container.)

@lhotari
Copy link
Member

lhotari commented Feb 4, 2021

Thanks for the great effort on fixing this @michaeljmarshall . I'll test these changes with my setup to see if the earlier problem that I was able to reproduce is fixed.

@lhotari
Copy link
Member

lhotari commented Feb 4, 2021

@michaeljmarshall unfortunately, it remains flaky. The logs are at https://gist.github.com/lhotari/f75e3efaf5f7bf7536e6d98c1898789e .

I reproduced with these commands on Linux (Ubuntu / PopOS 20.10):

$ git branch -D issue-6368-flaky-test # delete earlier branch for #9393 since it won't fast-forward
$ gh pr checkout 9393
$ mvn clean install -DskipTests -Dspotbugs.skip=true
$ ptbx_until_test_fails_in_docker_with_logs -Pcore-modules -pl pulsar-broker -Dtest=AntiAffinityNamespaceGroupTest

ptbx_until_test_fails_in_docker_with_logs is part of the toolbox I am using. That function runs a mvn test in a loop with limited resource (docker run --cpus=2 --memory=6g) until the command fails.

just in case anyone is interested, the ptbx_until_test_fails_in_docker_with_logs captures the logs which contain ansi color control codes, I stripped the ansi codes with ansi2txt before uploading the logs to the gist:

$ cat output_2021-02-04-093600.log | ansi2txt > output_2021-02-04-093600.log.txt
$ gh gist create output_2021-02-04-093600.log.txt -d "testing PR 9393 https://github.com/apache/pulsar/commits/$(git rev-parse HEAD | colrm 9)"

gh is the github's command line tooling, https://cli.github.com/ .

@michaeljmarshall
Copy link
Member Author

michaeljmarshall commented Feb 4, 2021

@lhotari - thank you for running that for me. For what it's worth, it looks like the failure is related to the cpu usage again, although this time it's not necessarily a value I would have expected--Infinity.

07:39:57.219 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 2 brokers being considered for assignment of tenant-010508de-9e58-4c86-95d8-d9b28f60a7ec/use/ns1/0x00000000_0xffffffff
07:39:57.219 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:42477 is overloaded: max usage=Infinity
07:39:57.220 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:42477 is overloaded: CPU: Infinity%, MEMORY: 17.458387%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
07:39:57.224 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 1 brokers being considered for assignment of tenant-010508de-9e58-4c86-95d8-d9b28f60a7ec/use/ns2/0x00000000_0xffffffff
07:39:57.224 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:42477 is overloaded: max usage=Infinity
07:39:57.224 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:42477 is overloaded: CPU: Infinity%, MEMORY: 17.458387%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
07:39:57.225 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:42477 is overloaded: max usage=Infinity
07:39:57.225 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:42477 is overloaded: CPU: Infinity%, MEMORY: 17.458387%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
!!!!!!!!! FAILURE-- [TestClass name=class org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest].testBrokerSelectionForAntiAffinityGroup([])-------

I'm not positive what solutions make sense here, but it does seem like a surprising behavior that the logic calculated the CPU usage at Infinity%. Seems like something worth looking into. Ultimately, it's the same logic for overloading that is the problem here.

@michaeljmarshall
Copy link
Member Author

I dug into this a bit more this morning, and it's interesting to point out that there are two fundamental differences between the system I am running tests on (macOS, or non linux) and @lhotari's system, which is linux. I'm guessing the Azure VM @lhotari references above is a linux machine, but it'd be good to confirm. This might explain some of the differences we're seeing regarding reproducibility. The following conditional block determines the way a broker calculates cpu (and all other) host usage:

if (isLinux) {
hostUsage = new LinuxBrokerHostUsageImpl(
Integer.MAX_VALUE, Optional.empty(), scheduler
);
} else {
hostUsage = new GenericBrokerHostUsageImpl(
Integer.MAX_VALUE, scheduler
);
}

I plan to take a look at the LinuxBrokerHostUsageImpl class later. I wonder if there is a bug in that code because we saw that Infinity value for CPU usage in the logs above. I'm guessing we're dividing by 0 in some rare edge case.

There could certainly be an issue in the GenericBrokerHostUsageImpl class, too, but it seems to me that we're more consistently reproducing the problem in the linux version, so that's where I'll look at the linux class next.

It's worth mentioning that neither of these Usage classes are explicitly tested.

@lhotari
Copy link
Member

lhotari commented Feb 8, 2021

I'm guessing the Azure VM @lhotari references above is a linux machine, but it'd be good to confirm.

@michaeljmarshall yes, it's a VM running Ubuntu Linux. There's GitHub Actions Runner VMs specs for more details about the environment. It's possible to debug such an environment by getting a shell inside a running VM with https://github.com/marketplace/actions/debugging-with-tmate in your own fork, for example. To debug the GitHub Actions VM, you'd enable GitHub Actions in your own fork in some repository and open a PR to your own fork. However the downside of that is that it's very time consuming.

@lhotari
Copy link
Member

lhotari commented Feb 8, 2021

I plan to take a look at the LinuxBrokerHostUsageImpl class later. I wonder if there is a bug in that code because we saw that Infinity value for CPU usage in the logs above. I'm guessing we're dividing by 0 in some rare edge case.

@michaeljmarshall very good direction. One detail that immediately caught my eye in LinuxBrokerHostUsageImpl is the usage of System.currentTimeMillis(). Replacing that with System.nanoTime() could be helpful. The calculations that mix doubles and longs seem error prone. It would be useful to check how good the test coverage for this class is.

@lhotari
Copy link
Member

lhotari commented Feb 8, 2021

The code in LinuxBrokerHostUsageImpl rings some bells. I faced some issues earlier where the method to read the Linux /proc and /sys/fs files with Files.readAllBytes was causing odd issues. I'll have to check if I reported an issue back then or brought it up on Pulsar Slack etc.

this.isCGroupsEnabled = isCGroupsEnabled;
calculateBrokerHostUsage();

executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhotari - because we were scheduling calculateBrokerHostUsage with a delay of 0 minutes and then calling it only a few lines later, it is quite possible that there could have been a race internally that led to a calculated elapsedSeconds of 0. If elapsedSeconds is 0, we'll divide by by 0 in the method to calculate cpuUsage, which could likely be our culprit for the "Infinite" cpu usage you saw in your test.

Here is the full method:

@Override
public void calculateBrokerHostUsage() {
List<String> nics = getNics();
double totalNicLimit = getTotalNicLimitKbps(nics);
double totalNicUsageTx = getTotalNicUsageTxKb(nics);
double totalNicUsageRx = getTotalNicUsageRxKb(nics);
double totalCpuLimit = getTotalCpuLimit();
SystemResourceUsage usage = new SystemResourceUsage();
long now = System.currentTimeMillis();
double elapsedSeconds = (now - lastCollection) / 1000d;
double cpuUsage = getTotalCpuUsage(elapsedSeconds);
if (lastCollection == 0L) {
usage.setMemory(getMemUsage());
usage.setBandwidthIn(new ResourceUsage(0d, totalNicLimit));
usage.setBandwidthOut(new ResourceUsage(0d, totalNicLimit));
} else {
double nicUsageTx = (totalNicUsageTx - lastTotalNicUsageTx) / elapsedSeconds;
double nicUsageRx = (totalNicUsageRx - lastTotalNicUsageRx) / elapsedSeconds;
usage.setMemory(getMemUsage());
usage.setBandwidthIn(new ResourceUsage(nicUsageRx, totalNicLimit));
usage.setBandwidthOut(new ResourceUsage(nicUsageTx, totalNicLimit));
}
lastTotalNicUsageTx = totalNicUsageTx;
lastTotalNicUsageRx = totalNicUsageRx;
lastCollection = System.currentTimeMillis();
this.usage = usage;
usage.setCpu(new ResourceUsage(cpuUsage, totalCpuLimit));
}

Also, note that this method likely should not have been called before setting isCGroupsEnabled because the internals of the method rely on that value. Perhaps @merlimat can provide a bit more context for whether or not this update adds value here (he added the isCGroupsEnabled code this past July).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that any bad state, like an infinite cpu usage, would be stored for a minute until the cpu usage is recalculated. Considering that this test runs quickly, that would definitely be a culprit for our flaky test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaeljmarshall I was also assuming some race condition and I made an experiment on top of your changes where I replaced the Files.readAllBytes usage and made ResourceUsage immutable: master...lhotari:issue-6368-flaky-test . However, the same problem still reproduces. CPU shows as Infinity in the log lines:

06:52:40.333 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbala
nce.impl.ModularLoadManagerImpl - 2 brokers being considered for assignment of tenant-98033a04-3a4d-4e5d-9203-
3147486f2673/use/ns1/0x00000000_0xffffffff
06:52:40.333 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbala
nce.impl.LeastLongTermMessageRate - Broker http://localhost:35061 is overloaded: max usage=Infinity
06:52:40.333 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbala
nce.impl.LeastLongTermMessageRate - Broker localhost:35061 is overloaded: CPU: Infinity%, MEMORY: 16.867092%, 
DIRECT MEMORY: 2.4597175%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
06:52:40.337 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbala
nce.impl.ModularLoadManagerImpl - 1 brokers being considered for assignment of tenant-98033a04-3a4d-4e5d-9203-
3147486f2673/use/ns2/0x00000000_0xffffffff
06:52:40.337 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbala
nce.impl.LeastLongTermMessageRate - Broker http://localhost:35061 is overloaded: max usage=Infinity
06:52:40.337 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbala
nce.impl.LeastLongTermMessageRate - Broker localhost:35061 is overloaded: CPU: Infinity%, MEMORY: 16.867092%, 
DIRECT MEMORY: 2.4597175%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
06:52:40.338 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbala
nce.impl.LeastLongTermMessageRate - Broker http://localhost:35061 is overloaded: max usage=Infinity
06:52:40.338 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbala
nce.impl.LeastLongTermMessageRate - Broker localhost:35061 is overloaded: CPU: Infinity%, MEMORY: 16.867092%, 
DIRECT MEMORY: 2.4597175%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
!!!!!!!!! FAILURE-- [TestClass name=class org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest]
.testBrokerSelectionForAntiAffinityGroup([])-------

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look--it's good to know we're narrowing in on something. I'll take a look again tomorrow.

@michaeljmarshall
Copy link
Member Author

One detail that immediately caught my eye in LinuxBrokerHostUsageImpl is the usage of System.currentTimeMillis(). Replacing that with System.nanoTime() could be helpful. The calculations that mix doubles and longs seem error prone.

In looking at the class and the scheduling of the calculateBrokerHostUsage method, I can see that this class (and GenericBrokerHostUsageImpl ) are not really designed to be called in immediate succession. (Side note, that makes me wonder about the choice to use scheduleAtFixedRate instead of scheduleWithFixedDelay.) Because we don't intend to call the methods in immediate succession, millis look like they could be fine, but I could definitely see making a switch to nanos too. What do you think?

I also agree that the mix of longs and doubles seems error prone.

It would be useful to check how good the test coverage for this class is.

These classes do seem pretty fundamental to load balancing and I believe they are also essential to the load shedding mechanisms, which are pretty important. There are tests that implicitly cover this code, but perhaps we can add something more explicit in another PR. It's likely that these tests aren't covered because some of the logic is hard to test. Reworking some of the classes to make them more testable might be necessary.

The code in LinuxBrokerHostUsageImpl rings some bells. I faced some issues earlier where the method to read the Linux /proc and /sys/fs files with Files.readAllBytes was causing odd issues. I'll have to check if I reported an issue back then or brought it up on Pulsar Slack etc.

Let me know if you found anything regarding this issue. My most recent commit should definitely add some stability to the LinuxBrokerHostUsageImpl by not calling it at nearly the same time.

@lhotari
Copy link
Member

lhotari commented Feb 9, 2021

Let me know if you found anything regarding this issue. My most recent commit should definitely add some stability to the LinuxBrokerHostUsageImpl by not calling it at nearly the same time.

@michaeljmarshall yes, you are right about the race condition. Good catch. It's the calculateBrokerHostUsage call that is scheduled and the calculateBrokerHostUsage call in the constructor that cause the issue.

Besides the race, it also seems that this is a clear bug in the calculateBrokerHostUsage method itself. I'm now running the test loop (to see if it fails at some point) with these changes:
468550f

I have a few commits in my fork where I experimented with various improvements:
master...lhotari:issue-6368-flaky-test

cpuUsageSum = 0d;
cpuUsageCount = 0;
return cpuUsage;
synchronized (this) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what about moving the synchronized keyword in the method declaration ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this, and I also moved the check for cpuUsageCount being zero into this method. It makes more sense to me in the synchronized method, and it will ensure that if any other methods end up calling this one, there isn't a chance of accidentally dividing by 0.

@eolivelli
Copy link
Contributor

@codelipenghui @jiazhai @wolfstudy PTAL

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good work @michaeljmarshall

@lhotari
Copy link
Member

lhotari commented Feb 10, 2021

@sijie @merlimat PTAL. This is a fix to a common flaky test.

@sijie sijie merged commit 610b17d into apache:master Feb 11, 2021
@michaeljmarshall michaeljmarshall deleted the issue-6368-flaky-test branch February 17, 2021 05:33
@michaeljmarshall
Copy link
Member Author

@sijie, @eolivelli, @lhotari - given the discussion in today's community meeting about what branches should have LTS, I'm thinking this commit should be cherry picked to other release branches because the changes to the underlying host usage classes are concurrency bugs. (Even if we don't have true LTS versions, this patch should likely make it into 2.7.x and possibly 2.6.x)

@merlimat merlimat added type/bug The PR fixed a bug or issue reported a bug release/2.7.1 labels Feb 17, 2021
@codelipenghui codelipenghui added the cherry-picked/branch-2.7 Archived: 2.7 is end of life label Feb 18, 2021
codelipenghui pushed a commit that referenced this pull request Feb 18, 2021
…hreshold (#9393)

Fixes: #6368

Flaky-test: `AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup`

TestClass is flaky. The testMethod test method fails sporadically.

See #6368 for example failures as well as for my analysis and detailed justification for this change.

In short, by setting the `LoadBalancerBrokerOverloadedThresholdPercentage` to `100`, we remove the main edge case that allows two namespaces in the same anti-affinity group to get placed on the same broker.

Note that I am assuming the following method will never return a value greater than 1, which could lead to test failure. https://github.com/apache/pulsar/blob/85f3ff4edbaa10c7894af8ad823cbce37b13829c/pulsar-common/src/main/java/org/apache/pulsar/policies/data/loadbalancer/LocalBrokerData.java#L214-L217

(cherry picked from commit 610b17d)
zymap pushed a commit that referenced this pull request Mar 2, 2021
…hreshold (#9393)

Fixes: #6368

Flaky-test: `AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup`

TestClass is flaky. The testMethod test method fails sporadically.

See #6368 for example failures as well as for my analysis and detailed justification for this change.

In short, by setting the `LoadBalancerBrokerOverloadedThresholdPercentage` to `100`, we remove the main edge case that allows two namespaces in the same anti-affinity group to get placed on the same broker.

Note that I am assuming the following method will never return a value greater than 1, which could lead to test failure. https://github.com/apache/pulsar/blob/85f3ff4edbaa10c7894af8ad823cbce37b13829c/pulsar-common/src/main/java/org/apache/pulsar/policies/data/loadbalancer/LocalBrokerData.java#L214-L217

(cherry picked from commit 610b17d)
merlimat pushed a commit to merlimat/pulsar that referenced this pull request Apr 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/branch-2.7 Archived: 2.7 is end of life release/2.6.4 release/2.7.1 type/bug The PR fixed a bug or issue reported a bug type/flaky-tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(Flaky-test) Intermittent failure of AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup()
7 participants