Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold #9393

michaeljmarshall · 2021-02-01T05:11:46Z

Flaky-test: AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup

TestClass is flaky. The testMethod test method fails sporadically.

See #6368 for example failures as well as for my analysis and detailed justification for this change.

In short, by setting the LoadBalancerBrokerOverloadedThresholdPercentage to 100, we remove the main edge case that allows two namespaces in the same anti-affinity group to get placed on the same broker.

Note that I am assuming the following method will never return a value greater than 1, which could lead to test failure.

pulsar/pulsar-common/src/main/java/org/apache/pulsar/policies/data/loadbalancer/LocalBrokerData.java

Lines 214 to 217 in 85f3ff4

    
           public double getMaxResourceUsage() { 
        
               return max(cpu.percentUsage(), memory.percentUsage(), directMemory.percentUsage(), bandwidthIn.percentUsage(), 
        
                       bandwidthOut.percentUsage()) / 100; 
        
           }

michaeljmarshall · 2021-02-01T05:17:26Z

Looks like the ResourceUsage class does not require usage to be less than limit. However, in looking at this class's usage, I don't see any places where the usage would be more than the limit.

pulsar/pulsar-common/src/main/java/org/apache/pulsar/policies/data/loadbalancer/ResourceUsage.java

Lines 27 to 68 in 73e0dbd

    
           public class ResourceUsage { 
        
               public double usage; 
        
               public double limit; 
        
               public ResourceUsage(double usage, double limit) { 
        
                   this.usage = usage; 
        
                   this.limit = limit; 
        
               } 
        
               public ResourceUsage(ResourceUsage that) { 
        
                   this.usage = that.usage; 
        
                   this.limit = that.limit; 
        
               } 
        
               public ResourceUsage() { 
        
               } 
        
               public void reset() { 
        
                   this.usage = -1; 
        
                   this.limit = -1; 
        
               } 
        
               /** 
        
                * this may be wrong since we are comparing available and not the usage. 
        
                * 
        
                * @param o 
        
                * @return 
        
                */ 
        
               public int compareTo(ResourceUsage o) { 
        
                   double required = o.limit - o.usage; 
        
                   double available = limit - usage; 
        
                   return Double.compare(available, required); 
        
               } 
        
               public float percentUsage() { 
        
                   float proportion = 0; 
        
                   if (limit > 0) { 
        
                       proportion = ((float) usage) / ((float) limit); 
        
                   } 
        
                   return proportion * 100; 
        
               } 
        
           }

lhotari · 2021-02-01T07:23:09Z

Nice work on this @michaeljmarshall .

However, it seems that the flakiness remains after this change.

Sometimes it's hard to reproduce the flaky test failures locally. One thing that seems to be a common theme is that the flaky test failures happen in CI, but can be hard to produce in local environments. While working on the fix for flaky test MessageIdTest, I found a way to reproduce some flaky test failures effectively by limiting the CPU resources to somewhat similar that there is in CI. The CI tests run on an Azure VM that has 2 cores and about 6GB free RAM (IIRC).

Since I use Linux for development, the easiest way for me to constraint the resources of the test run was to use Docker. On other than Linux, VM tooling such as https://multipass.run/ could be helpful in creating an environment with limited cpu resources.

These are the commands I used to test this change:

$ gh pr checkout 9393
$ mvn clean install -DskipTests -Dspotbugs.skip=true
$ docker run --cpus=2 --memory=6g -u 1000:1000 --net=host -it --rm -v $HOME:$HOME -w $PWD -v /etc/passwd:/etc/passwd:ro ubuntu bash -c 'source "$HOME/.sdkman/bin/sdkman-init.sh"; counter=0; while mvn -Pcore-modules -pl pulsar-broker test -DfailIfNoTests=false -Dtest=AntiAffinityNamespaceGroupTest -DredirectTestOutputToFile=false -DtestRetryCount=0; do echo "----------- LOOP $counter ---------------"; ((counter++)); done; echo "LOOP $counter"' | tee docker_output_`date +%s`.log

here's the output: https://gist.github.com/lhotari/7d8c7ae0a9e1a26d92599c585ba64e13
and pulsar-broker/target/surefire-reports/testng-results.xml https://gist.github.com/lhotari/fbbcd1405d8f16be7106f8ece9f66084

java.lang.AssertionError: did not expect [localhost:42919] but found [localhost:42919]
at org.testng.Assert.fail(Assert.java:99)
at org.testng.Assert.failEquals(Assert.java:1041)
at org.testng.Assert.assertNotEqualsImpl(Assert.java:147)
at org.testng.Assert.assertNotEquals(Assert.java:1531)
at org.testng.Assert.assertNotEquals(Assert.java:1535)
at org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup(AntiAffinityNamespaceGroupTest.java:427)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:45)
at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:73)
at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

What comes into mind is that the test might start before both brokers are available.
Something like

pulsar/pulsar-discovery-service/src/test/java/org/apache/pulsar/discovery/service/DiscoveryServiceTest.java

Lines 251 to 252 in 24f759c

    
           Awaitility.await().until(() 
        
                   -> service.getDiscoveryProvider().getAvailableBrokers().size() == number);

could help with that?

michaeljmarshall · 2021-02-02T06:17:58Z

@lhotari - thank you for your detailed explanation. I did not consider running this on a special environment to simulate the testing env. That is a great point, and something I'll keep in mind in the future.

Based on the example logging output you provided, it looks like it is still the broker being overridden because it is considered overloaded. See the following:

07:04:53.511 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 2 brokers being considered for assignment of tenant-c8478edb-886e-43b7-8d43-12cc159c0eb9/use/ns1/0x00000000_0xffffffff
07:04:53.511 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:33199 is overloaded: max usage=1.2440309524536133
07:04:53.511 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:33199 is overloaded: CPU: 124.40309%, MEMORY: 19.42325%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: 0.0%, BANDWIDTH OUT: 0.0%
07:04:53.516 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 1 brokers being considered for assignment of tenant-c8478edb-886e-43b7-8d43-12cc159c0eb9/use/ns2/0x00000000_0xffffffff
07:04:53.516 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:33199 is overloaded: max usage=1.2440309524536133
07:04:53.516 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:33199 is overloaded: CPU: 124.40309%, MEMORY: 19.42325%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: 0.0%, BANDWIDTH OUT: 0.0%
07:04:53.517 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:33199 is overloaded: max usage=1.2440309524536133
07:04:53.517 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:33199 is overloaded: CPU: 124.40309%, MEMORY: 19.42325%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: 0.0%, BANDWIDTH OUT: 0.0%

Given the first log line, it does look like two brokers are considered for placement. However, the CPU is listed at 124%, which leads to an override. I mentioned a concern about this in my initial PR message:

Note that I am assuming the following method will never return a value greater than 1, which could lead to test failure.

Perhaps it is worth bumping the limit to something like 300 in this one case? If we only have 2 cores, we won't exceed that. Although, that does leave us with the potential to see this flakiness again if we ever give the test more cores.

Do you think the broker's cpu utilization is high because they are still in the process of starting up? If so, perhaps your suggested await command could help by giving the brokers time to stabilize.

lhotari · 2021-02-02T06:33:44Z

Perhaps it is worth bumping the limit to something like 300 in this one case? If we only have 2 cores, we won't exceed that. Although, that does leave us with the potential to see this flakiness again if we ever give the test more cores.

I guess we would have to experiment to find a solution that makes sense. I don't know this area of Pulsar so I could give a direct advice.

Do you think the broker's cpu utilization is high because they are still in the process of starting up? If so, perhaps your suggested await command could help by giving the brokers time to stabilize.

My assumption was simply that the test might start executing before both brokers are available. That has been a source of flakiness at least in DiscoveryServiceTest. I didn't confirm this assumption in any way. A simple approach would be to experiment and check whether the problem reproduces after making changes. Being able to experiment, would require having ways to reproduce the issue in environment where you can quickly run experiments.

btw. I have published my "toolbox" as open source in https://github.com/lhotari/pulsar-contributor-toolbox . That contains shell script functions that I use for reproducing Pulsar flaky test failures. It works for me when I use Linux, zsh & sdkman (for JDK installation).

michaeljmarshall · 2021-02-02T06:43:59Z

@lhotari - thanks for the insight. I'll test this locally in a VM (I'm running on a Mac) and see how bumping the overloaded threshold affects the test.

lhotari · 2021-02-02T11:50:18Z

I took a quite look at the test and most likely my suggestion about checking the broker availability in setup doesn't make sense. What you already mentioned about the overload situation could be useful to validate by modifying the code and removing the logic temporarily from ModularLoadManagerImpl. For that, you would have to be able to first consistently reproduce the issue and then experimenting will be easy.

michaeljmarshall · 2021-02-03T07:38:50Z

@lhotari - I worked trying to get my environment set up tonight (just testing out setting the cpu/memory limits on my mac's docker desktop app, no extra VM yet), but I wasn't able to get to a place where I could test out this code change. I submitted a new commit with an updated threshold, which I think should work. If it doesn't work, I can continue tomorrow evening.

michaeljmarshall · 2021-02-04T07:18:43Z

@lhotari - After running this several times on my machine, I haven't been able to reproduce the failure with the limit set to 400. Although, I also wasn't able to get the broker's CPU utilization to go above 100%, but it looks like you were, so I'm not sure what that means because you had over 100% utilization. At what point do we consider this the solution to the flaky test? Thank you for your help.

As an aside, it does seem that using docker to limit CPU/mem is an option on a mac. Here is my Dockerfile

FROM openjdk:8-jdk-slim

RUN mkdir -p /usr/share/man/man1

RUN apt update && apt -y install maven

ENTRYPOINT ["/bin/bash"]

and then I started the container:

docker run --cpus=2 --memory=3g -it --entrypoint bash -v /Users/Michael:/root pulsar-test-env:0.0.0

and after navigating to my pulsar dir, ran

mvn test -Dtest=AntiAffinityNamespaceGroupTest -DfailIfNoTests=false -pl pulsar-broker

(Because I mounted my home to the container's home, maven worked properly so I could run the build and install outside of the container.)

lhotari · 2021-02-04T07:22:19Z

Thanks for the great effort on fixing this @michaeljmarshall . I'll test these changes with my setup to see if the earlier problem that I was able to reproduce is fixed.

lhotari · 2021-02-04T07:54:35Z

@michaeljmarshall unfortunately, it remains flaky. The logs are at https://gist.github.com/lhotari/f75e3efaf5f7bf7536e6d98c1898789e .

I reproduced with these commands on Linux (Ubuntu / PopOS 20.10):

$ git branch -D issue-6368-flaky-test # delete earlier branch for #9393 since it won't fast-forward
$ gh pr checkout 9393
$ mvn clean install -DskipTests -Dspotbugs.skip=true
$ ptbx_until_test_fails_in_docker_with_logs -Pcore-modules -pl pulsar-broker -Dtest=AntiAffinityNamespaceGroupTest

ptbx_until_test_fails_in_docker_with_logs is part of the toolbox I am using. That function runs a mvn test in a loop with limited resource (docker run --cpus=2 --memory=6g) until the command fails.

just in case anyone is interested, the ptbx_until_test_fails_in_docker_with_logs captures the logs which contain ansi color control codes, I stripped the ansi codes with ansi2txt before uploading the logs to the gist:

$ cat output_2021-02-04-093600.log | ansi2txt > output_2021-02-04-093600.log.txt
$ gh gist create output_2021-02-04-093600.log.txt -d "testing PR 9393 https://github.com/apache/pulsar/commits/$(git rev-parse HEAD | colrm 9)"

gh is the github's command line tooling, https://cli.github.com/ .

michaeljmarshall · 2021-02-04T17:01:48Z

@lhotari - thank you for running that for me. For what it's worth, it looks like the failure is related to the cpu usage again, although this time it's not necessarily a value I would have expected--Infinity.

07:39:57.219 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 2 brokers being considered for assignment of tenant-010508de-9e58-4c86-95d8-d9b28f60a7ec/use/ns1/0x00000000_0xffffffff
07:39:57.219 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:42477 is overloaded: max usage=Infinity
07:39:57.220 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:42477 is overloaded: CPU: Infinity%, MEMORY: 17.458387%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
07:39:57.224 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 1 brokers being considered for assignment of tenant-010508de-9e58-4c86-95d8-d9b28f60a7ec/use/ns2/0x00000000_0xffffffff
07:39:57.224 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:42477 is overloaded: max usage=Infinity
07:39:57.224 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:42477 is overloaded: CPU: Infinity%, MEMORY: 17.458387%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
07:39:57.225 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:42477 is overloaded: max usage=Infinity
07:39:57.225 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN  org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:42477 is overloaded: CPU: Infinity%, MEMORY: 17.458387%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN%
!!!!!!!!! FAILURE-- [TestClass name=class org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest].testBrokerSelectionForAntiAffinityGroup([])-------

I'm not positive what solutions make sense here, but it does seem like a surprising behavior that the logic calculated the CPU usage at Infinity%. Seems like something worth looking into. Ultimately, it's the same logic for overloading that is the problem here.

michaeljmarshall · 2021-02-08T16:54:25Z

I dug into this a bit more this morning, and it's interesting to point out that there are two fundamental differences between the system I am running tests on (macOS, or non linux) and @lhotari's system, which is linux. I'm guessing the Azure VM @lhotari references above is a linux machine, but it'd be good to confirm. This might explain some of the differences we're seeing regarding reproducibility. The following conditional block determines the way a broker calculates cpu (and all other) host usage:

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/tools/LoadReportCommand.java

Lines 94 to 102 in 7c68ade

    
           if (isLinux) { 
        
               hostUsage = new LinuxBrokerHostUsageImpl( 
        
                   Integer.MAX_VALUE, Optional.empty(), scheduler 
        
               ); 
        
           } else { 
        
               hostUsage = new GenericBrokerHostUsageImpl( 
        
                   Integer.MAX_VALUE, scheduler 
        
               ); 
        
           }

I plan to take a look at the LinuxBrokerHostUsageImpl class later. I wonder if there is a bug in that code because we saw that Infinity value for CPU usage in the logs above. I'm guessing we're dividing by 0 in some rare edge case.

There could certainly be an issue in the GenericBrokerHostUsageImpl class, too, but it seems to me that we're more consistently reproducing the problem in the linux version, so that's where I'll look at the linux class next.

It's worth mentioning that neither of these Usage classes are explicitly tested.

lhotari · 2021-02-08T18:44:28Z

I'm guessing the Azure VM @lhotari references above is a linux machine, but it'd be good to confirm.

@michaeljmarshall yes, it's a VM running Ubuntu Linux. There's GitHub Actions Runner VMs specs for more details about the environment. It's possible to debug such an environment by getting a shell inside a running VM with https://github.com/marketplace/actions/debugging-with-tmate in your own fork, for example. To debug the GitHub Actions VM, you'd enable GitHub Actions in your own fork in some repository and open a PR to your own fork. However the downside of that is that it's very time consuming.

lhotari · 2021-02-08T18:53:48Z

I plan to take a look at the LinuxBrokerHostUsageImpl class later. I wonder if there is a bug in that code because we saw that Infinity value for CPU usage in the logs above. I'm guessing we're dividing by 0 in some rare edge case.

@michaeljmarshall very good direction. One detail that immediately caught my eye in LinuxBrokerHostUsageImpl is the usage of System.currentTimeMillis(). Replacing that with System.nanoTime() could be helpful. The calculations that mix doubles and longs seem error prone. It would be useful to check how good the test coverage for this class is.

lhotari · 2021-02-08T19:27:23Z

The code in LinuxBrokerHostUsageImpl rings some bells. I faced some issues earlier where the method to read the Linux /proc and /sys/fs files with Files.readAllBytes was causing odd issues. I'll have to check if I reported an issue back then or brought it up on Pulsar Slack etc.

michaeljmarshall · 2021-02-09T06:40:34Z

...broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java

        this.isCGroupsEnabled = isCGroupsEnabled;
-        calculateBrokerHostUsage();
+
+        executorService.scheduleAtFixedRate(this::calculateBrokerHostUsage, 0,


@lhotari - because we were scheduling calculateBrokerHostUsage with a delay of 0 minutes and then calling it only a few lines later, it is quite possible that there could have been a race internally that led to a calculated elapsedSeconds of 0. If elapsedSeconds is 0, we'll divide by by 0 in the method to calculate cpuUsage, which could likely be our culprit for the "Infinite" cpu usage you saw in your test.

Here is the full method:

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java

Lines 96 to 127 in ce14ec4

@Override

public void calculateBrokerHostUsage() {

List<String> nics = getNics();

double totalNicLimit = getTotalNicLimitKbps(nics);

double totalNicUsageTx = getTotalNicUsageTxKb(nics);

double totalNicUsageRx = getTotalNicUsageRxKb(nics);

double totalCpuLimit = getTotalCpuLimit();

SystemResourceUsage usage = new SystemResourceUsage();

long now = System.currentTimeMillis();

double elapsedSeconds = (now - lastCollection) / 1000d;

double cpuUsage = getTotalCpuUsage(elapsedSeconds);

if (lastCollection == 0L) {

usage.setMemory(getMemUsage());

usage.setBandwidthIn(new ResourceUsage(0d, totalNicLimit));

usage.setBandwidthOut(new ResourceUsage(0d, totalNicLimit));

} else {

double nicUsageTx = (totalNicUsageTx - lastTotalNicUsageTx) / elapsedSeconds;

double nicUsageRx = (totalNicUsageRx - lastTotalNicUsageRx) / elapsedSeconds;

usage.setMemory(getMemUsage());

usage.setBandwidthIn(new ResourceUsage(nicUsageRx, totalNicLimit));

usage.setBandwidthOut(new ResourceUsage(nicUsageTx, totalNicLimit));

}

lastTotalNicUsageTx = totalNicUsageTx;

lastTotalNicUsageRx = totalNicUsageRx;

lastCollection = System.currentTimeMillis();

this.usage = usage;

usage.setCpu(new ResourceUsage(cpuUsage, totalCpuLimit));

}

Also, note that this method likely should not have been called before setting isCGroupsEnabled because the internals of the method rely on that value. Perhaps @merlimat can provide a bit more context for whether or not this update adds value here (he added the isCGroupsEnabled code this past July).

Also note that any bad state, like an infinite cpu usage, would be stored for a minute until the cpu usage is recalculated. Considering that this test runs quickly, that would definitely be a culprit for our flaky test.

@michaeljmarshall I was also assuming some race condition and I made an experiment on top of your changes where I replaced the Files.readAllBytes usage and made ResourceUsage immutable: master...lhotari:issue-6368-flaky-test . However, the same problem still reproduces. CPU shows as Infinity in the log lines:

06:52:40.333 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO org.apache.pulsar.broker.loadbala nce.impl.ModularLoadManagerImpl - 2 brokers being considered for assignment of tenant-98033a04-3a4d-4e5d-9203- 3147486f2673/use/ns1/0x00000000_0xffffffff 06:52:40.333 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbala nce.impl.LeastLongTermMessageRate - Broker http://localhost:35061 is overloaded: max usage=Infinity 06:52:40.333 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbala nce.impl.LeastLongTermMessageRate - Broker localhost:35061 is overloaded: CPU: Infinity%, MEMORY: 16.867092%, DIRECT MEMORY: 2.4597175%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN% 06:52:40.337 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO org.apache.pulsar.broker.loadbala nce.impl.ModularLoadManagerImpl - 1 brokers being considered for assignment of tenant-98033a04-3a4d-4e5d-9203- 3147486f2673/use/ns2/0x00000000_0xffffffff 06:52:40.337 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbala nce.impl.LeastLongTermMessageRate - Broker http://localhost:35061 is overloaded: max usage=Infinity 06:52:40.337 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbala nce.impl.LeastLongTermMessageRate - Broker localhost:35061 is overloaded: CPU: Infinity%, MEMORY: 16.867092%, DIRECT MEMORY: 2.4597175%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN% 06:52:40.338 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbala nce.impl.LeastLongTermMessageRate - Broker http://localhost:35061 is overloaded: max usage=Infinity 06:52:40.338 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbala nce.impl.LeastLongTermMessageRate - Broker localhost:35061 is overloaded: CPU: Infinity%, MEMORY: 16.867092%, DIRECT MEMORY: 2.4597175%, BANDWIDTH IN: NaN%, BANDWIDTH OUT: NaN% !!!!!!!!! FAILURE-- [TestClass name=class org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest] .testBrokerSelectionForAntiAffinityGroup([])-------

Thanks for taking a look--it's good to know we're narrowing in on something. I'll take a look again tomorrow.

michaeljmarshall · 2021-02-09T07:02:49Z

One detail that immediately caught my eye in LinuxBrokerHostUsageImpl is the usage of System.currentTimeMillis(). Replacing that with System.nanoTime() could be helpful. The calculations that mix doubles and longs seem error prone.

In looking at the class and the scheduling of the calculateBrokerHostUsage method, I can see that this class (and GenericBrokerHostUsageImpl ) are not really designed to be called in immediate succession. (Side note, that makes me wonder about the choice to use scheduleAtFixedRate instead of scheduleWithFixedDelay.) Because we don't intend to call the methods in immediate succession, millis look like they could be fine, but I could definitely see making a switch to nanos too. What do you think?

I also agree that the mix of longs and doubles seems error prone.

It would be useful to check how good the test coverage for this class is.

These classes do seem pretty fundamental to load balancing and I believe they are also essential to the load shedding mechanisms, which are pretty important. There are tests that implicitly cover this code, but perhaps we can add something more explicit in another PR. It's likely that these tests aren't covered because some of the logic is hard to test. Reworking some of the classes to make them more testable might be necessary.

The code in LinuxBrokerHostUsageImpl rings some bells. I faced some issues earlier where the method to read the Linux /proc and /sys/fs files with Files.readAllBytes was causing odd issues. I'll have to check if I reported an issue back then or brought it up on Pulsar Slack etc.

Let me know if you found anything regarding this issue. My most recent commit should definitely add some stability to the LinuxBrokerHostUsageImpl by not calling it at nearly the same time.

lhotari · 2021-02-09T07:34:41Z

Let me know if you found anything regarding this issue. My most recent commit should definitely add some stability to the LinuxBrokerHostUsageImpl by not calling it at nearly the same time.

@michaeljmarshall yes, you are right about the race condition. Good catch. It's the calculateBrokerHostUsage call that is scheduled and the calculateBrokerHostUsage call in the constructor that cause the issue.

Besides the race, it also seems that this is a clear bug in the calculateBrokerHostUsage method itself. I'm now running the test loop (to see if it fails at some point) with these changes:
468550f

I have a few commits in my fork where I experimented with various improvements:
master...lhotari:issue-6368-flaky-test

...broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java

…hreshold

…pl class initialization

...oker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/GenericBrokerHostUsageImpl.java

eolivelli · 2021-02-10T08:29:47Z

...oker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/GenericBrokerHostUsageImpl.java

-        cpuUsageSum = 0d;
-        cpuUsageCount = 0;
-        return cpuUsage;
+        synchronized (this) {


nit: what about moving the synchronized keyword in the method declaration ?

I did this, and I also moved the check for cpuUsageCount being zero into this method. It makes more sense to me in the synchronized method, and it will ensure that if any other methods end up calling this one, there isn't a chance of accidentally dividing by 0.

eolivelli · 2021-02-10T08:30:53Z

@codelipenghui @jiazhai @wolfstudy PTAL

lhotari

LGTM

eolivelli

LGTM

lhotari

LGTM, good work @michaeljmarshall

lhotari · 2021-02-10T20:43:50Z

@sijie @merlimat PTAL. This is a fix to a common flaky test.

michaeljmarshall · 2021-02-17T05:36:29Z

@sijie, @eolivelli, @lhotari - given the discussion in today's community meeting about what branches should have LTS, I'm thinking this commit should be cherry picked to other release branches because the changes to the underlying host usage classes are concurrency bugs. (Even if we don't have true LTS versions, this patch should likely make it into 2.7.x and possibly 2.6.x)

…hreshold (#9393) Fixes: #6368 Flaky-test: `AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup` TestClass is flaky. The testMethod test method fails sporadically. See #6368 for example failures as well as for my analysis and detailed justification for this change. In short, by setting the `LoadBalancerBrokerOverloadedThresholdPercentage` to `100`, we remove the main edge case that allows two namespaces in the same anti-affinity group to get placed on the same broker. Note that I am assuming the following method will never return a value greater than 1, which could lead to test failure. https://github.com/apache/pulsar/blob/85f3ff4edbaa10c7894af8ad823cbce37b13829c/pulsar-common/src/main/java/org/apache/pulsar/policies/data/loadbalancer/LocalBrokerData.java#L214-L217 (cherry picked from commit 610b17d)

…hreshold (apache#9393)

sijie assigned michaeljmarshall Feb 1, 2021

sijie added the type/flaky-tests label Feb 1, 2021

sijie added this to the 2.8.0 milestone Feb 1, 2021

michaeljmarshall force-pushed the issue-6368-flaky-test branch from f8bef98 to 079263b Compare February 3, 2021 07:33

michaeljmarshall force-pushed the issue-6368-flaky-test branch from 079263b to ce14ec4 Compare February 9, 2021 06:32

michaeljmarshall commented Feb 9, 2021

View reviewed changes

lhotari reviewed Feb 9, 2021

View reviewed changes

...broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java Show resolved Hide resolved

lhotari reviewed Feb 9, 2021

View reviewed changes

...broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/LinuxBrokerHostUsageImpl.java Outdated Show resolved Hide resolved

michaeljmarshall added 5 commits February 9, 2021 21:32

Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedT…

bc08dfe

…hreshold

Add comment to test for future reference

5f273b4

Rebase and bump threshold to 400

6093b5c

Only call calculateBrokerHostUsage once during LinuxBrokerHostUsageIm…

fffac55

…pl class initialization

Call method to ensure class is initialized before constructor returns

cd109ef

michaeljmarshall force-pushed the issue-6368-flaky-test branch from ce14ec4 to cd109ef Compare February 10, 2021 04:32

michaeljmarshall added 2 commits February 9, 2021 23:45

Make GenericBrokerHostUsageImpl thread safe

1c07173

Fix long line

7a53be0

eolivelli reviewed Feb 10, 2021

View reviewed changes

lhotari approved these changes Feb 10, 2021

View reviewed changes

Make methods synchronized; move check for zero; rebase

fb87dee

michaeljmarshall force-pushed the issue-6368-flaky-test branch from 7f902e9 to fb87dee Compare February 10, 2021 16:00

eolivelli approved these changes Feb 10, 2021

View reviewed changes

Use method already available in GenericBrokerHostUsageImpl

be7afe2

lhotari approved these changes Feb 10, 2021

View reviewed changes

sijie approved these changes Feb 11, 2021

View reviewed changes

sijie merged commit 610b17d into apache:master Feb 11, 2021

michaeljmarshall deleted the issue-6368-flaky-test branch February 17, 2021 05:33

merlimat added type/bug The PR fixed a bug or issue reported a bug release/2.7.1 labels Feb 17, 2021

sijie added the release/2.6.4 label Feb 17, 2021

codelipenghui added the cherry-picked/branch-2.7 Archived: 2.7 is end of life label Feb 18, 2021

zymap added the cherry-picked/branch-2.6 label Mar 2, 2021

merlimat pushed a commit to merlimat/pulsar that referenced this pull request Apr 6, 2021

Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedT…

5315f61

…hreshold (apache#9393)

lordcheng10 mentioned this pull request Jan 4, 2022

Fix bug :Infinity value for CPU or Bandwidth usage #13609

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold #9393

Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold #9393

michaeljmarshall commented Feb 1, 2021

michaeljmarshall commented Feb 1, 2021 •

edited

Loading

lhotari commented Feb 1, 2021 •

edited

Loading

michaeljmarshall commented Feb 2, 2021

lhotari commented Feb 2, 2021

michaeljmarshall commented Feb 2, 2021

lhotari commented Feb 2, 2021

michaeljmarshall commented Feb 3, 2021

michaeljmarshall commented Feb 4, 2021

lhotari commented Feb 4, 2021

lhotari commented Feb 4, 2021 •

edited

Loading

michaeljmarshall commented Feb 4, 2021 •

edited

Loading

michaeljmarshall commented Feb 8, 2021

lhotari commented Feb 8, 2021

lhotari commented Feb 8, 2021

lhotari commented Feb 8, 2021

michaeljmarshall Feb 9, 2021

michaeljmarshall Feb 9, 2021

lhotari Feb 9, 2021

michaeljmarshall Feb 9, 2021

michaeljmarshall commented Feb 9, 2021

lhotari commented Feb 9, 2021 •

edited

Loading

eolivelli Feb 10, 2021

michaeljmarshall Feb 10, 2021

eolivelli commented Feb 10, 2021

lhotari left a comment

eolivelli left a comment

lhotari left a comment

lhotari commented Feb 10, 2021

michaeljmarshall commented Feb 17, 2021

	public double getMaxResourceUsage() {
	return max(cpu.percentUsage(), memory.percentUsage(), directMemory.percentUsage(), bandwidthIn.percentUsage(),
	bandwidthOut.percentUsage()) / 100;
	}

	@Override
	public void calculateBrokerHostUsage() {
	List<String> nics = getNics();
	double totalNicLimit = getTotalNicLimitKbps(nics);
	double totalNicUsageTx = getTotalNicUsageTxKb(nics);
	double totalNicUsageRx = getTotalNicUsageRxKb(nics);
	double totalCpuLimit = getTotalCpuLimit();

	SystemResourceUsage usage = new SystemResourceUsage();
	long now = System.currentTimeMillis();
	double elapsedSeconds = (now - lastCollection) / 1000d;
	double cpuUsage = getTotalCpuUsage(elapsedSeconds);

	if (lastCollection == 0L) {
	usage.setMemory(getMemUsage());
	usage.setBandwidthIn(new ResourceUsage(0d, totalNicLimit));
	usage.setBandwidthOut(new ResourceUsage(0d, totalNicLimit));
	} else {
	double nicUsageTx = (totalNicUsageTx - lastTotalNicUsageTx) / elapsedSeconds;
	double nicUsageRx = (totalNicUsageRx - lastTotalNicUsageRx) / elapsedSeconds;

	usage.setMemory(getMemUsage());
	usage.setBandwidthIn(new ResourceUsage(nicUsageRx, totalNicLimit));
	usage.setBandwidthOut(new ResourceUsage(nicUsageTx, totalNicLimit));
	}

	lastTotalNicUsageTx = totalNicUsageTx;
	lastTotalNicUsageRx = totalNicUsageRx;
	lastCollection = System.currentTimeMillis();
	this.usage = usage;
	usage.setCpu(new ResourceUsage(cpuUsage, totalCpuLimit));
	}

Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold #9393

Fix testBrokerSelectionForAntiAffinityGroup by increasing OverloadedThreshold #9393

Conversation

michaeljmarshall commented Feb 1, 2021

michaeljmarshall commented Feb 1, 2021 • edited Loading

lhotari commented Feb 1, 2021 • edited Loading

michaeljmarshall commented Feb 2, 2021

lhotari commented Feb 2, 2021

michaeljmarshall commented Feb 2, 2021

lhotari commented Feb 2, 2021

michaeljmarshall commented Feb 3, 2021

michaeljmarshall commented Feb 4, 2021

lhotari commented Feb 4, 2021

lhotari commented Feb 4, 2021 • edited Loading

michaeljmarshall commented Feb 4, 2021 • edited Loading

michaeljmarshall commented Feb 8, 2021

lhotari commented Feb 8, 2021

lhotari commented Feb 8, 2021

lhotari commented Feb 8, 2021

michaeljmarshall Feb 9, 2021

Choose a reason for hiding this comment

michaeljmarshall Feb 9, 2021

Choose a reason for hiding this comment

lhotari Feb 9, 2021

Choose a reason for hiding this comment

michaeljmarshall Feb 9, 2021

Choose a reason for hiding this comment

michaeljmarshall commented Feb 9, 2021

lhotari commented Feb 9, 2021 • edited Loading

eolivelli Feb 10, 2021

Choose a reason for hiding this comment

michaeljmarshall Feb 10, 2021

Choose a reason for hiding this comment

eolivelli commented Feb 10, 2021

lhotari left a comment

Choose a reason for hiding this comment

eolivelli left a comment

Choose a reason for hiding this comment

lhotari left a comment

Choose a reason for hiding this comment

lhotari commented Feb 10, 2021

michaeljmarshall commented Feb 17, 2021

michaeljmarshall commented Feb 1, 2021 •

edited

Loading

lhotari commented Feb 1, 2021 •

edited

Loading

lhotari commented Feb 4, 2021 •

edited

Loading

michaeljmarshall commented Feb 4, 2021 •

edited

Loading

lhotari commented Feb 9, 2021 •

edited

Loading