High CPU load when PulsarClient (Java) is idle #6414

racorn · 2020-02-24T23:13:42Z

Describe the bug
Running a simple consumer on Windows using the PulsarClient, I notice that there is considerable CPU usage when the consumer is just idling. Profiling showed that the 'pulsar-timer-x-y' thread is heavy on the CPU. Digging into the source code, we can see that the HashedWheelTimer is set up with 1 ms tick time:

timer = new HashedWheelTimer(getThreadFactory("pulsar-timer"), 1, TimeUnit.MILLISECONDS);

A simple test with similar values show that a running HashedWheelTimer can consume CPU equal to 100% of one hyperthread:

    @Test
    public void testTimerWheel() throws Exception {
        HashedWheelTimer timer = new HashedWheelTimer(1, TimeUnit.MILLISECONDS);

        timer.newTimeout(new TimerTask() {

            @Override
            public void run(final Timeout timeout) throws Exception {
                timer.newTimeout(this, 500, TimeUnit.MILLISECONDS);

            }
        }, 500, TimeUnit.MILLISECONDS);

        Thread.sleep(30_000);
        timer.stop();
    }

My understanding of the HashedWheelTimer is that it is indented to be used for a large number of approximated timeouts and not for millisecond precision scheduling, but I may be mistaken.

To Reproduce
Run a simple producer subscribing to a topic with no messages, or run the above simple JUnit test. Observe CPU usage.

Expected behavior
An idle consumer should not incur high CPU load. For example, when running 8 consumers on a i7-8700 (4 cores 2 threads per core), TaskManager reported 100% CPU usage - when there was no traffic, no messages on the topics.

Discussion
High CPU load is a high price to pay for immediate batch dispatching (which may be the primary reason for setting the 1 ms tick time). It should be possible to lax on immediate batch dispatching in favor of more compute resources available to other tasks.

An alternative is to use a plain old ScheduledExecutorService for batch dispatching/reception, tests suggest it will also be lighter on the CPU when millisecond precision is required and the number of task is not too high. For example, I could schedule 2000 simple tasks every millisecond (scheduleAtFixedRate) with more compute available on a single thread. When scheduling 10k tasks every millisecond, the thread was working 100%, but scheduling 10k tasks every 10 millisecond was OK. So if a client is used to create a huge number of producers/consumers, it might be better to lax the batch timeout so the timeout thread has a chance to actually perform its work. Another observation is that the HashedWheelTimer used with 1 millisecond tick time is not able to run a single simple task every millisecond, party due to the HashedWheelTimer approximate nature, and maybe party because spinning the wheel competes with task expiration.

Note that when using ScheduledExecutorService each task might not get exactly the same number of executions, for example when scheduling 1000 tasks run execute every millisecond for 30 seconds, task were executed between 29 986 and 30 012 times.

Some suggestions:

The tick time could be configurable, with default value = 1 millisecond to preserve current behavior.
Use default tick time 100 milliseconds, then create a ScheduledExecutorService in PulsarClientImpl that is exposed the same way as the Timer object for other classes to use. The ProducerImpl and ConsumerBase could use the ScheduledExecutorService to schedule batch dispatching/reception.

Additional context
Observed under Windows 10 64-bit, running Java 11.

The text was updated successfully, but these errors were encountered:

jiazhai · 2020-02-28T02:39:15Z

Thanks a lot for @racorn 's detailed analysis. your suggestions sounds great. Would you like to provide a fix for this?

racorn · 2020-02-28T10:42:37Z

Yes, I can try to implement what I suggested, by mid next week. Will that be OK with you?

racorn · 2020-02-29T11:04:58Z

I wish I had seen this before: netty/netty#9710, and the fix netty/netty#9714

As the Netty version is now bumped to 4.1.45 on master, the CPU load is no longer an issue on Windows.

Anyway, @jiazhai, I made the suggested implementation in my forked repository, you can have a look to see if it is of any value. I could be an improvement in the ProducerImpl, where client.scheduledExecutorService().scheduleAtFixedRate better reflects what you want to achieve than using a Timeout. Also, the keyGeneratorTask could use the ScheduledExecutorService instead of using the Netty eventLoopGroup. On the other hand, if you are producing messages for a huge number of topics (10k +), the HashedWheelTimer may perform better. So maybe it is best to leave it as it is.

jiazhai · 2020-03-03T02:31:55Z

Thanks @racorn for your great help. Agree with you, we could leave it as it is. And would like to close this issue.

racorn added the type/bug The PR fixed a bug or issue reported a bug label Feb 24, 2020

sijie mentioned this issue Feb 25, 2020

ISSUE-6414: High CPU load when PulsarClient (Java) is idle streamnative/pulsar-archived#745

Closed

codelipenghui mentioned this issue Feb 25, 2020

Strange CPU usage in standalone using Docker #6175

Closed

jiazhai added the triage/week-9 label Feb 28, 2020

jiazhai added area/client help wanted labels Feb 28, 2020

jiazhai closed this as completed Mar 3, 2020

jiazhai mentioned this issue Mar 3, 2020

Number of "pulsar-timer-*" threads increases sporadically over time #6366

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU load when PulsarClient (Java) is idle #6414

High CPU load when PulsarClient (Java) is idle #6414

racorn commented Feb 24, 2020

jiazhai commented Feb 28, 2020

racorn commented Feb 28, 2020

racorn commented Feb 29, 2020 •

edited

Loading

jiazhai commented Mar 3, 2020

High CPU load when PulsarClient (Java) is idle #6414

High CPU load when PulsarClient (Java) is idle #6414

Comments

racorn commented Feb 24, 2020

jiazhai commented Feb 28, 2020

racorn commented Feb 28, 2020

racorn commented Feb 29, 2020 • edited Loading

jiazhai commented Mar 3, 2020

racorn commented Feb 29, 2020 •

edited

Loading