Improve visibility for DataCarrier. #3726

wu-sheng · 2019-10-28T07:12:38Z

As we received several reports about the usage of DataCarrier queue consume thread is not very high, which could cause CPUs are not fully used. I add a more layer to the DataCarrier queue to avoid the array element change is invisible across threads.

This is a compatible change. I hope someone could verify this in test or product envs. This should make performance of OAP better with less latency in process data.

wu-sheng · 2019-10-28T07:19:22Z

The old thread graph is like this.

As we have improved many other points for the performance, this seems a missing one. As this was reported, and no one submitted PR, I am trying to do this.

wu-sheng · 2019-10-28T07:20:08Z

@yantaowu I think you mentioned this before, could you check about this too?

wu-sheng · 2019-10-28T07:42:15Z

@lkxiaolou As you was working on optimizing DataCarrier multiple times, could you help on rechecking this too?

lkxiaolou · 2019-10-28T11:06:58Z

...tacarrier/src/main/java/org/apache/skywalking/apm/commons/datacarrier/buffer/BufferItem.java

+    }
+
+    public void setItem(Object item) {
+        this.item = item;


Item set volatile may reduce the producer's performance

Yes. The key is I want to reduce the visibility to consumer thread. I want to check whether it is real having that issue.

The point is, if the visibility issue happens, it affect the performance more, the queue will be blocked longer than required.

kezhenxu94

Maybe we can simply reuse the java.util.concurrent.atomic.AtomicReferenceArray, it has no thread visibility issues and is well optimized, as part of the JDK

wu-sheng · 2019-10-29T09:38:18Z

Is it a real different in our case? I used to check on that too

hanahmily · 2019-10-30T06:57:17Z

I'm missing the context of PR a little. Do you intend to make consumer threads to observe the change of "queue" with the help of volatile?
If your assumption is correct, some consumers will be busier than others. But from the graph, I didn't check it out.

wu-sheng · 2019-10-30T06:59:14Z

@hanahmily I was being reported by community, the consumers are not fully working. That is the context

hanahmily · 2019-10-30T07:11:09Z

@hanahmily I was being reported by community, the consumers are not fully working. That is the context

What's the state of these consumers? I think that's a clue for us to find out why.

wu-sheng · 2019-10-30T07:14:20Z

What's the state of these consumers? I think that's a clue for us to find out why.

We just know the threads are not busy from the graph. Don't know why. As we have reduced the number of threads, should be a problem.

yantaowu · 2019-11-04T05:55:18Z

@yantaowu I think you mentioned this before, could you check about this too?

I had tested this PR. The issue mentioned before has resolved, but the CPU usage rate is more than our own implementation.

dmsolr · 2019-11-04T06:16:07Z

I had tested this PR. The issue mentioned before has resolved, but the CPU usage rate is more than our own implementation.

Hi @yantaowu
I am interesting int this. Could you share how to test? And I have thought that related to 20-millis consume cycle. Changes 20ms to 100ms. Just guess :)

    public DataCarrier consume(IConsumer<T> consumer, int num) {
        return this.consume(consumer, num, 20);
    }

wu-sheng · 2019-11-04T06:21:30Z

@dmsolr He is running cherry pick test on old release, 6GA. The CPU usage is only a reference. I will try to do a comparison test between this and his, which is using blocking queue.

wu-sheng · 2019-11-13T03:54:18Z

@hanahmily @kezhenxu94 @dmsolr I run a local benchmark, and YES. In blocking mode(the OAP is using this mode now), the ArrayBlockingQueue has ~30% performance enhancement.

I write a fake Buffer class to run this test

public class Buffer2<T> {
//    private final BufferItem[] buffer;
    private final ArrayBlockingQueue buffer;
    private BufferStrategy strategy;
    private AtomicRangeInteger index;
    private List<QueueBlockingCallback<T>> callbacks;

    Buffer2(int bufferSize, BufferStrategy strategy) {
        buffer = new ArrayBlockingQueue(bufferSize);
        this.strategy = strategy;
        index = new AtomicRangeInteger(0, bufferSize);
        callbacks = new LinkedList<QueueBlockingCallback<T>>();
    }

    void setStrategy(BufferStrategy strategy) {
        this.strategy = strategy;
    }

    void addCallback(QueueBlockingCallback<T> callback) {
        callbacks.add(callback);
    }

    boolean save(T data) {
        try {
            buffer.put(data);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        return true;
    }

    public int getBufferSize() {
        return 0;
    }

    public void obtain(List<T> consumeList) {
        buffer.drainTo(consumeList);
    }

    public void obtain(List<T> consumeList, int start, int end) {
        buffer.drainTo(consumeList);
    }

}

And this is my benchmark

public class ChannelWritePerformanceTest {
    private static Buffer buffer = new Buffer(2000, BufferStrategy.BLOCKING);
    private static Buffer2 buffer2 = new Buffer2(2000, BufferStrategy.BLOCKING);

    @Benchmark
    public static void testBuffer1() {
        for (int i = 0; i < 200; i++) {
            buffer.save(new Object());
        }
        buffer.obtain(new ArrayList(2000));
    }

    @Benchmark
    public static void testBuffer2() {
        for (int i = 0; i < 200; i++) {
            buffer2.save(new Object());
        }
        buffer2.obtain(new ArrayList(2000));
    }

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
            .include(ChannelWritePerformanceTest.class.getSimpleName())
            .forks(1)
            .warmupIterations(1)
            .measurementIterations(5)
            .build();

        new Runner(opt).run();
    }
}

The following is test report

# JMH version: 1.21
# VM version: JDK 1.8.0_91, Java HotSpot(TM) 64-Bit Server VM, 25.91-b14
# VM invoker: /Users/wusheng/Documents/applications/jdk1.8.0_91.jdk/Contents/Home/jre/bin/java
# VM options: -javaagent:/Applications/IntelliJ IDEA.app/Contents/lib/idea_rt.jar=54052:/Applications/IntelliJ IDEA.app/Contents/bin -Dfile.encoding=UTF-8
# Warmup: 1 iterations, 10 s each
# Measurement: 10 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.apache.skywalking.apm.commons.datacarrier.buffer.ChannelWritePerformanceTest.testBuffer1

# Run progress: 0.00% complete, ETA 00:03:40
# Fork: 1 of 1
objc[74650]: Class JavaLaunchHelper is implemented in both /Users/wusheng/Documents/applications/jdk1.8.0_91.jdk/Contents/Home/jre/bin/java (0x109a6e4c0) and /Users/wusheng/Documents/applications/jdk1.8.0_91.jdk/Contents/Home/jre/lib/libinstrument.dylib (0x109ae84e0). One of the two will be used. Which one is undefined.
# Warmup Iteration   1: 125181.486 ops/s
Iteration   1: 123915.279 ops/s
Iteration   2: 113945.502 ops/s
Iteration   3: 113637.598 ops/s
Iteration   4: 110179.474 ops/s
Iteration   5: 110066.439 ops/s
Iteration   6: 114997.944 ops/s
Iteration   7: 113754.850 ops/s
Iteration   8: 116012.443 ops/s
Iteration   9: 112262.401 ops/s
Iteration  10: 109717.243 ops/s


Result "org.apache.skywalking.apm.commons.datacarrier.buffer.ChannelWritePerformanceTest.testBuffer1":
  113848.917 ±(99.9%) 6269.203 ops/s [Average]
  (min, avg, max) = (109717.243, 113848.917, 123915.279), stdev = 4146.689
  CI (99.9%): [107579.715, 120118.120] (assumes normal distribution)


# JMH version: 1.21
# VM version: JDK 1.8.0_91, Java HotSpot(TM) 64-Bit Server VM, 25.91-b14
# VM invoker: /Users/wusheng/Documents/applications/jdk1.8.0_91.jdk/Contents/Home/jre/bin/java
# VM options: -javaagent:/Applications/IntelliJ IDEA.app/Contents/lib/idea_rt.jar=54052:/Applications/IntelliJ IDEA.app/Contents/bin -Dfile.encoding=UTF-8
# Warmup: 1 iterations, 10 s each
# Measurement: 10 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.apache.skywalking.apm.commons.datacarrier.buffer.ChannelWritePerformanceTest.testBuffer2

# Run progress: 50.00% complete, ETA 00:01:51
# Fork: 1 of 1
objc[74661]: Class JavaLaunchHelper is implemented in both /Users/wusheng/Documents/applications/jdk1.8.0_91.jdk/Contents/Home/jre/bin/java (0x10baca4c0) and /Users/wusheng/Documents/applications/jdk1.8.0_91.jdk/Contents/Home/jre/lib/libinstrument.dylib (0x10bb444e0). One of the two will be used. Which one is undefined.
# Warmup Iteration   1: 185663.422 ops/s
Iteration   1: 167183.740 ops/s
Iteration   2: 150882.897 ops/s
Iteration   3: 150184.345 ops/s
Iteration   4: 151788.304 ops/s
Iteration   5: 150627.348 ops/s
Iteration   6: 152071.145 ops/s
Iteration   7: 153232.832 ops/s
Iteration   8: 150855.043 ops/s
Iteration   9: 149277.780 ops/s
Iteration  10: 148825.612 ops/s


Result "org.apache.skywalking.apm.commons.datacarrier.buffer.ChannelWritePerformanceTest.testBuffer2":
  152492.905 ±(99.9%) 8046.102 ops/s [Average]
  (min, avg, max) = (148825.612, 152492.905, 167183.740), stdev = 5321.998
  CI (99.9%): [144446.802, 160539.007] (assumes normal distribution)


# Run complete. Total time: 00:03:41

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                 Mode  Cnt       Score      Error  Units
ChannelWritePerformanceTest.testBuffer1  thrpt   10  113848.917 ± 6269.203  ops/s
ChannelWritePerformanceTest.testBuffer2  thrpt   10  152492.905 ± 8046.102  ops/s

Process finished with exit code 0

Do you have any suggestions? Do we need to update this? Such as using BlockingQueue in blocking mode?

wu-sheng · 2019-11-13T04:16:05Z

When I changed the codes to insert more data, interesting things happened.
I inserted 1200 records to 2000 size queu for both

for (int i = 0; i < 1200; i++) {
            buffer1/buffer2.save(new Object());
        }

Benchmark                                 Mode  Cnt      Score     Error  Units
ChannelWritePerformanceTest.testBuffer1  thrpt   10  30516.424 ± 745.733  ops/s
ChannelWritePerformanceTest.testBuffer2  thrpt   10  33467.919 ± 384.183  ops/s

wu-sheng · 2019-11-13T06:03:01Z

As @yantaowu told me offline, they are using Queue#offer with accepting queue is full and data lost tradeoff. The result is not so different, as we don't make the queue full in our tests. testBuffer2 is using #offer method in the test.

Benchmark                                 Mode  Cnt      Score      Error  Units
ChannelWritePerformanceTest.testBuffer1  thrpt   10  29020.969 ±  613.798  ops/s
ChannelWritePerformanceTest.testBuffer2  thrpt   10  32102.383 ± 1715.402  ops/s

wu-sheng · 2019-11-13T06:44:33Z

I dig more into the ArrayBlockingQueue source code, I noticed there is a misguide from the user feedback. ArrayBlockingQueue didn't ever deal with across thread visibility ever. Everyone here could check the source code.

So after I remove the volatile, but keep the BufferItem. There is no advantage of the BlockingQueue.

Benchmark                                 Mode  Cnt      Score      Error  Units
ChannelWritePerformanceTest.testBuffer1  thrpt   10  45184.551 ±  425.330  ops/s
ChannelWritePerformanceTest.testBuffer2  thrpt   10  33548.799 ± 1793.056  ops/s

wu-sheng · 2019-11-13T07:22:15Z

Furthermore, by using the master codes of DataCarrier, we have much higher throughput. The following result is using the master code as testBuffer1.

Benchmark                                 Mode  Cnt      Score      Error  Units
ChannelWritePerformanceTest.testBuffer1  thrpt   10  47261.798 ± 2850.445  ops/s
ChannelWritePerformanceTest.testBuffer2  thrpt   10  34402.022 ±  356.511  ops/s

wu-sheng · 2019-11-13T07:25:55Z

For following all these tests, unless we have very specific evidence showing there is across threads visibility issue, I think we should close this PR for now.

Visibility is not provided in BlockingQueue, it provides a global lock, which allows lower traffic.
Adding it could cost more CPU, but without evidence, we should not change this.

wu-sheng · 2019-11-13T08:47:54Z

OK. An update, the ArrayBlockingQueue processing the across threads visibility by using the lock.

Refer to, https://stackoverflow.com/questions/12429818/does-explicit-lock-automatically-provide-memory-visibility

wu-sheng · 2019-11-13T09:06:03Z

I will submit another PR to provide a new buffer implementation for oap server side first. The client side needs more research.

Improve visibility for DataCarrier.

d239b85

wu-sheng added core feature Core and important feature. Sometimes, break backwards compatibility. agent Language agent related. backend OAP backend related. feature New feature labels Oct 28, 2019

wu-sheng added this to the 6.5.0 milestone Oct 28, 2019

wu-sheng requested review from hanahmily, zhaoyuguang, ascrutae, JaredTan95, kezhenxu94 and peng-yongsheng October 28, 2019 07:12

Merge branch 'master' into queue-visibility

7b3e55a

lkxiaolou reviewed Oct 28, 2019

View reviewed changes

kezhenxu94 reviewed Oct 28, 2019

View reviewed changes

wu-sheng modified the milestones: 6.5.0, 6.6.0 Oct 31, 2019

Merge branch 'master' into queue-visibility

76b17b8

wu-sheng closed this Nov 13, 2019

wu-sheng added TBD To be decided later, need more discussion or input. wontfix This will not be worked on labels Nov 13, 2019

wu-sheng reopened this Nov 13, 2019

Merge branch 'master' into queue-visibility

23ff60b

wu-sheng closed this Nov 13, 2019

wu-sheng deleted the queue-visibility branch November 13, 2019 09:17

wu-sheng mentioned this pull request Nov 13, 2019

Refactor DataCarrier, support ArrayBlockingQueueBuffer as implementor #3849

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve visibility for DataCarrier. #3726

Improve visibility for DataCarrier. #3726

wu-sheng commented Oct 28, 2019 •

edited

Loading

wu-sheng commented Oct 28, 2019

wu-sheng commented Oct 28, 2019

wu-sheng commented Oct 28, 2019

lkxiaolou Oct 28, 2019

wu-sheng Oct 28, 2019

wu-sheng Nov 13, 2019

kezhenxu94 left a comment

wu-sheng commented Oct 29, 2019

hanahmily commented Oct 30, 2019

wu-sheng commented Oct 30, 2019

hanahmily commented Oct 30, 2019

wu-sheng commented Oct 30, 2019

yantaowu commented Nov 4, 2019

dmsolr commented Nov 4, 2019

wu-sheng commented Nov 4, 2019

wu-sheng commented Nov 13, 2019

wu-sheng commented Nov 13, 2019

wu-sheng commented Nov 13, 2019

wu-sheng commented Nov 13, 2019

wu-sheng commented Nov 13, 2019

wu-sheng commented Nov 13, 2019

wu-sheng commented Nov 13, 2019

wu-sheng commented Nov 13, 2019

Improve visibility for DataCarrier. #3726

Improve visibility for DataCarrier. #3726

Conversation

wu-sheng commented Oct 28, 2019 • edited Loading

wu-sheng commented Oct 28, 2019

wu-sheng commented Oct 28, 2019

wu-sheng commented Oct 28, 2019

lkxiaolou Oct 28, 2019

Choose a reason for hiding this comment

wu-sheng Oct 28, 2019

Choose a reason for hiding this comment

wu-sheng Nov 13, 2019

Choose a reason for hiding this comment

kezhenxu94 left a comment

Choose a reason for hiding this comment

wu-sheng commented Oct 29, 2019

hanahmily commented Oct 30, 2019

wu-sheng commented Oct 30, 2019

hanahmily commented Oct 30, 2019

wu-sheng commented Oct 30, 2019

yantaowu commented Nov 4, 2019

dmsolr commented Nov 4, 2019

wu-sheng commented Nov 4, 2019

wu-sheng commented Nov 13, 2019

wu-sheng commented Nov 13, 2019

wu-sheng commented Nov 13, 2019

wu-sheng commented Nov 13, 2019

wu-sheng commented Nov 13, 2019

wu-sheng commented Nov 13, 2019

wu-sheng commented Nov 13, 2019

wu-sheng commented Nov 13, 2019

wu-sheng commented Oct 28, 2019 •

edited

Loading