Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve visibility for DataCarrier. #3726

Closed
wants to merge 4 commits into from
Closed

Conversation

wu-sheng
Copy link
Member

@wu-sheng wu-sheng commented Oct 28, 2019

As we received several reports about the usage of DataCarrier queue consume thread is not very high, which could cause CPUs are not fully used. I add a more layer to the DataCarrier queue to avoid the array element change is invisible across threads.

This is a compatible change. I hope someone could verify this in test or product envs. This should make performance of OAP better with less latency in process data.

@wu-sheng wu-sheng added core feature Core and important feature. Sometimes, break backwards compatibility. agent Language agent related. backend OAP backend related. feature New feature labels Oct 28, 2019
@wu-sheng wu-sheng added this to the 6.5.0 milestone Oct 28, 2019
@wu-sheng
Copy link
Member Author

The old thread graph is like this.

image

As we have improved many other points for the performance, this seems a missing one. As this was reported, and no one submitted PR, I am trying to do this.

@wu-sheng
Copy link
Member Author

@yantaowu I think you mentioned this before, could you check about this too?

@wu-sheng
Copy link
Member Author

@lkxiaolou As you was working on optimizing DataCarrier multiple times, could you help on rechecking this too?

}

public void setItem(Object item) {
this.item = item;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Item set volatile may reduce the producer's performance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The key is I want to reduce the visibility to consumer thread. I want to check whether it is real having that issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is, if the visibility issue happens, it affect the performance more, the queue will be blocked longer than required.

Copy link
Member

@kezhenxu94 kezhenxu94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can simply reuse the java.util.concurrent.atomic.AtomicReferenceArray, it has no thread visibility issues and is well optimized, as part of the JDK

@wu-sheng
Copy link
Member Author

Is it a real different in our case? I used to check on that too

@hanahmily
Copy link
Contributor

I'm missing the context of PR a little. Do you intend to make consumer threads to observe the change of "queue" with the help of volatile?
If your assumption is correct, some consumers will be busier than others. But from the graph, I didn't check it out.

@wu-sheng
Copy link
Member Author

@hanahmily I was being reported by community, the consumers are not fully working. That is the context

@hanahmily
Copy link
Contributor

@hanahmily I was being reported by community, the consumers are not fully working. That is the context

What's the state of these consumers? I think that's a clue for us to find out why.

@wu-sheng
Copy link
Member Author

What's the state of these consumers? I think that's a clue for us to find out why.

We just know the threads are not busy from the graph. Don't know why. As we have reduced the number of threads, should be a problem.

@wu-sheng wu-sheng modified the milestones: 6.5.0, 6.6.0 Oct 31, 2019
@yantaowu
Copy link
Contributor

yantaowu commented Nov 4, 2019

@yantaowu I think you mentioned this before, could you check about this too?

I had tested this PR. The issue mentioned before has resolved, but the CPU usage rate is more than our own implementation.

@dmsolr
Copy link
Member

dmsolr commented Nov 4, 2019

I had tested this PR. The issue mentioned before has resolved, but the CPU usage rate is more than our own implementation.

Hi @yantaowu
I am interesting int this. Could you share how to test? And I have thought that related to 20-millis consume cycle. Changes 20ms to 100ms. Just guess :)

    public DataCarrier consume(IConsumer<T> consumer, int num) {
        return this.consume(consumer, num, 20);
    }

@wu-sheng
Copy link
Member Author

wu-sheng commented Nov 4, 2019

@dmsolr He is running cherry pick test on old release, 6GA. The CPU usage is only a reference. I will try to do a comparison test between this and his, which is using blocking queue.

@wu-sheng
Copy link
Member Author

@hanahmily @kezhenxu94 @dmsolr I run a local benchmark, and YES. In blocking mode(the OAP is using this mode now), the ArrayBlockingQueue has ~30% performance enhancement.

I write a fake Buffer class to run this test

public class Buffer2<T> {
//    private final BufferItem[] buffer;
    private final ArrayBlockingQueue buffer;
    private BufferStrategy strategy;
    private AtomicRangeInteger index;
    private List<QueueBlockingCallback<T>> callbacks;

    Buffer2(int bufferSize, BufferStrategy strategy) {
        buffer = new ArrayBlockingQueue(bufferSize);
        this.strategy = strategy;
        index = new AtomicRangeInteger(0, bufferSize);
        callbacks = new LinkedList<QueueBlockingCallback<T>>();
    }

    void setStrategy(BufferStrategy strategy) {
        this.strategy = strategy;
    }

    void addCallback(QueueBlockingCallback<T> callback) {
        callbacks.add(callback);
    }

    boolean save(T data) {
        try {
            buffer.put(data);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        return true;
    }

    public int getBufferSize() {
        return 0;
    }

    public void obtain(List<T> consumeList) {
        buffer.drainTo(consumeList);
    }

    public void obtain(List<T> consumeList, int start, int end) {
        buffer.drainTo(consumeList);
    }

}

And this is my benchmark

public class ChannelWritePerformanceTest {
    private static Buffer buffer = new Buffer(2000, BufferStrategy.BLOCKING);
    private static Buffer2 buffer2 = new Buffer2(2000, BufferStrategy.BLOCKING);

    @Benchmark
    public static void testBuffer1() {
        for (int i = 0; i < 200; i++) {
            buffer.save(new Object());
        }
        buffer.obtain(new ArrayList(2000));
    }

    @Benchmark
    public static void testBuffer2() {
        for (int i = 0; i < 200; i++) {
            buffer2.save(new Object());
        }
        buffer2.obtain(new ArrayList(2000));
    }

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
            .include(ChannelWritePerformanceTest.class.getSimpleName())
            .forks(1)
            .warmupIterations(1)
            .measurementIterations(5)
            .build();

        new Runner(opt).run();
    }
}

The following is test report

# JMH version: 1.21
# VM version: JDK 1.8.0_91, Java HotSpot(TM) 64-Bit Server VM, 25.91-b14
# VM invoker: /Users/wusheng/Documents/applications/jdk1.8.0_91.jdk/Contents/Home/jre/bin/java
# VM options: -javaagent:/Applications/IntelliJ IDEA.app/Contents/lib/idea_rt.jar=54052:/Applications/IntelliJ IDEA.app/Contents/bin -Dfile.encoding=UTF-8
# Warmup: 1 iterations, 10 s each
# Measurement: 10 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.apache.skywalking.apm.commons.datacarrier.buffer.ChannelWritePerformanceTest.testBuffer1

# Run progress: 0.00% complete, ETA 00:03:40
# Fork: 1 of 1
objc[74650]: Class JavaLaunchHelper is implemented in both /Users/wusheng/Documents/applications/jdk1.8.0_91.jdk/Contents/Home/jre/bin/java (0x109a6e4c0) and /Users/wusheng/Documents/applications/jdk1.8.0_91.jdk/Contents/Home/jre/lib/libinstrument.dylib (0x109ae84e0). One of the two will be used. Which one is undefined.
# Warmup Iteration   1: 125181.486 ops/s
Iteration   1: 123915.279 ops/s
Iteration   2: 113945.502 ops/s
Iteration   3: 113637.598 ops/s
Iteration   4: 110179.474 ops/s
Iteration   5: 110066.439 ops/s
Iteration   6: 114997.944 ops/s
Iteration   7: 113754.850 ops/s
Iteration   8: 116012.443 ops/s
Iteration   9: 112262.401 ops/s
Iteration  10: 109717.243 ops/s


Result "org.apache.skywalking.apm.commons.datacarrier.buffer.ChannelWritePerformanceTest.testBuffer1":
  113848.917 ±(99.9%) 6269.203 ops/s [Average]
  (min, avg, max) = (109717.243, 113848.917, 123915.279), stdev = 4146.689
  CI (99.9%): [107579.715, 120118.120] (assumes normal distribution)


# JMH version: 1.21
# VM version: JDK 1.8.0_91, Java HotSpot(TM) 64-Bit Server VM, 25.91-b14
# VM invoker: /Users/wusheng/Documents/applications/jdk1.8.0_91.jdk/Contents/Home/jre/bin/java
# VM options: -javaagent:/Applications/IntelliJ IDEA.app/Contents/lib/idea_rt.jar=54052:/Applications/IntelliJ IDEA.app/Contents/bin -Dfile.encoding=UTF-8
# Warmup: 1 iterations, 10 s each
# Measurement: 10 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: org.apache.skywalking.apm.commons.datacarrier.buffer.ChannelWritePerformanceTest.testBuffer2

# Run progress: 50.00% complete, ETA 00:01:51
# Fork: 1 of 1
objc[74661]: Class JavaLaunchHelper is implemented in both /Users/wusheng/Documents/applications/jdk1.8.0_91.jdk/Contents/Home/jre/bin/java (0x10baca4c0) and /Users/wusheng/Documents/applications/jdk1.8.0_91.jdk/Contents/Home/jre/lib/libinstrument.dylib (0x10bb444e0). One of the two will be used. Which one is undefined.
# Warmup Iteration   1: 185663.422 ops/s
Iteration   1: 167183.740 ops/s
Iteration   2: 150882.897 ops/s
Iteration   3: 150184.345 ops/s
Iteration   4: 151788.304 ops/s
Iteration   5: 150627.348 ops/s
Iteration   6: 152071.145 ops/s
Iteration   7: 153232.832 ops/s
Iteration   8: 150855.043 ops/s
Iteration   9: 149277.780 ops/s
Iteration  10: 148825.612 ops/s


Result "org.apache.skywalking.apm.commons.datacarrier.buffer.ChannelWritePerformanceTest.testBuffer2":
  152492.905 ±(99.9%) 8046.102 ops/s [Average]
  (min, avg, max) = (148825.612, 152492.905, 167183.740), stdev = 5321.998
  CI (99.9%): [144446.802, 160539.007] (assumes normal distribution)


# Run complete. Total time: 00:03:41

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                 Mode  Cnt       Score      Error  Units
ChannelWritePerformanceTest.testBuffer1  thrpt   10  113848.917 ± 6269.203  ops/s
ChannelWritePerformanceTest.testBuffer2  thrpt   10  152492.905 ± 8046.102  ops/s

Process finished with exit code 0

Do you have any suggestions? Do we need to update this? Such as using BlockingQueue in blocking mode?

@wu-sheng
Copy link
Member Author

When I changed the codes to insert more data, interesting things happened.
I inserted 1200 records to 2000 size queu for both

for (int i = 0; i < 1200; i++) {
            buffer1/buffer2.save(new Object());
        }
Benchmark                                 Mode  Cnt      Score     Error  Units
ChannelWritePerformanceTest.testBuffer1  thrpt   10  30516.424 ± 745.733  ops/s
ChannelWritePerformanceTest.testBuffer2  thrpt   10  33467.919 ± 384.183  ops/s

@wu-sheng
Copy link
Member Author

As @yantaowu told me offline, they are using Queue#offer with accepting queue is full and data lost tradeoff. The result is not so different, as we don't make the queue full in our tests. testBuffer2 is using #offer method in the test.

Benchmark                                 Mode  Cnt      Score      Error  Units
ChannelWritePerformanceTest.testBuffer1  thrpt   10  29020.969 ±  613.798  ops/s
ChannelWritePerformanceTest.testBuffer2  thrpt   10  32102.383 ± 1715.402  ops/s

@wu-sheng
Copy link
Member Author

I dig more into the ArrayBlockingQueue source code, I noticed there is a misguide from the user feedback. ArrayBlockingQueue didn't ever deal with across thread visibility ever. Everyone here could check the source code.

So after I remove the volatile, but keep the BufferItem. There is no advantage of the BlockingQueue.

Benchmark                                 Mode  Cnt      Score      Error  Units
ChannelWritePerformanceTest.testBuffer1  thrpt   10  45184.551 ±  425.330  ops/s
ChannelWritePerformanceTest.testBuffer2  thrpt   10  33548.799 ± 1793.056  ops/s

@wu-sheng
Copy link
Member Author

Furthermore, by using the master codes of DataCarrier, we have much higher throughput. The following result is using the master code as testBuffer1.

Benchmark                                 Mode  Cnt      Score      Error  Units
ChannelWritePerformanceTest.testBuffer1  thrpt   10  47261.798 ± 2850.445  ops/s
ChannelWritePerformanceTest.testBuffer2  thrpt   10  34402.022 ±  356.511  ops/s

@wu-sheng
Copy link
Member Author

For following all these tests, unless we have very specific evidence showing there is across threads visibility issue, I think we should close this PR for now.

  1. Visibility is not provided in BlockingQueue, it provides a global lock, which allows lower traffic.
  2. Adding it could cost more CPU, but without evidence, we should not change this.

@wu-sheng wu-sheng closed this Nov 13, 2019
@wu-sheng wu-sheng added TBD To be decided later, need more discussion or input. wontfix This will not be worked on labels Nov 13, 2019
@wu-sheng
Copy link
Member Author

OK. An update, the ArrayBlockingQueue processing the across threads visibility by using the lock.

Refer to, https://stackoverflow.com/questions/12429818/does-explicit-lock-automatically-provide-memory-visibility

@wu-sheng wu-sheng reopened this Nov 13, 2019
@wu-sheng wu-sheng closed this Nov 13, 2019
@wu-sheng
Copy link
Member Author

I will submit another PR to provide a new buffer implementation for oap server side first. The client side needs more research.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent Language agent related. backend OAP backend related. core feature Core and important feature. Sometimes, break backwards compatibility. feature New feature TBD To be decided later, need more discussion or input. wontfix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants