Performance test and improvement of 3.2.6 #1596

hepyu · 2018-08-28T05:55:09Z

I checked about the performance test report:
https://skywalkingtest.github.io/Agent-Benchmarks/README_zh.html

And I noticed that there was only provide the skywalking-agent machine configuration.
How about machine configuration about collector node or cluster and es machine or cluster? In my idea, this is very important in the performance test.

Is there some test report about collector and es in the performance test?

wu-sheng · 2018-08-28T06:06:26Z

And I noticed that there was only provide the skywalking-agent machine configuration.
How about machine configuration about collector node or cluster and es machine or cluster? In my idea, this is very important in the performance test.

No, it is not important for this case. Because this is about agent test, the backend is mock, because we don't have such Infra env to do so.

wu-sheng · 2018-08-28T06:06:50Z

Is there some test report about collector and es in the performance test?

We haven't done yet.

hepyu · 2018-08-28T06:58:36Z

Oh, I got what you said.

But before used in the product env, we have to assess the performance about the collector cluster.

ES itself is ok, because it is used a lot as the factual standards, we could only care about how to adjust its config.

Does the data transfer between agent and collect use the tcp long connection? (It seems used long connection based on jetty).
Then how about the efficiency between collector and es? What's about its mechanism? I mean, if the input request rate from collector gt the process rate to es, what will happen? Is there exist disk cache or similar way to process? It seems not appropriate to use memory to do this, data is too much.

In the product env of internet, this is very common: there will be 100+ micro-service instance, and the qps will be 20K total around, 1K qps at least in the high load instance, and it will increase of course in the future.

According to the performance report, the 1K qps per instance, then we need assess the collector scale.
If these conditions are satisfied as below:
a.Trace data transport is based on tcp long connection.
b.The package about trace is not large.
c.Collector have mechanism to write trace data to es in the high efficiency.
Then 10K qps per collector is very ok, I think.
Is that mean, we need at least three collector nodes in the env that based on giving consideration to usability of collector cluster and request qps scale?

At last, I mean, is there some advice or default config about collector cluster scale based on the request scale above?

hepyu · 2018-08-28T07:03:51Z

You are right, the performance test report is just for agent, so use mock to imitate collector is appropriate。

But I think, it seems that we couldn't make the mock collector and actual collector equivalent, it seems will hide the potential risk.

The efficiency of the collector cluster is very important in the product env.

wu-sheng · 2018-08-28T07:11:14Z

Does the data transfer between agent and collect use the tcp long connection? (It seems used long connection based on jetty).

I think should be gRPC HTTP/2. But long connection, yes. 3.x doesn't focus on performance so much, SkyWalking 5 and 6 focus on these fields.

In SkyWalking 5 beta2, we run a test, look like 10k~20k per collector if ElasticSearch is big enough.

hepyu · 2018-08-28T07:32:44Z

Oh, I got it, using gRPC based on protobuf protocol through long connection to transfer, is this true?

Could you please have a assessment about collector performance between 3.2.6 and 5&6?

I mean, how reliability about10K~20K per collector in the version 3.2.6?
In my opinion, the gap of the basic performance of the collector between 3.2.6 and 5&6 should not be much, because its similar mechanism? Could I consider like this?

wu-sheng · 2018-08-28T07:35:05Z

Oh, I got it, using gRPC based on protobuf protocol through long connection to transfer, is this true?

Yes.

Could you please have a assessment about collector performance between 3.2.6 and 5&6?
I mean, how reliability about10K~20K per collector in the version 3.2.6?

Haven't run that before. Have to do by yourself :)

In my opinion, the gap of the basic performance of the collector between 3.2.6 and 5&6 should not be much, because its similar mechanism? Could I consider like this?

5.0.0-beta2 did a big performance upgrade by optimization from alpha. From different version, especially old version(3.2.6), really don't know.

hepyu · 2018-09-06T02:18:44Z

I had do the real performance test to version 3.2.6.

Two points:

10K~20K qps per collector is ok.
Even the cpu %us of biz machine is up to 100%, its no affect to the biz requests, the only affect is skywalking itself, the trace record rate is low to 50%.

I think weak or similar reference is used in agent on collecting trace data. So I think this is the reason that why full gc not happen, just young gc.

wu-sheng · 2018-09-06T02:24:08Z

I think weak or similar reference is used in agent on collecting trace data. So I think this is the reason that why full gc not happen, just young gc.

We have a ring buffer to avoid memory overload. I think your result is perfect, and we are expected to happen.

Core concept is, do our best to don't affect the business app.

wu-sheng · 2018-09-06T02:26:18Z

I hope you could share the test result in public, here or post a blog. This will be another case our project did a good work.

hepyu · 2018-09-06T02:45:45Z

This is the part of the test performance report, here is the result, but just chinese description.
My english is poor, its hard work to translate to english using accurate special terms.Of course, if time enough after, I will translate to english.

The goal of this test is that confirm skywalking has no or little affect to biz. The result is yes.

wu-sheng · 2018-09-06T12:08:46Z

This is a very high-value test report from the community. Appreciate!!

Link from apache/skywalking#1596 (comment)

hepyu · 2018-09-10T02:43:57Z

Actually, We must know, this test is not standard, for example ,message size should designed as 100B, 1K, 10K, 100K, etc. But limited by time and resources, I can't afford it.

And Why I still to do this work? Because it is necessary, we all use rpc, and qps will be enlarged several times for skywalking-collector. I must assess the affect to biz if qps up to magnanimity.

Later if I have time, I think I will have a test that what biz happens when skywalking-collectors all crash.

wu-sheng · 2018-09-10T02:55:37Z

Actually, We must know, this test is not standard, for example ,message size should designed as 100B, 1K, 10K, 100K, etc. But limited by time and resources, I can't afford it.

You have done much more than others. Provide a very formal report.

Later if I have time, I think I will have a test that what biz happens when skywalking-collectors all crash.

Let's see what is happening in there. #1637 A new patch for connection.

hepyu · 2018-10-08T12:52:52Z

The whole test that contains the test about what biz happens when skywalking-collectors all crash.
Based on skywalking-3.2.6.

zhangzx1996 · 2024-09-06T09:15:29Z

Do you have any documentation on the performance after adding GRPC log reporting
@wu-sheng

wu-sheng closed this as completed Aug 28, 2018

wu-sheng self-assigned this Aug 28, 2018

wu-sheng added the question End user question and discussion. label Aug 28, 2018

wu-sheng added this to the 5.0.0-GA milestone Aug 28, 2018

wu-sheng added a commit to SkyAPMTest/Agent-Benchmarks that referenced this issue Sep 6, 2018

Update README_zh.md

821a70a

Link from apache/skywalking#1596 (comment)

wu-sheng added high priority High priority issue, blocking next release. test Test requirements about performance, feature or before release. labels Oct 8, 2018

wu-sheng changed the title ~~How about machine resources in the performance test? version3.2.6~~ Performance test and improvement of 3.2.6 Oct 8, 2018

wu-sheng mentioned this issue Apr 28, 2019

Performance test report #2549

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance test and improvement of 3.2.6 #1596

Performance test and improvement of 3.2.6 #1596

hepyu commented Aug 28, 2018

wu-sheng commented Aug 28, 2018

wu-sheng commented Aug 28, 2018

hepyu commented Aug 28, 2018

hepyu commented Aug 28, 2018 •

edited

Loading

wu-sheng commented Aug 28, 2018

hepyu commented Aug 28, 2018

wu-sheng commented Aug 28, 2018

hepyu commented Sep 6, 2018 •

edited by wu-sheng

Loading

wu-sheng commented Sep 6, 2018

wu-sheng commented Sep 6, 2018

hepyu commented Sep 6, 2018 •

edited

Loading

wu-sheng commented Sep 6, 2018

hepyu commented Sep 10, 2018

wu-sheng commented Sep 10, 2018

hepyu commented Oct 8, 2018 •

edited

Loading

zhangzx1996 commented Sep 6, 2024

Performance test and improvement of 3.2.6 #1596

Performance test and improvement of 3.2.6 #1596

Comments

hepyu commented Aug 28, 2018

wu-sheng commented Aug 28, 2018

wu-sheng commented Aug 28, 2018

hepyu commented Aug 28, 2018

hepyu commented Aug 28, 2018 • edited Loading

wu-sheng commented Aug 28, 2018

hepyu commented Aug 28, 2018

wu-sheng commented Aug 28, 2018

hepyu commented Sep 6, 2018 • edited by wu-sheng Loading

wu-sheng commented Sep 6, 2018

wu-sheng commented Sep 6, 2018

hepyu commented Sep 6, 2018 • edited Loading

wu-sheng commented Sep 6, 2018

hepyu commented Sep 10, 2018

wu-sheng commented Sep 10, 2018

hepyu commented Oct 8, 2018 • edited Loading

zhangzx1996 commented Sep 6, 2024

hepyu commented Aug 28, 2018 •

edited

Loading

hepyu commented Sep 6, 2018 •

edited by wu-sheng

Loading

hepyu commented Sep 6, 2018 •

edited

Loading

hepyu commented Oct 8, 2018 •

edited

Loading