Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance test and improvement of 3.2.6 #1596

Closed
hepyu opened this issue Aug 28, 2018 · 16 comments
Closed

Performance test and improvement of 3.2.6 #1596

hepyu opened this issue Aug 28, 2018 · 16 comments
Assignees
Labels
high priority High priority issue, blocking next release. question End user question and discussion. test Test requirements about performance, feature or before release.
Milestone

Comments

@hepyu
Copy link
Contributor

hepyu commented Aug 28, 2018

I checked about the performance test report:
https://skywalkingtest.github.io/Agent-Benchmarks/README_zh.html

And I noticed that there was only provide the skywalking-agent machine configuration.
How about machine configuration about collector node or cluster and es machine or cluster? In my idea, this is very important in the performance test.

Is there some test report about collector and es in the performance test?

@wu-sheng
Copy link
Member

And I noticed that there was only provide the skywalking-agent machine configuration.
How about machine configuration about collector node or cluster and es machine or cluster? In my idea, this is very important in the performance test.

No, it is not important for this case. Because this is about agent test, the backend is mock, because we don't have such Infra env to do so.

@wu-sheng wu-sheng self-assigned this Aug 28, 2018
@wu-sheng wu-sheng added the question End user question and discussion. label Aug 28, 2018
@wu-sheng wu-sheng added this to the 5.0.0-GA milestone Aug 28, 2018
@wu-sheng
Copy link
Member

Is there some test report about collector and es in the performance test?

We haven't done yet.

@hepyu
Copy link
Contributor Author

hepyu commented Aug 28, 2018

Oh, I got what you said.

But before used in the product env, we have to assess the performance about the collector cluster.

ES itself is ok, because it is used a lot as the factual standards, we could only care about how to adjust its config.

Does the data transfer between agent and collect use the tcp long connection? (It seems used long connection based on jetty).
Then how about the efficiency between collector and es? What's about its mechanism? I mean, if the input request rate from collector gt the process rate to es, what will happen? Is there exist disk cache or similar way to process? It seems not appropriate to use memory to do this, data is too much.

In the product env of internet, this is very common: there will be 100+ micro-service instance, and the qps will be 20K total around, 1K qps at least in the high load instance, and it will increase of course in the future.

According to the performance report, the 1K qps per instance, then we need assess the collector scale.
If these conditions are satisfied as below:
a.Trace data transport is based on tcp long connection.
b.The package about trace is not large.
c.Collector have mechanism to write trace data to es in the high efficiency.
Then 10K qps per collector is very ok, I think.
Is that mean, we need at least three collector nodes in the env that based on giving consideration to usability of collector cluster and request qps scale?

At last, I mean, is there some advice or default config about collector cluster scale based on the request scale above?

@hepyu
Copy link
Contributor Author

hepyu commented Aug 28, 2018

You are right, the performance test report is just for agent, so use mock to imitate collector is appropriate。

But I think, it seems that we couldn't make the mock collector and actual collector equivalent, it seems will hide the potential risk.

The efficiency of the collector cluster is very important in the product env.

@wu-sheng
Copy link
Member

Does the data transfer between agent and collect use the tcp long connection? (It seems used long connection based on jetty).

I think should be gRPC HTTP/2. But long connection, yes. 3.x doesn't focus on performance so much, SkyWalking 5 and 6 focus on these fields.

In SkyWalking 5 beta2, we run a test, look like 10k~20k per collector if ElasticSearch is big enough.

@hepyu
Copy link
Contributor Author

hepyu commented Aug 28, 2018

Oh, I got it, using gRPC based on protobuf protocol through long connection to transfer, is this true?

Could you please have a assessment about collector performance between 3.2.6 and 5&6?

I mean, how reliability about10K~20K per collector in the version 3.2.6?
In my opinion, the gap of the basic performance of the collector between 3.2.6 and 5&6 should not be much, because its similar mechanism? Could I consider like this?

@wu-sheng
Copy link
Member

Oh, I got it, using gRPC based on protobuf protocol through long connection to transfer, is this true?

Yes.

Could you please have a assessment about collector performance between 3.2.6 and 5&6?
I mean, how reliability about10K~20K per collector in the version 3.2.6?

Haven't run that before. Have to do by yourself :)

In my opinion, the gap of the basic performance of the collector between 3.2.6 and 5&6 should not be much, because its similar mechanism? Could I consider like this?

5.0.0-beta2 did a big performance upgrade by optimization from alpha. From different version, especially old version(3.2.6), really don't know.

@hepyu
Copy link
Contributor Author

hepyu commented Sep 6, 2018

I had do the real performance test to version 3.2.6.

Two points:

  1. 10K~20K qps per collector is ok.
  2. Even the cpu %us of biz machine is up to 100%, its no affect to the biz requests, the only affect is skywalking itself, the trace record rate is low to 50%.

I think weak or similar reference is used in agent on collecting trace data. So I think this is the reason that why full gc not happen, just young gc.

@wu-sheng
Copy link
Member

wu-sheng commented Sep 6, 2018

I think weak or similar reference is used in agent on collecting trace data. So I think this is the reason that why full gc not happen, just young gc.

We have a ring buffer to avoid memory overload. I think your result is perfect, and we are expected to happen.

Core concept is, do our best to don't affect the business app.

@wu-sheng
Copy link
Member

wu-sheng commented Sep 6, 2018

I hope you could share the test result in public, here or post a blog. This will be another case our project did a good work.

@hepyu
Copy link
Contributor Author

hepyu commented Sep 6, 2018

This is the part of the test performance report, here is the result, but just chinese description.
My english is poor, its hard work to translate to english using accurate special terms.Of course, if time enough after, I will translate to english.

The goal of this test is that confirm skywalking has no or little affect to biz. The result is yes.

test-result

@wu-sheng
Copy link
Member

wu-sheng commented Sep 6, 2018

This is a very high-value test report from the community. Appreciate!!

wu-sheng added a commit to SkyAPMTest/Agent-Benchmarks that referenced this issue Sep 6, 2018
@hepyu
Copy link
Contributor Author

hepyu commented Sep 10, 2018

Actually, We must know, this test is not standard, for example ,message size should designed as 100B, 1K, 10K, 100K, etc. But limited by time and resources, I can't afford it.

And Why I still to do this work? Because it is necessary, we all use rpc, and qps will be enlarged several times for skywalking-collector. I must assess the affect to biz if qps up to magnanimity.

Later if I have time, I think I will have a test that what biz happens when skywalking-collectors all crash.

@wu-sheng
Copy link
Member

Actually, We must know, this test is not standard, for example ,message size should designed as 100B, 1K, 10K, 100K, etc. But limited by time and resources, I can't afford it.

You have done much more than others. Provide a very formal report.

Later if I have time, I think I will have a test that what biz happens when skywalking-collectors all crash.

Let's see what is happening in there. #1637 A new patch for connection.

@hepyu
Copy link
Contributor Author

hepyu commented Oct 8, 2018

The whole test that contains the test about what biz happens when skywalking-collectors all crash.
Based on skywalking-3.2.6.

all

@wu-sheng wu-sheng added high priority High priority issue, blocking next release. test Test requirements about performance, feature or before release. labels Oct 8, 2018
@wu-sheng wu-sheng changed the title How about machine resources in the performance test? version3.2.6 Performance test and improvement of 3.2.6 Oct 8, 2018
@wu-sheng wu-sheng mentioned this issue Apr 28, 2019
4 tasks
@zhangzx1996
Copy link

Do you have any documentation on the performance after adding GRPC log reporting
@wu-sheng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority High priority issue, blocking next release. question End user question and discussion. test Test requirements about performance, feature or before release.
Projects
None yet
Development

No branches or pull requests

3 participants