Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance gRPC plugin #4177

Merged
merged 43 commits into from Jan 30, 2020
Merged

Enhance gRPC plugin #4177

merged 43 commits into from Jan 30, 2020

Conversation

devkanro
Copy link
Contributor

@devkanro devkanro commented Jan 5, 2020

Skywalking with enhanced gRPC

This PR enhances the original gRPC plugin of Skywalking.
I created PR early for the request for comments. @GuoDuanLZ and I will accept opinion and improve this PR.

Checklist

This PR still in WIP state, you can merge this PR after all check is done.

  • Implementation
  • Test
  • Documents
  • CI build pass
  • Review

Description

The original gRPC plugin just provided very basically tracing function, the exception of business code will not be traced.

We refactored this plugin for those functions:

  1. Server and Client tracing has the same operations.
  2. Provide the internal or external gRPC server tracing mode.
  3. Streamline span timeline.
  4. The error caused by business code will make the "/Response/onClose" fail.
  5. Reduce interface point and optimize the code.

Operations

The operation name of that this plugin created will be like

       service  traceSide             event
        ----┘      -----┘         --------┘
foo.bar.MyApi.echo/server/Request/onMessage
------┐       ---┐        ------┐
package      method        side

There are two sides of tracing for gRPC: client-side and server-side.
Trace side provides a view for gRPC request, it means what the gRPC request looks like in the client or the server.

There are five combinations for sides and events.

/Request/onMessage
# Request message received on the server or sent on the client.

/Request/onComplete
# Client has completed request sending, there is no more incoming request.

/Request/onCancel
# Client has canceled the call.

/Response/onMessage
# Response message received on client or sent on server.

/Response/onClose
# Call closed with status and trailers.

For simplifying tracing span, this plugin just creates onMessage spans for streaming calls.

  • For unary calls, no onMessage span created.
  • For client-streaming calls, just /Request/onMessage span created.
  • For server-streaming calls, just /Response/onMessage span created.
  • For bi-streaming calls, all /onMessage span created.

For example:

foo.bar.MyApi.echo/server/Request/onMessage
# means it is a server-side tracing for server receive a request message from the client.

foo.bar.MyApi.echo/client/Request/onMessage
# means it is a client-side tracing for client send a request message to server.

foo.bar.MyApi.echo/server/Response/onMessage
# means server send a response message to client.

foo.bar.MyApi.echo/server/Response/onClose
# means server completed all handling for a request.

Internal and external gRPC server tracing

If you have two gRPC servers called Server1 and Server2, Server1 will call the Server2.test method in your business code, we just need the server-side tracing because of the Server2 is an internal server, Server2 will provide server-side view, server-side view will be more useful for tracing a call than client-side view.

If Server2 is an external gRPC server which you can't get server-side view by it. You will need the Server1 client-side view for the call.

Configure plugin

This plugin provides three configs for configuring internal and external gRPC servers.

/**
 * If this config is false, only client spans of peer which configured in INCLUDED_CLIENT_TRACING_PEERS
 * will be collected. In this mode EXCLUDED_CLIENT_TRACING_PEERS will override INCLUDED_CLIENT_TRACING_PEERS.
 * <p>
 * If this config is true, only client spans of peer which configured in EXCLUDED_CLIENT_TRACING_PEERS
 * will not be collected. In this mode INCLUDED_CLIENT_TRACING_PEERS will override EXCLUDED_CLIENT_TRACING_PEERS.
 */
public static boolean DEFAULT_CLIENT_TRACING_ENABLE = false;

/**
 * Included client tracing peers. gRPC plugin will collect client spans of configured config.
 */
public static List<String> INCLUDED_CLIENT_TRACING_PEERS = new LinkedList<>();

/**
 * Excluded client tracing peers. gRPC plugin will not collect client spans of configured config.
 */
public static List<String> EXCLUDED_CLIENT_TRACING_PEERS = new LinkedList<>();

Default config is no external gRPC server, no client-side spans created. If you need the client-side view for three-part APIs, add the host and port like services.googleapis.com:11800 into INCLUDED_CLIENT_TRACING_PEERS.

Screenshots

WIP

@codecov-io
Copy link

codecov-io commented Jan 5, 2020

Codecov Report

Merging #4177 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4177      +/-   ##
==========================================
+ Coverage   26.36%   26.37%   +<.01%     
==========================================
  Files        1179     1179              
  Lines       25863    25863              
  Branches     3753     3753              
==========================================
+ Hits         6819     6821       +2     
+ Misses      18440    18438       -2     
  Partials      604      604
Impacted Files Coverage Δ
...pm/agent/core/profile/ProfileTaskQueryService.java 48.71% <0%> (+5.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a1d453...aef265d. Read the comment docs.

@wayilau wayilau added plugin Plugin for agent or collector. Be used to extend the capabilities of default implementor. enhancement Enhancement on performance or codes labels Jan 6, 2020
@wayilau wayilau added this to the 7.0.0 milestone Jan 6, 2020
Copy link
Member

@wayilau wayilau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you enhance the plugin, i think you should adjust Grpc tests cases, then use ci to check first.

@devkanro
Copy link
Contributor Author

devkanro commented Jan 6, 2020

If you enhance the plugin, i think you should adjust Grpc tests cases, then use ci to check first.

Yes, it already in our checklist, and @GuoDuanLZ will help us in tests.

@dmsolr
Copy link
Member

dmsolr commented Jan 7, 2020

This is a very good pull request template for the plugin contribution. 👍

@wu-sheng
Copy link
Member

wu-sheng commented Jan 7, 2020

This is a very good pull request template for the plugin contribution.

Anyone want to discuss about new issue and pull request templates?

@wu-sheng
Copy link
Member

wu-sheng commented Jan 8, 2020

Provide the internal or external gRPC server tracing mode.

What is this? Could you explain a little more?

@devkanro
Copy link
Contributor Author

devkanro commented Jan 8, 2020

What is this? Could you explain a little more?

I will update the doc of PR later, I am too busy to update PR in the working day.

GuoDuanLZ and others added 4 commits January 12, 2020 01:10
* Add grpc on error test

* Fixed bugs

* Add license and fix bugs

* Fixed bugs

* Fix bugs

* Override expect data

* Update expectedData.yaml

Co-authored-by: Kanro <higan@live.cn>
@devkanro
Copy link
Contributor Author

@wu-sheng @wayilau @kezhenxu94
This PR is ready for reviews.

@wu-sheng
Copy link
Member

Internal and external gRPC server tracing

It is still unclear for me about this section. Client/server side tracing are both required in nearly every RPC plugin. What makes the gRPC different? Manually setting this is very painful for the end user.

cc @kezhenxu94 Could you get the point of this?

@devkanro
Copy link
Contributor Author

@wu-sheng
If you trace the both of client and server-side, you would get the tracing like blow of bi-streaming call:

+
|
+-+(Server1) Exit: foo.bar.MyApi.echo
| |
| +-+(Server2) Entry: foo.bar.MyApi.echo
|   |
|   +-+(Server2) Local: foo.bar.MyApi.echo/server/Request/onMessage
|   | |
|   | +-+(Server2) Local: foo.bar.MyApi.echo/server/Response/onMessage
|   |
|   +-+(Server2) Local: foo.bar.MyApi.echo/server/Request/onMessage
|   | |
|   | +-+(Server2) Local: foo.bar.MyApi.echo/server/Response/onMessage
|   |
|   +-+(Server2) Local: foo.bar.MyApi.echo/server/Request/onComplete
|     |
|     +-+(Server2) Local: foo.bar.MyApi.echo/server/Response/onClose
|
+-+(Server1) Local: foo.bar.MyApi.echo/client/Request/onMessage
|
+-+(Server1) Local: foo.bar.MyApi.echo/client/Resoponse/onMessage
|
+-+(Server1) Local: foo.bar.MyApi.echo/client/Request/onMessage
|
+-+(Server1) Local: foo.bar.MyApi.echo/client/Resoponse/onMessage
|
+-+(Server1) Local: foo.bar.MyApi.echo/client/Request/onComplete
|
+-+(Server1) Local: foo.bar.MyApi.echo/client/Response/onClose

A call will be traced both in server1(client-side), and server2(server-side). They are redundant, and the server-side is more accurate than the client-side because of the gRPC framework. You can distinguish the server active message(parent is entry span) and passive message(parent is request message span).

In the internal server case, we don't need the client-side tracing, but in the external server case, client-side tracing is the only way to tracing gRPC call.

@devkanro
Copy link
Contributor Author

devkanro commented Jan 13, 2020

In the default config, all gRPC calls are internal calls. No client-side tracing will be created.

I think the default config is enough for most cases (most of external APIs will not be gRPC). But if you need the client-side tracing for separate peers, just add it to INCLUDED_CLIENT_TRACING_PEERS.

@wu-sheng
Copy link
Member

If no client-side span, there is no client-side metrics of topology. Then you would detect the error of unreachable or network perf unstable issues from trace and response time page. The thing you posted is exactly an APM should collect.

Back to my point, this is a basic design of SkyWalking. We should not argue about this in a single one plugin. If you want to discuss that, it is more than this. You need to change the design and protocol of the project.

If you want to change this, I prefer you keep that in private, and only push the both sides tracing in the upstream.

@devkanro
Copy link
Contributor Author

I agree with you, the error of unreachable or network perf unstable issues are also important for service.

I can simplify client-side tracing for this, like no onMessage event, only Complete/Close event for internal client-side tracing.

@wu-sheng
Copy link
Member

I can simplify client-side tracing for this, like no onMessage event, only Complete/Close event for internal client-side tracing.

I think in most cases, both of them should exist, including onMessage/Complete/Close. User could have further OP, such as DB or cache access in the onMessage like any other callback.

@devkanro
Copy link
Contributor Author

It seems test case passed with client tracing

@devkanro
Copy link
Contributor Author

devkanro commented Jan 28, 2020

Some discuss not about this PR.

I found there is bad throughput about OAP server collect tracing.
I have tried to use skywalking in my production env, but there are too many segments to collect.
And I scale the OAP server to 8c16gx8 instances, there are also too many errors about collecting(gRPC client canceled.)
The CPU looks fine about OAP server and ES server.

Maybe separate collectors from OAP server is a good idea?
Agents are not calling OAP server to collect spans directly, use the MessageQueue or log collector to do it, and make collectors consume those messages or log, one collector maybe 1c2g/2c4g, we can have many collectors with low cost.

If you are interested in this, maybe we could open an issue to discuss.

@wu-sheng
Copy link
Member

You could use another issue or mail list to discuss this. But to be honest, I can't see much difference. Many discussions have been done about this.

parentSpanId: 2, spanId: 3, spanLayer: RPCFramework, startTime: nq 0,
endTime: nq 0, componentId: 23, componentName: '', isError: false,
spanType: Local, peer: '', peerId: 0}
- {operationName: GreeterBlocking.sayHello/client/Response/onClose, operationId: 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these two use different formats? I know they may be right but it seems strange.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these two use different formats? I know they may be right but it seems strange.

When the span is shorter, it is a simplified format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you follow the same pattern? I prefer we keep this consistent same in all test cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you follow the same pattern? I prefer we keep this consistent same in all test cases.

Yes, of course.

spanType: Exit
peer: '127.0.0.1:18080'
peerId: 0
- {operationName: GreeterBlockingError.sayHello/client/Request/onComplete, operationId: 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one too.

This same as above

Copy link
Member

@wu-sheng wu-sheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @kezhenxu94 @arugal @dmsolr Please recheck.

@wu-sheng
Copy link
Member

@kezhenxu94 Could you recheck this recently?

@wu-sheng wu-sheng merged commit 9795080 into apache:master Jan 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent Language agent related. enhancement Enhancement on performance or codes plugin Plugin for agent or collector. Be used to extend the capabilities of default implementor.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants