Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

show process list and kill #12635

Closed
wants to merge 1 commit into from

Conversation

yangxuanjia
Copy link
Contributor

@yangxuanjia yangxuanjia commented Jan 20, 2021

show process list and kill.
show processs list can show put/get/delete/txn ops connections and stream connections.
kill can kill the connections from show_processlist by id.

./bin/etcdctl --endpoints=127.0.0.1:2379 -w=table showprocesslist
+----------------+---------------------+-----------+-------------------------------------------+------------+---------------------+
|    ENDPOINT    |         ID          | SOURCEIP  |                FULLMETHOD                 | REQUESTSTR |      STARTTIME      |
+----------------+---------------------+-----------+-------------------------------------------+------------+---------------------+
| 127.0.0.1:2379 | 1610523442300796988 | 127.0.0.1 |                    /etcdserverpb.KV/Range |  key:"a1"  | 2021-01-13 15:37:22 |
| 127.0.0.1:2379 | 1610523444189009930 | 127.0.0.1 |                    /etcdserverpb.KV/Range |  key:"a2"  | 2021-01-13 15:37:24 |
| 127.0.0.1:2379 | 1610523450890053513 | 127.0.0.1 | /etcdserverpb.Maintenance/ShowProcessList |            | 2021-01-13 15:37:30 |
+----------------+---------------------+-----------+-------------------------------------------+------------+---------------------+

./bin/etcdctl --endpoints=127.0.0.1:2379 -w=table showprocesslist --stream=true
+----------------+---------------------+-----------+---------------------------+------------+---------------------+
|    ENDPOINT    |         ID          | SOURCEIP  |        FULLMETHOD         | REQUESTSTR |      STARTTIME      |
+----------------+---------------------+-----------+---------------------------+------------+---------------------+
| 127.0.0.1:2379 | 1611123776378030471 | 127.0.0.1 | /etcdserverpb.Watch/Watch |            | 2021-01-20 14:22:56 |
+----------------+---------------------+-----------+---------------------------+------------+---------------------+

./bin/etcdctl --endpoints=127.0.0.1:2379 kill 1610523442300796988
./bin/etcdctl --endpoints=127.0.0.1:2379 kill --stream=true 1611123776378030471

Screenshot from 2021-01-25 15-37-23
ETCD provides GRPC connection request methods. Short connection requests include basic operations, such as adding kv, modifying kv, deleting kv, and querying kv. Long connection requests are streamed, including watch subscriptions, keep alive leases and keep alive. global lock global lock.

At present, these connection requests cannot be viewed. When the amount of concurrency is large, or when internal lock competition causes performance degradation or request delays increase, especially when transactions are concurrent, read-write conflicts and write-write conflicts become delayed. At the time, we could not see what is happening inside the current ETCD system, what requests may cause the system performance to decrease or the latency to increase, or when the memory suddenly soars, we don't know what operations are performed by the business. This The operation and maintenance of the online ETCD caused a great air disaster, especially when the business inquiries were caused or the business operations had decreased performance or increased latency, we did not know, and there was no way to answer.

View all ongoing connection requests within ETCD, including all running short and long connections. Including the source IP connected to the ETCD, the command executed by the connection and the parameters of the request, and the start time of the command executed and listed according to Time sorting, so we know which connections have the longest execution time.

Provide the function of killing the internal connection of ETCD, as an operation and maintenance need, when we analyze which connection is causing the performance degradation or the delay increase, or which connection has occupied the lock for a long time and is not released, other connections cannot obtain the lock and cannot execute At the time, you can use the kill command to forcibly kill the connection to restore the normal operation of the ETCD system.

The whole design consists of 3 blocks: interceptor (including short connection interceptor and stream long connection interceptor), view connection request module and cancel connection module.

1.Interceptor Module
The function of the interceptor is mainly to be able to intercept all connection requests to the client, including short connections and stream-like long connections. The short connection interceptor is mainly to obtain the basic operations of the client, such as adding KV, modifying KV, deleting KV, Query KV and transaction. The interceptor of long connection is mainly to obtain the client's streaming request, such as watch subscription, keep alive of lease lease and global lock of global lock.

In the interceptor module, define two global variables, both of the Concurrent Map type, which store short and long connections respectively. Since the interceptor is an entry interception, that is, all requests must pass through this interceptor, then if in the Map Storing data by adding RWMutex for thread safety protection, when the amount of concurrency is particularly large, it may cause performance degradation. Based on this consideration, a Concurrent Map type is used here, which will split the data into 32 pieces, and in each Add RWMutex to the shard to improve the concurrency performance of Map.

The object stored in the Map is a ProcessList type, and the core properties in it are as follows:
Ctx: Save context information.
Cancel: Cancel the aborted task method.
ID: The unique identifier of each connection.
StartTime: The start time of each connection establishment.
SourceIP: Client source IP.
FullMethod: The name of the interface called by the connection request.
RequestStr: The connection request call is the parameter passed.

Screenshot from 2021-01-25 16-54-15

The processing flow of the interceptor is not complicated, mainly in the interception process, the required relevant information is obtained, and then the information is assigned to each attribute of the ProcessList type variable.
For example, Ctx and Cancel generate new Ctx and Cancel based on the passed Ctx context, mainly to get a Cancel function reference, which is used to do the Kill function introduced later.
ID is the unique identification of each connection, composed of Nano nanoseconds plus a random number.
StartTime is mainly used to display the sorting time, so that we can know which connections have the longest execution time and which connections have just been executed. Because we tend to kill connections with too long execution time as much as possible, generally such connections tend to block other connections Or the culprit affecting performance.
SourceIP is mainly to obtain the source IP of the client. However, because sometimes LB (Load Balance) load balancing may be used in the ETCD deployment, the relevant judgment logic is also made when obtaining the source IP here, and the real client IP is obtained first as much as possible , Instead of LB's IP.
FullMethod stores the interface name of the connection request. Through this attribute, we know what the connection is for.
According to the put, delete, range and txn requests, RequestStr obtains and saves the string format output of the related request parameters, providing a human-readable string format.
Finally, the ProcessList variable can be stored in the Map according to the ID, and then the business interface is called. When the interceptor ends, the connection can be deleted from the Map. In this way, the entire process of the ProcessList interceptor is over.

2.View ProcessList module
View processList is mainly to provide an interface, the client through the interface call to view. And provide some parameters to determine the type of data returned.
CountOnly only obtains the number of connections, which is used in scenarios where you just want to see how many connections there are.
Type is used to determine whether you want to view the data of the short connection type or the stream long connection type.
ID is used to check a specific connection, if not specified, check all.
Finally, it is sorted according to the StartTime time, and it can be returned to the upper layer.

3.Kill module
The Kill module mainly provides an interface, according to the parameter ID and Type, forcibly canceling a connection. This function is not complicated, because in the interceptor module, each connection has been placed in the Map, and each connection is stored A pointer reference to the Cancel function. Here, as long as you find the corresponding connection object from the Map according to the ID and Type type passed in by the request, and then call the Cancel() function, you can cancel the connection. The client will receive one A similar error is reported, "Connection has canceled by peer". After canceling, delete the connection from the Map.

Copy link
Contributor

@ptabor ptabor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting functionality and I see that it can be useful,
but also is pretty dangerous.

I commented a code in few places, but I think this PR should go rather with bigger decision process with maintainers, writing proposal in form of a doc (although PR is also nice POC and feasibility study).

Assuming there is a decision to support that feature, following pieces would be needed:

  • making it opt-in-experimental flag
  • authentication model (who can list/kill processes)
  • extensive test coverage including client behavior of cancellation.


Type type = 2;

int64 id = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it filter (process lookup)?

  • requires documentation
  • I would identify processes by strings rather then 'int64'.
    If int - it should be rather uint.

bool count_only = 1;

enum Type {
OP = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't abbreviate in proto.


message ShowProcessListResponse {
ResponseHeader header = 1;
repeated mvccpb.ProcessList pls = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't abbreviate in proto.

message ShowProcessListResponse {
ResponseHeader header = 1;
repeated mvccpb.ProcessList pls = 2;
int64 count = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint

int64 count = 3;
}

message KillRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this does ctx.Cancel I would stick to the 'cancellation' nomenclature.

return func(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
ctx, cancel := context.WithCancel(ctx)

startTime := time.Now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think nano-time is guaranteed enough to be unique to make it request ID.
At least it should contain a 'random' part apart of time.

@@ -231,16 +326,30 @@ func newStreamInterceptor(s *etcdserver.EtcdServer) grpc.StreamServerInterceptor
return rpctypes.ErrGRPCNoLeader
}

startTime := time.Now()
id := startTime.Local().UnixNano()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@yangxuanjia
Copy link
Contributor Author

@ptabor
I will optimize the code by your suggests when I am not busy.

@nate-double-u
Copy link
Contributor

Hi @yangxuanjia, we have finished migrating the Documentation to https://github.com/etcd-io/website/. Could you please open a new PR there with the Documentation changes from this PR?

@stale
Copy link

stale bot commented Aug 11, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 11, 2021
@stale stale bot closed this Sep 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

3 participants