-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase the ability to reset apps and instances #1585
Comments
This is a complex feature we are considering. Maybr 6.0.0-alpha |
If you have any idea, submit your proposal in google doc or markdown in this issue. |
Re-registration schemeWhen it is found that the registration data is lost, the operation and maintenance personnel manually send an instruction to the collector. At this time, the collector will add an instruction parameter to the response when the agent receives the heartbeat, and the agent re-registers the app and instance after receiving the instruction. |
First, manually do reset command is NOT very reliable and robust to me. Why can't be the collector finds that automatically? And need to design about how to reset at agent sides. What changes are you trying to do about the protocol. I think we should be compatible upgrade about this. |
Of course, it can be discovered automatically, but the loss of metadata will only occur in certain situations. The question is whether the user needs to know that one of his operations has caused the loss of metadata. If you need to know the user, then manually reset it for a long time and reset it automatically if you don't need it. Regarding reset of the agent, it is necessary to consider which data is sent to the collector when the agent sends the metadata. When the metadata is not associated with the agent, the agent returns the registration command. At this time, the agent clears the metadata in the cache. , will trigger the registered action |
I am also OK with manual reset. But you should know, we will require UI to add setting tab to support this kind of command. For agent reset, I mean the details of process, even which data you will reset. By reviewing these, I can be sure that is the right design. I need you to provide as detailed info as possible. |
For furthermore, I have doubts about doing reset from UI. Considering the metadata(ID) have been deleted, how does the UI show the entrance of reset? More questions are
After all, I need your scenarios, rather than the simple reset requirements to help me review your plans. Like I said, the proposal really should be a document. |
1.1 Re-send 1.2 Re-registration 1.2.1 What method is used? 1.2.1.1 Automatic triggering 1.2.1.2 Manual trigger 1.2.2 How to identify the agent that needs to be re-registered |
For me, this could be dangerous to tell people, SkyWalking has
If you can't do this manually, I doubt the meaning of this check. This is not a pattern recognition, which AI may do better than human, otherwise, this is a fault check, I am pretty sure, human check is the most reliable thing and the final defence.
How is this could identify the |
My proposal, recheck your assumptions, make the plan more executable. In this discussion, we should focus on scenarios, code reactions and manual/auto solutions. |
The id before and after the reset may change, then the metrics aggregated according to the id are not allowed.
SkyWalking's consistency check adds instance code to uniquely determine instance to guarantee no-exception=data is right
I am wrong in this place, the server side can also send instructions to the agent reset through the grpc response. But considering this operation is not a common operation and security problem, do not use the server side reset
By adding an instanceCode to the agent configuration file, which is determined by the operation and maintenance personnel, they can also find the problem agent according to the instanceCode. |
I think you still miss the proposal. The answers to my questions can't tell what should do. |
@xingxingyu please don't edit the comments you wrote, if not for fix typo. Otherwise, this issue discussion can't be understood by any other reader. Consider you have done so, I am closing this issue. Please submit a new one to discuss. |
Please answer these questions before submitting your issue.
Question
Requirement or improvement
background
When the user upgrades SkyWalking, the data model of the old and new versions is inconsistent, causing the server to fail to start normally. At this time, the user may adopt the method of clearing the library.
The registration data is lost. The interface will not be able to display the statistical indicators reported by the client that lacks registration information. At the same time, because under the existing mechanism,
The user needs to restart the client to complete the re-registration operation, but the business system restart is not acceptable because of the monitoring system problem.
So we need a mechanism to re-register without restarting the business system.
ideas
The registration data is lost on the server side and there are two compensation measures:
Solving such problems is costly.
Key issues
Uniquely identifies
At present, the client automatically generates a globally unique agentUUID as the unique identifier of the client instance. However, the operation and maintenance personnel cannot use this identifier.
Accurately locate the server where the client is located, so you need to add the client instance name attribute in the startup file and startup parameters, and manually specify it when the user deploys.
Because the recovery function is not a necessary function of the system, as a non-essential option, the ability to automatically generate the original global unique agentUUID is retained, only when the user is starting.
The original agentUUID is overwritten when the client instance name is specified in the file or startup parameters.
in order to avoid modifying the 5.x protocol, resulting in other language probe linkage upgrade, the attribute of the instance name is added in the heartbeat interface of the 6.x protocol.
Problem finding
The client whose registration data is missing is not aware of it. Only the server can find it by parsing the data reported by the client. If the trace details are reported in the interface
Checking, because the order of Trace details is too large and consumes too much performance, so consider the heartbeat interface of the instance to find the problem client, but the current heartbeat interface
Only the instance ID is reported, and the check and friendly prompts cannot be checked. You need to modify this interface to add the instance name attribute to the interface.
Check the ID and instance name at the same time, and prompt in the error log information to check the problematic instance information.
Directive is issued
Considering the background of this solution is a very useful function, the instruction does not need to be sent to the client through the server, and the client is directly logged in to the client.
The instruction to reset the registration data, while considering the security problem, can not open the network interface to receive instructions from the client, so the file scanning and listening mode is\used to issue the instruction.
Considering the friendliness of the operator after the command is issued, the client will modify the status information in the file to inform the execution of the reset command.
Program
About unique identifier
Configure the instance_code field in agent.conf to ensure that it is globally unique and meaningful so that the operation and maintenance personnel can quickly identify the server where the agent is located.
About problem finding
The instance heartbeat protocol adds instance_code, and the server uses id and code to check at the same time. If it is not found, the print log (including instanse_code) is used to notify the user to reset the agent.
About the order
The listener thread then checks the value of is_register in the .trigger every 3 seconds. If true, the service, instance, network, endpoint cache will be cleared.
Because the agent-side segment generates network_id and srvice_id, the segment after the cache is emptied is discarded directly before being registered and returned before the network and endpoint. Before the service and instance are registered and returned, the segment is not converted.
Reset status feedback. After the cache is cleared, status is set to running. After registration is successful, status is set to finish, and status will be written to .trigger every time.
About .trigger file read and write
Contains 2 attributes
Is_register -> If true, registration will be triggered
Status -> informs the status of the registration, there are three values no_running, running, finish, failed
The text was updated successfully, but these errors were encountered: