Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the ability to reset apps and instances #1585

Closed
1 of 4 tasks
Liu-XinYuan opened this issue Aug 24, 2018 · 13 comments
Closed
1 of 4 tasks

Increase the ability to reset apps and instances #1585

Liu-XinYuan opened this issue Aug 24, 2018 · 13 comments
Assignees
Labels
agent Language agent related. backend OAP backend related. feature New feature question End user question and discussion. TBD To be decided later, need more discussion or input. wontfix This will not be worked on
Milestone

Comments

@Liu-XinYuan
Copy link
Contributor

Liu-XinYuan commented Aug 24, 2018

Please answer these questions before submitting your issue.

  • Why do you submit this issue?
  • Question or discussion
  • Bug
  • Requirement
  • Feature or performance improvement

Question

  • What do you want to know?

Requirement or improvement

  • Please describe about your requirements or improvement suggestions.

background

When the user upgrades SkyWalking, the data model of the old and new versions is inconsistent, causing the server to fail to start normally. At this time, the user may adopt the method of clearing the library.
The registration data is lost. The interface will not be able to display the statistical indicators reported by the client that lacks registration information. At the same time, because under the existing mechanism,
The user needs to restart the client to complete the re-registration operation, but the business system restart is not acceptable because of the monitoring system problem.
So we need a mechanism to re-register without restarting the business system.

ideas

The registration data is lost on the server side and there are two compensation measures:

  • Push the registration data cached by the client to the server again, but the ID of the registration data in the cache may have been occupied by other newly registered clients.
    Solving such problems is costly.
  • Reset the registration data of the problem client. The key elements of this solution are how to identify the problem client and how to send the command to the problem client.

Key issues

Uniquely identifies

At present, the client automatically generates a globally unique agentUUID as the unique identifier of the client instance. However, the operation and maintenance personnel cannot use this identifier.
Accurately locate the server where the client is located, so you need to add the client instance name attribute in the startup file and startup parameters, and manually specify it when the user deploys.
Because the recovery function is not a necessary function of the system, as a non-essential option, the ability to automatically generate the original global unique agentUUID is retained, only when the user is starting.
The original agentUUID is overwritten when the client instance name is specified in the file or startup parameters.
in order to avoid modifying the 5.x protocol, resulting in other language probe linkage upgrade, the attribute of the instance name is added in the heartbeat interface of the 6.x protocol.

Problem finding

The client whose registration data is missing is not aware of it. Only the server can find it by parsing the data reported by the client. If the trace details are reported in the interface
Checking, because the order of Trace details is too large and consumes too much performance, so consider the heartbeat interface of the instance to find the problem client, but the current heartbeat interface
Only the instance ID is reported, and the check and friendly prompts cannot be checked. You need to modify this interface to add the instance name attribute to the interface.
Check the ID and instance name at the same time, and prompt in the error log information to check the problematic instance information.

Directive is issued

Considering the background of this solution is a very useful function, the instruction does not need to be sent to the client through the server, and the client is directly logged in to the client.
The instruction to reset the registration data, while considering the security problem, can not open the network interface to receive instructions from the client, so the file scanning and listening mode is\used to issue the instruction.
Considering the friendliness of the operator after the command is issued, the client will modify the status information in the file to inform the execution of the reset command.

Program

About unique identifier

Configure the instance_code field in agent.conf to ensure that it is globally unique and meaningful so that the operation and maintenance personnel can quickly identify the server where the agent is located.

About problem finding

The instance heartbeat protocol adds instance_code, and the server uses id and code to check at the same time. If it is not found, the print log (including instanse_code) is used to notify the user to reset the agent.

About the order

  • The listener thread then checks the value of is_register in the .trigger every 3 seconds. If true, the service, instance, network, endpoint cache will be cleared.

  • Because the agent-side segment generates network_id and srvice_id, the segment after the cache is emptied is discarded directly before being registered and returned before the network and endpoint. Before the service and instance are registered and returned, the segment is not converted.

  • Reset status feedback. After the cache is cleared, status is set to running. After registration is successful, status is set to finish, and status will be written to .trigger every time.

About .trigger file read and write

Contains 2 attributes
Is_register -> If true, registration will be triggered
Status -> informs the status of the registration, there are three values ​​no_running, running, finish, failed

@wu-sheng wu-sheng added this to the 6.0.0-alpha milestone Aug 24, 2018
@wu-sheng wu-sheng added core feature Core and important feature. Sometimes, break backwards compatibility. agent Language agent related. backend OAP backend related. feature New feature labels Aug 24, 2018
@wu-sheng
Copy link
Member

This is a complex feature we are considering. Maybr 6.0.0-alpha

@wu-sheng
Copy link
Member

If you have any idea, submit your proposal in google doc or markdown in this issue.

@Liu-XinYuan
Copy link
Contributor Author

Liu-XinYuan commented Aug 25, 2018

Re-registration scheme

When it is found that the registration data is lost, the operation and maintenance personnel manually send an instruction to the collector. At this time, the collector will add an instruction parameter to the response when the agent receives the heartbeat, and the agent re-registers the app and instance after receiving the instruction.

@wu-sheng
Copy link
Member

wu-sheng commented Aug 25, 2018

First, manually do reset command is NOT very reliable and robust to me. Why can't be the collector finds that automatically?

And need to design about how to reset at agent sides.

What changes are you trying to do about the protocol. I think we should be compatible upgrade about this.

@wu-sheng wu-sheng added the TBD To be decided later, need more discussion or input. label Aug 25, 2018
@Liu-XinYuan
Copy link
Contributor Author

Of course, it can be discovered automatically, but the loss of metadata will only occur in certain situations. The question is whether the user needs to know that one of his operations has caused the loss of metadata. If you need to know the user, then manually reset it for a long time and reset it automatically if you don't need it.

Regarding reset of the agent, it is necessary to consider which data is sent to the collector when the agent sends the metadata. When the metadata is not associated with the agent, the agent returns the registration command. At this time, the agent clears the metadata in the cache. , will trigger the registered action

@wu-sheng
Copy link
Member

I am also OK with manual reset. But you should know, we will require UI to add setting tab to support this kind of command.

For agent reset, I mean the details of process, even which data you will reset. By reviewing these, I can be sure that is the right design.

I need you to provide as detailed info as possible.

@wu-sheng
Copy link
Member

For furthermore, I have doubts about doing reset from UI. Considering the metadata(ID) have been deleted, how does the UI show the entrance of reset?

More questions are

  1. How do you want to reset your agent? one by one, or reset all.
  2. How does the backend know the agent is reset.

After all, I need your scenarios, rather than the simple reset requirements to help me review your plans. Like I said, the proposal really should be a document.

@Liu-XinYuan
Copy link
Contributor Author

Liu-XinYuan commented Sep 3, 2018

  1. Why develop this feature, and plan
    Because the misoperation and data model changes in the collector upgrade process may cause some registration data to be lost, if there is a new agent registration, you may get the same id as the agent that lost the metadata, then the subsequent metrics and traces, Not allowed.
    For the loss of metadata, there are two kinds of compensation mechanisms, the agent resends the data in the memory, and the agent re-registers.

1.1 Re-send
The agent directly causes the agent to send the data in the memory to the collector, and the collector to the es. However, since it is serially registered, it may be overwritten by the registered data after the es is written. Therefore, this scheme is not adopted.

1.2 Re-registration
Trigger re-registration by setting the id in the agent-side cache to null

1.2.1 What method is used?
There are two ways to re-register, manual triggering and automatic triggering.

1.2.1.1 Automatic triggering
Loss of metadata due to misoperation is not a defect in the skywalking system. If the reset causes the id of the data before and after the agent to be inconsistent, an exception needs to be thrown to inform the user that his operation has caused the metadata to be lost. Therefore, the method of manually triggering the reset is adopted.

1.2.1.2 Manual trigger
If you choose to manually send commands in the management interface, you need to know which agent registration data is lost. Obviously, the metadata has been lost and there is no way to get it. Similarly, the collector side cannot be manually implemented.

1.2.2 How to identify the agent that needs to be re-registered
Uniquely determine the instance location by adding the instance_code attribute to the agent.conf configuration file.

@wu-sheng
Copy link
Member

wu-sheng commented Sep 3, 2018

Loss of metadata due to misoperation is not a defect in the skywalking system. If the reset causes the id of the data before and after the agent to be inconsistent, an exception needs to be thrown to inform the user that his operation has caused the metadata to be lost. Therefore, the method of manually triggering the reset is adopted.

  • Automatic triggering
    How do you know about inconsistent? For the metadata missing or partial missing, there is no way the backend could know. Could you info me more about inconsistent check mechanism? The only thing in my mind, which doesn't break performance, is when you can't get the name of service/endpoint-ip from cache, you know something is wrong. But this is really part of fact called inconsistent check.

For me, this could be dangerous to tell people, SkyWalking has inconsistent check, because this mechanism can't guarantee no exception = data is right.

If you choose to manually send commands in the management interface, you need to know which agent registration data is lost. Obviously, the metadata has been lost and there is no way to get it. Similarly, the collector side cannot be manually implemented.

If you can't do this manually, I doubt the meaning of this check. This is not a pattern recognition, which AI may do better than human, otherwise, this is a fault check, I am pretty sure, human check is the most reliable thing and the final defence.

1.2.2 How to identify the agent that needs to be re-registered
Uniquely determine the instance location by adding the instance_code attribute to the agent.conf configuration file.

How is this could identify the needs to thing? Could you explain more? Uniquely determine the instance location this is an ID of service instance, yes. Then?

@wu-sheng
Copy link
Member

wu-sheng commented Sep 3, 2018

My proposal, recheck your assumptions, make the plan more executable. In this discussion, we should focus on scenarios, code reactions and manual/auto solutions.

@Liu-XinYuan
Copy link
Contributor Author

Automatic triggering
How do you know about inconsistent? For the metadata missing or partial missing, there is no way the backend could know. Could you info me more about inconsistent check mechanism? The only thing in my mind, which doesn't break performance, is when you can't get the name of service/endpoint-ip from cache, you know something is wrong. But this is really part of fact called inconsistent check.

The id before and after the reset may change, then the metrics aggregated according to the id are not allowed.

For me, this could be dangerous to tell people, SkyWalking has inconsistent check, because this mechanism can't guarantee no exception = data is right.

SkyWalking's consistency check adds instance code to uniquely determine instance to guarantee no-exception=data is right

If you can't do this manually, I doubt the meaning of this check. This is not a pattern recognition, which AI may do better than human, otherwise, this is a fault check, I am pretty sure, human check is the most reliable thing and the final defence.

I am wrong in this place, the server side can also send instructions to the agent reset through the grpc response. But considering this operation is not a common operation and security problem, do not use the server side reset

How is this could identify the needs to thing? Could you explain more? Uniquely determine the instance location this is an ID of service instance, yes. Then?

By adding an instanceCode to the agent configuration file, which is determined by the operation and maintenance personnel, they can also find the problem agent according to the instanceCode.

@wu-sheng
Copy link
Member

wu-sheng commented Sep 4, 2018

I think you still miss the proposal. The answers to my questions can't tell what should do.

@wu-sheng
Copy link
Member

wu-sheng commented Sep 5, 2018

@xingxingyu please don't edit the comments you wrote, if not for fix typo. Otherwise, this issue discussion can't be understood by any other reader.

Consider you have done so, I am closing this issue. Please submit a new one to discuss.

@wu-sheng wu-sheng closed this as completed Sep 5, 2018
@wu-sheng wu-sheng added question End user question and discussion. wontfix This will not be worked on and removed core feature Core and important feature. Sometimes, break backwards compatibility. labels Sep 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent Language agent related. backend OAP backend related. feature New feature question End user question and discussion. TBD To be decided later, need more discussion or input. wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants