Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug information collector for chaos-mesh #694

Closed
YangKeao opened this issue Jul 14, 2020 · 9 comments · Fixed by #1074
Closed

Debug information collector for chaos-mesh #694

YangKeao opened this issue Jul 14, 2020 · 9 comments · Fixed by #1074

Comments

@YangKeao
Copy link
Member

YangKeao commented Jul 14, 2020

Feature Request

Nowadays it's really hard to collect debug information for some chaos. For example, the user should follow below four steps to collect debug information #688 :

  1. Find the node which the test pod running on by kubectl get pods -o wide|grep PODNAME, the 7th column is the node.
  2. Find the chaos-daemon corresponding to this node by kubectl get pods -n chaos-testing -o wide|grep NODENAME.
  3. Find the pid of test pod. There are several ways to do this. You can find docker container id of the pod with kubectl, and then use docker cli to get its pid. You can also read the related logs of corresponding chaos-daemon and it contains pid
  4. Get ipset, tc qdisc rules of the container

And it's also hard to write a script to collect the debug information, as there are several nodes affected by one chaos, and some information cannot be collected from CLI (the cli of containerd is really feature-poor 😭 ).

So we need a process to help us collect related debug information. This process can be shipped as a standalone executable file or embedded in controller-manager image. It can be accessed by running directly (with enough priviledge) or kubectl exec.

Here is a list of information we needed for different kinds of chaos:

  1. NetworkChaos. iptables, ipsets and tc qdisc/filter configs and related podnetworkchaos.

  2. StressChaos: cgroup configs for stress-ng and target process/container.

  3. IoChaos: mounts information

This list may become richer with the progress of development, so the extensibility should be considered carefully.

@cwen0
Copy link
Member

cwen0 commented Sep 4, 2020

I support this feature. We can create a single command tool to help us to collect the debug information.

@Yiyiyimu
Copy link
Member

Yiyiyimu commented Oct 4, 2020

Hi @YangKeao I tried to get debug info of NetworkChaos and StressChaos, following your steps. But for me, the project still looks like a script, like to use one command instead of several ones. Like for several nodes affected by one chaos, we could get the list of the nodes and get the debug info as we do to one node.

The only problem with this is I can't see the extensibility of it... I think I need some guidance for this problem. Or could you provide some specific examples that could not write a script, like what could not been get from ctr

@namco1992
Copy link
Contributor

Hi @cwen0 @YangKeao, since we are running daemonset with privileged mode and if I understood it correctly, the examples you mentioned (NetworkChaos, StressChaos and IOChaos) are also executed by the daemonset, then is it possible to have the daemonset collect the debug data we need and stream back to the controller through gRPC? I believe this way is extendable and could be helpful if we decide to expose the debug info on the dashboard to provide the users with a one-stop service.

@Yiyiyimu
Copy link
Member

Yiyiyimu commented Oct 5, 2020

@namco1992 Integrating with dashboard is really a great idea. That would be much easier to use, especially for complicated situations like one chaos affecting multiple nodes. Also using daemonset to collect debug info is quite promising, since from what we need right now, debug info is usually the config of commands executed by daemonset, so it's very easy to collect them. Actually I don't think we need to transfer the data to controller, we just need to save the lastest one to somewhere and ready to use them.

Just one question, since we use the push mode, we might need to collect data in high-frequency, will that affect the performance of chaos mesh in production? We could also leave the switch for users whether to collect the debug data.

@namco1992
Copy link
Contributor

namco1992 commented Oct 5, 2020

Hello @Yiyiyimu, I think probably the debug mode doesn't have to be always on and most of the time could be just a one-off data collection for getting a snapshot of current status. Another approach could be just return a stream when calling ChaosDaemonServer for continuously report status. I think currently we either return empty or just some basic metadata of the chaos. In the future, the streaming could be useful if we want to collect more data/logs from the daemonsets and make it bi-directional communication between the controller and daemons.

One of the use cases I can think of is that for running StressChaos, currently, we just use kubectl top or something else for monitoring the stage of the chaos. If we have the daemons streaming back some statistic data then we can plot it nicely on the dashboard without user switching between different screens.

IMO it's important to have an all-in-one dashboard for better user experience, but I'm happy to discuss more on this. I just feel it could be a good opportunity to have a standard data collection interface and it might be somewhat related to the Support generating the report for each chaos scenario item on the roadmap.

@YangKeao
Copy link
Member Author

YangKeao commented Oct 9, 2020

@namco1992 I don't think streaming debug information (like logs or statistic data) back to controller-manager is a good choice for this issue, because:

  1. Controller doesn't have any storage and is still stateless yet. But if the debug information is streamed back, it has to be stored somewhere.

  2. The "debug information" for most chaos doesn't change frequently.

IMO, in the first stage of debug collector, we only need a simple tool to collect these information (for a chaos resource) and print it to stdout or dashboard, so that the user can C-c & C-v it to the github issues 😸 . In the future, it may diagnostic chaos-mesh automatically, but I think it's too early to consider it here.

@YangKeao
Copy link
Member Author

YangKeao commented Oct 9, 2020

@Yiyiyimu The extensibility means the developers could extend the collector simply, when he add a new kind of chaos resource.

@namco1992
Copy link
Contributor

@YangKeao Yes I agree with you, regarding the scope of this issue, streaming back the debug info is not the ideal way. I was trying to shove the live status and reporting into this scenario and it might not be a good idea. 😅

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 15 days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants