Tracking Issue: Better observability for StressChaos #4405

kaaass · 2024-04-25T09:39:36Z

This is a tracking issue for LFX Mentorship 2024 March-May Term (project link, related issue #3651). Our mentors are @STRRL, @g1eny0ung and @cwen0. In this project, we aim to improve the observability of StressChaos by exposing the pod-level metrics (e.g., CPU usage seconds) related to the StressChaos experiments and providing a way to visualizing them. In the end, we hope Chaos Mesh end users can observe the injected stress at the Pod-level.

Steps

Step 1. Enabling Pod-level Metrics
In this step, we want to make sure the related metrics for StressChaos can be observed at pod-level. To achieve this, we need to evaluate the current implementation of StressChaos to see whether the stressor process has been injected into the current PID namespace, cgroups, etc.
Step 2. Exposing Pod-level Metrics
In this step, we want to provide a way to expose the pod-level metrics to the external observability stack (e.g., prometheus). All the metrics should contain labels about the injected StressChaos.
Step 3. Visualizing the Metrics
In this step, we want to provide a way for the end user to visualize the exposed experiment metrics. We expect to provide a Grafana dashboard and related prometheus configuration and provide related documents.

Progress

Step 1. Enabling Pod-level Metrics (Done)
- Evaluating the current StressChaos implementation
  - We had done a code review on the current implementation of the namespace and cgroup mechanism.
    - The injected stress-ng has been spawned in the same PID namespace of the Pod container.
      - By (*CommandBuilder) SetNS, which utilizes nsexec
    - The injected stress-ng has been attached to the correct cgroup of the Pod container.
      - By (*AttachCGroupV1).AttachProcess or (*AttachCGroupV2).AttachProcess
  - We confirmed that the pod-level metrics can witness the injected StressChaos in both cgroup v1 and v2 environments.
- Fixing issues found during the evaluation
  - Failed to apply StressChaos in minikube with qemu driver: controller is not supported #4406
  - fix: fail to inject StressChaos in certain cgroup v1 environment because PidPath returns an unexpected error #4407
Step 2. Exposing Pod-level Metrics (Done)
- Proposing plan and demo
  - Plan A: Exporting an intermediate metric (Now discarded)
    - We had proposed the rough plan (URL)
    - We had done a Proof-of-Concept demo to implement this, which includes:
      - An exporter to implement the designed work of chaos-controller-manager.
      - A helm chart includes the exporter deployments and promeheus-operator configuration.
      - A Grafana dashboard for visualization.
    - After discussion, we found this plan has several disadvantages and may not be good enough for all the use cases.
  - Plan B: Exporting experiment metric directly
    - We had proposed the RFC: RFC: Export Metrics related to StressChaos Experiments rfcs#47
    - We had done a working demo
- Implementation
  - chore: upgrade the base image of chaos-dlv #4408
  - Chaos Daemon
    - containerd: feat: export container metrics in Chaos Daemon for containerd runtime #4416
    - docker, cri-o: code done, will PR after containerd part been merged
  - Chaos Controller Manager: feat: export relation between experiment and container to metrics #4415
  - Helm Charts: feat: add prometheus rules to export metrics that can be used to observe the impact of StressChaos #4418
Step 3. Visualizing the Metrics (WIP)
- Current
- Design a Grafana dashboard
- Write documents

The text was updated successfully, but these errors were encountered:

kaaass · 2024-04-25T09:39:53Z

/assign

g1eny0ung added the LFX-MENTORSHIP label Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking Issue: Better observability for StressChaos #4405

Tracking Issue: Better observability for StressChaos #4405

kaaass commented Apr 25, 2024 •

edited

kaaass commented Apr 25, 2024

Tracking Issue: Better observability for StressChaos #4405

Tracking Issue: Better observability for StressChaos #4405

Comments

kaaass commented Apr 25, 2024 • edited

Steps

Progress

kaaass commented Apr 25, 2024

kaaass commented Apr 25, 2024 •

edited