Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add more status for record #3170

Merged
merged 1 commit into from
May 11, 2022
Merged

Conversation

xlgao-zju
Copy link
Member

@xlgao-zju xlgao-zju commented Apr 21, 2022

What problem does this PR solve?

Close #3127

What's changed and how it works?

Related changes

  • Need to update chaos-mesh/website
  • Need to update Dashboard UI
  • Need to cheery-pick to release branches
    • release-2.1
    • release-2.0

Checklist

CHANGELOG

  • I have updated the CHANGELOG.md
  • I have labeled this PR with "no-need-update-changelog"

Tests

  • Unit test
  • E2E test
  • No code
  • Manual test (add steps below)

create experiment, pause and resume it

image

Side effects

  • Breaking backward compatibility

Release note

Please add a release note.

You can safely ignore this section if you don't think this PR needs a release note.

DCO

If you find the DCO check fails, please run commands like below (Depends on the actual situations. For example, if the failed commit isn't the most recent) to fix it:

git commit --amend --signoff
git push --force

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Apr 21, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • STRRL
  • YangKeao

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@codecov
Copy link

codecov bot commented Apr 21, 2022

Codecov Report

Merging #3170 (653c3e5) into master (22a1cf5) will decrease coverage by 0.05%.
The diff coverage is 60.00%.

❗ Current head 653c3e5 differs from pull request most recent head a8721fb. Consider uploading reports for the commit a8721fb to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3170      +/-   ##
==========================================
- Coverage   41.04%   40.98%   -0.06%     
==========================================
  Files         161      164       +3     
  Lines       13640    13794     +154     
==========================================
+ Hits         5598     5653      +55     
- Misses       7629     7715      +86     
- Partials      413      426      +13     
Impacted Files Coverage Δ
api/v1alpha1/common_types.go 0.00% <0.00%> (ø)
controllers/common/step.go 0.00% <0.00%> (ø)
controllers/common/fx.go 59.04% <77.77%> (+5.88%) ⬆️
controllers/statuscheck/controller.go 79.26% <0.00%> (-4.88%) ⬇️
pkg/workflow/controllers/abort_node_reconciler.go 15.11% <0.00%> (-3.49%) ⬇️
pkg/selector/generic/mode.go 25.64% <0.00%> (-2.57%) ⬇️
.../workflow/controllers/workflow_entry_reconciler.go 54.05% <0.00%> (-2.17%) ⬇️
...g/workflow/controllers/parallel_node_reconciler.go 60.12% <0.00%> (-1.90%) ⬇️
pkg/workflow/controllers/serial_node_reconciler.go 59.58% <0.00%> (-1.56%) ⬇️
pkg/workflow/controllers/statuscheck_reconciler.go 7.57% <0.00%> (-1.52%) ⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6111432...a8721fb. Read the comment docs.

@xlgao-zju
Copy link
Member Author

/cc @cwen0 @STRRL @YangKeao @Hexilee
for review

@xlgao-zju xlgao-zju force-pushed the more-status branch 2 times, most recently from be46497 to 28e9c92 Compare April 21, 2022 11:27
Copy link
Member

@STRRL STRRL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM!

api/v1alpha1/common_types.go Outdated Show resolved Hide resolved
api/v1alpha1/common_types.go Outdated Show resolved Hide resolved
api/v1alpha1/common_types.go Outdated Show resolved Hide resolved
@STRRL
Copy link
Member

STRRL commented Apr 22, 2022

after discussing with @xlgao-zju, the capacity of events might be a problem now

There are two reasons that would make `events increase fast:

  • "one-to-many" chaos-records relation; when a chaos experiment uses a selector for selecting a big mount of targets, there would be many "records" in the status
  • uncontrolled retry/reconcile when failed to Apply/Recover; we would always requeue the reconcile request if the operation failed, and only using the default rate limiter of controller-runtime. So there would be many events with "similar" messages appearing in a short time.

Maybe we need an aggregation or rotate policy in the future. I still stay positive about the effect of this problem, I think we do not need to consider it in this PR.

How do you think about it? @YangKeao @Hexilee @xlgao-zju

@xlgao-zju
Copy link
Member Author

at the beginning, I append the events, when we apply failed. and then I get too many error messages. and the reason is:

apply failed -> update the status -> the update will let the object be reconciled -> apply failed

since this is circle, so we will get so so many record.events 😂

in order to avoid thi bug, I did those:

  • make the common CRD's status as a sub resource(or modify the status will increase the generation of the resource)
  • for common CRDs, use WithEventFilter to filter the events, pick up the event only when label/annotation/spec changes
  • for the CRDs controlled by common CRDs, there is no filter. since when the child CR status changes, we need to reconcile the according parent CR
  • add another condition step after record step, since if we will not reconcile the object, if only the status changes, the conditions calculated before record step may not be correct after the record step

the PR is ready for review
@STRRL @YangKeao @Hexilee

@xlgao-zju
Copy link
Member Author

kindly ping @STRRL @YangKeao @Hexilee for review

@STRRL
Copy link
Member

STRRL commented May 6, 2022

PTAL @YangKeao , this PR introduces Status Subresources for all the chaos CRD. But I forgot why do we NOT use status subresources before. Maybe after navirna we could use subresrouce without any blocking?🤔

@STRRL
Copy link
Member

STRRL commented May 6, 2022

after discussing with @xlgao-zju, the capacity of events might be a problem now

There are two reasons that would make `events increase fast:

  • "one-to-many" chaos-records relation; when a chaos experiment uses a selector for selecting a big mount of targets, there would be many "records" in the status
  • uncontrolled retry/reconcile when failed to Apply/Recover; we would always requeue the reconcile request if the operation failed, and only using the default rate limiter of controller-runtime. So there would be many events with "similar" messages appearing in a short time.

Maybe we need an aggregation or rotate policy in the future. I still stay positive about the effect of this problem, I think we do not need to consider it in this PR.

How do you think about it? @YangKeao @Hexilee @xlgao-zju

it seems this issue is NOT resolved properly, I could still reproduce it with these manifests:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx:latest
        imagePullPolicy: Always
        name: nginx
        resources: {}
---
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-delay-example
spec:
  action: latency
  mode: all
  selector:
    labelSelectors:
      app: nginx
  volumePath: /not-existed
  path: /not-existed/**/*
  delay: "10ms"
  percent: 10

@STRRL
Copy link
Member

STRRL commented May 6, 2022

It seems the predicates DO work. But the soo many events appended and resoruceVersion increased near to thousands. I am still trying to figure it out with logs. 🤔

controllers/common/fx.go Outdated Show resolved Hide resolved
controllers/common/fx.go Outdated Show resolved Hide resolved
@xlgao-zju
Copy link
Member Author

@STRRL @YangKeao I have remove the "sub resource" part. and add a predicate to skip the events which ONLY object.status.experiment.records[].events changed

PTAL

Copy link
Member

@STRRL STRRL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM!

PTAL @xlgao-zju

controllers/common/step.go Outdated Show resolved Hide resolved
controllers/common/fx.go Outdated Show resolved Hide resolved
@xlgao-zju
Copy link
Member Author

@STRRL updated

Copy link
Member

@YangKeao YangKeao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

controllers/common/fx.go Outdated Show resolved Hide resolved
Signed-off-by: xianglingao <xianglingao@tencent.com>
@xlgao-zju
Copy link
Member Author

@YangKeao Updated

Copy link
Member

@YangKeao YangKeao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@STRRL STRRL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@STRRL
Copy link
Member

STRRL commented May 11, 2022

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: a8721fb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

more clear status to tell if the experiment succeed
4 participants