Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easier twophase controller #1221

Merged
merged 13 commits into from Dec 15, 2020
Merged

Conversation

YangKeao
Copy link
Member

What problem does this PR solve?

#1207 #1214 Make the twophase scheduler easier to understand.

I think the twophase reconciler is too complicated. There are three parts:

  1. The basic twophase scheduler and retry logic
  2. Support pause
  3. Support updating scheduler

I created two controllers for two-phase scenario:

  1. Normal controller supports pause, retry, and basic scheduler
  2. A SchedulerUpdater controller to update nextStart when the scheduler changes.

The SchedulerUpdater controller is quite easy to understand. I will describe the Normal controller design:

  1. All modifications on chaos resource and apply/recover chaos should be triggered by an intention to turn into another phase.

  2. The intention to turn into another phase could be triggered by time, delete or pause.

  3. An intention could fail to set the phase because all side effect actions could fail and turn into the Failed phase.

@YangKeao
Copy link
Member Author

/run-e2e-tests

@YangKeao YangKeao changed the title Easier twophase controller [WIP] Easier twophase controller Nov 25, 2020
@YangKeao YangKeao marked this pull request as draft November 25, 2020 09:27
@YangKeao
Copy link
Member Author

/run-e2e-tests

@YangKeao YangKeao marked this pull request as ready for review November 26, 2020 06:50
@YangKeao YangKeao changed the title [WIP] Easier twophase controller Easier twophase controller Nov 26, 2020
@YangKeao
Copy link
Member Author

/run-e2e-tests

@YangKeao
Copy link
Member Author

/run-e2e-tests

Comment on lines 94 to 96
// Pause from a running phase should set the nextStart to now
// so that the Reconciler will start another cycle of running right
// after resume
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the cron is like 30 * * * *, it means run chaos on the 30th minute every hour, can't run right after resume

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cwen0 I don't know the exact definition of pause. It seems that the pause has two different meanings for these two kinds of scheduler.

Maybe we should clarify it first, through an RFC or an issue... I could modify the implementation according to that later (in another PR). Confusing feature will lead to confusing codes, which we always hate.

Copy link
Member

@Yiyiyimu Yiyiyimu Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think pause has two meanings for two kinds of schedulers. For the current implementation, in both situations, if we resume at the waiting state of the scheduler, we would turn the state from paused to waiting (at "Recovering"), get the next start time from the scheduler ("Waiting"), wait for the next start time and go into "Starting".
So it's like, the scheduler keeps the same, and the pause period is dug out of it.

For #1207, I checked the original implementation. Maybe I missed something, but I think change nextRecover from now + duration to nextStart + duration could fix it. Got it

@chaos-mesh chaos-mesh deleted a comment from WangXiangUSTC Nov 30, 2020
controllers/twophase/state_machine.go Outdated Show resolved Hide resolved
controllers/twophase/state_machine.go Show resolved Hide resolved
controllers/twophase/state_machine.go Outdated Show resolved Hide resolved
m.Log.Error(err, "failed to get the next start time")
return updated, err
}
nextRecover := startTime.Add(*duration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most people will treat nextStart and nextRecover as a set of data. next start then next recover. In fact, here nextRecover represents the recover time, not the next recover time. So I suggest renaming NextRecover directly to RecoverTime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The risk of renaming schema is too high. I would like to try to do it in another PR.

controllers/twophase/state_machine.go Outdated Show resolved Hide resolved
controllers/twophase/types.go Show resolved Hide resolved
controllers/twophase/update_scheduler.go Show resolved Hide resolved
}
status := chaos.GetStatus()

targetPhase := status.Experiment.Phase
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If status.Experiment.Phaseand nextStartTime both are empty, what will happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

targetPhase will be Running, as chaos.GetNextStart().Before(now) is true.

@codecov-io
Copy link

codecov-io commented Dec 1, 2020

Codecov Report

Merging #1221 (16d533c) into master (7e9ff3f) will decrease coverage by 0.49%.
The diff coverage is 52.68%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1221      +/-   ##
==========================================
- Coverage   55.78%   55.28%   -0.50%     
==========================================
  Files          68       83      +15     
  Lines        4383     5070     +687     
==========================================
+ Hits         2445     2803     +358     
- Misses       1768     2006     +238     
- Partials      170      261      +91     
Impacted Files Coverage Δ
api/v1alpha1/common_types.go 0.00% <0.00%> (ø)
api/v1alpha1/common_webhook.go 100.00% <ø> (ø)
api/v1alpha1/dnschaos_type.go 0.00% <0.00%> (ø)
api/v1alpha1/dnschaos_webhook.go 0.00% <0.00%> (ø)
api/v1alpha1/httpchaos_types.go 0.00% <0.00%> (ø)
api/v1alpha1/iochaos_types.go 0.00% <ø> (-40.00%) ⬇️
api/v1alpha1/jvmchaos_webhook.go 0.00% <0.00%> (ø)
api/v1alpha1/kernelchaos_types.go 0.00% <ø> (-20.00%) ⬇️
api/v1alpha1/kernelchaos_webhook.go 100.00% <ø> (+14.81%) ⬆️
api/v1alpha1/kinds.go 27.27% <ø> (+0.60%) ⬆️
... and 125 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 689005d...fc7f02d. Read the comment docs.

cwen0
cwen0 previously approved these changes Dec 1, 2020
Copy link
Member

@cwen0 cwen0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 134 to 135
// if the duration is too long, `nextRecover` could be after `nextStart`
// we can jump over a start to make sure `nextRecover` is before `nextStart`
Copy link
Member

@Yiyiyimu Yiyiyimu Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @YangKeao I have another idea to achieve maybe an ideal situation, that for standard cron (* * * * *), we could follow the schedule in this cycle (which we missed the start time), rather than start at the next cycle (or the cycle after the next).

So there are two situation: we entered in running period, or waiting period. We could calculate the whole cron cycle duration, and get the waitDuration from cycle - duration, which would be helpful in the following.

  1. Running period
    When time.Now().Before(nextStart - waitDuration), we could see we are in the running period. Entering state machine, still in this 20 conditions, but nextRecover should be nextStart - waitTime. And keep the same for others.

  2. Waiting period
    The same, when time.Now().After(nextStart - waitDuration). This time we should separate a condition, from Uninitialized/Paused to Waiting, and the only thing to do is set next start. After that it could be the same with the every type cron.

Copy link
Member

@STRRL STRRL Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cycle is not a stable value, for example: 5,15 * * * *, it means execute at each <any hour>:05 and <any hour>:15.

That's not an issue. I still considering other unexpected situations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@STRRL Yes you're right I missed this point. Maybe we could implement a Previous, just like Next but get the previous start time from certain time, which would also need when implementing pause for duration.
Actually someone creates a PR about this feature in cron repo, but the maintainer thinks it's outside the scope of that repo so we need to do this ourselves.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cycle is not a stable value, for example: 5/15 * * * *, it means execute at each <any hour>:05 and <any hour>:15.

That's not an issue. I still considering other unexpected situations.

Yeah but something like 5/18 * * * * would still cause an unstable cycle (run at 5, 23, 41, 59 in each hour)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yiyiyimu I have come up with an idea: if the current phase is Paused, we should ignore the targetPhase. And loop and iterate forward the nextStart and nextRecover until at least one of them is bigger than now.

The behavior of @every 5min is different as it may not apply right after resume. However, it works like erasing a period of time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YangKeao Yes it could work for pause, but that still could not solve the problem when we just start, since we don't know the last start time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yiyiyimu I have implemented the logic mentioned above. Could you have a look?

@STRRL
Copy link
Member

STRRL commented Dec 4, 2020

I could not find where the first call of Chaos.SetNextStart();

- When the Chaos Object just creates, its Status is empty, GetNextStart() will return an empty time.Time{}, and phase is
ExperimentPhaseUnitialized

- And only state machine update this field when state turn to Running; controllers/twophase/state_machine.go:150-151
- The condition to turn into Running is chaos.GetNextStart().Before(now)controllers/twophase/types.go:72-74; But I think when the object just commit to Kubernetes, chaos.GetNextStart() returns an empty time.Time{}, so that the two-phase will not start.

Did I miss something? 😰

Oops, the empty time.Time{} is indeed "Before()" the time.Now()

Signed-off-by: Yang Keao <keao.yang@yahoo.com>
@YangKeao
Copy link
Member Author

YangKeao commented Dec 7, 2020

/run-e2e-tests

1 similar comment
@YangKeao
Copy link
Member Author

YangKeao commented Dec 7, 2020

/run-e2e-tests

@cwen0 cwen0 requested a review from STRRL December 8, 2020 04:52
@cwen0 cwen0 self-requested a review December 11, 2020 03:42
cwen0
cwen0 previously approved these changes Dec 11, 2020
controllers/twophase/types.go Show resolved Hide resolved
status := chaos.GetStatus()
// TODO: find a better way to solve the pause and resume problem.
// Or pause is a bad design for the scheduler :(
if !chaos.IsPaused() && status.Experiment.Phase == v1alpha1.ExperimentPhasePaused {
Copy link
Contributor

@Colstuwjx Colstuwjx Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little confusion here.

At previous commit, each if-else, e.g. !chaos.GetNextRecover().IsZero() && chaos.GetNextRecover().Before(now) and !chaos.IsPaused() && status.Experiment.Phase == v1alpha1.ExperimentPhasePaused wouldn't run at the same time, but now, it did.

Think about the following case:

  1. a chaos created and started;
  2. it has been paused before nextRecover reached;
  3. nextRecover reached but nothing changed since the chaos is still been Paused :(
  4. the chaos has been unpaused

and now, both !chaos.GetNextRecover().IsZero() && chaos.GetNextRecover().Before(now) and !chaos.IsPaused() && status.Experiment.Phase == v1alpha1.ExperimentPhasePaused would be true, the targetPhase become ExperimentPhaseRunning!

Although it would not make the chaos really run apply, because Line 149 if check would be false, and the targetPhase would be changed to ExperimentPhaseWaiting after did resume.

Is this acceptable?

Copy link
Member

@cwen0 cwen0 Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation is consistent with the master branch, you can check here https://github.com/chaos-mesh/chaos-mesh/blob/master/controllers/twophase/types.go#L138. I think this implementation is acceptable. If we want to improve this, we can do it in another PR

Copy link
Contributor

@Colstuwjx Colstuwjx Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not the same case, what you mean in L138 should be started -> paused -> now -> nextRecover -> nextStart, and when I unpaused at now, it do resuming.

But what I mean is that started -> paused -> nextRecover -> now -> nextStart, and when I unpaused at now, it will enter both Line 68 and Line 82, and targetPhase become ExperimentPhaseRunning at this moment 😅

And then, it enter resume() Line 156.

It's not a big problem since it's finally do noop, but it still will leave one line log say:

2020-12-11T10:25:24.033Z	INFO	controllers.networkchaos.default/web-show-network-delay	change phase	{"reconciler": "networkchaos", "resource name": "default/web-show-network-delay", "current phase": "Paused", "target phase": "Running"}

and then turn to Waiting soon:

2020-12-11T10:25:24.051Z	INFO	controllers.networkchaos.default/web-show-network-delay	change phase	{"reconciler": "networkchaos", "resource name": "default/web-show-network-delay", "current phase": "Waiting", "target phase": "Waiting"}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's the dirty part. I have commented that target Running and Recovering has the same behavior for resume. I haven't come up with a good idea on how to solve the problem 😭

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always think this is a feature that you could see the chaos experiments being resumed IMMEDIATELY after clicking unpause.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did think so, how about make this targetPhase to a temporary phase, e.g. v1alpha1.ExperimentPhaseResuming 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@STRRL This phase will not be updated to Kubernetes. As you can see in the state machine, though we passed targetPhase as Running, it will be updated to Waiting (or Failed). It can only be seen in the log.

Signed-off-by: YangKeao <keao.yang@yahoo.com>
@YangKeao
Copy link
Member Author

/run-e2e-tests

Signed-off-by: Yang Keao <keao.yang@yahoo.com>
@Colstuwjx
Copy link
Contributor

Merging this PR would make two benifits:

  1. easier to change business codes for two-phase controller in the future;

  2. fixed scheduler: nextRecover shouldn't be later than nextStart #1207

And I think we could leave some existing issues to another PRs to fix.

Copy link
Contributor

@Colstuwjx Colstuwjx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@YangKeao
Copy link
Member Author

/merge

@YangKeao
Copy link
Member Author

/merge

@YangKeao
Copy link
Member Author

/run-e2e-tests

@YangKeao
Copy link
Member Author

YangKeao commented Dec 15, 2020

The bot doesn't work 👿 After running the e2e test, I will merge by myself.

@ti-srebot ti-srebot merged commit b0e257c into chaos-mesh:master Dec 15, 2020
@Colstuwjx Colstuwjx mentioned this pull request Dec 18, 2020
@YangKeao YangKeao mentioned this pull request Jan 11, 2021
@dcalvin dcalvin mentioned this pull request Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants