Easier twophase controller #1221

YangKeao · 2020-11-25T09:18:05Z

What problem does this PR solve?

#1207 #1214 Make the twophase scheduler easier to understand.

I think the twophase reconciler is too complicated. There are three parts:

The basic twophase scheduler and retry logic

Support pause

Support updating scheduler

I created two controllers for two-phase scenario:

Normal controller supports pause, retry, and basic scheduler
A SchedulerUpdater controller to update nextStart when the scheduler changes.

The SchedulerUpdater controller is quite easy to understand. I will describe the Normal controller design:

All modifications on chaos resource and apply/recover chaos should be triggered by an intention to turn into another phase.
The intention to turn into another phase could be triggered by time, delete or pause.
An intention could fail to set the phase because all side effect actions could fail and turn into the Failed phase.

YangKeao · 2020-11-25T09:20:39Z

/run-e2e-tests

YangKeao · 2020-11-26T06:49:12Z

/run-e2e-tests

YangKeao · 2020-11-26T08:04:08Z

/run-e2e-tests

YangKeao · 2020-11-26T12:36:57Z

/run-e2e-tests

controllers/twophase/state_machine.go

WangXiangUSTC · 2020-11-30T06:50:00Z

controllers/twophase/state_machine.go

+		// Pause from a running phase should set the nextStart to now
+		// so that the Reconciler will start another cycle of running right
+		// after resume


if the cron is like 30 * * * *, it means run chaos on the 30th minute every hour, can't run right after resume

@cwen0 I don't know the exact definition of pause. It seems that the pause has two different meanings for these two kinds of scheduler.

Maybe we should clarify it first, through an RFC or an issue... I could modify the implementation according to that later (in another PR). Confusing feature will lead to confusing codes, which we always hate.

I don't think pause has two meanings for two kinds of schedulers. For the current implementation, in both situations, if we resume at the waiting state of the scheduler, we would turn the state from paused to waiting (at "Recovering"), get the next start time from the scheduler ("Waiting"), wait for the next start time and go into "Starting".
So it's like, the scheduler keeps the same, and the pause period is dug out of it.

~~For #1207, I checked the original implementation. Maybe I missed something, but I think change nextRecover from now + duration to nextStart + duration could fix it.~~ Got it

controllers/twophase/state_machine.go

controllers/twophase/types.go

controllers/twophase/state_machine.go

cwen0 · 2020-11-30T11:12:59Z

controllers/twophase/state_machine.go

+				m.Log.Error(err, "failed to get the next start time")
+				return updated, err
+			}
+			nextRecover := startTime.Add(*duration)


Most people will treat nextStart and nextRecover as a set of data. next start then next recover. In fact, here nextRecover represents the recover time, not the next recover time. So I suggest renaming NextRecover directly to RecoverTime.

The risk of renaming schema is too high. I would like to try to do it in another PR.

controllers/twophase/state_machine.go

controllers/twophase/types.go

controllers/twophase/update_scheduler.go

cwen0 · 2020-11-30T13:07:06Z

controllers/twophase/types.go

-	}
+	status := chaos.GetStatus()
+
+	targetPhase := status.Experiment.Phase


If status.Experiment.Phaseand nextStartTime both are empty, what will happen?

targetPhase will be Running, as chaos.GetNextStart().Before(now) is true.

codecov-io · 2020-12-01T07:10:07Z

Codecov Report

Merging #1221 (16d533c) into master (7e9ff3f) will decrease coverage by 0.49%.
The diff coverage is 52.68%.

@@            Coverage Diff             @@
##           master    #1221      +/-   ##
==========================================
- Coverage   55.78%   55.28%   -0.50%     
==========================================
  Files          68       83      +15     
  Lines        4383     5070     +687     
==========================================
+ Hits         2445     2803     +358     
- Misses       1768     2006     +238     
- Partials      170      261      +91

Impacted Files	Coverage Δ
api/v1alpha1/common_types.go	`0.00% <0.00%> (ø)`
api/v1alpha1/common_webhook.go	`100.00% <ø> (ø)`
api/v1alpha1/dnschaos_type.go	`0.00% <0.00%> (ø)`
api/v1alpha1/dnschaos_webhook.go	`0.00% <0.00%> (ø)`
api/v1alpha1/httpchaos_types.go	`0.00% <0.00%> (ø)`
api/v1alpha1/iochaos_types.go	`0.00% <ø> (-40.00%)`	⬇️
api/v1alpha1/jvmchaos_webhook.go	`0.00% <0.00%> (ø)`
api/v1alpha1/kernelchaos_types.go	`0.00% <ø> (-20.00%)`	⬇️
api/v1alpha1/kernelchaos_webhook.go	`100.00% <ø> (+14.81%)`	⬆️
api/v1alpha1/kinds.go	`27.27% <ø> (+0.60%)`	⬆️
... and 125 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 689005d...fc7f02d. Read the comment docs.

cwen0

LGTM

Yiyiyimu · 2020-12-03T14:47:34Z

controllers/twophase/state_machine.go

+			// if the duration is too long, `nextRecover` could be after `nextStart`
+			// we can jump over a start to make sure `nextRecover` is before `nextStart`


Hi @YangKeao I have another idea to achieve maybe an ideal situation, that for standard cron (* * * * *), we could follow the schedule in this cycle (which we missed the start time), rather than start at the next cycle (or the cycle after the next).

So there are two situation: we entered in running period, or waiting period. We could calculate the whole cron cycle duration, and get the waitDuration from cycle - duration, which would be helpful in the following.

Running period
When time.Now().Before(nextStart - waitDuration), we could see we are in the running period. Entering state machine, still in this 20 conditions, but nextRecover should be nextStart - waitTime. And keep the same for others.

Waiting period
The same, when time.Now().After(nextStart - waitDuration). This time we should separate a condition, from Uninitialized/Paused to Waiting, and the only thing to do is set next start. After that it could be the same with the every type cron.

~~cycle is not a stable value, for example: 5,15 * * * *, it means execute at each <any hour>:05 and <any hour>:15.~~

That's not an issue. I still considering other unexpected situations.

@STRRL Yes you're right I missed this point. Maybe we could implement a Previous, just like Next but get the previous start time from certain time, which would also need when implementing pause for duration.
Actually someone creates a PR about this feature in cron repo, but the maintainer thinks it's outside the scope of that repo so we need to do this ourselves.

~~cycle is not a stable value, for example: 5/15 * * * *, it means execute at each <any hour>:05 and <any hour>:15.~~

That's not an issue. I still considering other unexpected situations.

Yeah but something like 5/18 * * * * would still cause an unstable cycle (run at 5, 23, 41, 59 in each hour)

@Yiyiyimu I have come up with an idea: if the current phase is Paused, we should ignore the targetPhase. And loop and iterate forward the nextStart and nextRecover until at least one of them is bigger than now.

The behavior of @every 5min is different as it may not apply right after resume. However, it works like erasing a period of time.

@YangKeao Yes it could work for pause, but that still could not solve the problem when we just start, since we don't know the last start time.

@Yiyiyimu I have implemented the logic mentioned above. Could you have a look?

STRRL · 2020-12-04T03:33:50Z

~~I could not find where the first call of Chaos.SetNextStart();~~

- When the Chaos Object just creates, its Status is empty, GetNextStart() will return an empty time.Time{}, and phase is
ExperimentPhaseUnitialized
~~- And only state machine update this field when state turn to Running; controllers/twophase/state_machine.go:150-151~~
- The condition to turn into Running is chaos.GetNextStart().Before(now)controllers/twophase/types.go:72-74; But I think when the object just commit to Kubernetes, chaos.GetNextStart() returns an empty time.Time{}, so that the two-phase will not start.

~~Did I miss something? 😰~~

Oops, the empty time.Time{} is indeed "Before()" the time.Now()

controllers/twophase/state_machine.go

Signed-off-by: Yang Keao <keao.yang@yahoo.com>

YangKeao · 2020-12-07T06:15:44Z

/run-e2e-tests

YangKeao · 2020-12-07T06:49:09Z

/run-e2e-tests

controllers/twophase/types.go

Colstuwjx · 2020-12-11T10:43:34Z

controllers/twophase/types.go

-	status := chaos.GetStatus()
+	// TODO: find a better way to solve the pause and resume problem.
+	// Or pause is a bad design for the scheduler :(
+	if !chaos.IsPaused() && status.Experiment.Phase == v1alpha1.ExperimentPhasePaused {


Little confusion here.

At previous commit, each if-else, e.g. !chaos.GetNextRecover().IsZero() && chaos.GetNextRecover().Before(now) and !chaos.IsPaused() && status.Experiment.Phase == v1alpha1.ExperimentPhasePaused wouldn't run at the same time, but now, it did.

Think about the following case:

a chaos created and started;

it has been paused before nextRecover reached;

nextRecover reached but nothing changed since the chaos is still been Paused :(

the chaos has been unpaused

and now, both !chaos.GetNextRecover().IsZero() && chaos.GetNextRecover().Before(now) and !chaos.IsPaused() && status.Experiment.Phase == v1alpha1.ExperimentPhasePaused would be true, the targetPhase become ExperimentPhaseRunning!

Although it would not make the chaos really run apply, because Line 149 if check would be false, and the targetPhase would be changed to ExperimentPhaseWaiting after did resume.

Is this acceptable?

The current implementation is consistent with the master branch, you can check here https://github.com/chaos-mesh/chaos-mesh/blob/master/controllers/twophase/types.go#L138. I think this implementation is acceptable. If we want to improve this, we can do it in another PR

I think it's not the same case, what you mean in L138 should be started -> paused -> now -> nextRecover -> nextStart, and when I unpaused at now, it do resuming.

But what I mean is that started -> paused -> nextRecover -> now -> nextStart, and when I unpaused at now, it will enter both Line 68 and Line 82, and targetPhase become ExperimentPhaseRunning at this moment 😅

And then, it enter resume() Line 156.

It's not a big problem since it's finally do noop, but it still will leave one line log say:

2020-12-11T10:25:24.033Z INFO controllers.networkchaos.default/web-show-network-delay change phase {"reconciler": "networkchaos", "resource name": "default/web-show-network-delay", "current phase": "Paused", "target phase": "Running"}

and then turn to Waiting soon:

2020-12-11T10:25:24.051Z INFO controllers.networkchaos.default/web-show-network-delay change phase {"reconciler": "networkchaos", "resource name": "default/web-show-network-delay", "current phase": "Waiting", "target phase": "Waiting"}

Yes. It's the dirty part. I have commented that target Running and Recovering has the same behavior for resume. I haven't come up with a good idea on how to solve the problem 😭

I always think this is a feature that you could see the chaos experiments being resumed IMMEDIATELY after clicking unpause.

If we did think so, how about make this targetPhase to a temporary phase, e.g. v1alpha1.ExperimentPhaseResuming 😁

@STRRL This phase will not be updated to Kubernetes. As you can see in the state machine, though we passed targetPhase as Running, it will be updated to Waiting (or Failed). It can only be seen in the log.

controllers/twophase/state_machine.go

Signed-off-by: YangKeao <keao.yang@yahoo.com>

…phase

controllers/twophase/types.go

Signed-off-by: YangKeao <keao.yang@yahoo.com>

YangKeao · 2020-12-14T06:40:08Z

/run-e2e-tests

Signed-off-by: Yang Keao <keao.yang@yahoo.com>

Colstuwjx · 2020-12-15T05:48:58Z

Merging this PR would make two benifits:

easier to change business codes for two-phase controller in the future;
fixed scheduler: nextRecover shouldn't be later than nextStart #1207

And I think we could leave some existing issues to another PRs to fix.

Colstuwjx

LGTM

YangKeao · 2020-12-15T08:41:16Z

/merge

YangKeao · 2020-12-15T09:15:47Z

/merge

YangKeao · 2020-12-15T11:20:35Z

/run-e2e-tests

YangKeao · 2020-12-15T11:20:48Z

The bot doesn't work 👿 After running the e2e test, I will merge by myself.

YangKeao force-pushed the easier-twophase branch from b35cca2 to 7579b06 Compare November 25, 2020 09:20

YangKeao changed the title ~~Easier twophase controller~~ [WIP] Easier twophase controller Nov 25, 2020

YangKeao marked this pull request as draft November 25, 2020 09:27

YangKeao marked this pull request as ready for review November 26, 2020 06:50

YangKeao changed the title ~~[WIP] Easier twophase controller~~ Easier twophase controller Nov 26, 2020

YangKeao added the needs-cherry-pick-1.0 label Nov 26, 2020

WangXiangUSTC reviewed Nov 27, 2020

View reviewed changes

controllers/twophase/state_machine.go Show resolved Hide resolved

WangXiangUSTC mentioned this pull request Nov 27, 2020

scheduler: fix error nextRecover shouldn't be later than nextStart #1214

Closed

6 tasks

WangXiangUSTC reviewed Nov 30, 2020

View reviewed changes

controllers/twophase/state_machine.go Outdated Show resolved Hide resolved

controllers/twophase/types.go Show resolved Hide resolved

chaos-mesh deleted a comment from WangXiangUSTC Nov 30, 2020

cwen0 reviewed Nov 30, 2020

View reviewed changes

cwen0 previously approved these changes Dec 1, 2020

View reviewed changes

Yiyiyimu mentioned this pull request Dec 3, 2020

feature: pause for certain duration #1016

Closed

7 tasks

cwen0 requested a review from WangXiangUSTC December 3, 2020 12:02

Yiyiyimu reviewed Dec 3, 2020

View reviewed changes

Colstuwjx reviewed Dec 4, 2020

View reviewed changes

controllers/twophase/state_machine.go Outdated Show resolved Hide resolved

YangKeao dismissed cwen0’s stale review via 120d40d December 7, 2020 06:01

YangKeao force-pushed the easier-twophase branch from a315baa to 986bf76 Compare December 7, 2020 06:07

easier twophase

e5779d0

Signed-off-by: Yang Keao <keao.yang@yahoo.com>

YangKeao force-pushed the easier-twophase branch from 986bf76 to e5779d0 Compare December 7, 2020 06:15

cwen0 requested a review from STRRL December 8, 2020 04:52

cwen0 self-requested a review December 11, 2020 03:42

cwen0 previously approved these changes Dec 11, 2020

View reviewed changes

Colstuwjx reviewed Dec 11, 2020

View reviewed changes

cwen0 added component/operator and removed status/can-merge labels Dec 11, 2020

cwen0 mentioned this pull request Dec 11, 2020

Update not working until chaos restart #1259

Open

Yiyiyimu reviewed Dec 11, 2020

View reviewed changes

controllers/twophase/state_machine.go Show resolved Hide resolved

YangKeao added 2 commits December 14, 2020 12:54

Merge remote-tracking branch 'upstream/master' into easier-twophase

6117cb9

Signed-off-by: YangKeao <keao.yang@yahoo.com>

Merge remote-tracking branch 'origin/easier-twophase' into easier-two…

4995a56

…phase

Colstuwjx reviewed Dec 14, 2020

View reviewed changes

controllers/twophase/types.go Outdated Show resolved Hide resolved

fix pause to pause senerio

a7a8fa0

Signed-off-by: YangKeao <keao.yang@yahoo.com>

YangKeao dismissed cwen0’s stale review via a7a8fa0 December 14, 2020 06:39

YangKeao removed the needs-cherry-pick-1.0 label Dec 15, 2020

Merge remote-tracking branch 'upstream/master' into easier-twophase

8c5ff75

Signed-off-by: Yang Keao <keao.yang@yahoo.com>

Colstuwjx approved these changes Dec 15, 2020

View reviewed changes

WangXiangUSTC mentioned this pull request Dec 15, 2020

scheduler: fix error nextRecover shouldn't be later than nextStart #1284

Merged

6 tasks

Yiyiyimu mentioned this pull request Dec 15, 2020

Bugfix: Use Last to get current phase at start time of standard cron #1286

Merged

6 tasks

cwen0 approved these changes Dec 15, 2020

View reviewed changes

ti-srebot added the status/can-merge label Dec 15, 2020

Merge branch 'master' into easier-twophase

3487eee

ti-srebot merged commit b0e257c into chaos-mesh:master Dec 15, 2020

Colstuwjx mentioned this pull request Dec 18, 2020

Support one-time job #1301

Closed

YangKeao mentioned this pull request Jan 11, 2021

scheduler not on cron #1379

Closed

dcalvin mentioned this pull request Feb 5, 2021

promote committers #1492

Merged

		// if the duration is too long, `nextRecover` could be after `nextStart`
		// we can jump over a start to make sure `nextRecover` is before `nextStart`

Easier twophase controller #1221

Easier twophase controller #1221

Conversation

YangKeao commented Nov 25, 2020

What problem does this PR solve?

YangKeao commented Nov 25, 2020

YangKeao commented Nov 26, 2020

YangKeao commented Nov 26, 2020

YangKeao commented Nov 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yiyiyimu Dec 3, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Dec 1, 2020 • edited

Codecov Report

cwen0 left a comment

Choose a reason for hiding this comment

Yiyiyimu Dec 3, 2020 • edited

Choose a reason for hiding this comment

STRRL Dec 4, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

STRRL commented Dec 4, 2020 • edited

YangKeao commented Dec 7, 2020

YangKeao commented Dec 7, 2020

Colstuwjx Dec 11, 2020 • edited

Choose a reason for hiding this comment

cwen0 Dec 11, 2020 • edited

Choose a reason for hiding this comment

Colstuwjx Dec 11, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YangKeao commented Dec 14, 2020

Colstuwjx commented Dec 15, 2020

Colstuwjx left a comment

Choose a reason for hiding this comment

YangKeao commented Dec 15, 2020

YangKeao commented Dec 15, 2020

YangKeao commented Dec 15, 2020

YangKeao commented Dec 15, 2020 • edited

Yiyiyimu Dec 3, 2020 •

edited

codecov-io commented Dec 1, 2020 •

edited

Yiyiyimu Dec 3, 2020 •

edited

STRRL Dec 4, 2020 •

edited

STRRL commented Dec 4, 2020 •

edited

Colstuwjx Dec 11, 2020 •

edited

cwen0 Dec 11, 2020 •

edited

Colstuwjx Dec 11, 2020 •

edited

YangKeao commented Dec 15, 2020 •

edited