New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Easier twophase controller #1221
Conversation
b35cca2
to
7579b06
Compare
/run-e2e-tests |
/run-e2e-tests |
/run-e2e-tests |
/run-e2e-tests |
// Pause from a running phase should set the nextStart to now | ||
// so that the Reconciler will start another cycle of running right | ||
// after resume |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the cron
is like 30 * * * *
, it means run chaos on the 30th minute every hour, can't run right after resume
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cwen0 I don't know the exact definition of pause
. It seems that the pause
has two different meanings for these two kinds of scheduler.
Maybe we should clarify it first, through an RFC or an issue... I could modify the implementation according to that later (in another PR). Confusing feature will lead to confusing codes, which we always hate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think pause
has two meanings for two kinds of schedulers. For the current implementation, in both situations, if we resume at the waiting state of the scheduler, we would turn the state from paused to waiting (at "Recovering"), get the next start time from the scheduler ("Waiting"), wait for the next start time and go into "Starting".
So it's like, the scheduler keeps the same, and the pause period is dug out of it.
For #1207, I checked the original implementation. Maybe I missed something, but I think change nextRecover from Got itnow + duration
to nextStart + duration
could fix it.
m.Log.Error(err, "failed to get the next start time") | ||
return updated, err | ||
} | ||
nextRecover := startTime.Add(*duration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most people will treat nextStart
and nextRecover
as a set of data. next start
then next recover
. In fact, here nextRecover
represents the recover time, not the next recover time. So I suggest renaming NextRecover
directly to RecoverTime
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The risk of renaming schema is too high. I would like to try to do it in another PR.
} | ||
status := chaos.GetStatus() | ||
|
||
targetPhase := status.Experiment.Phase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If status.Experiment.Phase
and nextStartTime
both are empty, what will happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
targetPhase
will be Running
, as chaos.GetNextStart().Before(now)
is true.
Codecov Report
@@ Coverage Diff @@
## master #1221 +/- ##
==========================================
- Coverage 55.78% 55.28% -0.50%
==========================================
Files 68 83 +15
Lines 4383 5070 +687
==========================================
+ Hits 2445 2803 +358
- Misses 1768 2006 +238
- Partials 170 261 +91
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// if the duration is too long, `nextRecover` could be after `nextStart` | ||
// we can jump over a start to make sure `nextRecover` is before `nextStart` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @YangKeao I have another idea to achieve maybe an ideal situation, that for standard cron (* * * * *), we could follow the schedule in this cycle (which we missed the start time), rather than start at the next cycle (or the cycle after the next).
So there are two situation: we entered in running period, or waiting period. We could calculate the whole cron cycle duration, and get the waitDuration
from cycle - duration
, which would be helpful in the following.
-
Running period
Whentime.Now().Before(nextStart - waitDuration)
, we could see we are in the running period. Entering state machine, still in this 20 conditions, but nextRecover should benextStart - waitTime
. And keep the same for others. -
Waiting period
The same, whentime.Now().After(nextStart - waitDuration)
. This time we should separate a condition, fromUninitialized/Paused
toWaiting
, and the only thing to do is set next start. After that it could be the same with theevery
type cron.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cycle
is not a stable value, for example: 5,15 * * * *
, it means execute at each <any hour>:05
and <any hour>:15
.
That's not an issue. I still considering other unexpected situations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@STRRL Yes you're right I missed this point. Maybe we could implement a Previous
, just like Next
but get the previous start time from certain time, which would also need when implementing pause for duration.
Actually someone creates a PR about this feature in cron repo, but the maintainer thinks it's outside the scope of that repo so we need to do this ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cycle
is not a stable value, for example:5/15 * * * *
, it means execute at each<any hour>:05
and<any hour>:15
.That's not an issue. I still considering other unexpected situations.
Yeah but something like 5/18 * * * *
would still cause an unstable cycle (run at 5, 23, 41, 59 in each hour)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yiyiyimu I have come up with an idea: if the current phase is Paused
, we should ignore the targetPhase
. And loop and iterate forward the nextStart
and nextRecover
until at least one of them is bigger than now
.
The behavior of @every 5min
is different as it may not apply right after resume. However, it works like erasing a period of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YangKeao Yes it could work for pause
, but that still could not solve the problem when we just start, since we don't know the last start time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yiyiyimu I have implemented the logic mentioned above. Could you have a look?
Oops, the empty |
a315baa
to
986bf76
Compare
Signed-off-by: Yang Keao <keao.yang@yahoo.com>
986bf76
to
e5779d0
Compare
/run-e2e-tests |
1 similar comment
/run-e2e-tests |
status := chaos.GetStatus() | ||
// TODO: find a better way to solve the pause and resume problem. | ||
// Or pause is a bad design for the scheduler :( | ||
if !chaos.IsPaused() && status.Experiment.Phase == v1alpha1.ExperimentPhasePaused { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Little confusion here.
At previous commit, each if-else, e.g. !chaos.GetNextRecover().IsZero() && chaos.GetNextRecover().Before(now)
and !chaos.IsPaused() && status.Experiment.Phase == v1alpha1.ExperimentPhasePaused
wouldn't run at the same time, but now, it did.
Think about the following case:
- a chaos created and started;
- it has been paused before nextRecover reached;
- nextRecover reached but nothing changed since the chaos is still been
Paused
:( - the chaos has been unpaused
and now, both !chaos.GetNextRecover().IsZero() && chaos.GetNextRecover().Before(now)
and !chaos.IsPaused() && status.Experiment.Phase == v1alpha1.ExperimentPhasePaused
would be true, the targetPhase become ExperimentPhaseRunning
!
Although it would not make the chaos really run apply
, because Line 149 if check would be false, and the targetPhase would be changed to ExperimentPhaseWaiting
after did resume
.
Is this acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation is consistent with the master branch, you can check here https://github.com/chaos-mesh/chaos-mesh/blob/master/controllers/twophase/types.go#L138. I think this implementation is acceptable. If we want to improve this, we can do it in another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not the same case, what you mean in L138 should be started -> paused -> now -> nextRecover -> nextStart
, and when I unpaused at now, it do resuming.
But what I mean is that started -> paused -> nextRecover -> now -> nextStart
, and when I unpaused at now, it will enter both Line 68 and Line 82, and targetPhase become ExperimentPhaseRunning
at this moment 😅
And then, it enter resume()
Line 156.
It's not a big problem since it's finally do noop
, but it still will leave one line log say:
2020-12-11T10:25:24.033Z INFO controllers.networkchaos.default/web-show-network-delay change phase {"reconciler": "networkchaos", "resource name": "default/web-show-network-delay", "current phase": "Paused", "target phase": "Running"}
and then turn to Waiting
soon:
2020-12-11T10:25:24.051Z INFO controllers.networkchaos.default/web-show-network-delay change phase {"reconciler": "networkchaos", "resource name": "default/web-show-network-delay", "current phase": "Waiting", "target phase": "Waiting"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's the dirty part. I have commented that target Running
and Recovering
has the same behavior for resume
. I haven't come up with a good idea on how to solve the problem 😭
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always think this is a feature that you could see the chaos experiments being resumed IMMEDIATELY after clicking unpause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we did think so, how about make this targetPhase to a temporary phase, e.g. v1alpha1.ExperimentPhaseResuming 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@STRRL This phase will not be updated to Kubernetes
. As you can see in the state machine, though we passed targetPhase
as Running
, it will be updated to Waiting
(or Failed
). It can only be seen in the log.
Signed-off-by: YangKeao <keao.yang@yahoo.com>
Signed-off-by: YangKeao <keao.yang@yahoo.com>
/run-e2e-tests |
Signed-off-by: Yang Keao <keao.yang@yahoo.com>
Merging this PR would make two benifits:
And I think we could leave some existing issues to another PRs to fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/merge |
/merge |
/run-e2e-tests |
The bot doesn't work 👿 After running the e2e test, I will merge by myself. |
What problem does this PR solve?
#1207 #1214 Make the
twophase
scheduler easier to understand.I created two controllers for two-phase scenario:
SchedulerUpdater
controller to updatenextStart
when thescheduler
changes.The
SchedulerUpdater
controller is quite easy to understand. I will describe theNormal
controller design:All modifications on chaos resource and apply/recover chaos should be triggered by an intention to turn into another phase.
The intention to turn into another phase could be triggered by time, delete or pause.
An intention could fail to set the phase because all side effect actions could fail and turn into the
Failed
phase.