Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bursts of scheduled tasks due to scheduler drift compensation #26910

Closed
patriknw opened this issue May 13, 2019 · 10 comments
Closed

Bursts of scheduled tasks due to scheduler drift compensation #26910

patriknw opened this issue May 13, 2019 · 10 comments
Assignees
Labels
1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:core
Milestone

Comments

@patriknw
Copy link
Member

When scheduling periodic tasks/messages the scheduler compensates for drift:

val driftNanos = clock() - getAndAdd(delay.toNanos)
if (self.get != null)
swap(schedule(executor, this, Duration.fromNanos(Math.max(delay.toNanos - driftNanos, 1))))

That is probably good for small variations, but it results in bursts when the process is suspended for long time (because of GC, or simulations with kill -STOP, and I can imagine that processes are suspended in virtualized or container environments)

Example to reproduce:

    val system = ActorSystem("DriftTest")

    import scala.concurrent.duration._
    import system.dispatcher
    @volatile var time = System.nanoTime()
    system.scheduler.schedule(1.second, 1.second) {
      val now = System.nanoTime()
      val duration = (now - time).nanos
      time = now
      println(s"# tick ${duration.toMillis} ms") // FIXME
    }

use kill -STOP pid, wait for 10 seconds and then kill -CONT pid

Example output (with some extra logging in the scheduler)

# scheduler drift 13 ms => schedule next in 986  ms
# tick 997 ms
# scheduler drift 10 ms => schedule next in 989  ms
# tick 1001 ms
# scheduler drift 12 ms => schedule next in 987  ms
# tick 1000 ms
# scheduler drift 13 ms => schedule next in 986  ms
# tick 1000 ms
# scheduler drift 13 ms => schedule next in 986  ms
# tick 999 ms
# scheduler drift 12 ms => schedule next in 987  ms
# tick 998 ms
# scheduler drift 11 ms => schedule next in 988  ms

[1]+  Stopped

[13:41:56] [patrik@pn2 /Users/patrik/dev/akka]
$ # tick 15005 ms
# scheduler drift 14017 ms => schedule next in 0  ms
# tick 0 ms
# scheduler drift 13017 ms => schedule next in 0  ms
# tick 3 ms
# scheduler drift 12020 ms => schedule next in 0  ms
# tick 10 ms
# scheduler drift 11031 ms => schedule next in 0  ms
# tick 11 ms
# scheduler drift 10042 ms => schedule next in 0  ms
# tick 9 ms
# scheduler drift 9052 ms => schedule next in 0  ms
# tick 10 ms
# scheduler drift 8062 ms => schedule next in 0  ms
# tick 10 ms
# scheduler drift 7072 ms => schedule next in 0  ms
# tick 9 ms
# scheduler drift 6082 ms => schedule next in 0  ms
# tick 11 ms
# scheduler drift 5093 ms => schedule next in 0  ms
# tick 10 ms
# scheduler drift 4103 ms => schedule next in 0  ms
# tick 8 ms
# scheduler drift 3112 ms => schedule next in 0  ms
# tick 9 ms
# scheduler drift 2121 ms => schedule next in 0  ms
# tick 10 ms
# scheduler drift 1131 ms => schedule next in 0  ms
# tick 9 ms
# scheduler drift 141 ms => schedule next in 858  ms
# tick 870 ms
# scheduler drift 11 ms => schedule next in 988  ms
# tick 1002 ms
# scheduler drift 13 ms => schedule next in 986  ms
# tick 998 ms
# scheduler drift 12 ms => schedule next in 987  ms
# tick 1000 ms
# scheduler drift 12 ms => schedule next in 987  ms
# tick 1000 ms
# scheduler drift 13 ms => schedule next in 986  ms
# tick 999 ms
# scheduler drift 12 ms => schedule next in 987  ms

I think this is what causing the problem reported in #26786
but I can imagine all kind of things that will add massive load on the system when waking up and running all scheduled tasks repeatedly to catch up with missed ticks..

@patriknw patriknw added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:core labels May 13, 2019
@patriknw patriknw self-assigned this May 13, 2019
@patriknw patriknw added the 3 - in progress Someone is working on this ticket label May 13, 2019
@patriknw patriknw added this to the 2.6.0-M2 milestone May 14, 2019
@jrudolph
Copy link
Member

That behavior is at least partly specified in schedule's Scaladoc:

If the execution of the runnable takes longer than the interval, the subsequent execution will start immediately after the prior one completes

I wonder, though, if that behavior is really needed by most callers. It would be good if we could offer another API that just discards the missed runs and just carries on scheduling the next task with the original interval. Would be good if we could make this even the default behavior (but how to do so compatibly?).

Assuming that we need to keep the current behavior as well for compatibility reasons, could we improve it to lower the impact of the burst? Delaying the missed runs might just be shifting the problem to a later point if the system currently cannot keep up anyway.

@patriknw
Copy link
Member Author

With the config I added in the PR it's possible to disable the drift compensation completely by setting it to 0 s. That config is global, but I doubt users care to have different behavior for different things.

@jrudolph
Copy link
Member

With the config I added in the PR it's possible to disable the drift compensation completely by setting it to 0 s. That config is global, but I doubt users care to have different behavior for different things.

How safe is that? E.g. if you implement a clock with scheduled ticks, that will go wrong if there are GC pauses which are too long. The existing code, explicitly tries to keep an invariant that if you schedule something you will (in steady state) get total-time-elapsed / interval messages. A clock implemented that way might go wrong temporarily but will catch up eventually (as long as nanoTime is accurate).

Users might not have implemented a simple clock but something similar based on that invariant, so I'm not so sure we can just change the behavior in that regard.

@jrudolph
Copy link
Member

I think that drift is misleading because it's not about being late but about keeping the average rate.

@patriknw
Copy link
Member Author

ok, if that shall continue to be a strong invariant of schedule we should at least make the catch up more smooth than firing all missing immediately after suspension. By not scheduling with 0 delay to catch up but use something like half of the average (given) delay.

Then we would also need to add another schedule that doesn't have this property and use that in all places internally because I don't think there is one single place where we want/need the compensation.

Having both alternatives in end-user api feels like too many alternatives for something that very few would care about. Maybe we could change behavior in 2.6 but with a config that switch back to old behavior?

@patriknw
Copy link
Member Author

I did some more research of what we have in Akka and how it's named and documented in the JDK.

In Akka:

  • Scheduler
    • schedule (interval parameter)
      • variant for running function (Runnable)
      • variant for sending message
      • doc: to be run repeatedly with an initial delay and a frequency
      • doc: If the execution of the runnable takes longer than the interval, the
        subsequent execution will start immediately after the prior one completes
        (there will be no overlap of executions of the runnable). In such cases,
        the actual execution interval will differ from the interval passed to this
        method.
      • but this is not documented on the message based methods (but given the same name it should be assumed to have same semantics)
  • FSM
    • setTimer repeat=true (timeout parameter)
      • doc: delay of first message delivery and between subsequent messages)
  • Actor with Timers (TimerScheduler)
    • startPeriodicTimer (interval parameter)
    • doc: Start a periodic timer that will send msg to the self actor at a fixed interval.
  • TimerGraphStageLogic
    • schedulePeriodically (interval parameter)
      • periodically with the given interval
      • e.g. used by throttle GroupedWeightedWithin, TickSource
  • typed.TimerScheduler
    • startPeriodicTimer (interval parameter)
      • doc: Start a periodic timer that will send msg to the self actor at a fixed interval.

In JDK:

  • ScheduledExecutorService (ScheduledThreadPoolExecutor)
    • scheduleAtFixedRate (period parameter)
    • scheduleWithFixedDelay (delay parameter)
      • doc: If any execution of this task
        takes longer than its period, then subsequent executions
        may start late, but will not concurrently execute.
      • doc: the period between successive executions
  • java.util.Timer
    • schedule (period parameter) fixed-delay
    • scheduleAtFixedRate (delay parameter)
    • doc: If a timer task takes excessive time
      complete, it "hogs" the timer's task execution thread. This can, in
      turn, delay the execution of subsequent tasks, which may "bunch up" and
      execute in rapid succession when (and if) the offending task finally
      completes.

java.util.Timer also have some nice descriptions:

In fixed-delay execution, each execution is scheduled relative to
the actual execution time of the previous execution. If an execution
is delayed for any reason (such as garbage collection or other
background activity), subsequent executions will be delayed as well.
In the long run, the frequency of execution will generally be slightly
lower than the reciprocal of the specified period.

Fixed-delay execution is appropriate for recurring activities
that require "smoothness." In other words, it is appropriate for
activities where it is more important to keep the frequency accurate
in the short run than in the long run. This includes most animation
tasks, such as blinking a cursor at regular intervals. It also includes
tasks wherein regular activity is performed in response to human
input, such as automatically repeating a character as long as a key
is held down.

In fixed-rate execution, each execution is scheduled relative to the
scheduled execution time of the initial execution. If an execution is
delayed for any reason (such as garbage collection or other background
activity), two or more executions will occur in rapid succession to
"catch up." In the long run, the frequency of execution will be
exactly the reciprocal of the specified period.

Fixed-rate execution is appropriate for recurring activities that
are sensitive to absolute time, such as ringing a chime every
hour on the hour, or running scheduled maintenance every day at a
particular time. It is also appropriate for recurring activities
where the total time to perform a fixed number of executions is
important, such as a countdown timer that ticks once every second for
ten seconds. Finally, fixed-rate execution is appropriate for
scheduling multiple repeating timer tasks that must remain synchronized
with respect to one another.

Given these new insights I have changed my mind. My first approach is not good since it's a mix of the two. We can't change the existing semantics, at least not for Scheduler.schedule, and we have to support both both ways.

In Scheduler we can't easily add a new method (because it's a public interface that others may have implemented), but we can provide the fixed-delay variant as a utility method in Scheduler companion (can be implemented on top of scheduleOnce).

For the other APIs we can add new methods and maybe it's best to deprecate the old one since its semantics was not well defined regarding this and I think most users don't want current fixed-rate semantics and the deprecation warning can be a good signal for that there is a choice to be made. Adding two new methods such as startTimerAtFixedRate and startTimerWithFixedDelay.

@jrudolph
Copy link
Member

Great overview!

In Scheduler we can't easily add a new method (because it's a public interface that others may have implemented), but we can provide the fixed-delay variant as a utility method in Scheduler companion (can be implemented on top of scheduleOnce).

Shouldn't it be the end goal also for Scheduler to stear users away from the the current schedule to a new scheduleWithFixedDelay method? I wonder if that will work if it's in the companion object.

Is the concern about Scheduler that we might introduce a new method whose name is already taken by an implementing class? Otherwise, it seems we could also think about deprecating the current schedule method (somewhat drastically) and the provide default implementations of the new scheduleWithFixedDelay and scheduleWithFixedRate directly in Scheduler in terms of scheduleOnce. Could that work?

@patriknw
Copy link
Member Author

Shouldn't it be the end goal also for Scheduler to stear users away from the the current schedule to a new scheduleWithFixedDelay method? I wonder if that will work if it's in the companion object.

I agree, but Scheduler is not the primary end user api any more. We have higher level APIs such the various Timers APIs. I see Scheduler as a low level tool.

Is the concern about Scheduler that we might introduce a new method whose name is already taken by an implementing class?

That is one concern, but another more important is that adding a new abstract method will break existing implementations if we can't provide a default implementation of it. Looking closer... I thought Scheduler was a pure interface without implementations, but it's not, so maybe we can add a default implementation of that method.

However, that would not be binary backwards compatible given how traits are mixed into the concrete classes, or did this change in 2.12 compiler? (Akka calling that method on a concrete class that was compiled with older Akka version). That might be alright breakage (I don't expect many external implementations)?

Another observation is that implementations of Scheduler in Java is not supported. Can't really work with such trait. It should have been an abstract class.

@jrudolph
Copy link
Member

However, that would not be binary backwards compatible given how traits are mixed into the concrete classes, or did this change in 2.12 compiler? (Akka calling that method on a concrete class that was compiled with older Akka version). That might be alright breakage (I don't expect many external implementations)?

I think that probably has changed with 2.12 but we would have to test.

Another observation is that implementations of Scheduler in Java is not supported. Can't really work with such trait. It should have been an abstract class.

That might actually also work with 2.12 but it seems we also offer AbstractScheduler for Java purposes.

@patriknw
Copy link
Member Author

ok, seems promising and we have a few things to try out

@patriknw patriknw removed this from the 2.6.0-M2 milestone May 24, 2019
patriknw added a commit that referenced this issue May 27, 2019
* previous `schedule` method is trying to maintain a fixed average frequency
  over time, but that can undersired bursts of scheduled tasks after a long
  GC or if the JVM process has been suspended, same with all other periodic
  scheduled message sending via various Timer APIs
* most of the time "fixed delay" is more desirable
* we can't just change because it's too big behavioral change and some might
  depend on previous behavior
* deprecate the old `schedule` and introduce new `scheduleWithFixedDelay`
  and `scheduleAtFixedRate`, when fixing the deprecation warning users should
  make a concious decision of which behavior to use (scheduleWithFixedDelay in
  most cases)
patriknw added a commit that referenced this issue May 27, 2019
* previous `schedule` method is trying to maintain a fixed average frequency
  over time, but that can result in undesired bursts of scheduled tasks after a long
  GC or if the JVM process has been suspended, same with all other periodic
  scheduled message sending via various Timer APIs
* most of the time "fixed delay" is more desirable
* we can't just change because it's too big behavioral change and some might
  depend on previous behavior
* deprecate the old `schedule` and introduce new `scheduleWithFixedDelay`
  and `scheduleAtFixedRate`, when fixing the deprecation warning users should
  make a concious decision of which behavior to use (scheduleWithFixedDelay in
  most cases)
patriknw added a commit that referenced this issue May 28, 2019
* previous `schedule` method is trying to maintain a fixed average frequency
  over time, but that can result in undesired bursts of scheduled tasks after a long
  GC or if the JVM process has been suspended, same with all other periodic
  scheduled message sending via various Timer APIs
* most of the time "fixed delay" is more desirable
* we can't just change because it's too big behavioral change and some might
  depend on previous behavior
* deprecate the old `schedule` and introduce new `scheduleWithFixedDelay`
  and `scheduleAtFixedRate`, when fixing the deprecation warning users should
  make a concious decision of which behavior to use (scheduleWithFixedDelay in
  most cases)
patriknw added a commit that referenced this issue Jun 3, 2019
* previous `schedule` method is trying to maintain a fixed average frequency
  over time, but that can result in undesired bursts of scheduled tasks after a long
  GC or if the JVM process has been suspended, same with all other periodic
  scheduled message sending via various Timer APIs
* most of the time "fixed delay" is more desirable
* we can't just change because it's too big behavioral change and some might
  depend on previous behavior
* deprecate the old `schedule` and introduce new `scheduleWithFixedDelay`
  and `scheduleAtFixedRate`, when fixing the deprecation warning users should
  make a concious decision of which behavior to use (scheduleWithFixedDelay in
  most cases)
patriknw added a commit that referenced this issue Jun 5, 2019
* previous `schedule` method is trying to maintain a fixed average frequency
  over time, but that can result in undesired bursts of scheduled tasks after a long
  GC or if the JVM process has been suspended, same with all other periodic
  scheduled message sending via various Timer APIs
* most of the time "fixed delay" is more desirable
* we can't just change because it's too big behavioral change and some might
  depend on previous behavior
* deprecate the old `schedule` and introduce new `scheduleWithFixedDelay`
  and `scheduleAtFixedRate`, when fixing the deprecation warning users should
  make a concious decision of which behavior to use (scheduleWithFixedDelay in
  most cases)

* Streams
* SchedulerSpec
  * test both fixed delay and fixed rate
* TimerSpec
* FSM and PersistentFSM
* mima
* runnable as second parameter list, also in typed.Scheduler
* IllegalStateException vs SchedulerException
* deprecated annotations
* api and reference docs, all places
* migration guide
patriknw added a commit that referenced this issue Jun 5, 2019
scheduleWithFixedDelay vs scheduleAtFixedRate, #26910
@patriknw patriknw removed the 3 - in progress Someone is working on this ticket label Jun 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:core
Projects
None yet
Development

No branches or pull requests

2 participants