Bursts of scheduled tasks due to scheduler drift compensation #26910

patriknw · 2019-05-13T11:53:40Z

When scheduling periodic tasks/messages the scheduler compensates for drift:

akka/akka-actor/src/main/scala/akka/actor/LightArrayRevolverScheduler.scala

Lines 105 to 107 in e071077

    
           val driftNanos = clock() - getAndAdd(delay.toNanos) 
        
           if (self.get != null) 
        
             swap(schedule(executor, this, Duration.fromNanos(Math.max(delay.toNanos - driftNanos, 1))))

That is probably good for small variations, but it results in bursts when the process is suspended for long time (because of GC, or simulations with kill -STOP, and I can imagine that processes are suspended in virtualized or container environments)

Example to reproduce:

    val system = ActorSystem("DriftTest")

    import scala.concurrent.duration._
    import system.dispatcher
    @volatile var time = System.nanoTime()
    system.scheduler.schedule(1.second, 1.second) {
      val now = System.nanoTime()
      val duration = (now - time).nanos
      time = now
      println(s"# tick ${duration.toMillis} ms") // FIXME
    }

use kill -STOP pid, wait for 10 seconds and then kill -CONT pid

Example output (with some extra logging in the scheduler)

# scheduler drift 13 ms => schedule next in 986  ms
# tick 997 ms
# scheduler drift 10 ms => schedule next in 989  ms
# tick 1001 ms
# scheduler drift 12 ms => schedule next in 987  ms
# tick 1000 ms
# scheduler drift 13 ms => schedule next in 986  ms
# tick 1000 ms
# scheduler drift 13 ms => schedule next in 986  ms
# tick 999 ms
# scheduler drift 12 ms => schedule next in 987  ms
# tick 998 ms
# scheduler drift 11 ms => schedule next in 988  ms

[1]+  Stopped

[13:41:56] [patrik@pn2 /Users/patrik/dev/akka]
$ # tick 15005 ms
# scheduler drift 14017 ms => schedule next in 0  ms
# tick 0 ms
# scheduler drift 13017 ms => schedule next in 0  ms
# tick 3 ms
# scheduler drift 12020 ms => schedule next in 0  ms
# tick 10 ms
# scheduler drift 11031 ms => schedule next in 0  ms
# tick 11 ms
# scheduler drift 10042 ms => schedule next in 0  ms
# tick 9 ms
# scheduler drift 9052 ms => schedule next in 0  ms
# tick 10 ms
# scheduler drift 8062 ms => schedule next in 0  ms
# tick 10 ms
# scheduler drift 7072 ms => schedule next in 0  ms
# tick 9 ms
# scheduler drift 6082 ms => schedule next in 0  ms
# tick 11 ms
# scheduler drift 5093 ms => schedule next in 0  ms
# tick 10 ms
# scheduler drift 4103 ms => schedule next in 0  ms
# tick 8 ms
# scheduler drift 3112 ms => schedule next in 0  ms
# tick 9 ms
# scheduler drift 2121 ms => schedule next in 0  ms
# tick 10 ms
# scheduler drift 1131 ms => schedule next in 0  ms
# tick 9 ms
# scheduler drift 141 ms => schedule next in 858  ms
# tick 870 ms
# scheduler drift 11 ms => schedule next in 988  ms
# tick 1002 ms
# scheduler drift 13 ms => schedule next in 986  ms
# tick 998 ms
# scheduler drift 12 ms => schedule next in 987  ms
# tick 1000 ms
# scheduler drift 12 ms => schedule next in 987  ms
# tick 1000 ms
# scheduler drift 13 ms => schedule next in 986  ms
# tick 999 ms
# scheduler drift 12 ms => schedule next in 987  ms

I think this is what causing the problem reported in #26786
but I can imagine all kind of things that will add massive load on the system when waking up and running all scheduled tasks repeatedly to catch up with missed ticks..

The text was updated successfully, but these errors were encountered:

…ed, #26910

jrudolph · 2019-05-14T11:31:05Z

That behavior is at least partly specified in schedule's Scaladoc:

If the execution of the runnable takes longer than the interval, the subsequent execution will start immediately after the prior one completes

I wonder, though, if that behavior is really needed by most callers. It would be good if we could offer another API that just discards the missed runs and just carries on scheduling the next task with the original interval. Would be good if we could make this even the default behavior (but how to do so compatibly?).

Assuming that we need to keep the current behavior as well for compatibility reasons, could we improve it to lower the impact of the burst? Delaying the missed runs might just be shifting the problem to a later point if the system currently cannot keep up anyway.

patriknw · 2019-05-14T13:51:22Z

With the config I added in the PR it's possible to disable the drift compensation completely by setting it to 0 s. That config is global, but I doubt users care to have different behavior for different things.

jrudolph · 2019-05-14T15:21:14Z

With the config I added in the PR it's possible to disable the drift compensation completely by setting it to 0 s. That config is global, but I doubt users care to have different behavior for different things.

How safe is that? E.g. if you implement a clock with scheduled ticks, that will go wrong if there are GC pauses which are too long. The existing code, explicitly tries to keep an invariant that if you schedule something you will (in steady state) get total-time-elapsed / interval messages. A clock implemented that way might go wrong temporarily but will catch up eventually (as long as nanoTime is accurate).

Users might not have implemented a simple clock but something similar based on that invariant, so I'm not so sure we can just change the behavior in that regard.

jrudolph · 2019-05-14T15:22:18Z

I think that drift is misleading because it's not about being late but about keeping the average rate.

patriknw · 2019-05-14T16:16:12Z

ok, if that shall continue to be a strong invariant of schedule we should at least make the catch up more smooth than firing all missing immediately after suspension. By not scheduling with 0 delay to catch up but use something like half of the average (given) delay.

Then we would also need to add another schedule that doesn't have this property and use that in all places internally because I don't think there is one single place where we want/need the compensation.

Having both alternatives in end-user api feels like too many alternatives for something that very few would care about. Maybe we could change behavior in 2.6 but with a config that switch back to old behavior?

patriknw · 2019-05-15T06:26:10Z

I did some more research of what we have in Akka and how it's named and documented in the JDK.

In Akka:

Scheduler
- schedule (interval parameter)
  - variant for running function (Runnable)
  - variant for sending message
  - doc: to be run repeatedly with an initial delay and a frequency
  - doc: If the execution of the runnable takes longer than the interval, the
    subsequent execution will start immediately after the prior one completes
    (there will be no overlap of executions of the runnable). In such cases,
    the actual execution interval will differ from the interval passed to this
    method.
  - but this is not documented on the message based methods (but given the same name it should be assumed to have same semantics)
FSM
- setTimer repeat=true (timeout parameter)
  - doc: delay of first message delivery and between subsequent messages)
Actor with Timers (TimerScheduler)
- startPeriodicTimer (interval parameter)
- doc: Start a periodic timer that will send msg to the self actor at a fixed interval.
TimerGraphStageLogic
- schedulePeriodically (interval parameter)
  - periodically with the given interval
  - e.g. used by throttle GroupedWeightedWithin, TickSource
typed.TimerScheduler
- startPeriodicTimer (interval parameter)
  - doc: Start a periodic timer that will send msg to the self actor at a fixed interval.

In JDK:

ScheduledExecutorService (ScheduledThreadPoolExecutor)
- scheduleAtFixedRate (period parameter)
- scheduleWithFixedDelay (delay parameter)
  - doc: If any execution of this task
    takes longer than its period, then subsequent executions
    may start late, but will not concurrently execute.
  - doc: the period between successive executions
java.util.Timer
- schedule (period parameter) fixed-delay
- scheduleAtFixedRate (delay parameter)
- doc: If a timer task takes excessive time
  complete, it "hogs" the timer's task execution thread. This can, in
  turn, delay the execution of subsequent tasks, which may "bunch up" and
  execute in rapid succession when (and if) the offending task finally
  completes.

java.util.Timer also have some nice descriptions:

In fixed-delay execution, each execution is scheduled relative to
the actual execution time of the previous execution. If an execution
is delayed for any reason (such as garbage collection or other
background activity), subsequent executions will be delayed as well.
In the long run, the frequency of execution will generally be slightly
lower than the reciprocal of the specified period.

Fixed-delay execution is appropriate for recurring activities
that require "smoothness." In other words, it is appropriate for
activities where it is more important to keep the frequency accurate
in the short run than in the long run. This includes most animation
tasks, such as blinking a cursor at regular intervals. It also includes
tasks wherein regular activity is performed in response to human
input, such as automatically repeating a character as long as a key
is held down.

In fixed-rate execution, each execution is scheduled relative to the
scheduled execution time of the initial execution. If an execution is
delayed for any reason (such as garbage collection or other background
activity), two or more executions will occur in rapid succession to
"catch up." In the long run, the frequency of execution will be
exactly the reciprocal of the specified period.

Fixed-rate execution is appropriate for recurring activities that
are sensitive to absolute time, such as ringing a chime every
hour on the hour, or running scheduled maintenance every day at a
particular time. It is also appropriate for recurring activities
where the total time to perform a fixed number of executions is
important, such as a countdown timer that ticks once every second for
ten seconds. Finally, fixed-rate execution is appropriate for
scheduling multiple repeating timer tasks that must remain synchronized
with respect to one another.

Given these new insights I have changed my mind. My first approach is not good since it's a mix of the two. We can't change the existing semantics, at least not for Scheduler.schedule, and we have to support both both ways.

In Scheduler we can't easily add a new method (because it's a public interface that others may have implemented), but we can provide the fixed-delay variant as a utility method in Scheduler companion (can be implemented on top of scheduleOnce).

For the other APIs we can add new methods and maybe it's best to deprecate the old one since its semantics was not well defined regarding this and I think most users don't want current fixed-rate semantics and the deprecation warning can be a good signal for that there is a choice to be made. Adding two new methods such as startTimerAtFixedRate and startTimerWithFixedDelay.

jrudolph · 2019-05-15T07:20:17Z

Great overview!

In Scheduler we can't easily add a new method (because it's a public interface that others may have implemented), but we can provide the fixed-delay variant as a utility method in Scheduler companion (can be implemented on top of scheduleOnce).

Shouldn't it be the end goal also for Scheduler to stear users away from the the current schedule to a new scheduleWithFixedDelay method? I wonder if that will work if it's in the companion object.

Is the concern about Scheduler that we might introduce a new method whose name is already taken by an implementing class? Otherwise, it seems we could also think about deprecating the current schedule method (somewhat drastically) and the provide default implementations of the new scheduleWithFixedDelay and scheduleWithFixedRate directly in Scheduler in terms of scheduleOnce. Could that work?

patriknw · 2019-05-15T08:30:40Z

Shouldn't it be the end goal also for Scheduler to stear users away from the the current schedule to a new scheduleWithFixedDelay method? I wonder if that will work if it's in the companion object.

I agree, but Scheduler is not the primary end user api any more. We have higher level APIs such the various Timers APIs. I see Scheduler as a low level tool.

Is the concern about Scheduler that we might introduce a new method whose name is already taken by an implementing class?

That is one concern, but another more important is that adding a new abstract method will break existing implementations if we can't provide a default implementation of it. Looking closer... I thought Scheduler was a pure interface without implementations, but it's not, so maybe we can add a default implementation of that method.

However, that would not be binary backwards compatible given how traits are mixed into the concrete classes, or did this change in 2.12 compiler? (Akka calling that method on a concrete class that was compiled with older Akka version). That might be alright breakage (I don't expect many external implementations)?

Another observation is that implementations of Scheduler in Java is not supported. Can't really work with such trait. It should have been an abstract class.

jrudolph · 2019-05-15T08:43:32Z

However, that would not be binary backwards compatible given how traits are mixed into the concrete classes, or did this change in 2.12 compiler? (Akka calling that method on a concrete class that was compiled with older Akka version). That might be alright breakage (I don't expect many external implementations)?

I think that probably has changed with 2.12 but we would have to test.

Another observation is that implementations of Scheduler in Java is not supported. Can't really work with such trait. It should have been an abstract class.

That might actually also work with 2.12 but it seems we also offer AbstractScheduler for Java purposes.

patriknw · 2019-05-15T09:08:59Z

ok, seems promising and we have a few things to try out

* previous `schedule` method is trying to maintain a fixed average frequency over time, but that can undersired bursts of scheduled tasks after a long GC or if the JVM process has been suspended, same with all other periodic scheduled message sending via various Timer APIs * most of the time "fixed delay" is more desirable * we can't just change because it's too big behavioral change and some might depend on previous behavior * deprecate the old `schedule` and introduce new `scheduleWithFixedDelay` and `scheduleAtFixedRate`, when fixing the deprecation warning users should make a concious decision of which behavior to use (scheduleWithFixedDelay in most cases)

* previous `schedule` method is trying to maintain a fixed average frequency over time, but that can result in undesired bursts of scheduled tasks after a long GC or if the JVM process has been suspended, same with all other periodic scheduled message sending via various Timer APIs * most of the time "fixed delay" is more desirable * we can't just change because it's too big behavioral change and some might depend on previous behavior * deprecate the old `schedule` and introduce new `scheduleWithFixedDelay` and `scheduleAtFixedRate`, when fixing the deprecation warning users should make a concious decision of which behavior to use (scheduleWithFixedDelay in most cases)

* previous `schedule` method is trying to maintain a fixed average frequency over time, but that can result in undesired bursts of scheduled tasks after a long GC or if the JVM process has been suspended, same with all other periodic scheduled message sending via various Timer APIs * most of the time "fixed delay" is more desirable * we can't just change because it's too big behavioral change and some might depend on previous behavior * deprecate the old `schedule` and introduce new `scheduleWithFixedDelay` and `scheduleAtFixedRate`, when fixing the deprecation warning users should make a concious decision of which behavior to use (scheduleWithFixedDelay in most cases) * Streams * SchedulerSpec * test both fixed delay and fixed rate * TimerSpec * FSM and PersistentFSM * mima * runnable as second parameter list, also in typed.Scheduler * IllegalStateException vs SchedulerException * deprecated annotations * api and reference docs, all places * migration guide

scheduleWithFixedDelay vs scheduleAtFixedRate, #26910

Jackson whitelist for deserialization of unbound class, #26910

patriknw added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:core labels May 13, 2019

patriknw self-assigned this May 13, 2019

patriknw added the 3 - in progress Someone is working on this ticket label May 13, 2019

patriknw added a commit that referenced this issue May 13, 2019

Avoid bursts of scheduled tasks when JVM wakes up after being suspend…

1eab02c

…ed, #26910

This was referenced May 13, 2019

Avoid bursts of scheduled tasks when JVM wakes up after being suspended, #26910 #26912

Closed

Unknown compression table version #26786

Closed

patriknw added this to the 2.6.0-M2 milestone May 14, 2019

patriknw mentioned this issue May 22, 2019

Specific minimal scheduler API for typed #26991

Closed

patriknw removed this from the 2.6.0-M2 milestone May 24, 2019

patriknw mentioned this issue May 27, 2019

scheduleWithFixedDelay vs scheduleAtFixedRate, #26910 #27034

Merged

8 tasks

patriknw added a commit that referenced this issue Jun 5, 2019

Jackson whitelist for deserialization of unbound class, #26910

cbe0b12

patriknw mentioned this issue Jun 5, 2019

Jackson whitelist for deserialization of unbound class, #26910 #27092

Merged

patriknw added a commit that referenced this issue Jun 5, 2019

Merge pull request #27034 from akka/wip-26910-schedule-patriknw

c4bf933

scheduleWithFixedDelay vs scheduleAtFixedRate, #26910

patriknw removed the 3 - in progress Someone is working on this ticket label Jun 5, 2019

patriknw added this to the 2.6.0-M3 milestone Jun 5, 2019

patriknw closed this as completed Jun 5, 2019

patriknw added a commit that referenced this issue Jun 7, 2019

Jackson whitelist for deserialization of unbound class, #26910

1b7c3d3

patriknw added a commit that referenced this issue Jun 7, 2019

Jackson whitelist for deserialization of unbound class, #26910

c62f428

patriknw added a commit that referenced this issue Jun 10, 2019

Merge pull request #27092 from akka/wip-26910-prefix-patriknw

6e6b157

Jackson whitelist for deserialization of unbound class, #26910

johanandren mentioned this issue Sep 24, 2020

Performance degradation after migrating from akka streams v2.5 to v2.6 #29663

Open

to11mtm mentioned this issue Nov 24, 2020

Discussion: Port scala LightArrayRevolverScheduler (HashedWheelTimerScheduler) changes for fixed rate vs Fixed Delay? akkadotnet/akka.net#4640

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bursts of scheduled tasks due to scheduler drift compensation #26910

Bursts of scheduled tasks due to scheduler drift compensation #26910

patriknw commented May 13, 2019

jrudolph commented May 14, 2019

patriknw commented May 14, 2019

jrudolph commented May 14, 2019

jrudolph commented May 14, 2019

patriknw commented May 14, 2019

patriknw commented May 15, 2019

jrudolph commented May 15, 2019

patriknw commented May 15, 2019

jrudolph commented May 15, 2019

patriknw commented May 15, 2019

Bursts of scheduled tasks due to scheduler drift compensation #26910

Bursts of scheduled tasks due to scheduler drift compensation #26910

Comments

patriknw commented May 13, 2019

jrudolph commented May 14, 2019

patriknw commented May 14, 2019

jrudolph commented May 14, 2019

jrudolph commented May 14, 2019

patriknw commented May 14, 2019

patriknw commented May 15, 2019

jrudolph commented May 15, 2019

patriknw commented May 15, 2019

jrudolph commented May 15, 2019

patriknw commented May 15, 2019