Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FreeRTOS timer callbacks occasionally executed twice (IDFGH-6590) #8234

Closed
dexterbg opened this issue Jan 13, 2022 · 13 comments
Closed

FreeRTOS timer callbacks occasionally executed twice (IDFGH-6590) #8234

dexterbg opened this issue Jan 13, 2022 · 13 comments
Assignees
Labels
Resolution: Cannot Reproduce Issue cannot be reproduced Status: Done Issue is done internally

Comments

@dexterbg
Copy link

Environment

  • Development Kit: none
  • Kit version (for WroverKit/PicoKit/DevKitC): none
  • Module or chip used: ESP32-WROVER
  • IDF versions: v3.3, v4.4 (release branches as of today)
  • Build Systems: make for v3.3, idf.py for v4.4
  • Compiler versions: gcc version 5.2.0 (crosstool-ng-1.22.0-97-gc752ad5) for idf v3.3, gcc version 8.4.0 (crosstool-NG esp-2021r2) for idf v4.4
  • Operating System: Linux, macOS
  • Using an IDE?: No
  • Power Supply: USB

Problem Description

FreeRTOS single shot timer callbacks occasionally are called twice. This occurs especially with CPU/MEM load and Wifi network traffic, but it also has been seen with Wifi disabled running a PPP network via modem. The issue can be reproduced by building & launching the attached test projects and starting some flood pings to the module, see below for details.

Source project / reference:

Additional note: it's possible this isn't limited to single shot timers. I haven't tested periodic timers for this behaviour yet.

Expected Behavior

A single shot timer callback is never executed twice on timer expiry.

Actual Behavior

Single shot timer callbacks occasionally are executed twice on timer expiry.

Steps to reproduce

  1. Build & run the test project(s)
  2. Start some flood pings to the module
  3. Monitor the output for "duplicate callback" messages

Code to reproduce this issue

See attached ZIP archive (contains two projects, one for idf 3.3 & one for 4.4):

Debug Logs

See attached ZIP archive, files "testrun.log".

Example excerpt:

I (5486) app_task: running on core 1
I (15846) app_task: checkpoint tid=1000
E (17226) timer_callback: duplicate callback detected on tid=1126
E (22196) timer_callback: duplicate callback detected on tid=1595
I (26396) app_task: checkpoint tid=2000
E (26746) timer_callback: duplicate callback detected on tid=2032
E (28906) timer_callback: duplicate callback detected on tid=2238
E (30216) timer_callback: duplicate callback detected on tid=2359
…

Other items if possible

See attached ZIP archive for sdkconfig.

@espressif-bot espressif-bot added the Status: Opened Issue is new label Jan 13, 2022
@github-actions github-actions bot changed the title FreeRTOS single shot timer callbacks occasionally executed twice FreeRTOS single shot timer callbacks occasionally executed twice (IDFGH-6590) Jan 13, 2022
@dexterbg
Copy link
Author

Additional info: the effect is also present with SPIRAM disabled (and 64K for cpuload_task to fit in internal RAM), and it's even more frequent when running in single core mode.

Regards,
Michael

@dexterbg
Copy link
Author

The effect also affects periodic timers.

Extended test project including test results: test-timer-dupecall-2.zip

Detection method: there are 5 counting and 1 checkpoint timers, identified by their timer IDs. The counting timers are set to run every 1,2,3,4,5 ticks. The checkpoint timer is set to run every 120 ticks, it subtracts the expected counts and checks for remaining offsets.

Due to shuffling of the internal timer execution list, the counters can have a small offset on one checkpoint which will be equalized on the next. These get logged as warnings. If an offset appears to be stable, an error log line is written.

The checkpoint logs show the long term evolution of the offsets. Offsets should long term always return to zero if all timers are executed exactly as often as they should. If they execute more often than expected, offsets will accumulate to positive values, if less often to negative.

The test run shows these stabilized checkpoints:

I (125386) periodic_callback: checkpoint: offsets = 0, 0, 0, 0, 0
…
I (5525386) periodic_callback: checkpoint: offsets = 1, 0, 0, 0, 0
…
I (13325396) periodic_callback: checkpoint: offsets = 1, 0, 1, 0, 0
…
I (37325396) periodic_callback: checkpoint: offsets = 1, 0, 1, 1, 0
…

Conclusions:

  1. Periodic timer callbacks are occasionally executed twice as well.
  2. This happens with a much lower frequency than with the single shot callback dupes.

Regards,
Michael

@dexterbg dexterbg changed the title FreeRTOS single shot timer callbacks occasionally executed twice (IDFGH-6590) FreeRTOS timer callbacks occasionally executed twice (IDFGH-6590) Jan 15, 2022
dexterbg added a commit to openvehicles/Open-Vehicle-Monitoring-System-3 that referenced this issue Jan 15, 2022
dexterbg added a commit to openvehicles/Open-Vehicle-Monitoring-System-3 that referenced this issue Jan 15, 2022
dexterbg added a commit to openvehicles/Open-Vehicle-Monitoring-System-3 that referenced this issue Jan 15, 2022
dexterbg added a commit to openvehicles/Open-Vehicle-Monitoring-System-3 that referenced this issue Jan 15, 2022
  (ref: espressif/esp-idf#8234)

Note: this scheme allows for callback processing delays of up to 3 ticks,
  it will fail to detect duplicate callbacks for timers with shorter
  periods -- if necessary, use another scheme there.
@david-cermak
Copy link
Collaborator

Hi @dexterbg

Thanks for the report and sharing a very simple sample application that exhibits the problem

void app_task(void *userdata)
{
  ...
    while (xTimerChangePeriod(timer, 1, portMAX_DELAY) != pdPASS);  // [1]
    vTimerSetTimerID(timer, (void *) tid);   // [2]
    while (xTimerStart(timer, portMAX_DELAY) != pdPASS);   // [3]
  ...
}

I wonder why you keep changing the timer period [1] before starting it [3]? Isn't it possible (when the system is really heavy loaded) that the timer callback gets called before we update the timer id [2], i.e. with the older tid?
Assuming that [1] wakes the timer from its dormant state:

* // xTimer is not active, change its period to 500ms. This will also
* // cause the timer to start. Block for a maximum of 100 ticks if the

case tmrCOMMAND_CHANGE_PERIOD:
case tmrCOMMAND_CHANGE_PERIOD_FROM_ISR:
pxTimer->ucStatus |= tmrSTATUS_IS_ACTIVE;
pxTimer->xTimerPeriodInTicks = xMessage.u.xTimerParameters.xMessageValue;
configASSERT( ( pxTimer->xTimerPeriodInTicks > 0 ) );
/* The new period does not really have a reference, and can
* be longer or shorter than the old one. The command time is
* therefore set to the current time, and as the period cannot
* be zero the next expiry time can only be in the future,
* meaning (unlike for the xTimerStart() case above) there is
* no fail case that needs to be handled here. */
( void ) prvInsertTimerInActiveList( pxTimer, ( xTimeNow + pxTimer->xTimerPeriodInTicks ), xTimeNow, xTimeNow );
break;

@dexterbg
Copy link
Author

Hi David,

I seem to never have noticed that xTimerChangePeriod() actually also starts a stopped timer, although it's clearly described in the documentation (https://freertos.org/FreeRTOS-timers-xTimerChangePeriod.html). Feeling stupid now, sorry for wasting your time on this part. I'm pretty sure that's the cause for the single shot callback dupes.

Any idea on why the cyclic timers also seem to get an occasional extra callback? Note we have no conclusive evidence for this happening in our main application, only some occasional effects that may be caused by this -- or by something else. So there may be something wrong with my test design here.

Thanks & regards,
Michael

dexterbg added a commit to openvehicles/Open-Vehicle-Monitoring-System-3 that referenced this issue Jan 29, 2022
Att: xTimerChangePeriod() not only changes the period, it also starts
a dormant (stopped) timer. So the timer was actually already running
(and the callback could be executed) before the payload was assigned.

Ref: espressif/esp-idf#8234 (comment)
@david-cermak
Copy link
Collaborator

Hi Michael,

Sorry for the late reply, no idea about those extra callback of periodic timers. I think it might be tricky to (dis)prove the duplicated callbacks of a cyclic timer. I would perhaps try to match timer's callback occurrences with kernel ticks?

@Dazza0 Dazza0 self-assigned this Feb 14, 2022
@sudeep-mohanty
Copy link
Collaborator

@dexterbg Apologies for getting to this issues so late!
I tried to reproduce the problem with the test-timer-dupecall-2.zip code but was unable to reproduce it on latest release/v4.4 branch. I adapted the source code for the mainline as well and still could not see the periodic timer being called more than once.
At no point do I notice the periodic_callback offsets being greater than 0 in a run of about 20 minutes with Wi-Fi connected in both STA and AP mode -

(126628) cpuload_task: unable to allocate 1312 bytes SPIRAM
I (126628) periodic_callback: checkpoint: offsets = 0, 0, 0, 0, 0
W (127678) wifi:<ba-add>idx:5 (ifx:1, 66:76:2d:06:90:8b), tid: ssn:158, winSize:64
...
...
I (246628) periodic_callback: checkpoint: offsets = 0, 0, 0, 0, 0
...
...
I (486628) periodic_callback: checkpoint: osets = 0, 0, 0, 0, 0

Perhaps you could try it out again on the latest release/v4.4 branch and let us know whether you still see the problem? If so, then maybe we would have to deep dive in the setup differences to understand what might be causing the problem.

@sudeep-mohanty sudeep-mohanty self-assigned this Jul 8, 2022
@dexterbg
Copy link
Author

dexterbg commented Jul 9, 2022

@sudeep-mohanty Thanks for investigating this. I'll try v4.4 ASAP, but your quoted log's runtime was a magnitude too short for the effect to occur.

On my test runs, the first stable offset never occurred before about an hour runtime, with sometimes some hours between two offset jumps.

@sudeep-mohanty
Copy link
Collaborator

@dexterbg I redid the experiment for a (much) longer duration of about ~ 4 hrs and I still did not observe the timer callback being called more than once. Did you get a chance to test it at your end?
Logs -

I (14765825) app_task: checkpoint tid=1476000
I (14765835) periodic_callback: checkpoint: offsets = 0, 0, 0, 0, 0
I (14775825) app_task: checkpoint tid=1477000
I (14785825) app_task: checkpoint tid=1478000
I (14795825) app_task: checkpoint tid=1479000
I (14805825) app_task: checkpoint tid=1480000
I (14815825) app_task: checkpoint tid=1481000
I (14825825) app_task: checkpoint tid=1482000
I (14835825) app_task: checkpoint tid=1483000
I (14845825) app_task: checkpoint tid=1484000
I (14855825) app_task: checkpoint tid=1485000
I (14865825) app_task: checkpoint tid=1486000
I (14875825) app_task: checkpoint tid=1487000
I (14885825) app_task: checkpoint tid=1488000
I (14885835) periodic_callback: checkpoint: offsets = 0, 0, 0, 0, 0

@AxelLin
Copy link
Contributor

AxelLin commented Jul 15, 2022

Environment

  • IDF versions: v3.3, v4.4 (release branches as of today)

@dexterbg
It's difficult to know the exactly commit with above description.
Please share "git describe --tags" output so others can verify if the problem is reproducable on exactly the same commit.

PS. please always use "git describe --tags" for IDF versions when report issues.

@dexterbg
Copy link
Author

I can confirm to also not be able to reproduce this with the current v4.4 release & toolchain.

esp-idf: v4.4.1-581-gee2029c38a
toolchain: crosstool-NG esp-2021r2-patch3 8.4.0

I've been running the test using different build configurations on an ESP32/R1, so I'm pretty sure this has been resolved.

So I guess this issue can be closed.

Thanks everyone involved!

@sudeep-mohanty
Copy link
Collaborator

Thanks for confirming @dexterbg!

@espressif-bot espressif-bot added Resolution: Cannot Reproduce Issue cannot be reproduced Status: Done Issue is done internally and removed Status: Opened Issue is new labels Jul 18, 2022
@AxelLin
Copy link
Contributor

AxelLin commented Jul 18, 2022

@sudeep-mohanty
Which commit fixed this issue? (I'd like to know if it's also fixed in other v4.x branches)

@dexterbg
Copy link
Author

Sorry, I cannot tell if I actually did the periodic callback test on the 4.4 branch. My test logs archived only show the periodic test on v3.3.4-848-g1ff5e24b1b, the only test log I've kept from the idf4 test was with v4.4-beta1-284-gd83021a6e8-dirty, but that was for the single shot test, which was flawed by me.

So the fix for the periodic callbacks may have been in a much earlier idf commit, or may have been in the new toolchain as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: Cannot Reproduce Issue cannot be reproduced Status: Done Issue is done internally
Projects
None yet
Development

No branches or pull requests

6 participants