Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NimBLE scan stops generating results, then crashes when resumed. (IDFGH-2001) #4196

Closed
jimmo opened this issue Oct 14, 2019 · 34 comments
Closed
Assignees
Labels
Status: Done Issue is done internally

Comments

@jimmo
Copy link
Contributor

jimmo commented Oct 14, 2019

Environment

  • Development Kit: TinyPICO
  • Module or chip used: ESP32-PICO-D4
  • IDF version (run git describe --tags to find it): v4.0-beta1, v4.1-dev
  • Build System: MicroPython
  • Compiler version (run xtensa-esp32-elf-gcc --version to find it): (crosstool-NG crosstool-ng-1.22.0-80-g6c4433a) 5.2.0
  • Operating System: Linux
  • Power Supply: USB

Problem Description

Using NimBLE, running a ble_gap_disc, the stack eventually stops generating new scan results (i.e. the ble_gap_event_fn passed to ble_gap_disc no longer gets called).

If I stop scanning (i.e. ble_gap_disc_cancel), then start again, it fails with:

ASSERT_PARAM(64 0), in rwble.c at line 235
Guru Meditation Error: Core  0 panic'ed (Interrupt wdt timeout on CPU0)
Core 0 register dump:
PC      : 0x40087aad  PS      : 0x00060034  A0      : 0x8008b964  A1      : 0x3ffbe4d0  
A2      : 0x00000001  A3      : 0x00000000  A4      : 0x00000000  A5      : 0x60008054  
A6      : 0x3ffbdb60  A7      : 0x60008050  A8      : 0x80087aad  A9      : 0x3ffbe4b0  
A10     : 0x00000004  A11     : 0x00000000  A12     : 0x6000804c  A13     : 0xffffffff  
A14     : 0x00000000  A15     : 0xfffffffc  SAR     : 0x00000004  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x400879e5  LEND    : 0x400879ec  LCOUNT  : 0x00000000  
Core 0 was running in ISR context:
EPC1    : 0x401c2722  EPC2    : 0x00000000  EPC3    : 0x00000000  EPC4    : 0x40087aad

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x40087aaa:0x3ffbe4d0 0x4008b961:0x3ffbe4f0 0x4008c54f:0x3ffbe510 0x40085ea5:0x3ffbe530 0x401c271f:0x3ffbbcb0 0x40152d2f:0x3ffbbcd0 0x40095495:0x3ffbbcf0 0x40093fe9:0x3ffbbd20

Core 1 register dump:
PC      : 0x401c2722  PS      : 0x00060f34  A0      : 0x80152d32  A1      : 0x3ffbc430  
A2      : 0x3ffc289d  A3      : 0x00060f20  A4      : 0x800941df  A5      : 0x3ffcfee0  
A6      : 0x00000003  A7      : 0x00060023  A8      : 0x8010f168  A9      : 0x3ffbc420  
A10     : 0x00000003  A11     : 0x00060f23  A12     : 0x00060f20  A13     : 0x3ffcfec0  
A14     : 0x00000003  A15     : 0x00060023  SAR     : 0x0000001d  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000  

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x401c271f:0x3ffbc430 0x40152d2f:0x3ffbc450 0x40095495:0x3ffbc470 0x40093fe9:0x3ffbc4a0

I have been investigating this issue that was reported by MicroPython users. The typical time to failure was several hours, but I was able to repro it much faster with some devices sending advertisements at a high rate (50 per second). Usually takes about 10-15 mins for the issue to occur.

The most important detail about my configuration is that it's running inside MicroPython, but I can repro this if I turn gap_scan_cb into a no-op, so that should hopefully exclude any issues with what we're doing in the callback. No other BLE operations were active. It's a passive scan, with interval=30ms, window=30ms, duration=forever. Another detail is that NimBLE is pinned to core 1 (same core as the MicroPython task), but I saw this problem when it was on the default core (0). The main MicroPython task is blocked on the UART (for the REPL) via ulTaskNotifyTake.

MicroPython doesn't use IDF's makefiles or cmake, but it does use sdkconfig. I've attached the input and generated sdkconfig.h below.

I wonder if this might be related to #4001 (but that was on Bluedroid, not NimBLE). I can repro with both v4.0-beta1 and v4.1-dev.

Here's a disassembly of r_rwble_isr which seems to be raising the assert above https://gist.github.com/jimmo/43ba1da440fcbebfab89c70e345f368b (I've marked the relevant line with ********) I've attached the elf below. Built from https://github.com/jimmo/micropython/tree/ble-fixes

Other items if possible

application.elf.zip
sdkconfig.zip

@github-actions github-actions bot changed the title NimBLE scan stops generating results, then crashes when resumed. NimBLE scan stops generating results, then crashes when resumed. (IDFGH-2001) Oct 14, 2019
@copercini
Copy link

copercini commented Oct 18, 2019

I'm not 100% sure if this is related, but it occurs in bluedroid too, if you have many results, the scan stuck after a while, so a workaround is to scans a short period of time (1 minute), stop some time (10 secs) and back to scan again instead try scan it forever

@prasad-alatkar
Copy link
Contributor

Hi @jimmo I am trying to reproduce this issue with the above mentioned settings using blecent example from $IDF_PATH/examples/bluetooth/nimble/blecent, unfortunately even after stressing it for >48Hrs I have not yet been able to see the crash.

Using NimBLE, running a ble_gap_disc, the stack eventually stops generating new scan results (i.e. the ble_gap_event_fn passed to ble_gap_disc no longer gets called).

Can you please confirm if you have not set disc_params.filter_duplicates = 1?

If I stop scanning (i.e. ble_gap_disc_cancel), then start again, it fails with:

Can you please tell me where exactly you are calling ble_gap_dic_cancel?

The main MicroPython task is blocked on the UART (for the REPL) via ulTaskNotifyTake.

I guess this might be differentiating factor between our setup, I will try to replicate this at my end and update you.

@jimmo
Copy link
Contributor Author

jimmo commented Oct 19, 2019

@prasad-alatkar Thanks for looking into this!

Can you please confirm if you have not set disc_params.filter_duplicates = 1?

We set filter_duplicates to 0.

Can you please tell me where exactly you are calling ble_gap_dic_cancel?

We call cancel from the main MicroPython task.

To elaborate -- the MicroPython firmware gives me a REPL prompt on the device, from which I call a Python function which internally calls ble_gap_disc, with interval and window both set to 30ms.

When the disc callback is invoked, it copies the data into a ringbuffer and notifies the MicroPython task, which then schedules a Python callback that prints out the scan data. (When I referred to making the callback a no-op, this is the code I'm talking about)

After some time interval (usually 10-15 mins in my testing at ~50 results/s, but I've had reports from users that this can take many hours at a slower rate), I stop seeing the print output from my Python callback.

So that's where I interactively stop the scan from the REPL, then re-start it, at which point I see the crash described in rwble.c.

The main MicroPython task is created from app_main().

    xTaskCreatePinnedToCore(mp_task, "mp_task", MP_TASK_STACK_LEN, NULL, MP_TASK_PRIORITY, &mp_main_task_handle, MP_TASK_COREID);

MP_TASK_COREID is 1. We also pin NimBLE to core 1.

@prasad-alatkar
Copy link
Contributor

Hi @jimmo, apologies for late reply. FYI, I was able to reproduce the issue with high advertisement rate. Though I am still looking into the root-cause, can you try disabling Bluetooth Modem Sleep ? From menuconfig just follow this path : Component config > Bluetooth > Bluetooth controller > MODEM SLEEP Options. Alternatively you can comment out CONFIG_BTDM_MODEM_SLEEP and CONFIG_BTDM_MODEM_SLEEP_MODE_ORIG from your sdkconfig.

Please share your valuable observations.

@jimmo
Copy link
Contributor Author

jimmo commented Oct 25, 2019

@prasad-alatkar Great that you were able to repro the issue, and thanks again for looking into it!

As you can probably see in our sdkconfig.h, we're using the default value of the sleep params (FYI MicroPython still uses tools/kconfig_new/confgen.py, so I just added CONFIG_BTDM_MODEM_SLEEP=n to our input config https://github.com/micropython/micropython/blob/master/ports/esp32/boards/sdkconfig.ble). Confirmed that this removed those two lines from the generated sdkconfig.h

Unfortunately I was still able to repro the issue. I actually got it to happen after a very short amount of time (less than a minute), and then again after about 15 minutes.

@0neblock
Copy link
Contributor

hi @prasad-alatkar, as @jimmo mentioned, this seems very similar to my issue #4001 which includes a coredump of the esp_32 when the error occurs which may give you greater insight into the operation of the BT system when it fails.
If you need any further information please let me know.

@prasad-alatkar
Copy link
Contributor

Hi @jimmo @0neblock apologies for delayed response. @csushantk is debugging this issue internally. I will update you once the issue is fixed.

@0neblock
Copy link
Contributor

Hi @prasad-alatkar
I noticed that a few commits were pushed to the Github repo recently, can you please confirm if your fix for this issue was included?

I am tracking release/v3.3 for my IDF toolchain.
I am testing this latest release now.
Thanks.

@prasad-alatkar
Copy link
Contributor

Hi @0neblock the issue is not yet fixed, I will update you as soon as it is fixed.

@pschlang
Copy link

@prasad-alatkar Apologies for pushing again, but is there any news here? Any workaround is highly appreciated.

@prasad-alatkar
Copy link
Contributor

Hi @pschlang , @csushantk is working on fixing the issue in controller. I am also working on workarounds in host, however it may take some time to fix this issue. I will update you as soon as it is fixed.

@pschlang
Copy link

Hi @prasad-alatkar, we also observe this issue in 4.0-rc1. We urgently need a workaround for this. Can you give some more details on how to avoid this issue or how we could detect and recover from it?

@ikalogic
Copy link

+1 if a workaround exists, that would be great until we have a solid fix.

@dmartauz
Copy link

@prasad-alatkar Any news regarding the fix?

@prasad-alatkar
Copy link
Contributor

Hi @dmartauz Can you please try and test on latest master ? commits 48bd2d74b4209bbad7c29d2d856810ee43a1a92c and 74c7db3f49e8605d7ec3736a3ba508a9f8d3fd9a are expected to resolve the crash issue. Please let me know if you still face the issue .

@dmartauz
Copy link

@prasad-alatkar Great, thanks. I will test in a week or two.

@0neblock
Copy link
Contributor

@prasad-alatkar is this excepted to fix bluedroid as well?

@jimmo
Copy link
Contributor Author

jimmo commented Mar 31, 2020

@prasad-alatkar thanks for this. When you say "are expected to resolve the crash issue." does that mean that your previously-failing test case now works?

Any chance you'd be able to back-port to v4.0? I don't have the time right now to move MicroPython to v4.1-beta1 (and we'd rather stick to the release version anyway). A few people have contacted us asking about this issue.

@prasad-alatkar
Copy link
Contributor

Hi @jimmo , the changes are backported to v4.0 as well.

When you say "are expected to resolve the crash issue." does that mean that your previously-failing test case now works?

These fixes address the bugs that were found while debugging this issue. Please let me know if it resolves your issue.

@0neblock sorry for delayed response. As the fix is in controller, I believe it will be applicable to bluedroid as well (CC: @csushantk )

@0neblock
Copy link
Contributor

Thanks @prasad-alatkar,
Do you know when/if these changes would be backported to v3.3?

@jimmo
Copy link
Contributor Author

jimmo commented Mar 31, 2020

@prasad-alatkar ah I was looking out for a new tag, but it seems that you have just moved the existing v4.0 tag? That's going to create some confusion. Can you do point releases in the future?

Thanks for the confirmation, I will try and get some people to test it ASAP. I can confirm that MicroPython builds cleanly against the new v4.0 revision tag, I will update here when I hear back from our users.

jimmo added a commit to jimmo/micropython that referenced this issue Apr 1, 2020
This is a "re-release" of v4.0. The "v4.0" tag was updated to include some backported fixes.

The main one is espressif/esp-idf#4196
jimmo added a commit to jimmo/micropython that referenced this issue Apr 1, 2020
This is a "re-release" of v4.0. The "v4.0" tag was updated to include some backported fixes.

The main fix relevant to MicroPython is espressif/esp-idf#4196
@prasad-alatkar
Copy link
Contributor

prasad-alatkar commented Apr 1, 2020

@0neblock actually the changes are backported to release/v3.3 as well. @jimmo by v4.0 I was referring to release/v4.0, sorry if there was any confusions because of this.

@0neblock
Copy link
Contributor

0neblock commented Apr 1, 2020

Hi @prasad-alatkar, the esp32-bt-lib On release/v3.3 hasn’t been updated for a month, is the fix definitely in there ?

@prasad-alatkar
Copy link
Contributor

Hi @0neblock I think commit 72fb22ed has the corresponding fix (cc: @csushantk @chinmaychhajed )

jimmo added a commit to jimmo/micropython that referenced this issue Jun 6, 2020
jimmo added a commit to jimmo/micropython that referenced this issue Jun 9, 2020
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Jun 12, 2020
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Jun 13, 2020
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Jun 13, 2020
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Jun 14, 2020
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Jun 14, 2020
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Jun 15, 2020
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Jun 18, 2020
@0neblock
Copy link
Contributor

I will try to reproduce issue(#4001) and revert back if I find anything. Just to be sure, Can you confirm that your application just does scanning and not advertising? Can you please also try to disable CONFIG_BLE_ADV_REPORT_FLOW_CONTROL_SUPPORTED and see if you still face the issue ?

Sorry @prasad-alatkar for my delay, I can confirm that disabling 'CONFIG_BLE_ADV_REPORT_FLOW_CONTROL_SUPPORTED' does not fix the issue in #4001, anything else i can try?

pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Jun 24, 2020
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Jun 25, 2020
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Jul 3, 2020
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Jul 3, 2020
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Jul 3, 2020
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Jul 3, 2020
@Alvin1Zhang
Copy link
Collaborator

Thanks for reporting and sorry for slow turnaround, the fix is available at 2f6f842. Feel free to reopen if the issue still happens. Thanks.

@GianlucaLoi
Copy link

Hello @prasad-alatkar ,

I'm having the same problem of @0neblock (issue #4001 ) using ESP-IDF v4.3-dev-907-g6c17e3a64.
There is any workaround or fix available for this esp-idf version?

Regards,

Gianluca.

pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Sep 21, 2020
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Dec 28, 2020
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Feb 1, 2021
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Feb 1, 2021
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Mar 10, 2021
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Mar 10, 2021
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Mar 10, 2021
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Apr 9, 2021
pfalcon pushed a commit to pfalcon/pycopy that referenced this issue Apr 18, 2021
@espressif-bot espressif-bot added the Status: Done Issue is done internally label May 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Done Issue is done internally
Projects
None yet
Development

No branches or pull requests